CUBRID Thread Manager NG — Connection/Worker Pool Redesign for High-Concurrency (CBRD-26177)

Contents:

Theoretical Background
Common DBMS Design
Motivation (CBRD-26152 + CBRD-26177)
CUBRID’s Approach
Source Walkthrough
Cross-check Notes
Open Questions
Sources

This document is the next-generation counterpart to cubrid-thread-worker-pool.md. The sibling doc covers the legacy baseline — one polling thread per accepted connection plus a max_clients-sized cubthread::worker_pool, with all dispatch through a per-pool mutex. The redesign tracked here, delivered under EPIC CBRD-26177 for the guava release, replaces the front half with a small bounded set of epoll-driven connection workers, adds a coordinator that balances connections across them and dynamically scales their count, bounds per-tick I/O via send/recv budgets, and rotates context allocation through per-worker freelists so the hot path no longer touches new/delete. The task-worker pool below is retained but resized via two new tunables (task_group, task_worker) to avoid contention at high concurrency.

Theoretical Background

The connection front-end of a database server has to multiplex many TCP sockets onto a finite number of CPU cores. Three architectures have dominated the literature, each with a distinct mapping between sockets, threads, and event-loop iterations.

Thread-per-connection (one-thread-per-client). Each accepted socket gets a dedicated kernel thread that calls read()/write() directly. The model is simple — no event-loop bookkeeping, no demultiplexing — and is the design Stevens describes in UNIX Network Programming, Vol. 1 (3rd ed., §16.5 “TCP Concurrent Server, One Child per Client”) as the canonical Unix server. It scales until the kernel’s thread-switch overhead dominates: at C10K and beyond, the working set of stacks blows the L1/L2 caches, the scheduler’s runqueue grows linearly with idle threads, and any shared mutex between threads serializes the entire server. Database Internals (Petrov, 2019, §5.3 “Concurrent Execution”) summarises the lesson: “If you want to scale to tens of thousands of concurrent connections, having one thread per connection becomes impractical.”

Reactor (event-driven). A small fixed pool of event-loop threads each blocks on a multiplexer (select/poll/epoll/kqueue) and dispatches ready sockets synchronously. The reference work is Pai, Druschel, and Zwaenepoel, “Flash: An Efficient and Portable Web Server” (USENIX 1999) which demonstrated that a single asymmetric event loop using non-blocking I/O could match or beat threaded servers at an order-of-magnitude lower memory cost. The crucial mechanical refinement is edge-triggered epoll: with EPOLLET, the kernel reports a readiness transition exactly once, and the user-space loop is responsible for draining the socket until EAGAIN. Edge-triggering eliminates wake-up storms but forces the loop to bound how much it drains per fd — otherwise a single fat connection can starve the others. This is the head-of-line blocking problem inside an event-loop worker.

Proactor (asynchronous I/O). The kernel signals completion, not readiness — Windows IOCP, Linux io_uring, POSIX AIO. Conceptually superior for write-heavy workloads but operationally heavier and not yet the default for database front-ends. CUBRID’s redesign deliberately chose reactor + edge trigger; proactor is out of scope.

Admission control via budgets. Welsh, Culler, and Brewer’s SEDA (“SEDA: An Architecture for Well-Conditioned, Scalable Internet Services,” SOSP 2001) framed the front-end as a sequence of stages connected by bounded queues, with each stage applying its own admission policy. The empirical observation is that latency under saturation degrades far less when each stage caps the work it will absorb in a single tick. CUBRID’s recv_budget_per_connection and send_budget_per_connection (CBRD-26392) are the SEDA admission gate applied to a single epoll tick: a fat reader that would happily drain a megabyte must instead yield after 16 KB, register itself in an “exhausted” list, and let the worker round-robin back to it on the next iteration.

Pool sizing — Little’s law. Given an arrival rate λ (requests/sec) and an average per-request service time S (sec), the average number of in-flight requests is L = λ · S. A pool with fewer than L workers will queue indefinitely; a pool with significantly more workers wastes CPU on context switching and blocks on internal critical sections. Database Internals (§5.3) notes that real systems usually pick a small multiple of physical cores and tune empirically, because S varies with the workload. CBRD-26424 (score-based assignment) and CBRD-26636 (Worker count sweep) implement exactly this empirical loop: measure throughput at several task_worker sizes, pick the local maximum.

Atomic-free monitoring. Naïve performance counters use std::atomic<uint64_t>::fetch_add per event. Under high load the cache-line of the counter pings between cores; at hundreds of thousands of events per second per worker the contention itself becomes the bottleneck the counter was meant to measure. The established workaround is thread-local accumulation with lazy aggregation: each worker increments a private counter and the monitor reader sums them. CBRD-26191 demonstrates the gain on YCSB (workload-a: 58 K → 60 K ops; workload-b: 70 K → 73 K ops) by removing only the atomic instructions on the hot path. Connection worker statistics in this redesign follow the same rule — statistics::metrics<> is a plain uint64_t[] per worker, summed by the coordinator on a 1-second timer.

Common DBMS Design

The shared design space for connection front-ends has narrowed since the C10K era. Almost every modern engine sits at one of four points on the threads × event loop matrix.

PostgreSQL — process per connection. postmaster forks a postgres backend process per accepted connection. The model gives strong isolation (a crashing backend can be restarted without killing peers) at the cost of high per-connection memory (≥10 MB). The PostgreSQL community has consistently rejected proposals to replace the model in core; instead, the project recommends external poolers such as PgBouncer for high-concurrency workloads. There is no equivalent of CUBRID’s “one CPU-pinned event loop per N connections” inside PostgreSQL itself.

MySQL — thread-per-connection by default; thread pool plugin optional. The default Connection_handler_manager runs one-thread-per-connection, giving each TCP session a dedicated pthread. The Enterprise Thread Pool plugin replaces this with a fixed number of thread groups (typically equal to core count) plus a small admission queue per group. The plugin exists exactly because the unbounded thread-per-connection model collapses past a few hundred concurrent sessions on the same workloads CUBRID measured in CBRD-26152. CUBRID’s redesign moves into this same architectural neighbourhood — bounded connection workers, group-based task dispatch, admission via budgets — without making it a plugin.

Oracle — dedicated server vs. shared server (DRCP). The default mode is dedicated-server (process per session). Shared-server mode multiplexes many sessions onto a small pool of server processes via a dispatcher that owns the listening socket and passes requests through queues. Database Resident Connection Pooling (DRCP) generalises this so multiple application servers share the same backend pool. CUBRID’s coordinator has the same arbitration role as the Oracle dispatcher, but with finer per-worker statistics and an auto-scaling rule.

SQL Server — SOSScheduler (cooperative). SQL Server’s SOS scheduler runs a fixed number of worker threads (≈ logical core count) and switches them cooperatively at well-defined yield points inside the engine. Connections are attached to schedulers rather than owning a thread of their own. The CUBRID redesign is closer to this model than to PostgreSQL’s: connection workers are CPU-pinned, fixed in count within a min/max range, and process many sessions per loop iteration.

Where legacy CUBRID sat. Before CBRD-26177 the server ran a polling thread per connection (each css_master_thread-spawned session looped on its own socket) plus a cubthread::worker_pool of size max_clients — see cubrid-thread-worker-pool.md for the detailed walkthrough. With max_clients set to 2000 the engine genuinely held ≥4000 threads at full saturation. Each polling thread contended for the worker-pool’s per-core mutex on every job dispatch; CBRD-26152 measured the result on YCSB-a as monotonically decreasing throughput as concurrency rose, with CPU spending the extra cycles in mutex idle rather than user code.

Where the redesign sits. With CBRD-26177 the front becomes a small set (min_connection_worker … max_connection_worker, defaults 4 … cores/2) of epoll-driven cubconn::connection::worker threads each pinned to a core; the back stays a cubthread::worker_pool sized by task_group × task_worker (renamed from thread_core_count × the old worker count). A single cubconn::connection::coordinator thread, also pinned, brokers new-client placement, rebalancing, and auto-scaling. The hot path (connection worker → task push → task worker pop) no longer takes a shared mutex except briefly for css_conn_entry::cmutex / rmutex, both of which are per-connection.

Motivation (CBRD-26152 + CBRD-26177)

CBRD-26152 — “[Survey] 동시성 증가에 따른 CPU idle 증가 원인 조사” (“Survey of why CPU idle rises when concurrency increases”) — is the empirical study that motivated the redesign. Yechan Hong ran YCSB workload-b (read 95%, update 5%) with the client/CAS cap at 2000 and swept thread counts from 200 to 1000. The unexpected finding was quoted directly in the ticket:

“스레드의 개수가 200개에서 1000개로 증가하였지만, 오히려 iowait가 아닌 CPU idle이 증가하고 있다.” (As the thread count increased from 200 to 1000, CPU idle — not iowait — increased.)

If the bottleneck were disk, more threads would have shown up as iowait. CPU idle rising under load instead pointed at internal synchronization: threads arriving at the worker-pool dispatch mutex faster than the holder could release it, then the kernel parking them, leaving cores genuinely idle.

CBRD-26177 names two structural causes:

“각 connection 스레드들이 모두 따로 polling하고 cub_server는 이론 상 max_clients × 2 이상의 thread를 가지게 되므로 자원 및 관리 관점에서 비효율적이다.” (Each connection thread polls independently, and cub_server theoretically holds at least max_clients × 2 threads, which is inefficient from both a resource and management perspective.)

“동시성이 점차 높아질수록 각각이 core의 mutex를 잡고 job을 할당 받으려고 하므로 이 contention은 CPU가 idle에 있게 하는 주요 병목 지점이 된다.” (As concurrency rises, each thread contends for a core’s mutex to be assigned a job; this contention is the main bottleneck that keeps the CPU idle.)

The resulting goals were:

Replace per-connection polling with a small bounded set of epoll-driven connection workers — eliminate excessive poll() calls (Acceptance Criterion 1 of CBRD-26177).
Make throughput monotonic in concurrency — additional clients should not degrade the rate (Acceptance Criterion 2).
Add admission-style backpressure inside each worker (CBRD-26392) so a single fat connection cannot starve its peers.
Add load-aware placement and dynamic resizing (CBRD-26406, CBRD-26407, CBRD-26424) so the engine self-tunes between idle and saturated regimes.
Strip atomics off the monitoring hot path (CBRD-26191).

CBRD-26177 also issued a hard directive that shaped every subsequent ticket and shapes this document:

“connection worker는 매우 동시성이 높은 hot-path이므로 perfmon 계열의 모니터링 코드를 추가해서는 안된다. 심각한 성능 저하를 일으킬 수 있다.” (The connection worker is a very high-concurrency hot path, so perfmon-class monitoring code must not be added. It can cause serious performance degradation.)

This is the single most important constraint to keep in mind when reading the source: anything that smells like a global atomic counter or a perfmon_inc_stat() call on the worker tick is a regression.

CUBRID’s Approach

The redesign is best understood as three figures, mirroring the diagram pages of the EPIC: the AS-IS baseline, the TO-BE state after CBRD-26212/26255, and the post-CBRD-26407 state after the coordinator is added.

Architecture diagrams

AS-IS (legacy). Each accepted client got a dedicated polling thread. Each polling thread, on every iteration, would push a task into the shared cubthread::worker_pool of size max_clients. The push acquired a per-core mutex; with hundreds of polling threads the mutex was contended on every dispatch.

flowchart LR
  subgraph "Front (legacy) — N == active clients"
    p1["polling thread 1<br/>poll(fd1)"]
    p2["polling thread 2<br/>poll(fd2)"]
    pN["polling thread N<br/>poll(fdN)"]
  end
  subgraph "Back (legacy) — task workers (size = max_clients)"
    direction TB
    M["per-core mutex<br/>(shared dispatch)"]
    W1["worker 1"]
    W2["worker 2"]
    WK["worker K"]
  end
  p1 --> M
  p2 --> M
  pN --> M
  M --> W1
  M --> W2
  M --> WK

TO-BE (CBRD-26212 + CBRD-26255). A small bounded set of connection_worker threads each runs an epoll_wait loop with edge-triggered I/O over many client sockets. Each connection worker is CPU-pinned. When a complete request arrives, the connection worker calls css_push_server_task into the back-end task pool. The number of connection workers is controlled by min_connection_worker/max_connection_worker; the task pool is sized by task_group × task_worker.

flowchart LR
  subgraph "Front (TO-BE) — bounded epoll workers"
    cw1["connection_worker 0<br/>epoll_wait()"]
    cw2["connection_worker 1<br/>epoll_wait()"]
    cwM["connection_worker M-1<br/>epoll_wait()"]
  end
  subgraph "Back — task workers (task_group × task_worker)"
    direction TB
    G0["group 0<br/>workers"]
    G1["group 1<br/>workers"]
    GG["group g-1<br/>workers"]
  end
  client1 -.fd.- cw1
  client2 -.fd.- cw1
  client3 -.fd.- cw2
  clientK -.fd.- cwM
  cw1 -- "css_push_server_task(idx)" --> G0
  cw2 --> G1
  cwM --> GG

Post-CBRD-26407 (coordinator + freelist). A single coordinator thread, pinned to core 0, owns placement (new-client → worker), rebalancing (move existing connections between workers when load skews), and auto-scaling (hibernate/awaken workers within min..max). Workers send statistics to it on a slow timer; the coordinator broadcasts control messages back. Inside each worker, contexts are claimed from a per-pool freelist instead of new/delete-allocated each time.

flowchart LR
  C["coordinator<br/>(pinned, core 0)"]
  subgraph "Connection workers (current = 4..max)"
    cw0["worker 0"]
    cw1["worker 1"]
    cwN["worker N-1"]
  end
  FL["pool::freelist<br/>(context cache)"]
  TP["task worker pool<br/>(task_group × task_worker)"]
  CTRL["controller socket<br/>(/tmp/cub_server_∗_coordinator.sock)"]

  C -- "NEW_CLIENT / HANDOFF / HIBERNATE / AWAKEN" --> cw0
  C -- "..." --> cw1
  C -- "..." --> cwN
  cw0 -- "STATISTICS / RETURN_TO_POOL / HANDOFF_REPLY" --> C
  cw0 -- "claim_context / retire_context" --> FL
  cw1 --> FL
  cwN --> FL
  cw0 -- "css_push_server_task" --> TP
  cw1 --> TP
  cwN --> TP
  CTRL -.SHOW_STATS / SCALE_UP / SCALE_DOWN / CLIENT_MOVE.- C

Connection worker (CBRD-26212)

The connection worker is implemented as cubconn::connection::worker in connection_worker.{cpp,hpp}. It owns:

a Linux epoll instance (cubsocket::epoll m_events);
two file descriptors registered into that epoll: an eventfd (m_eventfd) for inter-thread wakeups and a timerfd (m_timerfd) for periodic work (hibernation check, statistics push, HA close-all);
two per-worker message queues (IMMEDIATE, LAZY) implemented with tbb::concurrent_queue<message> and an atomic size counter;
the live set of context * it owns (m_context), and a deferred removal queue (m_removed_context);
two budget knobs (m_recv_budget, m_send_budget) and an exhausted-context map (m_exhausted);
an atomic-free statistics::metrics<statistics::worker> m_stats for self-reporting to the coordinator.

The constructor wires the epoll, registers the eventfd/timerfd, installs three timer handlers, and spawns the worker thread:

// worker::worker — src/connection/connection_worker.cpp
m_recv_budget = static_cast<size_t> (prm_get_integer_value (PRM_ID_CSS_RECV_BUDGET_PER_CONNECTION));
m_send_budget = static_cast<size_t> (prm_get_integer_value (PRM_ID_CSS_SEND_BUDGET_PER_CONNECTION));
m_exhausted.reserve (128);

m_eventfd = eventfd (0, EFD_NONBLOCK | EFD_CLOEXEC);
m_timerfd = timerfd_create (CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);
// ... eventfd_register both into m_events ...

eventfd_addtimer (timer_type::HIBERNATE,    timer_latency::MEDIUM_LATENCY, &worker::hibernate_check);
eventfd_addtimer (timer_type::STATISTICS,   timer_latency::MEDIUM_LATENCY, &worker::statistics_metrics_to_coordinator);
eventfd_addtimer (timer_type::HA,           timer_latency::HIGH_LATENCY,   &worker::ha_close_all_connections);

m_thread = std::thread (&worker::attach, this);

worker::attach is the thread entry point; it calls initialize → run → finalize. initialize pins the thread to its assigned core via os::resources::cpu::setaffinity (m_core), claims a cubthread::entry, and sets the thread name to "connections" (a name that, as we shall see, leaks into the task pool in CBRD-26617).

The main loop is the textbook reactor:

// worker::run — src/connection/connection_worker.cpp
while (!m_stop)
  {
    nfds = m_events.wait (events.data (), events.size (),
                          m_exhausted.empty () ? TIMEOUT_INFINITE : TIMEOUT_NOWAIT);
    // ...
    for (i = 0; i < nfds; i++)
      {
        ctx = reinterpret_cast<context *> (events[i].data.ptr);
        if ((events[i].events & (EPOLLHUP | EPOLLRDHUP | EPOLLERR)) && ...)
          {
            this->handle_hangup_or_error (ctx, events[i].events & EPOLLERR);
            continue;
          }
        if (events[i].events & EPOLLIN)
          {
            if (ctx->m_conn->fd == m_eventfd) { eventfds[0] = true; continue; }
            if (ctx->m_conn->fd == m_timerfd) { eventfds[1] = true; continue; }
            status = this->handle_reception (ctx, false);
            // ...
          }
        if (events[i].events & EPOLLOUT)
          status = this->handle_transmission (ctx, false);
      }

    if (m_exhausted.size () > 0) handle_exhausted ();
    if (eventfds[0] || eventfds[1]) eventfd_handler (eventfds);
  }

Note the timeout switch: when there are exhausted contexts to re-drive (see Send/recv budgets below) the loop polls with TIMEOUT_NOWAIT so it can immediately revisit them, otherwise it blocks indefinitely on epoll_wait. The eventfd is the single inter-thread doorbell — any outside producer (the coordinator, another connection worker handing off, a task worker returning a buffer) writes 1 into m_eventfd and the worker drains its in-process queue once the loop wakes.

The connection::context (connection_context.hpp) is the per-client object the worker owns. It contains the css_conn_entry *m_conn, a worker index, a unique 64-bit id, a receive state machine (HEADER → DATA → ERROR), the receiver and transmitter, and an inline statistics::metrics<statistics::context>. A complete request (header + optional data) is parsed inside worker::handle_reception → handle_packet → handle_header_packet or handle_data_packet, and the task push into the back-end pool happens at handle_command_header_packet (when the request has no following data) or handle_data_packet (after the data arrives):

// worker::push_task_into_worker_pool — src/connection/connection_worker.cpp
void worker::push_task_into_worker_pool (context *ctx)
{
  /* push new task into worker pool */
  css_push_server_task (*ctx->m_conn);
}

That single call is the entire interface between the new front and the legacy back. css_push_server_task (in server_support.c) wraps the connection in a css_server_task and routes it to the cubthread worker pool with push_task_on_core (..., conn_ref.idx, conn_ref.in_method) — the core hash being the connection index, exactly as in the legacy design, so a long-running session keeps affinity for the same back-end core.

Connection lifecycle (close path) is driven by worker::handle_connection_close. It serialises against ctx->m_conn->cmutex, drains any in-flight task workers via net_server_active_workers, retries (re-enqueues a SHUTDOWN_CLIENT on the LAZY queue) if back-end workers are still active, and on success removes the fd from epoll, marks the context m_removed = true, and pushes it into m_removed_context. The actual context return to the pool is deferred to purge_stale_contexts, which sends a single RETURN_TO_POOL message to the coordinator with the batched list — so the freelist is touched once per loop tick, not once per closed connection.

Connection pool (CBRD-26255)

The pool (cubconn::connection::pool in connection_pool.{cpp,hpp}) is the owner of workers, coordinator, and context freelist. It exists for the lifetime of the server and is held by cub_server as a single instance.

The freelist itself is a singly-linked stack of pool::freelist nodes, each of which embeds the actual context as its first member so that reinterpret_cast<freelist *> (ctx) recovers the node. The trick replaces the legacy “new context per connection” allocation pattern:

// pool::freelist — src/connection/connection_pool.hpp
struct freelist
{
  /* THIS MUST BE THE FIRST */
  context m_context;
  freelist *m_next;

  freelist (std::size_t capacity) : m_context (capacity), m_next (nullptr) {}
  ~freelist () = default;
};

// pool::claim_context / retire_context — src/connection/connection_pool.cpp
context *pool::claim_context ()
{
  freelist *head;
  assert (m_mutex_holder == std::this_thread::get_id ());

  head = m_freelist.m_head;
  if (head)
    {
      m_freelist.m_head = m_freelist.m_head->m_next;
    }
  else
    {
      head = new freelist (32 * 1024);
    }
  m_freelist.m_claim++;

  return &head->m_context;
}

void pool::retire_context (context *ctx)
{
  freelist *head;
  // ...
  head = reinterpret_cast<freelist *> (ctx);
  head->m_context.reset ();
  if (m_freelist.m_claim > m_freelist.m_max)
    delete head;            /* over-cap: actually free */
  else
    {
      head->m_next = m_freelist.m_head;
      m_freelist.m_head = head;
    }
  m_freelist.m_claim--;
}

The freelist is only manipulated by code holding pool::m_mutex. The coordinator’s handle_message_queue_new_client (which calls claim_context) and handle_message_queue_return_to_pool (which calls retire_context) both run on the coordinator thread, and the coordinator holds the pool lock for its entire lifetime (see coordinator::initialize → m_parent->lock_resource ()). This is the design choice that makes context allocation single-threaded without ever needing per-context atomics.

pool::initialize is wired via pool::initialize_topology, which maps the requested max_connection_workers onto an actual NUMA core layout via os::resources::cpu::effective () and may additionally serialise NIC RX/TX IRQ to those cores via os::resources::net::map_nic_to_index (cores). CBRD-26255 also provides this NIC-pinning, which is the source of the warning log messages discussed in the ticket comments (warning: NIC channel configuration failed) — they are non-fatal, surfacing only when the binary lacks CAP_NET_ADMIN or runs in a virtualised environment.

The shutdown sequence uses a thread_watcher (a bare condvar plus int active) to count down workers as they exit, and pool::finalize_workers waits up to css_get_shutdown_timeout() for m_watcher->active == 1 (only the coordinator left), then pool::finalize_coordinator waits for active == 0. Failure to reach those states triggers _exit(0) after a 10 s try-lock loop in try_to_lock_resource — a deliberate hard exit because the alternative is to wait forever for a thread holding state nothing else can clean up.

Send/recv budgets (CBRD-26392)

The budget mechanism is the single most subtle part of the design. Without it, edge-triggered epoll plus a draining reader would let a single client with backlog monopolise its worker: once EPOLLIN fires, the reader is contractually obliged to drain until EAGAIN; if the peer keeps writing, that drain loop never returns. CBRD-26392 caps the drain per epoll tick.

Quoting the ticket directly:

“하나의 connection worker는 여러 connection들을 관리한다. 이때 하나의 긴 송수신을 수행하게 되면 다른 송수신들이 계속 blocked되며 response가 지연되게 된다. 이때 한 번에 송수신할 수 있는 양을 제한하여 전체 지연을 안정화한다.” (One connection worker manages many connections. If a single long send or receive runs, the other I/Os remain blocked and their response is delayed. Bound the amount that can be sent or received at once to stabilise the overall latency.)

Defaults: 16 KB receive, 32 KB send (see system_parameter.c). Both can be set as low as 0 (no limit) or as high as 1 GB.

The implementation lives partly in receiver::drain / transmitter::fill (their second argument is a size_t limit = 0 budget) and partly in worker::handle_reception / worker::handle_transmission / worker::handle_exhausted_add_context / worker::handle_exhausted (connection_worker.cpp).

// worker::handle_reception — src/connection/connection_worker.cpp
io_status = ctx->m_recv.m_receiver.drain (ctx->m_conn->fd, m_recv_budget);
if (io_status == result::PeerReset || io_status == result::Error) { /* close */ }

assert (io_status == result::Pending || io_status == result::BudgetExhausted);

if (!in_exhausted && io_status == result::BudgetExhausted)
  {
    handle_exhausted_add_context (ctx, EPOLLIN);
  }

// worker::handle_transmission — src/connection/connection_worker.cpp
status = ctx->m_send.m_transmitter.fill (ctx->m_conn->fd, m_send_budget);
// ...
else if (!in_exhausted && status == result::BudgetExhausted)
  {
    handle_exhausted_add_context (ctx, EPOLLOUT);
  }

When a context exhausts its budget, it lands in m_exhausted keyed by context id. The main loop notices the non-empty exhausted map and switches epoll_wait to TIMEOUT_NOWAIT, then re-drives those contexts via handle_exhausted after serving the current epoll batch. The prepared flag in exhausted_context is the deferral guard: the first time a context is added it is marked !prepared and skipped; only on the second visit does the worker re-drain it. This ensures every other ready fd in the current epoll batch gets serviced before the budget-exceeded context is revisited.

The flow control finite-state machine for one fd:

stateDiagram-v2
  [*] --> Idle
  Idle --> Reading : EPOLLIN \n handle_reception
  Reading --> Idle : drain Pending \n EAGAIN
  Reading --> Exhausted : drain BudgetExhausted \n add to m_exhausted, EPOLLIN
  Exhausted --> Reading : revisit on next loop \n prepared flag
  Idle --> Writing : EPOLLOUT \n handle_transmission
  Writing --> Idle : fill Ok
  Writing --> Exhausted : fill BudgetExhausted \n add to m_exhausted, EPOLLOUT
  Reading --> Closing : ClosedConnection or PeerReset
  Writing --> Closing : ClosedConnection or PeerReset
  Closing --> [*] : handle_connection_close

Note that result::BudgetExhausted is a distinct enum value from result::Pending — the difference being that Pending means “the kernel has no more bytes for me right now” (back-off naturally until next epoll edge) while BudgetExhausted means “I have more bytes available but I’m yielding voluntarily” (must come back this loop or the next).

Auto scaling (CBRD-26406)

CBRD-26406 wires the mechanism for connection-rebalancing and worker-count scaling; the policy lives in CBRD-26424 (score-based selection, below). The mechanism is simple in shape: workers report statistics on a 1-second timer, the coordinator’s 5-second REBALANCING timer compares per-worker scores and asks the heaviest worker to hand off one of its connections to the lightest, the coordinator’s 60-second SCALING timer drives the auto-scaling state machine.

The scaling_status enum has only two states:

STABLE — current count is “good enough”, no measurement in progress.
TRIAL — sweep through count candidate sizes recording their throughput score, then pick the best.

At each SCALING tick:

// coordinator::statistics_scaling — src/connection/coordinator.cpp
if (m_scaling_statistics.status == scaling_status::STABLE)
  { this->scale_trial (); return true; }

assert (m_scaling_statistics.status == scaling_status::TRIAL);

bytes_inout = 0;
for (i = 0; i < m_max_worker; i++)
  {
    bytes_inout += m_statistics[i].m_sum.get (statistics::context::BYTES_IN_TOTAL);
    bytes_inout += m_statistics[i].m_sum.get (statistics::context::BYTES_OUT_TOTAL);
  }
m_scaling_statistics.history.push_back (
{ m_current_worker,
  VAL_TO_SCORE (50, 1000, bytes_inout) + m_task_statistics.completed.first * 2 });
m_scaling_statistics.count--;

if (m_scaling_statistics.count == 0)
  {
    selected = this->scale_selection ();      /* pick max-score scale */
    if (selected < m_current_worker)      this->scale_down ();
    else if (selected > m_current_worker) this->scale_up ();
    /* else stable */
  }
else
  {
    if (m_scaling_statistics.direction == scaling_direction::DOWN)
      this->scale_down ();
    else
      this->scale_up ();
  }

scale_trial clears the history, alternates the trial direction relative to the previous one (so consecutive trials don’t drift uni-directionally), and sets count to the auto_scaling_window_size parameter — the hyper-parameter that trades trial length for sensitivity. The default of 4 means each trial collects 4 samples (one per SCALING tick = 60 s) before deciding.

Sliding-window mechanism:

sequenceDiagram
  participant T as SCALING timer (60s)
  participant C as coordinator
  participant H as history (window_size = 4)

  Note over C: status = STABLE
  T->>C: tick
  C->>C: scale_trial()
  Note over C: direction = DOWN (or UP)<br/>count = 4<br/>status = TRIAL

  loop count = 4
    T->>C: tick
    C->>H: push_back({ current_worker, score })
    C->>C: scale_down() or scale_up()
  end

  T->>C: tick
  C->>H: push_back({ current_worker, score })
  C->>C: selected = scale_selection()
  alt selected != current
    C->>C: scale_down() or scale_up() to reach selected
  end
  Note over C: status = STABLE again

scale_selection picks any sample within 95% of the maximum score, then chooses uniformly among them — a small Boltzmann-style randomisation to avoid getting stuck at a flat local maximum (see CBRD-26424 commentary on the dual local maxima observed in small-machine measurements).

scale_up flips the next-in-line hibernating worker out of HIBERNATING by sending an AWAKEN lazy message to it and incrementing m_current_worker. scale_down does the reverse in two phases: scale_down itself migrates every connection of the draining worker via transfer_connection and parks the coordinator status as DRAINING; scale_down_finish is the actual hibernation, called from handle_message_queue_statistics only once the draining worker reports an empty context list. This two-phase shutdown is necessary because worker shutdown is asynchronous and the coordinator must not allow a worker to be re-targeted by statistics_find_score_extremes while it is still serving connections.

Coordinator + context freelist (CBRD-26407)

The coordinator (cubconn::connection::coordinator in coordinator.{cpp,hpp}) is structurally the same shape as a worker — pinned thread, epoll instance, eventfd + timerfd, single-producer-single-consumer (TBB) queue — but it owns three distinct timers and an external Unix-domain control socket.

// coordinator::coordinator — src/connection/coordinator.cpp
m_controller.open ("/tmp/cub_server_" + std::to_string (getpid ()) + "_coordinator.sock",
                   SOCK_NONBLOCK | SOCK_CLOEXEC);
m_ctrlfd = m_controller.get_fd ();
m_eventfd = eventfd (0, EFD_NONBLOCK | EFD_CLOEXEC);
m_timerfd = timerfd_create (CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);

eventfd_register (m_eventfd);
eventfd_register (m_timerfd);
eventfd_register (m_ctrlfd);

eventfd_addtimer (timer_type::STATISTICS,   timer_latency::LOW_LATENCY,    &coordinator::statistics_update);
eventfd_addtimer (timer_type::REBALANCING,  timer_latency::MEDIUM_LATENCY, &coordinator::statistics_rebalancing);
eventfd_addtimer (timer_type::SCALING,      timer_latency::HIGH_LATENCY,   &coordinator::statistics_scaling);

The three timer latencies are 1 second / 5 seconds / 60 seconds respectively (see the timer_latency enum in coordinator.hpp).

The control socket exposes administrative commands:

SHOW_STATS — print per-worker EWMA throughput and queue depth (statistics_print) to stdout.
SCALE_UP / SCALE_DOWN — force one step of the auto-scaling state machine.
CLIENT_MOVE — manually transfer one connection by id from worker from to worker to.

This is an out-of-band debugging interface; nothing in the data path uses it. Sending a control_recv struct via SOCK_DGRAM/SOCK_NONBLOCK triggers a reply with a single OK/NOK byte. The directive from CBRD-26177 (“no perfmon on the hot path”) means there is no SHOW server-side equivalent through the standard server channel — the controller is intentionally a side door, not a performance counter.

coordinator::handle_message_queue_new_client is where the placement policy lands. Note that it calls the same EWMA-driven score-extremes function used by rebalancing:

// coordinator::handle_message_queue_new_client — src/connection/coordinator.cpp
std::tie (worker, std::ignore) = statistics_find_score_extremes ();

m_statistics[worker].m_contexts.emplace (id,
                                         std::pair</*EWMA*/, /*prev*/>{ });
m_statistics[worker].m_client_num++;

request.type = connection::worker::message_type::NEW_CLIENT;
request.ctx = m_parent->claim_context ();
request.ctx->m_worker = worker;
request.ctx->m_id = id++;
request.conn = item.conn;
workers[worker]->enqueue (queue_type::IMMEDIATE, std::move (request));
workers[worker]->notify ();
this->statistics_update_score (worker);

— so every new client is immediately routed to the worker with the lowest current score, and that score is updated on the spot to bias the next placement.

The context migration protocol (used by both rebalancing and scale-down) is a four-step handshake between coordinator and two workers:

sequenceDiagram
  participant C as coordinator
  participant Wf as worker[from]
  participant Wt as worker[to]

  C->>C: m_migrating.insert(id)
  C->>Wf: HANDOFF_CLIENT(id, worker_ptr=Wt, worker_index=to)
  Wf->>Wf: locate ctx, remove from epoll/m_context
  Wf->>Wt: TAKEOVER_CLIENT(ctx)
  Wt->>Wt: register ctx in epoll (EPOLLIN | maybe EPOLLOUT)
  Wf->>C: HANDOFF_REPLY(transferred=true, id, from, to)
  C->>C: m_migrating.erase(id)<br/>fix m_statistics[from/to]

m_migrating prevents a connection from being targeted twice in flight. If the worker discovers the context is already gone (the client closed concurrently with the migration), the reply carries transferred=false and the coordinator reverts the projected stats. This is the single concurrency invariant the design relies on: a context is only ever owned by exactly one connection worker at a time, with ownership transferred via explicit message. No locks are required around the context itself — only the conn entry’s cmutex, briefly, for adapter field updates.

The context freelist (described above under CBRD-26255) was finalised in this same ticket. The CBRD-26407 description states the goal directly:

“context는 생성마다 Physical Memory와 Virtual Memory를 할당받고 이를 mapping하므로 이 과정을 생략하도록 한다.” (Each context creation allocates physical and virtual memory and maps the two, so skip this process.)

By pre-warming the freelist with max_connections * 1.1 preallocated freelist nodes (each 32 KB capacity), the runtime hot path is a pointer swap, not a mmap/page-fault sequence.

Score-based connection assignment (CBRD-26424)

The coordinator’s score function combines three signals into a single comparable scalar per worker:

// coordinator::statistics_update_score — src/connection/coordinator.cpp
m_statistics[worker].m_score =
  1 * static_cast<double> (m_statistics[worker].m_client_num) / 1
  + EVAL_WORKER  (EWMA(MQ_COMPLETED), EWMA(BLOCKED_RMUTEX))
  + EVAL_CONTEXT (EWMA(BYTES_IN_TOTAL) + EWMA(BYTES_OUT_TOTAL),
                  EWMA(RECV_BUDGET_HIT) + EWMA(SEND_BUDGET_HIT));

with the weight macros

#define VAL_TO_SCORE(w, m, s) ((w) * static_cast<double> (s) / (m))
#define EVAL_WORKER(mq, rmutex)  (VAL_TO_SCORE (25, 3.5, (mq))    + VAL_TO_SCORE (500, 1, (rmutex)))
#define EVAL_CONTEXT(bytes, bgt) (VAL_TO_SCORE (50, 1000, (bytes)) + VAL_TO_SCORE (10, 1, (bgt)))

Concretely the weights mean: bytes-of-traffic count for 50 × 1/1000 (≈ 1 unit per kilobyte); rmutex blocked microseconds count for 500 × 1 (≈ 500 units per microsecond blocked); MQ completions count for 25 × 1/3.5 (≈ 7 units per completion). Budget-hit events (i.e., contexts that hit the recv/send budget cap) are weighted at 10 — because a high budget-hit count means the worker is repeatedly running into its admission cap and would benefit from an extra peer to share load. CBRD-26424’s commentary explains the dual local maxima visible in measured throughput curves: small machines exhibit a non-monotonic relationship between worker count and throughput because of NUMA / RX-TX / HT-sibling interactions, and a naïve hill-climber gets stuck. The randomised top-5% selection in scale_selection is the escape hatch.

EWMA aggregation uses α = 0.06 (EWMA_ALPHA):

// coordinator::statistics_EWMA — src/connection/coordinator.hpp
acc = acc * (1 - alpha) + (current - prev) * (alpha / (time_delta * 1e-6));
prev = current;

The division by time_delta * 1e-6 normalises to microseconds, so the EWMA is a smoothed rate (events per microsecond) rather than a raw delta. With α = 0.06 and a 1 s sampling interval the effective half-life is roughly 11 samples (≈ 11 s); aged samples contribute less than 1 % after about a minute.

Atomic-free stats (CBRD-26191)

The statistics::metrics<T, VT = uint64_t> template (connection_statistics.hpp) is a fixed-size VT[STATS_COUNT] with add / sub / get / set / reset operations. There is no std::atomic anywhere — every increment is a plain memory write, because every increment is performed by exactly one thread (the worker that owns the metric). Aggregation across workers happens once per second, when the worker copies its metric block into a coordinator::message::statistics payload and the coordinator does a per-worker EWMA update inside its own single-threaded handler:

// worker::statistics_metrics_to_coordinator — src/connection/connection_worker.cpp
message.type = coordinator::message_type::STATISTICS;
message.statistics.cpu_time_ns = get_time_ns (CLOCK_THREAD_CPUTIME_ID);
message.statistics.time_ns     = get_time_ns (CLOCK_MONOTONIC);
message.statistics.worker.first  = m_index;
message.statistics.worker.second = m_stats;        /* copy */
message.statistics.contexts.reserve (m_context.size ());
for (context *ctx : m_context)
  message.statistics.contexts.emplace_back (ctx->m_id, ctx->m_stats);  /* copy */
m_coordinator->enqueue (std::move (message));

The bulk copy is cheap because m_stats is a fixed array (≈ 88 bytes) and the per-context array is at most a few hundred entries of 56 bytes each. The copy moves ownership across the single-producer-single-consumer queue without crossing any cache line that the worker is concurrently writing to. Crucially, this design exists to uphold the CBRD-26177 directive (“no perfmon on the hot path”): the worker never increments a shared counter, never spins on a lock, never executes a memory barrier in the dispatch loop.

CBRD-26191 measured the wider goal — strip atomics from server-wide monitoring — on YCSB:

workload	before	after	gain
workloada	58 464.28	60 646.59	+3.7%
workloadb	70 009.99	72 976.31	+4.2%
update	44 158.66	45 128.96	+2.2%
mix	9 440.82	10 115.33	+7.1%

The connection-side metrics design follows the same template at the new layer.

TCP keepalive tunables

CBRD-26177 promised three new per-socket keepalive parameters: tcp_keepalive_idle (start probing after N seconds idle), tcp_keepalive_interval (interval between probes), tcp_keepalive_count (consecutive failures = dead). The defaults are 300 s / 300 s / 3 with a high cap at 1 year of seconds. They are registered in system_parameter.c alongside the existing tcp_keepalive boolean and are intended to be applied by the socket-setup helper (tcp.c::css_sockopt) which already calls setsockopt (SOL_SOCKET, SO_KEEPALIVE, ...) when tcp_keepalive is set; the three new knobs feed TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT respectively for fine-grained tuning of dead-peer detection. The CUBRIDMAN-333 manual update covers the documentation rollout.

Task worker rework — `task_group` and `task_worker`

The back-end pool is still cubthread::worker_pool in thread_worker_pool_impl.{hpp,cpp}. Its sizing is now controlled by two parameters that replace the legacy thread_core_count/thread_worker_count pair:

task_group (renamed from thread_core_count) — number of cores in the worker pool. Each “core” in CUBRID terminology is a sub-pool with its own queue, owned by one worker_pool::core.
task_worker — total number of worker threads across all groups. Default at server startup: css_get_max_connections () (i.e., effectively the legacy max_clients), normalised down if it exceeds the system core count.

The auto-tuning code clamps task_group ≤ system core count and task_group ≤ task_worker (system_parameter.c boot sysprm tuning block):

/* sysprm_tune_client_parameters — src/base/system_parameter.c */
task_worker_prm = GET_PRM (PRM_ID_TASK_WORKER);
if (PRM_GET_INT (task_worker_prm->value) < 0)
  {
    /* the value of task worker is default. */
    sprintf (newval, "%d", task_worker);   /* css_get_max_connections() */
    (void) prm_set (task_worker_prm, newval, false);
  }

task_group_prm = GET_PRM (PRM_ID_TASK_GROUP);
if (PRM_GET_INT (task_group_prm->value) > system_cpu_count)
  {
    sprintf (newval, "%d", system_cpu_count);
    (void) prm_set (task_group_prm, newval, false);
  }
if (PRM_GET_INT (task_group_prm->value) > PRM_GET_INT (task_worker_prm->value))
  {
    sprintf (newval, "%d", PRM_GET_INT (task_worker_prm->value));
    (void) prm_set (task_group_prm, newval, false);
  }

The semantic shift is that task_worker is now interpreted as the total worker budget and task_group controls partitioning. The legacy thread_core_count was loosely “number of cores” with no policy; the new naming makes the intent explicit, and the coordinator’s task-completion EWMA (m_task_statistics.completed) uses css_get_task_stats from server_support.c to read the pool’s running totals into the score.

CBRD-26636 (“[성능 실험] Worker 개수에 따른 성능 추이”) found that task_worker ≈ 4–6 × cores consistently outperformed task_worker = max_clients on read-heavy YCSB workloads, but at the cost of a deadlock risk when task_worker < max_clients and many workers wait on a long lock. That risk motivates CBRD-26662 (see Cross-check Notes).

Source Walkthrough

Symbols are grouped by subsystem. CBRD-* annotations attribute each symbol to its driving ticket where one is identifiable.

epoll wrapper (CBRD-26212)

cubsocket::epoll (class, src/base/epoll.hpp) — RAII wrapper over epoll_create1/epoll_ctl/epoll_wait. Constructor opens an EPOLL_CLOEXEC instance; destructor closes it.
cubsocket::epoll::wait — thin shim over epoll_wait.
cubsocket::epoll::add_descriptor — EPOLL_CTL_ADD with optional void *ptr payload (used to thread context pointers through events[i].data.ptr).
cubsocket::epoll::modify_descriptor — EPOLL_CTL_MOD, used to add/remove EPOLLOUT when the transmitter queues pending data.
cubsocket::epoll::remove_descriptor — EPOLL_CTL_DEL.
cubsocket::nonblocking (parent class, nonblocking.hpp) — defines the result enum (Ok, Pending, BudgetExhausted, PeerReset, Error, ClosedConnection, Skewed, Aborted) that every receiver/transmitter/worker call returns.

connection::worker (CBRD-26212 / 26392 / 26406 / 26407 / 26617)

cubconn::connection::worker — class definition in connection_worker.hpp. Members include m_parent (pool), m_coordinator, m_watcher, the per-thread state (m_thread, m_core, m_status, m_stop, m_entry), the context set (m_context, m_removed_context), the epoll (m_events), the eventfd/timerfd (m_eventfd, m_timerfd), the timer table (m_timer_handler), the dual-priority message queues (m_queue[IMMEDIATE/LAZY], m_queue_size[]), the budget knobs and exhausted map (m_recv_budget, m_send_budget, m_exhausted), and the worker-side metrics (m_stats).
worker::worker — constructor; reads system parameters, installs three timers, spawns the thread.
worker::attach — thread entry; calls initialize → run → finalize.
worker::initialize — sets affinity, claims thread entry, sets pthread name "connections" (the name leak CBRD-26617 caught).
worker::run — main reactor loop.
worker::finalize — drain still-open contexts, retire thread entry, signal watcher.
worker::enqueue / worker::notify / worker::enqueue_and_notify — outside-thread interface.
worker::push_task_into_worker_pool — single-line bridge to css_push_server_task (the back-end pool).
worker::handle_reception / worker::handle_transmission — per-fd I/O drivers; honour m_recv_budget/m_send_budget and emit BudgetExhausted. (CBRD-26392)
worker::handle_exhausted_add_context / worker::handle_exhausted — exhausted-fd revisitation queue. (CBRD-26392)
worker::handle_message_queue_new_client — bind a fresh context to a fd; register in epoll with EPOLLET|EPOLLIN|EPOLLRDHUP.
worker::handle_message_queue_handoff_client / worker::handle_message_queue_takeover_client — the two halves of the migration handshake. (CBRD-26406 / CBRD-26407)
worker::handle_message_queue_send_packet / worker::handle_message_queue_release_packet — task workers shipping bytes back to a connection use these messages instead of writing the socket directly. Sending may add EPOLLOUT to the fd if the transmitter buffers data.
worker::handle_message_queue_shutdown_client — close- connection request from outside; calls handle_connection_close.
worker::handle_message_queue_hibernate / worker::handle_message_queue_awaken — auto-scaling state transitions.
worker::handle_connection_close — six-step close protocol with retry-via-LAZY-queue when back-end task workers still hold the conn.
worker::statistics_metrics_to_coordinator — every MEDIUM tick (1 s default), copy m_stats plus per-context metrics into a coordinator::message::STATISTICS. (CBRD-26191)
worker::hibernate_check — every MEDIUM tick, if status is HIBERNATING and m_context.empty(), stop the timer.
worker::ha_close_all_connections — every HIGH tick, if css_ha_server_state () == HA_SERVER_STATE_TO_BE_STANDBY, forcibly close all idle connections — the HA mode-change path that interacts with CBRD-26523.

connection::pool (CBRD-26255 / 26407)

cubconn::connection::pool::freelist — the singly-linked context cache node.
pool::initialize / pool::finalize — top-level bring-up / tear-down; called by the executable wire-up.
pool::initialize_topology — interrogates os::resources::cpu::effective () and (where capable) os::resources::net::map_nic_to_index ().
pool::initialize_freelist — pre-allocate max_connections * 1.1 freelist nodes.
pool::initialize_workers — create max_connection_workers pinned workers and pre-warm by sending each a START message on both queues.
pool::initialize_coordinator / pool::start_coordinator / pool::finalize_coordinator — coordinator lifecycle.
pool::dispatch — accept hand-off; called by master_connector once a TCP connection has completed CUBRID handshake. Sends a NEW_CLIENT to the coordinator.
pool::claim_context / pool::retire_context — freelist API; require m_mutex held by the calling thread.
pool::lock_resource / pool::release_resource / pool::try_to_lock_resource — the pool-wide mutex used by the coordinator for the duration of its lifetime.

connection::coordinator (CBRD-26406 / 26407 / 26424)

cubconn::connection::coordinator — class definition in coordinator.hpp. Members include m_parent, m_watcher, the controller (Unix-domain socket m_controller, m_ctrlfd), the message queue (m_queue, m_queue_size), the worker count tracking (m_max_worker, m_min_worker, m_current_worker), the migration-in-flight set (m_migrating), the scaling bookkeeping (m_scaling, m_scaling_statistics), and per-worker statistics (m_statistics).
coordinator::coordinator — opens the controller socket, registers fds into epoll, installs three timers, spawns thread.
coordinator::run — main reactor loop.
coordinator::initialize — pin to core 0 (or the first effective core), claim thread entry, set name "coordinator", take the pool lock for life.
coordinator::handle_message_queue_new_client — placement: pick min-score worker, allocate context, forward NEW_CLIENT. (CBRD-26424)
coordinator::handle_message_queue_return_to_pool — bulk return from a worker’s m_removed_context; clears per-context stats and calls pool::retire_context.
coordinator::handle_message_queue_handoff_reply — finalise migration; revert stats on transferred=false.
coordinator::handle_message_queue_statistics — per-worker stats arrival; runs EWMA update via statistics_update_connection, then statistics_update_score; if the reporting worker is the current draining_worker and reports empty contexts, calls scale_down_finish. (CBRD-26424)
coordinator::handle_message_queue_shutdown — flip m_stop true.
coordinator::transfer_connection — guarded by m_migrating; sends HANDOFF_CLIENT to the source worker.
coordinator::scale_up — AWAKEN next worker, bump m_current_worker. (CBRD-26406)
coordinator::scale_down / coordinator::scale_down_finish — drain target worker’s connections, then HIBERNATE. (CBRD-26406)
coordinator::scale_trial / coordinator::scale_selection / coordinator::statistics_scaling — the auto-scaling state machine. (CBRD-26406 / CBRD-26424)
coordinator::statistics_rebalancing — every MEDIUM tick (5 s), find score extremes, transfer one context if the gap exceeds 20 % of the high score. (CBRD-26424)
coordinator::statistics_EWMA — α = 0.06, microsecond-normalised, used for both worker and context metrics.
coordinator::statistics_find_score_extremes — linear scan over m_statistics[0..m_current_worker) returning (min_index, max_index).
coordinator::statistics_update_score — applies the EVAL_WORKER + EVAL_CONTEXT + client_num formula.
coordinator::statistics_print — controller-driven console dump of per-worker score, EWMA, byte counts.
coordinator::handle_controller / coordinator::handle_controller_request — dispatch the four control-socket commands.

connection::context, controller, statistics

cubconn::connection::context — per-client state (worker index, id, ignore guard, recv state machine, receiver, transmitter, blocker shared_ptr, per-context metrics). 32 KB inline send/recv buffer.
cubconn::connection::context::reset — reset for reuse via the freelist.
cubconn::thread_watcher — mutex + cv + int active used for ordered shutdown.
cubconn::message_blocker — single-shot mutex + cv + bool done used for blocking enqueue_and_notify callers.
cubconn::connection::controller<RX,TX> — templated Unix-domain datagram socket wrapper (controller.hpp).
cubconn::statistics::context / cubconn::statistics::worker — enums of metric keys (connection_statistics.hpp).
cubconn::statistics::metrics<T,VT> — fixed-size array of counters; supports +=, - (returns metrics<T,double>), * (scaling), add, sub, get, set, reset, copy_from. No atomics. (CBRD-26191)

task worker pool changes

cubthread::worker_pool (thread_worker_pool.hpp) — unchanged abstract interface.
cubthread::worker_pool::core — now sized by task_group.
cubthread::worker_pool::execute / execute_on_core — entry points called from css_push_server_task.
cubthread::worker_pool_task_capper (thread_worker_pool_taskcap.{hpp,cpp}) — the legacy admission-cap wrapper retained for HA daemons; m_tasks_available = m_max_tasks = worker_pool->get_worker_count ().
css_push_server_task (server_support.c) — the hot-path handoff; partitions by static_cast<size_t> (conn_ref.idx) so a connection always lands on the same task-pool core.
css_get_task_stats (server_support.c) — fills stats[3] = { requested, started, completed } from the pool’s internal counters; consumed by coordinator::statistics_update_task.

system parameters

PRM_ID_TCP_KEEPALIVE_IDLE / PRM_ID_TCP_KEEPALIVE_INTERVAL / PRM_ID_TCP_KEEPALIVE_COUNT — keepalive tunables.
PRM_ID_TASK_GROUP (renamed from thread_core_count).
PRM_ID_TASK_WORKER.
PRM_ID_CSS_MAX_CONNECTION_WORKER / PRM_ID_CSS_MIN_CONNECTION_WORKER.
PRM_ID_CSS_AUTO_SCALING_WINDOW_SIZE.
PRM_ID_CSS_RECV_BUDGET_PER_CONNECTION / PRM_ID_CSS_SEND_BUDGET_PER_CONNECTION.

Position hints (as of 2026-04-30)

Symbol	File	Line
`cubsocket::epoll` (class)	`src/base/epoll.hpp`	42
`cubsocket::epoll::epoll`	`src/base/epoll.cpp`	37
`cubsocket::epoll::wait`	`src/base/epoll.cpp`	54
`cubsocket::epoll::add_descriptor`	`src/base/epoll.cpp`	59
`cubsocket::epoll::modify_descriptor`	`src/base/epoll.cpp`	80
`cubsocket::epoll::remove_descriptor`	`src/base/epoll.cpp`	101
`cubconn::connection::worker` (class)	`src/connection/connection_worker.hpp`	52
`worker::message_type` (enum)	`src/connection/connection_worker.hpp`	106
`worker::worker`	`src/connection/connection_worker.cpp`	75
`worker::attach`	`src/connection/connection_worker.cpp`	2107
`worker::initialize`	`src/connection/connection_worker.cpp`	1943
`worker::finalize`	`src/connection/connection_worker.cpp`	1975
`worker::run`	`src/connection/connection_worker.cpp`	2007
`worker::enqueue`	`src/connection/connection_worker.cpp`	160
`worker::notify`	`src/connection/connection_worker.cpp`	182
`worker::enqueue_and_notify`	`src/connection/connection_worker.cpp`	218
`worker::push_task_into_worker_pool`	`src/connection/connection_worker.cpp`	288
`worker::purge_stale_contexts`	`src/connection/connection_worker.cpp`	294
`worker::handle_connection_close`	`src/connection/connection_worker.cpp`	386
`worker::statistics_metrics_to_coordinator`	`src/connection/connection_worker.cpp`	562
`worker::hibernate_check`	`src/connection/connection_worker.cpp`	584
`worker::ha_close_all_connections`	`src/connection/connection_worker.cpp`	606
`worker::handle_message_queue_new_client`	`src/connection/connection_worker.cpp`	1016
`worker::handle_message_queue_handoff_client`	`src/connection/connection_worker.cpp`	1079
`worker::handle_message_queue_takeover_client`	`src/connection/connection_worker.cpp`	1160
`worker::handle_message_queue_shutdown_client`	`src/connection/connection_worker.cpp`	1227
`worker::handle_message_queue`	`src/connection/connection_worker.cpp`	1356
`worker::handle_reception`	`src/connection/connection_worker.cpp`	1694
`worker::handle_transmission`	`src/connection/connection_worker.cpp`	1782
`worker::handle_exhausted_add_context`	`src/connection/connection_worker.cpp`	1837
`worker::handle_exhausted`	`src/connection/connection_worker.cpp`	1854
`cubconn::connection::pool` (class)	`src/connection/connection_pool.hpp`	39
`pool::freelist`	`src/connection/connection_pool.hpp`	42
`pool::initialize`	`src/connection/connection_pool.cpp`	62
`pool::finalize`	`src/connection/connection_pool.cpp`	89
`pool::dispatch`	`src/connection/connection_pool.cpp`	109
`pool::claim_context`	`src/connection/connection_pool.cpp`	140
`pool::retire_context`	`src/connection/connection_pool.cpp`	160
`pool::initialize_freelist`	`src/connection/connection_pool.cpp`	213
`pool::initialize_topology`	`src/connection/connection_pool.cpp`	249
`pool::initialize_workers`	`src/connection/connection_pool.cpp`	269
`pool::finalize_workers`	`src/connection/connection_pool.cpp`	314
`pool::initialize_coordinator`	`src/connection/connection_pool.cpp`	353
`pool::start_coordinator`	`src/connection/connection_pool.cpp`	376
`cubconn::connection::coordinator` (class)	`src/connection/coordinator.hpp`	41
`coordinator::coordinator`	`src/connection/coordinator.cpp`	57
`coordinator::initialize`	`src/connection/coordinator.cpp`	1192
`coordinator::run`	`src/connection/coordinator.cpp`	1240
`coordinator::transfer_connection`	`src/connection/coordinator.cpp`	237
`coordinator::scale_up`	`src/connection/coordinator.cpp`	281
`coordinator::scale_down`	`src/connection/coordinator.cpp`	348
`coordinator::scale_down_finish`	`src/connection/coordinator.cpp`	317
`coordinator::scale_trial`	`src/connection/coordinator.cpp`	378
`coordinator::scale_selection`	`src/connection/coordinator.cpp`	415
`coordinator::statistics_find_score_extremes`	`src/connection/coordinator.cpp`	460
`coordinator::statistics_update_score`	`src/connection/coordinator.cpp`	482
`coordinator::statistics_update_connection`	`src/connection/coordinator.cpp`	502
`coordinator::statistics_update_task`	`src/connection/coordinator.cpp`	545
`coordinator::statistics_rebalancing`	`src/connection/coordinator.cpp`	586
`coordinator::statistics_scaling`	`src/connection/coordinator.cpp`	629
`coordinator::handle_message_queue_new_client`	`src/connection/coordinator.cpp`	934
`coordinator::handle_message_queue_return_to_pool`	`src/connection/coordinator.cpp`	970
`coordinator::handle_message_queue_handoff_reply`	`src/connection/coordinator.cpp`	992
`coordinator::handle_message_queue_statistics`	`src/connection/coordinator.cpp`	1032
`coordinator::handle_controller_request`	`src/connection/coordinator.cpp`	1110
`cubconn::connection::context`	`src/connection/connection_context.hpp`	141
`cubconn::statistics::metrics`	`src/connection/connection_statistics.hpp`	111
`cubconn::connection::controller` (template)	`src/connection/controller.hpp`	43
`cubthread::worker_pool`	`src/thread/thread_worker_pool.hpp`	54
`cubthread::worker_pool_task_capper`	`src/thread/thread_worker_pool_taskcap.hpp`	30
`css_push_server_task`	`src/connection/server_support.c`	2354
`css_get_task_stats`	`src/connection/server_support.c`	2647
`REGISTER_CONNECTION` (macro)	`src/thread/thread_manager.hpp`	496
`PRM_ID_TCP_KEEPALIVE_IDLE` (param row)	`src/base/system_parameter.c`	5161
`PRM_ID_TASK_WORKER` (param row)	`src/base/system_parameter.c`	5197
`PRM_ID_CSS_MAX_CONNECTION_WORKER` (param row)	`src/base/system_parameter.c`	5209
`PRM_ID_CSS_AUTO_SCALING_WINDOW_SIZE` (param row)	`src/base/system_parameter.c`	5243
`PRM_ID_CSS_RECV_BUDGET_PER_CONNECTION` (param row)	`src/base/system_parameter.c`	5259
`PRM_ID_CSS_SEND_BUDGET_PER_CONNECTION` (param row)	`src/base/system_parameter.c`	5271

Cross-check Notes

Sibling doc — cubrid-thread-worker-pool.md. The legacy doc describes (a) css_master_thread accept loop, (b) one polling thread per accepted connection, (c) the cubthread::worker_pool and its core::worker machinery, (d) css_push_server_task as the dispatch point. Of those, (c) and (d) are still live and current. (a) is unchanged at the master-thread accept layer, but the handover point is now pool::dispatch (forwarding NEW_CLIENT to the coordinator) instead of “spawn a polling thread for this fd”. (b) is replaced: any reference in the legacy doc to “each connection has a thread” is no longer accurate. Look-up symbols that moved domains:

Polling/recv-loop logic in legacy was scattered across per-connection threads driven by css_internal_request_handler; now lives in cubconn::connection::worker::handle_reception and friends.
Connection-close protocol in legacy was a synchronous css_close_socket from the polling thread; now is worker::handle_connection_close with retry-via-LAZY-queue and a separate freelist return phase.
Stats in legacy were per-worker cubperf::stat_value arrays read with the worker pool’s get_stats; for the connection side, those readings no longer exist as counters at all (CBRD-26177 directive). Use the coordinator’s controller socket (SHOW_STATS) for diagnostics.
Admission control in legacy was worker_pool_task_capper for HA daemons only; in NG, every connection worker enforces a per-tick byte budget. The capper class is still in tree but is not on the connection-worker path.

Sibling doc — cubrid-server-session.md. Server session state lookup happens during request processing inside the task worker (after css_push_server_task lands in the back-end pool). The connection worker does not look up sessions; it only parses the network protocol. The session_p field on css_conn_entry is read on the task side (see css_server_task::execute in server_support.c). This is unchanged from the legacy doc and the redesign does not move it.

Regressions tracked under the EPIC.

CBRD-26586 — parallel query uses only one CPU after worker timeout. Root cause confirmed by Hong Yechan to be the interaction between thread_worker_timeout_seconds and affinity inheritance: when the connection worker creates a task worker (because the task pool let a thread expire), the new pthread inherits the connection worker’s CPU affinity, pinning all back-end work to the connection worker’s core. Fix: do not inherit affinity for newly-spawned task workers. Workaround until fix lands: set thread_worker_timeout_seconds high so back-end threads are not recycled.
CBRD-26617 — task worker thread name inherits “connections”. Same mechanism (attribute inheritance from the spawning thread). Confusing in core dumps because the thread name is used to label core.<name>... files, so a task-worker crash produced core.connections.*. Fix: set thread name when the task pool spawns a worker.
CBRD-26544 — schema_type_str synonym enum coredump. Pre-existing on develop; surfaced under the new build because CCI’s enum and its string array drifted out of sync. Fixed in the same merge window.
CBRD-26523 — HA test cases cbrd_21506_02, cbrd_22705_02 fail. Diagnosed as a pre-existing HA timing bug (logwr/copylog interaction on tid:0 system commits) that the redesign exposes because the new connection structure speeds up state transitions. Not a redesign regression; rerouted to CBRD-26576 for the actual fix.

HA-shell test set after merge. CBRD-26255 comments record a separate batch of HA shell-test failures (bug_bts_5212, bug_bts_9047, cbrd_22207, cbrd_23854, etc.) all attributed to timing changes — the redesign genuinely is faster, and that exposes test scripts whose sleeps and grep filters were calibrated to the legacy speed. The fixes were a mix of test-script timing tweaks and one genuine bug (-353 Resource temporarily unavailable under ulimit -n constraint, fixed by raising the FD limit and documenting the new minimum).

The CBRD-26177 “no perfmon” directive. Repeated here because it is the most likely thing to be broken by a future contributor:

“connection worker는 매우 동시성이 높은 hot-path이므로 perfmon 계열의 모니터링 코드를 추가해서는 안된다. 심각한 성능 저하를 일으킬 수 있다.”

Practical implications when reading or editing the code:

Do not add perfmon_inc_stat or any global atomic increment to worker::run, worker::handle_reception, worker::handle_transmission, worker::handle_packet, the message-queue handlers, or any of their callees.
Do add metrics to statistics::metrics<> instances on the worker (they are private uint64_t[]); the coordinator already sums them.
The controller socket (SHOW_STATS) is the supported read-out path; statistics_print is the renderer.
Per-context counters belong on context::m_stats, and their aggregation via the coordinator’s statistics_update_connection is already wired.

Pivot to CBRD-26662 — Logical-Wait-Aware Concurrency Control. This redesign delivered “high throughput at high concurrency” but exposed a follow-on weakness articulated in CBRD-26636: when task_worker is sized aggressively low (4–6 × cores) for throughput, lock waits on a few workers can block the whole back-end. CBRD-26662 introduces a slot abstraction — workers must hold a slot to be “active”; a worker entering a logical wait (lock or condition variable) returns its slot, freeing the slot for a new worker — bounded above by high_concurrency, a runtime-tunable. The plan is to retire task_group / task_worker and replace them with high_concurrency. That work is in progress; for now, treat task_worker and task_group as the canonical knobs.

Open Questions

Affinity-aware connection placement. The coordinator picks the minimum-score worker. When a connection is pgxc-style stateful (HA replication, CDC consumer, log-writer slave), is there value in pinning it to a fixed worker for the connection lifetime? The current transfer_connection will re-balance even long-lived sessions; the only opt-out is is_wait_required returning false for cdc_Gl.conn.fd in worker::is_wait_required. A first-class “affinity-pinned connection” flag would close the gap.
HA replication’s connection model. The connection worker honours HA_SERVER_STATE_TO_BE_STANDBY by force-closing non-active contexts (ha_close_all_connections). What happens during the opposite transition (standby → master), when a fresh batch of clients reconnects en masse and the coordinator has to allocate many contexts in a burst? The freelist is sized to max_connections * 1.1, so it should absorb the burst, but the coordinator is single-threaded on handle_message_queue_new_client. Concrete bound on the new-connection rate the coordinator can sustain is not measured.
Score-function weights. The macros EVAL_WORKER (25, 3.5, …) + (500, 1, …) and EVAL_CONTEXT (50, 1000, …) + (10, 1, …) are tuned constants. CBRD-26424 acknowledged this is empirical. What is the sensitivity surface? Could a runtime-tunable weight set obviate auto_scaling_window_size by letting operators bias the score toward latency or throughput?
Verification gap from CBRD-26421. The task explicitly stated that connection-worker rebalancing and dynamic scaling are not covered by automated tests because the connection pool’s internal state is not exposed through any user-visible interface. The controller socket is for debugging only. A read-only SHOW STATS SQL or DBA-RPC view would close the test gap.
std::nothrow vs. STL exceptions (CBRD-26412). The ticket’s resolution is essentially “we cannot guard exhaustively because STL throws and the codebase uses STL”. Some hot-path allocations (pool::freelist (32 * 1024), m_context.reserve (256), m_exhausted.reserve (128)) still throw on OOM. What’s the failure semantic the operator should expect — server crash, dropped connection, or graceful degradation? Today it is the first.
Send/recv budget defaults. 16 KB / 32 KB are reasonable for OLTP but are likely small for bulk-load and CDC streaming. Is there a per-connection-class override path short of editing cubrid.conf?

Sources

Source paths

src/connection/connection_worker.cpp (≈ 58 KB)
src/connection/connection_worker.hpp (≈ 10 KB)
src/connection/connection_pool.cpp (≈ 10 KB)
src/connection/connection_pool.hpp (≈ 3 KB)
src/connection/coordinator.cpp (≈ 35 KB)
src/connection/coordinator.hpp (≈ 10 KB)
src/connection/controller.hpp
src/connection/connection_context.hpp
src/connection/connection_statistics.hpp
src/connection/connection_support.{cpp,hpp}
src/connection/server_support.c — css_push_server_task (line 2354), css_get_task_stats (line 2647)
src/connection/tcp.c — setsockopt SO_KEEPALIVE (line 203)
src/base/epoll.{cpp,hpp}
src/thread/thread_worker_pool.hpp — abstract pool interface (line 54)
src/thread/thread_worker_pool_impl.{cpp,hpp} — pool implementation
src/thread/thread_worker_pool_taskcap.{cpp,hpp} — legacy admission cap
src/thread/thread_manager.hpp — REGISTER_CONNECTION (line 496)
src/base/system_parameter.{c,h} — param IDs and rows for tcp_keepalive_*, task_group, task_worker, min/max_connection_worker, auto_scaling_window_size, recv/send_budget_per_connection
src/executables/server.c — cubconn::connection::pool connections; (line 557)

JIRA tickets

EPIC: http://jira.cubrid.org/browse/CBRD-26177
Survey: http://jira.cubrid.org/browse/CBRD-26152
POC: http://jira.cubrid.org/browse/CBRD-26212
Pool redesign: http://jira.cubrid.org/browse/CBRD-26255
Send/recv budgets: http://jira.cubrid.org/browse/CBRD-26392
Rebalancing + auto scaling: http://jira.cubrid.org/browse/CBRD-26406
Coordinator + freelist: http://jira.cubrid.org/browse/CBRD-26407
Null-guard for new: http://jira.cubrid.org/browse/CBRD-26412
Verification cases (postponed): http://jira.cubrid.org/browse/CBRD-26421
Score formula: http://jira.cubrid.org/browse/CBRD-26424
HA bugs: http://jira.cubrid.org/browse/CBRD-26523
Synonym-enum coredump: http://jira.cubrid.org/browse/CBRD-26544
Parallel-query CPU regression: http://jira.cubrid.org/browse/CBRD-26586
Thread-name inheritance: http://jira.cubrid.org/browse/CBRD-26617
Worker-count sweep: http://jira.cubrid.org/browse/CBRD-26636
Atomic-free monitoring: http://jira.cubrid.org/browse/CBRD-26191
Logical-Wait-Aware (follow-on EPIC): http://jira.cubrid.org/browse/CBRD-26662
Manual update: http://jira.cubrid.org/browse/CUBRIDMAN-333

Textbook references

Silberschatz, Korth, Sudarshan. Database System Concepts, 6th ed. — Ch. 13 “Storage and File Structure” (buffer basics, framing of front-end vs back-end).
Petrov, Alex. Database Internals (O’Reilly, 2019). §5.3 “Concurrent Execution” — pool sizing intuition, C10K framing.
Stevens, W. Richard. UNIX Network Programming, Vol. 1, 3rd ed. — §16.5 “TCP Concurrent Server, One Child per Client” (the model the redesign moves away from).
Pai, V., P. Druschel, W. Zwaenepoel. Flash: An Efficient and Portable Web Server. USENIX 1999. (event-driven asymmetric multi-process design — direct ancestor of the reactor pattern in this redesign).
Welsh, M., D. Culler, E. Brewer. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. SOSP 2001. (admission control via bounded stage queues — the intellectual basis for recv_budget_per_connection / send_budget_per_connection).
Linux kernel docs — epoll(7), eventfd(2), timerfd_create(2). The EPOLLET (edge-triggered) semantics are mandatory background reading for anyone modifying worker::run.