CUBRID Monitoring — Perfmon Counters, Statistics Aggregation, and Per-Subsystem Monitors

Contents:

Theoretical Background
Common DBMS Design
CUBRID’s Approach
Source Walkthrough
Cross-check Notes
Open Questions
Sources

Theoretical Background

A database engine in production is a black box: hundreds of worker threads, dozens of subsystems, billions of small operations per minute. The only way to know what is healthy, where time is being spent, what is contended, is to count and time what it does. The monitoring subsystem is the engine’s instrumentation surface — where every subsystem deposits its events, and where DBAs, statdump, SHOW commands, and auto-tuners pull aggregates back out.

Three textbook ideas frame the shape every modern engine arrives at.

Counter-based monitoring. Hellerstein–Stonebraker (Anatomy of a Database System, Red Book ch. 4) and Petrov (Database Internals ch. 13) describe the engine as a graph of subsystems each owning a fixed catalogue of named numeric counters: counts, totals, maxima, gauges. The catalogue is static (part of the binary), the counters are long-lived (accumulate from server start), and reading them is cheap (a memcpy of an array, not a query through the storage stack). The cost of bookkeeping — the increments inside lock_object, pgbuf_fix, qexec_execute_mainblock — must be small enough that turning monitoring on does not perturb the workload it is measuring (the Heisenberg observation effect).

Per-thread vs server-wide aggregation. The naïve implementation increments a single global counter from every thread, pinning a cache line ping-ponging between cores and dwarfing the actual work. The textbook fix has two flavours: (a) atomic increments on a single global (one CAS per event), or (b) per-thread shards aggregated lazily by readers (wait-free writes, O(N_threads) reads). Most engines mix the two — atomic for low-frequency counters, shards for high-frequency ones. The mix is the architectural choice, not the existence of counters.

Sampling vs always-on. Always-on counters give exact totals at permanent overhead; sampling (PostgreSQL’s pg_stat_statements, Oracle’s ASH, MySQL’s events_* ring buffers) trades exactness for per-event metadata. Counters answer “how much of what happened?”, samples answer “who did it?”. CUBRID’s monitor subsystem is squarely in the counter camp; sampling-style observability lives elsewhere (SQL trace, broker SQL log, server diagnostics).

The structures in src/monitor/ and src/base/perf_monitor.{h,c} are direct expressions of where CUBRID lands on each axis.

Common DBMS Design

PostgreSQL — pg_stat_* and a stats collector. Each backend writes per-event counters into per-backend shared memory; a background collector aggregates them into snapshots that pg_stat_* views read via ordinary heap-style scans. From PG 15, the collector was replaced by direct shared-memory writes plus a dynamic-snapshot read path, but the architectural split — write-side counters vs read-side views — survives.

MySQL — Performance Schema + INFORMATION_SCHEMA + SHOW. A dedicated storage engine (ha_perfschema) backs ring-buffer event tables and aggregated tables. INFORMATION_SCHEMA carries ANSI-aligned views; SHOW STATUS/SHOW ENGINE INNODB STATUS is syntactic sugar over the same back-end. Instrumentation cost is opt-in via setup_consumers/setup_instruments.

Oracle — V$ over X$ fixed tables. The kernel publishes internal C arrays as X$ fixed tables via a hard-coded access driver; V$/GV$ are catalogue views over X$. AWR/ASH snapshots V$ into real tables for retention.

SQL Server — DMVs. sys.dm_exec_*, sys.dm_os_wait_stats, etc. — mechanically analogous to Oracle’s V$/X$: catalogue views resolved to internal table sources whose rows are computed on demand.

CUBRID lands closer to MySQL/Oracle than to PostgreSQL: counters are written directly into a process-local UINT64 array (pstat_Global.global_stats) rather than handed off to a separate collector process, and reads expose through a SHOW family over the same array. The newer cubmonitor C++ template library adds a registration model structurally close to PostgreSQL’s per-backend sharding (the per-transaction “sheets”) layered on top of a global counter, both halves living in one binary.

CUBRID’s Approach

CUBRID’s monitoring surface is not a single subsystem; it is two cooperating layers plus per-subsystem monitors that hold their own non-counter state.

flowchart LR
  HOT["Hot path
(pgbuf_fix, lock_object,
btree_find, qexec_∗)"]
  PERF["perf_monitor.{h,c}
pstat_Global.global_stats[]
PSTAT_METADATA pstat_Metadata[]"]
  CUB["cubmonitor (C++ templates)
monitor::register_statistics
counter/timer/max statistics"]
  TX["transaction_sheet_manager
per-tx sheets (≤ MAX_SHEETS)"]
  OVFP["per-subsystem monitor
ovfp_threshold_mgr
(per-vacuum-worker linked list)"]

  HOT -->|perfmon_inc/perfmon_add_at_offset| PERF
  HOT -->|stat.collect / autotimer| CUB
  CUB -.->|extends global| TX
  HOT -->|add_read_pages_count| OVFP

  SHOW["SHOW STATS / statdump
SHOW EXEC STATISTICS
sysprm dump"]
  SHOW -->|perfmon_get_stats /
  perfmon_calc_diff_stats| PERF
  SHOW -.->|monitor::fetch_*_statistics| CUB
  SHOW -->|ovfp_threshold_mgr::dump| OVFP

The C perf_monitor array is the larger and older surface — what SHOW EXEC STATISTICS, cubrid statdump, and the trace facilities still read. The cubmonitor C++ library was added later for richer composition (counter + timer + max + average together, optional per-transaction sheet); today it backs the lockfree_hashmap instrumentation and a small number of other call sites. Per-subsystem monitors like ovfp_threshold_mgr are a third pattern: structures whose state is too rich for a flat counter array — sorted linked lists of (BTID, OID, count, max-pages, timestamp).

The three layers share a discipline documented in CBRD-26177: no perfmon increments on the hot path of the connection-worker pool. Every counter increment is a write to a possibly-shared cache line; the per-thread worker pool was rebuilt specifically to avoid these shared writes in the latency-critical request dispatch path. Counters that survived the rebuild are atomic ones backing low-frequency events; the rest moved to per-thread shards or were elided.

`cubmonitor::statistic` — the building blocks

Every counter in the C++ layer is built from one template hierarchy in monitor_statistic.hpp along three orthogonal axes:

Representation type — amount_rep = std::uint64_t for counters/totals, floating_rep = double for ratios, time_rep = duration (std::chrono::high_resolution_clock::duration) for timers.
Collection semantics — accumulator_statistic adds, gauge_statistic overwrites, max_statistic keeps the largest, min_statistic keeps the smallest.
Synchronisation — _atomic_ variants wrap storage in std::atomic and use fetch_add/compare_exchange_strong; non-atomic variants assume a single writer.

template <class Rep>
class accumulator_statistic : public primitive<Rep>
{ public: void collect (const Rep &value); };

template <class Rep>
class accumulator_atomic_statistic : public atomic_primitive<Rep>
{ public: void collect (const Rep &value); };

// time_rep cannot live inside std::atomic directly — duration is not trivially atomic
template <>
class atomic_primitive<time_rep>
{
    // ... condensed ...
    std::atomic<time_rep::rep> m_value;          // store the count, not the duration
};

The naming scheme is valuetype_collectype[_atomic]_statistic: time_accumulator_atomic_statistic is the most common timer flavour. The aliases at the bottom of monitor_statistic.hpp enumerate every legal combination.

The fetch contract is uniform: every primitive exposes fetch(statistic_value *destination, fetch_mode mode) writing one uint64_t per statistic, and get_statistics_count() returning 1. The statistic_value representation is monomorphic — floating_rep and time_rep are bit-cast / duration-cast into uint64_t slots so a single read-side buffer carries everything. Time goes through microseconds:

// statistic_value_cast — monitor_statistic.cpp
statistic_value
statistic_value_cast (const time_rep &rep)
{
  // nanoseconds to microseconds
  return static_cast<statistic_value> (std::chrono::duration_cast<std::chrono::microseconds> (rep).count ());
}
// note: time_rep_cast (statistic_value_cast (stat)) != stat (lossy round-trip)

The lossy round-trip is deliberate: microseconds suffice for the dashboards that consume them, and the alternative (preserve nanoseconds) would halve the 64-bit dynamic range or force a wider export ABI.

Max/min statistics are thin layers on accumulators; the atomic flavours use the standard CAS loop and seed their stored value with numeric_limits::min()/::max() (and time_rep::min()/::max() specialisations) so the first collect always wins.

Grouped statistics — `timer_statistic`, `counter_timer_statistic`, `counter_timer_max_statistic`

Most call sites want a group: count, total time, max latency, derived average. monitor_collect.hpp composes the primitives:

// counter_timer_statistic — monitor_collect.hpp
template <class A = amount_accumulator_statistic, class T = time_accumulator_statistic>
class counter_timer_statistic
{
  public:
    class autotimer { /* RAII: reset on ctor, time_and_increment on dtor */ };
    void time_and_increment (const time_rep &d, const amount_rep &a = 1);
    void time_and_increment (const amount_rep &a = 1);          // use internal timer
    void register_to_monitor (monitor &mon, const char *basename) const;
  private:
    timer m_timer;
    A m_amount_statistic;
    T m_time_statistic;
};

Two patterns recur. Internal timer + autotimer — each grouped statistic carries a private cubmonitor::timer that snapshots clock_type::now() on construction; the autotimer RAII class resets on construction and increments on destruction, so scoped instrumentation reads { counter_timer_stat::autotimer at(my_stat); /* work */ }. Group registration — register_to_monitor registers get_statistics_count() + 1 slots, where +1 is a derived average total/count computed at fetch time. Names are prefix-stamped via build_name_vector with Num_, Total_time_, Max_time_, Avg_time_ — the same prefixes the C perf_monitor PSTAT_COUNTER_TIMER_VALUE rows use, which is what makes the two layers visually identical in statdump.

// counter_timer_max_statistic::register_to_monitor — monitor_collect.hpp
auto fetch_func = [&] (statistic_value * destination, fetch_mode mode)
{
  this->fetch (destination, mode);
  destination[get_statistics_count ()] = statistic_value_cast (this->get_average_time (mode));
};
mon.register_statistics (stat_count, fetch_func, names);

The lambda captures this and synthesises the average as a fourth slot at fetch time. No average counter is ever stored. This is one rationale for the microsecond cast — at fetch time, total / count is more numerically robust on integer microseconds than on integer nanoseconds.

`monitor` — the central registry

monitor_registration.hpp declares the registry. The whole class is a vector of registrations, each registration being a (count, fetch_function, names) triple. The fetch function is std::function<void (statistic_value *, fetch_mode)>, which lets a registration write multiple slots from arbitrary internal state (the lambdas built by register_to_monitor are exactly such functions).

// monitor — monitor_registration.hpp
class monitor
{
  public:
    using fetch_function = std::function<void (statistic_value *, fetch_mode)>;

    void register_statistics (std::size_t statistics_count,
                              const fetch_function &fetch_f,
                              const std::vector<std::string> &names);

    statistic_value *allocate_statistics_buffer (void) const;
    void fetch_global_statistics (statistic_value *destination) const;
    void fetch_transaction_statistics (statistic_value *destination) const;
    void fetch_statistics (statistic_value *destination, fetch_mode mode) const;

  private:
    struct registration
    {
      std::size_t m_statistics_count;
      fetch_function m_fetch_func;
    };

    std::size_t m_total_statistics_count;
    std::vector<std::string> m_all_names;
    std::vector<registration> m_registrations;
};

monitor &get_global_monitor (void);

The fetch loop is the simplest possible thing — walk the registrations, call each one’s fetch function, advance the destination pointer by the registration’s m_statistics_count. There is no synchronisation around the fetch; consistency is the writer’s problem (it is what _atomic_ statistics buy you) and snapshot consistency across registrations is explicitly not guaranteed.

// monitor::fetch_statistics — monitor_registration.cpp
void
monitor::fetch_statistics (statistic_value *destination, fetch_mode mode) const
{
  statistic_value *stats_iterp = destination;
  for (auto it : m_registrations)
    {
      it.m_fetch_func (stats_iterp, mode);
      stats_iterp += it.m_statistics_count;
    }
}

The global monitor is a file-scope static monitor Monitor returned by get_global_monitor (). There is no lifecycle — the monitor lives from server start to shutdown, and its m_registrations vector grows monotonically as subsystems register themselves.

A short note on the registration model. The signature register_statistics (count, fetch_f, names) is called both directly (when a subsystem wants to register a custom set) and indirectly via the helpers on grouped statistics (register_to_monitor). The fetch_f is a lambda that captures the statistic instance by reference — which means the statistic must outlive the global monitor. In practice every C++ statistic that goes into the registry is a static or member variable that lives for the life of the server, and the registration is performed once during init. There is no unregister_statistics.

Per-transaction sheets — `transaction_statistic` and `transaction_sheet_manager`

The most distinctive piece of the C++ layer is the per-transaction sheet model: a single transaction can be measured in isolation without paying per-thread sharding on every counter.

// transaction_statistic — monitor_transaction.hpp
template <class S>
class transaction_statistic
{
  public:
    using statistic_type = S;
    void fetch (statistic_value *destination, fetch_mode mode = FETCH_GLOBAL) const;
    void collect (const typename statistic_type::rep &value);  // global + sheet (if open)
  private:
    void extend (std::size_t to);                  // synchronised resize
    statistic_type m_global_stat;
    statistic_type *m_sheet_stats;
    std::size_t m_sheet_stats_count;
    std::mutex m_extend_mutex;
};

transaction_statistic<S> wraps any statistic S and adds a dynamically-grown array of per-sheet copies. collect() always increments the global; if the calling transaction has a sheet open, it additionally increments the sheet-local copy. The array grows on demand under m_extend_mutex; a non-watching transaction pays nothing.

The sheet identifier comes from transaction_sheet_manager, a static class:

// transaction_sheet_manager — monitor_transaction.hpp
class transaction_sheet_manager
{
  public:
    static const transaction_sheet INVALID_TRANSACTION_SHEET = std::numeric_limits<std::size_t>::max ();
    static const std::size_t MAX_SHEETS = 1024;
    static bool start_watch (void);
    static void end_watch (bool end_all = false);
    static transaction_sheet get_sheet (void);
  private:
    static std::size_t s_current_sheet_count;
    static unsigned s_sheet_start_count[MAX_SHEETS];
    static transaction_sheet *s_transaction_sheets;
    static std::mutex s_sheets_mutex;
};

Two pieces of state cooperate: s_transaction_sheets[i] maps the i-th transaction to its sheet slot (or INVALID); s_sheet_start_count[k] counts nested start_watch calls on sheet k, so watches are re-entrant and end_watch only frees the slot when the count drops to zero.

start_watch takes s_sheets_mutex, looks up the calling transaction via logtb_get_current_tran_index(), bumps its sheet count or scans s_sheet_start_count[] for the first count-0 slot. If all MAX_SHEETS = 1024 are taken, returns false — sheets are an acquired resource.

The architectural lever is get_sheet’s fast path:

// transaction_sheet_manager::get_sheet — monitor_transaction.cpp
if (s_current_sheet_count == 0)
  return INVALID_TRANSACTION_SHEET;     // no sheets; early out

When no transaction has ever called start_watch (the common case), every transaction_statistic::collect pays one global counter increment plus a comparison. Zero per-collect cost when no one is watching.

The trade-off: sheets are reused and reused sheets inherit values from their previous tenant. Per the header text, “the correct way of inspecting a transaction is to fetch two snapshots, once at the beginning and once at the end, and do the difference”. Snapshot-difference is the consumer’s problem — same semantic as PostgreSQL’s pg_stat_statements_reset() + deltas.

sequenceDiagram
  participant TX as Transaction (tx_idx)
  participant TSM as transaction_sheet_manager
  participant TS as transaction_statistic<S>
  participant G as global stat S

  TX->>TSM: start_watch ()
  TSM-->>TX: (assign sheet k, count 1)
  TX->>TS: collect(value)
  TS->>G: m_global_stat.collect(value)
  TS->>TSM: get_sheet ()
  TSM-->>TS: k
  TS->>TS: extend if k >= m_sheet_stats_count
  TS->>TS: m_sheet_stats[k].collect(value)

  Note over TX,TSM: end of measurement window
  TX->>TS: fetch(buf, FETCH_TRANSACTION_SHEET)
  TS->>TSM: get_sheet ()
  TSM-->>TS: k
  TS-->>TX: m_sheet_stats[k] value (snapshot)

  TX->>TSM: end_watch ()
  TSM-->>TSM: --s_sheet_start_count[k]; if 0, free slot

The fetch path mirrors the read side. monitor::fetch_transaction_statistics first checks whether the current transaction even has a sheet:

// monitor::fetch_transaction_statistics — monitor_registration.cpp
void
monitor::fetch_transaction_statistics (statistic_value *destination) const
{
  if (transaction_sheet_manager::get_sheet () == transaction_sheet_manager::INVALID_TRANSACTION_SHEET)
    {
      // no transaction sheet, nothing to fetch
      return;
    }
  fetch_statistics (destination, FETCH_TRANSACTION_SHEET);
}

If no sheet, the destination buffer is left untouched. If a sheet exists, the same fetch_statistics walk is invoked but with FETCH_TRANSACTION_SHEET instead of FETCH_GLOBAL, and each transaction_statistic reaches into its m_sheet_stats[sheet] slot rather than m_global_stat. Statistics that are not transaction_statistic-wrapped (plain primitives, plain accumulators) ignore the mode and write zero in transaction mode — they have no per-sheet copy to return.

Hot-path discipline — atomic-free where the volume demands it

The C perf_monitor.c array is older and more pervasive: any counter under PSTAT_* lives there. Writes go through perfmon_inc, perfmon_add_stat, or perfmon_add_stat_at_offset for compound counters. Atomic flavours (ATOMIC_INC_64) are always safe; non-atomic flavours are used where contention is known rare or a slight skew is acceptable in exchange for not paying lock cmpxchg on every page fix.

The CBRD-26177 directive — reflected in the worker-pool rebuild — is “no perfmon on the connection-worker hot path”. Every counter increment is a write that must eventually become visible across cores, and that visibility costs more than the increment itself. The rebuild moved instrumentation either onto slower job-execution paths or onto per-thread shards that readers walk at fetch time.

The C++ cubmonitor layer offers a different escape: because the registration model knows whether the calling transaction is watching, the active writer count is bounded by opened sheets, not threads. With s_current_sheet_count == 0, every collect is a single non-contended global increment plus a constant-time check; with one sheet open, two writes (global + sheet-local).

Three patterns recur.

Atomic where writer volume is high and reads are infrequent. lockfree_hashmap uses cubmonitor::atomic_counter_timer_stat. Every concurrent inserter pays a fetch_add; SHOW STATS pays a load. Acceptable because hashmap-op frequency is bounded by surrounding business logic.

Non-atomic where one writer is structural. Daemon statistics (log flusher, page flusher, vacuum master) are written by one thread and read by many. Non-atomic accumulator_statistic is correct: no write race, and torn 64-bit reads are fine on x86-64 (aligned 64-bit writes are atomic) and acceptable as approximate elsewhere.

Per-transaction sheets where the writer is the user, not the engine. SHOW EXEC STATISTICS reports counters for the calling session: the session opens a sheet at connect time, the engine increments both the global and the sheet, disconnection ends the watch. The sheet machinery expresses this without re-implementing per-session counters per subsystem.

flowchart LR
  W1["Worker 1
plain accumulator_statistic
(no sync; one writer)"]
  W2["Worker 2
amount_accumulator_atomic_statistic
(fetch_add per event)"]
  W3["Worker 3
transaction_statistic<accum_atomic>
collect → global atomic
+ optional sheet"]

  W1 -->|writes| G1["m_value (uint64_t)"]
  W2 -->|fetch_add| G2["m_value (atomic uint64_t)"]
  W3 -->|atomic add| G3["m_global_stat.m_value"]
  W3 -.->|if sheet open| S1["m_sheet_stats[k].m_value"]

  R["fetch (FETCH_GLOBAL)"]
  R --> G1
  R --> G2
  R --> G3

  RT["fetch (FETCH_TRANSACTION_SHEET)
(only if get_sheet () != INVALID)"]
  RT -.-> S1

Reset and snapshot semantics

There is no monitor::reset — the C++ layer does not support resetting a statistic. The only paths that “reset” anything are:

transaction_statistic::extend allocates fresh statistic instances (new statistic_type[to]) for newly-grown sheet slots; the fresh ones start from zero.
transaction_sheet_manager::start_watch on a reused sheet does not zero the sheet’s counters — it just bumps the start count. The caller must take a snapshot at the start and another at the end and do the difference itself, exactly as the header text demands.
The C perf_monitor.c does have a reset path (perfmon_get_stats_and_clear) that clears the in-memory array; this is invoked by the RESET STATISTICS family from the client side and by the per-session statistic block.

The “snapshot atomicity” question — can SHOW STATS observe a coherent set of counters? — is answered the same way as in MySQL/Oracle: no. Each counter is read independently. The fetch loop walks registrations sequentially, and a writer may increment between two reads. This is acceptable because the consumers (statdump, dashboards) are looking at long-running aggregates where a per-counter race of one event is within the noise floor.

Per-subsystem monitor — `ovfp_threshold_mgr`

Not every monitoring concern fits a counter. The “vacuum overflow-page threshold” tracker — the subsystem behind monitor_vacuum_ovfp_threshold.{cpp,hpp} — is the canonical example. The question it answers is: which (class, index) pairs have exceeded a configured number of overflow-page reads, and when did that happen? The answer is a sorted list of records, not a counter, so the module owns its own data structure rather than registering with the monitor.

The header is server-only:

#if !defined (SERVER_MODE)
#error Belongs to server module
#endif

class ovfp_monitor_lock final
{
#define LOCK_FREE_OWNER_ID (-1)
#define LOCK_ALL_OWNER_ID  (VACUUM_MAX_WORKER_COUNT)
#define LOCK_ITEMS_SIZE    (VACUUM_MAX_WORKER_COUNT)
  private:
    std::mutex  m_ovfp_monitor_mutex;
    int         m_lock_arr[LOCK_ITEMS_SIZE];
  public:
    void lock (int lock_index, int owner_id);
    void unlock (int lock_index, int owner_id);
};

Three classes layer the implementation. ovfp_threshold is the per-worker linked list of (BTID, OID, recent_pages, max_pages, recent_time, max_time, hit_cnt) records. ovfp_printer extends it with a sort and a merge-into-master path for dump time. ovfp_threshold_mgr is the singleton coordinator that fans out per-vacuum-worker arrays.

// ovfp_threshold_mgr — monitor_vacuum_ovfp_threshold.hpp
class ovfp_threshold_mgr
{
  private:
    ovfp_monitor_lock    m_ovfp_lock;
    ovfp_threshold       m_ovfp_threshold[VACUUM_MAX_WORKER_COUNT];

    UINT64 m_over_secs;
    char   m_since_time[32];
    int   m_threshold_pages;
  public:
    void init ();
    void add_read_pages_count (THREAD_ENTRY *thread_p, int worker_idx, BTID *btid, int npages);
    void dump (THREAD_ENTRY *thread_p, FILE *outfp);
    inline int get_threshold_page_cnt () const { return m_threshold_pages; }
};

The single global instance lives in vacuum.c:

// g_ovfp_threshold_mgr — query/vacuum.c
class ovfp_threshold_mgr g_ovfp_threshold_mgr;

and the vacuum worker calls into it on every overflow-page read once the per-thread page count crosses the threshold:

if (thread_p->read_ovfl_pages_count >= g_ovfp_threshold_mgr.get_threshold_page_cnt ())
  {
    g_ovfp_threshold_mgr.add_read_pages_count (thread_p, worker->idx, btid_int.sys_btid, thread_p->read_ovfl_pages_count);
  }

Three design choices matter here. Per-worker partitioning — m_ovfp_threshold[VACUUM_MAX_WORKER_COUNT] gives each worker its own list and lock slot, so add_info never blocks on a sibling. Collapse-on-dump — dump locks every worker with LOCK_ALL_OWNER_ID, folds entries into a single ovfp_printer which merges duplicate (BTID, OID) tuples by summing hit_cnt and taking max(read_pages[MAX_POS]), sorts by recent time, prints. Hold-and-sleep mutex — ovfp_monitor_lock::lock is a hand-rolled token lock layered on std::mutex:

// ovfp_monitor_lock::lock — monitor_vacuum_ovfp_threshold.cpp
void ovfp_monitor_lock::lock (int lock_index, int owner_id)
{
  m_ovfp_monitor_mutex.lock();
  while (m_lock_arr[lock_index] != LOCK_FREE_OWNER_ID)
    {
      m_ovfp_monitor_mutex.unlock();
      usleep (1);
      m_ovfp_monitor_mutex.lock();
    }
  m_lock_arr[lock_index] = owner_id;
  m_ovfp_monitor_mutex.unlock();
}

It guards an integer ownership token under a global mutex while yielding when contended. Suboptimal in theory but harmless in practice — contention exists only between worker-i hot path and dump’s all-workers grab, and dump runs out-of-band.

The user-facing surface is a fixed-format dump headed by m_since_time (set by init) and the configured m_threshold_pages (PRM_ID_VACUUM_OVFP_CHECK_THRESHOLD); records older than m_over_secs (PRM_ID_VACUUM_OVFP_CHECK_DURATION) are pruned at dump time by check_over_duration_times.

SHOW integration

The hand-off to SQL is via the SHOW family — see cubrid-show-commands.md for the dispatch architecture. Show types touching monitoring use S_SHOWSTMT_SCAN and read from PSTAT slots: SHOWSTMT_PAGE_BUFFER_STATUS peeks PSTAT_PB_DIRTY_CNT/PSTAT_PB_LRU*_CNT/PSTAT_PB_VICT_CAND; SHOWSTMT_TRAN_TABLES walks the transaction descriptor table; SHOWSTMT_THREADS reads worker-pool state plus pstat_Global peeks per thread. SHOW EXEC STATISTICS computes a session-scoped delta via perfmon_get_stats + perfmon_calc_diff_stats over the C array, not the C++ monitor. The two layers do not currently merge output; the architectural intent is for the C++ layer to subsume the C layer over time, but the migration is incremental.

`pstat_Metadata` and the C-side surface

The C catalogue is a single static array PSTAT_METADATA pstat_Metadata[] indexed by PERF_STAT_ID:

// pstat_metadata — base/perf_monitor.h
struct pstat_metadata
{
  PERF_STAT_ID psid;
  const char *stat_name;
  PSTAT_VALUE_TYPE valtype;        // ACCUMULATE_SINGLE / PEEK_SINGLE /
                                   // COUNTER_TIMER / COMPUTED_RATIO / COMPLEX
  int start_offset;                // computed at startup
  int n_vals;
  PSTAT_DUMP_IN_FILE_FUNC f_dump_in_file;
  PSTAT_DUMP_IN_BUFFER_FUNC f_dump_in_buffer;
  PSTAT_LOAD_FUNC f_load;
};

start_offset/n_vals are filled at perfmon_initialize so the counters live in one contiguous UINT64 buffer (pstat_Global.global_stats). PSTAT_COUNTER_TIMER_VALUE rows occupy four slots (count, total, max, derived avg) laid out by PSTAT_COUNTER_TIMER_*_VALUE macros. The Num_*/Total_time_*/Max_time_*/Avg_time_* naming matches cubmonitor’s register_to_monitor — the visual continuity that makes the two layers indistinguishable in statdump output.

PSTAT_COMPLEX_VALUE rows are hand-rolled blobs: PSTAT_PBX_FIX_COUNTERS is a flat array of dimensions module × page_type × page_mode × latch × cond indexed by PERF_PAGE_FIX_STAT_OFFSET, with custom dump/load callbacks. A single PSTAT row exposes hundreds of dimensions without bloating the enum. The page-buffer hot path uses perfmon_add_stat_at_offset (thread_p, psid, offset, amount) to write into pstat_Global.global_stats[start_offset + offset]. The CBRD-26177 directive forbids such calls in the connection-worker dispatch loop; once the request is on a job-execution thread, per-fix instrumentation is fine.

flowchart TB
  subgraph "C side (perf_monitor)"
    META["pstat_Metadata[PSTAT_∗]
{name, valtype, offset, n_vals}"]
    GLOBAL["pstat_Global.global_stats[]
contiguous UINT64 buffer"]
    META -->|describes| GLOBAL
  end

  subgraph "C++ side (cubmonitor)"
    REG["m_registrations[]
fetch_function + count + names"]
    STATS["accumulator/gauge/max/min
primitives + transaction_statistic"]
    REG -.->|fetch_function calls| STATS
  end

  subgraph "Per-subsystem"
    OVFP2["ovfp_threshold_mgr
m_ovfp_threshold[VACUUM_MAX_WORKER_COUNT]"]
  end

  HOT2["Hot paths
pgbuf_fix, lock_object,
btree_∗, qexec_∗"]
  HOT2 -->|perfmon_add_stat[_at_offset]| GLOBAL
  HOT2 -->|stat.collect / autotimer| STATS
  HOT2 -->|add_read_pages_count| OVFP2

  CONS["Consumers
SHOW STATS / EXEC STATISTICS
statdump / sysprm
trace facilities
ovfp dump"]
  CONS -->|perfmon_get_stats / calc_diff_stats| GLOBAL
  CONS -.->|monitor::fetch_*| REG
  CONS -->|ovfp_threshold_mgr::dump| OVFP2

Source Walkthrough

The symbols below are canonical anchors; line numbers are hints scoped to this doc’s updated: date.

Definitions and casts (monitor_definition.hpp, monitor_statistic.{hpp,cpp}): cubmonitor::statistic_value (the 64-bit slot type), clock_type/time_point/duration (chrono aliases), fetch_mode with FETCH_GLOBAL/FETCH_TRANSACTION_SHEET, amount_rep/floating_rep/time_rep, statistic_value_cast/amount_rep_cast/floating_rep_cast/time_rep_cast (the time round-trip is lossy by design).

Primitive statistics (monitor_statistic.hpp): primitive<Rep> (single-value), atomic_primitive<Rep> (atomic counterpart), atomic_primitive<time_rep> (specialised — duration is not trivially atomic). Subclasses: accumulator_statistic/accumulator_atomic_statistic (add), gauge_statistic/gauge_atomic_statistic (overwrite), max_statistic/max_atomic_statistic (CAS loop), min_statistic/min_atomic_statistic. Aliases enumerate every (rep × kind × atomic?) combination.

Grouped statistics (monitor_collect.hpp): timer (RAII clock snapshot), timer_statistic<T>, counter_timer_statistic<A,T>, counter_timer_max_statistic<A,T,M>, plus autotimer RAII inner classes, register_to_monitor, build_name_vector. Aliases: counter_timer_stat, atomic_counter_timer_stat, timer_stat, transaction_timer_stat, transaction_atomic_timer_stat.

Registry (monitor_registration.{hpp,cpp}): monitor, fetch_function, registration, register_statistics(count, fetch_f, names) and the template overload, add_registration, allocate_statistics_buffer, fetch_global_statistics, fetch_transaction_statistics, fetch_statistics, get_statistics_count/get_registered_count/get_statistic_name, get_global_monitor.

Per-transaction sheets (monitor_transaction.{hpp,cpp}): transaction_sheet (size_t), transaction_sheet_manager with INVALID_TRANSACTION_SHEET/MAX_SHEETS = 1024, start_watch/end_watch/get_sheet/static_init, internal state s_current_sheet_count/s_sheet_start_count[]/s_transaction_sheets[]/s_sheets_mutex. The wrapper transaction_statistic<S> with collect/extend.

Vacuum overflow-page threshold (monitor_vacuum_ovfp_threshold.{hpp,cpp}): ovfp_monitor_lock (per-worker token mutex), INDEX_OVFP_INFO (per-(BTID,OID) record), ovfp_threshold (add/find/alloc_info_mem/free_info_mem/check_over_duration_times/set_worker_idx), ovfp_printer (sort/add_info merge), ovfp_threshold_mgr (init/add_read_pages_count/dump/get_classoid/get_class_name_index_name/time_to_string/print), and the global g_ovfp_threshold_mgr declared in query/vacuum.c.

C-side perfmon (cross-reference): PERF_STAT_ID enum, pstat_Metadata[], PSTAT_METADATA struct, PSTAT_VALUE_TYPE enum, pstat_Global.global_stats[]. Write paths perfmon_inc/perfmon_add_stat/perfmon_add_stat_at_offset/perfmon_add_at_offset; read paths perfmon_get_stats/perfmon_calc_diff_stats/perfmon_calc_diff_stats_for_trace. PSTAT_COMPLEX_VALUE callbacks include f_{load,dump_in_file,dump_in_buffer}_Num_data_page_fix_ext.

Position hints (as of this revision)

Symbol	File	Line
`statistic_value`, `FETCH_GLOBAL/FETCH_TRANSACTION_SHEET`	`src/monitor/monitor_definition.hpp`	33, 42–43
`primitive<Rep>`, `atomic_primitive<Rep>`, `atomic_primitive<time_rep>`	`src/monitor/monitor_statistic.hpp`	127, 169, 220
`accumulator_statistic::collect`, `max_atomic_statistic::collect`	`src/monitor/monitor_statistic.hpp`	494, 537
`statistic_value_cast(time_rep)`, `accumulator_atomic_statistic<time_rep>::collect`	`src/monitor/monitor_statistic.cpp`	67, 169
`timer`, `timer_statistic<T>`, `counter_timer_statistic<A,T>`, `counter_timer_max_statistic<A,T,M>`	`src/monitor/monitor_collect.hpp`	37, 80, 132, 194
`counter_timer_statistic::register_to_monitor`, `counter_timer_max_statistic::register_to_monitor`	`src/monitor/monitor_collect.hpp`	388, 519
`build_name_vector` (base)	`src/monitor/monitor_collect.cpp`	26
`monitor` class	`src/monitor/monitor_registration.hpp`	65
`monitor::fetch_statistics`, `fetch_transaction_statistics`, `register_statistics`, `get_global_monitor`	`src/monitor/monitor_registration.cpp`	92, 109, 138, 159
`transaction_sheet_manager`, `transaction_statistic<S>`, `transaction_statistic::extend`, `::collect`	`src/monitor/monitor_transaction.hpp`	54, 99, 152, 257
`transaction_sheet_manager::start_watch`/`end_watch`/`get_sheet`	`src/monitor/monitor_transaction.cpp`	71, 128, 179
`ovfp_monitor_lock`, `ovfp_threshold`, `ovfp_threshold_mgr`	`src/monitor/monitor_vacuum_ovfp_threshold.hpp`	35, 75, 112
`ovfp_monitor_lock::lock`, `ovfp_threshold::add`/`find`/`check_over_duration_times`	`src/monitor/monitor_vacuum_ovfp_threshold.cpp`	50, 99, 169, 263
`ovfp_printer::sort`/`add_info`, `ovfp_threshold_mgr::init`/`add_read_pages_count`/`dump`	`src/monitor/monitor_vacuum_ovfp_threshold.cpp`	303, 351, 570, 580, 596
`g_ovfp_threshold_mgr` global instance	`src/query/vacuum.c`	663
`pstat_metadata`, `PSTAT_VALUE_TYPE`, `PSTAT_COUNTER_TIMER_*_VALUE` macros	`src/base/perf_monitor.h`	724, ~700, 140–143
`perfmon_get_stats`, `perfmon_calc_diff_stats`, `perfmon_add_stat_at_offset`	`src/base/perf_monitor.c`	792, 1404, 3481
`ct_stat_type = atomic_counter_timer_stat`	`src/base/lockfree_hashmap.hpp`	114

Cross-check Notes

Two coexisting layers, partial overlap. cubmonitor (C++ templates) and perf_monitor (C array) are separate counter systems in the same binary. The naming convention is shared (Num_*, Total_time_*, Max_time_*, Avg_time_*) but the storage is independent. SHOW EXEC STATISTICS reads the C array; the C++ monitor::fetch_global_statistics is reachable but populated by only a small set of subsystems (lockfree_hashmap and a few other call sites). Readers must be clear which array they mean.

atomic_primitive<floating_rep> is gated. The header guard MONITOR_ENABLE_ATOMIC_FLOATING_REP keeps atomic floating-point statistics behind a compile-time switch — std::atomic<double>::fetch_add is C++20 and the project still uses C++17. The CAS loop is implemented for completeness but not compiled in.

Snapshot consistency is explicitly weak. monitor::fetch_statistics walks registrations with no global lock. Per-counter consistency holds for atomic statistics; cross-counter consistency does not. The intended consumer is a delta consumer (read at t1, read at t2, subtract).

Sheets reuse and inherit values. transaction_sheet_manager::start_watch reuses sheet slots whose start count has reached zero, but the reused sheet’s transaction_statistic slot is not zeroed. The new watcher must take its own snapshot at start.

transaction_statistic::extend is per-instance, not per-manager. Each transaction_statistic<S> carries its own m_sheet_stats[] and m_extend_mutex. The first time a sheet number k is seen, the array grows to k+1. Cold-start pays an allocation; subsequent collects pay only the increment.

ovfp_monitor_lock is hold-and-sleep, not spin. The usleep (1) yields rather than busy-spins. Acceptable here because contention is structurally rare (only worker-i hot path vs all-workers dump-time grab); not acceptable for a connection-worker hot path.

Monitor lifetime is server-lifetime. The C++ monitor has no unregister_statistics. Registered statistics capture this by reference and must outlive the global monitor. The lifetime contract is not enforced by the type system.

time_rep_cast (statistic_value_cast (stat)) != stat. Time values export as microseconds and re-import as nanoseconds, losing sub-microsecond precision. Documented in monitor_statistic.cpp.

C perf_monitor carries its own dimensions. PSTAT_COMPLEX_VALUE rows like PSTAT_PBX_FIX_COUNTERS flatten 5-dimensional spaces (module × page_type × page_mode × latch × cond) into one slot range. The C++ monitor has no equivalent — multidimensional statistics need either many separate registrations or a custom fetch lambda.

CBRD-26177 directive lives in commit history. The “no perfmon on the connection-worker hot path” rule manifests as the absence of perfmon_inc/perfmon_add_* calls inside the worker-pool dispatch loop. Future contributors adding instrumentation there need to know the constraint is intentional.

Open Questions

Will the C perf_monitor array migrate to cubmonitor? The C++ design is structurally cleaner — type-safe, composable, sheets out of the box — but the C array has hundreds of counters with pstat_Metadata rows each. A piecemeal migration risks splitting SHOW STATS output; a big-bang migration is months of work. Whether the current “two layers” state is permanent or transitional is not committed in the source.

Should transaction_statistic extend eagerly? The dynamic resize is correct but introduces an allocation in the hot path the first time each sheet number is hit. Pre-allocating to MAX_SHEETS = 1024 would cost server-start memory but eliminate the resize. Worth revisiting if SHOW EXEC STATISTICS becomes always-on per session.

Is MAX_SHEETS = 1024 the right ceiling? The header text says “we can consider reducing this”. 1024 is generous for concurrent watchers but caps per-statistic memory at 1024 * sizeof (S). A configurable ceiling tied to NUM_NORMAL_TRANS may be a better default.

Should ovfp_threshold_mgr use lock-free lists? Per-worker locks avoid cross-worker contention; the usleep is heavier than needed for the rare collision. A lock-free linked list per worker would be faster but the contention frequency is structurally low — only at dump time.

Do we want sampling on top of always-on counters? cubmonitor is purely counter-shaped. CUBRID’s sampling-style observability lives in SQL trace and broker SQL log, outside the monitoring subsystem. Whether to bring sampling under the same umbrella is deferred.

Should the C++ monitor support unregister_statistics? Today’s lambda-by-reference capture forbids dynamic registrations. If future subsystems become more dynamic (loadable plugins, per-database instrumentation), the model becomes a liability.

Sources

Primary headers and implementations under src/monitor/:

monitor_definition.hpp — statistic_value, clock_type, fetch_mode, the foundational typedefs.
monitor_statistic.hpp / monitor_statistic.cpp — primitive and atomic statistics, accumulator/gauge/max/min, type casts.
monitor_collect.hpp / monitor_collect.cpp — grouped statistics (timer, counter+timer, counter+timer+max), register_to_monitor, build_name_vector.
monitor_registration.hpp / monitor_registration.cpp — the monitor registry class and the file-scope global instance.
monitor_transaction.hpp / monitor_transaction.cpp — transaction_sheet_manager, transaction_statistic<S>.
monitor_vacuum_ovfp_threshold.hpp / monitor_vacuum_ovfp_threshold.cpp — per-vacuum-worker overflow-page threshold tracker.

Cross-references inside the engine:

src/base/perf_monitor.h / src/base/perf_monitor.c — the older C counter array, PSTAT_* enum, pstat_Metadata[], perfmon_* write/read API.
src/base/perf.cpp / src/base/perf_monitor_trackers.cpp — perf utilities and trackers.
src/base/lockfree_hashmap.hpp — example consumer of cubmonitor::atomic_counter_timer_stat.
src/query/vacuum.c — declares g_ovfp_threshold_mgr and calls add_read_pages_count from the vacuum overflow-page read path.
src/base/system_parameter.{c,h} — defines PRM_ID_VACUUM_OVFP_CHECK_DURATION, PRM_ID_VACUUM_OVFP_CHECK_THRESHOLD.

Related curated docs in this knowledge base:

cubrid-show-commands.md — how SHOW STATS, SHOW EXEC STATISTICS, SHOW THREADS, SHOW PAGE BUFFER STATUS dispatch through S_SHOWSTMT_SCAN and reach the counter arrays.
cubrid-vacuum.md — the vacuum subsystem architecture that g_ovfp_threshold_mgr supplements.
cubrid-thread-worker-pool.md — the worker-pool rebuild whose “no perfmon on the connection-worker hot path” directive shaped which counters survived where.

JIRA references:

CBRD-26177 — the directive eliminating perfmon increments from the connection-worker hot path; the rationale is contention on shared counter cache lines under the rebuilt worker pool.