CUBRID Design Philosophy — Why the Codebase Looks the Way It Does

Contents:

How to read this document
Decisions → consequences map
1. OODB heritage
2. ARIES recovery, by the book
3. MVCC for reads, locks for writes
4. Volcano-style executor with iterator-per-XASL-node
5. Selinger-style cost optimizer with a query graph
6. Lock-free prior-list as the WAL commit pipe
7. Double Write Buffer for torn-page protection
8. Separate JVM for procedural language
9. Broker process pool in front of cub_server
10. Local-only liveness decision
11. Page-based, not log-structured
12. SA mode utilities for offline operations
13. What CUBRID explicitly is NOT

How to read this document

The detail docs in this directory answer what each subsystem is and how it works. They do not justify the shape of the system as a whole. Why does the lock manager still exist when there is also MVCC? Why is the WAL append path queued through a singly-linked list instead of going straight into the log buffer? Why is procedural code shipped to a separate process? The answers belong to the layer above the layer: the historical pressures, the academic references the original authors carried in their heads, and the engineering bets that closed off some directions and opened others.

This document is that layer. Each section names a decision, explains what else the team could have done, says what they read or what they were trying to avoid, and points at the consequence in the source. Where a section’s what is already covered elsewhere, the cross-reference is named and the prose stays at the level of why. The decisions are organised from the lowest-level storage commitments outward to the operational shell.

Decisions → consequences map

flowchart LR
  subgraph HIST["Historical / academic input"]
    UNISQL["UniSQL OODB heritage<br/>(Won Kim, 1992)"]
    ARIES["ARIES<br/>(Mohan TODS 1992)"]
    SELINGER["System R / Selinger<br/>(SIGMOD 1979)"]
    VOLCANO["Volcano<br/>(Graefe TKDE 1994)"]
    PG["PostgreSQL XLogInsert<br/>(centralised latch)"]
    ORA["Oracle in-process JVM<br/>cautionary tale"]
  end

  subgraph DEC["CUBRID decisions"]
    D1["1) OODB heritage"]
    D2["2) ARIES recovery"]
    D3["3) MVCC + locks"]
    D4["4) Volcano executor"]
    D5["5) Selinger optimizer"]
    D6["6) Lock-free prior list"]
    D7["7) DWB"]
    D8["8) Separate JVM"]
    D9["9) Broker + CAS pool"]
    D10["10) Local-only HA"]
    D11["11) Page-based"]
    D12["12) SA + CS dual build"]
  end

  subgraph CONS["Source consequences"]
    C1["class_object.c +<br/>locator workspace"]
    C2["log_recovery_∗.c +<br/>fuzzy checkpoint"]
    C3["mvcc.h + lock_manager.c +<br/>vacuum.c side-by-side"]
    C4["qexec_∗ tree walk"]
    C5["query_graph.c +<br/>query_planner.c"]
    C6["log_append.cpp +<br/>logpb_prior_lsa_∗"]
    C7["double_write_buffer.cpp"]
    C8["pl_engine/ + pl_connection"]
    C9["broker.c + cas.c +<br/>SCM_RIGHTS handoff"]
    C10["heartbeat.c local quorum"]
    C11["heap, btree, page_buffer<br/>in-place"]
    C12["libcubridsa.so /<br/>libcubridcs.so"]
  end

  UNISQL --> D1
  ARIES --> D2 & D3 & D6 & D11
  SELINGER --> D5
  VOLCANO --> D4
  PG --> D6
  ORA --> D8

  D1 --> C1
  D2 --> C2
  D3 --> C3
  D4 --> C4
  D5 --> C5
  D6 --> C6
  D7 --> C7
  D8 --> C8
  D9 --> C9
  D10 --> C10
  D11 --> C11
  D12 --> C12

ARIES is the only input that fans out to several decisions (recovery, MVCC’s vacuum coupling, the WAL pipeline, page-based storage) because it is the algorithm the engine is fundamentally built around.

1. OODB heritage

Decision. Every table is a class, every row is an instance with a persistent OID, and the in-memory schema graph carries methods, inheritance, partition trees, and resolutions alongside attributes and constraints. The client side maintains a workspace — a hash table of in-memory MOP (memory object pointer) handles keyed by OID — that mediates all access to persistent objects, batches dirty objects into LC_COPYAREA buffers, and ships them to the server-side locator_*_force family. The abstraction is not a thin veneer; it is the spine the engine grew on.

Why. CUBRID started life as UniSQL/X, an object-oriented database designed in the early 1990s by Won Kim’s group at MCC and commercialised as one of the first ORDBMSes. The OODB lineage is the directly named ancestor of class_object.c, of the metaclass / root-class distinction, and of the OID-as-physical-address (volid, pageid, slotid) triple that every fetch and every index leaf carries. The EXODUS storage manager (Carey & DeWitt, 1986) and Stonebraker’s POSTGRES (1986) are the academic precursors. The legacy was never deleted because deletion would have meant rewriting the storage layer — every page format, every index leaf, every catalog row references OIDs. The relational evolution therefore shadowed the relational features on top of the object scaffolding: a row is an object whose class is the table; a foreign key is an attribute of object-reference type; a partition is a class with a parent class link.

Consequence. class_object.c is 220 KB because it holds a graph: SM_CLASS with SM_ATTRIBUTE lists, an SM_METHOD list, an SM_RESOLUTION list for inheritance disambiguation, an SM_PARTITION subtree, an SM_TEMPLATE for in-flight DDL, and an SM_REPRESENTATION chain recording every historical layout the class has had. The locator (locator_cl.c / locator_sr.c) is the bulk-fetch / flush bridge: client-side a workspace of MOPs, server-side LC_COPYAREA routed through heap, btree, lock, log, FK, and replication. The PL family inherits the method-callback abstraction from the OODB era — modern cub_pl is the renovated form of what was originally an in-process method-on-class invocation.

Cross-references: cubrid-class-object.md, cubrid-locator.md, cubrid-pl-javasp.md.

2. ARIES recovery, by the book

Decision. Recovery is a three-pass replay over the WAL anchored at the most-recent fuzzy checkpoint: an analysis pass that rebuilds the dirty-page table and transaction table, a redo pass that reapplies every record whose target page LSN is older than the record’s, and an undo pass that walks each loser transaction backward emitting compensation log records (CLRs). The log is physiological for heap data and logical for index operations; the checkpoint is fuzzy (consistent checkpoints would freeze the engine).

Why. Mohan et al.’s ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (TODS 17.1, 1992) is the canonical algorithm of disk-resident relational recovery, and it was canonical when CUBRID’s earliest commits were written. The paper argues for a small set of properties that the algorithm jointly satisfies — repeating-history (redo every record, including those of loser transactions, before undoing), CLRs to make undo restartable, and a fuzzy checkpoint that does not block the engine — and a generation of engines (DB2, SQL Server, InnoDB, CUBRID) implemented it almost literally. CUBRID has no incentive to deviate because the algorithm is correct and the alternatives (shadow paging, log structuring) would each break some other property the engine depends on.

Consequence. log_recovery.c and log_recovery_redo.cpp are organised around the three-pass shape; redo dispatches per record type through an RV_fun[] array indexed by the LOG_RECTYPE enum, so adding a record type means extending the table rather than restructuring the loop. Fuzzy checkpoint is LOG_START_CHKPT / LOG_END_CHKPT bracketing; recovery anchors at LOG_END_CHKPT because that record carries the DPT/TT snapshot. Redo is parallelisable per page (log_recovery_redo_parallel.cpp shards by VPID). CLRs are type LOG_COMPENSATE with an explicit undo_nxlsa pointing past the record being undone, so a crash mid-undo never reapplies an undo. The whole machinery is “ARIES, by the book” — almost nothing CUBRID-specific except which subsystems the record types describe.

Cross-references: cubrid-recovery-manager.md, cubrid-checkpoint.md, cubrid-log-manager.md.

3. MVCC for reads, locks for writes

Decision. Reads see a snapshot that is built once at transaction (or statement) start; writes acquire an X-lock on the row before mutating it. The two mechanisms are not alternatives — they coexist. Snapshot isolation delivers the fast read path; the lock manager catches write-write conflicts and serialises them in the order of acquisition; vacuum reclaims versions older than the global oldest-visible watermark.

Why. Pure 2PL has one crippling cost on read-heavy workloads: every read takes an S-lock, every S-lock blocks concurrent X-locks, and a long-running scan stalls the entire write side. The opposite extreme — pure MVCC — looks attractive because reads never block writes, but the cost shows up in write-write conflict detection: without a lock manager, two updates to the same row must each materialise, detect the conflict at commit time, and roll one back. PostgreSQL’s serialisable snapshot isolation (SSI; Cahill et al. SIGMOD 2008) is the production design for full-MVCC and it is expensive — predicate locks, dependency graphs, cycle detection. CUBRID picks the hybrid: MVCC for reads, locks for writes. Write skew at SERIALIZABLE is handled by escalating to lock-based serialisation rather than by SSI; the read path stays cheap.

The MVCC implementation (mvcc.c, mvcc_table.cpp, mvcc_active_tran.cpp) shipped in 2014, well after the engine’s original 2PL-only era. The engineering bet was explicit: read-heavy OLTP is the dominant workload, and keeping the lock manager in place was cheaper than rewriting the write path to handle conflict detection in MVCC alone.

Consequence. The codebase has three sibling subsystems that share one vocabulary (the MVCCID) and three rendezvous points (per-record header, per-transaction snapshot, global oldest-visible watermark). mvcc.h::MVCC_REC_HEADER carries mvcc_ins_id, mvcc_del_id, and prev_version_lsa on every heap and index record. lock_manager.c sits next to MVCC, not under it. vacuum.c runs as a background process keyed off m_oldest_visible. The textbook cost — a long-running write transaction pinning the watermark and stalling vacuum — is structural and visible in bug reports about table bloat.

Cross-references: cubrid-mvcc.md, cubrid-lock-manager.md, cubrid-vacuum.md.

4. Volcano-style executor with iterator-per-XASL-node

Decision. A query plan is a tree of XASL nodes; each node owns an operator (heap scan, index scan, list scan, sort, group-by, hash join, …) that obeys the open / next / close contract. The executor walks the tree recursively, pulling tuples on demand. Operators are functions, not classes — the dispatch on xasl->type happens in qexec_execute_mainblock_internal, and the per-tuple work happens in scan_next_scan → scan_next_scan_local → scan_next_<type>_scan. The plan is fully serialisable: an XASL tree can be packed into a flat byte buffer, cached, and shipped between processes.

Why. Goetz Graefe’s Volcano — An Extensible and Parallel Query Evaluation System (TKDE 1994) is the canonical reference for the iterator model, and the design has not been bettered for the OLTP + light-OLAP workload CUBRID targets. The model has three structural advantages the codebase visibly exploits: pipelining is automatic for non-blocking operators (a Filter over a Scan over an Index produces no intermediate materialisation), composition is by tree (the executor walks the same tree the optimiser produced), and blocking operators (sort, hash-build, group-by, top-N) materialise into list files that downstream operators consume as ordinary list scans. The cost — per-tuple virtual call overhead — is acceptable for OLTP where tuple counts are small; it would not be acceptable for analytic scans of billions of rows, which is one reason CUBRID positions itself for OLTP rather than OLAP (decision 13).

XASL serialisability is the secondary win. Because each operator’s state is a struct with offsets into the XASL byte buffer rather than native pointers, the entire plan can be cached intact and shipped across the broker / CAS / cub_server boundary as a buffer copy.

Consequence. query_executor.c is ~600 KB of glue because each operator type gets its own qexec_<op>_proc / scan_next_<type>_scan pair, but the glue is uniform. The vectorised alternative (MonetDB/X100, HyPer-style emit-batches push) would require restructuring every operator’s interface; the work has not happened. Parallel-query extended the model with intra-operator parallelism but did not change the iterator contract — the parallel scan is still a SCAN_ID whose next() happens to multiplex across threads.

Cross-references: cubrid-query-executor.md, cubrid-xasl-generator.md, cubrid-xasl-cache.md.

5. Selinger-style cost optimizer with a query graph

Decision. Optimisation is a four-stage pipeline: PT_NODE (semantically checked parse tree) is lowered to a QO_ENV query graph of QO_NODEs (tables / derived relations), QO_SEGMENTs (column references), and QO_TERMs (predicates); join enumeration is dynamic-programming over a 2^N join_info vector, building best plans for every subset of size k from best plans for subsets of size k-1; cost is a System R-style fixed-cpu/io + variable-cpu/io model that consumes statistics produced by xstats_update_statistics; the surviving QO_PLAN tree is finalised into the XASL tree.

Why. Selinger et al.’s Access Path Selection in a Relational Database Management System (SIGMOD 1979) is the foundational paper of relational optimisation; left-deep dynamic programming is the dominant tractable approximation of the super-exponential plan space, running in O(2^N · N) time and O(2^N) space — manageable for the join cardinalities CUBRID’s OLTP workloads encounter. The alternative — Cascades-style bushy-plan search (Graefe 1995) — is what SQL Server and CockroachDB use; its advantages (transformation rules, group memoisation, adaptive re-optimisation) are most visible on the analytic shapes CUBRID is not chasing.

The query-graph stage is the deliberate intermediate representation. The parse tree is too close to the surface SQL; the XASL is too close to the executor. QO_NODE / QO_TERM is the minimal canonical form on which a cost-based enumeration can run.

Consequence. optimizer/query_graph.c builds the graph; optimizer/query_planner.c runs the DP; optimizer/plan_generation.c emits XASL. The cost model lives inline in query_planner.c as qo_*_cost functions that compute fixed and variable cpu/io for each access method; the constants are calibrated against CUBRID’s storage layer and tuned by the build, not by the operator. Statistics are produced by xstats_update_statistics and read directly out of the catalog at optimisation time. The structure mirrors System R faithfully enough that a reader of the 1979 paper could identify the corresponding functions by name.

Cross-references: cubrid-query-optimizer.md, cubrid-statistics.md.

6. Lock-free prior-list as the WAL commit pipe

Decision. A producer transaction does not write directly into the authoritative log buffer. It builds a LOG_PRIOR_NODE outside any global mutex (including the optional zlib compression of undo/redo payloads), then takes prior_lsa_mutex for the O(1) work of assigning the next monotonic LSN and linking the node onto the tail of a singly-linked list. The log-flush daemon runs the latch-protected drain: detach the list under the mutex, release, copy bytes from the detached nodes into the authoritative LOG_PAGE buffer, write to disk, advance nxio_lsa, broadcast the group-commit condition variable. Producers that have committed park on the CV until their LSN is durable.

Why. PostgreSQL’s XLogInsert is the deliberate inversion. Postgres’s appenders take a WALInsertLock (8 striped locks in current versions), copy bytes directly into the WAL buffer pages at the LSN they just claimed, and release. The lock holds across a memcpy whose size is the record length; under high concurrency the lock itself becomes the bottleneck. Mohan and DeWitt’s 1992 commit-pipe research had already argued for the structural cure: producers build records outside the latch, a single drain thread is responsible for the latch-protected work of moving bytes into the authoritative buffer. CUBRID’s prior list is exactly this pipe.

The win is not just lock contention. Because the producer’s expensive work (record formatting, MVCC stamping, zlib compression) happens outside the mutex, N producers can compress in parallel; the mutex hold time is bounded by O(1) link-tail manipulation. Group commit emerges naturally from a lock-free producer + batched consumer: many committers sleeping on one CV all wake from one fsync’s advance. The design is also the foundation of the CBRD-26177 high-concurrency redesign tracked in cubrid-thread-manager-ng.md.

Consequence. prior_lsa_alloc_and_copy_data and _crumbs are the producer-side allocators in log_append.cpp; prior_lsa_next_record_internal is the attach; logpb_prior_lsa_append_all_list in log_page_buffer.c is the drain. The detach is the simplest possible: under the mutex, swap prior_list_header to NULL, swap prior_list_tail to NULL, zero list_size, release. The mutex is held for exactly three pointer stores and one INT64 store; no bytes move while it is held. Backpressure is a soft cap: when list_size >= logpb_get_memsize(), producers wake the daemon and yield 1 ms.

Cross-references: cubrid-prior-list.md, cubrid-log-manager.md, cubrid-thread-manager-ng.md.

7. Double Write Buffer for torn-page protection

Decision. Every dirty data page is staged through a sequential double-write volume and fsynced there before the home write is issued. On recovery, a DWB scan happens before the analysis pass: any DWB slot whose home page is torn is replaced by the DWB copy. The DWB is a fixed-size circular file with deterministic slot assignment, so the recovery scan is bounded and the steady-state cost is one extra sequential write per home write.

Why. The asymmetry between DBMS page size and filesystem / device atomic write size is the root cause of torn pages. CUBRID writes data pages at IO_PAGESIZE (4-16 KiB depending on configuration); modern Linux filesystems guarantee atomic writes only at the hardware sector boundary (typically 512 B or 4 KiB). A crash mid-write can leave the home location holding a Frankenstein page whose first half corresponds to the new image and whose second half to the old. The three production responses are: full-page-image-on-first-modify (PostgreSQL’s full_page_writes), double-write buffer (MySQL InnoDB, CUBRID), and reliance on a copy-on-write filesystem (ZFS / Btrfs). PostgreSQL’s choice doubles WAL volume on every checkpoint boundary; the InnoDB / CUBRID choice is one extra sequential write per dirty-page flush, much cheaper at the cost of a fixed-size on-disk DWB structure.

The bet CUBRID made was that WAL throughput is the more precious resource. CUBRID’s WAL pipeline (decision 6) is the most carefully tuned part of the engine; making the WAL twice as large by inlining full pages would have eaten that headroom.

Consequence. double_write_buffer.cpp sits on the page-buffer flush hot path. Every pgbuf_flush for a data page goes through dwb_set_data_on_next_slot → WAL force → dwb_add_page → home write, in that order. The WAL force is the prior-list group-commit parking described in decision 6; a dirty page cannot be home-written unless its log-LSN has been drained from the prior list and fsync’d. Recovery’s first action — before analysis — is a DWB scan that replaces torn home pages from intact DWB slots. The structure imposes a hard ordering on the storage subsystem that the rest of the engine must respect.

Cross-references: cubrid-double-write-buffer.md, cubrid-page-buffer-manager.md.

8. Separate JVM for procedural language

Decision. PL/CSQL and JavaSP do not run in the database server process. They run in cub_pl, a separately forked JVM-hosting process that talks to cub_server over a Unix domain socket (or TCP). The catalog rows are owned by cub_server, the wire protocol is pl_connection.cpp / pl_comm.c, and the executor is the C++ cubpl::executor that sends every invocation to cub_pl and handles JDBC back-channel callbacks on the same socket. Both PL languages share the JVM: PL/CSQL is compiled to Java bytecode and dispatched through the same ExecuteThread that handles JavaSP reflective dispatch.

Why. Running user code in-process is dangerous. A buggy stored procedure can corrupt the heap, exhaust file descriptors, leak threads, or — most catastrophically — call System.exit() and take the entire database server down. Oracle had to learn the same lesson. Oracle’s early in-process JVM (the “Aurora JVM” embedded in oracle.exe) caused operational pain that drove the partial migration to external JVM processes. CUBRID’s earlier architecture also embedded the JVM via JNI directly into cub_server (pl_sr_jvm.cpp is the surviving artefact); the migration to the forked external cub_pl process was driven by the same isolation argument.

The JVM-startup cost is amortised by keeping cub_pl alive as a long-running sidecar. The classloader hierarchy is structured so user JARs reload without restarting the JVM; the security manager (SpSecurityManager) blocks System.exit() and native library loading from user classloaders. Sharing one JVM between PL/CSQL and JavaSP is the secondary bet: PL/CSQL is compiled to Java bytecode, so the natural place to host its runtime is wherever JavaSP is already hosted.

Consequence. pl_engine/ is a separate Gradle project that builds pl_server.jar; the build is decoupled from the C/C++ CMake build. pl_connection.cpp manages the connection pool (10 connections per cub_server). The server_monitor_task (a 1-second daemon) reaps and re-forks cub_pl if it dies. The JDBC back-channel runs on the same socket — CUBRIDServerSidePreparedStatement.execute() serialises a METHOD_CALLBACK_QUERY_PREPARE request that cub_server’s executor::response_callback_command() dispatches into the query machinery, so a JavaSP that issues SQL never opens a new TCP connection to itself.

Cross-references: cubrid-pl-javasp.md, cubrid-pl-plcsql.md.

9. Broker process pool in front of cub_server

Decision. Client SQL traffic does not go directly to cub_server. A cub_broker parent process owns the listening TCP socket and forks a fixed pool of cub_cas worker processes; on accept, the parent passes the client’s file descriptor to an idle CAS via SCM_RIGHTS over a Unix-domain rendezvous channel. The CAS worker authenticates the client, parses the CSS framing, and proxies the SQL traffic to cub_server. ACL state, SQL log, monitoring counters, and the broker administration interface are coordinated through a single SysV shared-memory segment.

Why. A relational engine’s in-process state — parser, XASL cache, prepared-statement registry, authenticated transaction descriptor — is expensive to set up and tear down. The driver wants the simplest possible model: open a socket, send SQL, read rows, close. The operator wants a horizontal cap on how many of these heavyweight contexts the host actually has to feed. CUBRID chose the broker-mediated model for three reasons:

Isolation. A CAS that crashes does not take the broker, peer CAS workers, or cub_server down. The parent reaps the dead CAS and forks a replacement.
Operational surface. Per-CAS SQL log and per-CAS error log are viable when each CAS is one process; they would be much harder if cub_server had to multiplex log buffers per client thread.
Pluggable front-ends. The CAS abstraction allows ODBC and JDBC bridges (cas_cgw.c) without modifying cub_server.

The choice splits a SQL request across four processes — JDBC client → broker (parent) → CAS (worker) → cub_server — with the broker step being a pure file-descriptor handoff that does not touch the SQL bytes. The latency cost is on the order of microseconds; the operational dividend pays for it.

Consequence. broker.c is the parent: it allocates the SysV shared-memory segment, forks the CAS workers, opens the listening socket, and selects the next idle CAS from the shared-memory job queue. cas.c is the worker: it recvmsgs the FD, runs the CSS handshake, opens its upstream connection to cub_server, and proxies traffic. broker_send_fd.c / broker_recv_fd.c wrap SCM_RIGHTS. broker_acl.c enforces IP-based ACL in the parent before the FD handoff. Each layer has its own debug log; correlating a single SQL request requires walking client log → broker log → CAS log → server log.

Cross-references: cubrid-broker.md, cubrid-network-protocol.md.

10. Local-only liveness decision

Decision. Each cub_master peer in a CUBRID HA cluster decides on its own whether the other peers are alive, based on UDP heartbeats and local timeouts. There is no global consensus protocol — no Raft, no ZAB, no Paxos. The election is a local view election: each node scores itself and its peers and decides who should be master in its own opinion. The ha_ping_hosts parameter lists external addresses each node pings to detect its own network isolation, guarding against the worst split-brain scenario.

Why. Consensus protocols are expensive — multi-round message exchanges, persistent state machines, leader leases, log replication. Chandra & Toueg’s Unreliable Failure Detectors for Reliable Distributed Systems (PODC 1996) is the formal reference. CUBRID’s HA target is not a 5-node cluster with quorum-based writes — it is the much more modest primary / standby pair with optional read-only replicas. For that topology a global consensus is overkill: there are two outcomes (the primary is alive, or it is gone and the standby promotes), and the worst case is split-brain. Local-view election with a ha_ping_hosts external arbiter covers split-brain at the cost of operator-provided ping targets.

The bet was that simplicity and operational clarity beat theoretical elegance for the workload CUBRID targets. A consensus protocol would require persistent state on every node, a log-replication step that does not match the engine’s existing WAL, and a more complex failover dialogue. The local-view design is a few thousand lines of UDP gossip, a job-queue FSM, and four worker threads. The cost is real: asymmetric-partition scenarios — where node A can reach node B but B cannot reach A — can cause both to promote. ha_ping_hosts reduces this risk but does not eliminate it; the remediation is operator awareness.

Consequence. master_heartbeat.c and heartbeat.c implement the gossip and the FSM. State transitions — slave → to-be-master → master, master → slave — are driven by a job queue consumed by four worker threads. The lack of a consensus log means failover is fast but not formally safe; the HA documentation in user-facing form is cubrid-ha-replication.md; the local-only design is what makes that document’s “operator must configure ping hosts” guidance non-negotiable.

Cross-references: cubrid-heartbeat.md, cubrid-ha-replication.md.

11. Page-based, not log-structured

Decision. Every storage subsystem in CUBRID assumes pages are mutated in place. Heap pages, B+Tree pages, catalog pages, overflow pages — all support direct updates. The WAL records the changes; the WAL invariant ensures the log is durable before the dirty page is written to its home location; the home location is the location. There is no log-structured rewrite, no compaction-based reclamation, no LSM-tree levels.

Why. Page-based, in-place storage is the design that ARIES was written for. Mohan’s algorithm assumes the engine can fetch a page, look at its LSN, decide whether to redo, and apply the redo function to the page in place. The two branches of storage design make opposite trade-offs: B-trees give cheap point reads at the cost of expensive in-place updates and the torn-page problem; LSM-trees give cheap writes at the cost of expensive reads and ongoing compaction.

CUBRID was written when the LSM choice was not the default — log-structured storage existed as a research idea (Rosenblum & Ousterhout, SOSP 1991), but no major relational engine had built on it for transactional workloads. SQL Server, DB2, Oracle, PostgreSQL, MySQL were all page-based. The OLTP workload — many small reads, many small writes, point queries by primary key — is what page-based storage was optimised for.

Consequence. The choice cascades through every storage subsystem. Vacuum (decision 3) reclaims dead row versions in place rather than rewriting the page. DWB (decision 7) protects the multi-sector write at the home address. B+Tree splits and merges pages while keeping their home addresses stable. The page buffer is a fixed-size hash table of (vfid, pageid) → buffer frame; if pages moved on every update there would be no stable key to hash on. Every storage path under src/storage/ assumes in-place mutability. There is no separate compaction scheduler. The closest thing CUBRID has to compaction is cub_compactdb (an SA-mode utility, decision 12), an explicit operator-driven operation rather than a continuous background process. The cost of in-place updates — torn pages, vacuum bloat, B+Tree fragmentation — is paid by the corresponding subsystems rather than amortised into the write path itself.

12. SA mode utilities for offline operations

Decision. Admin utilities — cub_compactdb, cub_unloaddb, cub_loaddb, cub_restoredb — link the entire server engine into their own process and operate on the on-disk database directly. They do not go through the broker, the CAS pool, or cub_server. Each utility is built twice in the CMake graph: once as <utility>_sa linking libcubridsa.so (the in-process server), once as <utility>_cs linking libcubridcs.so (the client-side stub). A runtime classification table (SA_ONLY, CS_ONLY, SA_CS) plus a dlopen of either library at startup selects the active mode.

Why. Some operations want exclusive access to the database file and do not benefit from network IPC. cub_compactdb wants to read every record, decide which are dead, and physically move the live ones; doing this through the daemon means convincing the daemon to take an exclusive lock on every class. cub_unloaddb wants to dump the entire database to a flat file as fast as the disk can write; the wire protocol’s CSS framing and per-row marshalling would be wasted overhead. cub_loaddb in SA mode bypasses the lock manager (no concurrent access by definition) and the WAL (the database is rebuilt; recovery is unnecessary), giving order-of-magnitude faster ingestion.

The dual-build mechanism is the engineering cost the design accepts. The same source file is compiled three times — once into cub_server (SERVER_MODE), once into libcubridsa.so (SA_MODE), once into libcubridcs.so (CS_MODE) — with #ifdef blocks selecting which networking layer is active. Most engine code is identical across all three; the divergence is concentrated in the network interface.

Consequence. sa/CMakeLists.txt and cs/CMakeLists.txt produce libcubridsa.so and libcubridcs.so as sibling shared libraries. util_admin.c, util_sa.c, and util_cs.c together form the dispatch layer: util_admin.c reads the per-utility classification and dlopens the right library. The classification table makes some utilities SA-only (cub_compactdb, cub_unloaddb), some CS-only (interactive csql against a running server), some dual (cub_loaddb, which is faster in SA but supports both). The #if defined(SERVER_MODE) / #if defined(SA_MODE) / #if defined(CS_MODE) blocks scattered through the engine encode the three-way split at the source level.

Cross-references: cubrid-sa-cs-runtime.md.

13. What CUBRID explicitly is NOT

The decisions above describe what CUBRID is. The non-goals — the shapes the engine deliberately does not take — are equally part of the design philosophy, because they explain why the codebase has the silhouette it has rather than some other silhouette.

CUBRID is not a distributed database. There is no cross-node transaction coordinator beyond two-phase commit, no global query optimiser, no shard-aware planner. The HA story is primary / standby with read-only replicas (decision 10), not a Spanner / CockroachDB / TiDB-style multi-region cluster.

CUBRID is not log-structured. Storage is page-based and in-place (decision 11). There is no SSTable compaction, no LSM-level threading, no write-amplification analysis. The vacuum process reclaims MVCC versions in place; it is not a compaction scheduler.

CUBRID is not vectorised. The executor is iterator-per-tuple (decision 4). There is no MonetDB/X100-style batch execution, no HyPer-style code generation, no LLVM-JIT’d query pipeline. The cost is paid in analytic-scan throughput; the win is OLTP simplicity.

CUBRID is not MPP. The parallel-query work is intra-node parallelism — one operator’s tuple stream fanned out to multiple worker threads on the same machine. There is no cross-node exchange operator, no shuffle phase, no distributed cost model.

Where this leaves CUBRID. The positioning is OLTP plus light OLAP on a single primary with HA standby. Within that target the engine is competitive on read latency (snapshot isolation reads without locks), on write throughput (the prior-list pipe absorbs high-concurrency commits), on operational predictability (ARIES recovery, fuzzy checkpoints, DWB-protected writes), and on the narrow surface of administrative utilities. It is not competitive — and does not try to be — on the workloads the non-goals exclude. A reader contemplating a feature that would push CUBRID into the excluded territory should expect the engine to fight back; the codebase is optimised for what it is, not for what it might become.

The thirteen decisions are the answer to the question this document opens with: why does CUBRID look the way it does? They are the shape underneath the shape. The detail docs in this directory walk the shape; this one names what made it.