Skip to content

CUBRID Storage Engine — Section Overview

Contents:

The storage engine in CUBRID is everything that sits below the catalog and the locator and above the operating system’s filesystem. It is the layer that turns a handful of OS-level byte streams into fixed-size pages, hot-caches them in memory, lays out variable-length user records on top of them, builds ordered access paths over those records, and protects every byte against torn writes, dirty-then-crash failures, and (optionally) on-disk theft. The locator and the catalog treat this layer as an abstraction — they ask for a heap page or a B+Tree leaf and the storage engine produces it; how the page got there, how it stays consistent on flush, and how it survives a power loss are all internal concerns of this section.

The boundary above is sharp: heap and B+Tree mutations enter through locator_*_force (cubrid-locator.md) and through catalog_manager (cubrid-catalog-manager.md); WAL records leave through the prior list (cubrid-prior-list.md) and the log manager (cubrid-log-manager.md); MVCC visibility is decided by cubrid-mvcc.md reading the headers this section’s heap layer plants. The boundary below is just as sharp: the disk manager owns every pread/pwrite against a CUBRID volume, and nothing else in the engine is allowed to touch a data file directly. Everything in between is what the 9 docs in this section cover. The AREA slab pool, which used to be the tenth, has moved to cubrid-overview-base-infra.md along with the lock-free primitives — the rationale is that AREA is a memory-allocator concern that every layer can use (parser, optimizer, executor), not a storage layer in the on-disk sense.

The storage engine is a strict, easy-to-name layering. Each layer treats the layer below as an abstraction and exports a narrower abstraction upward. Understanding the layering is the prerequisite for understanding any read or write path through CUBRID.

flowchart TB
  subgraph UPPER["Upper layers (cross-section)"]
    LOC["locator / catalog<br/>(cubrid-locator.md, cubrid-catalog-manager.md)"]
    LOG["log manager + prior list<br/>(cubrid-log-manager.md, cubrid-prior-list.md)"]
    MVCC["MVCC + vacuum<br/>(cubrid-mvcc.md, cubrid-vacuum.md)"]
  end

  subgraph KEY["Keys on pages"]
    BTREE["btree<br/>cubrid-btree.md"]
    EHASH["extendible_hash<br/>cubrid-extendible-hash.md"]
  end

  subgraph REC["Records on pages"]
    HEAP["heap_manager<br/>(slotted pages, MVCC headers)<br/>cubrid-heap-manager.md"]
    OFLOW["overflow_file<br/>(big-record / OID-list chains)<br/>cubrid-overflow-file.md"]
  end

  subgraph EXT["Out-of-band data"]
    LOB["LOB (BLOB / CLOB)<br/>(files outside data volume)<br/>cubrid-lob.md"]
  end

  subgraph CACHE["Pages in memory"]
    PB["page_buffer<br/>(BCB cache, three-zone LRU)<br/>cubrid-page-buffer-manager.md"]
    DWB["double_write_buffer<br/>(torn-page protection)<br/>cubrid-double-write-buffer.md"]
  end

  subgraph CRYPTO["Page-level encryption"]
    TDE["tde<br/>(AES/ARIA-256-CTR, MK -> DEK)<br/>cubrid-tde.md"]
  end

  subgraph DISK["Volumes & files (OS boundary)"]
    DM["disk_manager + file_manager<br/>(volume / sector / file / page)<br/>cubrid-disk-manager.md"]
    VOLS[("OS files<br/>(_dbname / _dbname_t / _lgar*)")]
  end

  LOC --> HEAP
  LOC --> BTREE
  LOC --> LOB
  LOG --> PB
  HEAP --> OFLOW
  BTREE --> OFLOW
  HEAP --> PB
  BTREE --> PB
  EHASH --> PB
  OFLOW --> PB
  PB --> TDE
  PB --> DWB
  TDE --> DWB
  DWB --> DM
  PB -. clean page read .-> DM
  DM --> VOLS
  LOB -. host filesystem .-> VOLS

  MVCC -. reads / writes record headers .-> HEAP
  MVCC -. vacuum reclaims .-> BTREE

Reading the diagram bottom-up — and matching each layer to its detail doc — gives the seven-piece structure of the section:

  • Volumes & files (cubrid-disk-manager.md). The OS-file boundary. A volume is one OS file; a sector is 64 contiguous pages and the disk-manager’s allocation unit; a file is a sector bundle; a page is the I/O unit (VPID = (volid, pageid)). The disk cache splits permanent and temporary purposes so temp files cannot starve the permanent space, drives a two-step sector reservation, and decides when to extend a volume adaptively. The file manager turns reserved sectors into pages via three extensible-data tables (Partial / Full / User).
  • Pages in memory (cubrid-page-buffer-manager.md + cubrid-double-write-buffer.md). The page buffer maps VPID -> BCB -> in-memory frame through a per-bucket hash, runs a three-zone LRU split into per-thread private lists with adjustable quotas plus a shared list, hands victims directly to sleeping waiters via lock-free queues, and protects each BCB with a custom read/write/flush latch. Every dirty page must pass the DWB on the way out — the DWB stages the page into a sequential, fixed-size volume that is fsync’d before the home write is issued, so a torn write at the home location is recoverable from the DWB copy before log replay even begins.
  • Records on pages (cubrid-heap-manager.md + cubrid-overflow-file.md). The heap manager stores variable- length user records on slotted pages, dispatches the nine record types (REC_HOME, REC_RELOCATION, REC_BIGONE, REC_NEWHOME, …), threads insert/update/delete/read through them, and embeds the MVCC header (insert-MVCCID, delete-MVCCID, prev-version chain) directly inside the record so visibility decisions can be answered without leaving the heap page. When a record outgrows what any heap page can hold, the heap delegates to the overflow file (FILE_MULTIPAGE_OBJECT_HEAP) and stores a small reference at the home OID; the same overflow-file module also handles B+Tree overflow keys (FILE_BTREE_OVERFLOW_KEY) and per-key OID lists too long for a leaf entry.
  • Keys on pages (cubrid-btree.md + cubrid-extendible-hash.md). The B+Tree is CUBRID’s primary access path: slotted-page nodes, separate non-leaf and leaf record formats, key||OID concatenation for non-unique indexes, latch-coupling on the read path, and unique-constraint enforcement at the OID-suffix level (with per-key OID-list overflow when duplicates pile up). Extendible hash is the second on-page organisation, used by a small set of internal callers — class-name → OID lookup, catalog → repr-id lookup, UPDATE/DELETE OID dedup tables — through an EHID-rooted directory file with a doubling pointer count and slotted bucket pages with binary search and per-bucket local depth.
  • Big external data (cubrid-lob.md). BLOB and CLOB columns whose payload is too large to keep next to the row are stored as files outside the data volume (host filesystem or an object store), named by a locator URI string in the record itself. A per-transaction red-black tree on the TDES tracks every locator the transaction touched, and a single dispatch point at commit / rollback reconciles file-system reality with the transaction outcome.
  • Encryption (cubrid-tde.md). TDE wraps the page buffer’s encrypt-on-flush / decrypt-on-read hooks. A two-level key hierarchy keeps the master key (MK) outside the database in a separate <db>_keys file; the MK wraps three per-database data encryption keys (DEKs) — DATA, LOG, TEMP — and AES-256-CTR or ARIA-256-CTR with per-page nonces (LSA for permanent pages, atomic counter for temp, logical pageid for log) does the actual byte-level work. The TDE flag is a per-file tablespace property that propagates down to each page’s pflag bits.

The crucial structural fact across the diagram is that every record-organisation layer talks to the page buffer, never to the disk manager directly. This is what lets the page buffer enforce WAL ordering — a heap page mutation produces a log record first (routed through the prior list to the log manager), and the buffer manager refuses to flush the dirty page until the matching log LSA is durable. The same constraint applies to B+Tree, ehash, and catalog mutations.

The 9 docs in this section are not equally important to a newcomer, and they are not best read in alphabetical or diagram-bottom-up order. The recommended sequence is calibrated to the engineering effort behind each — start where most of the code lives, branch into the rest as needed.

  1. cubrid-disk-manager.md — Read first. It defines the vocabulary (VPID, sector, file, volume) that every other doc in the section reuses, and it is also the cleanest layer to understand in isolation because it has no upward dependency beyond “give me an OS file and a page size”. Pay attention to the two-step sector reservation and the permanent / temporary split — those two ideas reappear under different names in the page buffer and in DWB.
  2. cubrid-page-buffer-manager.md — Read second. This is where the bulk of the storage-engine engineering effort lives. The BCB array, the three-zone LRU with private and shared lists, the direct victim handoff to sleeping waiters, and the custom RW/flush latch are all dense topics; budget time. The page buffer is also the one layer in the section that touches every upper layer (heap, btree, ehash, catalog, log) — it is the central coordinator.
  3. cubrid-double-write-buffer.md — Read immediately after the page buffer; it is the durability complement. The DWB is conceptually small (sequential staging volume + fsync + home write) but its placement in the flush path and its restart recovery routine are easy to misunderstand without the page buffer’s flush sequence in mind.
  4. cubrid-heap-manager.md + cubrid-btree.md — Read the two main record/key access methods next. The heap manager shows the on-page record vocabulary and the MVCC header layout; the B+Tree shows the latch-coupling discipline and the key||OID convention. They reference each other through the overflow file, so reading them in either order works, but doing them as a pair is what produces the “I now understand how a row gets read” mental model.
  5. cubrid-overflow-file.md — Skim after heap + btree. Its job is “the page chain everyone falls back to when the primary slot is too small”; understanding it without first knowing what spills into it leads to confusion.
  6. cubrid-extendible-hash.md — Skim per need. Few callers use it (class-name lookup, repr-id lookup, OID dedup), and none of them are on the SELECT/UPDATE hot path. Read it when debugging the catalog, schema reload, or unique-index maintenance.
  7. cubrid-lob.md — Skim per need. Most CUBRID workloads have no LOB columns; this doc is essential when they do.
  8. cubrid-tde.md — Skim per need. The encryption path is largely orthogonal to the rest of the storage engine (encrypt-on-flush / decrypt-on-read at well-defined hooks), so it composes cleanly on top of everything else but does not change the read or write paths described above.

A reader who works through 1-4 in order will have the full “how a page gets read or written” mental model. The remaining four docs each fill in a side-channel that the core path either avoids (LOB, ehash) or transparently piggybacks on (overflow, TDE).

Cross-cutting concerns inside this section

Section titled “Cross-cutting concerns inside this section”

Three concerns thread through every layer in the diagram and are worth naming once at the section level rather than rediscovering in each detail doc.

Latching discipline. The page buffer exports a custom RW/flush latch on each BCB (PGBUF_LATCH_READ, PGBUF_LATCH_WRITE, PGBUF_LATCH_FLUSH); it is the only page-grain latch in the storage engine, and every record/key layer above is required to hold it for the duration of a page mutation. The heap manager and the B+Tree share this discipline: a heap insert acquires WRITE on the target page, possibly READ on the overflow head; a B+Tree descend acquires READ on each non-leaf, releases ancestors as it moves down (latch-coupling), and upgrades to WRITE only at the leaf. Extendible hash uses the same primitive on its directory and bucket pages; TDE inherits it transparently because encryption is layered below the latch (the BCB latch protects the plaintext frame; the cipher boundary is at the I/O edge). Page latches are intentionally separate from transactional locks — they are short, embedded in the BCB, and unrelated to isolation; see cubrid-page-buffer-manager.md §“Lock vs latch separation” and the cross-reference in cubrid-lock-manager.md.

WAL participation. Every storage subsystem in this section emits log records for ARIES recovery. The disk manager logs sector reservations and volume extensions (RVDK_*); the heap logs every record-type-specific mutation (RVHF_*); the B+Tree logs every node mutation (RVBT_*); extendible hash logs directory and bucket changes (RVEH_*); the overflow file logs its page-chain extensions; LOB logs locator-state transitions that the per-tx red-black tree replays at commit; even TDE participates indirectly because its per-page nonce is the LSA stamped by the log manager. The page buffer is the enforcer: it refuses to flush a dirty page until the LSA of the last modifying log record is durable. The actual log infrastructure lives outside this section in cubrid-log-manager.md and cubrid-prior-list.md, but every layer here is a first-class participant. Recovery (three-pass ARIES; cubrid-recovery-manager.md) replays those records against the same page buffer that produced them.

MVCC integration. CUBRID is an MVCC engine and the heap manager is where MVCC physically lives — every heap record header carries an insert-MVCCID, a delete-MVCCID, and a prev-version pointer to the previous physical version on the same heap. The B+Tree carries OIDs without MVCC headers and relies on the heap’s headers when the visibility predicate follows an OID into the heap; the catalog and the system tables inherit the same convention because they are stored in heap files too. Vacuum (cubrid-vacuum.md) reads these headers in WAL replay order, decides what is dead under the oldest-active-MVCCID watermark, and physically reclaims the slot — a heap-page mutation in its own right that takes the WRITE latch and emits its own RVHF_* records. LOB cleanup piggybacks on the same commit/rollback hooks that drive the red-black-tree dispatch in cubrid-lob.md. The MVCC vocabulary itself is owned by cubrid-mvcc.md; this section’s contribution is the on-page realisation.

DocOne-line summary
cubrid-disk-manager.mdFour-level hierarchy (OS file = volume, 64-page sector = allocation unit, file = sector bundle, page = I/O unit); permanent-vs-temporary disk cache; two-step sector reservation; adaptive volume extension; three extensible-data tables (Partial / Full / User).
cubrid-page-buffer-manager.mdBCB array, three-zone LRU split into per-thread private lists with adjustable quotas plus a shared list, direct victim handoff via lock-free queues, custom read/write/flush latch per BCB.
cubrid-double-write-buffer.mdSequential staging volume fsync’d before home write — torn-write protection between page buffer and data files; restart compares DWB copy to home page and replaces if needed before log replay starts.
cubrid-heap-manager.mdSlotted pages, nine record types, INSERT/UPDATE/DELETE/READ flow, MVCC versioning inside the record header (insert-MVCCID, delete-MVCCID, prev-version chain), hot-path caches.
cubrid-overflow-file.mdHeap big-record and B+Tree overflow-OID page chains; one symbol-level overflow-file module shared by FILE_MULTIPAGE_OBJECT_HEAP / FILE_BTREE_OVERFLOW_KEY / per-tree OID overflow; WAL discipline for crash safety.
cubrid-lob.mdBLOB/CLOB stored as files outside the data volume, locator-URI naming in the record, per-transaction red-black tree on TDES, single dispatch point at commit/rollback for filesystem reconciliation.
cubrid-btree.mdSlotted-page nodes, separate non-leaf and leaf record formats, key
cubrid-extendible-hash.mdFagin-style EHID-rooted directory file with doubling pointer count, slotted bucket pages with binary search and per-bucket local depth, system-op-bracketed splits/merges, RVEH_* WAL records.
cubrid-tde.mdTwo-level key hierarchy (master-key wraps three per-database DEKs); AES-256-CTR or ARIA-256-CTR with per-page nonces (LSA for permanent, atomic counter for temp, logical pageid for log); encrypt-on-flush / decrypt-on-read hooks; separate <db>_keys master-key file; per-file TDE flag propagates to each page’s pflag bits.

The storage engine is the bottom of the cub_server process, but not the bottom of the system — three neighbouring sections own the boundaries it touches.

  • above (DDL & Schema) — Catalog and class-object. The catalog manager (cubrid-catalog-manager.md) stores per-class disk representation and statistics in a heap file anchored by CTID, and the parallel system classes (_db_class, _db_attribute, _db_index, …) bootstrap from a fixed root-class OID; the in-memory SM_CLASS graph (cubrid-class-object.md) materialises the catalog in the client-side workspace. This section’s heap manager holds the bytes; the DDL & Schema section interprets them. See cubrid-overview-ddl-schema.md for the full topology.
  • above (Transaction & Recovery) — Log, vacuum, checkpoint, recovery. Every dirty page in this section is gated by an LSA the log manager produced; every restart replays through the same page buffer; vacuum walks the WAL and reclaims dead versions through the heap manager’s record headers; the DWB recovery handshake runs before log replay even starts. The storage engine is the substrate WAL protects; the Transaction & Recovery section is the protocol that protects it. See cubrid-overview-txn-recovery.md.
  • above (Server Architecture) — The locator (cubrid-locator.md) is the single named bridge between the storage engine and the higher layers. The client-side workspace batches dirty objects into LC_COPYAREA buffers and ships them to a server-side locator_*_force family that fans out into heap, btree, lock, log, FK, and replication paths through one canonical entry point. Triggers and integrity rules also fire from there. From the storage engine’s perspective, every mutating call site is reachable through the locator.

The cross-section boundary is sharp on purpose. A reader who wants to follow a SELECT into the heap, or a COMMIT through the WAL, or a DDL into the catalog, traverses these section boundaries explicitly — and the overview docs at each section name the exact crossing.