CUBRID Storage Engine — Section Overview
Contents:
- What this section covers
- The layered stack
- Reading order
- Cross-cutting concerns inside this section
- Detail-doc summaries
- Adjacent sections
What this section covers
Section titled “What this section covers”The storage engine in CUBRID is everything that sits below the catalog and the locator and above the operating system’s filesystem. It is the layer that turns a handful of OS-level byte streams into fixed-size pages, hot-caches them in memory, lays out variable-length user records on top of them, builds ordered access paths over those records, and protects every byte against torn writes, dirty-then-crash failures, and (optionally) on-disk theft. The locator and the catalog treat this layer as an abstraction — they ask for a heap page or a B+Tree leaf and the storage engine produces it; how the page got there, how it stays consistent on flush, and how it survives a power loss are all internal concerns of this section.
The boundary above is sharp: heap and B+Tree mutations enter through
locator_*_force (cubrid-locator.md) and through catalog_manager
(cubrid-catalog-manager.md); WAL records leave through the prior list
(cubrid-prior-list.md) and the log manager (cubrid-log-manager.md);
MVCC visibility is decided by cubrid-mvcc.md reading the headers this
section’s heap layer plants. The boundary below is just as sharp: the
disk manager owns every pread/pwrite against a CUBRID volume, and
nothing else in the engine is allowed to touch a data file directly.
Everything in between is what the 9 docs in this section cover.
The AREA slab pool, which used to be the tenth, has moved to
cubrid-overview-base-infra.md along with the lock-free
primitives — the rationale is that AREA is a memory-allocator
concern that every layer can use (parser, optimizer, executor),
not a storage layer in the on-disk sense.
The layered stack
Section titled “The layered stack”The storage engine is a strict, easy-to-name layering. Each layer treats the layer below as an abstraction and exports a narrower abstraction upward. Understanding the layering is the prerequisite for understanding any read or write path through CUBRID.
flowchart TB
subgraph UPPER["Upper layers (cross-section)"]
LOC["locator / catalog<br/>(cubrid-locator.md, cubrid-catalog-manager.md)"]
LOG["log manager + prior list<br/>(cubrid-log-manager.md, cubrid-prior-list.md)"]
MVCC["MVCC + vacuum<br/>(cubrid-mvcc.md, cubrid-vacuum.md)"]
end
subgraph KEY["Keys on pages"]
BTREE["btree<br/>cubrid-btree.md"]
EHASH["extendible_hash<br/>cubrid-extendible-hash.md"]
end
subgraph REC["Records on pages"]
HEAP["heap_manager<br/>(slotted pages, MVCC headers)<br/>cubrid-heap-manager.md"]
OFLOW["overflow_file<br/>(big-record / OID-list chains)<br/>cubrid-overflow-file.md"]
end
subgraph EXT["Out-of-band data"]
LOB["LOB (BLOB / CLOB)<br/>(files outside data volume)<br/>cubrid-lob.md"]
end
subgraph CACHE["Pages in memory"]
PB["page_buffer<br/>(BCB cache, three-zone LRU)<br/>cubrid-page-buffer-manager.md"]
DWB["double_write_buffer<br/>(torn-page protection)<br/>cubrid-double-write-buffer.md"]
end
subgraph CRYPTO["Page-level encryption"]
TDE["tde<br/>(AES/ARIA-256-CTR, MK -> DEK)<br/>cubrid-tde.md"]
end
subgraph DISK["Volumes & files (OS boundary)"]
DM["disk_manager + file_manager<br/>(volume / sector / file / page)<br/>cubrid-disk-manager.md"]
VOLS[("OS files<br/>(_dbname / _dbname_t / _lgar*)")]
end
LOC --> HEAP
LOC --> BTREE
LOC --> LOB
LOG --> PB
HEAP --> OFLOW
BTREE --> OFLOW
HEAP --> PB
BTREE --> PB
EHASH --> PB
OFLOW --> PB
PB --> TDE
PB --> DWB
TDE --> DWB
DWB --> DM
PB -. clean page read .-> DM
DM --> VOLS
LOB -. host filesystem .-> VOLS
MVCC -. reads / writes record headers .-> HEAP
MVCC -. vacuum reclaims .-> BTREE
Reading the diagram bottom-up — and matching each layer to its detail doc — gives the seven-piece structure of the section:
- Volumes & files (
cubrid-disk-manager.md). The OS-file boundary. A volume is one OS file; a sector is 64 contiguous pages and the disk-manager’s allocation unit; a file is a sector bundle; a page is the I/O unit (VPID = (volid, pageid)). The disk cache splits permanent and temporary purposes so temp files cannot starve the permanent space, drives a two-step sector reservation, and decides when to extend a volume adaptively. The file manager turns reserved sectors into pages via three extensible-data tables (Partial / Full / User). - Pages in memory (
cubrid-page-buffer-manager.md+cubrid-double-write-buffer.md). The page buffer mapsVPID -> BCB -> in-memory framethrough a per-bucket hash, runs a three-zone LRU split into per-thread private lists with adjustable quotas plus a shared list, hands victims directly to sleeping waiters via lock-free queues, and protects each BCB with a custom read/write/flush latch. Every dirty page must pass the DWB on the way out — the DWB stages the page into a sequential, fixed-size volume that isfsync’d before the home write is issued, so a torn write at the home location is recoverable from the DWB copy before log replay even begins. - Records on pages (
cubrid-heap-manager.md+cubrid-overflow-file.md). The heap manager stores variable- length user records on slotted pages, dispatches the nine record types (REC_HOME,REC_RELOCATION,REC_BIGONE,REC_NEWHOME, …), threads insert/update/delete/read through them, and embeds the MVCC header (insert-MVCCID, delete-MVCCID, prev-version chain) directly inside the record so visibility decisions can be answered without leaving the heap page. When a record outgrows what any heap page can hold, the heap delegates to the overflow file (FILE_MULTIPAGE_OBJECT_HEAP) and stores a small reference at the home OID; the same overflow-file module also handles B+Tree overflow keys (FILE_BTREE_OVERFLOW_KEY) and per-key OID lists too long for a leaf entry. - Keys on pages (
cubrid-btree.md+cubrid-extendible-hash.md). The B+Tree is CUBRID’s primary access path: slotted-page nodes, separate non-leaf and leaf record formats, key||OID concatenation for non-unique indexes, latch-coupling on the read path, and unique-constraint enforcement at the OID-suffix level (with per-key OID-list overflow when duplicates pile up). Extendible hash is the second on-page organisation, used by a small set of internal callers — class-name → OID lookup, catalog → repr-id lookup, UPDATE/DELETE OID dedup tables — through an EHID-rooted directory file with a doubling pointer count and slotted bucket pages with binary search and per-bucket local depth. - Big external data (
cubrid-lob.md). BLOB and CLOB columns whose payload is too large to keep next to the row are stored as files outside the data volume (host filesystem or an object store), named by a locator URI string in the record itself. A per-transaction red-black tree on the TDES tracks every locator the transaction touched, and a single dispatch point at commit / rollback reconciles file-system reality with the transaction outcome. - Encryption (
cubrid-tde.md). TDE wraps the page buffer’s encrypt-on-flush / decrypt-on-read hooks. A two-level key hierarchy keeps the master key (MK) outside the database in a separate<db>_keysfile; the MK wraps three per-database data encryption keys (DEKs) — DATA, LOG, TEMP — and AES-256-CTR or ARIA-256-CTR with per-page nonces (LSA for permanent pages, atomic counter for temp, logical pageid for log) does the actual byte-level work. The TDE flag is a per-file tablespace property that propagates down to each page’spflagbits.
The crucial structural fact across the diagram is that every record-organisation layer talks to the page buffer, never to the disk manager directly. This is what lets the page buffer enforce WAL ordering — a heap page mutation produces a log record first (routed through the prior list to the log manager), and the buffer manager refuses to flush the dirty page until the matching log LSA is durable. The same constraint applies to B+Tree, ehash, and catalog mutations.
Reading order
Section titled “Reading order”The 9 docs in this section are not equally important to a newcomer, and they are not best read in alphabetical or diagram-bottom-up order. The recommended sequence is calibrated to the engineering effort behind each — start where most of the code lives, branch into the rest as needed.
cubrid-disk-manager.md— Read first. It defines the vocabulary (VPID, sector, file, volume) that every other doc in the section reuses, and it is also the cleanest layer to understand in isolation because it has no upward dependency beyond “give me an OS file and a page size”. Pay attention to the two-step sector reservation and the permanent / temporary split — those two ideas reappear under different names in the page buffer and in DWB.cubrid-page-buffer-manager.md— Read second. This is where the bulk of the storage-engine engineering effort lives. The BCB array, the three-zone LRU with private and shared lists, the direct victim handoff to sleeping waiters, and the custom RW/flush latch are all dense topics; budget time. The page buffer is also the one layer in the section that touches every upper layer (heap, btree, ehash, catalog, log) — it is the central coordinator.cubrid-double-write-buffer.md— Read immediately after the page buffer; it is the durability complement. The DWB is conceptually small (sequential staging volume +fsync+ home write) but its placement in the flush path and its restart recovery routine are easy to misunderstand without the page buffer’s flush sequence in mind.cubrid-heap-manager.md+cubrid-btree.md— Read the two main record/key access methods next. The heap manager shows the on-page record vocabulary and the MVCC header layout; the B+Tree shows the latch-coupling discipline and the key||OID convention. They reference each other through the overflow file, so reading them in either order works, but doing them as a pair is what produces the “I now understand how a row gets read” mental model.cubrid-overflow-file.md— Skim after heap + btree. Its job is “the page chain everyone falls back to when the primary slot is too small”; understanding it without first knowing what spills into it leads to confusion.cubrid-extendible-hash.md— Skim per need. Few callers use it (class-name lookup, repr-id lookup, OID dedup), and none of them are on the SELECT/UPDATE hot path. Read it when debugging the catalog, schema reload, or unique-index maintenance.cubrid-lob.md— Skim per need. Most CUBRID workloads have no LOB columns; this doc is essential when they do.cubrid-tde.md— Skim per need. The encryption path is largely orthogonal to the rest of the storage engine (encrypt-on-flush / decrypt-on-read at well-defined hooks), so it composes cleanly on top of everything else but does not change the read or write paths described above.
A reader who works through 1-4 in order will have the full “how a page gets read or written” mental model. The remaining four docs each fill in a side-channel that the core path either avoids (LOB, ehash) or transparently piggybacks on (overflow, TDE).
Cross-cutting concerns inside this section
Section titled “Cross-cutting concerns inside this section”Three concerns thread through every layer in the diagram and are worth naming once at the section level rather than rediscovering in each detail doc.
Latching discipline. The page buffer exports a custom RW/flush
latch on each BCB (PGBUF_LATCH_READ, PGBUF_LATCH_WRITE,
PGBUF_LATCH_FLUSH); it is the only page-grain latch in the
storage engine, and every record/key layer above is required to
hold it for the duration of a page mutation. The heap manager and
the B+Tree share this discipline: a heap insert acquires WRITE on
the target page, possibly READ on the overflow head; a B+Tree
descend acquires READ on each non-leaf, releases ancestors as it
moves down (latch-coupling), and upgrades to WRITE only at the
leaf. Extendible hash uses the same primitive on its directory
and bucket pages; TDE inherits it transparently because encryption
is layered below the latch (the BCB latch protects the
plaintext frame; the cipher boundary is at the I/O edge). Page
latches are intentionally separate from transactional locks —
they are short, embedded in the BCB, and unrelated to isolation;
see cubrid-page-buffer-manager.md §“Lock vs latch separation”
and the cross-reference in cubrid-lock-manager.md.
WAL participation. Every storage subsystem in this section
emits log records for ARIES recovery. The disk manager logs
sector reservations and volume extensions (RVDK_*); the heap
logs every record-type-specific mutation (RVHF_*); the B+Tree
logs every node mutation (RVBT_*); extendible hash logs
directory and bucket changes (RVEH_*); the overflow file logs
its page-chain extensions; LOB logs locator-state transitions
that the per-tx red-black tree replays at commit; even TDE
participates indirectly because its per-page nonce is the
LSA stamped by the log manager. The page buffer is the
enforcer: it refuses to flush a dirty page until the LSA of the
last modifying log record is durable. The actual log
infrastructure lives outside this section in
cubrid-log-manager.md and cubrid-prior-list.md, but every
layer here is a first-class participant. Recovery (three-pass
ARIES; cubrid-recovery-manager.md) replays those records
against the same page buffer that produced them.
MVCC integration. CUBRID is an MVCC engine and the heap
manager is where MVCC physically lives — every heap record
header carries an insert-MVCCID, a delete-MVCCID, and a
prev-version pointer to the previous physical version on the
same heap. The B+Tree carries OIDs without MVCC headers and
relies on the heap’s headers when the visibility predicate
follows an OID into the heap; the catalog and the system tables
inherit the same convention because they are stored in heap
files too. Vacuum (cubrid-vacuum.md) reads these headers in
WAL replay order, decides what is dead under the
oldest-active-MVCCID watermark, and physically reclaims the
slot — a heap-page mutation in its own right that takes the
WRITE latch and emits its own RVHF_* records. LOB cleanup
piggybacks on the same commit/rollback hooks that drive the
red-black-tree dispatch in cubrid-lob.md. The MVCC vocabulary
itself is owned by cubrid-mvcc.md; this section’s
contribution is the on-page realisation.
Detail-doc summaries
Section titled “Detail-doc summaries”| Doc | One-line summary |
|---|---|
cubrid-disk-manager.md | Four-level hierarchy (OS file = volume, 64-page sector = allocation unit, file = sector bundle, page = I/O unit); permanent-vs-temporary disk cache; two-step sector reservation; adaptive volume extension; three extensible-data tables (Partial / Full / User). |
cubrid-page-buffer-manager.md | BCB array, three-zone LRU split into per-thread private lists with adjustable quotas plus a shared list, direct victim handoff via lock-free queues, custom read/write/flush latch per BCB. |
cubrid-double-write-buffer.md | Sequential staging volume fsync’d before home write — torn-write protection between page buffer and data files; restart compares DWB copy to home page and replaces if needed before log replay starts. |
cubrid-heap-manager.md | Slotted pages, nine record types, INSERT/UPDATE/DELETE/READ flow, MVCC versioning inside the record header (insert-MVCCID, delete-MVCCID, prev-version chain), hot-path caches. |
cubrid-overflow-file.md | Heap big-record and B+Tree overflow-OID page chains; one symbol-level overflow-file module shared by FILE_MULTIPAGE_OBJECT_HEAP / FILE_BTREE_OVERFLOW_KEY / per-tree OID overflow; WAL discipline for crash safety. |
cubrid-lob.md | BLOB/CLOB stored as files outside the data volume, locator-URI naming in the record, per-transaction red-black tree on TDES, single dispatch point at commit/rollback for filesystem reconciliation. |
cubrid-btree.md | Slotted-page nodes, separate non-leaf and leaf record formats, key |
cubrid-extendible-hash.md | Fagin-style EHID-rooted directory file with doubling pointer count, slotted bucket pages with binary search and per-bucket local depth, system-op-bracketed splits/merges, RVEH_* WAL records. |
cubrid-tde.md | Two-level key hierarchy (master-key wraps three per-database DEKs); AES-256-CTR or ARIA-256-CTR with per-page nonces (LSA for permanent, atomic counter for temp, logical pageid for log); encrypt-on-flush / decrypt-on-read hooks; separate <db>_keys master-key file; per-file TDE flag propagates to each page’s pflag bits. |
Adjacent sections
Section titled “Adjacent sections”The storage engine is the bottom of the cub_server process, but not the bottom of the system — three neighbouring sections own the boundaries it touches.
- above (DDL & Schema) — Catalog and class-object. The
catalog manager (
cubrid-catalog-manager.md) stores per-class disk representation and statistics in a heap file anchored by CTID, and the parallel system classes (_db_class,_db_attribute,_db_index, …) bootstrap from a fixed root-class OID; the in-memorySM_CLASSgraph (cubrid-class-object.md) materialises the catalog in the client-side workspace. This section’s heap manager holds the bytes; the DDL & Schema section interprets them. Seecubrid-overview-ddl-schema.mdfor the full topology. - above (Transaction & Recovery) — Log, vacuum, checkpoint,
recovery. Every dirty page in this section is gated by an LSA
the log manager produced; every restart replays through the
same page buffer; vacuum walks the WAL and reclaims dead
versions through the heap manager’s record headers; the DWB
recovery handshake runs before log replay even starts. The
storage engine is the substrate WAL protects; the
Transaction & Recovery section is the protocol that protects
it. See
cubrid-overview-txn-recovery.md. - above (Server Architecture) — The locator
(
cubrid-locator.md) is the single named bridge between the storage engine and the higher layers. The client-side workspace batches dirty objects intoLC_COPYAREAbuffers and ships them to a server-sidelocator_*_forcefamily that fans out into heap, btree, lock, log, FK, and replication paths through one canonical entry point. Triggers and integrity rules also fire from there. From the storage engine’s perspective, every mutating call site is reachable through the locator.
The cross-section boundary is sharp on purpose. A reader who wants to follow a SELECT into the heap, or a COMMIT through the WAL, or a DDL into the catalog, traverses these section boundaries explicitly — and the overview docs at each section name the exact crossing.