Skip to content

CUBRID Catalog Manager — Disk Representation, System Classes, and Statistics

Contents:

The catalog is the database’s self-description: every other subsystem — parser, optimizer, MVCC, lock manager, vacuum, CDC — asks the catalog questions like “what attributes does class C have?”, “where is its heap file?”, “what indexes target it?”, “how many rows does it have?”. Database Internals (Petrov, ch. 1 §“Database storage” and ch. 7 §“Storage Engines”) frames this as one of the two universal database invariants: schema and storage layout must be reconstructible from the bytes on disk without any out-of-band knowledge.

Two implementation choices the model leaves open shape every real engine and frame the rest of this document:

  1. Where the bootstrap “root” lives. The catalog needs a well-known starting point — an OID, a fixed page id, or a file id — that is the same on every CUBRID database file. Without it, opening the catalog would itself need a catalog. Engines pick: a fixed OID for a root class (CUBRID, OODB-style), a fixed page in the system tablespace (Oracle’s bootstrap segment), or a fixed table OID with hard-coded schema (PostgreSQL’s pg_class).
  2. Whether storage layout and user-visible schema are the same structure. Some engines unify them: PostgreSQL’s pg_class is the table catalog, accessed via the same heap+index machinery as user tables. Others split them: an internal “catalog manager” stores compact disk-representation records for engine use, and a parallel set of user-visible system classes (_db_class, _db_attribute, …) lets SQL queries inspect the schema. CUBRID picks the split design — the internal system_catalog and the user-visible catcls_* tables coexist, with the user-side tables driven from the internal records.

Once those choices are named, every CUBRID-specific structure in this document either implements one of them or makes the access faster.

Every relational engine reaches for a similar set of patterns around catalog storage and bootstrap.

The catalog is stored in heap files like any other class — but the heap manager needs catalog records to interpret pages. The chicken-and-egg is resolved by bootstrap classes with hard- coded schemas the engine can interpret without consulting the catalog. CUBRID’s root class is the seed; from the root class the engine learns about _db_class, from _db_class it learns about _db_attribute, and so on.

Engines distinguish between the physical disk representation (the byte order of attributes, fixed vs. variable, pad bytes) and the logical schema (column names, types, constraints). The disk representation can change without invalidating existing rows by versioning: each class has a list of representations indexed by REPR_ID, and every heap row carries the REPR_ID it was written under. ALTER TABLE bumps the representation; old rows decode with the old representation, new rows with the new.

Statistics are the optimizer’s input, not part of the schema — they change continuously. Engines store them adjacent to the catalog but with a different update cadence. PostgreSQL has pg_statistic, InnoDB has mysql.innodb_index_stats, CUBRID has CLS_INFO carrying per-class stats and BTREE_STATS per index.

The optimizer asks at three granularities: per-class (heap_num_objects, heap_num_pages), per-attribute (n_distinct_values), per-index (B+Tree key count, leaf count, height, partial-key cardinality for compound indexes). CUBRID’s CLASS_STATS, ATTR_STATS, BTREE_STATS map one-to-one to these levels.

Two access flavours: server-side and client-side

Section titled “Two access flavours: server-side and client-side”

The server reads the catalog during query execution; the client reads it during DDL parsing and schema introspection. CUBRID ships parallel *_sr.c (server) and *_cl.c (client) sources for statistics, with the _sr side authoritative.

Theoretical conceptCUBRID name
Catalog identifier (boot anchor)CTID { vfid, xhid, hpgid } (system_catalog.h); global catalog_Id
Disk representation of one classDISK_REPR { id, n_fixed, fixed[], n_variable, variable[] } (system_catalog.h)
Per-attribute disk infoDISK_ATTR { id, location, type, val_length, value, position, classoid, n_btstats, bt_stats, ndv }
Per-class catalog infoCLS_INFO { ci_hfid, ci_tot_pages, ci_tot_objects, ci_time_stamp, ci_rep_dir }
Per-access transient stateCATALOG_ACCESS_INFO { class_oid, dir_oid, class_name, is_update, … }
User-visible system classes_db_class, _db_attribute, _db_index, _db_domain, _db_method, …
Catalog-class entry-point familycatcls_* functions (catalog_class.c)
Catalog primary heapcatalog_Id.vfid — file holding catalog records
Catalog directory hashcatalog_Id.xhid — extendible-hash index class_oid → dir_oid
Catalog header pagecatalog_Id.hpgid — fixed page id holding catalog metadata
Class statisticsCLASS_STATS { time_stamp, heap_num_objects, heap_num_pages, n_attrs, attr_stats[] } (statistics.h)
Per-attribute statisticsATTR_STATS { id, type, n_btstats, bt_stats[], ndv }
Per-index statisticsBTREE_STATS { btid, leafs, pages, height, keys, has_function, key_type, pkeys[] }
Recovery functions for catalogcatalog_rv_new_page_redo, catalog_rv_insert_{redo,undo}, catalog_rv_delete_{redo,undo}, catalog_rv_update, catalog_rv_ovf_page_logical_insert_undo
Server bootboot_restart_server (boot_sr.c)
Catalog start at accesscatalog_start_access_with_dir_oid
Catalog end at accesscatalog_end_access_with_dir_oid

The catalog has four moving parts: the catalog identifier and its three-volume layout, the disk-representation records that capture per-class storage, the parallel user-visible system classes that surface the same data through SQL, and the statistics records whose update cadence is different from the schema. We walk them in that order.

flowchart LR
  subgraph BOOT["Bootstrap"]
    ROOT["Root class (fixed OID)"]
    BOOTFN["boot_restart_server"]
  end
  subgraph CTID["catalog_Id (CTID)"]
    VFID["vfid: catalog heap file"]
    XHID["xhid: extendible hash class_oid → dir_oid"]
    HPGID["hpgid: catalog header page"]
  end
  subgraph CR["Catalog records (system_catalog.c)"]
    DR1["DISK_REPR for class A repr 1"]
    DR2["DISK_REPR for class A repr 2"]
    CLI["CLS_INFO for class A"]
    CR1["..."]
  end
  subgraph SC["User-visible system classes (catcls)"]
    DBC["_db_class"]
    DBA["_db_attribute"]
    DBI["_db_index"]
    DBD["_db_domain"]
    DBT["_db_data_type"]
  end
  subgraph ST["Statistics"]
    CS["CLASS_STATS"]
    AS["ATTR_STATS"]
    BS["BTREE_STATS"]
  end
  BOOTFN --> ROOT
  BOOTFN --> CTID
  CTID --> CR
  ROOT --> SC
  SC -.synchronised with.-> CR
  CR -.cardinality from.-> ST
  ST --> CS --> AS --> BS

The figure encodes three boundaries. (boot / runtime) the root class is the only OID the engine knows ahead of time; everything else is reachable from it. (internal / user-visible) catalog records under catalog_Id are compact disk representations the engine reads on hot paths; system classes under _db_* are the schema-introspection face SQL queries hit. (schema / stats) representations describe shape and do not change often; statistics describe size and change every analyze.

CTID — the catalog’s three-part identity

Section titled “CTID — the catalog’s three-part identity”

The catalog itself is a small data structure with three pointers:

// CTID — src/storage/system_catalog.h
struct ctid
{
VFID vfid; /* catalog volume identifier — heap file
holding catalog records */
EHID xhid; /* extendible hash index identifier —
class_oid → dir_oid map */
PAGEID hpgid; /* catalog header page identifier */
};
typedef struct ctid CTID;
extern CTID catalog_Id; /* global catalog identifier */

The three components correspond to three on-disk objects:

  • vfid — a heap file that stores the catalog’s DISK_REPR and CLS_INFO records. It is treated like any other heap by the heap manager (cubrid-heap-manager.md), with one extra invariant: catalog records must never be vacuumed away while any transaction can still see the class they describe.
  • xhid — an extendible-hash index keyed on class_oid, pointing to the directory record OID for that class. The directory record in turn lists the OIDs of all DISK_REPR records for the class (one per representation, plus one for CLS_INFO).
  • hpgid — the catalog header page, holding global catalog metadata (version, last-allocated representation id, etc.). Its page id is fixed at boot time and never changes.

catalog_initialize (system_catalog.c) populates the global catalog_Id from these three values during boot. From that moment, every catalog access starts with catalog_Id and descends into one of the three components.

Disk representation — one record per attribute layout

Section titled “Disk representation — one record per attribute layout”

Per class, per representation:

// DISK_REPR — src/storage/system_catalog.h
struct disk_representation
{
REPR_ID id; /* representation identifier */
int n_fixed; /* number of fixed-length attributes */
struct disk_attribute *fixed; /* fixed attribute structures */
int fixed_length; /* total length of fixed attrs */
int n_variable; /* number of variable-length attrs */
struct disk_attribute *variable; /* variable attribute structures */
};

The split between fixed and variable is the on-disk layout choice: fixed-length attributes pack tightly with no per-attribute offset overhead, variable-length attributes carry an offset table at the front of the row. Iterating over disk_representation::fixed[] and ::variable[] is exactly the order the heap manager uses to interpret a row.

Each attribute carries:

// DISK_ATTR — src/storage/system_catalog.h
struct disk_attribute
{
ATTR_ID id; /* attribute identifier */
int location; /* fixed: exact offset; variable: index into
the offset table */
DB_TYPE type; /* int / varchar / float / … */
int val_length; /* default value length, ≥ 0 */
void *value; /* default value (no default expression) */
int position; /* storage position (fixed only) */
OID classoid; /* source class — for inherited attrs */
int n_btstats; /* number of B+tree statistics */
BTREE_STATS *bt_stats; /* per-index stats array */
INT64 ndv; /* Number of Distinct Values */
};

Two fields worth marking up. (classoid) for inherited attributes, the classoid distinguishes which class along the inheritance chain originally defined the attribute. The optimizer uses this to avoid double-counting inherited attributes when computing class cardinality. (bt_stats[] and ndv) statistics live inside the attribute record, not in a separate file. This is the trade-off: ALTER STATS rewrites the attribute record, which is heavier than a separate stats table would be, but reading the disk representation gives the optimizer everything in one fetch.

Per-class info — the heap pointer and rough counts

Section titled “Per-class info — the heap pointer and rough counts”

CLS_INFO is the per-class summary record:

// CLS_INFO — src/storage/system_catalog.h
struct cls_info
{
HFID ci_hfid; /* heap file identifier for the class */
int ci_tot_pages; /* total pages in the heap file */
int ci_tot_objects; /* total live objects */
unsigned int ci_time_stamp; /* timestamp of last update */
OID ci_rep_dir; /* representation directory record OID */
};

ci_hfid is the most-read field. Every query that scans a class starts by fetching the class’s CLS_INFO from the catalog, reading ci_hfid, and handing the heap file to the scan manager.

ci_rep_dir is the back-pointer from CLS_INFO to the directory record listing all representations of this class. Following the chain class_oid → xhid → dir_oid → DISK_REPR is the standard lookup; following CLS_INFO::ci_rep_dir → DISK_REPR is the inverse for traversal during ALTER.

ci_time_stamp is the cache-validation token: the optimizer caches CLS_INFO in process memory and invalidates the cache when ci_time_stamp advances.

Every catalog read or write goes through a CATALOG_ACCESS_INFO session:

// CATALOG_ACCESS_INFO — src/storage/system_catalog.h
struct catalog_access_info
{
OID *class_oid;
OID *dir_oid; /* cached after first xhid lookup */
char *class_name;
bool is_update; /* update access — needs X locks */
bool need_unlock; /* unlock at end-access time */
bool access_started; /* guard against double-start */
bool need_free_class_name;
#if !defined (NDEBUG)
bool is_systemop_started;
#endif
};

The session is opened with catalog_start_access_with_dir_oid and closed with catalog_end_access_with_dir_oid. Between the two, the caller has the directory OID cached, the class lock acquired (S for read, X for update), and (in update sessions) a system-op bracket open so partial catalog updates can be rolled back as a logical unit. The debug-only is_systemop_started field assert-checks this discipline.

Catalog-class machinery — user-visible schema

Section titled “Catalog-class machinery — user-visible schema”

Parallel to catalog_Id-rooted records, CUBRID maintains a set of user-visible classes that mirror the same data through SQL. The classes are conventionally named _db_*:

  • _db_class — one row per class, with name, OID, owner, type (table / view / partition), creation time.
  • _db_attribute — one row per attribute of a class.
  • _db_index — one row per index.
  • _db_domain — one row per type domain (used for compound domains).
  • _db_data_type — system data-type catalogue.
  • _db_method, _db_meth_arg, _db_meth_file, _db_method_sig — for OODB methods.
  • _db_partition — partitioning info.
  • _db_trigger — triggers.
  • _db_serial — sequences.
  • _db_collation — collation catalogue.
  • _db_charset, _db_servers, _db_user, _db_auth, _db_password, _db_synonym, … — auxiliary.

The catcls_* family in catalog_class.c is the bridge:

// catalog_class.h — src/storage/catalog_class.h
extern bool catcls_Enable;
int catcls_compile_catalog_classes (THREAD_ENTRY *thread_p);
int catcls_insert_catalog_classes (THREAD_ENTRY *thread_p, RECDES *record);
int catcls_delete_catalog_classes (THREAD_ENTRY *thread_p, const char *name, OID *class_oid);
int catcls_update_catalog_classes (THREAD_ENTRY *thread_p, const char *name, RECDES *record,
OID *class_oid_p, UPDATE_INPLACE_STYLE force_in_place);
int catcls_finalize_class_oid_to_oid_hash_table (THREAD_ENTRY *thread_p);
int catcls_remove_entry (THREAD_ENTRY *thread_p, OID *class_oid);
int catcls_get_server_compat_info (THREAD_ENTRY *thread_p, INTL_CODESET *charset_id_p,
char *lang_buf, const int lang_buf_size, char *timezone_checksum);
int catcls_get_db_collation (THREAD_ENTRY *thread_p, LANG_COLL_COMPAT **db_collations, int *coll_cnt);
int catcls_update_class_stats (THREAD_ENTRY *thread_p, const char *class_name,
unsigned int ci_time_stamp, bool with_fullscan);

When DDL creates or alters a class, the engine writes an internal DISK_REPR to the catalog and inserts a row into _db_class (and related rows into _db_attribute, _db_index, …). The two writes are bracketed in a single transaction so they commit together; if they diverge (e.g., crash between writes), the recovery pass rolls the partial work back as one unit. Whether concurrent readers can ever observe the two faces out of step under lock-mode escalation is open (see Open Q3).

catcls_compile_catalog_classes is the function that originally builds the system classes from a hard-coded schema; it runs at install time. (The neighbouring symbol catcls_insert_catalog_classes is the row-insert entry point used at DDL time, not the install-time compiler.) The schema source lives in src/object/schema_system_catalog_install.cpp (cubrid AGENTS.md mentions strict formatting rules there).

Bootstrap — boot_restart_server and the root class

Section titled “Bootstrap — boot_restart_server and the root class”

boot_restart_server (boot_sr.c) is the post-recovery entry point that brings the catalog online:

  1. Initialize log and recover. log_initialize runs the three-pass restart (cubrid-recovery-manager.md).
  2. Initialize disk and file managers.
  3. Read catalog_Id. From the database parameter file (the boot_DB_parm record) plus the on-disk header; this gives (vfid, xhid, hpgid).
  4. catalog_initialize (catalog_Id). Sets up in-memory structures.
  5. Bind the root class. The root class has a fixed OID stored in the boot parameters; the engine reads its DISK_REPR from the catalog and primes the metaclass cache.
  6. Bind the system classes. Walking from the root class, _db_class, _db_attribute, etc. are loaded with the class-name ↔ OID mapping cached in catcls_class_oid_to_oid_hash_table.
  7. Initialize statistics caches.
  8. Start vacuum master + workers (cubrid-vacuum.md).

After step 7 the server is ready to accept queries. boot_DB_parm is the on-disk parameter record; updates to it flow through catcls_update_catalog_classes so they’re durable.

Statistics — separate cadence, separate access

Section titled “Statistics — separate cadence, separate access”

Statistics are part of the same DISK_ATTR record but updated on a different cadence — every UPDATE STATISTICS or fullscan. The statistics structures:

// statistics.h — src/storage/statistics.h
struct btree_stats
{
BTID btid;
int leafs; /* leaf pages including overflow */
int pages; /* total pages */
int height; /* tree depth */
int keys; /* distinct keys */
int has_function; /* function index? */
TP_DOMAIN *key_type;
int pkeys_size; /* compound-key partial-cardinality array */
int *pkeys; /* pkeys[k] = NDV of first k+1 components */
int dedup_idx; /* SUPPORT_DEDUPLICATE_KEY_MODE */
};
struct attr_stats
{
int id;
DB_TYPE type;
int n_btstats;
BTREE_STATS *bt_stats;
INT64 ndv; /* Number of Distinct Values */
};
struct class_stats
{
unsigned int time_stamp;
int heap_num_objects;
int heap_num_pages;
int n_attrs;
ATTR_STATS *attr_stats;
};

The pkeys[] array is worth marking up. For a compound index (a, b, ..., x) of size pkeys_size = k, pkeys[i] is the cardinality of the first i+1 columns. The optimizer uses this to estimate selectivity for queries that filter on a prefix of the index — without it, every prefix query would have to assume independent column distributions.

STATS_SAMPLING_THRESHOLD = 5000 and NUMBER_OF_SAMPLING_PAGES = 5000 (declared in statistics.h) are the sampling defaults; full-scan mode (STATS_WITH_FULLSCAN) is the alternative. The stats_adjust_sampling_weight inline (in statistics.h) applies a differential weight when sampling NDV is below 1% of expected; the assumption is that “if the sample data is a lot of duplicated, there will also be duplicate in the overall data”.

Server-side statistics access goes through statistics_sr.c; client-side (the SQL interface) through statistics_cl.c plus stats_get_statistics (declared in statistics.h, SERVER_MODE-disabled).

sequenceDiagram
  participant CL as DDL parser
  participant CAT as catalog (system_catalog)
  participant CC as catalog_class (catcls_*)
  participant LM as log_manager (sysop)
  participant LCK as lock_manager

  CL->>LCK: X-lock class_oid
  CL->>LM: log_sysop_start (DDL atomic)
  CL->>CAT: catalog_start_access_with_dir_oid (X)
  CL->>CAT: catalog_add_representation (new DISK_REPR with new REPR_ID)
  CL->>CAT: catalog_update_class_info (CLS_INFO ci_time_stamp bumped)
  CL->>CC: catcls_update_catalog_classes (_db_class row)
  CL->>CC: catcls_update_catalog_classes (_db_attribute rows)
  CL->>CAT: catalog_end_access_with_dir_oid
  CL->>LM: log_sysop_commit
  CL->>LCK: X-lock release at commit

The two catalog faces — internal records and user-visible system classes — are updated in the same system op, so a crash mid-update either rolls them both back or keeps them both. The DDL transaction commits as a single atomic unit even though it touched ~10 different files.

Anchor on symbol names, not line numbers.

  • CTID (system_catalog.h) — catalog identifier.
  • DISK_REPR / DISK_ATTR (system_catalog.h) — disk representation records.
  • CLS_INFO (system_catalog.h) — per-class summary.
  • CATALOG_ACCESS_INFO (system_catalog.h) — per-access session.
  • CATALOG_DIR_REPR_KEY = -2 macro (system_catalog.h) — directory key sentinel.
  • BTREE_STATS / ATTR_STATS / CLASS_STATS (statistics.h).
  • BTREE_STATS_PKEYS_NUM = 8 macro (statistics.h) — compound-key array bound.
  • STATS_SAMPLING_THRESHOLD / NUMBER_OF_SAMPLING_PAGES (statistics.h).
  • catalog_initialize (system_catalog.c).
  • catalog_finalize (system_catalog.c).
  • catalog_create (system_catalog.c) — first-time setup; called only by root.
  • catalog_destroy (system_catalog.c) — drop catalog.
  • catalog_reclaim_space (system_catalog.c) — compact fragmented catalog records.
  • catalog_get_class_info (system_catalog.c).
  • catalog_get_representation (system_catalog.c).
  • catalog_get_representation_directory (system_catalog.c).
  • catalog_get_last_representation_id (system_catalog.c).
  • catalog_get_class_info_from_record (system_catalog.c) — decode CLS_INFO from a heap record.
  • catalog_get_dir_oid_from_cache (system_catalog.c) — cache-aware lookup.
  • catalog_add_representation (system_catalog.c).
  • catalog_add_class_info (system_catalog.c).
  • catalog_update_class_info (system_catalog.c).
  • catalog_drop_old_representations (system_catalog.c).
  • catalog_insert / catalog_update / catalog_delete (system_catalog.c) — generic record-level ops.
  • catalog_start_access_with_dir_oid (system_catalog.c).
  • catalog_end_access_with_dir_oid (system_catalog.c).
  • catalog_rv_new_page_redo, catalog_rv_insert_redo / _undo, catalog_rv_delete_redo / _undo, catalog_rv_update, catalog_rv_ovf_page_logical_insert_undo (declared in system_catalog.h, defined in system_catalog.c).
  • catalog_get_cardinality (system_catalog.c).
  • catalog_get_cardinality_by_name (system_catalog.c).
  • catcls_compile_catalog_classes (catalog_class.c) — install-time schema build.
  • catcls_insert_catalog_classes (catalog_class.c).
  • catcls_update_catalog_classes (catalog_class.c).
  • catcls_delete_catalog_classes (catalog_class.c).
  • catcls_remove_entry (catalog_class.c).
  • catcls_get_server_compat_info (catalog_class.c) — charset / locale / timezone compatibility check at boot.
  • catcls_get_db_collation (catalog_class.c).
  • catcls_update_class_stats (catalog_class.c).
  • catcls_finalize_class_oid_to_oid_hash_table (catalog_class.c).
  • catcls_find_and_set_cached_class_oid (catalog_class.c).
  • boot_restart_server (boot_sr.c) — main boot entry.
  • The boot parameter record (boot_DB_parm) — disk-resident database parameters including catalog OIDs.
  • stats_get_statistics (statistics.h, defined in statistics_cl.c) — client-side fetch.
  • stats_dump / stats_ndv_dump (statistics.h) — debugging.
  • stats_make_select_list_for_ndv (statistics.h).
  • stats_get_ndv_by_query (statistics.h).
  • stats_adjust_sampling_weight (statistics.h, inline).
SymbolFileLine
CTID (struct)system_catalog.h45
DISK_REPR (struct)system_catalog.h63
DISK_ATTR (struct)system_catalog.h80
CLS_INFO (struct)system_catalog.h96
CATALOG_ACCESS_INFO (struct)system_catalog.h106
catalog_Id (extern global)system_catalog.h153
BTREE_STATS (struct)statistics.h61
ATTR_STATS (struct)statistics.h82
CLASS_STATS (struct)statistics.h93
stats_adjust_sampling_weight (inline)statistics.h135
catalog_initializesystem_catalog.c2577
catalog_finalizesystem_catalog.c2607
catalog_get_class_info_from_recordsystem_catalog.c504
catalog_initialize_max_spacesystem_catalog.c549
catalog_initialize_new_pagesystem_catalog.c598
catalog_add_representationsystem_catalog.c2815
catalog_add_class_infosystem_catalog.c3029
catalog_update_class_infosystem_catalog.c3172
catalog_get_class_infosystem_catalog.c4113
catcls_insert_catalog_classescatalog_class.c4310
boot_restart_serverboot_sr.c1969
  • Catalog source files are system_catalog.{c,h} and catalog_class.{c,h}, not catalog_manager.{c,h}. Verified by find -name 'catalog*'. The skeleton’s references: list named catalog_manager.{c,h} (which doesn’t exist); corrected at draft time. The misnomer comes from the vendor decks calling the subsystem “catalog manager” in prose.

  • The catalog identifier CTID is a triple of (vfid, xhid, hpgid). Verified by reading the CTID struct in system_catalog.h. The three correspond to a heap file (catalog records), an extendible-hash index (class_oid → dir_oid), and a fixed header page.

  • catalog_Id is a global, not per-thread state. Verified by the extern CTID catalog_Id declaration in system_catalog.h. The identifier is set once at boot (catalog_initialize) and never changes for the lifetime of the server.

  • DISK_REPR splits attributes into fixed and variable arrays, not a single ordered list. Verified by the disk_representation struct in system_catalog.h (separate n_fixed / fixed[] and n_variable / variable[] fields). Implication: row decoding reads fixed attributes by exact offset, variable attributes via the per-row offset table.

  • Statistics live inline on the attribute record, not in a separate file. Verified by the n_btstats, bt_stats, and ndv fields on DISK_ATTR in system_catalog.h. Cost: stats updates rewrite the attribute record. Benefit: optimizer reads schema and stats in one fetch.

  • CATALOG_ACCESS_INFO carries a debug-only is_systemop_started field. Verified by the NDEBUG-conditional field on the catalog_access_info struct in system_catalog.h. Its purpose is to assert that update-mode catalog accesses are properly bracketed in a system op.

  • The user-visible system classes are populated through the catcls_* family in catalog_class.c, separate from the internal catalog_* in system_catalog.c. Verified by reading catalog_class.h (the entire header is small — under ~50 lines) and grep-finding catcls_insert_catalog_classes in catalog_class.c. The two surfaces share the same transaction so DDL is atomic across both.

  • catcls_Enable is a global toggle for catalog-class maintenance. Verified by the extern bool catcls_Enable declaration in catalog_class.h. When false, the system classes aren’t kept in sync — used during installation and migration.

  • CLS_INFO::ci_time_stamp is the cache-validation token. Verified by the ci_time_stamp field on the cls_info struct in system_catalog.h. Optimizer caches CLS_INFO in process memory; cache invalidation compares stored and current timestamps.

  • Seven recovery functions handle catalog log records. Verified by the recovery-function declarations in system_catalog.h: catalog_rv_new_page_redo, catalog_rv_insert_redo, catalog_rv_insert_undo, catalog_rv_delete_redo, catalog_rv_delete_undo, catalog_rv_update, catalog_rv_ovf_page_logical_insert_undo. The last one is notable — overflow-page insertion has a logical undo (the redo would replay page allocation; logical undo de-allocates the page through the file manager).

  • Default sampling page count is 5000 with a sampling threshold of 5000. Verified by the sampling constants in statistics.h. STATS_SAMPLING_THRESHOLD = 5000 is the trial count; NUMBER_OF_SAMPLING_PAGES = 5000 is the page budget; EXPECTED_ROWS_PER_PAGE = 20 is the fan-out assumption.

  • The compound-key partial-cardinality array pkeys[] is sized at 8 by default. Verified by the BTREE_STATS_PKEYS_NUM = 8 macro in statistics.h. Compound indexes deeper than 8 columns lose per-prefix selectivity tracking past the 8th.

  1. Where exactly does the root class’s OID live on disk? The boot_DB_parm record holds boot parameters, but the root class’s OID is one specific field there. Investigation path: read boot_sr.c around line 1969 (boot_restart_server) and trace where it loads the root-class OID.

  2. Catalog overflow-page logical-undo discipline. catalog_rv_ovf_page_logical_insert_undo is logical, but what’s the sequence of file-manager calls during the undo path that ensures the overflow page returns cleanly? Investigation path: read its body and chase file_dealloc_page calls.

  3. Synchronisation between internal catalog and user system classes. A DDL must update both in lockstep. What prevents another reader from observing the internal catalog updated but _db_class not yet? Investigation path: trace lock acquisition order in DDL paths; check whether the catalog X-lock covers both faces.

  4. catcls_update_class_stats cadence. Stats updates flow through this function. Is it synchronous with the SQL UPDATE STATISTICS command, or is there a background sweep? Investigation path: grep callers; check for daemon registration.

  5. Catalog-class cache invalidation across servers in HA. On a slave server, when a master DDL is replayed, does the slave invalidate its catalog-class hash table? Investigation path: cubrid-cdc.md plus catcls_finalize_class_oid_to_oid_hash_table.

  6. catalog_reclaim_space cadence and triggers. Catalog compaction is presumably rare, but the trigger isn’t named in the header. Investigation path: grep for callers; check for use in boot_restart_server or a background daemon.

Beyond CUBRID — Comparative Designs and Research Frontiers

Section titled “Beyond CUBRID — Comparative Designs and Research Frontiers”

Pointers, not analysis.

  • PostgreSQL pg_class — single catalog table, accessed through normal heap+index machinery. Bootstrap via genbki.pl script + pg_*_d.h macros. CUBRID’s split design trades unification for a more compact internal record format the optimizer reads directly.

  • MySQL data dictionary (8.0+) — InnoDB tables since 8.0, before that was the FRM file. CUBRID’s split predates and is closer to pre-8.0 MySQL conceptually (a binary structure separate from the SQL face).

  • Oracle’s bootstrap segment — single-row obj$ seed read at instance start. CUBRID’s boot_DB_parm plus root-class OID is the same idea with two anchors.

  • Schema versioning by REPR_ID is similar to PG’s pg_attribute.atttypid versioning — both engines decode rows by their stored representation, allowing online ALTER TABLE without rewriting all rows immediately. The difference is in where the version lives: CUBRID per-row (REPR_ID in the row header), PG per-table-version (DDL gets a new pg_class row).

  • InnoDB’s mysql.innodb_index_stats — separate table for per-index stats. Compared to CUBRID’s inline bt_stats[], this is heavier to query but lighter to update.

  • HyPer / Vectorwise compressed catalogs — research engines that compress the catalog structure for in-memory caching. CUBRID’s DISK_REPR is already compact; in-memory variants could collapse it further.

Raw analyses (raw/code-analysis/cubrid/storage/catalog_manager/)

Section titled “Raw analyses (raw/code-analysis/cubrid/storage/catalog_manager/)”
  • 1._Catalog_Overview.pdf
  • 2._Root_Class.pdf
  • 3._System_Catalog_n_Statistics.pdf
  • 4._Catalog_Classes_n_boot_DB_parm.pdf
  • cls_info_rec.pptx
  • CUBRID Catalog Access.pptx
  • knowledge/code-analysis/cubrid/cubrid-heap-manager.md — heap files the catalog records live on.
  • knowledge/code-analysis/cubrid/cubrid-btree.mdBTREE_STATS consumers; index-stats source.
  • knowledge/code-analysis/cubrid/cubrid-recovery-manager.md — catalog catalog_rv_* functions in RV_fun[].
  • knowledge/code-analysis/cubrid/cubrid-log-manager.md — system-op bracket discipline DDL uses.
  • knowledge/code-analysis/cubrid/cubrid-cdc.md — DDL events surfaced from catalog mutations; in-progress in the same batch.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database Internals (Petrov), Ch. 1 §“Database storage” (boot anchors), Ch. 7 §“Storage Engines” (catalog as metadata).

CUBRID source (/data/hgryoo/references/cubrid/)

Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”
  • src/storage/system_catalog.{c,h}
  • src/storage/catalog_class.{c,h}
  • src/storage/statistics.h, statistics_{cl,sr}.{c,h}
  • src/transaction/boot_sr.{c,h}
  • src/object/schema_system_catalog_install.cpp — install-time hard-coded schema for the system classes (CUBRID AGENTS.md §“Add info schema view”).