Skip to content

CUBRID unloaddb — Schema and Data Export, Four-File Output Layout, and the Per-Class Multi-Thread · Multi-Process Driver

A logical exporter is the database utility that walks the catalog and writes out a textual or near-textual representation of the schema and data, suitable for re-import by the engine’s loader into another database (same engine, possibly different version, or a fresh empty database for cross-version migration / DR drills / test-fixture builds). Every relational engine has one — pg_dump, mysqldump, expdp, mongodump — and every implementation makes the same set of design choices.

Three axes dominate:

  1. Schema-first vs. data-first ordering. Almost every loader needs the schema to exist before the data can be inserted, so exporters emit schema first and data second. The interesting question is what counts as schema: table definitions are obvious; views depend on tables and must come after; triggers reference tables (and may reference each other); indexes can be built from already-loaded data and are therefore deferable. The ordering choice — and which artefacts are deferred to a post-load pass — is the export’s contract with the loader.

  2. One file vs. many files. A single export file is simple to ship around and easy to replay (cat dump.sql | psql). Many files (one per phase, or one per class) parallelise both export and import, and let operators selectively reload pieces. The tradeoff is operational: many files mean an exporter process coordinates output streams, and an importer needs to know the filename convention.

  3. Snapshot semantics. What counts as “the database” at the moment of export? A naive read sees committed-but-then-changed rows differently for early-extracted vs. late-extracted classes. Engines with MVCC let the exporter pin a single snapshot (PostgreSQL pg_dump uses repeatable-read; CUBRID exposes the --latest-image toggle to choose between the last committed image vs. the snapshot at export start). Exporters that don’t pin a snapshot rely on the operator to suspend writes — which is sometimes acceptable for cold backups, never for online ones.

The unloaddb tool addresses all three with table-driven, fixed choices for ordering (schema → triggers → indexes → data) and a configurable split (one file per phase by default; one file per class as an option) plus an explicit --latest-image snapshot toggle.

EnginePhase orderingFile layoutConcurrencySnapshot
PostgreSQL pg_dumpdata is bracketed by --section=pre-data (schema), --section=data, --section=post-data (indexes, FKs, triggers)one file (text/custom/directory format); directory format → one file per relation--jobs N (directory format only) — parallel per-table COPYrepeatable-read snapshot pinned at start
MySQL mysqldumpDDL inline before each INSERT (--single-transaction to pin)one file by default; --tab makes per-table .sql + .txt pairsnone in mysqldump itself; mysqlpump is the parallel variant--single-transaction pins via REPEATABLE READ; otherwise no snapshot
Oracle expdp (Data Pump)METADATA_ONLY / DATA_ONLY phases; full export emits both with metadata first inside the dumpfileone or many .dmp dumpfiles in a directory object; format is binaryPARALLEL=N worker processesFLASHBACK_TIME / FLASHBACK_SCN
MongoDB mongodumpmetadata .metadata.json + BSON .bson per collectionone directory per database, one file pair per collectionimplicit per-collection parallelismnone by default; --snapshot for collection-scan consistency
CUBRID unloaddbschema → triggers → indexes → data (loaddb replays in this order)four files (_schema/_trigger/_indexes/_objects) by default; --datafile-per-class adds one data file per class--thread-count N (per-class fetch parallelism); --mt-process N/M (whole-process partition by class)--latest-image toggle: latest committed vs. start-of-export

CUBRID’s distinguishing choice is the two-axis concurrency: either thread-pool inside one process (--thread-count) for network-bound fetch parallelism, or multiple processes (--mt-process N/M) where each process exports a disjoint subset of the class list and writes to a per-process file prefix. The two modes are mutually exclusive at invocation time but address different bottlenecks — threads for network round-trip latency, processes for CPU-bound text serialisation.

A vanilla unloaddb mydb produces four files in the output directory:

FileContentsLoaded by
<prefix>_schemaCREATE CLASS ... for tables, CREATE VCLASS ... for views, in a topologically-correct order; column definitions, primary keys, owner / auth grants when do_auth = 1loaddb phase 1 (schema)
<prefix>_triggerCREATE TRIGGER ... statements for every trigger; emitted after schema because triggers reference classesloaddb phase 2 (triggers)
<prefix>_indexesCREATE INDEX ... statements for every secondary index; emitted last so indexes can be built post-load over fully populated heaps (avoids the per-row index-maintenance cost during the data insert)loaddb phase 4 (after data)
<prefix>_objectsThe CUBRID object-loader format — one record per row, schema-aware text encodingloaddb phase 3 (data)

<prefix> defaults to the database name and is overridden by --output-prefix / -p. --output-path / -od selects the output directory.

The phase order in the export (schema → trigger → index → data) is not the load order. Loaddb consumes them as: schema first (must be present for any data), then triggers (must be present before data so AFTER triggers fire correctly during the load — unless triggers are explicitly disabled with --no-trigger), then data, then indexes (built over the now-populated heaps in one batch, much faster than maintaining them per-row).

The unloaddb function in unloaddb.c is the SA/CS dual entry loaded by the unified cubrid admin CLI (see cubrid-cub-admin.md §“The database-admin arm”). It takes a UTIL_FUNCTION_ARG * and unpacks ~25 options grouped into five families:

// unloaddb — unloaddb.c
input_filename = utility_get_option_string_value (arg_map, UNLOAD_INPUT_CLASS_FILE_S, 0);
include_references = utility_get_option_bool_value (arg_map, UNLOAD_INCLUDE_REFERENCE_S);
required_class_only = utility_get_option_bool_value (arg_map, UNLOAD_INPUT_CLASS_ONLY_S);
datafile_per_class = utility_get_option_bool_value (arg_map, UNLOAD_DATAFILE_PER_CLASS_S);
est_size = utility_get_option_int_value (arg_map, UNLOAD_ESTIMATED_SIZE_S);
cached_pages = utility_get_option_int_value (arg_map, UNLOAD_CACHED_PAGES_S);
output_dirname = utility_get_option_string_value (arg_map, UNLOAD_OUTPUT_PATH_S, 0);
do_schema = utility_get_option_bool_value (arg_map, UNLOAD_SCHEMA_ONLY_S);
do_objects = utility_get_option_bool_value (arg_map, UNLOAD_DATA_ONLY_S);
latest_image_flag = utility_get_option_bool_value (arg_map, UNLOAD_LATEST_IMAGE_S);
output_prefix = utility_get_option_string_value (arg_map, UNLOAD_OUTPUT_PREFIX_S, 0);
hash_filename = utility_get_option_string_value (arg_map, UNLOAD_HASH_FILE_S, 0);
verbose_flag = utility_get_option_bool_value (arg_map, UNLOAD_VERBOSE_S);
order = utility_get_option_bool_value (arg_map, UNLOAD_KEEP_STORAGE_ORDER_S)
? FOLLOW_STORAGE_ORDER : FOLLOW_ATTRIBUTE_ORDER;
split_schema_files = utility_get_option_string_value (arg_map, UNLOAD_SPLIT_SCHEMA_FILES_S, 0);
is_as_dba = utility_get_option_string_value (arg_map, UNLOAD_AS_DBA_S, 0);
g_pre_alloc_varchar_size = utility_get_option_int_value (arg_map, UNLOAD_STRING_BUFFER_SIZE_S);
g_request_pages = utility_get_option_int_value (arg_map, UNLOAD_REQUEST_PAGES_S);
thread_count = utility_get_option_int_value (arg_map, UNLOAD_THREAD_COUNT_S);
sampling_records = utility_get_option_int_value (arg_map, UNLOAD_SAMPLING_TEST_S);
/* --mt-process N/M parsed as "process_idx/process_total" */

Five families:

  • What to extract. --schema-only / --data-only, --input-class-file, --input-class-only, --include-reference.
  • Where to put it. --output-path, --output-prefix, --datafile-per-class, --split-schema-files.
  • How to fetch. --cached-pages, --request-pages, --string-buffer-size, --latest-image, --keep-storage-order.
  • How to parallelise. --thread-count or --mt-process N/M.
  • Auth / verbose. --user, --password, --as-dba, --verbose, --enhanced-estimates, --sampling-test, --estimated-size, --hash-file.

After option parse and db_restart_ex, the unloaddb function runs in two halves gated by do_schema / do_objects:

// unloaddb (paraphrased)
if (!status && (do_schema || !do_objects)) { // schema half
if (g_parallel_process_cnt <= 1 || g_parallel_process_idx == 1) {
create_filename_trigger (output_dirname, output_prefix,
trigger_output_filename, ...);
create_filename_indexes (output_dirname, output_prefix,
indexes_output_filename, ...);
unload_context.do_auth = 1;
unload_context.storage_order = order;
unload_context.exec_name = exec_name;
unload_context.login_user = user;
unload_context.output_prefix = output_prefix;
extract_classes_to_file (unload_context); // _schema
extract_triggers_to_file (unload_context, trigger_output_filename); // _trigger
extract_indexes_to_file (unload_context, indexes_output_filename); // _indexes
unload_context.clear_schema_workspace ();
}
}
AU_SAVE_AND_ENABLE (au_save);
if (!status && (do_objects || !do_schema)) { // data half
extract_objects (unload_context, output_dirname,
thread_count, sampling_records, enhanced_estimates);
}
AU_RESTORE (au_save);
db_shutdown ();

Two important conditions:

  • Schema half runs only in process 1 of an --mt-process N/M partition. There’s only one _schema / _trigger / _indexes file regardless of process count; emitting it from every process would race the file system. Process 1 emits, processes 2..N skip and proceed straight to the data half.
  • Auth toggle around the data half. AU_SAVE_AND_ENABLE / AU_RESTORE brackets the data extraction so that even if the caller arrived with auth disabled (e.g., from an internal script), the actual row-by-row read enforces the connecting user’s authorisation. The schema half doesn’t need this because its emitter explicitly checks is_dba_user / is_dba_group_member via extract_context.

extract_context (declared in extract_schema.hpp) is the bag threaded through every emitter function:

FieldPurpose
output_dirnameWhere files go
output_prefixPer-database filename prefix
exec_name”unloaddb” — for error messages
login_userThe connecting user
is_dba_user / is_dba_group_memberAuthorisation flags resolved at startup
do_authWhether to emit GRANT statements (1 = yes)
storage_orderFOLLOW_STORAGE_ORDER (column order on disk) vs. FOLLOW_ATTRIBUTE_ORDER (declaration order from catalog); affects column ordering in CREATE CLASS and the data file
classesDB_OBJLIST of class objects to export — populated by extract_classes_to_file for the schema half, reused for triggers/indexes
has_indexesCached flag set during schema extraction
vclass_list_has_using_indexView-level USING INDEX tracker
clear_schema_workspace()Frees the per-half workspace

The struct’s role is to keep the emitter functions stateless across classes — every call passes the context, and the context carries all the policy.

Schema extraction (extract_classes_to_file

Section titled “Schema extraction (extract_classes_to_file →”

emit_schema)

extract_classes_to_file (defined in unload_schema.c:1378) opens <prefix>_schema, populates ctxt.classes from the catalog, emits per-user-class headers, and calls emit_schema(ctxt, output, EXTRACT_CLASS) to write tables — then a second call with EXTRACT_VCLASS to write views. The two-pass split is what enforces “tables before views”; if any view references a table not yet emitted, the second pass fails with an explicit error rather than producing a non-replayable file.

emit_schema itself iterates ctxt.classes and per class produces the full CREATE CLASS statement: column list with CHAR(N) / NUMERIC(p,s) / etc. domain renderers; constraints (PRIMARY KEY, UNIQUE, NOT NULL inline; FOREIGN KEY at end); inheritance / partition clauses; OWNER name and GRANT ... TO ... when do_auth. The storage_order setting picks between two column-iteration helpers: attribute-order is the catalog declaration order (matches the user’s mental model), storage-order matches the on-disk slot layout (matters for cross-platform binary-tier replication).

extract_triggers_to_file and extract_indexes_to_file follow the same pattern: open output, walk the catalog, emit CREATE TRIGGER / CREATE INDEX per object. Triggers are emitted with their condition expressions verbatim; indexes preserve INDEX hint and filtered-index WHERE clauses.

Data extraction (extract_objectsprocess_class)

Section titled “Data extraction (extract_objects → process_class)”

extract_objects (defined in unload_object.c:699) is the data half’s entry. It:

  1. Re-walks the catalog (or the input class file if -i was given) to build the class list.
  2. If --mt-process N/M, partitions the class list — class i is handled by process (i mod N) + 1. The partition is over the class list, so each process touches a disjoint subset of tables; combined output is the union of per-process <prefix>_objects files (or per-prefix-and-class files when --datafile-per-class is also on).
  3. For each class assigned to this process, calls process_class.

process_class (line 1773) is the per-class loop. It:

  • Issues a server-side fetch for the class’s heap pages via db_get_all_objects / cursor APIs, batched by --cached-pages (default 100) and --request-pages (default 100).
  • For each fetched object, walks the class’s attribute list per storage_order and writes one record to the _objects file in the loader’s text format: %id|attr1=val1, attr2=val2, ... with type-specific encoders for OID references, set / sequence / multiset, JSON, LOB locators, MVCC visibility metadata.
  • If --datafile-per-class, the per-class output stream is a separate file <prefix>_<class>_objects; otherwise everything appends to the single _objects file with class headers separating sections.

When --thread-count > 1, process_class splits the per-class fetch across the configured number of worker threads — each thread fetches a disjoint range of OIDs and serialises to a per-thread buffer, then the buffers are concatenated under a class-level lock before being flushed to the output file. This parallelises the network round-trip, which is the dominant cost on remote-database exports.

--latest-image toggles between two visibility modes for the fetch:

  • Default (off). Fetch with the snapshot taken at the start of extract_objects — every class sees a consistent view of committed-as-of-start data.
  • --latest-image on. Fetch the most recent committed version of each row, regardless of when other classes were extracted. Faster (no snapshot tracking) but produces an export that may reflect transactions committed during the export run, which would surprise a downstream consumer expecting a coherent snapshot.

-io input-class-only, -ir include-reference)

-i <file> restricts the export to a list of class names read from a file (one per line). -io (input-class-only) further restricts to only those classes — the default behaviour is to also pull in classes referenced from the listed ones (foreign keys, set element domains, view base classes), recursively, until closure.

-ir (include-reference) is the explicit form of the reference crawl — used when the user wants the reference-following behaviour without using -i at all (i.e., on a class list derived elsewhere).

required_class_only = true (set from -io) is checked inside extract_classes_to_file at the catalog walk: classes not in the input list are skipped even if they’re referenced from listed ones.

By default, unloaddb extracts only classes owned by the connecting user (or all classes if the user is DBA). --as-dba asks for cross-owner extraction — useful when a non-DBA admin wants to dump another user’s schema for debugging or migration. The flag is gated:

// unloaddb (paraphrased)
if (is_as_dba) {
unload_context.is_dba_group_member = au_is_dba_group_member (Au_user);
if (!unload_context.is_dba_group_member) {
fprintf (stderr, "\n--%s is an option available only when "
"the user is a DBA Group.\n", UNLOAD_AS_DBA_L);
goto end;
}
} else {
unload_context.is_dba_user = ws_is_same_object (Au_dba_user, Au_user);
}

A user not in the DBA group invoking --as-dba is rejected before any extraction; the option is therefore an explicit operator opt-in rather than something that silently degrades.

--enhanced-estimates)

--sampling-test N instructs extract_objects to fetch only the first N records per class — used for export-pipeline timing and disk-budget estimation without paying the full extraction cost. The normal _objects output is produced; downstream tooling that just wants byte counts can use this and divide by class cardinality to estimate full-export size.

--enhanced-estimates (only meaningful with --verbose) adds fine-grained timing breakdowns per class: catalog read time, per-row serialisation time, write time, lock time. It costs ~5% overhead and is gated behind --verbose because the extra rows in the report would otherwise pollute clean export logs.

When the schema half produces tens or hundreds of CREATE / GRANT statements, --split-schema-files writes one schema file per class instead of one big <prefix>_schema. Useful for downstream tooling (diffs, PR review, partial reload) that wants to operate on per-class units. The trigger and index files are not split.

--mt-process N/M means “this process is partition N of M total, doing the data half only”. Examples:

  • unloaddb --mt-process 1/4 mydb — process 1 of 4; emits the schema half (because it’s process 1) and the data half for classes assigned to partition 1.
  • unloaddb --mt-process 2/4 mydb — process 2 of 4; skips the schema half (only process 1 emits it); data half for classes assigned to partition 2.

Each process gets its own <prefix>_objects (or per-class if --datafile-per-class) named with the partition index in the prefix. The downstream loaddb concatenates them or loads them sequentially.

The per-class assignment is deterministic on the class list order returned by the catalog walk: class index i goes to partition (i mod M) + 1. This means a class added to the catalog between two unloaddb runs may end up in a different partition the second time — operators should not rely on per-partition class membership across runs.

The MAX_PROCESS_COUNT = 36 bound is hard-coded; the --thread-count bound is MAX_THREAD_COUNT = 127. The two flags are mutually exclusive at invocation time — --thread-count > 1 with --mt-process N/M (N > 1) is rejected.

SymbolRole
unloaddbLibrary entry; option parse; auth setup; orchestrates schema-half then data-half; calls db_shutdown at end
unload_usagePrints usage from message catalog
do_multi_processing (gated by MULTI_PROCESSING_UNLOADDB_WITH_FORK)Optional fork-spawn of N children when in fork-mode multi-process; not the default code path
output_prefix, output_dirname, etc. (file-scope globals)Carry the option values into helpers that don’t take the context
SymbolRole
extract_classes_to_fileOpens <prefix>_schema; walks catalog; calls emit_schema(EXTRACT_CLASS) then emit_schema(EXTRACT_VCLASS)
emit_schemaPer-class emitter; renders CREATE CLASS/CREATE VCLASS with columns, constraints, owner, grants
extract_triggers_to_fileOpens <prefix>_trigger; per-trigger CREATE TRIGGER emit
extract_indexes_to_fileOpens <prefix>_indexes; calls emit_indexes
emit_indexesPer-class index iteration; emits CREATE INDEX with hint / filtered clauses preserved
Per-domain renderers (emit_domain, emit_attribute_def, emit_constraint_def, emit_partition_def, etc.)Type-specific emitters used by emit_schema
SymbolRole
extract_objectsData-half entry; partitions class list per --mt-process; loops over classes calling process_class
process_classPer-class fetch + serialise loop; honours --cached-pages, --request-pages, --thread-count, --latest-image
Per-DB_TYPE writers (object/set/sequence/multiset/JSON/OID/LOB)Type-specific text encoders into the loader format

Output-file plumbing (unload_object_file.{c,h})

Section titled “Output-file plumbing (unload_object_file.{c,h})”
SymbolRole
Output-file open / close / flush helpersUsed by the data-half writers; abstracts per-thread buffering and the eventual write to <prefix>_objects (or per-class file)
Filename builders (create_filename_*)<prefix>_schema, <prefix>_trigger, <prefix>_indexes, <prefix>_objects, plus the per-class variants when --datafile-per-class
SymbolRole
extract_context (struct)The bag threaded through every emitter; fields described above
extract_context::clear_schema_workspaceFrees ctxt.classes and per-half scratch state
SymbolPath
unloaddb (entry)src/executables/unloaddb.c:115
unload_usagesrc/executables/unloaddb.c:99
Option-unpack blocksrc/executables/unloaddb.c:144–227
Schema-half driversrc/executables/unloaddb.c:404–455
Data-half driversrc/executables/unloaddb.c:457–492
extract_classes_to_filesrc/executables/unload_schema.c:1378
extract_triggers_to_filesrc/executables/unload_schema.c:1534
extract_indexes_to_filesrc/executables/unload_schema.c:1606
emit_schemasrc/executables/unload_schema.c:1708
emit_indexessrc/executables/unload_schema.c:1671
extract_objectssrc/executables/unload_object.c:699
process_classsrc/executables/unload_object.c:1773
process_class declarationsrc/executables/unload_object.c:394
extract_context (struct)src/executables/extract_schema.hpp

Symbol names are the canonical anchor; line numbers are hints scoped to the updated: date.

  • Loaddb’s load order is not unloaddb’s emit order. unloaddb emits schema → trigger → index → data; loaddb consumes schema → trigger → data → index. Index creation is deferred at load time to avoid per-row maintenance, which is why the index file is third in the unload but fourth in the load.
  • Schema-only and data-only are not strictly inverse predicates. --schema-only runs the schema half and skips the data half; --data-only runs the data half and skips the schema half; neither set means both run. The default is both. Expressing schema-and-trigger-but-no-index isn’t directly available — the operator would have to redirect or split files post-hoc.
  • --mt-process is the only multi-process mode actually used. An older MULTI_PROCESSING_UNLOADDB_WITH_FORK code path exists in the source (gated by a build define) that did fork-spawning inside one invocation; it’s not in the default build. The modern path is multiple invocations of the binary with different --mt-process N/M values, each producing per-partition output, coordinated by an outside script.
  • --use-delimiter is deprecated. Still parsed (the value is read and discarded) for backward compatibility with older scripts; has no effect.
  • Owner / grant emission is gated on do_auth = 1 which is hard-coded in the schema-half driver. There’s no flag to suppress authorization in the schema export — the schema file always includes OWNER and GRANT clauses if the connecting user can see them.
  • --latest-image and --mt-process interact unintuitively. Without --latest-image, each process pins its own snapshot, so partitions can see different snapshot-of-database states unless scheduled to start within the same wall-clock instant. With --latest-image, every process sees latest committed; coherence is the operator’s problem either way for multi-process exports.
  • storage_order = FOLLOW_STORAGE_ORDER changes both the schema column declaration order and the data file’s attribute-list order — the loader reads columns positionally, so the two must agree. If they disagree (e.g., from a manually edited schema file), loaddb mis-assigns values silently. This is why the flag is rarely used in practice — it’s there for cross-platform binary-tier replication where on-disk order matters and the schema file is regenerated.
  • AU_SAVE_AND_ENABLE around the data half uses the snapshot of Au_user taken at startup, not whatever Au_user happens to be at half-boundary. A schema half that ran with --as-dba doesn’t re-elevate the data half — both halves run with the same Au_user, just with the data half’s authorisation force-enabled.
  • Per-class data-file naming with --mt-process + --datafile-per-class. The collision rules between the partition-index-prefixed naming and the per-class-suffixed naming aren’t documented; in practice the output is <prefix>_<partidx>_<class>_objects but the filename builders aren’t called out in AGENTS.md.
  • Per-class fetch parallelism with --thread-count uses how many threads server-side? The client opens N parallel cursors; each cursor consumes server resources (a thread-pool worker per fetch). For a server with cubrid.conf thread_count lower than --thread-count, the export is bottlenecked at server capacity. The interaction is not warned about by the client.
  • Resumability. A failed export leaves a partial _objects file; there’s no resume mode. Operators rerun from scratch. Some engines (Oracle Data Pump) support RESUME jobs; CUBRID does not.
  • Compression. No built-in compression; the operator pipes the output through gzip post-hoc. mongodump and pg_dump’s custom format both support inline compression.
  • Cross-version compatibility of _objects format. The loaddb format is documented as stable across CUBRID versions but the exact compatibility matrix (especially around new data types like JSON Tables, TDE-encrypted classes) isn’t recorded.
  • src/executables/unloaddb.c — entry, option parse, two-half orchestration, multi-process partition logic
  • src/executables/unloaddb.h — option-string constants, shared types
  • src/executables/unload_schema.c — schema / trigger / index emitters; per-domain and per-constraint renderers
  • src/executables/unload_object.c — data extraction driver and per-class loop
  • src/executables/unload_object_file.{c,h} — output-file plumbing, per-thread buffering, filename builders
  • src/executables/extract_schema.{cpp,hpp}extract_context struct definition and helpers
  • src/executables/AGENTS.md — agent guide; binary→source mapping
  • Adjacent docs: cubrid-loaddb.md (the inverse — loadddb is the consumer of every file unloaddb produces and its phase order is what dictates the unload’s emit order), cubrid-cub-admin.md (the unified cubrid admin entry that loads the unloaddb symbol; the SA_CS routing per --SA-mode / --CS-mode), cubrid-csql.md (the --loaddb-output mode that lets csql emit data in the same loader format unloaddb produces, useful for ad-hoc partial exports without invoking unloaddb at all), cubrid-class-object.md (the catalog walk that schema extraction relies on), cubrid-locator.md (the OID materialisation that data extraction depends on), cubrid-mvcc.md (the snapshot semantics underlying --latest-image)