Skip to content

CUBRID loaddb — Bulk Loader, Direct-Path Heap+B+Tree Insert, and Post-Load Statistics Rebuild

Contents:

A bulk loader is the part of a database engine whose job is to take N rows that originate outside the system — a flat file dumped from another database, an export from a previous version of the same database, an ETL pipeline output — and make them visible to readers as fast as the storage layer physically allows. The defining choice that separates a “bulk loader” from “INSERT in a loop” is whether the loader is willing to bypass the per-row machinery that the engine runs for ordinary INSERT … VALUES (…). Petrov’s Database Internals (Ch. 4 “Implementing B-Trees” and Ch. 5 “Transaction Processing”) frames this in three terms — write amplification, index maintenance, and constraint checking — and argues that an INSERT-loop loader pays the full per-row tax for all three, whereas a direct-path loader is allowed to amortize them across the entire batch.

The write-amplification axis is the most direct. An ordinary INSERT that lands in a transactionally logged engine produces, at minimum, one heap-record write, one WAL physiological-log record per modified page, one index-leaf write per index, and one WAL physiological-log record per index leaf. With k indexes, the amplification factor is roughly 1 + 2k page touches per row. A bulk loader can drop this in two complementary ways:

  1. Page-level logging instead of record-level logging. If the loader is the only writer to a page (which is true if the class is exclusively locked for the duration of the load), a single “page-image redo” record can cover an entire freshly built page regardless of how many records it holds. Hundreds of rows now share one log record.

  2. Index-build deferral. Instead of inserting into each B+Tree as each row arrives — which generates near-random page touches across k trees — the loader can collect all (key, OID) pairs, sort them externally, and bulk-build each B+Tree bottom-up. The sort bounds memory; the bottom-up build touches each leaf page exactly once and writes it sequentially.

The index-maintenance axis interacts with the constraint-checking axis. A unique B+Tree must reject duplicate keys; if the index is incrementally maintained as the loader runs, every row pays the duplicate-check cost online and the engine still has to acquire key-range locks to be safe against concurrent inserters. If the index is built after the load, the duplicate check happens during the sort (adjacent equal keys signal a violation) and no key-range locks are needed because no other transaction can be touching the half-loaded table — provided the load held an exclusive class-level lock.

The constraint-checking axis is broader: foreign-key, NOT NULL, CHECK, trigger firing, etc. A bulk loader almost always defers or disables foreign-key checks (CUBRID sets locator_Dont_check_foreign_key = true in SA mode) and always disables trigger firing (CUBRID calls db_disable_trigger () before loading). The cost model is: a deferred foreign-key check is an anti-join executed once over the loaded data versus a per-row probe; deferred is cheaper above a few thousand rows.

The third textbook concern is parallel partitioning. If the input file is large but the table is unpartitioned, the loader can still parallelise by batching the input file and assigning batches to worker threads. The catch is that batches share the same heap file and the same indexes, so the workers must coordinate on (end-of-batch → commit, start-of-next-batch → continue) ordering to preserve the OID-monotonicity guarantee that some loaders advertise and to keep WAL recovery deterministic. CUBRID’s worker pool model (see load_worker_manager.cpp) takes exactly this batching approach, ordered commits and all.

The post-load statistics rebuild is the last textbook ingredient. Cost-based optimisers depend on column histograms, distinct-value counts, and per-class row counts; a fresh load invalidates all three. Running ordinary UPDATE STATISTICS after the load is therefore not optional. CUBRID’s loaddb makes this a built-in step: the SA loader calls sm_update_statistics per class and the CS loader calls loaddb_update_stats, which in turn drives xstats_update_statistics on the server. Without this step, the optimiser would plan against empty statistics and pick catastrophic plans for the first several queries against the freshly loaded table.

This document tracks how CUBRID’s loaddb realises all four pieces — direct-path heap insert under a Bulk-Update class lock, deferred / batched commit, deferred index loading via a separate -i index file, and post-load statistics rebuild — across the files in src/loaddb/ and the storage primitives they call.

The textbook gives the model; this section names the engineering conventions that almost every row-oriented engine adopts when shipping a production bulk loader. CUBRID’s specific choices in ## CUBRID's Approach are best read as one set of dials within this shared design space.

Every loader needs an unambiguous on-disk syntax. The conventions in the wild are:

  • CSV with header line, optionally with explicit column-list and delimiter override (PostgreSQL COPY FROM, MySQL LOAD DATA INFILE, SQL Server BCP). Cheap to produce, no schema in the file itself.
  • Binary table-dump format, table-aware and portable across versions of the same engine (Oracle SQL*Loader’s direct-path, PostgreSQL pg_dump custom format, MySQL mysqldump --tab’s .txt/.sql pair). Faster to parse than text but engine-specific.
  • CUBRID-format object file, which is a hybrid: SQL-style schema statements followed by %class / %id directives that name the target table and assign it a numeric class-id, then per-instance lines whose tokens are typed via a small lexer (string, integer, timestamp, \KRW for monetary, {…} for sets, @oid for object references, + line continuation, etc.). The format originates in the unloader (unloaddb) and is symmetric to it.

An “INSERT-style” loader threads each row through:

parser → name-resolution → semantic-check → XASL-gen → executor →
locator_insert_force per row → trigger fire → constraint check

A direct-path loader collapses that to:

load-file lexer → load-file parser → DB_VALUE per attribute →
record_descriptor → locator_multi_insert_force on a vector →
(optional) deferred index entry → batch commit

The savings come from never invoking the SQL parser, never building an XASL, never going through the per-row optimiser, and from passing a vector of records to the storage layer so it can pack multiple records onto one heap page and emit one WAL record for the whole page. Postgres’s pg_bulkload extension, Oracle’s SQL*Loader direct path, and CUBRID’s loaddb all sit on this side of the line; PostgreSQL’s built-in COPY FROM sits in between (it bypasses the optimiser but goes through the executor’s ExecInsert).

Drop-and-rebuild vs incremental index maintenance

Section titled “Drop-and-rebuild vs incremental index maintenance”

When the loader is given a table with k secondary indexes, it has two schedules:

  1. Incremental maintenance. Every inserted heap record produces k index-leaf writes inline. Simple to implement (the heap insert already drives locator_attribute_info_force which calls btree_insert), but the index pages are touched in random key order, so the buffer pool churns and the WAL grows by k records per row.

  2. Drop-and-rebuild. The loader drops or never-creates the secondary indexes, loads the heap, then rebuilds each index bottom-up via external sort + sequential leaf-page write. The cost is one external sort per index plus one sequential scan of the heap; the benefit is dense, well-ordered leaf pages and a B+Tree whose root-to-leaf paths are minimal.

CUBRID’s loaddb supports the second schedule by separating the input into a schema file (-s / --schema-file), an object file (-d / --data-file), and an index file (-i / --index-file). The unloader emits the indexes into the index file as ordinary CREATE INDEX statements, and the loader executes them after the heap load via ldr_exec_query_from_file. The build itself is then the engine’s normal index-create path, which uses btree_load_index + external_sort for sorted bottom-up construction.

Three patterns are visible across engines:

  • Single-process, multi-threaded batches (CUBRID CS-mode loaddb, pg_bulkload). One reader splits the file into batches; a worker pool consumes batches; commits are ordered.
  • Multi-process, partition-aware (Oracle SQL*Loader’s parallel-direct option, Greenplum gpload). Each worker writes to its own segment / partition. Cheap parallelism but only available when the table is partitioned.
  • External sharding then per-shard load (Vertica, ClickHouse). The work is offloaded to whatever sharded the data; the engine itself runs N sequential bulk loads.

CUBRID is in the first camp. Workers share one heap file under a single transaction-per-batch model: the session holds a Bulk-Update (BU_LOCK) lock on the class for the entire load, each worker opens a sub-transaction (logtb_assign_tran_index), inserts its batch, and commits in batch-id order (session::wait_for_previous_batch).

Three patterns:

  • Run as a separate command afterward (PostgreSQL ANALYZE, MySQL ANALYZE TABLE, Oracle DBMS_STATS.GATHER_TABLE_STATS). Cheap to implement, easy to forget.
  • Implicit on commit (some auto-analyze daemons trigger when the row count delta crosses a threshold).
  • Built into the loader (CUBRID loaddb, Oracle SQL*Loader’s STATISTICS clause). The loader knows exactly which classes were touched and runs the analysis itself.

CUBRID picks the third: SA mode calls sm_update_statistics per class in ldr_update_statistics; CS mode calls loaddb_update_stats which fetches the class-OID list from the load session and the client iterates with stats_update_statistics(STATS_WITH_SAMPLING) per class.

Loaders measured in millions of rows must survive partial failure. The conventions are:

  • Periodic commit. Every N rows (--periodic-commit N / commit-period) the loader commits, recording the line number reached. Restart skips ahead to that line. CUBRID defaults PERIODIC_COMMIT_DEFAULT_VALUE = 10240.
  • Error-control file. A list of error codes the loader is allowed to ignore (CUBRID --error-control-file); a row that triggers one of them is logged and skipped instead of aborting the batch.
  • Syntax-only mode. A dry run that parses the file but does no inserts (CUBRID --check-only / args.syntax_check); used to validate a freshly produced unload file before committing the database to it.

CUBRID supports all three. The combination of periodic-commit and syntax-check is precisely what production ETL operators want: dry-run then commit the dry-run plan with a checkpoint every 10K rows.

CUBRID realises bulk-load as a two-process pipeline in CS mode and a single-process direct-path in SA mode. The two share the same load-file grammar (load_grammar.yy + load_lexer.l), the same batching algorithm (cubload::split in load_common.cpp), and the same cubload::driver mediator class. They diverge at the object_loader / class_installer interfaces (load_common.hpp): SA mode uses sa_object_loader (a wrapper around the legacy ldr_* callbacks in load_sa_loader.cpp) while CS mode uses server_object_loader (the direct-path implementation in load_server_loader.cpp).

flowchart TD
    subgraph CLIENT["loaddb client process (load_db.c)"]
        A1["main: loaddb_user → loaddb_internal"] --> A2["ldr_validate_object_file<br/>· get_loaddb_args"]
        A2 --> A3["db_login + db_restart<br/>(connect to server)"]
        A3 --> A4["ldr_load_schema_file<br/>(if -s)"]
        A4 --> A5{SA_MODE?}
        A5 -- yes --> SA["ldr_sa_load<br/>load_sa_loader.cpp"]
        A5 -- no --> CS["ldr_server_load<br/>· split + loaddb_load_batch"]
        SA --> A6["ldr_exec_query_from_file<br/>(if -i index file)"]
        CS --> A6
        A6 --> A7["sm_update_catalog_statistics"]
        A7 --> A8["db_shutdown"]
    end
    subgraph SERVER["cub_server process (CS mode only)"]
        B1["sloaddb_init<br/>→ new cubload::session"]
        B2["sloaddb_install_class<br/>→ session::install_class"]
        B3["sloaddb_load_batch<br/>→ session::load_batch<br/>→ worker_manager_try_task"]
        B4["load_task::execute<br/>→ driver::parse<br/>→ server_object_loader<br/>→ locator_multi_insert_force"]
        B5["sloaddb_update_stats<br/>→ xstats_update_statistics"]
        B6["sloaddb_destroy"]
    end
    CS -.NET_SERVER_LD_INIT.-> B1
    CS -.NET_SERVER_LD_INSTALL_CLASS.-> B2
    CS -.NET_SERVER_LD_LOAD_BATCH.-> B3
    B3 --> B4
    A7 -.NET_SERVER_LD_UPDATE_STATS.-> B5
    A8 -.NET_SERVER_LD_DESTROY.-> B6

The CUBRID load file is a line-oriented text file that mixes three kinds of lines:

%id mytable 1
%class mytable (id, name, salary, dept)
1 'Alice' 50000.00 @dept|2
2 'Bob' 60000.00 NULL
  • A %id <class-name> <numeric-id> line names a class and assigns it a numeric class-id used in the rest of the batch. It is parsed by the id_command rule in load_grammar.yy and dispatched to class_installer::check_class.
  • A %class <class-name> ( <attr-list> ) line is identical in function but additionally lists the attributes-in-order. It maps to the class_command rule and class_installer::install_class.
  • Every other non-blank line is an instance line: optional <int>: OID prefix followed by space-separated typed constants. This maps to the instance_line rule and dispatches to object_loader::start_line then process_line.

The lexer (load_lexer.l) recognises typed constants by literal shape: [+\-]?[0-9]+ is INT_LIT; the larger regex with [Ee] and optional [fFlL] is REAL_LIT; [0-9]+/[0-9]+/[0-9]+ is DATE_LIT2; [0-9]+:[0-9]+(:[0-9]+)? covers six varieties of TIME_LIT; '…' is the SQL string body (state <SQS>); "…" is either a delimited identifier (state <DELIMITED_ID>) or a double-quoted string (state <DQS>) depending on m_semantic_helper.in_instance_line (). Currency symbols get their own tokens — \$ → DOLLAR_SYMBOL, \\KRW → WON_SYMBOL, \\EUR → EURO_SYMBOL, etc. — because monetary is a first-class DB type. The + line-continuation marker is handled both in the lexer ('\+[ \t]*\r?\n[ \t]*\' inside <SQS> glues two SQS strings together) and in the splitter (ends_with (line, "+") in cubload::split keeps the row in one_row_buffer for the next iteration).

Command-line flags map onto cubload::load_args in load_common.hpp and are unpacked by get_loaddb_args in load_db.c:

Flag (long / short)Field on load_argsDefault
--user / -uuser_nameAU_PUBLIC_USER_NAME
--password / -ppasswordempty (prompt on ER_AU_INVALID_PASSWORD)
--schema-file / -sschema_fileempty
--index-file / -iindex_fileempty
--data-file / -dobject_fileempty
--check-only / -csyntax_checkfalse
--load-only / -lload_onlyfalse
--periodic-commitperiodic_commit10240 (PERIODIC_COMMIT_DEFAULT_VALUE)
--no-statisticsdisable_statisticsfalse
--ignore-loggingignore_loggingfalse
--error-control-fileerror_fileempty
--ignore-classesignore_class_fileempty
--table / -ttable_nameempty (load entire file)
--CS-mode / -Ccs_modefalse
--no-user-specified-nameno_user_specified_namefalse

The struct is cubpacking::packable_object-derived because in CS mode the entire load_args is shipped to the server in the sloaddb_init request body.

The lexer is a Flex C++ scanner generated from load_lexer.l. The grammar is a Bison LALR(1) C++ parser generated from load_grammar.yy (skeleton lalr1.cc, cubload namespace, class parser). The two are stitched together by the cubload::driver class (load_driver.cpp):

// driver::parse — load_driver.cpp
int
driver::parse (std::istream &iss, int line_offset)
{
m_scanner->switch_streams (&iss);
m_scanner->set_lineno (line_offset + 1);
m_semantic_helper.reset_after_batch ();
assert (m_class_installer != NULL && m_object_loader != NULL);
parser parser (*this);
return parser.parse ();
}

The driver owns a cubload::scanner, a cubload::class_installer, a cubload::object_loader, an error_handler, and a semantic_helper. The parser’s actions (in load_grammar.yy) call into m_driver.get_class_installer () for %class and %id directives, into m_driver.get_object_loader () for instance lines, and into m_driver.get_semantic_helper () for constructing the typed constant_type values out of raw lexer strings. The scanner is generated in-place (%option yyclass="cubload::scanner", load_scanner.hpp) so that lexer states see the live driver and its semantic_helper.

The semantic_helper (load_semantic_helper.hpp) is the only component that allocates significant memory inside the parser: it maintains a string_pool of 1024 reusable string_type slots, a constant_pool of 1024 constant_type slots, a qstr_buf_pool of 512 × 32 KiB string-body buffers, and a fallback cubmem::extensible_block for strings that overflow the pool. The pool is reset between batches (reset_after_batch) and between rows (reset_after_line), so a single parse generates effectively zero malloc traffic on the hot path.

cubload::split in load_common.cpp is the I/O front-end: it opens the object file, walks it line by line, accumulates lines into a batch_buffer, and flushes the buffer when one of three conditions trips:

  1. Class boundary. A line beginning with %class or %CLASS (the splitter checks the prefix textually before the lexer ever sees it). The current batch is flushed to b_handler, the class-id counter is incremented, and the new %class or %id line is sent alone to c_handler.
  2. Row count. The accumulated row count reaches args.periodic_commit (default 10240). The buffer is flushed to b_handler and a new batch starts.
  3. Buffer size. The buffer would exceed LOADDB_BUFFER_SIZE_LIMIT ((2 GiB - 1 KiB)). The splitter trims back to the last complete row, flushes, and continues with the leftover row in one_row_buffer.

Two cross-line concerns complicate the splitter:

  • Line-continuation. A line ending in + means “row continues on the next line”. The splitter holds such lines in one_row_buffer until it sees a line that does not end in +, then concatenates them and counts the result as one row.
  • Open string literals. Single quotes inside a row may span multiple physical lines. The splitter tracks single_quote_checker (a 0/1 toggle XOR-ed on every ') and refuses to flush while a quote is still open — even if the row count condition would otherwise have flushed.

Once the batch is full, the splitter calls handle_batch, which constructs a cubload::batch (with auto-incremented batch_id, the current class_id, the contents, the starting line offset, and the row count) and invokes b_handler (batch). The two handlers are lambdas in load_db.c::load_object_file:

// load_object_file — load_db.c:1510
batch_handler b_handler = [&] (const batch &batch) -> int {
bool use_temp_batch = false;
bool is_batch_accepted = false;
int error_code;
do {
load_status status;
error_code = loaddb_load_batch (batch, use_temp_batch,
is_batch_accepted, status);
if (error_code != NO_ERROR) return error_code;
use_temp_batch = true; // don't re-upload while retrying
print_stats (status.get_load_stats (), *args, exit_status);
} while (!is_batch_accepted);
return error_code;
};
class_handler c_handler = [] (const batch &batch, bool &is_ignored) -> int {
std::string class_name;
int error_code = loaddb_install_class (batch, is_ignored, class_name);
if (error_code == NO_ERROR && !is_ignored && !class_name.empty ())
error_code = load_has_authorization (class_name, AU_INSERT);
return error_code;
};

The retry loop is essential. loaddb_load_batch may return with is_batch_accepted == false if the server’s worker pool is full; the client side keeps the already-shipped batch buffer in m_temp_task on the server side and the next call passes use_temp_batch = true so the network does not re-ship the bytes.

The client-side stub is loaddb_load_batch in network_interface_cl.c. It sends a NET_SERVER_LD_LOAD_BATCH request whose body is a packed cubload::batch. The server demultiplexes via network_sr.c:

// network_sr.c — request-table excerpt
req_p->processing_function = sloaddb_init; // NET_SERVER_LD_INIT
req_p->processing_function = sloaddb_install_class; // NET_SERVER_LD_INSTALL_CLASS
req_p->processing_function = sloaddb_load_batch; // NET_SERVER_LD_LOAD_BATCH
req_p->processing_function = sloaddb_fetch_status; // NET_SERVER_LD_FETCH_STATUS
req_p->processing_function = sloaddb_destroy; // NET_SERVER_LD_DESTROY
req_p->processing_function = sloaddb_interrupt; // NET_SERVER_LD_INTERRUPT
req_p->processing_function = sloaddb_update_stats; // NET_SERVER_LD_UPDATE_STATS

The server-side handlers (network_interface_sr.cpp) are thin: they unpack the request, look up the per-connection cubload::session via session_get_load_session, call the matching method (session::install_class, session::load_batch, session::fetch_status, etc.), pack the reply, and send it. The session is stored on the client’s session_state struct so that a single TCP connection sees a single cubload::session for the lifetime of the load.

sequenceDiagram
    participant L as loaddb client
    participant S as cub_server
    participant W as worker pool
    L->>S: NET_SERVER_LD_INIT (load_args)
    S->>S: new cubload::session(args)<br/>worker_manager_register_session
    S-->>L: NO_ERROR
    L->>S: NET_SERVER_LD_INSTALL_CLASS (%class line)
    S->>S: server_class_installer<br/>locate_class + BU_LOCK<br/>register_class_with_attributes
    S-->>L: class_name + is_ignored
    loop per batch (size = periodic_commit)
        L->>S: NET_SERVER_LD_LOAD_BATCH (batch bytes)
        S->>W: worker_manager_try_task(load_task)
        W->>W: logtb_assign_tran_index<br/>driver::parse<br/>server_object_loader::flush_records<br/>locator_multi_insert_force<br/>wait_for_previous_batch<br/>xtran_server_commit
        S-->>L: status + is_batch_accepted
    end
    L->>S: NET_SERVER_LD_UPDATE_STATS
    S->>S: enumerate class_registry<br/>pack OIDs back
    S-->>L: vector<OID>
    L->>L: stats_update_statistics(STATS_WITH_SAMPLING) per class
    L->>S: NET_SERVER_LD_DESTROY
    S->>S: session.wait_for_completion<br/>worker_manager_unregister_session
    S-->>L: NO_ERROR

When the splitter encounters a %class or %id line, it calls c_handler, which on CS mode lands in session::install_class → invoke_parser → server_class_installer::install_class. The installer’s job is to:

  1. Lower-case the class name (to_lowercase_identifierintl_identifier_lower).
  2. Check whether the class is in the user’s --ignore-classes set (is_class_ignored), or is one of the legacy GLO classes (IS_OLD_GLO_CLASS); if so, register an is_ignored = true class_entry and return early.
  3. Look up the class by name with xlocator_find_class_oid requesting BU_LOCK (server_class_installer::locate_class).
  4. If found, fetch the class record, walk its attribute representation, optionally filter / reorder by the explicit attribute list from the %class line, and register a class_entry in the session’s class_registry.

The BU_LOCK request is the load’s most consequential decision. BU_LOCK is defined in lock_table.h as the Bulk-Update Lock, sitting between IX_LOCK and X_LOCK in the hierarchy. It is compatible with itself and with BU_LOCK for the same class (so two concurrent loaders can run on the same class), but blocks every ordinary writer (X_LOCK, S_LOCK, IS_LOCK, SIX_LOCK are all incompatible with BU_LOCK for the same resource — see the compatibility matrix in lock_table.c). Crucially, holding BU_LOCK is what allows the heap manager to take the page-level log shortcut in locator_multi_insert_force:

// locator_multi_insert_force — locator_sr.c:13779
bool has_BU_lock =
lock_has_lock_on_object (class_oid, oid_Root_class_oid, BU_LOCK);
// ...
heap_max_page_size =
heap_nonheader_page_capacity () * (1.0f - PRM_HF_UNFILL_FACTOR);
// fill a fresh page with as many records as fit, then:
pgbuf_log_redo_new_page (thread_p, home_hint_p.pgptr,
DB_PAGESIZE, PAGE_HEAP);

That pgbuf_log_redo_new_page is the bulk-load WAL shortcut: one log record covers the entire freshly stamped page regardless of how many heap records it holds. Without BU_LOCK, the engine would have to fall back to per-record physical-undo logging because some other transaction might have a partial view of the same page.

The CS-mode session (load_session.cpp) does not parse batches in the network handler. Instead, session::load_batch constructs a cubload::load_task, hands it to worker_manager_try_task, and returns to the client immediately:

// session::load_batch — load_session.cpp
task = new load_task (*batch, *this, *thread_ref.conn_entry);
auto pred = [&] () -> bool {
is_batch_accepted = worker_manager_try_task (task);
if (is_batch_accepted) ++m_active_task_count;
else if (!use_temp_batch) {
m_temp_task = task; // server keeps the batch
use_temp_batch = true; // client should skip re-shipping
}
return !m_collected_stats.empty () || is_batch_accepted;
};

The worker pool itself (load_worker_manager.cpp) is a single process-global pool sized by PRM_ID_LOADDB_WORKER_COUNT:

// REGISTER_WORKERPOOL — load_worker_manager.cpp:106
REGISTER_WORKERPOOL (loaddb, []() {
return prm_get_integer_value (PRM_ID_LOADDB_WORKER_COUNT);
});

The pool is shared across all concurrent load sessions in the server. Each worker thread has a cubload::driver lazily attached via worker_entry_manager::on_create (claimed from a resource_shared_pool<driver>); the driver is reset (driver::clear) in on_retire and given back to the pool. The cubthread::worker_pool_task_capper rate-limits how many tasks one session can have in flight.

A load_task::execute (load_session.cpp:120) is the body of the worker:

// load_task::execute — load_session.cpp
void execute (cubthread::entry &thread_ref) final {
if (m_session.is_failed ()) return;
thread_ref.conn_entry = &m_conn_entry;
driver *driver = thread_ref.m_loaddb_driver;
init_driver (driver, m_session);
const class_entry *cls_entry =
m_session.get_class_registry ().get_class_entry (m_batch.get_class_id ());
// ...
logtb_assign_tran_index (&thread_ref, NULL_TRANID, TRAN_ACTIVE,
NULL, NULL, TRAN_LOCK_INFINITE_WAIT,
TRAN_DEFAULT_ISOLATION_LEVEL ());
int tran_index = thread_ref.tran_index;
m_session.register_tran_start (tran_index);
// copy session client ids onto worker tdes
LOG_TDES *session_tdes =
log_Gl.trantable.all_tdes[m_conn_entry.get_tran_index ()];
LOG_TDES *worker_tdes = log_Gl.trantable.all_tdes[tran_index];
worker_tdes->client.set_ids (session_tdes->client);
bool parser_result = invoke_parser (driver, m_batch);
std::size_t rows_number =
driver->get_object_loader ().get_rows_number ();
driver->clear ();
if (m_session.is_failed () ||
(!is_syntax_check_only && (!parser_result || er_has_error ())))
{
m_session.fail ();
xtran_server_abort (&thread_ref);
}
else
{
m_session.wait_for_previous_batch (m_batch.get_id ());
xtran_server_commit (&thread_ref, false);
m_session.stats_update_rows_committed (rows_number);
m_session.stats_update_last_committed_line (line_no + 1);
}
notify_done_and_tran_end (tran_index);
}

The two important invariants are:

  • Each worker runs in its own transaction (logtb_assign_tran_indexxtran_server_commit per batch). The BU_LOCK on the class is inherited from the outer session because all workers register their transaction with the same session, and the session’s transaction acquired the lock at install_class time.
  • Commits are batch-id-ordered via wait_for_previous_batch (m_batch.get_id ()). Batch N cannot commit until batch N-1 has committed. This ensures that the WAL’s apparent commit order matches the file’s line order, which is important for HA replication and crash recovery.

Inside the worker, invoke_parser runs the Bison parser against the batch’s content string. The parser reduces each instance line to a constant_type * linked list and calls server_object_loader::process_line(cons):

// server_object_loader::process_line — load_server_loader.cpp:632
for (constant_type *c = cons; c != NULL; c = c->next, attr_index++) {
const attribute &attr = m_class_entry->get_attribute (attr_index);
int error_code = process_constant (c, attr);
if (error_code != NO_ERROR) {
m_error_handler.on_syntax_failure ();
return;
}
db_value &db_val = get_attribute_db_value (attr_index);
error_code = heap_attrinfo_set (
&m_class_entry->get_class_oid (),
attr.get_repr ().id, &db_val, &m_attrinfo);
// ...
}

process_constant dispatches by LDR_* type code from the lexer (process_generic_constant for everything that is not a collection or monetary, process_collection_constant for {…} set literals, process_monetary_constant for currency values). Generic conversion goes through the conv_func matrix in load_db_value_converter.cpp, indexed by [DB_TYPE][LDR_TYPE]. The matrix is initialised once at process start (init_setters) so the per-row dispatch is a single 2-D array lookup.

After process_line succeeds, the grammar action calls finish_line, which serialises the in-memory db_value array into a record_descriptor via heap_attrinfo_transform_to_disk_except_lob and pushes it onto the worker’s m_recdes_collected vector:

// server_object_loader::finish_line — load_server_loader.cpp:689
record_descriptor new_recdes (cubmem::STANDARD_BLOCK_ALLOCATOR);
RECDES *old_recdes = NULL;
if (heap_attrinfo_transform_to_disk_except_lob (
m_thread_ref, &m_attrinfo, old_recdes, &new_recdes) != S_SUCCESS)
{
m_error_handler.on_failure ();
return;
}
if (!m_error_handler.current_line_has_error ())
m_recdes_collected.push_back (std::move (new_recdes));

The crucial point is that records accumulate. They are not inserted one-by-one. At end-of-batch, the grammar’s start rule (loader_start in load_grammar.yy) calls m_driver.get_object_loader ().flush_records (), which dispatches the entire vector to the storage layer:

// server_object_loader::flush_records — load_server_loader.cpp:728
log_sysop_start (m_thread_ref);
int error_code = locator_multi_insert_force (
m_thread_ref,
&m_scancache.node.hfid, &m_scancache.node.class_oid,
m_recdes_collected,
/*has_index=*/true,
/*op_type=*/MULTI_ROW_INSERT,
&m_scancache, &force_count, /*pruning_type=*/0,
NULL, NULL, UPDATE_INPLACE_NONE, /*dont_check_fk=*/true);
if (error_code == NO_ERROR) {
log_sysop_attach_to_outer (m_thread_ref);
m_rows += m_recdes_collected.size ();
}

locator_multi_insert_force (locator_sr.c:13779) packs the records onto fresh heap pages using heap_alloc_new_page → heap_insert_logical(in_place) and emits one pgbuf_log_redo_new_page log record per filled page — the bulk-write shortcut described above. Indexes are still maintained inline (has_index=true) because CUBRID does not yet bulk-build secondary indexes for primary-key tables; instead, the heap insert path uses the same btree_insert it would for a per-row INSERT, just amortised across the page-fix lifetime of one heap page.

There is one fallback path: when HA is not disabled, the “page-image-redo” log record cannot be replicated (the slave needs record-level LSAs per row). The loader detects this with HA_DISABLED () and falls back to a per-record loop calling locator_insert_force with log_sysop_start / log_sysop_attach_to_outer per row:

// flush_records HA path — load_server_loader.cpp:766
if (insert_errors_filtered || !HA_DISABLED ()) {
for (size_t i = 0; i < m_recdes_collected.size (); i++) {
log_sysop_start (m_thread_ref);
RECDES local_record = m_recdes_collected[i].get_recdes ();
int error_code = locator_insert_force (m_thread_ref, &hfid,
&class_oid, &dummy_oid, &local_record, /*has_index=*/true,
op_type, &m_scancache, &force_count, /*pruning=*/0, NULL, NULL,
UPDATE_INPLACE_NONE, NULL, has_BU_lock,
/*dont_check_fk=*/true, /*ignore_serializable=*/false);
// ... per-record commit-or-skip with log_sysop_attach/abort
}
}

So CUBRID’s “direct path” is really page-batched, not log-free: it batches log records at the page granularity rather than the row granularity, but it still emits one redo per page. That choice preserves crash recovery and (in non-HA mode) replication correctness while still beating per-row WAL by one to two orders of magnitude.

flowchart LR
    subgraph WORKER["load_task::execute (worker thread)"]
      P1[bison parse batch] --> P2[per-row<br/>process_line]
      P2 --> P3[heap_attrinfo_set]
      P3 --> P4[finish_line]
      P4 --> P5[heap_attrinfo_transform_to_disk_except_lob]
      P5 --> P6[m_recdes_collected.push_back]
      P6 -.next row.-> P2
      P6 --> P7[end-of-batch:<br/>flush_records]
    end
    subgraph FLUSH["flush_records branches"]
      P7 --> Q1{HA disabled<br/>and no error filter?}
      Q1 -- yes --> Q2[locator_multi_insert_force<br/>page-image redo log]
      Q1 -- no --> Q3[per-record<br/>locator_insert_force loop]
    end
    Q2 --> R1[xtran_server_commit]
    Q3 --> R1
    R1 --> R2[wait_for_previous_batch<br/>then commit]

Foreign keys are explicitly disabled in two ways. SA mode sets locator_Dont_check_foreign_key = true directly. CS mode sets dont_check_fk = true on every call to locator_(multi_)insert_force. So FK violations introduced by the load are not caught at load time; they will surface only when the table is next referenced.

Unique-constraint checks survive the load but are deferred. As the worker inserts rows, the heap manager calls btree_insert per index entry; on an existing leaf the unique optimisation simply records a per-class index_stats count (m_scancache.m_index_stats). At end-of-load, when server_object_loader::stop_scancache runs, it walks m_scancache.m_index_stats->get_map() and asserts uniqueness; if any index reports !is_unique(), it raises BTREE_SET_UNIQUE_VIOLATION_ERROR and fails the session:

// stop_scancache — load_server_loader.cpp:1097
if (m_scancache.m_index_stats != NULL) {
for (const auto &it : m_scancache.m_index_stats->get_map ()) {
if (!it.second.is_unique ()) {
BTREE_SET_UNIQUE_VIOLATION_ERROR (
thread_get_thread_entry_info (), NULL, NULL,
&m_class_entry->get_class_oid (), &it.first, NULL);
m_error_handler.on_failure ();
break;
}
int error = logtb_tran_update_unique_stats (
thread_get_thread_entry_info (),
it.first, it.second, true);
}
}

Triggers are unconditionally disabled by db_disable_trigger () in loaddb_internal before any data file is opened.

NOT NULL is enforced inline: the conversion functions in load_db_value_converter.cpp raise ER_OBJ_ATTRIBUTE_CANT_BE_NULL if a LDR_NULL constant lands on a not-null attribute, and the class-installer (register_class_with_attributes) walks the attributes that the %class line omits and refuses to install if any of them are is_notnull.

Indexes are not built by the data-file loader. Instead, the unloader emits each index as a SQL CREATE INDEX statement into a separate index file, and loaddb_internal runs it after the data file:

// loaddb_internal — load_db.c
if (index_file != NULL) {
print_log_msg (1, "\nStart index loading.\n");
// ldr_exec_query_from_file parses & executes statements
if (ldr_exec_query_from_file (args.index_file.c_str (),
index_file,
&index_file_start_line, &args)
!= NO_ERROR) {
// print error, restart hint, abort
}
sm_update_catalog_statistics (CT_INDEX_NAME, STATS_WITH_FULLSCAN);
sm_update_catalog_statistics (CT_INDEXKEY_NAME, STATS_WITH_FULLSCAN);
db_commit_transaction ();
}

Each CREATE INDEX runs through the engine’s normal index-create path, which is btree_load_index + external_sort for sorted bottom-up B+Tree construction. Because the heap is already populated when the index file runs, the build sees the entire row set in one pass — exactly the drop-and-rebuild schedule from ## Common DBMS Design. The loaddb-specific tuning is LOAD_INDEX_MIN_SORT_BUFFER_PAGES (8192): if PRM_ID_SR_NBUFFERS is below that, loaddb force-overrides it via sysprm_set_force so the external sort has at least 64 MiB of working memory.

flowchart LR
    A[unloaddb produces<br/>schema.sql / data.obj / index.sql] --> B
    B[loaddb -s schema.sql] --> C
    C[loaddb -d data.obj<br/>direct-path heap insert<br/>w/ BU_LOCK + page-image redo] --> D
    D[loaddb -i index.sql<br/>CREATE INDEX per line<br/>btree_load_index + external_sort] --> E
    E[loaddb_update_stats /<br/>sm_update_statistics<br/>per loaded class] --> F
    F[db_commit + db_shutdown]

Triggers and stored procedures live in their own file (-t / --trigger-file). Like the index file, it is executed via ldr_exec_query_from_file after the data load. Unlike the index file, it has no special sort-buffer override; the statements are just compiled and executed one at a time, with periodic_commit applied to the executed-statement count.

In CS mode, post-load statistics live entirely outside the bulk-load worker pool. After wait_for_completion reports the load finished, load_db.c::ldr_server_load decides whether to refresh:

// ldr_server_load — load_db.c
if (!load_interrupted && !status.is_load_failed ()
&& !args->syntax_check && error_code == NO_ERROR
&& !args->disable_statistics)
{
error_code = loaddb_update_stats (args->verbose);
// print stats and a final fetch
}

loaddb_update_stats (network_interface_cl.c:10717) is unusual: it sends a NET_SERVER_LD_UPDATE_STATS request, the server responds with the list of class OIDs that the load touched (drawn from session.get_class_registry ()), and the client iterates those OIDs calling stats_update_statistics(STATS_WITH_SAMPLING) for each. The actual histogram build runs server-side under xstats_update_statistics in statistics_sr.c. Doing it as a per-class iteration on the client side rather than a single bulk server call is deliberate: it lets the client print per-class progress (LOADDB_MSG_CLASS_TITLE) and respect the user’s verbose preference.

In SA mode, ldr_update_statistics (load_sa_loader.cpp:6627) walks the Classes linked list (the SA loader’s own class registry) and calls sm_update_statistics(class_, STATS_WITH_SAMPLING) directly.

Both paths default to sampling (STATS_WITH_SAMPLING), not full scan. A full scan is reserved for the post-index step, where the catalog tables db_index and db_index_key are themselves re-statistic’d with STATS_WITH_FULLSCAN.

Errors at three layers:

  1. Lexer / parser errors raise LOADDB_MSG_SYNTAX_ERR via error_handler::on_error, which on CS mode appends to the session’s m_stats.error_message. The client polls loaddb_fetch_status every 100 ms (in ldr_server_load’s do-while loop) and prints any new error_message lines as they arrive.
  2. Per-row insertion errors (type conversion, NOT NULL, FK violation, …) are reported by error_handler::on_failure. If the error code is in args.m_ignored_errors, the row is skipped and the batch continues; otherwise the batch’s transaction is aborted and the session is marked failed.
  3. Session-level interrupt (Ctrl-C, signal). The client’s register_signal_handlers installs a handler that sets load_interrupted = true and calls loaddb_interrupt, which ships a NET_SERVER_LD_INTERRUPT to the server. The server’s session::interrupt walks the session’s set of active transaction indexes and sets each one’s interrupt flag via logtb_set_tran_index_interrupt, then marks the session failed.

Restart is line-based. Every committed batch updates stats.last_committed_line with the file offset of the last successfully committed row. On interrupt, the client prints LOADDB_MSG_LAST_COMMITTED_LINE so the user can rerun loaddb with a :line suffix on the data file. The line-skip is performed by ldr_get_start_line_no parsing the :N suffix off the file path and ldr_exec_query_from_file (or cubload::split after a fresh parse) jumping ahead.

Syntax check (--check-only) runs the entire pipeline up to but not including the heap insert. server_object_loader::flush_records short-circuits on args.syntax_check and asserts the recdes vector is empty; process_line only counts rows. All errors are still collected and printed, but the user gets a clean DB at the end.

Parallelism is batch-level, not row-level. Each batch runs end-to- end on one worker thread. With PRM_ID_LOADDB_WORKER_COUNT = 8 and periodic_commit = 10240, eight workers can be running batches of 10K rows each concurrently, all under the same BU_LOCK. The client’s batch retry loop ensures back-pressure: if the worker pool is saturated, worker_manager_try_task returns false, the server returns is_batch_accepted = false, and the client polls again.

CUBRID does not have an explicit “parallel partition load” mode in loaddb: a partitioned class’s load file contains rows for all partitions intermixed, and the inserter relies on the engine’s normal partition-pruning machinery (the pruning_type parameter to locator_multi_insert_force, set to 0 — DB_NOT_PARTITIONED_CLASS — in flush_records, which means each row is dispatched to its correct partition lazily by the heap manager).

flowchart TD
    SPLIT["cubload::split (client side)"]
    SPLIT --> B1["batch 1: rows 1..10240<br/>class_id 1"]
    SPLIT --> B2["batch 2: rows 10241..20480<br/>class_id 1"]
    SPLIT --> B3["batch 3: rows 20481..30720<br/>class_id 1"]
    SPLIT --> B4["batch 4: rows 30721..40960<br/>class_id 1"]
    B1 --> WP{worker_pool size<br/>= PRM_LOADDB_WORKER_COUNT}
    B2 --> WP
    B3 --> WP
    B4 --> WP
    WP --> W1[worker 1<br/>tran_index = T1<br/>commit ordered]
    WP --> W2[worker 2<br/>tran_index = T2<br/>commit ordered]
    WP --> W3[worker 3<br/>tran_index = T3<br/>commit ordered]
    W1 -. wait_for_previous_batch .-> W2
    W2 -. wait_for_previous_batch .-> W3
    W3 --> COMMIT[xtran_server_commit<br/>ordered by batch_id]

This section lists the stable symbol names used in the prose above. Line numbers are scoped to the updated: date.

SymbolRole
loaddb_userPublic utility entry: forwards to loaddb_internal(arg, 0)
loaddb_internalArgument validation, login, schema/data/index/trigger driver
get_loaddb_argsMaps UTIL_ARG_MAP to cubload::load_args
ldr_validate_object_fileSanity-checks the args (volume, files, HA-mode constraints)
ldr_check_fileOpens a file and returns FILE * with error code
ldr_get_start_line_noParses :N suffix from file path
register_signal_handlersInstalls SIGINT/SIGQUITload_interrupted = true
SymbolRole
ldr_server_loadCS-mode driver: loaddb_init, load_object_file, loaddb_update_stats, loaddb_destroy
ldr_sa_loadSA-mode driver: ldr_init, ldr_Driver->parse, ldr_update_statistics
load_object_fileConstructs b_handler/c_handler lambdas, calls cubload::split
cubload::splitLine-by-line splitter: detects %class/%id, batches, line-continuation, single-quote tracking
cubload::handle_batchWraps a batch_buffer into a cubload::batch and calls b_handler
append_incomplete_rowBuffer-overflow handler that flushes and continues
SymbolRole
cubload::driverMediator: scanner + class_installer + object_loader + error_handler + semantic_helper
driver::initializeHands ownership of the four components to the driver
driver::parseSwitches the scanner stream and runs the Bison parser
cubload::scannerFlex C++ scanner generated from load_lexer.l
cubload::parserBison C++ parser generated from load_grammar.yy
cubload::semantic_helperPool allocator for string_type / constant_type / qstr_buf
make_string_by_buffer / make_string_by_yytextAllocate string_type from pool
make_constant / make_real / make_monetary_constantAllocate constant_type from pool
reset_after_line / reset_after_batchReset pool indices
SymbolRole
cubload::load_argsPackable struct of CLI flags
cubload::batchPackable struct: batch_id, class_id, content, line offset, row count
cubload::statsPackable struct: rows_committed, current_line, last_committed_line, rows_failed, error/log messages
cubload::load_statusPackable struct: client_type, completed, failed, vector
cubload::data_typeLDR_NULL / INT / STR / NUMERIC / DOUBLE / FLOAT / OID / DATE / TIME / TIMESTAMP / DATETIME / COLLECTION / MONETARY / BSTR / XSTR / JSON …
cubload::class_installerPure-virtual interface for class registration
cubload::object_loaderPure-virtual interface for row insertion
SymbolRole
cubload::sessionPer-connection load state on the server
session::install_classRegister a class via the parser
session::load_batchEnqueue a load_task on the worker pool, with retry semantics
session::wait_for_previous_batchOrder commits by batch_id
session::wait_for_completionUsed by sloaddb_destroy
session::fail / session::interruptMark session failed, set tran-index interrupt flags
session::stats_update_*Atomically update the live stats
session::fetch_statusSnapshot the stats and recently collected per-batch stats
init_driverLazy-init the worker’s driver with server_class_installer + server_object_loader
invoke_parserRun the Bison parser against the batch content
cubload::class_registryclass_id → class_entry map; mutex-protected
cubload::class_entryResolved class: OID + name + ordered attributes + is_ignored
cubload::attributeResolved attribute: name + index + or_attribute * repr
SymbolRole
server_class_installerImplements class_installer for the server
server_class_installer::locate_classxlocator_find_class_oid with BU_LOCK
server_class_installer::locate_class_for_all_usersSchema lookup across all users (legacy 11.2 compat)
server_class_installer::register_class_with_attributesBuild the class_entry from the class’s or_attribute[]
server_object_loaderImplements object_loader for direct-path heap insert
server_object_loader::initstart_scancache + start_attrinfo; assert BU_LOCK held
server_object_loader::process_linePer-row: type-convert constants + heap_attrinfo_set
server_object_loader::finish_lineheap_attrinfo_transform_to_disk_except_lobm_recdes_collected.push_back
server_object_loader::flush_recordsEnd-of-batch: locator_multi_insert_force (or per-record loop in HA)
server_object_loader::stop_scancacheValidate unique index stats, raise BTREE_SET_UNIQUE_VIOLATION_ERROR if any
server_object_loader::process_constant / process_generic_constant / process_monetary_constant / process_collection_constantType-dispatched constant conversion
server_object_loader::clear_db_valuesReset m_db_values and m_attrinfo between rows
SymbolRole
cubload::load_taskcubthread::entry_task running invoke_parser + commit
load_task::executeWorker body: assign tran, parse, flush, ordered commit, notify
worker_entry_managercubthread::entry_manager that claims/retires cubload::driver instances
worker_manager_register_session / worker_manager_unregister_sessionAdd/remove session from the global active set
worker_manager_try_taskHands a task to the pool via worker_pool_task_capper
worker_manager_stop_allUsed at shutdown to interrupt every active session
SymbolRole
cubload::conv_funcint (*) (const char *, size_t, const attribute *, db_value *)
cubload::get_conv_func2-D lookup setters[db_type][ldr_type]
cubload::init_settersOne-shot population of the matrix
to_db_int / to_db_bigint / to_db_string / to_db_date / to_db_monetary / to_db_jsonPer-(db_type,ldr_type) conversion
mismatchDefault action: raise ER_LDR_DOMAIN_MISMATCH
SymbolRole
loaddb_init (client) → sloaddb_init (server)New cubload::session
loaddb_install_classsloaddb_install_classParse %class / %id line
loaddb_load_batchsloaddb_load_batchSubmit a batch (with use_temp_batch re-shipping flag)
loaddb_fetch_statussloaddb_fetch_statusPoll session stats
loaddb_update_statssloaddb_update_statsGet per-class OID list, call stats_update_statistics per class
loaddb_destroysloaddb_destroywait_for_completion and free the session
loaddb_interruptsloaddb_interruptSet tran-index interrupts on all active workers
SymbolRole
sa_class_installer / sa_object_loaderSA mode adapters wrapping the legacy ldr_* callbacks
ldr_init_driverLazily creates ldr_Driver and the SA installer/loader pair
ldr_init / ldr_start / ldr_finalPer-load lifecycle in load_sa_loader.cpp
ldr_register_post_commit_handler / ldr_register_post_interrupt_handlersetjmp/longjmp commit-or-abort plumbing
ldr_update_statisticsWalks Classes and calls sm_update_statistics(STATS_WITH_SAMPLING)
ldr_signal_handlerSets ldr_Load_interrupted
ldr_statsSnapshots Total_objects, Total_fails, Last_committed_line
SymbolFileLine
loaddb_usersrc/loaddb/load_db.c957
loaddb_internalsrc/loaddb/load_db.c530
ldr_server_loadsrc/loaddb/load_db.c1305
load_object_filesrc/loaddb/load_db.c1510
get_loaddb_argssrc/loaddb/load_db.c1251
ldr_exec_query_from_filesrc/loaddb/load_db.c1018
cubload::splitsrc/loaddb/load_common.cpp678
cubload::handle_batchsrc/loaddb/load_common.cpp882
append_incomplete_rowsrc/loaddb/load_common.cpp649
cubload::load_argssrc/loaddb/load_common.hpp83
cubload::batchsrc/loaddb/load_common.hpp47
cubload::statssrc/loaddb/load_common.hpp255
cubload::data_type enumsrc/loaddb/load_common.hpp132
cubload::class_installersrc/loaddb/load_common.hpp313
cubload::object_loadersrc/loaddb/load_common.hpp372
cubload::sessionsrc/loaddb/load_session.hpp71
cubload::session::load_batchsrc/loaddb/load_session.cpp582
cubload::load_task::executesrc/loaddb/load_session.cpp120
init_driver (server)src/loaddb/load_session.cpp47
invoke_parsersrc/loaddb/load_session.cpp70
server_class_installer::locate_classsrc/loaddb/load_server_loader.cpp118
server_class_installer::register_class_with_attributessrc/loaddb/load_server_loader.cpp329
server_object_loader::initsrc/loaddb/load_server_loader.cpp591
server_object_loader::process_linesrc/loaddb/load_server_loader.cpp632
server_object_loader::finish_linesrc/loaddb/load_server_loader.cpp689
server_object_loader::flush_recordssrc/loaddb/load_server_loader.cpp728
server_object_loader::stop_scancachesrc/loaddb/load_server_loader.cpp1097
worker_entry_managersrc/loaddb/load_worker_manager.cpp50
REGISTER_WORKERPOOL (loaddb, …)src/loaddb/load_worker_manager.cpp106
worker_manager_register_sessionsrc/loaddb/load_worker_manager.cpp112
worker_manager_try_tasksrc/loaddb/load_worker_manager.cpp100
cubload::class_registrysrc/loaddb/load_class_registry.hpp87
class_registry::register_classsrc/loaddb/load_class_registry.cpp146
init_setters (conv matrix)src/loaddb/load_db_value_converter.cpp86
get_conv_funcsrc/loaddb/load_db_value_converter.cpp195
driver::parsesrc/loaddb/load_driver.cpp85
loader_start (Bison action)src/loaddb/load_grammar.yy205
class_command (Bison rule)src/loaddb/load_grammar.yy279
instance_line (Bison rule)src/loaddb/load_grammar.yy419
ldr_sa_loadsrc/loaddb/load_sa_loader.cpp6330
ldr_update_statistics (SA)src/loaddb/load_sa_loader.cpp6627
sloaddb_initsrc/communication/network_interface_sr.cpp10559
sloaddb_install_classsrc/communication/network_interface_sr.cpp10583
sloaddb_load_batchsrc/communication/network_interface_sr.cpp10639
sloaddb_fetch_statussrc/communication/network_interface_sr.cpp10710
sloaddb_destroysrc/communication/network_interface_sr.cpp10763
sloaddb_update_statssrc/communication/network_interface_sr.cpp10806
loaddb_update_stats (client)src/communication/network_interface_cl.c10717
locator_multi_insert_forcesrc/transaction/locator_sr.c13779
locator_insert_forcesrc/transaction/locator_sr.c4938
BU_LOCK enum valuesrc/transaction/lock_table.h46
xstats_update_statisticssrc/storage/statistics_sr.c109

This document was written while reading the live source under src/loaddb/, src/transaction/locator_sr.c, src/storage/statistics_sr.c, and the network-interface files. There is no PDF “raw” source for it; the points below are reconciliations against companion analyses in this knowledge tree.

  • cubrid-heap-manager.md describes the slotted-page layout and the per-record header. The bulk loader does not introduce any new page layout; it relies on heap_attrinfo_transform_to_disk_except_lob to produce the same on-disk record format used by ordinary INSERT, and on the page-image-redo path for WAL economy. The MVCC header on each inserted record is stamped by heap_insert_logical in the normal way (the loader’s transaction id becomes the record’s insert-MVCCID).
  • cubrid-btree.md documents btree_load_index as the bulk-build path for a freshly created index. Loaddb invokes that path not via the data-file loader, but via CREATE INDEX statements in the index file (-i). So the loader’s interaction with btree.c during the data load is per-row btree_insert amortised across the heap-page lifetime; the bottom-up bulk build only happens during the post-data index file step.
  • cubrid-locator.md documents the locator’s locator_(multi_)insert_force. The loader’s use of it is pure direct-path: dont_check_fk = true, op_type = MULTI_ROW_INSERT, pruning_type = 0. The BU_LOCK requirement is checked inside locator_multi_insert_force itself (lock_has_lock_on_object (class_oid, oid_Root_class_oid, BU_LOCK)) and is what unlocks the page-image-redo log shortcut.
  • cubrid-statistics.md documents xstats_update_statistics. The loader calls the same entry-point as user-issued UPDATE STATISTICS; nothing is unique to loaddb except that it iterates by per-class OID and prefers STATS_WITH_SAMPLING for the user data and STATS_WITH_FULLSCAN for the catalog tables it refreshes after the index file.
  • cubrid-external-sort.md is invoked transitively by the -i index file path. The loader-specific knob is LOAD_INDEX_MIN_SORT_BUFFER_PAGES = 8192, which loaddb_internal forces via sysprm_set_force when an index file is present.
  • The AGENTS.md in src/loaddb/ lists load_object.c and load_object_table.c among the load-time files. Reading the current source confirms these are only used by the SA-mode loader (load_sa_loader.cpp); CS mode does not include them. The SA loader retains a “fast loaddb prototype” lineage and is considerably larger than the CS-mode server_object_loader — 6.8 K lines vs 1.2 K — but functionally narrower because it does not have to coordinate with a worker pool.
  • Online schema change during load. The BU_LOCK held by the load session blocks X_LOCK on the same class, so DDL against the loading class is excluded for the duration of the load. What about DDL on other classes that the loaded class references via FK? The loader skips FK checks (dont_check_fk), so a parallel ALTER TABLE on a referenced class is silently safe; whether it should be is a policy question.
  • Resumable load mid-batch. Restart at line N is supported, but N is a committed-batch boundary (default 10240 rows). A load interrupted mid-batch loses the entire batch. There is no per-row WAL marker that would let a resume start in the middle of a batch. Whether to add one (and at what cost to the page-image redo shortcut) is open.
  • Compression of the on-wire batch. A batch is shipped as the raw text content of the rows it contains, packed by cubpacking::packer::pack_string. There is no transport-level compression (zstd / lz4), and the batch can be up to LOADDB_BUFFER_SIZE_LIMIT ≈ 2 GiB. For large WAN loads this is a noticeable cost. Whether to add an opt-in compressor on the batch::pack / batch::unpack path is an open optimisation.
  • Bulk B+Tree build at load time. Today, indexes are either maintained per-row during the heap load (if they exist on the table at load start) or built bottom-up by CREATE INDEX in a separate file. There is no path that drops the indexes automatically before the data file and rebuilds them after — that is the unloader’s responsibility. A “loaddb —rebuild-indexes” flag would simplify operator workflows but require coordination with the schema-loader step.
  • Parallel partition-aware load. Partitioned tables today see their rows interleaved in the load file and dispatched per-row by the heap manager’s pruning logic. A load file pre-partitioned by the unloader would let each worker write to its own partition’s HFID with no cross-worker coordination — closer to Oracle’s parallel-direct path. Whether the throughput gain warrants the unloader and grammar changes is an open design question.
  • HA mode page-image redo. Today HA forces the per-record loop in flush_records. The replication log requires record-level LSAs per row to be replayable on the slave. A redesigned slave replay that understands page-image redo records would unlock the fast path on HA-enabled clusters; this is open R&D, not a current loader feature.

The document is code-only; no raw/ PDFs or PPTX inputs are attached. The reading was anchored on the following CUBRID source files (paths relative to references/cubrid/):

  • src/loaddb/load_db.c — utility entry, schema/index/trigger drivers, ldr_server_load, ldr_exec_query_from_file.
  • src/loaddb/load_common.{hpp,cpp}load_args, batch, stats, load_status, the class_installer / object_loader interfaces, and the file splitter cubload::split.
  • src/loaddb/load_session.{hpp,cpp}cubload::session, load_task, init_driver, invoke_parser, wait_for_previous_batch.
  • src/loaddb/load_server_loader.{hpp,cpp}server_class_installer, server_object_loader, the flush_records page-image-redo path and the per-record HA path.
  • src/loaddb/load_sa_loader.{hpp,cpp}ldr_sa_load, ldr_update_statistics, the legacy ldr_* callback chain.
  • src/loaddb/load_worker_manager.{hpp,cpp} — the global loaddb worker pool, REGISTER_WORKERPOOL, worker_manager_try_task.
  • src/loaddb/load_class_registry.{hpp,cpp} — class_id-keyed registry of resolved classes and their attribute lists.
  • src/loaddb/load_db_value_converter.{hpp,cpp} — the [DB_TYPE][LDR_TYPE] conversion matrix.
  • src/loaddb/load_driver.{hpp,cpp} — driver mediator class.
  • src/loaddb/load_grammar.yy, src/loaddb/load_lexer.l, src/loaddb/load_semantic_helper.hpp — Bison/Flex grammar and pool-allocating semantic helper.
  • src/loaddb/load_error_handler.hpp — templated error formatting and per-line accounting.
  • src/communication/network_interface_cl.c, src/communication/network_interface_sr.cpp, src/communication/network_sr.c — the seven-message loaddb_*/sloaddb_* protocol.
  • src/transaction/locator_sr.clocator_multi_insert_force, locator_insert_force.
  • src/transaction/lock_table.h, src/transaction/lock_table.c, src/transaction/lock_manager.cBU_LOCK value and compatibility matrix.
  • src/storage/heap_file.cheap_attrinfo_*, heap_alloc_new_page, heap_insert_logical, heap_get_class_info, heap_scancache_*.
  • src/storage/btree_load.c, src/storage/external_sort.c — invoked transitively from the index-file path.
  • src/storage/statistics_sr.cxstats_update_statistics.

Theoretical references:

  • Petrov, Database Internals, Ch. 4 “Implementing B-Trees” (bottom-up B+Tree build) and Ch. 5 “Transaction Processing” (write amplification, deferred constraint check).
  • Garcia-Molina / Ullman / Widom, Database Systems: The Complete Book, §16 “Logging and Recovery” (page-image vs record-level redo) and §13.7 “Variable-Length Data and Records”.
  • PostgreSQL documentation, COPY FROM and pg_bulkload extension manuals.
  • Oracle SQL*Loader, Direct Path Load chapter — the canonical reference for the direct-path technique.
  • MySQL Reference Manual, LOAD DATA INFILE and bulk-insert optimisation notes.