CUBRID loaddb — Bulk Loader, Direct-Path Heap+B+Tree Insert, and Post-Load Statistics Rebuild
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”A bulk loader is the part of a database engine whose job is to take
N rows that originate outside the system — a flat file dumped from
another database, an export from a previous version of the same
database, an ETL pipeline output — and make them visible to readers as
fast as the storage layer physically allows. The defining choice that
separates a “bulk loader” from “INSERT in a loop” is whether the
loader is willing to bypass the per-row machinery that the engine
runs for ordinary INSERT … VALUES (…). Petrov’s Database Internals
(Ch. 4 “Implementing B-Trees” and Ch. 5 “Transaction Processing”)
frames this in three terms — write amplification, index
maintenance, and constraint checking — and argues that an
INSERT-loop loader pays the full per-row tax for all three, whereas a
direct-path loader is allowed to amortize them across the entire
batch.
The write-amplification axis is the most direct. An ordinary
INSERT that lands in a transactionally logged engine produces, at
minimum, one heap-record write, one WAL physiological-log record per
modified page, one index-leaf write per index, and one WAL
physiological-log record per index leaf. With k indexes, the
amplification factor is roughly 1 + 2k page touches per row. A bulk
loader can drop this in two complementary ways:
-
Page-level logging instead of record-level logging. If the loader is the only writer to a page (which is true if the class is exclusively locked for the duration of the load), a single “page-image redo” record can cover an entire freshly built page regardless of how many records it holds. Hundreds of rows now share one log record.
-
Index-build deferral. Instead of inserting into each B+Tree as each row arrives — which generates near-random page touches across k trees — the loader can collect all
(key, OID)pairs, sort them externally, and bulk-build each B+Tree bottom-up. The sort bounds memory; the bottom-up build touches each leaf page exactly once and writes it sequentially.
The index-maintenance axis interacts with the constraint-checking axis. A unique B+Tree must reject duplicate keys; if the index is incrementally maintained as the loader runs, every row pays the duplicate-check cost online and the engine still has to acquire key-range locks to be safe against concurrent inserters. If the index is built after the load, the duplicate check happens during the sort (adjacent equal keys signal a violation) and no key-range locks are needed because no other transaction can be touching the half-loaded table — provided the load held an exclusive class-level lock.
The constraint-checking axis is broader: foreign-key, NOT NULL,
CHECK, trigger firing, etc. A bulk loader almost always defers or
disables foreign-key checks (CUBRID sets
locator_Dont_check_foreign_key = true in SA mode) and always
disables trigger firing (CUBRID calls db_disable_trigger () before
loading). The cost model is: a deferred foreign-key check is an
anti-join executed once over the loaded data versus a per-row probe;
deferred is cheaper above a few thousand rows.
The third textbook concern is parallel partitioning. If the input
file is large but the table is unpartitioned, the loader can still
parallelise by batching the input file and assigning batches to
worker threads. The catch is that batches share the same heap file
and the same indexes, so the workers must coordinate on
(end-of-batch → commit, start-of-next-batch → continue) ordering to
preserve the OID-monotonicity guarantee that some loaders advertise
and to keep WAL recovery deterministic. CUBRID’s worker pool model
(see load_worker_manager.cpp) takes exactly this batching approach,
ordered commits and all.
The post-load statistics rebuild is the last textbook ingredient.
Cost-based optimisers depend on column histograms, distinct-value
counts, and per-class row counts; a fresh load invalidates all three.
Running ordinary UPDATE STATISTICS after the load is therefore not
optional. CUBRID’s loaddb makes this a built-in step: the SA loader
calls sm_update_statistics per class and the CS loader calls
loaddb_update_stats, which in turn drives xstats_update_statistics
on the server. Without this step, the optimiser would plan against
empty statistics and pick catastrophic plans for the first several
queries against the freshly loaded table.
This document tracks how CUBRID’s loaddb realises all four pieces —
direct-path heap insert under a Bulk-Update class lock, deferred /
batched commit, deferred index loading via a separate -i index
file, and post-load statistics rebuild — across the files in
src/loaddb/ and the storage primitives they call.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model; this section names the engineering
conventions that almost every row-oriented engine adopts when
shipping a production bulk loader. CUBRID’s specific choices in
## CUBRID's Approach are best read as one set of dials within this
shared design space.
Bulk-load file format
Section titled “Bulk-load file format”Every loader needs an unambiguous on-disk syntax. The conventions in the wild are:
- CSV with header line, optionally with explicit column-list and
delimiter override (PostgreSQL
COPY FROM, MySQLLOAD DATA INFILE, SQL Server BCP). Cheap to produce, no schema in the file itself. - Binary table-dump format, table-aware and portable across
versions of the same engine (Oracle SQL*Loader’s direct-path,
PostgreSQL
pg_dumpcustom format, MySQLmysqldump --tab’s.txt/.sqlpair). Faster to parse than text but engine-specific. - CUBRID-format object file, which is a hybrid: SQL-style schema
statements followed by
%class/%iddirectives that name the target table and assign it a numeric class-id, then per-instance lines whose tokens are typed via a small lexer (string, integer, timestamp,\KRWfor monetary,{…}for sets,@oidfor object references,+line continuation, etc.). The format originates in the unloader (unloaddb) and is symmetric to it.
Direct-path vs SQL-INSERT path
Section titled “Direct-path vs SQL-INSERT path”An “INSERT-style” loader threads each row through:
parser → name-resolution → semantic-check → XASL-gen → executor →locator_insert_force per row → trigger fire → constraint checkA direct-path loader collapses that to:
load-file lexer → load-file parser → DB_VALUE per attribute →record_descriptor → locator_multi_insert_force on a vector →(optional) deferred index entry → batch commitThe savings come from never invoking the SQL parser, never building
an XASL, never going through the per-row optimiser, and from passing
a vector of records to the storage layer so it can pack multiple
records onto one heap page and emit one WAL record for the whole
page. Postgres’s pg_bulkload extension, Oracle’s SQL*Loader
direct path, and CUBRID’s loaddb all sit on this side of the line;
PostgreSQL’s built-in COPY FROM sits in between (it bypasses the
optimiser but goes through the executor’s ExecInsert).
Drop-and-rebuild vs incremental index maintenance
Section titled “Drop-and-rebuild vs incremental index maintenance”When the loader is given a table with k secondary indexes, it has two schedules:
-
Incremental maintenance. Every inserted heap record produces k index-leaf writes inline. Simple to implement (the heap insert already drives
locator_attribute_info_forcewhich callsbtree_insert), but the index pages are touched in random key order, so the buffer pool churns and the WAL grows by k records per row. -
Drop-and-rebuild. The loader drops or never-creates the secondary indexes, loads the heap, then rebuilds each index bottom-up via external sort + sequential leaf-page write. The cost is one external sort per index plus one sequential scan of the heap; the benefit is dense, well-ordered leaf pages and a B+Tree whose root-to-leaf paths are minimal.
CUBRID’s loaddb supports the second schedule by separating the
input into a schema file (-s / --schema-file), an object file
(-d / --data-file), and an index file (-i / --index-file).
The unloader emits the indexes into the index file as ordinary
CREATE INDEX statements, and the loader executes them after the
heap load via ldr_exec_query_from_file. The build itself is then
the engine’s normal index-create path, which uses
btree_load_index + external_sort for sorted bottom-up
construction.
Parallel partition loading
Section titled “Parallel partition loading”Three patterns are visible across engines:
- Single-process, multi-threaded batches (CUBRID CS-mode loaddb, pg_bulkload). One reader splits the file into batches; a worker pool consumes batches; commits are ordered.
- Multi-process, partition-aware (Oracle SQL*Loader’s
parallel-direct option, Greenplum
gpload). Each worker writes to its own segment / partition. Cheap parallelism but only available when the table is partitioned. - External sharding then per-shard load (Vertica, ClickHouse). The work is offloaded to whatever sharded the data; the engine itself runs N sequential bulk loads.
CUBRID is in the first camp. Workers share one heap file under a
single transaction-per-batch model: the session holds a Bulk-Update
(BU_LOCK) lock on the class for the entire load, each worker
opens a sub-transaction (logtb_assign_tran_index), inserts its
batch, and commits in batch-id order
(session::wait_for_previous_batch).
Post-load statistics
Section titled “Post-load statistics”Three patterns:
- Run as a separate command afterward (PostgreSQL
ANALYZE, MySQLANALYZE TABLE, OracleDBMS_STATS.GATHER_TABLE_STATS). Cheap to implement, easy to forget. - Implicit on commit (some auto-analyze daemons trigger when the row count delta crosses a threshold).
- Built into the loader (CUBRID
loaddb, Oracle SQL*Loader’sSTATISTICSclause). The loader knows exactly which classes were touched and runs the analysis itself.
CUBRID picks the third: SA mode calls sm_update_statistics per
class in ldr_update_statistics; CS mode calls
loaddb_update_stats which fetches the class-OID list from the
load session and the client iterates with
stats_update_statistics(STATS_WITH_SAMPLING) per class.
Error handling and restart
Section titled “Error handling and restart”Loaders measured in millions of rows must survive partial failure. The conventions are:
- Periodic commit. Every N rows (
--periodic-commit N/commit-period) the loader commits, recording the line number reached. Restart skips ahead to that line. CUBRID defaultsPERIODIC_COMMIT_DEFAULT_VALUE = 10240. - Error-control file. A list of error codes the loader is
allowed to ignore (CUBRID
--error-control-file); a row that triggers one of them is logged and skipped instead of aborting the batch. - Syntax-only mode. A dry run that parses the file but does no
inserts (CUBRID
--check-only/args.syntax_check); used to validate a freshly produced unload file before committing the database to it.
CUBRID supports all three. The combination of periodic-commit and syntax-check is precisely what production ETL operators want: dry-run then commit the dry-run plan with a checkpoint every 10K rows.
CUBRID’s Approach
Section titled “CUBRID’s Approach”CUBRID realises bulk-load as a two-process pipeline in CS mode
and a single-process direct-path in SA mode. The two share the
same load-file grammar (load_grammar.yy + load_lexer.l), the same
batching algorithm (cubload::split in load_common.cpp), and the
same cubload::driver mediator class. They diverge at the
object_loader / class_installer interfaces (load_common.hpp):
SA mode uses sa_object_loader (a wrapper around the legacy
ldr_* callbacks in load_sa_loader.cpp) while CS mode uses
server_object_loader (the direct-path implementation in
load_server_loader.cpp).
flowchart TD
subgraph CLIENT["loaddb client process (load_db.c)"]
A1["main: loaddb_user → loaddb_internal"] --> A2["ldr_validate_object_file<br/>· get_loaddb_args"]
A2 --> A3["db_login + db_restart<br/>(connect to server)"]
A3 --> A4["ldr_load_schema_file<br/>(if -s)"]
A4 --> A5{SA_MODE?}
A5 -- yes --> SA["ldr_sa_load<br/>load_sa_loader.cpp"]
A5 -- no --> CS["ldr_server_load<br/>· split + loaddb_load_batch"]
SA --> A6["ldr_exec_query_from_file<br/>(if -i index file)"]
CS --> A6
A6 --> A7["sm_update_catalog_statistics"]
A7 --> A8["db_shutdown"]
end
subgraph SERVER["cub_server process (CS mode only)"]
B1["sloaddb_init<br/>→ new cubload::session"]
B2["sloaddb_install_class<br/>→ session::install_class"]
B3["sloaddb_load_batch<br/>→ session::load_batch<br/>→ worker_manager_try_task"]
B4["load_task::execute<br/>→ driver::parse<br/>→ server_object_loader<br/>→ locator_multi_insert_force"]
B5["sloaddb_update_stats<br/>→ xstats_update_statistics"]
B6["sloaddb_destroy"]
end
CS -.NET_SERVER_LD_INIT.-> B1
CS -.NET_SERVER_LD_INSTALL_CLASS.-> B2
CS -.NET_SERVER_LD_LOAD_BATCH.-> B3
B3 --> B4
A7 -.NET_SERVER_LD_UPDATE_STATS.-> B5
A8 -.NET_SERVER_LD_DESTROY.-> B6
File format and command-line surface
Section titled “File format and command-line surface”The CUBRID load file is a line-oriented text file that mixes three kinds of lines:
%id mytable 1%class mytable (id, name, salary, dept)1 'Alice' 50000.00 @dept|22 'Bob' 60000.00 NULL- A
%id <class-name> <numeric-id>line names a class and assigns it a numeric class-id used in the rest of the batch. It is parsed by theid_commandrule inload_grammar.yyand dispatched toclass_installer::check_class. - A
%class <class-name> ( <attr-list> )line is identical in function but additionally lists the attributes-in-order. It maps to theclass_commandrule andclass_installer::install_class. - Every other non-blank line is an instance line: optional
<int>:OID prefix followed by space-separated typed constants. This maps to theinstance_linerule and dispatches toobject_loader::start_linethenprocess_line.
The lexer (load_lexer.l) recognises typed constants by literal
shape: [+\-]?[0-9]+ is INT_LIT; the larger regex with [Ee] and
optional [fFlL] is REAL_LIT; [0-9]+/[0-9]+/[0-9]+ is
DATE_LIT2; [0-9]+:[0-9]+(:[0-9]+)? covers six varieties of
TIME_LIT; '…' is the SQL string body (state <SQS>); "…" is
either a delimited identifier (state <DELIMITED_ID>) or a
double-quoted string (state <DQS>) depending on
m_semantic_helper.in_instance_line (). Currency symbols get their
own tokens — \$ → DOLLAR_SYMBOL, \\KRW → WON_SYMBOL,
\\EUR → EURO_SYMBOL, etc. — because monetary is a first-class
DB type. The + line-continuation marker is handled both in the
lexer ('\+[ \t]*\r?\n[ \t]*\' inside <SQS> glues two SQS
strings together) and in the splitter (ends_with (line, "+") in
cubload::split keeps the row in one_row_buffer for the next
iteration).
Command-line flags map onto cubload::load_args in
load_common.hpp and are unpacked by get_loaddb_args in
load_db.c:
| Flag (long / short) | Field on load_args | Default |
|---|---|---|
--user / -u | user_name | AU_PUBLIC_USER_NAME |
--password / -p | password | empty (prompt on ER_AU_INVALID_PASSWORD) |
--schema-file / -s | schema_file | empty |
--index-file / -i | index_file | empty |
--data-file / -d | object_file | empty |
--check-only / -c | syntax_check | false |
--load-only / -l | load_only | false |
--periodic-commit | periodic_commit | 10240 (PERIODIC_COMMIT_DEFAULT_VALUE) |
--no-statistics | disable_statistics | false |
--ignore-logging | ignore_logging | false |
--error-control-file | error_file | empty |
--ignore-classes | ignore_class_file | empty |
--table / -t | table_name | empty (load entire file) |
--CS-mode / -C | cs_mode | false |
--no-user-specified-name | no_user_specified_name | false |
The struct is cubpacking::packable_object-derived because in CS
mode the entire load_args is shipped to the server in the
sloaddb_init request body.
Lexer / parser pipeline
Section titled “Lexer / parser pipeline”The lexer is a Flex C++ scanner generated from load_lexer.l. The
grammar is a Bison LALR(1) C++ parser generated from
load_grammar.yy (skeleton lalr1.cc, cubload namespace, class
parser). The two are stitched together by the cubload::driver
class (load_driver.cpp):
// driver::parse — load_driver.cppintdriver::parse (std::istream &iss, int line_offset){ m_scanner->switch_streams (&iss); m_scanner->set_lineno (line_offset + 1); m_semantic_helper.reset_after_batch ();
assert (m_class_installer != NULL && m_object_loader != NULL); parser parser (*this);
return parser.parse ();}The driver owns a cubload::scanner, a cubload::class_installer,
a cubload::object_loader, an error_handler, and a
semantic_helper. The parser’s actions (in load_grammar.yy) call
into m_driver.get_class_installer () for %class and %id
directives, into m_driver.get_object_loader () for instance
lines, and into m_driver.get_semantic_helper () for constructing
the typed constant_type values out of raw lexer strings. The
scanner is generated in-place (%option yyclass="cubload::scanner",
load_scanner.hpp) so that lexer states see the live driver and
its semantic_helper.
The semantic_helper (load_semantic_helper.hpp) is the only
component that allocates significant memory inside the parser: it
maintains a string_pool of 1024 reusable string_type slots, a
constant_pool of 1024 constant_type slots, a qstr_buf_pool of
512 × 32 KiB string-body buffers, and a fallback
cubmem::extensible_block for strings that overflow the pool. The
pool is reset between batches (reset_after_batch) and between rows
(reset_after_line), so a single parse generates effectively zero
malloc traffic on the hot path.
Splitter and batching
Section titled “Splitter and batching”cubload::split in load_common.cpp is the I/O front-end: it
opens the object file, walks it line by line, accumulates lines into
a batch_buffer, and flushes the buffer when one of three
conditions trips:
- Class boundary. A line beginning with
%classor%CLASS(the splitter checks the prefix textually before the lexer ever sees it). The current batch is flushed tob_handler, the class-id counter is incremented, and the new%classor%idline is sent alone toc_handler. - Row count. The accumulated row count reaches
args.periodic_commit(default 10240). The buffer is flushed tob_handlerand a new batch starts. - Buffer size. The buffer would exceed
LOADDB_BUFFER_SIZE_LIMIT((2 GiB - 1 KiB)). The splitter trims back to the last complete row, flushes, and continues with the leftover row inone_row_buffer.
Two cross-line concerns complicate the splitter:
- Line-continuation. A line ending in
+means “row continues on the next line”. The splitter holds such lines inone_row_bufferuntil it sees a line that does not end in+, then concatenates them and counts the result as one row. - Open string literals. Single quotes inside a row may span
multiple physical lines. The splitter tracks
single_quote_checker(a 0/1 toggle XOR-ed on every') and refuses to flush while a quote is still open — even if the row count condition would otherwise have flushed.
Once the batch is full, the splitter calls handle_batch, which
constructs a cubload::batch (with auto-incremented batch_id,
the current class_id, the contents, the starting line offset, and
the row count) and invokes b_handler (batch). The two handlers are
lambdas in load_db.c::load_object_file:
// load_object_file — load_db.c:1510batch_handler b_handler = [&] (const batch &batch) -> int { bool use_temp_batch = false; bool is_batch_accepted = false; int error_code; do { load_status status; error_code = loaddb_load_batch (batch, use_temp_batch, is_batch_accepted, status); if (error_code != NO_ERROR) return error_code; use_temp_batch = true; // don't re-upload while retrying print_stats (status.get_load_stats (), *args, exit_status); } while (!is_batch_accepted); return error_code;};
class_handler c_handler = [] (const batch &batch, bool &is_ignored) -> int { std::string class_name; int error_code = loaddb_install_class (batch, is_ignored, class_name); if (error_code == NO_ERROR && !is_ignored && !class_name.empty ()) error_code = load_has_authorization (class_name, AU_INSERT); return error_code;};The retry loop is essential. loaddb_load_batch may return with
is_batch_accepted == false if the server’s worker pool is full;
the client side keeps the already-shipped batch buffer in
m_temp_task on the server side and the next call passes
use_temp_batch = true so the network does not re-ship the bytes.
Network plumbing
Section titled “Network plumbing”The client-side stub is loaddb_load_batch in
network_interface_cl.c. It sends a NET_SERVER_LD_LOAD_BATCH
request whose body is a packed cubload::batch. The server
demultiplexes via network_sr.c:
// network_sr.c — request-table excerptreq_p->processing_function = sloaddb_init; // NET_SERVER_LD_INITreq_p->processing_function = sloaddb_install_class; // NET_SERVER_LD_INSTALL_CLASSreq_p->processing_function = sloaddb_load_batch; // NET_SERVER_LD_LOAD_BATCHreq_p->processing_function = sloaddb_fetch_status; // NET_SERVER_LD_FETCH_STATUSreq_p->processing_function = sloaddb_destroy; // NET_SERVER_LD_DESTROYreq_p->processing_function = sloaddb_interrupt; // NET_SERVER_LD_INTERRUPTreq_p->processing_function = sloaddb_update_stats; // NET_SERVER_LD_UPDATE_STATSThe server-side handlers (network_interface_sr.cpp) are thin: they
unpack the request, look up the per-connection
cubload::session via session_get_load_session, call the
matching method (session::install_class, session::load_batch,
session::fetch_status, etc.), pack the reply, and send it. The
session is stored on the client’s session_state struct so that a
single TCP connection sees a single cubload::session for the
lifetime of the load.
sequenceDiagram
participant L as loaddb client
participant S as cub_server
participant W as worker pool
L->>S: NET_SERVER_LD_INIT (load_args)
S->>S: new cubload::session(args)<br/>worker_manager_register_session
S-->>L: NO_ERROR
L->>S: NET_SERVER_LD_INSTALL_CLASS (%class line)
S->>S: server_class_installer<br/>locate_class + BU_LOCK<br/>register_class_with_attributes
S-->>L: class_name + is_ignored
loop per batch (size = periodic_commit)
L->>S: NET_SERVER_LD_LOAD_BATCH (batch bytes)
S->>W: worker_manager_try_task(load_task)
W->>W: logtb_assign_tran_index<br/>driver::parse<br/>server_object_loader::flush_records<br/>locator_multi_insert_force<br/>wait_for_previous_batch<br/>xtran_server_commit
S-->>L: status + is_batch_accepted
end
L->>S: NET_SERVER_LD_UPDATE_STATS
S->>S: enumerate class_registry<br/>pack OIDs back
S-->>L: vector<OID>
L->>L: stats_update_statistics(STATS_WITH_SAMPLING) per class
L->>S: NET_SERVER_LD_DESTROY
S->>S: session.wait_for_completion<br/>worker_manager_unregister_session
S-->>L: NO_ERROR
Class installer and the Bulk-Update lock
Section titled “Class installer and the Bulk-Update lock”When the splitter encounters a %class or %id line, it calls
c_handler, which on CS mode lands in
session::install_class → invoke_parser → server_class_installer::install_class.
The installer’s job is to:
- Lower-case the class name (
to_lowercase_identifier→intl_identifier_lower). - Check whether the class is in the user’s
--ignore-classesset (is_class_ignored), or is one of the legacy GLO classes (IS_OLD_GLO_CLASS); if so, register anis_ignored = trueclass_entryand return early. - Look up the class by name with
xlocator_find_class_oidrequestingBU_LOCK(server_class_installer::locate_class). - If found, fetch the class record, walk its attribute
representation, optionally filter / reorder by the explicit
attribute list from the
%classline, and register aclass_entryin the session’sclass_registry.
The BU_LOCK request is the load’s most consequential decision.
BU_LOCK is defined in lock_table.h as the Bulk-Update Lock,
sitting between IX_LOCK and X_LOCK in the hierarchy. It is
compatible with itself and with BU_LOCK for the same class (so two
concurrent loaders can run on the same class), but blocks every
ordinary writer (X_LOCK, S_LOCK, IS_LOCK, SIX_LOCK are all
incompatible with BU_LOCK for the same resource — see the
compatibility matrix in lock_table.c). Crucially, holding
BU_LOCK is what allows the heap manager to take the page-level
log shortcut in locator_multi_insert_force:
// locator_multi_insert_force — locator_sr.c:13779bool has_BU_lock = lock_has_lock_on_object (class_oid, oid_Root_class_oid, BU_LOCK);// ...heap_max_page_size = heap_nonheader_page_capacity () * (1.0f - PRM_HF_UNFILL_FACTOR);// fill a fresh page with as many records as fit, then:pgbuf_log_redo_new_page (thread_p, home_hint_p.pgptr, DB_PAGESIZE, PAGE_HEAP);That pgbuf_log_redo_new_page is the bulk-load WAL shortcut: one log
record covers the entire freshly stamped page regardless of how many
heap records it holds. Without BU_LOCK, the engine would have to
fall back to per-record physical-undo logging because some other
transaction might have a partial view of the same page.
Server-side worker pool
Section titled “Server-side worker pool”The CS-mode session (load_session.cpp) does not parse batches in
the network handler. Instead, session::load_batch constructs a
cubload::load_task, hands it to worker_manager_try_task, and
returns to the client immediately:
// session::load_batch — load_session.cpptask = new load_task (*batch, *this, *thread_ref.conn_entry);auto pred = [&] () -> bool { is_batch_accepted = worker_manager_try_task (task); if (is_batch_accepted) ++m_active_task_count; else if (!use_temp_batch) { m_temp_task = task; // server keeps the batch use_temp_batch = true; // client should skip re-shipping } return !m_collected_stats.empty () || is_batch_accepted;};The worker pool itself (load_worker_manager.cpp) is a single
process-global pool sized by PRM_ID_LOADDB_WORKER_COUNT:
// REGISTER_WORKERPOOL — load_worker_manager.cpp:106REGISTER_WORKERPOOL (loaddb, []() { return prm_get_integer_value (PRM_ID_LOADDB_WORKER_COUNT);});The pool is shared across all concurrent load sessions in the
server. Each worker thread has a cubload::driver lazily attached
via worker_entry_manager::on_create (claimed from a
resource_shared_pool<driver>); the driver is reset
(driver::clear) in on_retire and given back to the pool. The
cubthread::worker_pool_task_capper rate-limits how many tasks one
session can have in flight.
A load_task::execute (load_session.cpp:120) is the body of the
worker:
// load_task::execute — load_session.cppvoid execute (cubthread::entry &thread_ref) final { if (m_session.is_failed ()) return; thread_ref.conn_entry = &m_conn_entry; driver *driver = thread_ref.m_loaddb_driver; init_driver (driver, m_session);
const class_entry *cls_entry = m_session.get_class_registry ().get_class_entry (m_batch.get_class_id ()); // ... logtb_assign_tran_index (&thread_ref, NULL_TRANID, TRAN_ACTIVE, NULL, NULL, TRAN_LOCK_INFINITE_WAIT, TRAN_DEFAULT_ISOLATION_LEVEL ()); int tran_index = thread_ref.tran_index; m_session.register_tran_start (tran_index);
// copy session client ids onto worker tdes LOG_TDES *session_tdes = log_Gl.trantable.all_tdes[m_conn_entry.get_tran_index ()]; LOG_TDES *worker_tdes = log_Gl.trantable.all_tdes[tran_index]; worker_tdes->client.set_ids (session_tdes->client);
bool parser_result = invoke_parser (driver, m_batch); std::size_t rows_number = driver->get_object_loader ().get_rows_number (); driver->clear ();
if (m_session.is_failed () || (!is_syntax_check_only && (!parser_result || er_has_error ()))) { m_session.fail (); xtran_server_abort (&thread_ref); } else { m_session.wait_for_previous_batch (m_batch.get_id ()); xtran_server_commit (&thread_ref, false); m_session.stats_update_rows_committed (rows_number); m_session.stats_update_last_committed_line (line_no + 1); } notify_done_and_tran_end (tran_index);}The two important invariants are:
- Each worker runs in its own transaction (
logtb_assign_tran_index→xtran_server_commitper batch). TheBU_LOCKon the class is inherited from the outer session because all workers register their transaction with the same session, and the session’s transaction acquired the lock atinstall_classtime. - Commits are batch-id-ordered via
wait_for_previous_batch (m_batch.get_id ()). Batch N cannot commit until batch N-1 has committed. This ensures that the WAL’s apparent commit order matches the file’s line order, which is important for HA replication and crash recovery.
Direct-path heap insert
Section titled “Direct-path heap insert”Inside the worker, invoke_parser runs the Bison parser against the
batch’s content string. The parser reduces each instance line to a
constant_type * linked list and calls
server_object_loader::process_line(cons):
// server_object_loader::process_line — load_server_loader.cpp:632for (constant_type *c = cons; c != NULL; c = c->next, attr_index++) { const attribute &attr = m_class_entry->get_attribute (attr_index); int error_code = process_constant (c, attr); if (error_code != NO_ERROR) { m_error_handler.on_syntax_failure (); return; } db_value &db_val = get_attribute_db_value (attr_index); error_code = heap_attrinfo_set ( &m_class_entry->get_class_oid (), attr.get_repr ().id, &db_val, &m_attrinfo); // ...}process_constant dispatches by LDR_* type code from the lexer
(process_generic_constant for everything that is not a collection
or monetary, process_collection_constant for {…} set literals,
process_monetary_constant for currency values). Generic conversion
goes through the conv_func matrix in load_db_value_converter.cpp,
indexed by [DB_TYPE][LDR_TYPE]. The matrix is initialised once at
process start (init_setters) so the per-row dispatch is a single
2-D array lookup.
After process_line succeeds, the grammar action calls
finish_line, which serialises the in-memory db_value array into
a record_descriptor via heap_attrinfo_transform_to_disk_except_lob
and pushes it onto the worker’s m_recdes_collected vector:
// server_object_loader::finish_line — load_server_loader.cpp:689record_descriptor new_recdes (cubmem::STANDARD_BLOCK_ALLOCATOR);RECDES *old_recdes = NULL;if (heap_attrinfo_transform_to_disk_except_lob ( m_thread_ref, &m_attrinfo, old_recdes, &new_recdes) != S_SUCCESS){ m_error_handler.on_failure (); return;}if (!m_error_handler.current_line_has_error ()) m_recdes_collected.push_back (std::move (new_recdes));The crucial point is that records accumulate. They are not
inserted one-by-one. At end-of-batch, the grammar’s start rule
(loader_start in load_grammar.yy) calls
m_driver.get_object_loader ().flush_records (), which dispatches
the entire vector to the storage layer:
// server_object_loader::flush_records — load_server_loader.cpp:728log_sysop_start (m_thread_ref);int error_code = locator_multi_insert_force ( m_thread_ref, &m_scancache.node.hfid, &m_scancache.node.class_oid, m_recdes_collected, /*has_index=*/true, /*op_type=*/MULTI_ROW_INSERT, &m_scancache, &force_count, /*pruning_type=*/0, NULL, NULL, UPDATE_INPLACE_NONE, /*dont_check_fk=*/true);if (error_code == NO_ERROR) { log_sysop_attach_to_outer (m_thread_ref); m_rows += m_recdes_collected.size ();}locator_multi_insert_force (locator_sr.c:13779) packs the
records onto fresh heap pages using
heap_alloc_new_page → heap_insert_logical(in_place) and emits one
pgbuf_log_redo_new_page log record per filled page — the
bulk-write shortcut described above. Indexes are still maintained
inline (has_index=true) because CUBRID does not yet bulk-build
secondary indexes for primary-key tables; instead, the heap insert
path uses the same btree_insert it would for a per-row INSERT,
just amortised across the page-fix lifetime of one heap page.
There is one fallback path: when HA is not disabled, the
“page-image-redo” log record cannot be replicated (the slave needs
record-level LSAs per row). The loader detects this with
HA_DISABLED () and falls back to a per-record loop calling
locator_insert_force with log_sysop_start /
log_sysop_attach_to_outer per row:
// flush_records HA path — load_server_loader.cpp:766if (insert_errors_filtered || !HA_DISABLED ()) { for (size_t i = 0; i < m_recdes_collected.size (); i++) { log_sysop_start (m_thread_ref); RECDES local_record = m_recdes_collected[i].get_recdes (); int error_code = locator_insert_force (m_thread_ref, &hfid, &class_oid, &dummy_oid, &local_record, /*has_index=*/true, op_type, &m_scancache, &force_count, /*pruning=*/0, NULL, NULL, UPDATE_INPLACE_NONE, NULL, has_BU_lock, /*dont_check_fk=*/true, /*ignore_serializable=*/false); // ... per-record commit-or-skip with log_sysop_attach/abort }}So CUBRID’s “direct path” is really page-batched, not log-free: it batches log records at the page granularity rather than the row granularity, but it still emits one redo per page. That choice preserves crash recovery and (in non-HA mode) replication correctness while still beating per-row WAL by one to two orders of magnitude.
flowchart LR
subgraph WORKER["load_task::execute (worker thread)"]
P1[bison parse batch] --> P2[per-row<br/>process_line]
P2 --> P3[heap_attrinfo_set]
P3 --> P4[finish_line]
P4 --> P5[heap_attrinfo_transform_to_disk_except_lob]
P5 --> P6[m_recdes_collected.push_back]
P6 -.next row.-> P2
P6 --> P7[end-of-batch:<br/>flush_records]
end
subgraph FLUSH["flush_records branches"]
P7 --> Q1{HA disabled<br/>and no error filter?}
Q1 -- yes --> Q2[locator_multi_insert_force<br/>page-image redo log]
Q1 -- no --> Q3[per-record<br/>locator_insert_force loop]
end
Q2 --> R1[xtran_server_commit]
Q3 --> R1
R1 --> R2[wait_for_previous_batch<br/>then commit]
Constraint and unique handling
Section titled “Constraint and unique handling”Foreign keys are explicitly disabled in two ways. SA mode sets
locator_Dont_check_foreign_key = true directly. CS mode sets
dont_check_fk = true on every call to
locator_(multi_)insert_force. So FK violations introduced by the
load are not caught at load time; they will surface only when the
table is next referenced.
Unique-constraint checks survive the load but are deferred. As the
worker inserts rows, the heap manager calls
btree_insert per index entry; on an existing leaf the unique
optimisation simply records a per-class index_stats count
(m_scancache.m_index_stats). At end-of-load, when
server_object_loader::stop_scancache runs, it walks
m_scancache.m_index_stats->get_map() and asserts uniqueness; if
any index reports !is_unique(), it raises
BTREE_SET_UNIQUE_VIOLATION_ERROR and fails the session:
// stop_scancache — load_server_loader.cpp:1097if (m_scancache.m_index_stats != NULL) { for (const auto &it : m_scancache.m_index_stats->get_map ()) { if (!it.second.is_unique ()) { BTREE_SET_UNIQUE_VIOLATION_ERROR ( thread_get_thread_entry_info (), NULL, NULL, &m_class_entry->get_class_oid (), &it.first, NULL); m_error_handler.on_failure (); break; } int error = logtb_tran_update_unique_stats ( thread_get_thread_entry_info (), it.first, it.second, true); }}Triggers are unconditionally disabled by db_disable_trigger () in
loaddb_internal before any data file is opened.
NOT NULL is enforced inline: the conversion functions in
load_db_value_converter.cpp raise ER_OBJ_ATTRIBUTE_CANT_BE_NULL
if a LDR_NULL constant lands on a not-null attribute, and the
class-installer (register_class_with_attributes) walks the
attributes that the %class line omits and refuses to install if
any of them are is_notnull.
Index loading (the -i file)
Section titled “Index loading (the -i file)”Indexes are not built by the data-file loader. Instead, the unloader
emits each index as a SQL CREATE INDEX statement into a separate
index file, and loaddb_internal runs it after the data file:
// loaddb_internal — load_db.cif (index_file != NULL) { print_log_msg (1, "\nStart index loading.\n"); // ldr_exec_query_from_file parses & executes statements if (ldr_exec_query_from_file (args.index_file.c_str (), index_file, &index_file_start_line, &args) != NO_ERROR) { // print error, restart hint, abort } sm_update_catalog_statistics (CT_INDEX_NAME, STATS_WITH_FULLSCAN); sm_update_catalog_statistics (CT_INDEXKEY_NAME, STATS_WITH_FULLSCAN); db_commit_transaction ();}Each CREATE INDEX runs through the engine’s normal index-create
path, which is btree_load_index + external_sort for sorted
bottom-up B+Tree construction. Because the heap is already populated
when the index file runs, the build sees the entire row set in one
pass — exactly the drop-and-rebuild schedule from ## Common DBMS Design. The loaddb-specific tuning is LOAD_INDEX_MIN_SORT_BUFFER_PAGES
(8192): if PRM_ID_SR_NBUFFERS is below that, loaddb force-overrides
it via sysprm_set_force so the external sort has at least 64 MiB of
working memory.
flowchart LR
A[unloaddb produces<br/>schema.sql / data.obj / index.sql] --> B
B[loaddb -s schema.sql] --> C
C[loaddb -d data.obj<br/>direct-path heap insert<br/>w/ BU_LOCK + page-image redo] --> D
D[loaddb -i index.sql<br/>CREATE INDEX per line<br/>btree_load_index + external_sort] --> E
E[loaddb_update_stats /<br/>sm_update_statistics<br/>per loaded class] --> F
F[db_commit + db_shutdown]
Triggers (the --trigger-file file)
Section titled “Triggers (the --trigger-file file)”Triggers and stored procedures live in their own file (-t /
--trigger-file). Like the index file, it is executed via
ldr_exec_query_from_file after the data load. Unlike the index
file, it has no special sort-buffer override; the statements are
just compiled and executed one at a time, with periodic_commit
applied to the executed-statement count.
Post-load statistics rebuild
Section titled “Post-load statistics rebuild”In CS mode, post-load statistics live entirely outside the bulk-load
worker pool. After wait_for_completion reports the load finished,
load_db.c::ldr_server_load decides whether to refresh:
// ldr_server_load — load_db.cif (!load_interrupted && !status.is_load_failed () && !args->syntax_check && error_code == NO_ERROR && !args->disable_statistics){ error_code = loaddb_update_stats (args->verbose); // print stats and a final fetch}loaddb_update_stats (network_interface_cl.c:10717) is unusual: it
sends a NET_SERVER_LD_UPDATE_STATS request, the server responds
with the list of class OIDs that the load touched (drawn from
session.get_class_registry ()), and the client iterates those OIDs
calling stats_update_statistics(STATS_WITH_SAMPLING) for each. The
actual histogram build runs server-side under xstats_update_statistics
in statistics_sr.c. Doing it as a per-class iteration on the client
side rather than a single bulk server call is deliberate: it lets the
client print per-class progress (LOADDB_MSG_CLASS_TITLE) and
respect the user’s verbose preference.
In SA mode, ldr_update_statistics (load_sa_loader.cpp:6627) walks
the Classes linked list (the SA loader’s own class registry) and
calls sm_update_statistics(class_, STATS_WITH_SAMPLING) directly.
Both paths default to sampling (STATS_WITH_SAMPLING), not full
scan. A full scan is reserved for the post-index step, where the
catalog tables db_index and db_index_key are themselves
re-statistic’d with STATS_WITH_FULLSCAN.
Error handling, restart, syntax-check
Section titled “Error handling, restart, syntax-check”Errors at three layers:
- Lexer / parser errors raise
LOADDB_MSG_SYNTAX_ERRviaerror_handler::on_error, which on CS mode appends to the session’sm_stats.error_message. The client pollsloaddb_fetch_statusevery 100 ms (inldr_server_load’s do-while loop) and prints any new error_message lines as they arrive. - Per-row insertion errors (type conversion, NOT NULL, FK
violation, …) are reported by
error_handler::on_failure. If the error code is inargs.m_ignored_errors, the row is skipped and the batch continues; otherwise the batch’s transaction is aborted and the session is marked failed. - Session-level interrupt (Ctrl-C, signal). The client’s
register_signal_handlersinstalls a handler that setsload_interrupted = trueand callsloaddb_interrupt, which ships aNET_SERVER_LD_INTERRUPTto the server. The server’ssession::interruptwalks the session’s set of active transaction indexes and sets each one’s interrupt flag vialogtb_set_tran_index_interrupt, then marks the session failed.
Restart is line-based. Every committed batch updates
stats.last_committed_line with the file offset of the last
successfully committed row. On interrupt, the client prints
LOADDB_MSG_LAST_COMMITTED_LINE so the user can rerun loaddb with
a :line suffix on the data file. The line-skip is performed by
ldr_get_start_line_no parsing the :N suffix off the file path
and ldr_exec_query_from_file (or cubload::split after a fresh
parse) jumping ahead.
Syntax check (--check-only) runs the entire pipeline up to but
not including the heap insert. server_object_loader::flush_records
short-circuits on args.syntax_check and asserts the recdes vector
is empty; process_line only counts rows. All errors are still
collected and printed, but the user gets a clean DB at the end.
Parallelism and partitioning
Section titled “Parallelism and partitioning”Parallelism is batch-level, not row-level. Each batch runs end-to-
end on one worker thread. With PRM_ID_LOADDB_WORKER_COUNT = 8 and
periodic_commit = 10240, eight workers can be running batches of
10K rows each concurrently, all under the same BU_LOCK. The
client’s batch retry loop ensures back-pressure: if the worker pool
is saturated, worker_manager_try_task returns false, the server
returns is_batch_accepted = false, and the client polls again.
CUBRID does not have an explicit “parallel partition load” mode in
loaddb: a partitioned class’s load file contains rows for all
partitions intermixed, and the inserter relies on the engine’s
normal partition-pruning machinery (the pruning_type parameter to
locator_multi_insert_force, set to 0 — DB_NOT_PARTITIONED_CLASS
— in flush_records, which means each row is dispatched to its
correct partition lazily by the heap manager).
flowchart TD
SPLIT["cubload::split (client side)"]
SPLIT --> B1["batch 1: rows 1..10240<br/>class_id 1"]
SPLIT --> B2["batch 2: rows 10241..20480<br/>class_id 1"]
SPLIT --> B3["batch 3: rows 20481..30720<br/>class_id 1"]
SPLIT --> B4["batch 4: rows 30721..40960<br/>class_id 1"]
B1 --> WP{worker_pool size<br/>= PRM_LOADDB_WORKER_COUNT}
B2 --> WP
B3 --> WP
B4 --> WP
WP --> W1[worker 1<br/>tran_index = T1<br/>commit ordered]
WP --> W2[worker 2<br/>tran_index = T2<br/>commit ordered]
WP --> W3[worker 3<br/>tran_index = T3<br/>commit ordered]
W1 -. wait_for_previous_batch .-> W2
W2 -. wait_for_previous_batch .-> W3
W3 --> COMMIT[xtran_server_commit<br/>ordered by batch_id]
Source Walkthrough
Section titled “Source Walkthrough”This section lists the stable symbol names used in the prose above.
Line numbers are scoped to the updated: date.
Entry points and CLI
Section titled “Entry points and CLI”| Symbol | Role |
|---|---|
loaddb_user | Public utility entry: forwards to loaddb_internal(arg, 0) |
loaddb_internal | Argument validation, login, schema/data/index/trigger driver |
get_loaddb_args | Maps UTIL_ARG_MAP to cubload::load_args |
ldr_validate_object_file | Sanity-checks the args (volume, files, HA-mode constraints) |
ldr_check_file | Opens a file and returns FILE * with error code |
ldr_get_start_line_no | Parses :N suffix from file path |
register_signal_handlers | Installs SIGINT/SIGQUIT → load_interrupted = true |
File-driver and lambda handlers
Section titled “File-driver and lambda handlers”| Symbol | Role |
|---|---|
ldr_server_load | CS-mode driver: loaddb_init, load_object_file, loaddb_update_stats, loaddb_destroy |
ldr_sa_load | SA-mode driver: ldr_init, ldr_Driver->parse, ldr_update_statistics |
load_object_file | Constructs b_handler/c_handler lambdas, calls cubload::split |
cubload::split | Line-by-line splitter: detects %class/%id, batches, line-continuation, single-quote tracking |
cubload::handle_batch | Wraps a batch_buffer into a cubload::batch and calls b_handler |
append_incomplete_row | Buffer-overflow handler that flushes and continues |
Lexer / parser / driver
Section titled “Lexer / parser / driver”| Symbol | Role |
|---|---|
cubload::driver | Mediator: scanner + class_installer + object_loader + error_handler + semantic_helper |
driver::initialize | Hands ownership of the four components to the driver |
driver::parse | Switches the scanner stream and runs the Bison parser |
cubload::scanner | Flex C++ scanner generated from load_lexer.l |
cubload::parser | Bison C++ parser generated from load_grammar.yy |
cubload::semantic_helper | Pool allocator for string_type / constant_type / qstr_buf |
make_string_by_buffer / make_string_by_yytext | Allocate string_type from pool |
make_constant / make_real / make_monetary_constant | Allocate constant_type from pool |
reset_after_line / reset_after_batch | Reset pool indices |
Common types
Section titled “Common types”| Symbol | Role |
|---|---|
cubload::load_args | Packable struct of CLI flags |
cubload::batch | Packable struct: batch_id, class_id, content, line offset, row count |
cubload::stats | Packable struct: rows_committed, current_line, last_committed_line, rows_failed, error/log messages |
cubload::load_status | Packable struct: client_type, completed, failed, vector |
cubload::data_type | LDR_NULL / INT / STR / NUMERIC / DOUBLE / FLOAT / OID / DATE / TIME / TIMESTAMP / DATETIME / COLLECTION / MONETARY / BSTR / XSTR / JSON … |
cubload::class_installer | Pure-virtual interface for class registration |
cubload::object_loader | Pure-virtual interface for row insertion |
Server-side session
Section titled “Server-side session”| Symbol | Role |
|---|---|
cubload::session | Per-connection load state on the server |
session::install_class | Register a class via the parser |
session::load_batch | Enqueue a load_task on the worker pool, with retry semantics |
session::wait_for_previous_batch | Order commits by batch_id |
session::wait_for_completion | Used by sloaddb_destroy |
session::fail / session::interrupt | Mark session failed, set tran-index interrupt flags |
session::stats_update_* | Atomically update the live stats |
session::fetch_status | Snapshot the stats and recently collected per-batch stats |
init_driver | Lazy-init the worker’s driver with server_class_installer + server_object_loader |
invoke_parser | Run the Bison parser against the batch content |
cubload::class_registry | class_id → class_entry map; mutex-protected |
cubload::class_entry | Resolved class: OID + name + ordered attributes + is_ignored |
cubload::attribute | Resolved attribute: name + index + or_attribute * repr |
Server-side direct-path implementation
Section titled “Server-side direct-path implementation”| Symbol | Role |
|---|---|
server_class_installer | Implements class_installer for the server |
server_class_installer::locate_class | xlocator_find_class_oid with BU_LOCK |
server_class_installer::locate_class_for_all_users | Schema lookup across all users (legacy 11.2 compat) |
server_class_installer::register_class_with_attributes | Build the class_entry from the class’s or_attribute[] |
server_object_loader | Implements object_loader for direct-path heap insert |
server_object_loader::init | start_scancache + start_attrinfo; assert BU_LOCK held |
server_object_loader::process_line | Per-row: type-convert constants + heap_attrinfo_set |
server_object_loader::finish_line | heap_attrinfo_transform_to_disk_except_lob → m_recdes_collected.push_back |
server_object_loader::flush_records | End-of-batch: locator_multi_insert_force (or per-record loop in HA) |
server_object_loader::stop_scancache | Validate unique index stats, raise BTREE_SET_UNIQUE_VIOLATION_ERROR if any |
server_object_loader::process_constant / process_generic_constant / process_monetary_constant / process_collection_constant | Type-dispatched constant conversion |
server_object_loader::clear_db_values | Reset m_db_values and m_attrinfo between rows |
Worker pool
Section titled “Worker pool”| Symbol | Role |
|---|---|
cubload::load_task | cubthread::entry_task running invoke_parser + commit |
load_task::execute | Worker body: assign tran, parse, flush, ordered commit, notify |
worker_entry_manager | cubthread::entry_manager that claims/retires cubload::driver instances |
worker_manager_register_session / worker_manager_unregister_session | Add/remove session from the global active set |
worker_manager_try_task | Hands a task to the pool via worker_pool_task_capper |
worker_manager_stop_all | Used at shutdown to interrupt every active session |
Conversion matrix
Section titled “Conversion matrix”| Symbol | Role |
|---|---|
cubload::conv_func | int (*) (const char *, size_t, const attribute *, db_value *) |
cubload::get_conv_func | 2-D lookup setters[db_type][ldr_type] |
cubload::init_setters | One-shot population of the matrix |
to_db_int / to_db_bigint / to_db_string / to_db_date / to_db_monetary / to_db_json … | Per-(db_type,ldr_type) conversion |
mismatch | Default action: raise ER_LDR_DOMAIN_MISMATCH |
Network plumbing
Section titled “Network plumbing”| Symbol | Role |
|---|---|
loaddb_init (client) → sloaddb_init (server) | New cubload::session |
loaddb_install_class → sloaddb_install_class | Parse %class / %id line |
loaddb_load_batch → sloaddb_load_batch | Submit a batch (with use_temp_batch re-shipping flag) |
loaddb_fetch_status → sloaddb_fetch_status | Poll session stats |
loaddb_update_stats → sloaddb_update_stats | Get per-class OID list, call stats_update_statistics per class |
loaddb_destroy → sloaddb_destroy | wait_for_completion and free the session |
loaddb_interrupt → sloaddb_interrupt | Set tran-index interrupts on all active workers |
SA-mode driver
Section titled “SA-mode driver”| Symbol | Role |
|---|---|
sa_class_installer / sa_object_loader | SA mode adapters wrapping the legacy ldr_* callbacks |
ldr_init_driver | Lazily creates ldr_Driver and the SA installer/loader pair |
ldr_init / ldr_start / ldr_final | Per-load lifecycle in load_sa_loader.cpp |
ldr_register_post_commit_handler / ldr_register_post_interrupt_handler | setjmp/longjmp commit-or-abort plumbing |
ldr_update_statistics | Walks Classes and calls sm_update_statistics(STATS_WITH_SAMPLING) |
ldr_signal_handler | Sets ldr_Load_interrupted |
ldr_stats | Snapshots Total_objects, Total_fails, Last_committed_line |
Position hints (as of updated: date)
Section titled “Position hints (as of updated: date)”| Symbol | File | Line |
|---|---|---|
loaddb_user | src/loaddb/load_db.c | 957 |
loaddb_internal | src/loaddb/load_db.c | 530 |
ldr_server_load | src/loaddb/load_db.c | 1305 |
load_object_file | src/loaddb/load_db.c | 1510 |
get_loaddb_args | src/loaddb/load_db.c | 1251 |
ldr_exec_query_from_file | src/loaddb/load_db.c | 1018 |
cubload::split | src/loaddb/load_common.cpp | 678 |
cubload::handle_batch | src/loaddb/load_common.cpp | 882 |
append_incomplete_row | src/loaddb/load_common.cpp | 649 |
cubload::load_args | src/loaddb/load_common.hpp | 83 |
cubload::batch | src/loaddb/load_common.hpp | 47 |
cubload::stats | src/loaddb/load_common.hpp | 255 |
cubload::data_type enum | src/loaddb/load_common.hpp | 132 |
cubload::class_installer | src/loaddb/load_common.hpp | 313 |
cubload::object_loader | src/loaddb/load_common.hpp | 372 |
cubload::session | src/loaddb/load_session.hpp | 71 |
cubload::session::load_batch | src/loaddb/load_session.cpp | 582 |
cubload::load_task::execute | src/loaddb/load_session.cpp | 120 |
init_driver (server) | src/loaddb/load_session.cpp | 47 |
invoke_parser | src/loaddb/load_session.cpp | 70 |
server_class_installer::locate_class | src/loaddb/load_server_loader.cpp | 118 |
server_class_installer::register_class_with_attributes | src/loaddb/load_server_loader.cpp | 329 |
server_object_loader::init | src/loaddb/load_server_loader.cpp | 591 |
server_object_loader::process_line | src/loaddb/load_server_loader.cpp | 632 |
server_object_loader::finish_line | src/loaddb/load_server_loader.cpp | 689 |
server_object_loader::flush_records | src/loaddb/load_server_loader.cpp | 728 |
server_object_loader::stop_scancache | src/loaddb/load_server_loader.cpp | 1097 |
worker_entry_manager | src/loaddb/load_worker_manager.cpp | 50 |
REGISTER_WORKERPOOL (loaddb, …) | src/loaddb/load_worker_manager.cpp | 106 |
worker_manager_register_session | src/loaddb/load_worker_manager.cpp | 112 |
worker_manager_try_task | src/loaddb/load_worker_manager.cpp | 100 |
cubload::class_registry | src/loaddb/load_class_registry.hpp | 87 |
class_registry::register_class | src/loaddb/load_class_registry.cpp | 146 |
init_setters (conv matrix) | src/loaddb/load_db_value_converter.cpp | 86 |
get_conv_func | src/loaddb/load_db_value_converter.cpp | 195 |
driver::parse | src/loaddb/load_driver.cpp | 85 |
loader_start (Bison action) | src/loaddb/load_grammar.yy | 205 |
class_command (Bison rule) | src/loaddb/load_grammar.yy | 279 |
instance_line (Bison rule) | src/loaddb/load_grammar.yy | 419 |
ldr_sa_load | src/loaddb/load_sa_loader.cpp | 6330 |
ldr_update_statistics (SA) | src/loaddb/load_sa_loader.cpp | 6627 |
sloaddb_init | src/communication/network_interface_sr.cpp | 10559 |
sloaddb_install_class | src/communication/network_interface_sr.cpp | 10583 |
sloaddb_load_batch | src/communication/network_interface_sr.cpp | 10639 |
sloaddb_fetch_status | src/communication/network_interface_sr.cpp | 10710 |
sloaddb_destroy | src/communication/network_interface_sr.cpp | 10763 |
sloaddb_update_stats | src/communication/network_interface_sr.cpp | 10806 |
loaddb_update_stats (client) | src/communication/network_interface_cl.c | 10717 |
locator_multi_insert_force | src/transaction/locator_sr.c | 13779 |
locator_insert_force | src/transaction/locator_sr.c | 4938 |
BU_LOCK enum value | src/transaction/lock_table.h | 46 |
xstats_update_statistics | src/storage/statistics_sr.c | 109 |
Cross-check Notes
Section titled “Cross-check Notes”This document was written while reading the live source under
src/loaddb/, src/transaction/locator_sr.c,
src/storage/statistics_sr.c, and the network-interface files.
There is no PDF “raw” source for it; the points below are
reconciliations against companion analyses in this knowledge tree.
cubrid-heap-manager.mddescribes the slotted-page layout and the per-record header. The bulk loader does not introduce any new page layout; it relies onheap_attrinfo_transform_to_disk_except_lobto produce the same on-disk record format used by ordinary INSERT, and on the page-image-redo path for WAL economy. The MVCC header on each inserted record is stamped byheap_insert_logicalin the normal way (the loader’s transaction id becomes the record’s insert-MVCCID).cubrid-btree.mddocumentsbtree_load_indexas the bulk-build path for a freshly created index. Loaddb invokes that path not via the data-file loader, but viaCREATE INDEXstatements in the index file (-i). So the loader’s interaction with btree.c during the data load is per-rowbtree_insertamortised across the heap-page lifetime; the bottom-up bulk build only happens during the post-data index file step.cubrid-locator.mddocuments the locator’slocator_(multi_)insert_force. The loader’s use of it is pure direct-path:dont_check_fk = true,op_type = MULTI_ROW_INSERT,pruning_type = 0. TheBU_LOCKrequirement is checked insidelocator_multi_insert_forceitself (lock_has_lock_on_object (class_oid, oid_Root_class_oid, BU_LOCK)) and is what unlocks the page-image-redo log shortcut.cubrid-statistics.mddocumentsxstats_update_statistics. The loader calls the same entry-point as user-issuedUPDATE STATISTICS; nothing is unique to loaddb except that it iterates by per-class OID and prefersSTATS_WITH_SAMPLINGfor the user data andSTATS_WITH_FULLSCANfor the catalog tables it refreshes after the index file.cubrid-external-sort.mdis invoked transitively by the-iindex file path. The loader-specific knob isLOAD_INDEX_MIN_SORT_BUFFER_PAGES = 8192, whichloaddb_internalforces viasysprm_set_forcewhen an index file is present.- The
AGENTS.mdinsrc/loaddb/listsload_object.candload_object_table.camong the load-time files. Reading the current source confirms these are only used by the SA-mode loader (load_sa_loader.cpp); CS mode does not include them. The SA loader retains a “fast loaddb prototype” lineage and is considerably larger than the CS-modeserver_object_loader— 6.8 K lines vs 1.2 K — but functionally narrower because it does not have to coordinate with a worker pool.
Open Questions
Section titled “Open Questions”- Online schema change during load. The
BU_LOCKheld by the load session blocksX_LOCKon the same class, so DDL against the loading class is excluded for the duration of the load. What about DDL on other classes that the loaded class references via FK? The loader skips FK checks (dont_check_fk), so a parallelALTER TABLEon a referenced class is silently safe; whether it should be is a policy question. - Resumable load mid-batch. Restart at line
Nis supported, butNis a committed-batch boundary (default 10240 rows). A load interrupted mid-batch loses the entire batch. There is no per-row WAL marker that would let a resume start in the middle of a batch. Whether to add one (and at what cost to the page-image redo shortcut) is open. - Compression of the on-wire batch. A batch is shipped as the
raw text content of the rows it contains, packed by
cubpacking::packer::pack_string. There is no transport-level compression (zstd / lz4), and the batch can be up toLOADDB_BUFFER_SIZE_LIMIT ≈ 2 GiB. For large WAN loads this is a noticeable cost. Whether to add an opt-in compressor on thebatch::pack/batch::unpackpath is an open optimisation. - Bulk B+Tree build at load time. Today, indexes are either
maintained per-row during the heap load (if they exist on the
table at load start) or built bottom-up by
CREATE INDEXin a separate file. There is no path that drops the indexes automatically before the data file and rebuilds them after — that is the unloader’s responsibility. A “loaddb —rebuild-indexes” flag would simplify operator workflows but require coordination with the schema-loader step. - Parallel partition-aware load. Partitioned tables today see their rows interleaved in the load file and dispatched per-row by the heap manager’s pruning logic. A load file pre-partitioned by the unloader would let each worker write to its own partition’s HFID with no cross-worker coordination — closer to Oracle’s parallel-direct path. Whether the throughput gain warrants the unloader and grammar changes is an open design question.
- HA mode page-image redo. Today HA forces the per-record
loop in
flush_records. The replication log requires record-level LSAs per row to be replayable on the slave. A redesigned slave replay that understands page-image redo records would unlock the fast path on HA-enabled clusters; this is open R&D, not a current loader feature.
Sources
Section titled “Sources”The document is code-only; no raw/ PDFs or PPTX inputs are
attached. The reading was anchored on the following CUBRID source
files (paths relative to references/cubrid/):
src/loaddb/load_db.c— utility entry, schema/index/trigger drivers,ldr_server_load,ldr_exec_query_from_file.src/loaddb/load_common.{hpp,cpp}—load_args,batch,stats,load_status, theclass_installer/object_loaderinterfaces, and the file splittercubload::split.src/loaddb/load_session.{hpp,cpp}—cubload::session,load_task,init_driver,invoke_parser,wait_for_previous_batch.src/loaddb/load_server_loader.{hpp,cpp}—server_class_installer,server_object_loader, theflush_recordspage-image-redo path and the per-record HA path.src/loaddb/load_sa_loader.{hpp,cpp}—ldr_sa_load,ldr_update_statistics, the legacyldr_*callback chain.src/loaddb/load_worker_manager.{hpp,cpp}— the global loaddb worker pool,REGISTER_WORKERPOOL,worker_manager_try_task.src/loaddb/load_class_registry.{hpp,cpp}— class_id-keyed registry of resolved classes and their attribute lists.src/loaddb/load_db_value_converter.{hpp,cpp}— the[DB_TYPE][LDR_TYPE]conversion matrix.src/loaddb/load_driver.{hpp,cpp}— driver mediator class.src/loaddb/load_grammar.yy,src/loaddb/load_lexer.l,src/loaddb/load_semantic_helper.hpp— Bison/Flex grammar and pool-allocating semantic helper.src/loaddb/load_error_handler.hpp— templated error formatting and per-line accounting.src/communication/network_interface_cl.c,src/communication/network_interface_sr.cpp,src/communication/network_sr.c— the seven-messageloaddb_*/sloaddb_*protocol.src/transaction/locator_sr.c—locator_multi_insert_force,locator_insert_force.src/transaction/lock_table.h,src/transaction/lock_table.c,src/transaction/lock_manager.c—BU_LOCKvalue and compatibility matrix.src/storage/heap_file.c—heap_attrinfo_*,heap_alloc_new_page,heap_insert_logical,heap_get_class_info,heap_scancache_*.src/storage/btree_load.c,src/storage/external_sort.c— invoked transitively from the index-file path.src/storage/statistics_sr.c—xstats_update_statistics.
Theoretical references:
- Petrov, Database Internals, Ch. 4 “Implementing B-Trees” (bottom-up B+Tree build) and Ch. 5 “Transaction Processing” (write amplification, deferred constraint check).
- Garcia-Molina / Ullman / Widom, Database Systems: The Complete Book, §16 “Logging and Recovery” (page-image vs record-level redo) and §13.7 “Variable-Length Data and Records”.
- PostgreSQL documentation,
COPY FROMandpg_bulkloadextension manuals. - Oracle SQL*Loader, Direct Path Load chapter — the canonical reference for the direct-path technique.
- MySQL Reference Manual,
LOAD DATA INFILEand bulk-insert optimisation notes.