CUBRID CDC — Streaming DML and DDL Through the WAL
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Source verification (as of 2026-04-30)
- Beyond CUBRID — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Change Data Capture (CDC) is the practice of turning a database’s internal write log into a downstream event stream consumers can react to: stream into Kafka, mirror into a search index, materialise a denormalised view, audit changes for compliance. Database Internals (Petrov) does not have a dedicated CDC chapter, but the topic sits at the intersection of ch. 5 (Recovery, WAL) and ch. 13 (Distribution, replication).
Two implementation choices the model leaves open shape every CDC implementation and frame the rest of this document:
- Where do logical events come from? Two paths:
(a) From the physical WAL — walk
LOG_*UNDOREDO_DATA/LOG_MVCC_*records and reconstruct logical row images by correlating with the catalog. PostgreSQL’spg_logicaldoes this; older CUBRID HA replication (log_applier.c::la_apply_*) does this. (b) From explicit logical records the engine emits at DML time. The records are intentionally rich — table OID, before image, after image, transaction user, statement text. Consumers parse them with no catalog lookup. Modern CUBRID CDC takes this path: every DML emits aLOG_SUPPLEMENTAL_INFOrecord alongside the regularLOG_*UNDOREDO_DATA. - Push or pull? Push (replication-style: a daemon on the
producing server tails the log and ships records to consumers)
or pull (CDC-API style: a consumer asks the server “give me the
next batch from LSA X”). CUBRID supports both: HA replication
is push (
la_apply_log_fileis a long-running daemon), CDC API is pull (cdc_make_loginfois request/response).
Once the choices are named, every CUBRID-specific structure in this document either implements one of them or makes the access faster.
Common DBMS Design
Section titled “Common DBMS Design”Every engine that ships CDC reaches for the same pattern set.
Forward log walking with a position cursor
Section titled “Forward log walking with a position cursor”The consumer carries an LSA cursor; the server returns records
“from this LSA forward, up to N records or N bytes”. On each
batch the consumer commits its cursor downstream and on
reconnect resumes from there. PostgreSQL’s pg_logical slot,
Debezium’s offset, MySQL’s binlog position are all the same idea.
Logical event types
Section titled “Logical event types”The shared event vocabulary: INSERT, UPDATE, DELETE, BEGIN,
COMMIT, ABORT, plus DDL (CREATE / ALTER / DROP). Most engines
keep this small (5-7 types) so consumers don’t need an
ever-growing parser. CUBRID’s SUPPLEMENT_REC_TYPE enum
(cubrid-log-manager.md §“Supplemental records”) has 11 types,
including trigger-driven INSERT/UPDATE/DELETE for completeness.
Catalog snapshot or per-event schema
Section titled “Catalog snapshot or per-event schema”DDL is the hard case. The consumer must apply table-X schema
when reading row events. Two approaches: (a) pull a catalog
snapshot before the first row event, (b) emit DDL events
inline so the consumer maintains its own schema cache. CUBRID
emits DDL inline as LOG_SUPPLEMENT_DDL records carrying the
SQL text.
Transaction grouping at commit boundaries
Section titled “Transaction grouping at commit boundaries”Consumers want to see “all rows of transaction T in one batch,
sorted between the BEGIN and COMMIT events”. The producer must
buffer until COMMIT and then flush in order. The cost: latency
proportional to transaction duration. CUBRID’s CDC producer
uses a per-tran user-info map (tran_user) and emits events
keyed by trid.
Active-LSA-keep-alive
Section titled “Active-LSA-keep-alive”The log archive remover must not delete archives the consumer
still needs. Each engine has a watermark: PostgreSQL’s
replication_slot.confirmed_flush_lsn, MySQL’s binlog retention
days, CUBRID’s cdc_min_log_pageid_to_keep (declared in
log_manager.h:235).
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theoretical concept | CUBRID name |
|---|---|
| Logical event record | LOG_SUPPLEMENTAL_INFO log record (cubrid-log-manager.md) |
| Event type enum | SUPPLEMENT_REC_TYPE (log_record.hpp) — 11 values |
| Forward log walker | log_reader class (log_reader.hpp) |
| Pull-style consumer entry | cdc_make_loginfo (log_manager.c:14835) |
| LSA validation | cdc_validate_lsa (log_manager.c:14402) |
| Time → LSA lookup | cdc_find_lsa (log_manager.c:14137) |
| DML event reconstruction | cdc_make_dml_loginfo (log_manager.c:12818) |
| Before-image fetch | cdc_get_undo_record (log_manager.c:11244) |
| Before+after image fetch | cdc_get_recdes (log_manager.c:11330) |
| Per-event metadata | CDC_LOGINFO_ENTRY { next_lsa, length, log_info } (log_impl.h) |
| Producer-side state machine | CDC_PRODUCER_STATE { WAIT, RUN, DEAD } + CDC_PRODUCER_REQUEST |
| Producer struct | CDC_PRODUCER (log_impl.h) with next_extraction_lsa, filters, queue |
| Consumer state machine | CDC_CONSUMER_REQUEST (log_impl.h) |
| Public C API | src/api/cubrid_log.c — DLL surface for external consumers |
| Archive-keep watermark | cdc_min_log_pageid_to_keep (log_manager.h:235) |
| Legacy HA replication daemon | la_apply_log_file (log_applier.c:8074) |
| Legacy commit replay | la_log_commit (log_applier.c:6531) |
| Legacy filter type | REPL_FILTER_TYPE { NONE, INCLUDE_TBL, EXCLUDE_TBL } (log_applier.h:48) |
| Retry-eligible error mask | LA_RETRY_ON_ERROR macro (log_applier.h:34) |
CUBRID’s Approach
Section titled “CUBRID’s Approach”The CDC-related code lives in two places: a modern CDC API
that the server exposes for pull-style external consumers
(cdc_* functions in log_manager.c, ~3000 lines), and a
legacy HA replication daemon that pushes log archives onto a
slave (la_* functions in log_applier.c, ~233KB / ~8000 lines).
Both walk the log forward via log_reader. We walk the modern
API first, then the legacy daemon.
Overall structure
Section titled “Overall structure”flowchart TB
subgraph SRV["Producer server"]
DML["DML transaction"]
LM["log_manager:\nLOG_∗UNDOREDO_DATA +\nLOG_SUPPLEMENTAL_INFO"]
LRD["log_reader\n(forward walker)"]
CDCP["cdc_∗ functions\n(producer side)"]
CDCQ["produced queue"]
DML --> LM
LM --> LRD --> CDCP --> CDCQ
end
subgraph CON["Consumer (external)"]
API["cubrid_log API\n(DLL surface)"]
APP["consumer app\n(Kafka publisher,\nDebezium-like)"]
CDCQ --> API --> APP
end
subgraph HA["Legacy HA replication"]
LASRC["master log archive"]
LAS["la_apply_log_file\n(client-mode daemon)"]
LAREPL["la_apply_repl_log"]
LACMT["la_log_commit"]
SLAVE["slave server"]
LASRC --> LAS --> LAREPL --> LACMT --> SLAVE
end
LM -.same WAL.-> LASRC
The figure encodes two parallel forward-walking pipelines that
share the WAL. (modern CDC) the server hosts the producer
inside log_manager.c; consumers connect via the
cubrid_log DLL and pull batches. (legacy HA) a separate
client-mode process (cubrid_replication) tails archive
volumes from the master and replays records onto a connected
slave server.
LOG_SUPPLEMENTAL_INFO — the modern event format
Section titled “LOG_SUPPLEMENTAL_INFO — the modern event format”Modern CDC does not reconstruct logical events from physical
log records. Instead, every DML at the producer emits an
auxiliary LOG_SUPPLEMENTAL_INFO record alongside the regular
LOG_*UNDOREDO_DATA. The supplemental record’s payload is
self-describing — its first byte is one of 11 record kinds:
// SUPPLEMENT_REC_TYPE — src/transaction/log_record.hpp:418typedef enum supplement_rec_type{ LOG_SUPPLEMENT_TRAN_USER, /* who: client user name */ LOG_SUPPLEMENT_UNDO_RECORD, /* raw undo image */ LOG_SUPPLEMENT_DDL, /* DDL statement text */
/* DML records: * | LOG_REC_HEADER | TYPE | LENGTH | CLASS OID | UNDO LSA | REDO LSA | */ LOG_SUPPLEMENT_INSERT, LOG_SUPPLEMENT_UPDATE, LOG_SUPPLEMENT_DELETE,
/* Same shape, but emitted from a trigger action: */ LOG_SUPPLEMENT_TRIGGER_INSERT, LOG_SUPPLEMENT_TRIGGER_UPDATE, LOG_SUPPLEMENT_TRIGGER_DELETE,
LOG_SUPPLEMENT_LARGER_REC_TYPE,} SUPPLEMENT_REC_TYPE;
struct log_rec_supplement{ SUPPLEMENT_REC_TYPE rec_type; int length;};DML records are intentionally indirect: they carry the LSA of
the underlying LOG_UNDOREDO_DATA rather than the row image
itself. The CDC producer follows the LSA back to the data record
and decodes it with cdc_get_recdes, materializing the
before/after row images on-demand. The reason: the supplemental
record stays small (~50 bytes) regardless of row size, so its
log-bandwidth cost is bounded.
DDL records carry the SQL text inline so consumers can replay ALTER/DROP without parsing data records.
CDC producer — pull-driven log walker
Section titled “CDC producer — pull-driven log walker”The producer’s state machine and configuration:
// CDC_PRODUCER — src/transaction/log_impl.htypedef enum cdc_producer_state{ CDC_PRODUCER_STATE_WAIT, CDC_PRODUCER_STATE_RUN, CDC_PRODUCER_STATE_DEAD} CDC_PRODUCER_STATE;
typedef struct cdc_producer{ LOG_LSA next_extraction_lsa; /* cursor */
/* Filter configuration */ int all_in_cond; /* match-all flag */ int num_extraction_user; char **extraction_user; /* whitelist of users */ int num_extraction_class; UINT64 *extraction_classoids; /* whitelist of class OIDs */
volatile CDC_PRODUCER_STATE state; volatile CDC_PRODUCER_REQUEST request;
int produced_queue_size;
pthread_mutex_t lock; pthread_cond_t wait_cond;
CDC_TEMP_LOGBUF temp_logbuf[2]; /* double-buffered log pages */
std::unordered_map<TRANID, char *> tran_user; std::unordered_map<TRANID, int> tran_ignore;} CDC_PRODUCER;Each consumer’s request flows through:
- Initialization.
cdc_initialize(log_manager.c:14957) sets up the producer instance, locks, condition variable, and double-buffered log page slots. - Configuration.
cdc_set_configuration(declared inlog_manager.h:239) installs filters: which users to include, which class OIDs to extract, the timeout, the max item count. - LSA seeding. Either (a)
cdc_set_extraction_lsafor an explicit LSA, or (b)cdc_find_lsa(log_manager.c:14137) for “give me the LSA closest to this wall-clock time”. The second is what consumers use on first connect. - LSA validation.
cdc_validate_lsa(log_manager.c:14402) checks that the LSA is in a still-archived range. Returns error if archives have been removed past it. - Pull loop.
cdc_get_logitem_info(declared inlog_manager.h:241) returns the next batch of events. The producer thread runscdc_make_loginfo(log_manager.c:14835) under the producer’s mutex, walking forward fromnext_extraction_lsauntil the batch is full. - Cleanup.
cdc_finalize(log_manager.c:15087) tears down on disconnect.
The producer state transitions:
stateDiagram-v2 [*] --> WAIT: cdc_initialize WAIT --> RUN: consumer wakeup RUN --> WAIT: consumer pause / queue full RUN --> DEAD: cdc_kill_producer WAIT --> DEAD: shutdown DEAD --> [*]: cdc_finalize
The dual CDC_PRODUCER_REQUEST enum is the request-from-consumer
signal that the consumer-side thread sets so the producer-side
thread reads on its next tick. Used to implement
cdc_pause_producer, cdc_wakeup_producer, cdc_kill_producer
without a tight ping-pong on the mutex.
cdc_make_loginfo — the producer hot loop
Section titled “cdc_make_loginfo — the producer hot loop”// cdc_make_loginfo — src/transaction/log_manager.c:14835 (sketch)intcdc_make_loginfo (THREAD_ENTRY *thread_p, LOG_LSA *start_lsa){ /* Walk the log forward starting at start_lsa. */ while (more_to_read && batch_not_full) { record_header = read_log_record_header (start_lsa);
switch (record_header->type) { case LOG_COMMIT: flush_pending_events_for_tran (record_header->trid); /* emit COMMIT event */ break;
case LOG_ABORT: drop_pending_events_for_tran (record_header->trid); break;
case LOG_SUPPLEMENTAL_INFO: supp = read_supplemental_record_payload (start_lsa); switch (supp.rec_type) { case LOG_SUPPLEMENT_INSERT: case LOG_SUPPLEMENT_UPDATE: case LOG_SUPPLEMENT_DELETE: /* indirect: chase to underlying LOG_UNDOREDO_DATA */ cdc_get_recdes (..., supp.undo_lsa, supp.redo_lsa, &undo_recdes, &redo_recdes); if (passes_filter (classoid, user)) cdc_make_dml_loginfo (..., trid, user, dml_type, classoid, &undo_recdes, &redo_recdes, &log_info_entry, /*is_flashback=*/false); break;
case LOG_SUPPLEMENT_DDL: /* DDL statement text is inline. */ if (passes_filter) emit_ddl_event (statement_text); break;
case LOG_SUPPLEMENT_TRAN_USER: tran_user_map[trid] = user_name; break; } break;
case LOG_END_OF_LOG: break_out; }
advance_to_next_record (&start_lsa); }}Two properties that matter. (a) Events are buffered per-trid
until COMMIT — abort drops them. The producer’s
std::unordered_map<TRANID, char *> tran_user keys per-trid
metadata; a parallel per-trid event list (not shown in the
header excerpt but referenced in the producer body) holds the
DML events themselves. (b) Filtering happens at the producer,
not the consumer — both class OIDs and user names are checked
before an event is queued. The cost: filter-out events still
require a cdc_get_recdes call. The benefit: queue size stays
proportional to filtered events, not raw events.
cdc_get_recdes and cdc_get_undo_record — backward chase
Section titled “cdc_get_recdes and cdc_get_undo_record — backward chase”The DML supplemental records carry only the LSAs of the underlying data records. To materialize the row image, the producer chases:
// cdc_get_undo_record — src/transaction/log_manager.c:11244 (signature)SCAN_CODE cdc_get_undo_record (THREAD_ENTRY *thread_p, LOG_PAGE *log_page_p, LOG_LSA lsa, RECDES *undo_recdes);
// cdc_get_recdes — src/transaction/log_manager.c:11330 (signature)int cdc_get_recdes (THREAD_ENTRY *thread_p, LOG_LSA *undo_lsa, RECDES *undo_recdes, LOG_LSA *redo_lsa, RECDES *redo_recdes, bool is_flashback);cdc_get_undo_record reads a LOG_*UNDOREDO_DATA record at the
given LSA, decompresses it (using a per-call LOG_ZIP context),
and returns the undo image as a RECDES. cdc_get_recdes is
the wrapper that fetches both undo and redo images in one
call — used for UPDATE events where the consumer wants
before/after pairs.
The is_flashback parameter switches behaviour: in flashback
mode (cubrid-flashback.md), the function tolerates broken
chains and missing pages; in CDC mode, those are errors.
cdc_make_dml_loginfo — pack a DML event
Section titled “cdc_make_dml_loginfo — pack a DML event”// cdc_make_dml_loginfo — src/transaction/log_manager.c:12818 (signature)int cdc_make_dml_loginfo (THREAD_ENTRY *thread_p, int trid, char *user, CDC_DML_TYPE dml_type, OID classoid, RECDES *undo_recdes, RECDES *redo_recdes, CDC_LOGINFO_ENTRY *dml_entry, bool is_flashback);The function takes the materialised undo/redo RECDES plus the
trid/user/dml_type metadata and packs them into a
CDC_LOGINFO_ENTRY. The entry’s wire format is what the
consumer-side library decodes:
// CDC_LOGINFO_ENTRY — src/transaction/log_impl.htypedef struct cdc_loginfo_entry{ LOG_LSA next_lsa; /* LSA after this event — consumer cursor advance */ int length; char *log_info; /* serialised event payload */} CDC_LOGINFO_ENTRY;The consumer commits next_lsa downstream once its handler
succeeds; on reconnect the next pull starts at that LSA.
Public API surface
Section titled “Public API surface”External consumers don’t link against log_manager.c; they go
through src/api/cubrid_log.c, which wraps the cdc_*
functions as a stable C ABI. The DLL surface is documented in
raw/code-analysis/cubrid/storage/cdc/CUBRID API 문서_v1.6.docx.
A typical session:
cubrid_log_connect(...)cubrid_log_set_extraction_filter(...) /* tables, users */cubrid_log_set_lsa_by_time(time, &lsa) /* or cubrid_log_set_lsa(lsa) */loop { n = cubrid_log_extract(&entries, &num) for each entry: handle_event(entry) cubrid_log_commit(entries[n-1].next_lsa)}cubrid_log_disconnect()The consumer is responsible for downstream durability of
next_lsa; CUBRID does not track per-consumer offsets server-side.
Archive retention — cdc_min_log_pageid_to_keep
Section titled “Archive retention — cdc_min_log_pageid_to_keep”The log-archive remove daemon
(log_wakeup_remove_log_archive_daemon in cubrid-log-manager.md)
gates its deletion on the smallest LSA any active CDC consumer
or HA replication slave still depends on:
// cdc_min_log_pageid_to_keep — src/transaction/log_manager.h:235extern LOG_PAGEID cdc_min_log_pageid_to_keep ();The function returns MAX_LOG_PAGEID when no consumer is
attached (so any archive can be removed) and the smallest
consumer cursor’s pageid when consumers are attached.
Legacy HA replication — la_* family
Section titled “Legacy HA replication — la_* family”Before the CDC API, CUBRID supported HA replication via a client-mode daemon that ran on a slave host, fetched log archives from the master, and replayed them onto a local slave server. The daemon’s entry point:
// la_apply_log_file — src/transaction/log_applier.c:8074 (signature)int la_apply_log_file (const char *database_name, const char *log_path, const int max_mem_size);Internally it loops:
- Fetch the next log archive (via remote read or shared
filesystem —
log_path). - Walk forward, calling
la_apply_repl_log(log_applier.c:5739) per record. - On
LOG_COMMIT, callla_log_commit(log_applier.c:6531) — records the applied-to LSA in the slave’sLA_HA_APPLY_INFOrow so a daemon restart resumes from the right place. - On retryable errors (deadlock, lock timeout, page latch
abort —
LA_RETRY_ON_ERRORmacro), the record is retried.
The daemon supports table-level filtering via
REPL_FILTER_TYPE:
// REPL_FILTER_TYPE — src/transaction/log_applier.h:48typedef enum{ REPL_FILTER_NONE, REPL_FILTER_INCLUDE_TBL, REPL_FILTER_EXCLUDE_TBL} REPL_FILTER_TYPE;The legacy path does not rely on supplemental log records;
it walks the regular LOG_*UNDOREDO_DATA records and
reconstructs row images by following them through the
catalog. This is why the daemon is more fragile across schema
changes than the modern CDC API: a DDL that changes a class’s
representation between master and slave can trip up the
replay logic.
A modern CDC pull, end to end
Section titled “A modern CDC pull, end to end”sequenceDiagram
participant App as Consumer app
participant API as cubrid_log API
participant Q as producer queue
participant Prod as cdc producer thread
participant LR as log_reader
participant LM as log_manager (WAL)
App->>API: "connect + set filter + set LSA"
API->>Prod: "cdc_set_extraction_lsa (X)"
loop each batch
App->>API: "extract(N)"
API->>Q: "pop N entries"
alt queue starved
Q->>Prod: "wake"
Prod->>LR: "walk forward from next_extraction_lsa"
LR->>LM: "fetch_page (LOG_CS read mode)"
LR->>LM: "parse record header"
alt "LOG_SUPPLEMENT_INSERT/UPDATE/DELETE"
Prod->>LM: "cdc_get_recdes (chase undo+redo LSA)"
Prod->>Prod: "cdc_make_dml_loginfo"
Prod->>Q: "push CDC_LOGINFO_ENTRY"
else LOG_SUPPLEMENT_DDL
Prod->>Q: "push DDL entry"
else LOG_COMMIT
Prod->>Q: "flush per-trid buffer + push COMMIT"
end
end
Q-->>API: "N entries"
API-->>App: "entries"
App->>API: "commit(entries[N-1].next_lsa)"
end
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers.
Modern CDC API
Section titled “Modern CDC API”cdc_initialize(log_manager.c).cdc_finalize(log_manager.c).cdc_set_configuration(log_manager.h) — filters, timeout.cdc_set_extraction_lsa(log_manager.c) — explicit LSA seed.cdc_find_lsa(log_manager.c) — time → LSA lookup.cdc_validate_lsa(log_manager.c) — archive range check.cdc_make_loginfo(log_manager.c) — producer hot loop.cdc_get_logitem_info(log_manager.h) — consumer batch fetch.cdc_get_loginfo_metadata(log_manager.h) — peek without consuming.cdc_get_recdes(log_manager.c) — materialize undo+redo RECDES from indirect LSAs.cdc_get_undo_record(log_manager.c) — single-image variant.cdc_make_dml_loginfo(log_manager.c) — pack a DML event.cdc_min_log_pageid_to_keep(log_manager.h) — archive retention watermark.cdc_pause_producer/cdc_wakeup_producer/cdc_kill_producer(log_manager.h).cdc_pause_consumer/cdc_wakeup_consumer(log_manager.h).cdc_reinitialize_queue(log_manager.h).cdc_free_extraction_filter/cdc_cleanup/cdc_cleanup_consumer(log_manager.h).cdc_daemons_init/cdc_daemons_destroy(log_manager.h) — registration of the producer/consumer threads with the cubthread manager.
Producer state and types
Section titled “Producer state and types”CDC_PRODUCER_STATEenum (log_impl.h).CDC_PRODUCER_REQUESTenum (log_impl.h).CDC_CONSUMER_REQUESTenum (log_impl.h).CDC_LOGINFO_ENTRY(log_impl.h).CDC_TEMP_LOGBUF(log_impl.h) — double-buffered log pages.CDC_PRODUCER(log_impl.h) — global producer state.
Log walker
Section titled “Log walker”log_readerclass (log_reader.hpp) — forward-walking log fetcher; shared with recovery and flashback.log_reader::set_lsa_and_fetch_page(log_reader.hpp).LOG_READ_ALIGN,LOG_READ_ADD_ALIGN,LOG_READ_ADVANCE_WHEN_DOESNT_FIT(log_reader.hpp) — the inline helpers that handle log-page boundary crossing.
Public API
Section titled “Public API”src/api/cubrid_log.c— DLL entry surface (cubrid_log_*functions).
Legacy HA replication
Section titled “Legacy HA replication”la_apply_log_file(log_applier.c) — daemon entry.la_apply_repl_log(log_applier.c) — per-record dispatch.la_log_commit(log_applier.c) — commit-side bookkeeping on the slave.la_init(log_applier.c) — daemon init.la_init_recdes_pool/la_init_cache_pb/la_init_cache_log_buffer/la_init_repl_lists(log_applier.c) — internal init.la_init_ha_apply_info(log_applier.c) — initialise the per-slave applied-to bookkeeping row.la_get_applied_log_info/la_get_copied_log_info(log_applier.h) — diagnostic.LA_RETRY_ON_ERRORmacro (log_applier.h) — retryable error mask.REPL_FILTER_TYPE(log_applier.h) — table-level filter.log_applier_sql_log.{c,h}— SQL-log emission (slave-side textual replay log for audit).
Position hints as of 2026-04-30
Section titled “Position hints as of 2026-04-30”| Symbol | File | Line |
|---|---|---|
SUPPLEMENT_REC_TYPE enum | log_record.hpp | 418 |
log_rec_supplement (struct) | log_record.hpp | 434 |
CDC_LOGINFO_ENTRY (struct) | log_impl.h | 808 |
CDC_TEMP_LOGBUF (struct) | log_impl.h | 815 |
CDC_PRODUCER (struct) | log_impl.h | 821 |
cdc_get_undo_record | log_manager.c | 11244 |
cdc_get_recdes | log_manager.c | 11330 |
cdc_make_dml_loginfo | log_manager.c | 12818 |
cdc_find_lsa | log_manager.c | 14137 |
cdc_validate_lsa | log_manager.c | 14402 |
cdc_set_extraction_lsa | log_manager.c | 14465 |
cdc_make_loginfo | log_manager.c | 14835 |
cdc_initialize | log_manager.c | 14957 |
cdc_finalize | log_manager.c | 15087 |
log_reader (class) | log_reader.hpp | 36 |
la_init | log_applier.c | 6917 |
la_apply_log_file | log_applier.c | 8074 |
la_apply_repl_log | log_applier.c | 5739 |
la_log_commit | log_applier.c | 6531 |
LA_RETRY_ON_ERROR (macro) | log_applier.h | 34 |
REPL_FILTER_TYPE (enum) | log_applier.h | 48 |
Source verification (as of 2026-04-30)
Section titled “Source verification (as of 2026-04-30)”Verified facts
Section titled “Verified facts”-
Modern CDC and legacy HA are separate code paths sharing the same WAL. Verified by reading both
log_applier.c(CS_MODE daemon) andcdc_*functions inlog_manager.c(server-side). No call from one to the other; the two pipelines walk the log independently. -
DML supplemental records are indirect — they carry the LSA of the underlying
LOG_UNDOREDO_DATA, not the row image. Verified by reading theSUPPLEMENT_REC_TYPEenum atlog_record.hpp:418and the comment block at lines 423-424: “Contains lsa of logs which contain undo, redo raw record (UPDATE, DELETE, INSERT) | LOG_REC_HEADER | SUPPLEMENT_REC_TYPE | LENGTH | CLASS OID | UNDO LSA (sizeof LOG_LSA) | REDO LSA |”. -
The supplemental record set has 11 entries: TRAN_USER, UNDO_RECORD, DDL, INSERT, UPDATE, DELETE, TRIGGER_INSERT, TRIGGER_UPDATE, TRIGGER_DELETE, plus the bookend sentinels. Verified at
log_record.hpp:418-432.LOG_SUPPLEMENT_LARGER_REC_TYPEis the upper bound for range checks; new types are appended before it. -
DDL supplemental records carry the SQL text inline. Verified at
log_record.hpp:421(“LOG_SUPPLEMENT_DDL”) plusLOG_TDES::ddl_sql_user_textatlog_impl.h:564. The text is captured at DDL-execute time and emitted into the supplemental record so consumers don’t need to round-trip the catalog. -
The CDC producer maintains a per-trid
tran_usermap. Verified atlog_impl.h:845(std::unordered_map<TRANID, char *> tran_user). The map is keyed on transaction id, value is the client user name recorded fromLOG_SUPPLEMENT_TRAN_USER. -
The producer has a 3-state state machine. Verified at
log_impl.h:787-791(CDC_PRODUCER_STATE— WAIT, RUN, DEAD) plus theCDC_PRODUCER_REQUESTandCDC_CONSUMER_REQUESTenums for inter-thread signalling. -
Log fetch is a separate class (
log_reader) reused across recovery, CDC, and flashback. Verified atlog_reader.hpp:36. The class encapsulates LSA → page-fetch + alignment + skip semantics. The header carries a “remaining member after porting features from the LETS structure” comment indicating the class was originally extracted to be reusable. -
log_readeris not thread-safe. Verified atlog_reader.hpp:31(“NOTE: not thread safe”). Each producer thread / recovery worker / flashback session creates its ownlog_readerinstance. -
HA replication’s retryable errors are an explicit whitelist macro. Verified at
log_applier.h:34(LA_RETRY_ON_ERROR). The list includes lock timeouts, unilateral aborts, page latch timeouts, deadlock cycles, TDE cipher errors. Non-listed errors propagate up. -
Archive retention defers to the smallest active consumer cursor. Verified at
log_manager.h:235(cdc_min_log_pageid_to_keep). The archive remove daemon (cubrid-log-manager.md) gates its deletion on this watermark. -
The producer uses double-buffered log pages. Verified at
log_impl.h:842(CDC_TEMP_LOGBUF temp_logbuf[2]). One buffer is the page being parsed; the other is prefetched for the next read. Halves the wait-for-IO latency on sequential walks.
Open questions
Section titled “Open questions”-
Per-trid event buffer location. The producer must buffer uncommitted events. The
tran_usermap is visible in the header, but the matching per-trid event-list structure is not surfaced — it lives inlog_manager.cinternals. Investigation path: readcdc_make_loginfobody around line 14835. -
Behaviour on
LOG_DUMMY_HA_SERVER_STATErecords. HA server-state changes are logged. Does the modern CDC producer surface them as events, or skip? Investigation path: grep forLOG_DUMMY_HA_SERVER_STATEswitch arms incdc_make_loginfo. -
Multi-statement DDL handling. A single
ALTER TABLEcan touch many catalog rows. Does the producer emit one DDL event per ALTER, or one per catalog mutation? Investigation path: search forLOG_SUPPLEMENT_DDLemit sites. -
Filter race on consumer reconfigure. If a consumer calls
cdc_set_configurationmid-stream to add a class, what happens to events already in the queue from before the filter applied? Investigation path: readcdc_set_configurationand the producer-side filter check. -
HA replication slave failover. When a slave is promoted to master, the LA daemon must stop. Where is the stop signal?
la_force_shutdownexists; who calls it? Investigation path: grep callers insrc/connection/heartbeat*. -
log_applier_sql_log.{c,h}purpose. A separate SQL log the slave writes? Audit log? Replay verification? Investigation path: read the file pair.
Beyond CUBRID — Comparative Designs & Research Frontiers
Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”Pointers, not analysis.
-
Debezium (Kafka Connect) — pluggable CDC connectors for PG, MySQL, MongoDB, Oracle. Wire format: Avro / JSON over Kafka. Comparing CUBRID’s
CDC_LOGINFO_ENTRYwire format against Debezium’s would document the round-trip cost. -
PostgreSQL logical replication — output plugins (
pgoutput,wal2json) emit logical events from a physical WAL walk. Closer to CUBRID’s legacy HA approach conceptually; the modern CDC API’s supplemental records are closer to Maxwell’s binlog row events. -
MySQL binlog (statement / row / mixed) — three modes for emitting events. Row mode is closest to CUBRID’s modern CDC. Statement mode is closer to legacy HA replication.
-
Debezium-style outbox pattern — applications write events into a dedicated outbox table; CDC publishes them. CUBRID’s
LOG_SUPPLEMENT_DDLis a primitive form: the engine itself is the outbox writer. -
Kafka Connect as the consumer — CUBRID’s pull-style API fits a Kafka Connect source connector well; the consumer thread becomes the connect worker.
-
Structured Streaming exactly-once semantics — the consumer’s commitment of
next_lsais at-least-once unless the consumer integrates 2PC. CUBRID’s 2PC (cubrid-2pc.md) could in principle be wired into the consumer commit; not currently surfaced in the API.
Sources
Section titled “Sources”Raw analyses (raw/code-analysis/cubrid/storage/cdc/)
Section titled “Raw analyses (raw/code-analysis/cubrid/storage/cdc/)”CDC 진행상황 공유_v2.pptxCDC 인수인계.pptxALTER, DROP.pptxDML Log sequence.pdfCUBRID API 문서_v1.6.docx
Sibling docs
Section titled “Sibling docs”knowledge/code-analysis/cubrid/cubrid-log-manager.md—LOG_SUPPLEMENTAL_INFO+SUPPLEMENT_REC_TYPEsource.knowledge/code-analysis/cubrid/cubrid-flashback.md— opposite direction (backward); shareslog_reader. In-progress in the same batch.knowledge/code-analysis/cubrid/cubrid-recovery-manager.md— shares thelog_readerclass with the redo path.knowledge/code-analysis/cubrid/cubrid-catalog-manager.md—_db_classetc. updates that emitLOG_SUPPLEMENT_DDL.knowledge/code-analysis/cubrid/cubrid-2pc.md— distributed commit semantics relevant to exactly-once consumption.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database Internals (Petrov), Ch. 5 §“Logging”, Ch. 13 §“Replication”.
- Designing Data-Intensive Applications (Kleppmann), Ch. 5 “Replication”, Ch. 11 “Stream Processing” — CDC framing.
CUBRID source (/data/hgryoo/references/cubrid/)
Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”src/transaction/log_manager.{c,h}—cdc_*modern API.src/transaction/log_applier.{c,h}— legacy HA daemon.src/transaction/log_applier_sql_log.{c,h}— slave SQL log.src/transaction/log_reader.{cpp,hpp}— forward walker.src/api/cubrid_log.c— DLL entry surface.