Skip to content

CUBRID CDC — Streaming DML and DDL Through the WAL

Contents:

Change Data Capture (CDC) is the practice of turning a database’s internal write log into a downstream event stream consumers can react to: stream into Kafka, mirror into a search index, materialise a denormalised view, audit changes for compliance. Database Internals (Petrov) does not have a dedicated CDC chapter, but the topic sits at the intersection of ch. 5 (Recovery, WAL) and ch. 13 (Distribution, replication).

Two implementation choices the model leaves open shape every CDC implementation and frame the rest of this document:

  1. Where do logical events come from? Two paths: (a) From the physical WAL — walk LOG_*UNDOREDO_DATA / LOG_MVCC_* records and reconstruct logical row images by correlating with the catalog. PostgreSQL’s pg_logical does this; older CUBRID HA replication (log_applier.c::la_apply_*) does this. (b) From explicit logical records the engine emits at DML time. The records are intentionally rich — table OID, before image, after image, transaction user, statement text. Consumers parse them with no catalog lookup. Modern CUBRID CDC takes this path: every DML emits a LOG_SUPPLEMENTAL_INFO record alongside the regular LOG_*UNDOREDO_DATA.
  2. Push or pull? Push (replication-style: a daemon on the producing server tails the log and ships records to consumers) or pull (CDC-API style: a consumer asks the server “give me the next batch from LSA X”). CUBRID supports both: HA replication is push (la_apply_log_file is a long-running daemon), CDC API is pull (cdc_make_loginfo is request/response).

Once the choices are named, every CUBRID-specific structure in this document either implements one of them or makes the access faster.

Every engine that ships CDC reaches for the same pattern set.

Forward log walking with a position cursor

Section titled “Forward log walking with a position cursor”

The consumer carries an LSA cursor; the server returns records “from this LSA forward, up to N records or N bytes”. On each batch the consumer commits its cursor downstream and on reconnect resumes from there. PostgreSQL’s pg_logical slot, Debezium’s offset, MySQL’s binlog position are all the same idea.

The shared event vocabulary: INSERT, UPDATE, DELETE, BEGIN, COMMIT, ABORT, plus DDL (CREATE / ALTER / DROP). Most engines keep this small (5-7 types) so consumers don’t need an ever-growing parser. CUBRID’s SUPPLEMENT_REC_TYPE enum (cubrid-log-manager.md §“Supplemental records”) has 11 types, including trigger-driven INSERT/UPDATE/DELETE for completeness.

DDL is the hard case. The consumer must apply table-X schema when reading row events. Two approaches: (a) pull a catalog snapshot before the first row event, (b) emit DDL events inline so the consumer maintains its own schema cache. CUBRID emits DDL inline as LOG_SUPPLEMENT_DDL records carrying the SQL text.

Consumers want to see “all rows of transaction T in one batch, sorted between the BEGIN and COMMIT events”. The producer must buffer until COMMIT and then flush in order. The cost: latency proportional to transaction duration. CUBRID’s CDC producer uses a per-tran user-info map (tran_user) and emits events keyed by trid.

The log archive remover must not delete archives the consumer still needs. Each engine has a watermark: PostgreSQL’s replication_slot.confirmed_flush_lsn, MySQL’s binlog retention days, CUBRID’s cdc_min_log_pageid_to_keep (declared in log_manager.h:235).

Theoretical conceptCUBRID name
Logical event recordLOG_SUPPLEMENTAL_INFO log record (cubrid-log-manager.md)
Event type enumSUPPLEMENT_REC_TYPE (log_record.hpp) — 11 values
Forward log walkerlog_reader class (log_reader.hpp)
Pull-style consumer entrycdc_make_loginfo (log_manager.c:14835)
LSA validationcdc_validate_lsa (log_manager.c:14402)
Time → LSA lookupcdc_find_lsa (log_manager.c:14137)
DML event reconstructioncdc_make_dml_loginfo (log_manager.c:12818)
Before-image fetchcdc_get_undo_record (log_manager.c:11244)
Before+after image fetchcdc_get_recdes (log_manager.c:11330)
Per-event metadataCDC_LOGINFO_ENTRY { next_lsa, length, log_info } (log_impl.h)
Producer-side state machineCDC_PRODUCER_STATE { WAIT, RUN, DEAD } + CDC_PRODUCER_REQUEST
Producer structCDC_PRODUCER (log_impl.h) with next_extraction_lsa, filters, queue
Consumer state machineCDC_CONSUMER_REQUEST (log_impl.h)
Public C APIsrc/api/cubrid_log.c — DLL surface for external consumers
Archive-keep watermarkcdc_min_log_pageid_to_keep (log_manager.h:235)
Legacy HA replication daemonla_apply_log_file (log_applier.c:8074)
Legacy commit replayla_log_commit (log_applier.c:6531)
Legacy filter typeREPL_FILTER_TYPE { NONE, INCLUDE_TBL, EXCLUDE_TBL } (log_applier.h:48)
Retry-eligible error maskLA_RETRY_ON_ERROR macro (log_applier.h:34)

The CDC-related code lives in two places: a modern CDC API that the server exposes for pull-style external consumers (cdc_* functions in log_manager.c, ~3000 lines), and a legacy HA replication daemon that pushes log archives onto a slave (la_* functions in log_applier.c, ~233KB / ~8000 lines). Both walk the log forward via log_reader. We walk the modern API first, then the legacy daemon.

flowchart TB
  subgraph SRV["Producer server"]
    DML["DML transaction"]
    LM["log_manager:\nLOG_∗UNDOREDO_DATA +\nLOG_SUPPLEMENTAL_INFO"]
    LRD["log_reader\n(forward walker)"]
    CDCP["cdc_∗ functions\n(producer side)"]
    CDCQ["produced queue"]
    DML --> LM
    LM --> LRD --> CDCP --> CDCQ
  end
  subgraph CON["Consumer (external)"]
    API["cubrid_log API\n(DLL surface)"]
    APP["consumer app\n(Kafka publisher,\nDebezium-like)"]
    CDCQ --> API --> APP
  end
  subgraph HA["Legacy HA replication"]
    LASRC["master log archive"]
    LAS["la_apply_log_file\n(client-mode daemon)"]
    LAREPL["la_apply_repl_log"]
    LACMT["la_log_commit"]
    SLAVE["slave server"]
    LASRC --> LAS --> LAREPL --> LACMT --> SLAVE
  end
  LM -.same WAL.-> LASRC

The figure encodes two parallel forward-walking pipelines that share the WAL. (modern CDC) the server hosts the producer inside log_manager.c; consumers connect via the cubrid_log DLL and pull batches. (legacy HA) a separate client-mode process (cubrid_replication) tails archive volumes from the master and replays records onto a connected slave server.

LOG_SUPPLEMENTAL_INFO — the modern event format

Section titled “LOG_SUPPLEMENTAL_INFO — the modern event format”

Modern CDC does not reconstruct logical events from physical log records. Instead, every DML at the producer emits an auxiliary LOG_SUPPLEMENTAL_INFO record alongside the regular LOG_*UNDOREDO_DATA. The supplemental record’s payload is self-describing — its first byte is one of 11 record kinds:

// SUPPLEMENT_REC_TYPE — src/transaction/log_record.hpp:418
typedef enum supplement_rec_type
{
LOG_SUPPLEMENT_TRAN_USER, /* who: client user name */
LOG_SUPPLEMENT_UNDO_RECORD, /* raw undo image */
LOG_SUPPLEMENT_DDL, /* DDL statement text */
/* DML records:
* | LOG_REC_HEADER | TYPE | LENGTH | CLASS OID | UNDO LSA | REDO LSA | */
LOG_SUPPLEMENT_INSERT,
LOG_SUPPLEMENT_UPDATE,
LOG_SUPPLEMENT_DELETE,
/* Same shape, but emitted from a trigger action: */
LOG_SUPPLEMENT_TRIGGER_INSERT,
LOG_SUPPLEMENT_TRIGGER_UPDATE,
LOG_SUPPLEMENT_TRIGGER_DELETE,
LOG_SUPPLEMENT_LARGER_REC_TYPE,
} SUPPLEMENT_REC_TYPE;
struct log_rec_supplement
{
SUPPLEMENT_REC_TYPE rec_type;
int length;
};

DML records are intentionally indirect: they carry the LSA of the underlying LOG_UNDOREDO_DATA rather than the row image itself. The CDC producer follows the LSA back to the data record and decodes it with cdc_get_recdes, materializing the before/after row images on-demand. The reason: the supplemental record stays small (~50 bytes) regardless of row size, so its log-bandwidth cost is bounded.

DDL records carry the SQL text inline so consumers can replay ALTER/DROP without parsing data records.

The producer’s state machine and configuration:

// CDC_PRODUCER — src/transaction/log_impl.h
typedef enum cdc_producer_state
{
CDC_PRODUCER_STATE_WAIT,
CDC_PRODUCER_STATE_RUN,
CDC_PRODUCER_STATE_DEAD
} CDC_PRODUCER_STATE;
typedef struct cdc_producer
{
LOG_LSA next_extraction_lsa; /* cursor */
/* Filter configuration */
int all_in_cond; /* match-all flag */
int num_extraction_user;
char **extraction_user; /* whitelist of users */
int num_extraction_class;
UINT64 *extraction_classoids; /* whitelist of class OIDs */
volatile CDC_PRODUCER_STATE state;
volatile CDC_PRODUCER_REQUEST request;
int produced_queue_size;
pthread_mutex_t lock;
pthread_cond_t wait_cond;
CDC_TEMP_LOGBUF temp_logbuf[2]; /* double-buffered log pages */
std::unordered_map<TRANID, char *> tran_user;
std::unordered_map<TRANID, int> tran_ignore;
} CDC_PRODUCER;

Each consumer’s request flows through:

  1. Initialization. cdc_initialize (log_manager.c:14957) sets up the producer instance, locks, condition variable, and double-buffered log page slots.
  2. Configuration. cdc_set_configuration (declared in log_manager.h:239) installs filters: which users to include, which class OIDs to extract, the timeout, the max item count.
  3. LSA seeding. Either (a) cdc_set_extraction_lsa for an explicit LSA, or (b) cdc_find_lsa (log_manager.c:14137) for “give me the LSA closest to this wall-clock time”. The second is what consumers use on first connect.
  4. LSA validation. cdc_validate_lsa (log_manager.c:14402) checks that the LSA is in a still-archived range. Returns error if archives have been removed past it.
  5. Pull loop. cdc_get_logitem_info (declared in log_manager.h:241) returns the next batch of events. The producer thread runs cdc_make_loginfo (log_manager.c:14835) under the producer’s mutex, walking forward from next_extraction_lsa until the batch is full.
  6. Cleanup. cdc_finalize (log_manager.c:15087) tears down on disconnect.

The producer state transitions:

stateDiagram-v2
  [*] --> WAIT: cdc_initialize
  WAIT --> RUN: consumer wakeup
  RUN --> WAIT: consumer pause / queue full
  RUN --> DEAD: cdc_kill_producer
  WAIT --> DEAD: shutdown
  DEAD --> [*]: cdc_finalize

The dual CDC_PRODUCER_REQUEST enum is the request-from-consumer signal that the consumer-side thread sets so the producer-side thread reads on its next tick. Used to implement cdc_pause_producer, cdc_wakeup_producer, cdc_kill_producer without a tight ping-pong on the mutex.

cdc_make_loginfo — the producer hot loop

Section titled “cdc_make_loginfo — the producer hot loop”
// cdc_make_loginfo — src/transaction/log_manager.c:14835 (sketch)
int
cdc_make_loginfo (THREAD_ENTRY *thread_p, LOG_LSA *start_lsa)
{
/* Walk the log forward starting at start_lsa. */
while (more_to_read && batch_not_full)
{
record_header = read_log_record_header (start_lsa);
switch (record_header->type)
{
case LOG_COMMIT:
flush_pending_events_for_tran (record_header->trid);
/* emit COMMIT event */
break;
case LOG_ABORT:
drop_pending_events_for_tran (record_header->trid);
break;
case LOG_SUPPLEMENTAL_INFO:
supp = read_supplemental_record_payload (start_lsa);
switch (supp.rec_type)
{
case LOG_SUPPLEMENT_INSERT:
case LOG_SUPPLEMENT_UPDATE:
case LOG_SUPPLEMENT_DELETE:
/* indirect: chase to underlying LOG_UNDOREDO_DATA */
cdc_get_recdes (..., supp.undo_lsa, supp.redo_lsa,
&undo_recdes, &redo_recdes);
if (passes_filter (classoid, user))
cdc_make_dml_loginfo (..., trid, user, dml_type, classoid,
&undo_recdes, &redo_recdes,
&log_info_entry, /*is_flashback=*/false);
break;
case LOG_SUPPLEMENT_DDL:
/* DDL statement text is inline. */
if (passes_filter)
emit_ddl_event (statement_text);
break;
case LOG_SUPPLEMENT_TRAN_USER:
tran_user_map[trid] = user_name;
break;
}
break;
case LOG_END_OF_LOG:
break_out;
}
advance_to_next_record (&start_lsa);
}
}

Two properties that matter. (a) Events are buffered per-trid until COMMIT — abort drops them. The producer’s std::unordered_map<TRANID, char *> tran_user keys per-trid metadata; a parallel per-trid event list (not shown in the header excerpt but referenced in the producer body) holds the DML events themselves. (b) Filtering happens at the producer, not the consumer — both class OIDs and user names are checked before an event is queued. The cost: filter-out events still require a cdc_get_recdes call. The benefit: queue size stays proportional to filtered events, not raw events.

cdc_get_recdes and cdc_get_undo_record — backward chase

Section titled “cdc_get_recdes and cdc_get_undo_record — backward chase”

The DML supplemental records carry only the LSAs of the underlying data records. To materialize the row image, the producer chases:

// cdc_get_undo_record — src/transaction/log_manager.c:11244 (signature)
SCAN_CODE cdc_get_undo_record (THREAD_ENTRY *thread_p,
LOG_PAGE *log_page_p,
LOG_LSA lsa,
RECDES *undo_recdes);
// cdc_get_recdes — src/transaction/log_manager.c:11330 (signature)
int cdc_get_recdes (THREAD_ENTRY *thread_p,
LOG_LSA *undo_lsa, RECDES *undo_recdes,
LOG_LSA *redo_lsa, RECDES *redo_recdes,
bool is_flashback);

cdc_get_undo_record reads a LOG_*UNDOREDO_DATA record at the given LSA, decompresses it (using a per-call LOG_ZIP context), and returns the undo image as a RECDES. cdc_get_recdes is the wrapper that fetches both undo and redo images in one call — used for UPDATE events where the consumer wants before/after pairs.

The is_flashback parameter switches behaviour: in flashback mode (cubrid-flashback.md), the function tolerates broken chains and missing pages; in CDC mode, those are errors.

// cdc_make_dml_loginfo — src/transaction/log_manager.c:12818 (signature)
int cdc_make_dml_loginfo (THREAD_ENTRY *thread_p,
int trid, char *user, CDC_DML_TYPE dml_type,
OID classoid,
RECDES *undo_recdes, RECDES *redo_recdes,
CDC_LOGINFO_ENTRY *dml_entry,
bool is_flashback);

The function takes the materialised undo/redo RECDES plus the trid/user/dml_type metadata and packs them into a CDC_LOGINFO_ENTRY. The entry’s wire format is what the consumer-side library decodes:

// CDC_LOGINFO_ENTRY — src/transaction/log_impl.h
typedef struct cdc_loginfo_entry
{
LOG_LSA next_lsa; /* LSA after this event — consumer cursor advance */
int length;
char *log_info; /* serialised event payload */
} CDC_LOGINFO_ENTRY;

The consumer commits next_lsa downstream once its handler succeeds; on reconnect the next pull starts at that LSA.

External consumers don’t link against log_manager.c; they go through src/api/cubrid_log.c, which wraps the cdc_* functions as a stable C ABI. The DLL surface is documented in raw/code-analysis/cubrid/storage/cdc/CUBRID API 문서_v1.6.docx. A typical session:

cubrid_log_connect(...)
cubrid_log_set_extraction_filter(...) /* tables, users */
cubrid_log_set_lsa_by_time(time, &lsa) /* or cubrid_log_set_lsa(lsa) */
loop {
n = cubrid_log_extract(&entries, &num)
for each entry:
handle_event(entry)
cubrid_log_commit(entries[n-1].next_lsa)
}
cubrid_log_disconnect()

The consumer is responsible for downstream durability of next_lsa; CUBRID does not track per-consumer offsets server-side.

Archive retention — cdc_min_log_pageid_to_keep

Section titled “Archive retention — cdc_min_log_pageid_to_keep”

The log-archive remove daemon (log_wakeup_remove_log_archive_daemon in cubrid-log-manager.md) gates its deletion on the smallest LSA any active CDC consumer or HA replication slave still depends on:

// cdc_min_log_pageid_to_keep — src/transaction/log_manager.h:235
extern LOG_PAGEID cdc_min_log_pageid_to_keep ();

The function returns MAX_LOG_PAGEID when no consumer is attached (so any archive can be removed) and the smallest consumer cursor’s pageid when consumers are attached.

Before the CDC API, CUBRID supported HA replication via a client-mode daemon that ran on a slave host, fetched log archives from the master, and replayed them onto a local slave server. The daemon’s entry point:

// la_apply_log_file — src/transaction/log_applier.c:8074 (signature)
int la_apply_log_file (const char *database_name,
const char *log_path,
const int max_mem_size);

Internally it loops:

  1. Fetch the next log archive (via remote read or shared filesystem — log_path).
  2. Walk forward, calling la_apply_repl_log (log_applier.c:5739) per record.
  3. On LOG_COMMIT, call la_log_commit (log_applier.c:6531) — records the applied-to LSA in the slave’s LA_HA_APPLY_INFO row so a daemon restart resumes from the right place.
  4. On retryable errors (deadlock, lock timeout, page latch abort — LA_RETRY_ON_ERROR macro), the record is retried.

The daemon supports table-level filtering via REPL_FILTER_TYPE:

// REPL_FILTER_TYPE — src/transaction/log_applier.h:48
typedef enum
{
REPL_FILTER_NONE,
REPL_FILTER_INCLUDE_TBL,
REPL_FILTER_EXCLUDE_TBL
} REPL_FILTER_TYPE;

The legacy path does not rely on supplemental log records; it walks the regular LOG_*UNDOREDO_DATA records and reconstructs row images by following them through the catalog. This is why the daemon is more fragile across schema changes than the modern CDC API: a DDL that changes a class’s representation between master and slave can trip up the replay logic.

sequenceDiagram
  participant App as Consumer app
  participant API as cubrid_log API
  participant Q as producer queue
  participant Prod as cdc producer thread
  participant LR as log_reader
  participant LM as log_manager (WAL)

  App->>API: "connect + set filter + set LSA"
  API->>Prod: "cdc_set_extraction_lsa (X)"
  loop each batch
    App->>API: "extract(N)"
    API->>Q: "pop N entries"
    alt queue starved
      Q->>Prod: "wake"
      Prod->>LR: "walk forward from next_extraction_lsa"
      LR->>LM: "fetch_page (LOG_CS read mode)"
      LR->>LM: "parse record header"
      alt "LOG_SUPPLEMENT_INSERT/UPDATE/DELETE"
        Prod->>LM: "cdc_get_recdes (chase undo+redo LSA)"
        Prod->>Prod: "cdc_make_dml_loginfo"
        Prod->>Q: "push CDC_LOGINFO_ENTRY"
      else LOG_SUPPLEMENT_DDL
        Prod->>Q: "push DDL entry"
      else LOG_COMMIT
        Prod->>Q: "flush per-trid buffer + push COMMIT"
      end
    end
    Q-->>API: "N entries"
    API-->>App: "entries"
    App->>API: "commit(entries[N-1].next_lsa)"
  end

Anchor on symbol names, not line numbers.

  • cdc_initialize (log_manager.c).
  • cdc_finalize (log_manager.c).
  • cdc_set_configuration (log_manager.h) — filters, timeout.
  • cdc_set_extraction_lsa (log_manager.c) — explicit LSA seed.
  • cdc_find_lsa (log_manager.c) — time → LSA lookup.
  • cdc_validate_lsa (log_manager.c) — archive range check.
  • cdc_make_loginfo (log_manager.c) — producer hot loop.
  • cdc_get_logitem_info (log_manager.h) — consumer batch fetch.
  • cdc_get_loginfo_metadata (log_manager.h) — peek without consuming.
  • cdc_get_recdes (log_manager.c) — materialize undo+redo RECDES from indirect LSAs.
  • cdc_get_undo_record (log_manager.c) — single-image variant.
  • cdc_make_dml_loginfo (log_manager.c) — pack a DML event.
  • cdc_min_log_pageid_to_keep (log_manager.h) — archive retention watermark.
  • cdc_pause_producer / cdc_wakeup_producer / cdc_kill_producer (log_manager.h).
  • cdc_pause_consumer / cdc_wakeup_consumer (log_manager.h).
  • cdc_reinitialize_queue (log_manager.h).
  • cdc_free_extraction_filter / cdc_cleanup / cdc_cleanup_consumer (log_manager.h).
  • cdc_daemons_init / cdc_daemons_destroy (log_manager.h) — registration of the producer/consumer threads with the cubthread manager.
  • CDC_PRODUCER_STATE enum (log_impl.h).
  • CDC_PRODUCER_REQUEST enum (log_impl.h).
  • CDC_CONSUMER_REQUEST enum (log_impl.h).
  • CDC_LOGINFO_ENTRY (log_impl.h).
  • CDC_TEMP_LOGBUF (log_impl.h) — double-buffered log pages.
  • CDC_PRODUCER (log_impl.h) — global producer state.
  • log_reader class (log_reader.hpp) — forward-walking log fetcher; shared with recovery and flashback.
  • log_reader::set_lsa_and_fetch_page (log_reader.hpp).
  • LOG_READ_ALIGN, LOG_READ_ADD_ALIGN, LOG_READ_ADVANCE_WHEN_DOESNT_FIT (log_reader.hpp) — the inline helpers that handle log-page boundary crossing.
  • src/api/cubrid_log.c — DLL entry surface (cubrid_log_* functions).
  • la_apply_log_file (log_applier.c) — daemon entry.
  • la_apply_repl_log (log_applier.c) — per-record dispatch.
  • la_log_commit (log_applier.c) — commit-side bookkeeping on the slave.
  • la_init (log_applier.c) — daemon init.
  • la_init_recdes_pool / la_init_cache_pb / la_init_cache_log_buffer / la_init_repl_lists (log_applier.c) — internal init.
  • la_init_ha_apply_info (log_applier.c) — initialise the per-slave applied-to bookkeeping row.
  • la_get_applied_log_info / la_get_copied_log_info (log_applier.h) — diagnostic.
  • LA_RETRY_ON_ERROR macro (log_applier.h) — retryable error mask.
  • REPL_FILTER_TYPE (log_applier.h) — table-level filter.
  • log_applier_sql_log.{c,h} — SQL-log emission (slave-side textual replay log for audit).
SymbolFileLine
SUPPLEMENT_REC_TYPE enumlog_record.hpp418
log_rec_supplement (struct)log_record.hpp434
CDC_LOGINFO_ENTRY (struct)log_impl.h808
CDC_TEMP_LOGBUF (struct)log_impl.h815
CDC_PRODUCER (struct)log_impl.h821
cdc_get_undo_recordlog_manager.c11244
cdc_get_recdeslog_manager.c11330
cdc_make_dml_loginfolog_manager.c12818
cdc_find_lsalog_manager.c14137
cdc_validate_lsalog_manager.c14402
cdc_set_extraction_lsalog_manager.c14465
cdc_make_loginfolog_manager.c14835
cdc_initializelog_manager.c14957
cdc_finalizelog_manager.c15087
log_reader (class)log_reader.hpp36
la_initlog_applier.c6917
la_apply_log_filelog_applier.c8074
la_apply_repl_loglog_applier.c5739
la_log_commitlog_applier.c6531
LA_RETRY_ON_ERROR (macro)log_applier.h34
REPL_FILTER_TYPE (enum)log_applier.h48
  • Modern CDC and legacy HA are separate code paths sharing the same WAL. Verified by reading both log_applier.c (CS_MODE daemon) and cdc_* functions in log_manager.c (server-side). No call from one to the other; the two pipelines walk the log independently.

  • DML supplemental records are indirect — they carry the LSA of the underlying LOG_UNDOREDO_DATA, not the row image. Verified by reading the SUPPLEMENT_REC_TYPE enum at log_record.hpp:418 and the comment block at lines 423-424: “Contains lsa of logs which contain undo, redo raw record (UPDATE, DELETE, INSERT) | LOG_REC_HEADER | SUPPLEMENT_REC_TYPE | LENGTH | CLASS OID | UNDO LSA (sizeof LOG_LSA) | REDO LSA |”.

  • The supplemental record set has 11 entries: TRAN_USER, UNDO_RECORD, DDL, INSERT, UPDATE, DELETE, TRIGGER_INSERT, TRIGGER_UPDATE, TRIGGER_DELETE, plus the bookend sentinels. Verified at log_record.hpp:418-432. LOG_SUPPLEMENT_LARGER_REC_TYPE is the upper bound for range checks; new types are appended before it.

  • DDL supplemental records carry the SQL text inline. Verified at log_record.hpp:421 (“LOG_SUPPLEMENT_DDL”) plus LOG_TDES::ddl_sql_user_text at log_impl.h:564. The text is captured at DDL-execute time and emitted into the supplemental record so consumers don’t need to round-trip the catalog.

  • The CDC producer maintains a per-trid tran_user map. Verified at log_impl.h:845 (std::unordered_map<TRANID, char *> tran_user). The map is keyed on transaction id, value is the client user name recorded from LOG_SUPPLEMENT_TRAN_USER.

  • The producer has a 3-state state machine. Verified at log_impl.h:787-791 (CDC_PRODUCER_STATE — WAIT, RUN, DEAD) plus the CDC_PRODUCER_REQUEST and CDC_CONSUMER_REQUEST enums for inter-thread signalling.

  • Log fetch is a separate class (log_reader) reused across recovery, CDC, and flashback. Verified at log_reader.hpp:36. The class encapsulates LSA → page-fetch + alignment + skip semantics. The header carries a “remaining member after porting features from the LETS structure” comment indicating the class was originally extracted to be reusable.

  • log_reader is not thread-safe. Verified at log_reader.hpp:31 (“NOTE: not thread safe”). Each producer thread / recovery worker / flashback session creates its own log_reader instance.

  • HA replication’s retryable errors are an explicit whitelist macro. Verified at log_applier.h:34 (LA_RETRY_ON_ERROR). The list includes lock timeouts, unilateral aborts, page latch timeouts, deadlock cycles, TDE cipher errors. Non-listed errors propagate up.

  • Archive retention defers to the smallest active consumer cursor. Verified at log_manager.h:235 (cdc_min_log_pageid_to_keep). The archive remove daemon (cubrid-log-manager.md) gates its deletion on this watermark.

  • The producer uses double-buffered log pages. Verified at log_impl.h:842 (CDC_TEMP_LOGBUF temp_logbuf[2]). One buffer is the page being parsed; the other is prefetched for the next read. Halves the wait-for-IO latency on sequential walks.

  1. Per-trid event buffer location. The producer must buffer uncommitted events. The tran_user map is visible in the header, but the matching per-trid event-list structure is not surfaced — it lives in log_manager.c internals. Investigation path: read cdc_make_loginfo body around line 14835.

  2. Behaviour on LOG_DUMMY_HA_SERVER_STATE records. HA server-state changes are logged. Does the modern CDC producer surface them as events, or skip? Investigation path: grep for LOG_DUMMY_HA_SERVER_STATE switch arms in cdc_make_loginfo.

  3. Multi-statement DDL handling. A single ALTER TABLE can touch many catalog rows. Does the producer emit one DDL event per ALTER, or one per catalog mutation? Investigation path: search for LOG_SUPPLEMENT_DDL emit sites.

  4. Filter race on consumer reconfigure. If a consumer calls cdc_set_configuration mid-stream to add a class, what happens to events already in the queue from before the filter applied? Investigation path: read cdc_set_configuration and the producer-side filter check.

  5. HA replication slave failover. When a slave is promoted to master, the LA daemon must stop. Where is the stop signal? la_force_shutdown exists; who calls it? Investigation path: grep callers in src/connection/heartbeat*.

  6. log_applier_sql_log.{c,h} purpose. A separate SQL log the slave writes? Audit log? Replay verification? Investigation path: read the file pair.

Beyond CUBRID — Comparative Designs & Research Frontiers

Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”

Pointers, not analysis.

  • Debezium (Kafka Connect) — pluggable CDC connectors for PG, MySQL, MongoDB, Oracle. Wire format: Avro / JSON over Kafka. Comparing CUBRID’s CDC_LOGINFO_ENTRY wire format against Debezium’s would document the round-trip cost.

  • PostgreSQL logical replication — output plugins (pgoutput, wal2json) emit logical events from a physical WAL walk. Closer to CUBRID’s legacy HA approach conceptually; the modern CDC API’s supplemental records are closer to Maxwell’s binlog row events.

  • MySQL binlog (statement / row / mixed) — three modes for emitting events. Row mode is closest to CUBRID’s modern CDC. Statement mode is closer to legacy HA replication.

  • Debezium-style outbox pattern — applications write events into a dedicated outbox table; CDC publishes them. CUBRID’s LOG_SUPPLEMENT_DDL is a primitive form: the engine itself is the outbox writer.

  • Kafka Connect as the consumer — CUBRID’s pull-style API fits a Kafka Connect source connector well; the consumer thread becomes the connect worker.

  • Structured Streaming exactly-once semantics — the consumer’s commitment of next_lsa is at-least-once unless the consumer integrates 2PC. CUBRID’s 2PC (cubrid-2pc.md) could in principle be wired into the consumer commit; not currently surfaced in the API.

Raw analyses (raw/code-analysis/cubrid/storage/cdc/)

Section titled “Raw analyses (raw/code-analysis/cubrid/storage/cdc/)”
  • CDC 진행상황 공유_v2.pptx
  • CDC 인수인계.pptx
  • ALTER, DROP.pptx
  • DML Log sequence.pdf
  • CUBRID API 문서_v1.6.docx
  • knowledge/code-analysis/cubrid/cubrid-log-manager.mdLOG_SUPPLEMENTAL_INFO + SUPPLEMENT_REC_TYPE source.
  • knowledge/code-analysis/cubrid/cubrid-flashback.md — opposite direction (backward); shares log_reader. In-progress in the same batch.
  • knowledge/code-analysis/cubrid/cubrid-recovery-manager.md — shares the log_reader class with the redo path.
  • knowledge/code-analysis/cubrid/cubrid-catalog-manager.md_db_class etc. updates that emit LOG_SUPPLEMENT_DDL.
  • knowledge/code-analysis/cubrid/cubrid-2pc.md — distributed commit semantics relevant to exactly-once consumption.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database Internals (Petrov), Ch. 5 §“Logging”, Ch. 13 §“Replication”.
  • Designing Data-Intensive Applications (Kleppmann), Ch. 5 “Replication”, Ch. 11 “Stream Processing” — CDC framing.

CUBRID source (/data/hgryoo/references/cubrid/)

Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”
  • src/transaction/log_manager.{c,h}cdc_* modern API.
  • src/transaction/log_applier.{c,h} — legacy HA daemon.
  • src/transaction/log_applier_sql_log.{c,h} — slave SQL log.
  • src/transaction/log_reader.{cpp,hpp} — forward walker.
  • src/api/cubrid_log.c — DLL entry surface.