CUBRID Log Manager — WAL, LSN, and Append Discipline

Contents:

Theoretical Background
Common DBMS Design
CUBRID’s Approach
Source Walkthrough
Source verification (as of 2026-04-30)
Beyond CUBRID — Comparative Designs & Research Frontiers
Sources

Theoretical Background

The log manager is CUBRID’s implementation of the write-ahead log (WAL) protocol, the contract every disk-resident relational engine signs with the recovery manager: no dirty data page may reach disk before the log records describing the modification reach stable storage. Database Internals (Petrov, ch. 5 §“Recovery”) frames the two halves of that contract — the log itself, and the discipline that keeps it ahead of data — as the foundation that ARIES (Mohan et al., 1992) builds on. Without WAL, redo and undo at recovery time have no authority over what may have leaked to disk; with WAL, the engine replays only what it logged and undoes only what it must.

The log is a logical, append-only sequence of typed records. Every record carries enough information for two questions to be answered later: “given the page state on disk, what should I do to reapply this change?” (redo) and “given the page state on disk, what should I do to undo this change?” (undo). Around that central record type the engine layers transaction boundaries (commit / abort), system boundaries (start, checkpoint, end-of-log), structural boundaries (savepoint, system-op start/end), and out-of-band channels (HA state, replication, supplemental info for CDC). All of them share one infinite stream identified by LSN — the log sequence number that is the engine’s universal “before / after / same as” comparator on state.

Two implementation choices the WAL model leaves open shape every real engine and frame the rest of this document:

How the LSN is named and how records are physically arranged. A monotonic 64-bit counter is the textbook answer; in practice the address is decomposed into (page_id, in-page offset) so the reader can resolve a record without scanning. The trade-off is between number space (how long until you wrap) and locality (how cheaply you can fetch a record).
How the in-memory side of the log is disciplined. The naive answer is “every appender writes its own record into the page buffer under a global lock.” Real engines instead construct records into a staging list (prior list, log batch, group buffer) under a smaller lock, then a single thread drains that list into the page buffer and flushes. The discipline shows up as different latency profiles at commit time: do we force on every commit, or do we wait for a small window of peers and force them together (group commit)?

CUBRID answers both questions concretely. The next section names the shared engineering vocabulary every WAL engine uses to answer them.

Common DBMS Design

Almost every WAL-based engine — PostgreSQL, InnoDB, Oracle, SQL Server, and CUBRID — composes the log out of the same handful of patterns. The patterns are not in Database Internals’s ARIES chapter; they are the engineering vocabulary that lives between the textbook and the source.

LSN as identity

The LSN names a position in the log, but engines also use it as the equality predicate on page state. Every dirty data page carries the LSA of the most-recent log record that modified it. The flush rule on the buffer pool side then becomes simple: a data page may only be written to disk after the log record whose LSA matches has been forced. The same LSA serves the recovery manager during the analysis pass to decide which pages need redo. PostgreSQL calls it XLogRecPtr, InnoDB calls it lsn_t, CUBRID calls it LOG_LSA — the structure differs but the role is the same.

Log records as a typed union

A log record’s first bytes are a fixed header (back-LSA, forward-LSA, trid, type) followed by a payload whose shape depends on the type. The type set is small (~30 in CUBRID, ~70 in InnoDB) and is treated as a closed enum so a recovery dispatch table can map it to a function pointer per-record. The payload itself is rarely self-describing; the type tells the reader how to parse the rest. Engines that started physical (InnoDB) or logical (PostgreSQL via output plugins) and added the other later end up with hybrid record-type sets — CUBRID’s LOG_REDO_DATA is physiological, but LOG_SYSOP_END_LOGICAL_* carries logical undo for index operations that touch many pages.

Append buffer as a single-writer pipeline

A multi-writer log buffer is a contention bottleneck. The standard escape is a two-stage pipeline: appenders construct records into a prior list (linked list of staged records under a small mutex), and a single drain thread (or the log critical section under write-mode) walks the prior list and copies records into the log page buffer in LSN order. The prior list is the engine’s unit of “I am going to commit a batch”, and its head/tail pointers are the only structures that need to be touched under the prior-LSA mutex.

Force-at-commit and group commit

The commit log record must be on stable storage before the client sees a successful commit. The naive policy — fsync per commit — is prohibitive on rotating media and still expensive on SSDs. Group commit is universal: a committing transaction places its commit record into the prior list, requests “force everything up to my LSA”, and parks. The drainer fsyncs once for many waiters. The trade-off shows up as commit latency vs. throughput; engines tune the window.

Active log and archives

The on-disk log is two-tier: a fixed-size active log (a circular file or a small set of pages) plus an arbitrary chain of archive log volumes that are read-only after creation. Archives exist for media recovery (replay from a backup) and historical reads (CDC tail, flashback). The boundary between active and archive moves forward as checkpoints establish that all earlier records’ redo no longer matters.

Theory ↔ CUBRID mapping

Theoretical concept	CUBRID name
LSN — log sequence number	`LOG_LSA` — packed `pageid:48 / offset:16` bit-field (`log_lsa.hpp`)
Log record header (back/forward/trid/type)	`LOG_RECORD_HEADER` (`log_record.hpp`)
Log record type enum	`LOG_RECTYPE` — 35+ entries from `LOG_UNDOREDO_DATA` to `LOG_SUPPLEMENTAL_INFO`
Per-record recovery function table	`LOG_RCVINDEX` (`recovery.h`) → `(undofun, redofun)` in `RV_fun[]`
Staged record before page-buffer copy	`LOG_PRIOR_NODE` linked into `LOG_PRIOR_LSA_INFO` (`log_append.hpp`)
Prior-list mutex	`log_prior_lsa_info::prior_lsa_mutex`
Drain into page buffer (single writer)	`logpb_prior_lsa_append_all_list` (`log_page_buffer.c`)
Append cursor & last-flushed LSA	`log_append_info::nxio_lsa` (atomic) + `prev_lsa`
Force-flush of pages	`logpb_flush_all_append_pages` / `logpb_force_flush_pages`
Group-commit waiter	`LOG_FLUSH_INFO` + condition variable in `log_page_buffer.c` flush daemon
Active log header	`LOG_HEADER` in page `LOGPB_HEADER_PAGE_ID = -9` (`log_storage.hpp`)
Per-page header	`LOG_HDRPAGE` with checksum + flags
Archive log header	`LOG_ARV_HEADER`
Compensation log record (CLR)	`LOG_COMPENSATE` + `log_append_compensate`
Postpone (deferred actions on commit)	`LOG_POSTPONE` + `LOG_RUN_POSTPONE`
MVCC-flavoured records	`LOG_MVCC_UNDOREDO_DATA`, `LOG_MVCC_UNDO_DATA`, `LOG_MVCC_REDO_DATA`
Supplemental records for CDC	`LOG_SUPPLEMENTAL_INFO` + `SUPPLEMENT_REC_TYPE` (DDL, INSERT/UPDATE/DELETE LSA pointers)

By the time we name a CUBRID symbol in the next section, the reader already knows what kind of structure to expect.

CUBRID’s Approach

CUBRID has four moving parts on the log path: the LSA scheme that names every record, the typed log record that lives at that address, the prior list where records are staged before they reach a page, and the log page buffer + flush daemon that drives records to disk. We walk them in that order.

Overall structure

flowchart LR
  subgraph S["Server side appenders"]
    T1["transaction\nthread A"]
    T2["transaction\nthread B"]
    T3["transaction\nthread C"]
  end
  subgraph PL["Prior list (under prior_lsa_mutex)"]
    N1["LOG_PRIOR_NODE\nrecord 1"]
    N2["LOG_PRIOR_NODE\nrecord 2"]
    N3["LOG_PRIOR_NODE\nrecord 3"]
    N1 --> N2 --> N3
  end
  subgraph PB["Log page buffer (LOG_PAGE ring)"]
    P1["page i"]
    P2["page i+1"]
  end
  subgraph DISK["Stable storage"]
    AL["active log file"]
    AR["archive volumes"]
  end
  T1 -->|prior_lsa_alloc_and_copy_data| N1
  T2 -->|prior_lsa_alloc_and_copy_data| N2
  T3 -->|prior_lsa_alloc_and_copy_data| N3
  PL -->|logpb_prior_lsa_append_all_list| PB
  PB -->|logpb_flush_all_append_pages| AL
  AL -->|archive| AR

The figure encodes the three boundaries that matter for correctness: (prior-list boundary) appenders see only their own node and the list tail under one mutex; (page-buffer boundary) the drain thread is the single writer of LSAs into pages; (disk boundary) the flush daemon is the single writer of pages to the active log file. Each boundary serializes one concern (record ordering, LSA assignment, durability) without serializing the others.

LSA — naming a position in the log

The LSA is a 64-bit packed value: 48 bits of logical page id, 16 bits of in-page offset. CUBRID exposes it both as a struct with operators and as a set of macro shims (LSA_COPY, LSA_LT, …) for the C side.

// struct log_lsa — src/transaction/log_lsa.hpp
struct log_lsa
{
  std::int64_t pageid:48;   /* Log page identifier : 6 bytes */
  std::int64_t offset:16;   /* Offset in page : 2 bytes (defined as 16-bit INT64
                               for alignment) */

  inline log_lsa () = default;
  inline constexpr log_lsa (std::int64_t log_pageid, std::int16_t log_offset);

  constexpr inline bool is_null () const;
  constexpr inline bool is_max () const;

  constexpr inline bool operator== (const log_lsa &olsa) const;
  inline bool operator< (const log_lsa &olsa) const;
  // ... condensed ...
};

constexpr std::int64_t NULL_LOG_PAGEID = -1;
constexpr std::int16_t NULL_LOG_OFFSET = -1;
constexpr log_lsa NULL_LSA { NULL_LOG_PAGEID, NULL_LOG_OFFSET };
constexpr log_lsa MAX_LSA = { /* 47-bit max */, /* 15-bit max */ };

Three points worth marking up. (a) The bit-field decomposition means pageid saturates at 2^47 − 1 ≈ 1.4 × 10^14 logical pages. With db_logpagesize typically 4 KiB, that is a log address space of roughly 5 × 10^17 bytes — five hundred petabytes — before exhaustion. The 16-bit offset is signed for alignment, but its maximum is bounded by LOG_PAGESIZE, not INT16_MAX. (b) < is lexicographic on (pageid, offset), which is the comparator the prior-list and page-buffer drain rely on to enforce monotonicity. (c) NULL_LSA is (-1, -1). is_null() checks only pageid because that is the field engines mostly read; set_null() writes both to silence valgrind on partially-initialised structs (the inline comment in log_lsa.hpp calls this out explicitly).

Log record — header plus typed payload

Every record begins with LOG_RECORD_HEADER, immediately followed by a type-specific header struct, which is followed by zero or more data buffers (undo image, redo image, key, OID, …).

// LOG_RECORD_HEADER and the record-type enum — src/transaction/log_record.hpp
struct log_rec_header
{
  LOG_LSA prev_tranlsa;     /* Address of previous log record for the same transaction */
  LOG_LSA back_lsa;         /* Backward log address */
  LOG_LSA forw_lsa;         /* Forward log address */
  TRANID  trid;             /* Transaction identifier */
  LOG_RECTYPE type;         /* Log record type */
};

enum log_rectype
{
  LOG_SMALLER_LOGREC_TYPE = 0,         /* lower bound check */
  LOG_UNDOREDO_DATA       = 2,         /* an undo and redo data record */
  LOG_UNDO_DATA           = 3,
  LOG_REDO_DATA           = 4,
  LOG_DBEXTERN_REDO_DATA  = 5,
  LOG_POSTPONE            = 6,
  LOG_RUN_POSTPONE        = 7,
  LOG_COMPENSATE          = 8,         /* CLR — compensate an undone undo */
  LOG_COMMIT_WITH_POSTPONE = 14,
  LOG_COMMIT              = 17,
  LOG_SYSOP_START_POSTPONE = 18,
  LOG_SYSOP_END           = 20,        /* nested top-op end (commit / abort / logical) */
  LOG_ABORT               = 22,
  LOG_START_CHKPT         = 25,
  LOG_END_CHKPT           = 26,
  LOG_SAVEPOINT           = 27,
  LOG_2PC_PREPARE         = 28,        /* 2PC voted yes */
  /* LOG_2PC_START / COMMIT_DECISION / ABORT_DECISION / INFORM / RECV_ACK ... */
  LOG_END_OF_LOG          = 35,
  LOG_DUMMY_HEAD_POSTPONE = 36,        /* no-op markers */
  LOG_DUMMY_CRASH_RECOVERY = 37,
  LOG_REPLICATION_DATA    = 39,
  LOG_REPLICATION_STATEMENT = 40,
  LOG_DIFF_UNDOREDO_DATA  = 43,        /* diff undo+redo to save space */
  LOG_DUMMY_HA_SERVER_STATE = 44,
  LOG_DUMMY_OVF_RECORD    = 45,        /* overflow-record marker */
  LOG_MVCC_UNDOREDO_DATA  = 46,        /* MVCC variant: carries MVCCID + vacuum info */
  LOG_MVCC_UNDO_DATA      = 47,
  LOG_MVCC_REDO_DATA      = 48,
  LOG_MVCC_DIFF_UNDOREDO_DATA = 49,
  LOG_SYSOP_ATOMIC_START  = 50,
  LOG_DUMMY_GENERIC       = 51,        /* "ridiculous, but flush needs it" — comment in header */
  LOG_SUPPLEMENTAL_INFO   = 52,        /* CDC supplemental: tran user, DDL, undo/redo LSA, raw image */
  LOG_LARGER_LOGREC_TYPE,
};

The triple-LSA in the header (prev_tranlsa, back_lsa, forw_lsa) is the structure ARIES needs at recovery time. prev_tranlsa chains records belonging to the same transaction and is the entry point for rollback (the undo pass walks the chain backward from the transaction’s tail). back_lsa and forw_lsa chain records in physical order on the log so the analysis and redo passes can scan forward without re-reading every page header. The MVCC-flavoured records add an MVCCID and a LOG_VACUUM_INFO (prev_mvcc_op_log_lsa

vfid) so the vacuum subsystem can walk MVCC operations without re-parsing the entire log. The type set is intentionally append-only: old code (commented-out #if 0 blocks for LOG_CLIENT_NAME, LOG_LCOMPENSATE, LOG_UNLOCK_COMMIT) is preserved as numbered holes so a binary log compatible with old releases still parses.

A specific union worth marking up is LOG_REC_SYSOP_END, the record written when a system op (a sub-transactional unit of recovery) finishes:

// LOG_REC_SYSOP_END — src/transaction/log_record.hpp
struct log_rec_sysop_end
{
  LOG_LSA lastparent_lsa;    /* last address before the top action */
  LOG_LSA prv_topresult_lsa; /* previous top action's end LSA */
  LOG_SYSOP_END_TYPE type;   /* COMMIT | ABORT | LOGICAL_UNDO |
                                LOGICAL_MVCC_UNDO | LOGICAL_COMPENSATE |
                                LOGICAL_RUN_POSTPONE */
  const VFID *vfid;
  union
  {
    LOG_REC_UNDO undo;            /* logical undo */
    LOG_REC_MVCC_UNDO mvcc_undo;  /* logical MVCC undo */
    LOG_LSA compensate_lsa;       /* logical compensate */
    struct {
      LOG_LSA postpone_lsa;
      bool is_sysop_postpone;
    } run_postpone;
  };
};

A “system op” in CUBRID is the engine’s private nested transaction — used by index splits, heap-overflow allocations, and other operations that need to be atomic as a group even though they touch many pages. The union carries either the logical-undo image (so recovery can reverse the system op without replaying its physical changes backward), or the LSA of the postpone record (so a postpone interrupted by crash can resume), or the LSA of the compensate target (so an undone system op leaves a clean trail). The type constant on the record selects which arm is live.

The prior list — staging before the page buffer

Figure 1 — Prior list, page buffer, log volume

Figure 1 — The three storage tiers a log record passes through. At the top, transactions chain LOG_PRIOR_NODE values onto the prior list at log_Gl.prior_info.prior_list_header. The drain thread copies nodes into log_Pb.buffers — an in-memory ring of LOG_PAGE frames — and the flush daemon writes those pages to the active log volume on disk. The dashed boundaries are the three serialization points the prose below names: prior-list mutex, single-writer drain, single-writer flush. (Source: log manager_v0.5.docx, prior-list overview figure.)

When a transaction wants to log an undo+redo data record, it calls prior_lsa_alloc_and_copy_data (log_append.cpp:273). The function allocates a LOG_PRIOR_NODE, fills the header type-specific fields from the caller-supplied buffers, and returns a node that does not yet have an LSA.

// LOG_PRIOR_NODE / LOG_PRIOR_LSA_INFO — src/transaction/log_append.hpp
struct log_prior_node
{
  LOG_RECORD_HEADER log_header;
  LOG_LSA           start_lsa;       /* assertion-only — assigned on attach */
  bool              tde_encrypted;

  int   data_header_length;
  char *data_header;

  int   ulength;                     /* undo length */
  char *udata;
  int   rlength;                     /* redo length */
  char *rdata;

  LOG_PRIOR_NODE *next;
};

struct log_prior_lsa_info
{
  LOG_LSA prior_lsa;                 /* next LSA to assign */
  LOG_LSA prev_lsa;                  /* last attached node's LSA */

  LOG_PRIOR_NODE *prior_list_header;
  LOG_PRIOR_NODE *prior_list_tail;
  INT64           list_size;         /* bytes */
  LOG_PRIOR_NODE *prior_flush_list_header;

  std::mutex prior_lsa_mutex;
};

The transaction then calls prior_lsa_next_record (1553) — or prior_lsa_next_record_with_lock (1559) when it already holds the prior mutex — to attach the node. Attaching is two coordinated operations under the mutex: assign prior_lsa to the node’s start_lsa, and append the node to the tail of prior_list. The returned LSA is the transaction’s “I have logged this” promise.

// prior_lsa_next_record_internal — src/transaction/log_append.cpp
LOG_LSA
prior_lsa_next_record_internal (THREAD_ENTRY *thread_p,
                                LOG_PRIOR_NODE *node,
                                LOG_TDES *tdes, int with_lock)
{
  // ... condensed ...
  if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK)
    log_Gl.prior_info.prior_lsa_mutex.lock ();

  start_lsa = log_Gl.prior_info.prior_lsa;
  prior_lsa_start_append (thread_p, node, tdes);     /* assigns node->start_lsa,
                                                        bumps prior_lsa by record size */
  prior_lsa_end_append (thread_p, node);             /* attaches to prior_list tail */

  if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK)
    log_Gl.prior_info.prior_lsa_mutex.unlock ();

  return start_lsa;
}

Two properties fall out of this discipline. (a) The mutex is held for the duration of “decide where the record lives” plus “attach” — both O(1). The actual record copy from caller buffers happened earlier in prior_lsa_alloc_and_copy_data, outside the mutex, so the critical section is short. (b) Because attaching also assigns the LSA, a transaction’s LSA totally orders against every other transaction’s LSA: there is no “I appended at LSA X but my neighbour at X+δ committed first” race.

The page buffer drain — single-writer LSN order

The next stage is the drain. logpb_prior_lsa_append_all_list (log_page_buffer.c:3106) is called by the log critical section, the flush daemon, and any path that needs to ensure records are in the page buffer. Its job is to walk the prior list and copy each node into the appropriate LOG_PAGE. Each copy looks up — and allocates if needed — pages from the page buffer pool initialised by logpb_initialize_pool (553).

// logpb_prior_lsa_append_all_list — src/transaction/log_page_buffer.c
int
logpb_prior_lsa_append_all_list (THREAD_ENTRY *thread_p)
{
  // ... condensed ...
  /* Detach the prior list under the mutex; release it; copy outside. */
  std::unique_lock<std::mutex> lk (log_Gl.prior_info.prior_lsa_mutex);
  LOG_PRIOR_NODE *first = log_Gl.prior_info.prior_list_header;
  log_Gl.prior_info.prior_list_header = NULL;
  log_Gl.prior_info.prior_list_tail   = NULL;
  log_Gl.prior_info.list_size         = 0;
  lk.unlock ();

  for (LOG_PRIOR_NODE *node = first; node != NULL; node = node->next)
    {
      logpb_append_next_record (thread_p, node);
      if (logpb_is_page_in_archive (...))            /* page boundary crossing */
        logpb_next_append_page (thread_p, ...);
      logpb_free_node (node);
    }
  return NO_ERROR;
}

Two properties that matter for correctness. (a) The drain is a single writer into the log page buffer; the prior-list mutex serialises appenders against each other, the drain mutex serialises appenders against the drain. (b) Page boundaries are crossed by logpb_next_append_page (2630), which marks the current page dirty in the flush info, allocates a new logical page, and threads its physical descriptor.

Flush — making it durable

Figure 2 — Daemons that drain the prior list and the active log

Figure 2 — Same three tiers as Figure 1, with the three log-manager daemons attached. log_Flush_daemon is the actual writer to the active log volume; log_Checkpoint_daemon periodically calls logpb_checkpoint so the next analysis pass has a closer starting LSA; log_Remove_log_archive_daemon reclaims archive volumes whose records are no longer needed by any HA replica or backup. The daemons are the only writers to disk — appenders never block on I/O themselves. (Source: log manager_v0.5.docx, log-manager + daemons figure.)

Flushing is the last stage. logpb_flush_all_append_pages (3232) is the workhorse. It walks the dirty list of the page buffer, writes each page to the active log file via the file_io subsystem, and updates log_append_info::nxio_lsa — the next-IO LSA, the watermark that says “everything below me is on stable storage”.

// log_append_info — src/transaction/log_append.hpp
struct log_append_info
{
  int                       vdes;        /* volume descriptor of active log */
  std::atomic<LOG_LSA>      nxio_lsa;    /* lowest LSA not yet on disk */
  LOG_LSA                   prev_lsa;    /* last appended record */
  LOG_PAGE                 *log_pgptr;   /* the currently fixed log page */
  bool                      appending_page_tde_encrypted;
};

The forced-flush path is logpb_force_flush_pages (4096), invoked at commit (log_commit → logpb_force_flush_pages directly when the HA configuration demands it, otherwise indirectly via the flush daemon). The flush daemon (log_wakeup_log_flush_daemon — declared in log_manager.h) sleeps on a condition variable and is woken by appenders that need to be sure their record is on disk before they return. The waiter checks nxio_lsa >= my_commit_lsa on wake; if not, it sleeps again. This is the group-commit implementation: many waiters, one fsync.

The corresponding read-side guarantee is the WAL invariant: a data page flush in the buffer manager (pgbuf_flush_page_*) must ensure nxio_lsa >= page->lsa before issuing the page write. CUBRID enforces it in pgbuf_flush_check_log_lsa, which calls back into the log manager to force as needed. (The buffer manager side is covered in cubrid-page-buffer-manager.md.)

Compression, archiving, and the active-log header

Beyond the hot append path, the log manager owns three pieces of machinery that are not on the per-record path but matter for operability.

Compression. Records over log_Zip_min_size_to_compress are compressed in place during prior_lsa_alloc_and_copy_data using zlib through a per-thread LOG_ZIP context (log_compress.c, log_append_get_zip_undo / log_append_get_zip_redo). The flag log_Zip_support is a global toggle. The compression boundary is the prior-list node, not the page; this means the record’s compressed shape is stable across page boundaries and the recovery side decompresses with the same context.

Archiving. The active log is fixed-size; once it fills, the oldest pages roll over to an archive volume. LOG_ARV_HEADER names each archive file (log_storage.hpp); nxarv_pageid and nxarv_phy_pageid in LOG_HEADER track the next page to archive. The archive remove daemon (log_wakeup_remove_log_archive_daemon) deletes archives once the checkpoint LSA has moved past them and no replication / CDC reader still depends on them.

Active-log header. LOG_HEADER, stored at logical page LOGPB_HEADER_PAGE_ID = -9, is the bootstrap structure: it carries the database creation time, page size, next transaction ID, next MVCC ID, append LSA, checkpoint LSA, archive bookkeeping, HA state, backup LSAs, and vacuum bookkeeping. The header is the single source of truth for “where is the log up to” across crash boundaries. Its eof_lsa is the LSN of LOG_END_OF_LOG, which the recovery analysis pass uses to find the end of the legitimate log.

The append APIs the rest of the engine sees

From outside the log manager, three families of entry points cover ~all uses:

// Append API surface — log_manager.h (excerpt)
extern void log_append_undoredo_data (THREAD_ENTRY *, LOG_RCVINDEX,
                                       LOG_DATA_ADDR *,
                                       int undo_length, int redo_length,
                                       const void *undo_data,
                                       const void *redo_data);
extern void log_append_undo_data    (THREAD_ENTRY *, LOG_RCVINDEX,
                                       LOG_DATA_ADDR *, int length,
                                       const void *data);
extern void log_append_redo_data    (THREAD_ENTRY *, LOG_RCVINDEX,
                                       LOG_DATA_ADDR *, int length,
                                       const void *data);
extern void log_append_compensate   (THREAD_ENTRY *, LOG_RCVINDEX,
                                       const VPID *, PGLENGTH,
                                       PAGE_PTR, int length,
                                       const void *data, LOG_TDES *);
extern void log_append_postpone     (THREAD_ENTRY *, LOG_RCVINDEX,
                                       LOG_DATA_ADDR *, int length,
                                       const void *data);
extern LOG_LSA *log_append_savepoint (THREAD_ENTRY *, const char *savept_name);

extern TRAN_STATE log_commit (THREAD_ENTRY *, int tran_index, bool retain_lock);
extern TRAN_STATE log_abort  (THREAD_ENTRY *, int tran_index);

extern void log_sysop_start (THREAD_ENTRY *);
extern void log_sysop_commit (THREAD_ENTRY *);
extern void log_sysop_abort (THREAD_ENTRY *);
extern void log_sysop_end_logical_undo (THREAD_ENTRY *, LOG_RCVINDEX,
                                         const VFID *, int undo_size,
                                         const char *undo_data);

The LOG_DATA_ADDR struct (log_append.hpp) bundles a page pointer, a file id, and an in-page offset — the addressing the recovery side will need to redo or undo. LOG_RCVINDEX is the dispatch key into the global RV_fun[] array (recovery.h) that pairs a redo function with an undo function for each record kind. The append API’s job is to turn the caller’s (rcvindex, addr, payload) tuple into a LOG_PRIOR_NODE of the right LOG_RECTYPE, attach it to the prior list, and return.

Record lifecycle, end to end

Figure 3 — One INSERT's commit walked through the log path

Figure 3 — A concrete INSERT INTO tbl VALUES (10) followed by COMMIT, walked through the four numbered steps the deck used: ① the transaction calls prior_lsa_alloc_and_copy_data and attaches the new prior node; ② the drain copies the node into a LOG_PAGE and the flush daemon writes it; ③ the active log volume eventually rolls into an archive; ④ archives that no past replica still reads are removed. Steps ① and ② happen inline with the commit (the caller blocks on nxio_lsa); ③ and ④ are background. The Mermaid sequenceDiagram below adds the symbol-level call chain. (Source: log manager_v0.5.docx, commit walkthrough figure.)

sequenceDiagram
  participant TX as Transaction thread
  participant API as log_append_*
  participant PA as prior_lsa_alloc_and_copy_data
  participant PN as prior_lsa_next_record
  participant DR as logpb_prior_lsa_append_all_list
  participant FL as logpb_flush_all_append_pages
  participant FD as log flush daemon
  participant DK as Disk (active log)

  TX->>API: log_append_undoredo_data(rcvindex, addr, undo, redo)
  API->>PA: build LOG_PRIOR_NODE (zip if eligible)
  PA-->>API: node with header + payload
  API->>PN: assign LSA + attach (mutex)
  PN-->>API: LSA = X
  Note over TX,API: caller continues without disk I/O
  loop drainer wake
    DR->>DR: detach prior list
    DR->>DR: copy nodes into LOG_PAGE buffer
  end
  TX->>FD: log_commit → demand force up to commit LSA
  FD->>FL: flush all pages up to commit LSA
  FL->>DK: write LOG_PAGE → active log
  FL-->>FD: nxio_lsa advanced
  FD-->>TX: wake committers whose LSA <= nxio_lsa

The diagram shows the two interleaved orderings the design holds. LSA order is enforced at attach time by the prior-list mutex — every LSA the engine ever hands out is monotonically greater than the previous. Disk order is enforced at flush time by the single writer to the active log volume. The two orderings agree because the drain copies nodes in prior-list order and the flush writes pages in page-id order.

Source Walkthrough

Anchor on symbol names, not line numbers. Lines drift.

log_lsa (log_lsa.hpp) — packed pageid:48 / offset:16 LSN.
LOG_RECORD_HEADER (log_record.hpp) — header at every record’s front, containing prev_tranlsa, back_lsa, forw_lsa, trid, type.
LOG_RECTYPE (log_record.hpp) — closed enum of record kinds.
LOG_PAGE / LOG_HDRPAGE / LOG_HEADER (log_storage.hpp) — the on-disk layout.
LOG_PRIOR_NODE / LOG_PRIOR_LSA_INFO (log_append.hpp) — staged records between caller and page buffer.
log_append_info (log_append.hpp) — global append cursor and nxio_lsa.

Prior-list (staging)

prior_lsa_alloc_and_copy_data (log_append.cpp) — build a node from (rcvindex, addr, undo, redo), optionally zip.
prior_lsa_alloc_and_copy_crumbs (log_append.cpp) — variant that takes a list of LOG_CRUMB data fragments.
prior_lsa_next_record_internal (log_append.cpp) — assign LSA and attach to tail under prior mutex.
prior_lsa_start_append / prior_lsa_end_append (log_append.cpp) — split halves of attach.
log_append_undoredo_data and friends (log_manager.c) — the surface the rest of the engine calls.

Page-buffer drain

logpb_initialize_pool (log_page_buffer.c) — set up the log page buffer pool of LOGPB_BUFFER_NPAGES_LOWER+ pages.
logpb_prior_lsa_append_all_list (log_page_buffer.c) — drain the prior list into the page buffer.
logpb_next_append_page (log_page_buffer.c) — cross page boundary; mark current page dirty.

Flush

logpb_flush_all_append_pages (log_page_buffer.c) — write dirty pages to active log; advance nxio_lsa.
logpb_force_flush_pages (log_page_buffer.c) — force-flush variant for commit demand and HA.
log_wakeup_log_flush_daemon (log_manager.h) — wake the daemon that drives group-commit batching.

Lifecycle

log_create (log_manager.c) — create active log + header at database init.
log_initialize (log_manager.c) — open / recover the log on server start.
log_commit (log_manager.c) — append LOG_COMMIT, force, release locks.
log_abort (log_manager.c) — drive undo, append LOG_ABORT.
log_complete (log_manager.c) — final state transition; emits end-of-transaction record.
log_final (log_manager.c) — graceful shutdown, force, write LOG_HEADER::is_shutdown = true.

Compression & archive

log_append_init_zip / log_append_final_zip (log_append.cpp).
log_append_get_zip_undo / log_append_get_zip_redo (log_append.cpp).
log_wakeup_remove_log_archive_daemon (log_manager.h) — daemon that deletes obsolete archive volumes.

CDC integration (forward boundary)

cdc_* family (log_manager.h) — cdc_find_lsa, cdc_get_logitem_info, cdc_make_dml_loginfo, cdc_validate_lsa, cdc_min_log_pageid_to_keep. Exposed here because CDC walks the log forward through log_reader and depends on the log manager to keep the relevant range alive.

Position hints as of 2026-04-30

Symbol	File	Line
`log_lsa` (struct)	`log_lsa.hpp`	35
`log_rec_header`	`log_record.hpp`	146
`enum log_rectype`	`log_record.hpp`	35
`log_rec_sysop_end`	`log_record.hpp`	305
`log_prior_node`	`log_append.hpp`	91
`log_prior_lsa_info`	`log_append.hpp`	112
`log_append_info`	`log_append.hpp`	73
`log_header`	`log_storage.hpp`	113
`log_create`	`log_manager.c`	791
`log_initialize`	`log_manager.c`	1059
`log_append_undoredo_data`	`log_manager.c`	1893
`log_append_redo_data`	`log_manager.c`	2035
`log_commit`	`log_manager.c`	5352
`log_abort`	`log_manager.c`	5461
`log_complete`	`log_manager.c`	5653
`prior_lsa_alloc_and_copy_data`	`log_append.cpp`	273
`prior_lsa_alloc_and_copy_crumbs`	`log_append.cpp`	410
`prior_lsa_next_record_internal`	`log_append.cpp`	1357
`prior_lsa_next_record`	`log_append.cpp`	1553
`prior_lsa_start_append`	`log_append.cpp`	1593
`prior_lsa_end_append`	`log_append.cpp`	1652
`logpb_initialize_pool`	`log_page_buffer.c`	553
`logpb_next_append_page`	`log_page_buffer.c`	2630
`logpb_prior_lsa_append_all_list`	`log_page_buffer.c`	3106
`logpb_flush_all_append_pages`	`log_page_buffer.c`	3232
`logpb_force_flush_pages`	`log_page_buffer.c`	4096

Source verification (as of 2026-04-30)

Each entry is a fact about the current source — readable without the original analysis materials. Trailing notes show the verification trail.

Verified facts

LOG_LSA is a 64-bit packed struct: pageid:48 / offset:16, defined as bit-fields on int64_t. Verified in log_lsa.hpp:35. The offset field is intentionally typed as int64_t:16 rather than int16_t for alignment; the comment in the source calls this out. Implication: a single log page must fit in 32 KiB minus a small header, since the offset is signed 16-bit; CUBRID’s default db_logpagesize is 4 KiB.
Null LSA is (-1, -1), but log_lsa::is_null() checks only pageid. Verified in log_lsa.hpp:97-101 and matching set_null() at line 109. The set_null function explicitly writes both fields, with the inline comment “this is how LOG_LSA is initialized many times; we need to initialize both fields or we’ll have ‘conditional jump or move on uninitialized value’”.
The append pipeline is two-stage: prior list → page buffer. Verified by reading log_append.hpp (struct definitions), log_append.cpp (prior_lsa_* functions), and log_page_buffer.c (logpb_prior_lsa_append_all_list). The prior-list mutex (prior_lsa_mutex) is the only lock appenders contend on; the drain is single-writer.
LSA assignment happens inside the prior-list mutex, in prior_lsa_start_append. Verified at log_append.cpp:1593. This means LSAs are totally ordered by the order in which appenders take the mutex, not by when they copied their record bytes. Implication: the LSA returned from log_append_undoredo_data is a stable handle even for concurrent appenders.
Group commit is implemented via a flush daemon woken by waiters on nxio_lsa. Verified by log_wakeup_log_flush_daemon in log_manager.h:221 plus log_append_info::nxio_lsa (atomic) in log_append.hpp:76. The daemon’s main loop is in log_page_buffer.c (search log_flush_daemon); the waiter side is logpb_flush_all_append_pages.
Log records are zlib-compressed in place when over a configured size threshold. Verified by log_append_init_zip, log_append_get_zip_undo / _redo in log_append.hpp:162-165 and log_compress.c. The toggle is log_Zip_support; the threshold is log_Zip_min_size_to_compress (both globals).
The active log header lives at logical page id -9, hard-coded as LOGPB_HEADER_PAGE_ID. Verified in log_storage.hpp:51. It is kept on the active portion of the log and is backed up in all archive logs — this is what makes emergency recovery (log_restart_emergency) possible from any archive.
Record types are append-only across releases. Verified in log_record.hpp:35-141 by counting the #if 0 blocks for obsolete types (LOG_CLIENT_NAME = 1, LOG_LCOMPENSATE = 9, LOG_UNLOCK_COMMIT = 41, etc.). The numbered holes mean an old binary log is parseable by a current build.
MVCC log records carry both MVCCID and LOG_VACUUM_INFO. Verified in log_record.hpp:202-217 (LOG_REC_MVCC_UNDOREDO, LOG_REC_MVCC_UNDO). The vacuum_info.prev_mvcc_op_log_lsa links MVCC operations into a chain the vacuum subsystem can walk without re-reading every other record.
LOG_SUPPLEMENTAL_INFO is the channel CDC uses for catalog-visible events. Verified in log_record.hpp:418-439 (SUPPLEMENT_REC_TYPE enum) and the log_append_supplemental_* declarations in log_manager.h:171-179. The supplemental record type set includes TRAN_USER, DDL, INSERT/UPDATE/DELETE, and TRIGGER_INSERT/UPDATE/DELETE — the same set CDC consumers de-serialise on the downstream side (cubrid-cdc.md).

Open questions

What is the exact group-commit window policy? Is it a fixed timeout, a count-of-waiters threshold, or a hybrid? The flush daemon’s wake condition would tell us. Investigation path: read the daemon body in log_page_buffer.c (search for the log_flush_daemon function or thread entry), correlate with log_writer.c (76 KB) which appears to drive the flush coordination.
How is the prior-list size bounded? LOG_PRIOR_LSA_INFO::list_size is tracked but its consumers / triggers were not located in this pass. Is there a soft cap that forces appenders to drain before continuing? Investigation path: grep for list_size accesses; look in log_append.cpp and log_page_buffer.c.
TDE-encrypted log pages: exact placement of encryption. log_prior_node::tde_encrypted and log_append_info::appending_page_tde_encrypted are present, but whether encryption happens at attach, drain, or flush time is unverified. Investigation path: read prior_set_tde_encrypted and prior_is_tde_encrypted in log_append.cpp; correlate with tde.h.
LOG_DUMMY_GENERIC (record type 51) — flush-only marker. The header comment calls it “ridiculous, but flush needs it”. What invariant does it preserve? A no-op record at a specific LSA forces page-boundary flush? Investigation path: grep for producers of LOG_DUMMY_GENERIC; check if it is ever consumed by recovery dispatch.
Active-log size policy. LOG_HEADER::npages is set at create time but the resize / rotation policy is not surfaced in the header. Is there a configurable knob? Investigation path: search for npages writes outside of log_create_internal.
CDC-keep-alive interaction with archive deletion. cdc_min_log_pageid_to_keep (declared in log_manager.h:235) gates the archive-remove daemon, but the synchronisation between CDC’s progress and the daemon was not traced. Investigation path: read log_wakeup_remove_log_archive_daemon and follow cdc_min_log_pageid_to_keep consumers.

Beyond CUBRID — Comparative Designs & Research Frontiers

Pointers, not analysis. Each bullet is a starting handle for a follow-up doc.

PostgreSQL XLOG — XLogRecPtr is a flat 64-bit LSN, decomposed into (timeline, segment, offset) outside the integer. The record format uses a separate “block-data” array per record rather than a packed payload. CUBRID’s prior-list staging maps cleanly to PG’s WAL insertion locks (multiple, hashed) — a comparison would tell us how CUBRID’s single prior mutex scales vs. PG’s striped locks.
InnoDB redo log (mtr / log_t) — uses mini-transactions (mtr_t) as the staging unit, then a fixed-size circular log file with a separate log_buf ring. Group commit is via the flush_loop thread on a os_event. Comparing LOG_RUN_POSTPONE against InnoDB’s deferred-buffer-pool work would isolate where the two engines disagree on “must be on disk before commit”.
ARIES original (Mohan et al., 1992) — the canonical model for every WAL engine here. CUBRID’s three-LSA header (prev_tranlsa, back_lsa, forw_lsa) is the ARIES record layout; the CLR via LOG_COMPENSATE is the ARIES CLR. A side-by-side of CUBRID’s LOG_SYSOP_END against ARIES’s nested top-actions would surface where CUBRID extends the model (logical undo, MVCC undo).
Aurora’s offload-WAL (Verbitski et al., SIGMOD 2017) — moves WAL into the storage layer so each compute node writes log records, not data pages. CUBRID’s WAL is process-local, so this is more of a structural contrast than a feature gap; comparing the protocols highlights what changes when “WAL is the database” rather than “WAL describes the database”.
Silo (Tu et al., SOSP 2013) and epoch-based recovery — alternative to ARIES that batches commits per-epoch. CUBRID’s group-commit window is a softer version of the same idea (force a small batch together) without the strict epoch boundary.
TimescaleDB / Hyper logging for time-series workloads — log compression strategies that exploit the columnar nature of redo data. CUBRID’s LOG_DIFF_UNDOREDO_DATA (record type 43) is a primitive form of the same idea.

Sources

Raw analyses (`raw/code-analysis/cubrid/storage/`)

log_manager/log manager_v0.5.docx
recovery_manager/log_manager_v0.3.pptx — earlier deck filed under recovery_manager but covering the log manager surface.

Textbook chapters (under `knowledge/research/dbms-general/`)

Database Internals (Petrov), Ch. 5 “Transactions and Recovery”, §“Write-Ahead Logging” and §“Log Sequence Numbers”.
Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (TODS 17.1, 1992) — referenced in the recovery doc but the LSN scheme and CLR design are first stated there.

CUBRID source (`/data/hgryoo/references/cubrid/`)

src/transaction/log_manager.{c,h}
src/transaction/log_append.{cpp,hpp}
src/transaction/log_record.hpp
src/transaction/log_lsa.{hpp,cpp}
src/transaction/log_storage.hpp
src/transaction/log_page_buffer.c
src/transaction/log_compress.{c,h}
src/transaction/log_writer.c

Sibling docs in this knowledge base

knowledge/code-analysis/cubrid/cubrid-page-buffer-manager.md — data-page side of the WAL invariant.
knowledge/code-analysis/cubrid/cubrid-mvcc.md — consumers of LOG_MVCC_* records.
knowledge/code-analysis/cubrid/cubrid-recovery-manager.md — reader of this log; in-progress in the same 2026-04-30 batch.
knowledge/code-analysis/cubrid/cubrid-cdc.md — consumer of LOG_SUPPLEMENTAL_INFO; in-progress in the same batch.
knowledge/code-analysis/cubrid/cubrid-2pc.md — owner of LOG_2PC_* records; in-progress in the same batch.