Skip to content

CUBRID Log Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-log-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single log record inside the kernel.

Contents:

ChTitleStatus
1Data Structure Map
2Initialization and Memory
3Building a Prior Node from a Caller Request
4LSA Assignment and Attach to the Prior List
5Draining the Prior List into the Page Buffer
6Crossing a Log Page Boundary
7Flush Durability and the WAL Rule
8Commit and Abort Lifecycle
9System Operations Postpone and Compensation
10Archiving Header Maintenance and Edge Paths

A log record is re-encoded across three tiers from caller to disk: caller inputs (log_data_addr, log_crumb), the staging tier (log_prior_node, log_prior_lsa_info, log_append_info), and the on-disk tier (log_hdrpage, log_page, log_header, log_arv_header). The body — the log_rec_* family off log_rec_header — flows through all three unchanged. This is the field-level map; later chapters trace the motion between tiers.

Cross-link: WAL theory (why the log LSA orders everything, why the log must reach disk before the data page) lives in the high-level companion cubrid-log-manager.md. This chapter documents the structures the rule operates over, not the rule.

A log sequence address (LSA) packs a logical page id and an in-page byte offset into a 64-bit bit-field.

// log_lsa -- src/transaction/log_lsa.hpp
struct log_lsa
{
std::int64_t pageid:48; /* Log page identifier : 6 bytes length */
std::int64_t offset:16; /* Offset in page. :16 of int64 (not short) for alignment */
// ... condensed: is_null(), is_max(), set_null(); ordering compares pageid then offset ...
};
FieldRoleWhy
pageid (48b)Logical page id in the infinite logUnbounded append-only page sequence
offset (16b)Byte offset within area[]:16 of an int64 packs to 8 bytes

INVARIANT — LSA total ordering. operator< compares pageid then offset, making LSAs a monotone WAL clock; every before/after and durability decision is an LSA comparison. Lose it and the WAL rule (Ch 7) and recovery replay cannot decide what to redo.

Sentinels/shims: NULL_LSA = {-1,-1} (set_null() writes both fields), MAX_LSA = {(1<<47)-1,(1<<15)-1}, and the LSA_* macros (LSA_COPY, LSA_SET_NULL, LSA_ISNULL, LSA_EQ/LE/LT/GE/GT, LSA_AS_ARGS) — inline wrappers over the operators so legacy C compiles.

Every on-disk record begins with a fixed log_rec_header threading it into a physical chain and a per-transaction chain:

// log_rec_header -- src/transaction/log_record.hpp
struct log_rec_header
{
LOG_LSA prev_tranlsa; /* prev record of SAME transaction */
LOG_LSA back_lsa, forw_lsa; /* physically prev / next record */
TRANID trid; LOG_RECTYPE type;
};
FieldRoleWhy
prev_tranlsaPrior record of the same transactionUndo walks one transaction backward
back_lsaPhysically previous recordReverse log scan
forw_lsaPhysically next recordRedo forward scan; NULL_LSA until successor known (Ch 4)
tridOwning transaction idDemultiplexes the interleaved stream
typeLOG_RECTYPE discriminatorTagged-union tag; selects the payload struct

INVARIANT — header forms the doubly linked physical chain. For adjacent A then B: B.back_lsa == addr(A) and A.forw_lsa == addr(B). back_lsa is set at build, forw_lsa only once the successor’s LSA is known — so the chain is briefly half-open at the tail. Disagreement makes undo and redo scans visit different record sets, breaking recovery.

The enum type ranges over is explicitly numbered and append-only: obsolete values are wrapped in #if 0 rather than deleted, so the on-disk integer meaning never shifts.

// log_rectype -- src/transaction/log_record.hpp (condensed)
enum log_rectype
{
LOG_SMALLER_LOGREC_TYPE = 0, /* lower-bound check */
#if 0
LOG_CLIENT_NAME = 1, /* Obsolete -- hole preserved */
#endif
LOG_UNDOREDO_DATA = 2, LOG_UNDO_DATA = 3, LOG_REDO_DATA = 4,
// ... LOG_COMMIT=17, LOG_SYSOP_END=20, LOG_ABORT=22, ...
LOG_MVCC_UNDOREDO_DATA = 46, LOG_MVCC_UNDO_DATA = 47, LOG_MVCC_REDO_DATA = 48,
LOG_MVCC_DIFF_UNDOREDO_DATA = 49, LOG_SYSOP_ATOMIC_START = 50,
LOG_DUMMY_GENERIC = 51, /* dummy used for flush */
LOG_SUPPLEMENTAL_INFO = 52,
LOG_LARGER_LOGREC_TYPE /* upper-bound check */
};

INVARIANT — sentinel bounds and stable wire values. LOG_SMALLER_LOGREC_TYPE (0) and LOG_LARGER_LOGREC_TYPE bracket the valid range; because the integer is persisted a number is never reused (the #if 0 holes guarantee it). Classification macros (LOG_IS_UNDO_RECORD_TYPE, LOG_IS_REDO_RECORD_TYPE, LOG_IS_UNDOREDO_RECORD_TYPE, LOG_IS_MVCC_OP_RECORD_TYPE) read type.

1.4 The recovery-data locator — log_data

Section titled “1.4 The recovery-data locator — log_data”

Undo/redo payloads embed a log_data naming where on a data volume the change applies — a recovery coordinate, not the log’s address:

// log_data -- src/transaction/log_record.hpp
struct log_data { LOG_RCVINDEX rcvindex; PAGEID pageid; PGLENGTH offset; VOLID volid; };
FieldRoleWhy
rcvindexIndex into the recovery dispatch tablePicks the rv* function for the bytes
pageidTarget data page idPage to refix
offsetOffset/slot within that pageWhere the change lands
volidVolume id of the target pageDisambiguates pageid across volumes

1.5 The payload family — undo/redo and MVCC variants

Section titled “1.5 The payload family — undo/redo and MVCC variants”

The type tag selects one payload, following the header on the page; all build on log_data:

// log_rec_undoredo / undo / redo -- src/transaction/log_record.hpp
struct log_rec_undoredo { LOG_DATA data; int ulength, rlength; };
struct log_rec_undo { LOG_DATA data; int length; };
struct log_rec_redo { LOG_DATA data; int length; };

log_rec_undoredo carries ulength+rlength (lengths frame the two blobs); log_rec_undo carries one length (undo image only, logical undo); log_rec_redo carries one length (redo image only, page-physical redo). MVCC variants wrap these and attach an MVCC id plus vacuum bookkeeping:

// MVCC payload wrappers -- src/transaction/log_record.hpp
struct log_rec_mvcc_undoredo { LOG_REC_UNDOREDO undoredo; MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };
struct log_rec_mvcc_undo { LOG_REC_UNDO undo; MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };
struct log_rec_mvcc_redo { LOG_REC_REDO redo; MVCCID mvccid; }; /* no vacuum_info */
StructWrapsAddsWhy
log_rec_mvcc_undoredolog_rec_undoredomvccid, vacuum_infoMVCC ops vacuum tracks
log_rec_mvcc_undolog_rec_undomvccid, vacuum_infoMVCC delete-style ops
log_rec_mvcc_redolog_rec_redomvccid onlyPure redo creates no version to vacuum

log_vacuum_info is the back-pointer carried by undo MVCC records:

// log_vacuum_info -- src/transaction/log_record.hpp
struct log_vacuum_info { LOG_LSA prev_mvcc_op_log_lsa; VFID vfid; };
FieldRoleWhy
prev_mvcc_op_log_lsaLSA of the previous MVCC-op recordVacuum walks this chain in log order
vfidFile the change belongs toDetect dropped/reused file; decide object kind

The append path materializes a record as a log_prior_node linked into the prior list — the central staging structure (Ch 3–5):

// log_prior_node -- src/transaction/log_append.hpp
struct log_prior_node
{
LOG_RECORD_HEADER log_header;
LOG_LSA start_lsa; bool tde_encrypted;
int data_header_length; char *data_header;
int ulength; char *udata; int rlength; char *rdata;
LOG_PRIOR_NODE *next;
};
FieldRoleWhy
log_headerEmbedded log_rec_headerCopied onto the page; back_lsa/forw_lsa filled at linking
start_lsa / tde_encryptedAssigned LSA; encryption flagLSA asserted vs page offset; flag drives hdr.flags at drain
data_header_length / data_headerLength + buffer of the log_rec_* structSerialized apart from variable data
ulength/udata, rlength/rdataLength + buffer of undo / redo bytesThe two images, possibly compressed
nextNext nodeOrders nodes awaiting drain

INVARIANT — a prior node owns its heap buffers. data_header, udata, rdata are independently malloc-ed; length is zero exactly when the pointer is unused. Drain frees them after copying into the page buffer; leak or double-free corrupts the heap.

1.7 The prior-list anchor — log_prior_lsa_info

Section titled “1.7 The prior-list anchor — log_prior_lsa_info”

The in-memory anchor for the whole prior list — LSA cursor, list head/tail, and the serializing mutex:

// log_prior_lsa_info -- src/transaction/log_append.hpp
struct log_prior_lsa_info
{
LOG_LSA prior_lsa; LOG_LSA prev_lsa;
LOG_PRIOR_NODE *prior_list_header; LOG_PRIOR_NODE *prior_list_tail;
INT64 list_size; /* bytes */
LOG_PRIOR_NODE *prior_flush_list_header;
std::mutex prior_lsa_mutex;
};
FieldRoleWhy
prior_lsaNext LSA to assignMonotone allocator cursor; advanced by record size (Ch 4)
prev_lsaLSA of the last appended recordFills the next node’s back_lsa
prior_list_header / prior_list_tailHead / tail of the awaiting-drain listDrain start; O(1) append
list_sizeTotal bytes stagedFlusher decides when to drain
prior_flush_list_headerHead of the detached flush sublistDrain steals here so producers keep appending
prior_lsa_mutexMutex over all the aboveLSA assignment + linkage atomic

INVARIANT — prior_lsa_mutex serializes LSA assignment. prior_lsa is advanced and the node linked under one acquisition, so no two records share an LSA and list order matches LSA order; splitting the two mis-orders the drained page.

1.8 The on-disk append cursor — log_append_info

Section titled “1.8 The on-disk append cursor — log_append_info”

The disk-facing append point — open log file, fixed page, and the lowest LSA not yet on disk:

// log_append_info -- src/transaction/log_append.hpp
struct log_append_info
{
int vdes;
std::atomic<LOG_LSA> nxio_lsa; /* Lowest LSA NOT yet written to disk (WAL) */
LOG_LSA prev_lsa; LOG_PAGE *log_pgptr; bool appending_page_tde_encrypted;
// ... condensed: get_nxio_lsa(), set_nxio_lsa() ...
};
FieldRoleWhy
vdesOS fd of the active log volumeTarget of page writes
nxio_lsaAtomic lowest LSA not yet flushedWAL watermark; readers/flusher race without the prior mutex
prev_lsaLast record appended to the bufferDrain-side mirror of staging prev_lsa
log_pgptrCurrently fixed log pageDrain target; replaced on page boundary (Ch 6)
appending_page_tde_encryptedLive page must be encryptedCarries the node’s tde_encrypted onto the page

INVARIANT — nxio_lsa is the WAL durability watermark. Records with LSA < nxio_lsa are on disk; >= nxio_lsa are not. Flusher and WAL checks touch it concurrently, so it is std::atomic, reached only via get_nxio_lsa()/set_nxio_lsa(); a torn read lets a data page flush ahead of its log (Ch 7).

1.9 The caller inputs — log_data_addr and log_crumb

Section titled “1.9 The caller inputs — log_data_addr and log_crumb”

What a caller (heap/btree op) hands the append API; everything above derives from these:

// log_data_addr / log_crumb -- src/transaction/log_append.hpp
struct log_crumb { int length; const void *data; };
struct log_data_addr { const VFID *vfid; PAGE_PTR pgptr; PGLENGTH offset; };
Struct / FieldRoleWhy
log_crumb.length / .dataOne contiguous piece of caller dataCallers pass an array to gather scattered buffers
log_data_addr.vfidFile the page belongs to, or NULLFile/TDE context; log_data.volid/pageid come from pgptr, not vfid
log_data_addr.pgptrPointer to the fixed data pageIts volid/pageid extracted into log_data
log_data_addr.offsetOffset/slot of the changeBecomes log_data.offset; high bits hold LOG_RV_RECORD_* flags

1.10 On-disk page structures — log_hdrpage and log_page

Section titled “1.10 On-disk page structures — log_hdrpage and log_page”

A physical log page is a log_hdrpage plus a flexible area[]. The area[1] is the struct-hack — never sizeof it; use LOG_PAGESIZE:

// log_hdrpage / log_page -- src/transaction/log_storage.hpp
struct log_hdrpage { LOG_PAGEID logical_pageid; PGLENGTH offset; short flags; int checksum; };
struct log_page { LOG_HDRPAGE hdr; char area[1]; }; /* area is flexible */
FieldRoleWhy
logical_pageidPage id in the infinite sequenceMatches log_lsa.pageid; identity check on read
offsetOffset of the first record starting hereSalvage anchor if the prior page is corrupt
flagsTDE bits (..._ENCRYPTED_AES/ARIA)LOG_IS_PAGE_TDE_ENCRYPTED tests the mask
checksumCRC32 over the pageDetects torn pages
log_page.hdrThe header aboveFixed page prefix
log_page.area[]Record bytesSized by LOG_PAGESIZE

INVARIANT — LOG_PAGEID -9 is the header page. LOGPB_HEADER_PAGE_ID = -9 holds the log_header, carries no log records, and is duplicated into every archive. Code must never write a normal record onto pageid -9.

1.11 The volume headers — log_header and log_arv_header

Section titled “1.11 The volume headers — log_header and log_arv_header”

log_header is the master control block on page -9. Every member, grouped by role:

Field groupFieldsRole
Identity / safetymagic, db_creation, db_release, db_compatibility, db_iopagesize, db_logpagesize, db_charsetRefuse a log from an incompatible build/page size
Append cursorappend_lsa, fpageid, eof_lsaPersisted append loc, pageid at slot 1, end of log
Recoverychkpt_lsa, smallest_lsa_at_last_chkptLowest LSA recovery starts from
Transaction / MVCCnext_trid, mvcc_next_id, mvcc_op_log_lsa, oldest_visible_mvccid, newest_block_mvccid, vacuum_last_blockid, does_block_need_vacuumNext ids to assign; vacuum’s progress
Archivenxarv_pageid, nxarv_phy_pageid, nxarv_num, last_arv_num_for_syscrashes, last_deleted_arv_num, npagesDrives Ch 10’s archiving
Backupbkup_level0_lsa/1/2, bkinfo[]Per-level incremental backup anchors
HA / lifecycleha_server_state, ha_file_status, ha_promotion_time, is_shutdown, was_active_log_reset, has_logging_been_skipped, db_restore_time, mark_will_delReplication state; clean-shutdown flag
Alignment / miscdummy, dummy3, dummy4, vol_creation, avg_ntrans, avg_nlocks, was_copied, prefix_name, perm_status_obsoletedummy* pads; vol_creation time; avg_* sizing hints; was_copied resets a copied DB; prefix_name log prefix; perm_status_obsolete legacy

log_arv_header is the smaller header stamped on each archive file:

// log_arv_header -- src/transaction/log_storage.hpp
struct log_arv_header
{
char magic[CUBRID_MAGIC_MAX_LENGTH];
INT32 dummy; INT64 db_creation; INT64 vol_creation;
TRANID next_trid; DKNPAGES npages; LOG_PAGEID fpageid;
int arv_num; INT32 dummy2;
};
FieldRoleWhy
magicFile-type magicfile/magic recognition + sanity
db_creation / vol_creationCreation timestampsMatch archive to its database/volume
next_tridNext trid at archive timeRecovery context
npagesPage count in this archiveBounds the page range
fpageidLogical pageid at physical slot 1Maps physical to logical pages
arv_numArchive sequence numberMatches log_header.nxarv_num chain
dummy, dummy2Alignment padsKeep the on-disk layout stable
flowchart TB
  subgraph CALLER["Caller inputs"]
    DADDR["log_data_addr"]
    CRUMB["log_crumb[]"]
  end
  subgraph STAGE["Staging tier (memory)"]
    PLINFO["log_prior_lsa_info"]
    NODE["log_prior_node"]
    AINFO["log_append_info"]
  end
  subgraph REC["Record body (all tiers)"]
    HDR["log_rec_header"]
    PAY["log_rec_*\n+ MVCC wrappers"]
  end
  subgraph DISK["On-disk tier"]
    PAGE["log_page"]
    HPAGE["log_hdrpage"]
    LHDR["log_header (page -9)"]
    AHDR["log_arv_header"]
  end

  DADDR --> NODE
  CRUMB --> NODE
  PLINFO -->|owns list of| NODE
  NODE -->|embeds| HDR
  NODE -->|serializes| PAY
  PAY -->|embeds| LDATA["log_data"]
  PLINFO -->|drains to| AINFO
  AINFO -->|fixes / writes| PAGE
  PAGE -->|hdr is| HPAGE
  LHDR -->|append_lsa to| PAGE
  LHDR -->|nxarv_* feed| AHDR

Figure 1-1. How a record’s structures connect across the three tiers.

The LSA/pointer edges a modifier must keep consistent:

  • Physical chainlog_rec_header.forw_lsa/back_lsa; prev_tranlsa.
  • Staging allocatorlog_prior_lsa_info.prior_lsa/prev_lsa.
  • Durability watermarklog_append_info.nxio_lsa.
  • Vacuum chainlog_vacuum_info.prev_mvcc_op_log_lsa.
  1. A log record crosses three tiers — caller inputs, staging (log_prior_node anchored by log_prior_lsa_info, drained via log_append_info), and on-disk (log_page under log_header); the body (log_rec_header + a log_rec_* payload) is the constant.
  2. log_lsa is the 48:16 bit-field clock; its total ordering founds every durability decision. log_rec_header threads each record into a physical doubly linked chain and a per-transaction chain, with type discriminating the append-only, hole-preserving log_rectype.
  3. MVCC wrappers add mvccid and (undo only) log_vacuum_info; the redo wrapper omits vacuum_info — a pure redo creates no version to vacuum.
  4. prior_lsa_mutex makes LSA assignment plus linkage atomic and nxio_lsa is the atomic WAL watermark — the two concurrency invariants the append path rests on; the on-disk page reserves pageid -9 for log_header, whose nxarv_* feed log_arv_header.

The reader question: before any record can be appended, how are the prior-list, the page-buffer pool, the flush bookkeeping, and the global log state bootstrapped and allocated? For conceptual roles — what the prior list is for, why WAL demands a ring — see the companion cubrid-log-manager.md (“The append pipeline”, “Durability”). This chapter is bring-up mechanics: who mallocs, what each field starts at, which teardown frees it. Two entry points, both under LOG_CS, both calling log_final first if a prior instance is mounted:

  • log_create_internal — runs once at DB creation: formats the active-log volume, writes the first LOG_HEADER to page -9, flushes one empty append page, then tears the pool back down. No live state survives.
  • log_initialize_internal — runs at every restart / SA boot: mounts the existing active log, reads page -9, keeps the pool alive, hands control to recovery.

2.1 The global singleton: log_global / log_Gl

Section titled “2.1 The global singleton: log_global / log_Gl”

Everything hangs off one process-wide singleton, log_Gl (struct log_global), default-constructed at static init by log_global::log_global (log_global.c); bring-up populates its members rather than allocating it.

// log_global -- src/transaction/log_impl.h (condensed; #if SERVER_MODE members noted in the table)
struct log_global {
TRANTABLE trantable; LOG_APPEND_INFO append; LOG_PRIOR_LSA_INFO prior_info;
LOG_HEADER hdr; LOG_ARCHIVES archive; LOG_PAGEID run_nxchkpt_atpageid;
LOG_LSA chkpt_redo_lsa; DKNPAGES chkpt_every_npages; LOG_RECVPHASE rcv_phase; LOG_LSA rcv_phase_lsa;
LOG_PAGE *loghdr_pgptr; LOG_FLUSH_INFO flush_info; LOG_GROUP_COMMIT_INFO group_commit_info;
logwr_info *writer_info; /* the ONLY heap member of the ctor: new logwr_info() */
BACKGROUND_ARCHIVING_INFO bg_archive_info; mvcctable mvcc_table; GLOBAL_UNIQUE_STATS_TABLE unique_stats_table;
// #if SERVER_MODE: flushed_lsa_lower_bound, chkpt_lsa_lock, backup_in_progress; #else: final_restored_lsa
};

The ctor nulls every LSA-valued field to NULL_LSA, seeds flush_info to {0, 0, NULL, PTHREAD_MUTEX_INITIALIZER}, runs prior_info’s ctor (§2.5), and news writer_info (its only heap allocation). Every field:

FieldRoleWhy it exists / ctor seed
trantablePer-transaction LOG_TDES tablearea == NULL is the “not initialized” sentinel; sized by logtb_define_trantable_log_latch.
appendLive append cursor (vdes, log_pgptr, prev_lsa, atomic nxio_lsa)Where prior nodes drain into a page; Ch 4-5.
prior_infoIn-memory prior-list head/tail + LSA cursorsDecouples LSA assignment from disk layout; Ch 3-5.
hdrIn-RAM copy of on-disk LOG_HEADER (append_lsa/eof_lsa live here)Avoids re-reading page -9.
archiveCurrent archive descriptor cacheUsed when a wanted page rolled into an archive.
run_nxchkpt_atpageidPage id where next checkpoint firesNULL_PAGEID during create/init; recomputed at end of init.
flushed_lsa_lower_bound / chkpt_lsa_lockSERVER_MODE flush-coord LSA + chkpt-LSA mutexNULL_LSA / PTHREAD_MUTEX_INITIALIZER.
chkpt_redo_lsa / chkpt_every_npagesRedo-start LSA + checkpoint frequencyNULL_LSA / INT_MAX (latter from PRM_ID_LOG_CHECKPOINT_NPAGES).
rcv_phase / rcv_phase_lsaRecovery phase + its LSALOG_RECOVERY_ANALYSIS_PHASE / NULL_LSA; log_final resets phase.
backup_in_progress / final_restored_lsa#if pair: SERVER backup flag vs SA last-restored LSAOne per build; false / NULL_LSA.
loghdr_pgptrOne LOG_PAGESIZE scratch page for header I/OGlobal buffer malloc’d in log_initialize_internal, freed in log_final — distinct from the create-path local of the same name (§2.2).
flush_infotoflush[] + counters + mutexDirty append pages to push on a flush; §2.4.
group_commit_infoMutex+cond for group commitLets committers coalesce fsyncs.
writer_infoHA log-writer stateOnly ctor new; deleted in ~log_global.
bg_archive_infoBackground archiving descriptorInit’d at tail of init if PRM_ID_LOG_BACKGROUND_ARCHIVING is on.
mvcc_table / unique_stats_tableMVCC snapshot table / global unique-index statsDefault-ctor / GLOBAL_UNIQUE_STATS_TABLE_INITIALIZER.
graph TD
  subgraph logGl["log_Gl (LOG_GLOBAL singleton)"]
    A["append : LOG_APPEND_INFO<br/>vdes, log_pgptr, prev_lsa, nxio_lsa"]
    P["prior_info : LOG_PRIOR_LSA_INFO<br/>prior_lsa, prev_lsa, list head/tail"]
    F["flush_info : LOG_FLUSH_INFO<br/>toflush[], max/num_toflush, mutex"]
  end
  PB["log_Pb (LOG_PB_GLOBAL_DATA)<br/>buffers[], pages_area, header_page"]
  F -- "toflush[] points into" --> PB
  A -- "log_pgptr points into" --> PB

Figure 2-1. The global singleton and the separately-declared page-buffer global log_Pb.

2.2 log_create_internal — first-ever bring-up

Section titled “2.2 log_create_internal — first-ever bring-up”

Runs under LOG_CS_ENTER. Every branch:

  1. Stale-state guard: trantable.area != NULLlog_final (§2.7).
  2. umask; logpb_initialize_pool (§2.3) allocates the ring. Error → goto error.
  3. logpb_initialize_log_names builds log_Name_active etc. Error → goto error.
  4. logpb_initialize_header (&log_Gl.hdr, ...) fills the in-RAM header (page count, db_logpagesize = LOG_PAGESIZE). Error → goto error.
  5. logpb_create_header_page carves the page--9 buffer into a stack local loghdr_pgptr — declared in log_create_internal, not the global log_Gl.loghdr_pgptr of §2.1; the create path scratches a separate page from the restart-path I/O buffer.
  6. fileio_format creates the active-log file; the compound if goto errors on any of vdes == NULL_VOLDES, logpb_fetch_start_append_page failing, or the local loghdr_pgptr == NULL:
// log_create_internal -- src/transaction/log_manager.c
log_Gl.append.vdes = fileio_format (thread_p, db_fullname, log_Name_active, ...);
if (log_Gl.append.vdes == NULL_VOLDES
|| logpb_fetch_start_append_page (thread_p) != NO_ERROR || loghdr_pgptr == NULL)
goto error; /* <- any one failure unwinds the whole pool */
  1. Mark the empty append page dirty; logpb_flush_pages_direct writes the end-of-log mark.
  2. memcpy the in-RAM hdr into the local loghdr_pgptr->area; logpb_flush_page writes page -9 (error → goto error; under CUBRID_DEBUG it reads back and asserts).
  3. Clear log_pgptr, dismount, create volume-info/log-info files, register active + backup-info volumes via logpb_add_volume.
  4. Normal exit: logpb_finalize_pool, LOG_CS_EXIT, NO_ERROR.

The error: label runs the same logpb_finalize_pool + LOG_CS_EXIT (returning ER_FAILED if unset). Create never leaves a live pool.

INVARIANT — page -9 is the single source of truth for log geometry. The only place a fresh LOG_HEADER is written from scratch; every later boot reads it back. The step-8 memcpy + synchronous logpb_flush_page enforces it. If that flush fails silently, restart reads garbage geometry (db_logpagesize, fpageid) and re-formats or refuses to mount.

2.2b log_initialize_internal — restart bring-up

Section titled “2.2b log_initialize_internal — restart bring-up”

Shares the early scaffolding but diverges at the mount: it reads page -9, keeps the pool, dispatches to recovery. Every branch in order:

  1. Clean-state guard: trantable.area != NULL → log_final.
  2. Log-names init: logpb_initialize_log_names failure is fatal (logpb_fatal_error then goto error), not a plain propagate.
  3. loghdr_pgptr malloc: the global log_Gl.loghdr_pgptr (page--9 I/O buffer for logpb_fetch_header/logpb_flush_header); NULL → fatal + goto error. Freed in log_final (§2.7) and on error:.
  4. Pool init: logpb_initialize_pool (§2.3); error → goto error.
  5. fileio_mount returning NULL_VOLDES splits two ways — media-crash (ismedia_crash != false) synthesizes an approximate header (logpb_initialize_header for geometry, then the forced fields below mark everything un-checkpointed, LOG_RESET_APPEND_LSA syncs into prior_info, chkpt_lsa nulled, nxarv_* maxed); else error_code = ER_IO_MOUNT_FAIL; goto error:
    // log_initialize_internal -- src/transaction/log_manager.c
    log_Gl.hdr.fpageid = LOGPAGEID_MAX; log_Gl.hdr.append_lsa.pageid = LOGPAGEID_MAX;
    log_Gl.hdr.append_lsa.offset = 0; LOG_RESET_APPEND_LSA (&log_Gl.hdr.append_lsa);
  6. Non-NULL vdes else: logpb_fetch_header (&log_Gl.hdr) reads the real page -9 into the mirror.
  7. Copy hdr.chkpt_lsachkpt_redo_lsa. restore_slave branch (ismedia_crash && r_args && r_args->restore_slave): copy db_creation, smallest_lsa_at_last_chkpt, append_lsa out into r_args for HA slave restore.
  8. Prefix-name mismatch: strcmp(hdr.prefix_name, prefix_logname) != 0ER_LOG_INCOMPATIBLE_PREFIX_NAME (notification) and continue anyhow.
  9. Page-size mismatch → recursive re-init: hdr.db_iopagesize != IO_PAGESIZE || hdr.db_logpagesize != LOG_PAGESIZEdb_set_page_size, logpb_finalize_pool, dismount, LOG_CS_EXIT, re-logtb_define_trantable_log_latch, then call log_initialize_internal again and return — buffers rebuilt at the right size (cross-ref §2.8).
  10. Compatibility checks (rel_get_disk_compatible, rel_is_log_compatible) goto error on incompatible versions; logtb_define_trantable_log_latch(-1) builds the live trantable; fileio_map_mounted verifies the log belongs to this DB (else undefine trantable + goto error).
  11. Recovery dispatch: init_emergency == false && (hdr.is_shutdown == false || ismedia_crash) → prior run crashed → log_recovery. Else clean/emergency boot → logpb_fetch_start_append_page, read EOF record to seed prev_lsa via LOG_RESET_PREV_LSA(&eof->back_lsa), set is_shutdown = false, logpb_flush_header.
  12. Prior/append LSA assert + reset (cross-ref §2.5): set rcv_phase = LOG_RESTARTED, then the defensive assert(0) + re-reset if append.prev_lsa/hdr.append_lsa diverge from prior_info; recompute chkpt_every_npages, run_nxchkpt_atpageid, bring up bg-archiving, LOG_CS_EXIT, return.

The error: label dismounts vdes if mounted, free_and_inits loghdr_pgptr, LOG_CS_EXIT, logpb_fatal_error — a failed restart aborts.

2.3 logpb_initialize_pool — the page-buffer ring

Section titled “2.3 logpb_initialize_pool — the page-buffer ring”

The ring lives in a separate global, log_Pb of type LOG_PB_GLOBAL_DATA, not inside log_Gl.

// log_pb_global_data / log_buffer -- src/transaction/log_page_buffer.c
struct log_pb_global_data {
LOG_BUFFER *buffers; LOG_PAGE *pages_area; LOG_BUFFER header_buffer; LOG_PAGE *header_page;
int num_buffers; LOGPB_PARTIAL_APPEND partial_append; };
struct log_buffer {
volatile LOG_PAGEID pageid; volatile LOG_PHY_PAGEID phy_pageid; bool dirty; LOG_PAGE *logpage; };

LOG_PB_GLOBAL_DATA: buffers (descriptor array), pages_area (one slab of num_buffers * LOG_PAGESIZE), header_buffer/header_page (the page--9 descriptor + backing page), num_buffers, partial_append (record-split-across-flush state, Ch 6). The per-page descriptor LOG_BUFFER:

FieldRoleWhy it exists
pageidLogical id of the resident log-sequence pageNULL_PAGEID = free; lookups key on this. volatile — read without the lock.
phy_pageidPhysical offset in the active-log fileTranslation cache so each flush skips logpb_to_physical_pageid.
dirtyPage differs from diskDrives whether a slot is added to toflush[].
logpagePointer into the shared pages_area slabDecouples the small descriptor from the LOG_PAGESIZE payload.

Branch-complete (asserts LOG_CS_OWN_WRITE_MODE):

  1. log_append_init_zip (§2.6) — compression contexts come up before the ring.
  2. If logpb_Initialized, logpb_finalize_pool (re-entrant safety), then assert pages_area == NULL.
  3. num_buffers = prm_get_integer_value (PRM_ID_LOG_NBUFFERS).
  4. malloc buffers. NULLer_set + return ER_OUT_OF_VIRTUAL_MEMORY (no pool to unwind).
  5. malloc pages_area (num_buffers * LOG_PAGESIZE). NULLfree_and_init(buffers), return.
  6. memset slab to LOG_PAGE_INIT_VALUE; loop logpb_initialize_log_buffer (&buffers[i], pages_area + i*LOG_PAGESIZE) wires descriptor i to slab slot i, setting pageid = phy_pageid = NULL_PAGEID, dirty = false, and stamping the page header (logical_pageid = NULL_PAGEID, offset = NULL_OFFSET, flags = 0).
  7. malloc header_page (one LOG_PAGESIZE); NULL → free both prior allocations, return. Wired into header_buffer — resident slot for page -9 (LOGPB_HEADER_PAGE_ID == -9).
  8. logpb_initialize_flush_info (§2.4). Error → goto error.
  9. partial_append.status = LOGPB_APPENDREC_SUCCESS; its aligned scratch page pointer is set.
  10. logpb_Initialized = true; pthread_*_init the chkpt-lsa lock, group-commit cond/mutex, writer_info conds/mutexes; writer_info->is_init = true. Return NO_ERROR.

The error: label runs logpb_finalize_pool then logpb_fatal_error (aborts) — a pool-init failure is fatal, unlike the early malloc returns which merely propagate.

INVARIANT — the descriptor array and the page slab are the same length and freed together. buffers[i].logpage always points at pages_area + i*LOG_PAGESIZE; logpb_locate_page recovers the index by (log_pg - pages_area) / LOG_PAGESIZE and asserts the round-trip. If num_buffers diverged between the two mallocs, that arithmetic indexes out of bounds.

flowchart TD
  S["init_zip; finalize_pool if re-entrant"] --> N["num_buffers = PRM_ID_LOG_NBUFFERS"]
  N --> B{"malloc buffers?"}
  B -- no --> E1["return ER_OUT_OF_VIRTUAL_MEMORY"]
  B -- yes --> P{"malloc pages_area?"}
  P -- no --> E2["free buffers; return"]
  P -- yes --> Hp{"malloc header_page?"}
  Hp -- no --> E3["free buffers+pages; return"]
  Hp -- yes --> Fi{"init_flush_info?"}
  Fi -- no --> Err["goto error: finalize_pool; fatal_error"]
  Fi -- yes --> Done["init mutexes; Initialized=true; NO_ERROR"]

Figure 2-2. Branch map of logpb_initialize_pool, every allocation-failure path.

2.4 logpb_initialize_flush_info — the dirty-page roster

Section titled “2.4 logpb_initialize_flush_info — the dirty-page roster”

LOG_FLUSH_INFO (embedded as log_Gl.flush_info) is the list of append pages a flush must push to disk.

// log_flush_info -- src/transaction/log_impl.h
struct log_flush_info {
int max_toflush; int num_toflush; LOG_PAGE **toflush;
#if defined(SERVER_MODE)
pthread_mutex_t flush_mutex;
#endif
};
FieldRoleWhy it exists
max_toflushCapacity, set to num_buffers - 1One slot reserved (header flushes separately), so the roster never exceeds num_buffers - 1.
num_toflushLive count of staged pagesReset to 0 here and after each flush.
toflushArray of LOG_PAGE* in ascending page-id ordercalloc’d to num_buffers pointers; sorted so the writev issues contiguous I/O.
flush_mutex(SERVER_MODE) serializes roster mutationLog-flush thread and committers both touch it.

logpb_initialize_flush_info: if toflush != NULL it calls logpb_finalize_flush_info first (re-entrant) then asserts toflush == NULL; sets max_toflush = num_buffers - 1, num_toflush = 0, calloc’s toflush to num_buffers pointers (extra slot is harmless slack), er_sets ER_OUT_OF_VIRTUAL_MEMORY on NULL, and pthread_mutex_inits — even on allocation failure, returning the error code the caller treats as goto error. logpb_finalize_flush_info reverses it: if toflush != NULL, lock, free_and_init(toflush), zero counters, unlock, pthread_mutex_destroy; no-op (double-call safe) when already NULL.

2.5 prior_lsa_info constructor — seeding the prior list

Section titled “2.5 prior_lsa_info constructor — seeding the prior list”

LOG_PRIOR_LSA_INFO heads the in-memory prior list (the staging area between a caller’s append request and the page buffer; Ch 3-5).

// log_prior_lsa_info -- src/transaction/log_append.hpp
struct log_prior_lsa_info {
LOG_LSA prior_lsa; LOG_LSA prev_lsa; LOG_PRIOR_NODE *prior_list_header; LOG_PRIOR_NODE *prior_list_tail;
INT64 list_size; LOG_PRIOR_NODE *prior_flush_list_header; std::mutex prior_lsa_mutex; log_prior_lsa_info (); };
FieldRoleWhy it exists
prior_lsaLSA the next appended node will receiveAdvancing it under the mutex issues LSAs in monotonic order without touching disk.
prev_lsaLSA of the previously appended nodeLets each new node store a back_lsa for backward chaining / undo.
prior_list_header / prior_list_tailFIFO head (drain consumes) / tail (O(1) append)Drain (Ch 5) reads head; new nodes splice at tail.
list_sizeQueued byte countLets the drainer/flusher decide when to push.
prior_flush_list_headerSub-list already promoted toward flushSeparates “appended” from “being flushed”.
prior_lsa_mutexThe hot lock of the whole subsystemEvery LSA assignment serializes here.

The ctor seeds everything empty; the real LSA seed is deferred to log_initialize_internal, which copies recovered header LSAs into both append and prior_info:

// log_prior_lsa_info ctor / LOG_RESET_*_LSA -- src/transaction/log_append.cpp
log_prior_lsa_info::log_prior_lsa_info () // every member: NULL_LSA / NULL / 0 / default mutex
: prior_lsa (NULL_LSA), prev_lsa (NULL_LSA), prior_list_header (NULL), prior_list_tail (NULL)
, list_size (0), prior_flush_list_header (NULL), prior_lsa_mutex () { }
void LOG_RESET_APPEND_LSA (const LOG_LSA *lsa) // header drives prior_lsa
{ log_Gl.hdr.append_lsa = *lsa; log_Gl.prior_info.prior_lsa = *lsa; }
void LOG_RESET_PREV_LSA (const LOG_LSA *lsa)
{ log_Gl.append.prev_lsa = *lsa; log_Gl.prior_info.prev_lsa = *lsa; }

INVARIANT — prior_info.prior_lsa == hdr.append_lsa and prior_info.prev_lsa == append.prev_lsa at end of init. log_initialize_internal assert(0)s and re-resets on divergence (if (!LSA_EQ (&log_Gl.hdr.append_lsa, &log_Gl.prior_info.prior_lsa)) { assert (0); LOG_RESET_APPEND_LSA (...); } and the symmetric prev_lsa check). If it drifted, the first appended record would get an LSA disagreeing with where the cursor writes, corrupting the back-chain.

2.6 log_append_init_zip / log_append_final_zip — compression contexts

Section titled “2.6 log_append_init_zip / log_append_final_zip — compression contexts”

LOG_ZIP is the (de)compression scratch buffer: struct log_zip { LOG_ZIP_SIZE_T data_length = 0; LOG_ZIP_SIZE_T buf_size = 0; char *log_data = nullptr; }; (log_compress.h).

FieldRoleWhy it exists
data_lengthBytes currently heldResult length after log_zip/log_unzip.
buf_sizeCapacity of log_datalog_zip_realloc_if_needed grows it; avoids re-malloc per record.
log_dataThe (de)compression bufferHolds LZ4 output; log_zip_alloc(IO_PAGESIZE) sizes it.

log_append_init_zip branches on mode and PRM_ID_LOG_COMPRESS:

  1. Compression disabled → log_Zip_support = false, return.
  2. SERVER_MODE: log_Zip_support = true; the buffers are per-thread, allocated lazily on first use — log_append_get_zip_undo/_redo do if (thread_p->log_zip_undo == NULL) thread_p->log_zip_undo = log_zip_alloc (IO_PAGESIZE);.
  3. SA-mode: allocate two process-global statics log_zip_undo/log_zip_redo plus a log_data_ptr scratch of IO_PAGESIZE * 2. If any is NULLlog_Zip_support = false and free whichever allocated (each under its own if). Else log_Zip_support = true.

log_append_final_zip mirrors it: if !log_Zip_support return; under SERVER_MODE nothing (per-thread buffers die with the thread entry); in SA-mode frees log_zip_undo/log_zip_redo/log_data_ptr. It runs from logpb_finalize_pool (§2.7), so zip teardown is tied to pool teardown.

INVARIANT — log_Zip_support is the single gate. All callers gate on it, never on the individual buffer pointers; init sets it false on any partial allocation failure so a half-allocated context is never used.

2.7 Teardown: log_final and logpb_finalize_pool

Section titled “2.7 Teardown: log_final and logpb_finalize_pool”

log_final is the orderly shutdown and the re-entrancy guard create/init call up front. Branch-complete:

  1. Destroy server daemons and system transactions; LOG_CS_ENTER; reset rcv_phase.
  2. trantable.area == NULL → nothing initialized; exit.
  3. Else !logpb_is_pool_initialized() → only trantable; logtb_undefine_trantable, exit.
  4. Else append.vdes == NULL_VOLDES → pool but no volume; logpb_finalize_pool + logtb_undefine_trantable, exit.
  5. Else abort every active transaction (log_abort), tracking anyloose_ends; flush to disk (logpb_flush_pages_direct + pgbuf_flush_all + fileio_synchronize_all).
  6. Header branch: if !anyloose_ends && error_code == NO_ERROR, set hdr.is_shutdown = true and snap chkpt_lsa = append_lsa (clean — restart skips recovery). Else logpb_checkpoint.
  7. logpb_flush_header, logpb_finalize_pool, logtb_undefine_trantable, dismount bg-archive + active volumes, free_and_init(loghdr_pgptr), LOG_CS_EXIT.

logpb_finalize_pool (from log_final and the create/init error paths) is idempotent — returns if !logpb_Initialized. Otherwise it reverses bring-up exactly: clear the append cursor (log_pgptr = NULL, nxio_lsa/prev_lsa = NULL_LSA, mirrored into prior_info), free_and_init buffers/pages_area/header_page, num_buffers = 0, logpb_Initialized = false, logpb_finalize_flush_info (§2.4), destroy chkpt + group-commit locks, finalize writer info, and finally log_append_final_zip (§2.6) — zip freed last, mirroring init’s zip-first, so no in-flight append (touching a per-thread LOG_ZIP) outlives its buffers.

2.8 The LOGAREA_SIZE / LOG_PAGESIZE relationship

Section titled “2.8 The LOGAREA_SIZE / LOG_PAGESIZE relationship”

A LOG_PAGE is LOG_PAGESIZE bytes (db_Log_page_size in storage_common.h). The first SSIZEOF(LOG_HDRPAGE) bytes are the page header; the rest is the record area: #define LOGAREA_SIZE (LOG_PAGESIZE - SSIZEOF(LOG_HDRPAGE)) (log_impl.h).

This constant constrains all record placement. The append macros (LOG_APPEND_ALIGN, LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT) compare append_lsa.offset against LOGAREA_SIZE and call logpb_next_append_page on overflow; LOG_PRIOR_LSA_LAST_APPEND_OFFSET() likewise returns LOGAREA_SIZE, so the prior-list and page-buffer sides agree on where a page ends (page-crossing is Chapter 6). At init the point is: geometry is fixed from the header’s db_logpagesize, validated against the running LOG_PAGESIZE.

INVARIANT — db_logpagesize must equal the running LOG_PAGESIZE. As traced in §2.2b step 9, log_initialize_internal checks db_iopagesize != IO_PAGESIZE || db_logpagesize != LOG_PAGESIZE; on mismatch it db_set_page_sizes, finalizes the pool, dismounts, and recursively re-enters itself so buffers are reallocated at the correct size. Otherwise LOGAREA_SIZE is computed against the wrong page size and records straddle physical page boundaries.

  1. Two entry points, different lifetimes. log_create_internal formats, writes page -9, finalizes the pool (no live state); log_initialize_internal mounts, reads page -9, keeps the pool live, runs recovery.
  2. Restart has a richer branch tree (§2.2b): fatal log-names path, global loghdr_pgptr malloc, fileio_mount NULL_VOLDES split (media-crash header synthesis with LOGPAGEID_MAX vs ER_IO_MOUNT_FAIL), logpb_fetch_header, restore_slave copy-out, tolerated prefix mismatch, recursive page-size re-init, recovery-vs-clean dispatch.
  3. Two globals: log_Gl (append/prior/header/flush) vs the separate ring log_Pb; flush_info.toflush[] and append.log_pgptr point into log_Pb.
  4. The ring is two parallel allocationsLOG_BUFFER[] + one pages_area slab; descriptor i ↔ slab slot i, recovered by pointer arithmetic. Flush capacity is num_buffers - 1.
  5. The prior list starts empty, LSA-seeded from the header via LOG_RESET_APPEND_LSA/LOG_RESET_PREV_LSA into both append and prior_info; init asserts they agree.
  6. Compression is mode-split: SA-mode process-global LOG_ZIP statics, server-mode per-thread lazy; log_Zip_support is the single gate, false on any partial failure.
  7. Teardown reverses bring-up, freeing flush-info and zip last; log_final’s is_shutdown = true branch lets the next boot skip recovery.

Chapter 3: Building a Prior Node from a Caller Request

Section titled “Chapter 3: Building a Prior Node from a Caller Request”

When the engine modifies a page it calls a log_append_* API. Before the change can reach disk it must become a prior node — a heap-allocated LOG_PRIOR_NODE carrying a fully formed log record. This chapter answers: given a caller tuple (rcvindex, addr, undo_data, redo_data), how is a complete LOG_PRIOR_NODE built before it is handed an LSA?

The defining property of this phase: it runs entirely outside the prior-list mutex — allocation, header sizing, payload copying, and compression all happen on the caller’s memory. Only once the node is finished does Chapter 4’s prior_lsa_next_record take prior_lsa_mutex, stamp the LSA, and splice it in (the companion’s single-writer pipeline).

3.1 The append API surface — thin wrappers over crumbs

Section titled “3.1 The append API surface — thin wrappers over crumbs”

The entry points (log_append_undoredo_data, log_append_undo_data, log_append_redo_data, plus the *2 and *_recdes variants) package the caller’s contiguous buffer into one LOG_CRUMB and delegate to the crumbs API.

// log_append_undoredo_data -- src/transaction/log_manager.c
LOG_CRUMB undo_crumb, redo_crumb;
assert (0 == undo_length || undo_data != NULL); /* <- zero length must mean NULL data */
undo_crumb.data = undo_data; undo_crumb.length = undo_length; // ... redo_crumb the same ...
log_append_undoredo_crumbs (thread_p, rcvindex, addr, 1, 1, &undo_crumb, &redo_crumb);
// inside log_append_undoredo_crumbs: type from rcvindex alone:
LOG_RECTYPE rectype = LOG_IS_MVCC_OPERATION (rcvindex) ? LOG_MVCC_UNDOREDO_DATA : LOG_UNDOREDO_DATA;

A LOG_CRUMB is a (length, data) pair. The *2 variants synthesize a LOG_DATA_ADDR from (vfid, pgptr, offset); _recdes variants wrap a RECDES. LOG_IS_MVCC_OPERATION is true for MVCC heap/btree ops and RVES_NOTIFY_VACUUM; the undo-only path picks LOG_(MVCC_)UNDO_DATA, the redo-only path *REDO*. This rectype is the switch key for every sizing decision downstream. Before construction, log_append_*_crumbs runs a guard chain (Figure 3-1); each guard is a distinct early return, so the node is built only after all five pass:

flowchart TB
  B{"log_No_logging?"} -- yes --> B1["log_skip_logging; return"]
  B -- no --> D{"LOG_FIND_TDES == NULL?"}
  D -- yes --> D1["ER_LOG_UNKNOWN_TRANINDEX; return"]
  D -- no --> E{"not sysop AND not active AND not aborted?"}
  E -- yes --> E1["return, log nothing"]
  E -- no --> F{"log_can_skip_undo_logging?"}
  F -- yes --> F1["append redo crumbs only; return"]
  F -- no --> G["prior_lsa_alloc_and_copy_crumbs"]
  G --> H{"node == NULL?"} -- yes --> H1["return"]
  H -- no --> I["TDE encrypt; prior_lsa_next_record (Ch 4)"]

Figure 3-1 — Guard chain of log_append_undoredo_crumbs. When undo is skippable it degenerates to a redo-only append. log_append_undo_crumbs skips silently (no redo fallback); log_append_redo_crumbs uses log_can_skip_redo_logging.

3.2 LOG_PRIOR_NODE — the construction target

Section titled “3.2 LOG_PRIOR_NODE — the construction target”
// struct log_prior_node -- src/transaction/log_append.hpp
struct log_prior_node {
LOG_RECORD_HEADER log_header;
LOG_LSA start_lsa; /* for assertion */
bool tde_encrypted;
int data_header_length; char *data_header;
int ulength; char *udata;
int rlength; char *rdata;
LOG_PRIOR_NODE *next;
};
FieldRoleWhy it exists
log_headerOnly .type set here.Record identity / switch key. LSA links filled in Ch 4 under the mutex.
start_lsaEventual LSA. Unset here/* for assertion */.Assigned by prior_lsa_next_record (Ch 4); read only by MVCC vacuum-header assertions. The node has no log position yet, so reading it during construction is a bug.
tde_encryptedWhether the log page must be TDE-encrypted.false at alloc, raised by prior_set_tde_encrypted; drives page-boundary encryption (Ch 6).
data_header_length / data_headerByte size + separate malloc holding the filled LOG_REC_*.From rectype via sizeof(LOG_REC_*); separate buffer lets the drain (Ch 5) copy header then data independently.
ulength / udataStored undo length (high bit = zipped) + heap copy of undo bytes.Node must own its payload; caller’s buffer may be freed after return. Drain copies exactly ulength bytes.
rlength / rdataAs above, for redo.Redo payload ownership and length.
nextList pointer.NULL here; Ch 4 sets it on append to prior_list_tail.

A finished node is three independent mallocs — node, data_header (a LOG_REC_UNDOREDO or LOG_REC_MVCC_UNDOREDO), and each payload copy — making it self-owned; next, start_lsa, and the log_header LSA links stay blank until Ch 4.

Invariant — the node owns its payload by value. udata/rdata are always freshly malloc’d copies (the copiers always memcpy), never aliases of the caller’s buffers. If violated, the asynchronous drain in Ch 5 could read freed memory.

3.3 Allocation dispatch — prior_lsa_alloc_and_copy_crumbs

Section titled “3.3 Allocation dispatch — prior_lsa_alloc_and_copy_crumbs”

prior_lsa_alloc_and_copy_crumbs mallocs the node, zeroes every construction field, sets log_header.type, then dispatches:

// prior_lsa_alloc_and_copy_crumbs -- src/transaction/log_append.cpp
node->log_header.type = rec_type; node->tde_encrypted = false; /* ... all payload fields zeroed ... */
switch (rec_type) {
case LOG_UNDOREDO_DATA: ... case LOG_MVCC_REDO_DATA: /* all 8 undo/redo families */
error = prior_lsa_gen_undoredo_record_from_crumbs (thread_p, node, rcvindex, addr, ...); break;
default: assert_release (false); error = ER_FAILED; break; /* <- crumbs path is undo/redo only */
}

On error it frees data_header, udata, rdata, then the node, and returns NULL — the caller (§3.1) treats NULL as “give up silently.”

The sibling prior_lsa_alloc_and_copy_data handles non-crumb families (postpone, compensate, commit, sysop, 2PC): its switch routes undo/redo cases to assert_release(false) and the rest to prior_lsa_gen_record, prior_lsa_gen_postpone_record, etc. — so the two allocators partition the type space: crumbs for undo/redo data, plain copy for control records.

prior_lsa_gen_record is the plain-copy builder Chapters 8–10 lean on for commit/abort/sysop nodes. It does no compression and no MVCC stamping — only sizes, allocates, and copies an optional undo blob; the header contents are filled by the caller. Its three branches:

BranchEffect
switch (rec_type)Dummy/decision records (LOG_DUMMY_HEAD_POSTPONE, LOG_2PC_*_DECISION, LOG_START_CHKPT, LOG_SYSOP_ATOMIC_START) assert length==0 && data==NULL and leave data_header_length == 0; control records set data_header_length = sizeof(LOG_REC_*) (e.g. LOG_COMMIT/LOG_ABORTLOG_REC_DONETIME, LOG_SYSOP_ENDLOG_REC_SYSOP_END); default leaves it 0.
if (data_header_length > 0)Mallocs the header (memset in debug builds); on failure raises ER_OUT_OF_VIRTUAL_MEMORY and returns immediately — no udata copy attempted.
if (length > 0)Copies the optional undo blob via prior_lsa_copy_undo_data_to_node, propagating its error code; otherwise returns NO_ERROR.

3.4 prior_lsa_gen_undoredo_record_from_crumbs — the core builder

Section titled “3.4 prior_lsa_gen_undoredo_record_from_crumbs — the core builder”

The builder runs four phases (Figure 3-2). It sums the crumb lengths, fetches the per-side zip scratch (log_append_get_zip_undo/_redo), and sets type-shaped flags: a LOG_IS_UNDOREDO_RECORD_TYPE sets has_undo + has_redo and needs both scratches (or a zero-length side); a LOG_IS_REDO_RECORD_TYPE sets has_redo, needs zip_redo; otherwise UNDO needs zip_undo — all &&-gated by log_Zip_support into can_zip.

It then (optionally) compresses (§3.5), sizes and mallocs the typed header, aims local pointers at its sub-fields, fills the shared LOG_DATA, and copies the payloads. Pointer aiming uses a fall-through switch: each MVCC arm grabs its extra mvccid_p/vacuum_info_p, then [[fallthrough]] into the non-MVCC arm for the shared length/data pointers — UNDO sets ulength_p only, REDO rlength_p only, UNDOREDO both:

// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cpp
case LOG_MVCC_UNDOREDO_DATA: case LOG_MVCC_DIFF_UNDOREDO_DATA: /* MVCC arm: extra ptrs, then fall through */
vacuum_info_p = &mvcc_undoredo_p->vacuum_info; mvccid_p = &mvcc_undoredo_p->mvccid; [[fallthrough]];
case LOG_UNDOREDO_DATA: case LOG_DIFF_UNDOREDO_DATA: /* shared: aim both length ptrs + log_data_p */
data_header_ulength_p = &undoredo_p->ulength; ... log_data_p = &undoredo_p->data; break;

The shared LOG_DATA is filled from addr: rcvindex, offset, and (pageid, volid) via pgbuf_get_vpid_ptr — or NULL_PAGEID/NULL_VOLID when addr->pgptr == NULL (logical logging).

flowchart TB
  M["Phase 1: sum lengths, get zip scratch, compute has_undo/has_redo/can_zip"] --> Z{"can_zip AND\nsome side >= thr?"}
  Z -- yes --> ZB["Phase 2: log_diff + log_zip; if both zipped rewrite type to *_DIFF_*"]
  Z -- no --> HSZ["Phase 3a: size header by type"]
  ZB --> HSZ
  HSZ --> MAL{"malloc data_header OK?"}
  MAL -- no --> ERR["ER_OUT_OF_VIRTUAL_MEMORY; goto error"]
  MAL -- yes --> PTR["Phase 3b: aim ptrs via fall-through switch, fill LOG_DATA, stamp MVCCID/vacuum if set"]
  PTR --> CP["Phase 4: copy udata/rdata (zipped or raw)"]
  CP --> RET["return NO_ERROR"]
  ERR --> RETE["return error_code"]

Figure 3-2 — Control flow of prior_lsa_gen_undoredo_record_from_crumbs. Every branch reaches return NO_ERROR or the error: label, which frees data_header/udata/rdata.

3.5 The compression branch — boundary is the node, not the page

Section titled “3.5 The compression branch — boundary is the node, not the page”

CUBRID compresses per record (per prior node), never per log page — which is why it lives in construction, before any LSA or page is assigned: the compressed bytes are sized into ulength/rlength and copied into the node, so Ch 6’s page-boundary logic never sees uncompressed data. Two globals gate it; scratch is a per-side LOG_ZIP:

// src/transaction/log_append.cpp ; src/transaction/log_compress.h
bool log_Zip_support = false; /* <- master toggle, from prm */
int log_Zip_min_size_to_compress = 255; /* <- per-side threshold (bytes) */
struct log_zip { LOG_ZIP_SIZE_T data_length = 0; LOG_ZIP_SIZE_T buf_size = 0; char *log_data = nullptr; };

LOG_ZIP holds one result, all three fields: log_data is the output buffer (prior_lsa_copy_*_data_to_node memcpys from it), data_length its produced length (what MAKE_ZIP_LEN wraps into the header; raw if it did not shrink), buf_size its log_zip_alloc-set capacity (IO_PAGESIZE

  • LZ4 bound) so it is not reallocated per record. Scratch comes from log_append_get_zip_undo/_redo: per-thread in SERVER_MODE (thread_p->log_zip_undo, lazily log_zip_alloc’d), file-static singletons stand-alone. If thread_p is NULL and unresolvable via thread_get_thread_entry_info, the getter returns NULL — forcing can_zip false for that side via the zip_* != NULL clause.

The compression block and the length-stamping copy run as one unit:

// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cpp
if (can_zip && (ulength >= log_Zip_min_size_to_compress || rlength >= log_Zip_min_size_to_compress)) {
if (ulength >= thr && rlength >= thr) {
(void) log_diff (ulength, undo_data, rlength, redo_data); /* <- redo diffed against undo */
is_undo_zip = log_zip (zip_undo, ulength, undo_data);
is_redo_zip = log_zip (zip_redo, rlength, redo_data);
if (is_redo_zip) is_diff = true;
} else { if (ulength >= thr) is_undo_zip = log_zip (zip_undo, ulength, undo_data);
if (rlength >= thr) is_redo_zip = log_zip (zip_redo, rlength, redo_data); }
}
if (is_diff) node->log_header.type = is_mvcc_op ? LOG_MVCC_DIFF_UNDOREDO_DATA : LOG_DIFF_UNDOREDO_DATA;
// ... after header sized/aimed, undo arm (redo symmetric): ...
if (is_undo_zip) { *data_header_ulength_p = MAKE_ZIP_LEN (zip_undo->data_length); /* <- sets 0x80000000 */
error_code = prior_lsa_copy_undo_data_to_node (node, zip_undo->data_length, (char *) zip_undo->log_data);
} else if (has_undo) { *data_header_ulength_p = ulength;
error_code = prior_lsa_copy_undo_crumbs_to_node (node, num_ucrumbs, ucrumbs); }

Four outcomes: neither side over threshold (skipped); both large (log_diff rewrites redo as its difference from undo, then both zip, flipping the type to *_DIFF_* if redo zipped); only one large (that side zips, no diff); log_zip returns false (copied raw). MAKE_ZIP_LEN(len) is len | 0x80000000; recovery strips it via GET_ZIP_LEN/ZIP_CHECK.

Invariant — header length encodes compression state. Whether a side is zipped is recorded only in the sign bit of the header length field; a zipped payload written without MAKE_ZIP_LEN would feed compressed bytes straight to recovery and corrupt the page. Pairing is_*_zip with MAKE_ZIP_LEN in the same arm guarantees they never diverge.

The copier prior_lsa_copy_undo_data_to_node (_redo_ mirrors it) mallocs length bytes (returns NO_ERROR for length <= 0 || data == NULL, ER_OUT_OF_VIRTUAL_MEMORY on failure), memcpys, and sets node->ulength; the crumb copiers malloc once then memcpy each crumb. Either way node->ulength/rlength holds the stored length.

For MVCC types the pointer switch left mvccid_p/vacuum_info_p non-NULL, so two extra fills run. The MVCCID comes from the current TDES, preferring the innermost sub-transaction id:

// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cpp
if (mvccid_p != NULL) {
tdes = LOG_FIND_CURRENT_TDES (thread_p);
if (tdes == NULL || !MVCCID_IS_VALID (tdes->mvccinfo.id)) {
assert_release (false); error_code = ER_FAILED; goto error; /* <- MVCC op needs an MVCCID */
} else if (!tdes->mvccinfo.sub_ids.empty ()) *mvccid_p = tdes->mvccinfo.sub_ids.back (); /* nested sysop */
else *mvccid_p = tdes->mvccinfo.id;
}

vacuum_info_p gets the file id (addr->vfid, or NULL for RVES_NOTIFY_VACUUM, else assert_release(false)), and prev_mvcc_op_log_lsa is set NULL — completed later in Ch 4’s prior_lsa_next_record_internal, which links the record into the vacuum chain once the LSA is known. These two fields, plus start_lsa, are the only ones here depending on transaction/log state, not the caller tuple. The two record layouts:

// struct log_rec_undoredo / log_rec_mvcc_undoredo -- src/transaction/log_record.hpp
struct log_rec_undoredo { LOG_DATA data; int ulength; int rlength; };
struct log_rec_mvcc_undoredo { LOG_REC_UNDOREDO undoredo; MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };

Every field: data (the LOG_DATA triple rcvindex/pageid/offset plus volid) is where recovery dispatches and locates the bytes; ulength/rlength are the stored lengths (high bit = zipped). The MVCC variant embeds undoredo so non-MVCC readers share code, then adds mvccid (the writer’s id, for vacuum and visibility) and vacuum_info (prev_mvcc_op_log_lsa back-link + owning vfid; back-link filled Ch 4).

  1. The public log_append_* APIs are thin — wrap the buffer in a LOG_CRUMB, delegate to log_append_*_crumbs, which pick the LOG_RECTYPE from rcvindex and run a five-guard chain first.
  2. Construction is lock-free and self-owning — all work outside prior_lsa_mutex; the node owns three mallocs (node, data_header, payload copies) so the async drain never touches caller memory.
  3. Two allocators partition the type space..._crumbsprior_lsa_gen_undoredo_record_from_crumbs for undo/redo data; ..._dataprior_lsa_gen_record for control records, whose three branches size the header (0 for dummy/decision types), malloc with an ER_OUT_OF_VIRTUAL_MEMORY bail, and copy an optional undo blob.
  4. The core builder runs measure → compress → size+fill the typed header → copy payloads, a [[fallthrough]] switch sharing the non-MVCC layout across the UNDO/REDO/UNDOREDO shapes.
  5. Compression is per-node, not per-page — gated by log_Zip_support and the 255-byte threshold with per-thread LOG_ZIP scratch (NULL thread_p ⇒ no compression); both-sides-large triggers log_diff and may rewrite the type to *_DIFF_*. The zipped/raw choice is recorded only in the length’s high bit via MAKE_ZIP_LEN.
  6. MVCC records get MVCCID + vacuum info from the TDES (sub-id preferred); prev_mvcc_op_log_lsa and start_lsa stay NULL/blank until the LSA is assigned in Chapter 4 — reading start_lsa during construction is a bug.

Chapter 4: LSA Assignment and Attach to the Prior List

Section titled “Chapter 4: LSA Assignment and Attach to the Prior List”

Chapter 3 left us holding a fully formed LOG_PRIOR_NODE whose payload is populated but whose position in the log is unknown. This chapter assigns the node its LSA and splices it onto the prior-list tail inside one short mutex-guarded critical section. For why CUBRID stages records in an in-memory prior list, see the “prior list” section of cubrid-log-manager.md. The payoff:

Invariant 4-A (LSA order = mutex-acquisition order). Every LSA the engine hands out is monotonically increasing, and the order in which two threads receive their LSAs is exactly the order in which they acquired prior_info.prior_lsa_mutex. The mutex is held for only an O(1) sequence of pointer/offset updates — no I/O, no allocation.

Three structs meet here: the node (Ch 1; only the fields this chapter writes), the global cursor, and the embedded on-disk record header.

log_prior_node (fields written in this chapter)

Section titled “log_prior_node (fields written in this chapter)”
FieldRoleWhy it exists
log_headerLOG_RECORD_HEADER — bytes that will physically precede the record in the pageCarries the four linkage LSAs + trid + type that recovery walks
start_lsaThe LSA this node is assignedReturned to the caller as the record’s identity; cross-checked when the node is drained
tde_encryptedWhether the holding page must be TDE-encryptedSet by prior_set_tde_encrypted; read when the page is allocated/flushed
data_header_lengthByte length of data_headerDrives the offset advance for the data-header region
data_headerThe typed record header (e.g. LOG_REC_SYSOP_END)Cast on the matched type arm to read MVCC/sysop sub-fields
ulength / udataUndo payload length / bufferulength>0 triggers an offset advance for undo data
rlength / rdataRedo payload length / bufferrlength>0 triggers an offset advance for redo data
nextSingly-linked pointer to the next prior nodeSet when this node becomes the new tail

log_rec_header (LOG_RECORD_HEADER) — every field

Section titled “log_rec_header (LOG_RECORD_HEADER) — every field”

The physical record header; prior_lsa_start_append/prior_lsa_end_append exist almost entirely to fill it.

FieldRoleWhy it exists
prev_tranlsaPrevious log record of the same transactionLets undo/rollback walk one transaction’s records backward without scanning the whole log
back_lsaPrevious physical record (any transaction)Lets recovery walk the global log backward
forw_lsaNext physical recordLets analysis/redo walk forward; known only after this record’s size is fixed, so filled in prior_lsa_end_append
tridTransaction id owning this recordRecovery groups records by transaction; set from tdes->trid
typeLOG_RECTYPE (e.g. LOG_COMMIT, LOG_SYSOP_END)Dispatch key for every type-specific branch in prior_lsa_next_record_internal

log_prior_lsa_info (the global cursor; log_Gl.prior_info) — every field

Section titled “log_prior_lsa_info (the global cursor; log_Gl.prior_info) — every field”
FieldRoleWhy it exists
prior_lsaThe next LSA to assign — the moving cursorEvery node copies this into start_lsa; advanced by the offset helpers as the node’s bytes are accounted for
prev_lsaLSA of the last record appended to the prior streamBecomes the new node’s back_lsa, then is updated to point at the new node
prior_list_headerHead of the singly-linked prior listThe drain side (Chapter 5) consumes from the head
prior_list_tailTail of the prior listNew nodes attach here in O(1)
list_sizeBytes staged but not yet flushedCompared against logpb_get_memsize() to decide when to force a flush
prior_flush_list_headerHead of the detached list being flushedSet when the list is unhooked for draining (Chapter 5)
prior_lsa_mutexstd::mutex serializing the whole assignmentThe single lock whose acquisition order defines LSA order (Invariant 4-A)
flowchart LR
  subgraph G["log_Gl.prior_info (LOG_PRIOR_LSA_INFO)"]
    PL["prior_lsa<br/>(next LSA cursor)"]
    PV["prev_lsa<br/>(last record)"]
    H["prior_list_header"]
    T["prior_list_tail"]
    M["prior_lsa_mutex"]
  end
  N["new LOG_PRIOR_NODE<br/>start_lsa, log_header, next"]
  PL -- "copied into" --> N
  PV -- "copied into log_header.back_lsa" --> N
  T -- "->next = node, then tail = node" --> N
  M -. "guards all of the above" .- G

Figure 4-1. The cursor feeds the node its identity and linkage, then adopts the node as its new tail.

4.2 The entry points: with_lock and the LOG_PRIOR_LSA_LOCK enum

Section titled “4.2 The entry points: with_lock and the LOG_PRIOR_LSA_LOCK enum”

Two public entry points, one shared body; the only difference is whether the caller already holds prior_lsa_mutex.

// prior_lsa_next_record / _with_lock -- src/transaction/log_append.cpp
prior_lsa_next_record (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node, log_tdes *tdes)
{ return prior_lsa_next_record_internal (thread_p, node, tdes, LOG_PRIOR_LSA_WITHOUT_LOCK); }
prior_lsa_next_record_with_lock (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node, log_tdes *tdes)
{ return prior_lsa_next_record_internal (thread_p, node, tdes, LOG_PRIOR_LSA_WITH_LOCK); }

The with_lock argument is one of the two values below (the enum has no comments in the source; the annotations here are editorial):

// LOG_PRIOR_LSA_LOCK -- src/transaction/log_append.hpp
enum LOG_PRIOR_LSA_LOCK
{
LOG_PRIOR_LSA_WITHOUT_LOCK = 0, // internal locks/unlocks the mutex itself
LOG_PRIOR_LSA_WITH_LOCK = 1 // caller already holds the mutex
};

The _with_lock variant lets a caller emit several records with no interleaving: take the mutex once, call _with_lock repeatedly. The plain variant is the common single-record path.

4.3 prior_lsa_next_record_internal — branch-complete walkthrough

Section titled “4.3 prior_lsa_next_record_internal — branch-complete walkthrough”

The body has three phases: lock + prior_lsa_start_append (4.4); a 6-arm type-dispatch ladder (table below); then the offset-walk + prior_lsa_end_append (4.5) + tail splice + conditional unlock-and-flush. The frame and the tail splice, quoted verbatim (note both splice arms are two statements, not a chained assignment):

// prior_lsa_next_record_internal -- src/transaction/log_append.cpp
if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK) { log_Gl.prior_info.prior_lsa_mutex.lock (); }
prior_lsa_start_append (thread_p, node, tdes); // <- assigns start_lsa + header linkage (4.4)
LSA_COPY (&start_lsa, &node->start_lsa); // <- snapshot before any advance
// ... vacuum-produce guard + 6-arm type dispatch ladder (tables below) ...
log_prior_lsa_append_advance_when_doesnot_fit (node->data_header_length);
log_prior_lsa_append_add_align (node->data_header_length);
if (node->ulength > 0) { prior_lsa_append_data (node->ulength); }
if (node->rlength > 0) { prior_lsa_append_data (node->rlength); }
prior_lsa_end_append (thread_p, node); // <- fixes forw_lsa (4.5)
if (log_Gl.prior_info.prior_list_tail == NULL)
{
log_Gl.prior_info.prior_list_header = node; // <- empty list: node is head ...
log_Gl.prior_info.prior_list_tail = node; // <- ... and tail
}
else
{
log_Gl.prior_info.prior_list_tail->next = node; // <- O(1) tail splice (two statements)
log_Gl.prior_info.prior_list_tail = node;
}
log_Gl.prior_info.list_size += (sizeof (LOG_PRIOR_NODE) + node->data_header_length
+ node->ulength + node->rlength);
if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK)
{
log_Gl.prior_info.prior_lsa_mutex.unlock (); // <- release BEFORE the flush decision
// ... condensed: if list_size >= logpb_get_memsize() -> force-flush fork (see callout) ...
}
tdes->num_log_records_written++;
return start_lsa;

Before the ladder, a vacuum-produce guard fires: under LOG_ISRESTARTED () and log_Gl.hdr.does_block_need_vacuum, if start_lsa crossed into a new vacuum block id versus mvcc_op_log_lsa, it calls vacuum_produce_log_block_data (asserting the prior block id is exactly one behind). Skipped entirely during crash recovery.

The 6-arm type-dispatch ladder. Mutually exclusive if/else if on node->log_header.type. Every assignment happens under the mutex with the just-snapshotted start_lsa — the reason the captured LSAs are coherent (Chapters 8–9).

#Matched type(s)GuardAction
1LOG_MVCC_UNDO_DATA, LOG_MVCC_UNDOREDO_DATA, LOG_MVCC_DIFF_UNDOREDO_DATA, or (LOG_SYSOP_END && ((LOG_REC_SYSOP_END *)data_header)->type == LOG_SYSOP_END_LOGICAL_MVCC_UNDO)Resolve vacuum_info/mvccid via nested sub-branch; vacuum_info->prev_mvcc_op_log_lsa = log_Gl.hdr.mvcc_op_log_lsa; prior_update_header_mvcc_info (start_lsa, mvccid) (4.6)
2LOG_SYSOP_START_POSTPONEassert (LSA_ISNULL (rcv.sysop_start_postpone_lsa))rcv.sysop_start_postpone_lsa = start_lsa; if lastparent_lsa < rcv.atomic_sysop_start_lsa null it; tdes->state = TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE (under mutex, for checkpoint correctness)
3LOG_SYSOP_ENDIf atomic_sysop_start_lsa non-null && lastparent_lsa < it → null; same test/null for sysop_start_postpone_lsa
4LOG_COMMIT_WITH_POSTPONE or LOG_COMMIT_WITH_POSTPONE_OBSOLETErcv.tran_start_postpone_lsa = start_lsa
5LOG_SYSOP_ATOMIC_STARTassert (LSA_ISNULL (rcv.atomic_sysop_start_lsa))rcv.atomic_sysop_start_lsa = start_lsa
6LOG_COMMIT or LOG_ABORTassert (commit_abort_lsa.is_null ())commit_abort_lsa = start_lsa

Nested 3-way MVCC sub-branch (inside arm 1) — selects which struct holds vacuum_info/mvccid:

Sub-armConditionvacuum_info / mvccid source
atype == LOG_MVCC_UNDO_DATA(LOG_REC_MVCC_UNDO *) node->data_header&mvcc_undo->vacuum_info, mvcc_undo->mvccid
btype == LOG_SYSOP_END&((LOG_REC_SYSOP_END *) node->data_header)->mvcc_undo&mvcc_undo->vacuum_info, mvcc_undo->mvccid
celse (LOG_MVCC_UNDOREDO_DATA / LOG_MVCC_DIFF_UNDOREDO_DATA, asserted)(LOG_REC_MVCC_UNDOREDO *) node->data_header&mvcc_undoredo->vacuum_info, mvcc_undoredo->mvccid

If none of arms 1–6 match (the common data-record case), the ladder is a no-op and control falls straight to the offset walk.

Unlock-then-flush fork. Only WITHOUT_LOCK unlocks here, and the list_size >= logpb_get_memsize() check sits outside the mutex. Inside it, SERVER_MODE wakes the flush daemon and sleeps 1 ms when not in crash recovery, versus a synchronous logpb_prior_lsa_append_all_list under LOG_CS during recovery. SA mode (#else) is always synchronous. Chapter 5 covers the drain.

flowchart TD
  A["enter internal"] --> B{"WITHOUT_LOCK?"}
  B -- yes --> C["lock prior_lsa_mutex"]
  B -- no --> D["prior_lsa_start_append"]
  C --> D
  D --> E["snapshot start_lsa"]
  E --> VG{"vacuum-produce guard"}
  VG --> F["6-arm type dispatch ladder<br/>see arms 1-6 table; no-op if no match"]
  F --> G["advance + add_align data_header"]
  G --> H{"ulength>0?"}
  H -- yes --> I["append_data ulength"]
  H -- no --> J{"rlength>0?"}
  I --> J
  J -- yes --> K["append_data rlength"]
  J -- no --> L["end_append: set forw_lsa"]
  K --> L
  L --> M{"tail == NULL?"}
  M -- yes --> N["header = tail = node"]
  M -- no --> O["tail->next = node; tail = node"]
  N --> P["list_size += footprint"]
  O --> P
  P --> Q{"WITHOUT_LOCK?"}
  Q -- yes --> R["unlock; maybe force flush"]
  Q -- no --> S["num_log_records_written++; return"]
  R --> S

Figure 4-2. Branch-complete control flow of prior_lsa_next_record_internal, including all six dispatch arms.

4.4 prior_lsa_start_append — assigning the LSA and the backward chain

Section titled “4.4 prior_lsa_start_append — assigning the LSA and the backward chain”

This is where the node’s identity is born.

// prior_lsa_start_append -- src/transaction/log_append.cpp
log_prior_lsa_append_advance_when_doesnot_fit (sizeof (LOG_RECORD_HEADER)); // <- header must not straddle a page
node->log_header.trid = tdes->trid;
LSA_COPY (&node->start_lsa, &log_Gl.prior_info.prior_lsa); // <- THE LSA assignment, before any advance (Inv 4-C)
if (tdes->is_system_worker_transaction () && !tdes->is_under_sysop ())
{
LSA_SET_NULL (&node->log_header.prev_tranlsa); // <- worker, no sysop: lose the per-tran chain
LSA_SET_NULL (&tdes->head_lsa);
LSA_SET_NULL (&tdes->tail_lsa);
}
else
{
LSA_COPY (&node->log_header.prev_tranlsa, &tdes->tail_lsa); // chain to this tran's last record
LSA_COPY (&tdes->tail_lsa, &log_Gl.prior_info.prior_lsa); // this record is now the tran tail
if (LSA_ISNULL (&tdes->head_lsa))
{ LSA_COPY (&tdes->head_lsa, &tdes->tail_lsa); } // first record of the tran
LSA_COPY (&tdes->undo_nxlsa, &log_Gl.prior_info.prior_lsa); // next to undo on rollback
}
LSA_COPY (&node->log_header.back_lsa, &log_Gl.prior_info.prev_lsa); // <- physical backward link (any tran)
LSA_SET_NULL (&node->log_header.forw_lsa); // <- not known yet (end_append)
LSA_COPY (&log_Gl.prior_info.prev_lsa, &log_Gl.prior_info.prior_lsa); // <- prev_lsa now names THIS record
log_prior_lsa_append_add_align (sizeof (LOG_RECORD_HEADER)); // <- account the header bytes

The transaction-chain fork: system workers (e.g. vacuum) do not own a rollback chain, so a worker record not under a sysop nulls prev_tranlsa/head_lsa/tail_lsa. Everyone else chains to the prior tail and updates it. The physical back_lsa = prev_lsa link is transaction-independent; prev_lsa then advances to name this record. forw_lsa is nulled here, fixed in end_append.

Invariant 4-B. A system-worker record NOT under a sysop carries a null prev_tranlsa. Violating it would make recovery walk a non-existent transaction chain.

Invariant 4-C (start before advance). start_lsa is read before any add_align advances the cursor, so it names the first byte of the record. The header-fit guard runs first so that first byte is on a page that can hold the header.

4.5 prior_lsa_end_append — fixing forw_lsa

Section titled “4.5 prior_lsa_end_append — fixing forw_lsa”

Once the data-header, undo, and redo regions are accounted for, the cursor sits at the first byte past this record — the next record’s start, i.e. this record’s forw_lsa. Both helpers run before forw_lsa is read: align, then bump to the next page if the next header would not fit. So forw_lsa always names a position where the following header can legally live, and every forw_lsa equals the next record’s start_lsa with no straddling header between.

// prior_lsa_end_append -- src/transaction/log_append.cpp
static void
prior_lsa_end_append (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node)
{
log_prior_lsa_append_align (); // <- align to next record start
log_prior_lsa_append_advance_when_doesnot_fit (sizeof (LOG_RECORD_HEADER)); // <- next header must fit too
LSA_COPY (&node->log_header.forw_lsa, &log_Gl.prior_info.prior_lsa);
}

4.6 prior_update_header_mvcc_info — vacuum block bookkeeping

Section titled “4.6 prior_update_header_mvcc_info — vacuum block bookkeeping”

Invoked from arm 1 of the ladder. It maintains the running MVCC-block summary in the global log header so vacuum knows which blocks have MVCC work.

// prior_update_header_mvcc_info -- src/transaction/log_append.cpp
assert (MVCCID_IS_VALID (mvccid));
if (!log_Gl.hdr.does_block_need_vacuum) // <- FIRST MVCC record of this block
{
log_Gl.hdr.oldest_visible_mvccid = log_Gl.mvcc_table.get_global_oldest_visible ();
log_Gl.hdr.newest_block_mvccid = mvccid;
}
else
{
// ... condensed: sanity asserts on oldest/newest/block id ...
if (log_Gl.hdr.newest_block_mvccid < mvccid) // <- subsequent record: raise high-water only
{ log_Gl.hdr.newest_block_mvccid = mvccid; }
}
log_Gl.hdr.mvcc_op_log_lsa = record_lsa; // <- both branches: latest MVCC op position
log_Gl.hdr.does_block_need_vacuum = true;

The first MVCC record of a block seeds oldest_visible_mvccid from the MVCC table; subsequent records only raise newest_block_mvccid (the elided else also asserts the block id matches mvcc_op_log_lsa). Both arms set mvcc_op_log_lsa = record_lsa and mark the block dirty — consistent with the totally-ordered LSA stream because it runs under the mutex.

4.7 The offset helpers — how prior_lsa walks the record footprint

Section titled “4.7 The offset helpers — how prior_lsa walks the record footprint”

Three statics advance log_Gl.prior_info.prior_lsa, consuming each region. All operate on a 0-based offset within a LOGAREA_SIZE-byte page area (leading assert (... offset >= 0) lines elided).

// offset helpers -- src/transaction/log_append.cpp
static void log_prior_lsa_append_align ()
{
log_Gl.prior_info.prior_lsa.offset = DB_ALIGN (log_Gl.prior_info.prior_lsa.offset, DOUBLE_ALIGNMENT);
if ((size_t) log_Gl.prior_info.prior_lsa.offset >= (size_t) LOGAREA_SIZE) // <- align rolled off page
{ log_Gl.prior_info.prior_lsa.pageid++; log_Gl.prior_info.prior_lsa.offset = 0; }
}
static void log_prior_lsa_append_advance_when_doesnot_fit (size_t length)
{
if ((size_t) log_Gl.prior_info.prior_lsa.offset + length >= (size_t) LOGAREA_SIZE) // <- region won't fit
{ log_Gl.prior_info.prior_lsa.pageid++; log_Gl.prior_info.prior_lsa.offset = 0; }
}
static void log_prior_lsa_append_add_align (size_t add)
{
log_Gl.prior_info.prior_lsa.offset += (add); // <- consume the region's bytes
log_prior_lsa_append_align (); // <- then align (may roll to next page)
}

advance_when_doesnot_fit is the only one with a branch — a pre-check so a header never straddles a boundary. The pairing advance_when_doesnot_fit(N) then add_align(N) first ensures region N fits, then consumes it. Payloads that span pages (prior_lsa_append_data) are Chapter 6’s subject.

4.8 prior_set_tde_encrypted — marking the node for encryption

Section titled “4.8 prior_set_tde_encrypted — marking the node for encryption”

Separate from the LSA path, called on the node for sensitive records.

// prior_set_tde_encrypted -- src/transaction/log_append.cpp
if (!tde_is_loaded()) // <- cipher must be available
{
er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_TDE_CIPHER_IS_NOT_LOADED, 0);
return ER_TDE_CIPHER_IS_NOT_LOADED; // <- error branch
}
tde_er_log ("prior_set_tde_encrypted(): rcvindex = %s\n", rv_rcvindex_string (recvindex));
node->tde_encrypted = true; // <- the only state change
return NO_ERROR;

Two branches: cipher not loaded → log error, return ER_TDE_CIPHER_IS_NOT_LOADED, node untouched; otherwise flip node->tde_encrypted = true. The flag is read later when the holding page is allocated/flushed — it does not participate in LSA assignment, which is why it is a standalone setter, not part of prior_lsa_start_append. (Query side: the trivial prior_is_tde_encrypted.)

  1. One mutex defines the order. prior_lsa_mutex is acquired once per record (or held across several via _with_lock); acquisition order is LSA order (Invariant 4-A). Read-then-advance under one lock means no shared or out-of-order LSAs; no separate counter.
  2. prior_lsa_start_append is the moment of birth. Copies prior_lsa into start_lsa before advancing (Invariant 4-C), sets trid, builds prev_tranlsa/back_lsa, nulls forw_lsa.
  3. The transaction chain forks on worker status. Worker records not under a sysop get null prev_tranlsa/head_lsa/tail_lsa (Invariant 4-B); everyone else chains and updates tail_lsa/head_lsa/undo_nxlsa.
  4. A 6-arm type ladder captures start_lsa under the lock. MVCC-undo (nested 3-way select → prior_update_header_mvcc_info), SYSOP_START_POSTPONE (also flips tdes->state), SYSOP_END, COMMIT_WITH_POSTPONE(_OBSOLETE), SYSOP_ATOMIC_START, COMMIT/ABORT each stash the LSA into tdes->rcv.*/commit_abort_lsa.
  5. forw_lsa is fixed last. prior_lsa_end_append aligns past the record and guards the next header’s fit, so forw_lsa equals the next record’s start_lsa.
  6. Offset helpers consume the footprint. advance_when_doesnot_fit pre-checks fit (the one branch), add_align consumes-then-aligns, align rounds to DOUBLE_ALIGNMENT and rolls pages.
  7. Expensive work is outside the lock. The O(1) splice and list_size bump end the critical section; the flush check and any flush run after unlock.

Chapter 5: Draining the Prior List into the Page Buffer

Section titled “Chapter 5: Draining the Prior List into the Page Buffer”

Chapter 4 left a chain of log_prior_nodes with LSAs wired up by prior_lsa_next_recordpromised but not yet copied into any LOG_PAGE frame. This chapter traces the single-writer drain that detaches the list, walks it in LSN order, and serializes each node into the page buffer. We stop at the page boundary (Chapter 6 owns logpb_next_append_page); the WAL rule is Chapter 7 (companion cubrid-log-manager.md).

Two locks, two jobs: prior_lsa_mutex serializes appenders against each other (held only for the LSA-stamp-and-link, Chapter 4; does not protect the page buffer); LOG_CS write mode serializes appenders against the page-buffer writer — every drain function opens with assert (LOG_CS_OWN_WRITE_MODE (thread_p)).

INVARIANT (single-writer drain). The drain runs while the caller owns LOG_CS write mode; with two drainers, append_lsa.offset and log_pgptr would update non-atomically and records would interleave. The LOG_CS_OWN_WRITE_MODE assert makes a violation fatal. Every struct table below assumes this.

The hand-off is the detach: under prior_lsa_mutex the writer snips off the list and nulls the header, so later appenders build a fresh list while the detached one is drained lock-free.

flowchart TD
  A1["prior_lsa_next_record\nholds prior_lsa_mutex briefly"]
  D1["logpb_prior_lsa_append_all_list"]
  D2["detach list under prior_lsa_mutex\nreset header/tail/list_size"]
  D3["logpb_append_prior_lsa_list\nwalk nodes in LSN order"]
  D4["logpb_append_next_record per node"]
  D5["copy bytes into LOG_PAGE frames\nset dirty, free node"]
  A1 -->|"attach to prior_list"| D2
  D1 --> D2 --> D3 --> D4 --> D5

Figure 5-1. The two serialization layers and the detach hand-off.

log_prior_node — the unit being drained (log_append.hpp)

Section titled “log_prior_node — the unit being drained (log_append.hpp)”
FieldRoleWhy it exists
log_headerLOG_RECORD_HEADER copied verbatim by logpb_start_appendOn-disk record header
start_lsaMust equal append_lsa when appendedCatches LSN-order corruption
tde_encryptedDestination page is TDE-encryptedDrives appending_page_tde_encrypted
data_header_lengthByte length of data_headerSizes the header copy
data_headerFixed per-record-type header payloadPart after LOG_RECORD_HEADER
ulength / udataLength/pointer of the undo segmentRollback image
rlength / rdataLength/pointer of the redo segmentRecovery image
nextLink to the next nodeWalked in LSN order

INVARIANT (node order = LSN order). Tail-append under prior_lsa_mutex makes next traversal exactly ascending LSN; logpb_append_next_record re-checks each node via LSA_EQ (&node->start_lsa, &log_Gl.hdr.append_lsa) and a mismatch is a logpb_fatal_error.

LOG_PAGE / log_hdrpage — the destination frame (log_storage.hpp)

Section titled “LOG_PAGE / log_hdrpage — the destination frame (log_storage.hpp)”

log_page is { LOG_HDRPAGE hdr; char area[1]; }; log_hdrpage is the per-frame header.

FieldRoleWhy it exists
hdr.logical_pageidIdentity of this frame in the logMaps page to physical slot
hdr.offsetOffset of the first record on this pageSet once by logpb_start_append; enables salvage
hdr.flagsTDE encryption flagsStamped by logpb_set_tde_algorithm
hdr.checksumCRC32 of the pageComputed at flush (Chapter 7)
areaBuffer header+payload are memcpy’d intoLOG_APPEND_PTR() = area + append_lsa.offset

LOG_BUFFER — frame wrapper carrying the dirty bit (log_page_buffer.c)

Section titled “LOG_BUFFER — frame wrapper carrying the dirty bit (log_page_buffer.c)”
FieldRoleWhy it exists
pageid (volatile)Logical page id of the wrapped frameValidates flush targets
phy_pageid (volatile)Physical page id in the active logMaps logical page to disk slot
dirty (bool)“Has unflushed changes”Raised by logpb_set_dirty, cleared by flusher (Chapter 7)
logpage (LOG_PAGE*)Back-pointer to buffered payloadlogpb_get_log_buffer recovers the wrapper from a LOG_PAGE*

log_append_info — the single writer’s cursor state (log_append.hpp)

Section titled “log_append_info — the single writer’s cursor state (log_append.hpp)”
FieldRoleWhy it exists
vdesActive-log volume descriptorFlush target; untouched by the drain
nxio_lsa (atomic)Lowest LSN not yet on diskThe WAL frontier (Chapter 7)
prev_lsaAddress of the last fully appended recordlogpb_start_append checks back_lsa == prev_lsa, then advances it
log_pgptrThe currently fixed append page frameLOG_APPEND_PTR() writes into log_pgptr->area
appending_page_tde_encryptedPage being filled needs TDESet per node from node->tde_encrypted

INVARIANT (back_lsa chaining). logpb_start_append asserts back_lsa == prev_lsa before each header; the on-disk backward chain must stay unbroken or the process fatals out.

5.3 logpb_prior_lsa_append_all_list — detach then drain

Section titled “5.3 logpb_prior_lsa_append_all_list — detach then drain”
// logpb_prior_lsa_append_all_list -- src/transaction/log_page_buffer.c
int
logpb_prior_lsa_append_all_list (THREAD_ENTRY * thread_p)
{
LOG_PRIOR_NODE *prior_list;
assert (LOG_CS_OWN_WRITE_MODE (thread_p)); /* <- single-writer invariant */
log_Gl.prior_info.prior_lsa_mutex.lock ();
prior_list = prior_lsa_remove_prior_list (thread_p); /* <- detach */
log_Gl.prior_info.prior_lsa_mutex.unlock (); /* <- mutex dropped early */
if (prior_list != NULL)
{
// ... condensed: perfmon stats ...
logpb_append_prior_lsa_list (thread_p, prior_list); /* <- drain, no mutex held */
}
return NO_ERROR;
}

prior_lsa_remove_prior_list is the detach — the only mutation of the prior-list header during the drain:

// prior_lsa_remove_prior_list -- src/transaction/log_page_buffer.c
static LOG_PRIOR_NODE *
prior_lsa_remove_prior_list (THREAD_ENTRY * thread_p)
{
LOG_PRIOR_NODE *prior_list;
assert (LOG_CS_OWN_WRITE_MODE (thread_p));
prior_list = log_Gl.prior_info.prior_list_header;
log_Gl.prior_info.prior_list_header = NULL; /* <- reset header/tail/size: */
log_Gl.prior_info.prior_list_tail = NULL; /* new appenders start fresh */
log_Gl.prior_info.list_size = 0;
return prior_list;
}

Branch: if prior_list == NULL the drain is skipped; otherwise the mutex is released before the copy, shrinking the appender-blocking window to the three pointer writes.

5.4 logpb_append_prior_lsa_list — walk and free

Section titled “5.4 logpb_append_prior_lsa_list — walk and free”

The detached list is parked on prior_flush_list_header (a separate slot, so a rebuilt prior_list_header stays untouched), then drained node-by-node.

// logpb_append_prior_lsa_list -- src/transaction/log_page_buffer.c
static int
logpb_append_prior_lsa_list (THREAD_ENTRY * thread_p, LOG_PRIOR_NODE * list)
{
LOG_PRIOR_NODE *node;
assert (log_Gl.prior_info.prior_flush_list_header == NULL); /* <- no concurrent drain */
log_Gl.prior_info.prior_flush_list_header = list;
while (log_Gl.prior_info.prior_flush_list_header != NULL)
{
node = log_Gl.prior_info.prior_flush_list_header;
log_Gl.prior_info.prior_flush_list_header = node->next; /* <- advance before copy */
logpb_append_next_record (thread_p, node); /* <- the copy */
if (node->data_header != NULL) free_and_init (node->data_header);
if (node->udata != NULL) free_and_init (node->udata);
if (node->rdata != NULL) free_and_init (node->rdata);
free_and_init (node); /* <- node lifetime ends */
}
return NO_ERROR;
}

Each segment is freed only if non-NULL (a node may carry any subset); the head advances before the copy so the loop ends on the NULL next. The assert (prior_flush_list_header == NULL) enforces no overlapping flush list — a corollary of the single-writer invariant, holding under LOG_CS.

5.5 logpb_append_next_record — one node, header + payload

Section titled “5.5 logpb_append_next_record — one node, header + payload”
// logpb_append_next_record -- src/transaction/log_page_buffer.c
static int
logpb_append_next_record (THREAD_ENTRY * thread_p, LOG_PRIOR_NODE * node)
{
if (!LSA_EQ (&node->start_lsa, &log_Gl.hdr.append_lsa))
logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "logpb_append_next_record"); /* <- LSN-order */
if (log_Gl.flush_info.num_toflush + 1 >= log_Gl.flush_info.max_toflush)
logpb_flush_all_append_pages (thread_p); /* <- flush early, before this record */
log_Gl.append.appending_page_tde_encrypted = prior_is_tde_encrypted (node);
logpb_start_append (thread_p, &node->log_header); /* writes LOG_RECORD_HEADER */
if (node->data_header != NULL)
{
LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT (thread_p, node->data_header_length); /* keep header contiguous */
logpb_append_data (thread_p, node->data_header_length, node->data_header);
}
if (node->udata != NULL)
logpb_append_data (thread_p, node->ulength, node->udata);
if (node->rdata != NULL)
logpb_append_data (thread_p, node->rlength, node->rdata);
logpb_end_append (thread_p, &node->log_header);
log_Gl.append.appending_page_tde_encrypted = false; /* reset for next node */
return NO_ERROR;
}

The non-obvious branch is the early flush (num_toflush + 1 >= max_toflush): flushing now, with no record in progress, keeps the partial-append state machine (LOGPB_APPENDREC_*, Chapter 7) from triggering mid-record. data_header is pre-advanced to stay on one page; udata/rdata wrap. Figure 5-2 covers every branch.

flowchart TD
  S["enter logpb_append_next_record"] --> C1{"start_lsa == append_lsa ?"}
  C1 -->|no| F["logpb_fatal_error"]
  C1 -->|yes| C2{"flush list nearly full ?"}
  C2 -->|yes| FL["logpb_flush_all_append_pages"]
  C2 -->|no| H
  FL --> H["set tde flag\nlogpb_start_append: write header"]
  H --> C3{"data_header ?"}
  C3 -->|yes| DH["ADVANCE_WHEN_DOESNOT_FIT\nappend_data header"]
  C3 -->|no| C4
  DH --> C4{"udata ?"}
  C4 -->|yes| UD["append_data udata"]
  C4 -->|no| C5
  UD --> C5{"rdata ?"}
  C5 -->|yes| RD["append_data rdata"]
  C5 -->|no| E
  RD --> E["logpb_end_append\nreset tde flag"]

Figure 5-2. Branch-complete flow of logpb_append_next_record.

5.6 logpb_start_append — stamp the record header

Section titled “5.6 logpb_start_append — stamp the record header”
// logpb_start_append -- src/transaction/log_page_buffer.c
static void
logpb_start_append (THREAD_ENTRY * thread_p, LOG_RECORD_HEADER * header)
{
LOG_RECORD_HEADER *log_rec;
// ... condensed: assert, perfmon, ADVANCE_WHEN_DOESNOT_FIT (header contiguous) ...
if (!LSA_EQ (&header->back_lsa, &log_Gl.append.prev_lsa))
logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "logpb_start_append"); /* <- back-chain check */
if (log_Gl.append.appending_page_tde_encrypted
&& !LOG_IS_PAGE_TDE_ENCRYPTED (log_Gl.append.log_pgptr))
{
// ... condensed: stamp TDE algorithm on the page ...
logpb_set_dirty (thread_p, log_Gl.append.log_pgptr);
}
log_rec = (LOG_RECORD_HEADER *) LOG_APPEND_PTR ();
*log_rec = *header; /* <- the header copy */
// ... condensed: if hdr.offset == NULL_OFFSET, set first-record offset on this page ...
if (log_rec->type == LOG_END_OF_LOG)
{
LSA_COPY (&log_Gl.hdr.eof_lsa, &log_Gl.hdr.append_lsa);
logpb_set_dirty (thread_p, log_Gl.append.log_pgptr);
}
else
{
LSA_COPY (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa); /* advance prev_lsa */
LOG_APPEND_SETDIRTY_ADD_ALIGN (thread_p, sizeof (LOG_RECORD_HEADER)); /* dirty + bump + align */
log_Pb.partial_append.status = LOGPB_APPENDREC_IN_PROGRESS;
}
}

Two branches: hdr.offset == NULL_OFFSET sets the page’s first-record offset once; the type split routes LOG_END_OF_LOG (EOF sentinel, Chapter 7) down a placeholder path leaving prev_lsa/IN_PROGRESS untouched, vs else advancing the chain into IN_PROGRESS.

5.7 logpb_append_data — the aligned byte copy

Section titled “5.7 logpb_append_data — the aligned byte copy”
// logpb_append_data -- src/transaction/log_page_buffer.c
static void
logpb_append_data (THREAD_ENTRY * thread_p, int length, const char *data)
{
int copy_length; char *ptr, *last_ptr;
if (length == 0 || data == NULL)
return; /* <- empty segment: nothing to do */
LOG_APPEND_ALIGN (thread_p, LOG_DONT_SET_DIRTY); /* align, don't dirty yet */
ptr = LOG_APPEND_PTR ();
last_ptr = LOG_LAST_APPEND_PTR (); /* = area + LOGAREA_SIZE */
if ((ptr + length) >= last_ptr) /* <- does NOT fit in this page */
{
while (length > 0)
{
if (ptr >= last_ptr)
{
logpb_next_append_page (thread_p, LOG_SET_DIRTY); /* Chapter 6 */
ptr = LOG_APPEND_PTR (); last_ptr = LOG_LAST_APPEND_PTR ();
}
copy_length = (ptr + length >= last_ptr) ? CAST_BUFLEN (last_ptr - ptr) : length;
memcpy (ptr, data, copy_length);
ptr += copy_length; data += copy_length; length -= copy_length;
log_Gl.hdr.append_lsa.offset += copy_length; /* advance by bytes copied */
}
}
else /* <- fits entirely */
{
memcpy (ptr, data, length);
log_Gl.hdr.append_lsa.offset += length;
}
LOG_APPEND_ALIGN (thread_p, LOG_SET_DIRTY); /* align for next append AND mark dirty */
}

The boundary-span path copies to page end, calls logpb_next_append_page (Chapter 6), and repeats until length == 0. logpb_append_crumbs is the scatter-gather sibling (same fit/span logic), not on the drain path.

INVARIANT (cursor tracks bytes copied). append_lsa.offset advances by exactly the bytes memcpy’d on every path; drift would make the next LOG_APPEND_PTR() point at the wrong byte and records overlap. Both LOG_APPEND_ALIGN calls only round up.

5.8 logpb_end_append — close the record, point forward

Section titled “5.8 logpb_end_append — close the record, point forward”
// logpb_end_append -- src/transaction/log_page_buffer.c
static void
logpb_end_append (THREAD_ENTRY * thread_p, LOG_RECORD_HEADER * header)
{
// ... condensed: align + ADVANCE_WHEN_DOESNOT_FIT position the cursor at next slot ...
assert (LSA_EQ (&header->forw_lsa, &log_Gl.hdr.append_lsa)); /* <- forw_lsa = next slot */
if (!LSA_EQ (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa))
logpb_set_dirty (thread_p, log_Gl.append.log_pgptr); /* dirty if cursor moved off prev */
if (log_Pb.partial_append.status == LOGPB_APPENDREC_IN_PROGRESS)
; /* normal: fall through */
else if (log_Pb.partial_append.status == LOGPB_APPENDREC_PARTIAL_FLUSHED_END_OF_LOG)
{
log_Pb.partial_append.status = LOGPB_APPENDREC_PARTIAL_ENDED;
logpb_flush_all_append_pages (thread_p); /* re-flush correct version */
}
else
assert_release (false); /* invalid state */
log_Pb.partial_append.status = LOGPB_APPENDREC_SUCCESS; /* record now stable */
}

After the cursor is repositioned and the forw_lsa assert (partner to back_lsa) confirms it, the state machine branches: IN_PROGRESS falls through; PARTIAL_FLUSHED_END_OF_LOG (a forced flush swapped in an EOF sentinel) re-flushes the real record (Chapter 7); elseassert_release(false). All end at SUCCESS.

INVARIANT (record bracketing). Between logpb_start_append (IN_PROGRESS) and logpb_end_append (SUCCESS), exactly one record is mid-write. A forced flush seeing IN_PROGRESS knows it caught a partial record; SUCCESS means the page is safe to flush. Breaking the bracket lets a half-written record reach disk unmarked.

  1. Two locks, two jobs. prior_lsa_mutex serializes appenders; LOG_CS serializes them against the single writer.
  2. Detach, then drain. Reset header/tail/list_size under the mutex, release, then copy.
  3. LSN order, freed immediately. Each node copied via logpb_append_next_record, then its segments and itself freed.
  4. Three assertions prove the chain. back_lsa==prev_lsa, forw_lsa==append_lsa, start_lsa==append_lsa — divergence is fatal.
  5. Cursor stays honest. logpb_append_data advances append_lsa.offset by exactly the bytes copied.
  6. Dirty, not flushed. logpb_set_dirty only flips LOG_BUFFER::dirty; flush/checksum/WAL are Chapter 7.
  7. Boundary crossing deferred. Every logpb_next_append_page hands off to Chapter 6.

The drain loop of Chapter 5 streams a prior node’s bytes into log_Gl.append.log_pgptr one fragment at a time. When the running offset log_Gl.hdr.append_lsa.offset reaches the page’s usable limit (LOGAREA_SIZE), the appender must seal the full page, register it for flush, and obtain a fresh logical page.

The reader question: what happens when a record does not fit, and how is a fresh page fetched while the full one is queued for flush? The mid-stream answer is logpb_next_append_page; the first-page bootstrap is logpb_fetch_start_append_page and its stripped twin logpb_fetch_start_append_page_new. All obtain a buffer frame and initialize its header through logpb_create_page / logpb_locate_page. WAL and the append/flush split are in the high-level companion; flush durability is Chapter 7. The prior-side mirror log_prior_lsa_append_advance_when_doesnot_fit (Chapter 4) reserves the address space across the page tail before any bytes exist; this chapter fetches the physical frame for that address.

The appender never calls logpb_next_append_page from record-assembly code. Two macros own that decision, both comparing log_Gl.hdr.append_lsa.offset against LOGAREA_SIZE (the alignment/advance arithmetic is Chapter 4 / Chapter 5 material): LOG_APPEND_ALIGN crosses after a fragment when the DOUBLE_ALIGNMENT-rounded offset reaches the limit; LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT(length) crosses before writing when offset + length would overrun, so the fragment lands whole on the next page.

That after-vs-before split is why logpb_next_append_page takes current_setdirty. LOG_APPEND_ALIGN (reached via LOG_APPEND_SETDIRTY_ADD_ALIGN with LOG_SET_DIRTY) has already dirtied the page it is leaving; the ADVANCE macro crosses before any byte is written, so nothing is dirty yet. Both therefore pass LOG_DONT_SET_DIRTY, leaving the seal branch inside logpb_next_append_page dead in the hot path — it exists only for direct callers that did not pre-seal.

logpb_set_dirty flips one boolean on the page’s buffer frame:

// logpb_set_dirty -- src/transaction/log_page_buffer.c
void
logpb_set_dirty (THREAD_ENTRY * thread_p, LOG_PAGE * log_pgptr)
{
LOG_BUFFER *bufptr;
bufptr = logpb_get_log_buffer (log_pgptr); /* <- recovers frame from page address */
// ... condensed ...
bufptr->dirty = true;
}

Invariant (dirty-before-detach): a page that received append bytes must be bufptr->dirty before log_Gl.append.log_pgptr is repointed away. Every write path runs LOG_APPEND_SETDIRTY_ADD_ALIGN (calling LOG_APPEND_ALIGN with LOG_SET_DIRTY) before the offset reaches LOGAREA_SIZE. If violated, the full page sits in toflush[] un-dirtied and the flusher skips it, losing committed records.

log_append_info is the appender’s fixed cursor; log_page / log_hdrpage are the physical page layout; log_flush_info is the hand-off queue to the flusher.

log_append_info (log_append.hpp) — one global, log_Gl.append.

FieldRoleWhy it exists
vdesVolume descriptor (fd) of the active logThe eventual fileio_write target
nxio_lsaatomic<LOG_LSA>: lowest LSA not yet on diskWAL boundary; re-pointed by the fetch helpers (6.5), read by Chapter 7
prev_lsaLSA of the last appended recordlogpb_start_append checks back_lsa == prev_lsa; logical, so survives a crossing unchanged (6.7)
log_pgptrThe currently fixed append pageThe pointer the crossing nulls then re-points
appending_page_tde_encryptedPages created mid-append must be TDE-encryptedPropagates the record’s encryption decision onto new mid-record pages (6.6)

log_hdrpage (log_storage.hpp) — header at the front of every log page.

FieldRoleWhy it exists
logical_pageidLOG_PAGEID: page’s address in the infinite logSo readers/flushers know which logical page a frame holds
offsetPGLENGTH: byte offset of the first full record hereSalvage anchor for recovery if the prior page is corrupt
flagsshort bitfield; today only TDE bitsCarries LOG_HDRPAGE_FLAG_ENCRYPTED_AES/_ARIA; set via logpb_set_tde_algorithm
checksumint: CRC32 over the pageConsistency check; memset garbage at create, computed at write time (6.4)

log_page (log_storage.hpp): LOG_HDRPAGE hdr followed by char area[1] (record region, sized LOGAREA_SIZE). Never sizeof it; use LOG_PAGESIZE.

log_flush_info (log_impl.h) — the queue the crossing pushes into; one global, log_Gl.flush_info.

FieldRoleWhy it exists
max_toflushCapacity of toflushThreshold that forces a flush when the queue fills
num_toflushCount of queued pagesIncremented under flush_mutex per crossing
toflushLOG_PAGE **: ordered pages awaiting flushHand-off list; array order = flush order
flush_mutexMutex (SERVER_MODE) over the three aboveLets the Log Flush Thread and appender share the queue safely
graph TD
  subgraph append_cursor
    AI["log_append_info<br/>log_Gl.append"]
    AI -->|log_pgptr| PG["LOG_PAGE (current)"]
    AI -->|appending_page_tde_encrypted| TDE["TDE decision"]
  end
  PG -->|hdr| HDR["log_hdrpage<br/>logical_pageid / offset / flags / checksum"]
  subgraph flush_queue
    FI["log_flush_info<br/>log_Gl.flush_info"]
    FI -->|toflush num_toflush| Q["LOG_PAGE *[] ordered"]
  end
  PG -.->|enqueued on crossing| Q

Figure 6-1. Struct relationships: the crossing repoints log_pgptr to a new LOG_PAGE and pushes a page into toflush[].

6.3 logpb_next_append_page: branch-complete walkthrough

Section titled “6.3 logpb_next_append_page: branch-complete walkthrough”
// logpb_next_append_page -- src/transaction/log_page_buffer.c
assert (LOG_CS_OWN_WRITE_MODE (thread_p)); /* (entry) LOG CS held write-exclusive */
if (current_setdirty == LOG_SET_DIRTY)
{ logpb_set_dirty (thread_p, log_Gl.append.log_pgptr); } /* (A) seal old page */
log_Gl.append.log_pgptr = NULL; /* (B) detach; (C) pageid++, offset=0 */
log_Gl.hdr.append_lsa.pageid++; log_Gl.hdr.append_lsa.offset = 0;
if (LOGPB_AT_NEXT_ARCHIVE_PAGE_ID (log_Gl.hdr.append_lsa.pageid))
{ logpb_archive_active_log (thread_p); } /* (D) wrap onto unarchived slot */
if (LOGPB_IS_FIRST_PHYSICAL_PAGE (log_Gl.hdr.append_lsa.pageid))
{ log_Gl.hdr.fpageid += LOGPB_ACTIVE_NPAGES; logpb_flush_header (thread_p); } /* (E) cycled */
log_Gl.append.log_pgptr = logpb_create_page (thread_p, log_Gl.hdr.append_lsa.pageid); /* (F) */
if (log_Gl.append.log_pgptr == NULL)
{ logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "log_next_append_page"); return; } /* (G) */
if (log_Gl.append.appending_page_tde_encrypted) /* (H) propagate TDE — see 6.6 */
{ /* ... logpb_set_tde_algorithm + logpb_set_dirty ... */ }
rv = pthread_mutex_lock (&flush_info->flush_mutex);
flush_info->toflush[flush_info->num_toflush++] = log_Gl.append.log_pgptr; /* (I) enqueue NEW page */
need_flush = (flush_info->num_toflush >= flush_info->max_toflush); /* (J) queue full? */
pthread_mutex_unlock (&flush_info->flush_mutex);
if (need_flush)
{ logpb_flush_all_append_pages (thread_p); } /* (K) forced flush, outside the mutex */

Figure 6-2 traces every branch; two are non-obvious. (B): between detach and (F) there is no current append page, but the write-exclusive LOG CS (entry assert) means no other appender ever observes the gap. (I): the page enqueued is the fresh empty page, not the one just filled — that one was queued at its own birth, so every page enters toflush[] exactly once. The rest (D archive-wrap → Chapter 10, E ring-wrap header bump, G fatal-NULL) are labelled in the excerpt and flowchart.

flowchart TD
  S["enter, LOG CS write-held"] --> A{"current_setdirty == LOG_SET_DIRTY?"}
  A -->|yes| A1["logpb_set_dirty(old page)"]
  A -->|no| B
  A1 --> B["log_pgptr = NULL; pageid++; offset = 0"]
  B --> D{"LOGPB_AT_NEXT_ARCHIVE_PAGE_ID?"}
  D -->|yes| D1["logpb_archive_active_log"]
  D -->|no| E
  D1 --> E{"LOGPB_IS_FIRST_PHYSICAL_PAGE?"}
  E -->|yes| E1["fpageid += ACTIVE_NPAGES; logpb_flush_header"]
  E -->|no| F
  E1 --> F["log_pgptr = logpb_create_page(pageid)"]
  F --> G{"log_pgptr == NULL?"}
  G -->|yes| G1["logpb_fatal_error -> return"]
  G -->|no| H{"appending_page_tde_encrypted?"}
  H -->|yes| H1["set_tde_algorithm; set_dirty"]
  H -->|no| I
  H1 --> I["lock flush_mutex; toflush[num++] = new page"]
  I --> J{"num_toflush >= max_toflush?"}
  J -->|yes| J1["need_flush = true"]
  J -->|no| K
  J1 --> K["unlock flush_mutex"]
  K --> L{"need_flush?"}
  L -->|yes| L1["logpb_flush_all_append_pages"]
  L -->|no| Z["return"]
  L1 --> Z

Figure 6-2. logpb_next_append_page control flow, every branch.

6.4 Obtaining the frame: logpb_locate_page for NEW_PAGE

Section titled “6.4 Obtaining the frame: logpb_locate_page for NEW_PAGE”

logpb_create_page(thread_p, pageid) is return logpb_locate_page (thread_p, pageid, NEW_PAGE);. logpb_locate_page maps a logical pageid to a buffer frame and, for NEW_PAGE, initializes the header in place — never touching disk. The branches that matter:

// logpb_locate_page -- src/transaction/log_page_buffer.c
index = logpb_get_log_buffer_index (pageid); /* ring hash -> frame slot; bad index -> NULL */
log_bufptr = &log_Pb.buffers[index];
if (log_bufptr->pageid != NULL_PAGEID && log_bufptr->pageid != pageid)
{ /* frame holds a DIFFERENT page */
if (log_bufptr->dirty == true)
{ assert_release (false); /* must not victimize dirty */ ... }
log_bufptr->pageid = NULL_PAGEID; /* invalidate */
}
if (log_bufptr->pageid == NULL_PAGEID)
{
if (fetch_mode == NEW_PAGE)
{
memset (log_bufptr->logpage, LOG_PAGE_INIT_VALUE, LOG_PAGESIZE); /* 0xff fill */
log_bufptr->logpage->hdr.logical_pageid = pageid; /* (1) */
log_bufptr->logpage->hdr.offset = NULL_OFFSET; /* (2) */
log_bufptr->logpage->hdr.flags = 0; /* (3) clears any TDE bits */
}
else /* OLD_PAGE */
{ if (logpb_read_page_from_file (...) != NO_ERROR) { return NULL; } }
}
else
{ assert (fetch_mode == OLD_PAGE); /* frame already holds exactly this page */ }

The three header writes initialize the new page: (1) logical_pageid = pageid makes the frame be this logical page; (2) offset = NULL_OFFSET — no record starts here yet, logpb_start_append later overwrites it with the first-record offset; (3) flags = 0 clears stale TDE bits, which is why (H) must re-apply the algorithm. checksum is not set here — it stays memset garbage until logpb_set_page_checksum (called by logpb_writev_append_pages per page before fileio_write) runs log_pgptr->hdr.checksum = checksum_crc32;.

Invariant (one frame per ring slot): the assert_release (false) encodes that the appender must never evict a dirty frame for a new append page. The ring is sized so a slot’s prior occupant is flushed before reuse. Violating it overwrites a page the flusher believed safe — silent log corruption.

logpb_next_append_page handles mid-stream crossings; two public functions handle the first page of an append session, bootstrapping log_pgptr and re-anchoring WAL.

// logpb_fetch_start_append_page -- src/transaction/log_page_buffer.c
PAGE_FETCH_MODE flag = OLD_PAGE;
if ((log_Gl.hdr.append_lsa.pageid == FIRST_LOG_PAGEID) /* NDEBUG: ==0; else PRM_ID_FIRST_LOG_PAGEID */
&& (log_Gl.hdr.append_lsa.offset == 0))
{ flag = NEW_PAGE; } /* empty log: skip the read */
if (log_Gl.append.log_pgptr != NULL)
{ logpb_invalid_all_append_pages (thread_p); } /* stale append page: discard */
log_Gl.append.log_pgptr = logpb_locate_page (thread_p, log_Gl.hdr.append_lsa.pageid, flag);
if (log_Gl.append.log_pgptr == NULL) { return ER_FAILED; }
log_Gl.append.set_nxio_lsa (log_Gl.hdr.append_lsa); /* (*) re-anchor WAL boundary */
// ... same flush_mutex enqueue as 6.3: toflush[num_toflush++] = log_pgptr; threshold -> need_flush ...
if (need_flush) { logpb_flush_pages_direct (thread_p); } /* note: direct, not _all_append */

Two branches distinguish it from the mid-stream path. flag selection: an empty log (pageid equal to PRM_ID_FIRST_LOG_PAGEID, offset 0) is fetched NEW_PAGE with no read; otherwise OLD_PAGE reads back the on-disk page (e.g. restart resuming a half-full last page) so the appender continues after the tail. stale-page discard: a non-NULL log_pgptr on entry triggers logpb_invalid_all_append_pages. The (*) line re-anchors WAL — nxio_lsa jumps to the current append position (“everything below here is durable”); mid-stream logpb_next_append_page never touches nxio_lsa, since within a session the boundary moves only when pages actually flush (Chapter 7).

// logpb_fetch_start_append_page_new -- src/transaction/log_page_buffer.c
log_Gl.append.log_pgptr = logpb_locate_page (thread_p, log_Gl.hdr.append_lsa.pageid, NEW_PAGE);
if (log_Gl.append.log_pgptr == NULL) { return NULL; } /* caller handles NULL */
log_Gl.append.set_nxio_lsa (log_Gl.hdr.append_lsa);
return log_Gl.append.log_pgptr;

_new is the stripped variant: always NEW_PAGE, no stale-page check, no enqueue and no flush threshold. It serves callers (log creation / format) that want a fresh first page but manage flushing themselves — the thing to know before reusing it is that it skips the toflush[] bookkeeping the other two perform.

6.6 TDE flag propagation across the boundary

Section titled “6.6 TDE flag propagation across the boundary”

The encryption decision belongs to the record but is enforced per page via log_Gl.append.appending_page_tde_encrypted. logpb_append_next_record (Chapter 5) owns the flag’s lifetime: it sets the flag from prior_is_tde_encrypted (node) before assembly and resets it to false after logpb_end_append. While true, two re-stamp sites carry the TDE bits onto pages entered during assembly. Site (H) inside logpb_next_append_page (6.3) calls logpb_set_tde_algorithm then logpb_set_dirty on the just-created page. The second site, in logpb_start_append, guards against re-stamping:

// logpb_start_append -- src/transaction/log_page_buffer.c
if (log_Gl.append.appending_page_tde_encrypted)
{
if (!LOG_IS_PAGE_TDE_ENCRYPTED (log_Gl.append.log_pgptr)) /* idempotent guard */
{
TDE_ALGORITHM tde_algo = (TDE_ALGORITHM) prm_get_integer_value (PRM_ID_TDE_DEFAULT_ALGORITHM);
logpb_set_tde_algorithm (thread_p, log_Gl.append.log_pgptr, tde_algo);
logpb_set_dirty (thread_p, log_Gl.append.log_pgptr);
}
}

logpb_set_tde_algorithm writes hdr.flags (clear the encrypted mask, OR the algorithm bit). Because logpb_locate_page zeroed hdr.flags at create (6.4 step 3), the new page starts un-encrypted and (H) re-stamps it; the trailing logpb_set_dirty is essential or the bit could be lost. The logpb_start_append guard with LOG_IS_PAGE_TDE_ENCRYPTED keeps it from re-stamping a page (H) already handled.

Invariant (encryption follows the record across pages): if record R is TDE-encrypted, every page R touches — including pages allocated mid-R — has a non-zero TDE flag. Enforced by appending_page_tde_encrypted staying true for R’s whole assembly plus the (H) re-stamp on each new page. If broken, part of an encrypted record is written in clear text: logpb_writev_append_pages checks LOG_IS_PAGE_TDE_ENCRYPTED per page at write time, so a missing flag means that page silently skips encryption.

log_Gl.append.prev_lsa is logical, not page-relative, so logpb_next_append_page leaves it untouched — it keeps pointing at the last appended record regardless of page, letting logpb_start_append validate header->back_lsa == prev_lsa even when the new record begins on a freshly crossed page. prev_lsa advances only in logpb_start_append (LSA_COPY (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa)), never in the page fetch. This buffer-side seam pairs with Chapter 4’s prior-side log_prior_lsa_append_advance_when_doesnot_fit: both compute the break from the same LOGAREA_SIZE threshold, so reserved address and materialized frame never disagree.

  1. One function owns the mid-stream crossing. logpb_next_append_page optionally seals the full page, nulls log_pgptr, advances append_lsa.pageid, creates a fresh frame, and queues a page — all under the write-held LOG CS.
  2. The page enqueued on a crossing is the new page, not the full one — every page enters toflush[] exactly once, at its own birth.
  3. NEW_PAGE create writes three header fields, not the checksum. logpb_locate_page sets logical_pageid, offset = NULL_OFFSET, flags = 0 and memsets the body 0xff; checksum is computed by logpb_set_page_checksum just before fileio_write.
  4. flags = 0 at create is why TDE must be re-stamped. Three sites touch appending_page_tde_encrypted: set/reset in logpb_append_next_record plus the re-stamps in logpb_next_append_page (H) and logpb_start_append, keeping an encrypted record encrypted across page breaks.
  5. Threshold flush is decided under flush_mutex, executed outside itnum_toflush >= max_toflush sets need_flush, then logpb_flush_all_append_pages runs after release.
  6. logpb_fetch_start_append_page vs _new. The former chooses NEW_PAGE/OLD_PAGE, discards stale pages, enqueues, and re-anchors nxio_lsa; _new is always NEW_PAGE and skips the enqueue/threshold work for callers managing their own flush.
  7. The crossing is the buffer-side mirror of Chapter 4 — the prior side reserves address space across the tail before bytes exist, this side fetches the frame for that address, both keyed off LOGAREA_SIZE.

Chapter 7: Flush Durability and the WAL Rule

Section titled “Chapter 7: Flush Durability and the WAL Rule”

This chapter answers one question: how do dirty log pages reach stable storage, how is nxio_lsa advanced, and how do group commit and the WAL invariant cooperate to keep recovery correct? Chapters 3–6 built a prior list, drained it into the page buffer (Ch 5), and crossed page boundaries (Ch 6) — all in volatile memory. Here the bytes hit the disk.

For the framing of WAL and why the log must be durable before data pages, see the companion cubrid-log-manager.md §“Write-Ahead Logging”. We trace the enforcing code, not the theory.

7.1 The three structures that hold the durability state

Section titled “7.1 The three structures that hold the durability state”

Durability is coordinated across three structs: log_append_info (log_append.hpp) owns the watermark; log_flush_info (log_impl.h) is the work list of pages to scan; log_buffer (log_page_buffer.c) is the per-page slot whose dirty bit the flusher clears.

log_append_info holds { int vdes; std::atomic<LOG_LSA> nxio_lsa; LOG_LSA prev_lsa; LOG_PAGE *log_pgptr; bool appending_page_tde_encrypted; }:

FieldRoleWhy it exists
vdesActive-log fd passed to every fileio_write/fileio_synchronize.The flusher must know which fd to write and fsync.
nxio_lsaThe durability watermark — lowest LSA whose page is not yet forced to disk. Atomic for lock-free reads.Answers both “is record X durable?” (group-commit) and “must I flush before this data page?” (WAL).
prev_lsaLast appended (in-buffer) record, vs nxio_lsa (last flushed).Lets the flusher detect a partial record (nxio_lsa.pageid == prev_lsa.pageid) and not validate it early.
log_pgptrAppend page currently fixed for new records.After a flush resets num_toflush, it is re-seeded as toflush[0].
appending_page_tde_encryptedWhether the next page must be TDE-encrypted.Carries the encryption decision from append time to write time.

The accessors are trivial (get_nxio_lsa = nxio_lsa.load (), set_nxio_lsa = nxio_lsa.store ()); the atomicity is the contract.

log_flush_info holds { int max_toflush; int num_toflush; LOG_PAGE **toflush; pthread_mutex_t flush_mutex; } (the mutex is SERVER_MODE only):

FieldRoleWhy it exists
max_toflushArray capacity; at num_toflush == max_toflush the buffer is full (log_buffer_full_count ticks).Bounds the batch; full-list events drive Ch 6’s partial-append path.
num_toflushCount of queued pages; < 1 means nothing to flush.The loop bound; reset to 0 then re-seeded with the live append page.
toflushArray of LOG_PAGE*, ascending by pageid.The contiguity scan walks it to coalesce pages into one writev.
flush_mutexGuards the array vs concurrent producers (Ch 5 drain) and the flusher.Held across the entire flush body — scan, nxio-page write, and the fileio_synchronize — acquired in Phase 2’s scan setup and released only after the Phase 6 nxio_lsa advance; i.e. the fsync runs while flush_mutex is held. A separate short-lived acquisition guards only the Phase-1 num_toflush check. The group-commit wait is on gc_cond/gc_mutex (§7.5), a different lock entirely.

log_buffer holds { volatile LOG_PAGEID pageid; volatile LOG_PHY_PAGEID phy_pageid; bool dirty; LOG_PAGE *logpage; }:

FieldRoleWhy it exists
pageidLogical page id; volatile as slots recycle.Flusher asserts bufptr->pageid == toflush[i]->hdr.logical_pageid to confirm the slot still holds its page.
phy_pageidPhysical offset in the active log volume.Write target phy_pageid + i; contiguity needs phy_pageid+1, not just pageid+1.
dirtyHas un-flushed changes.The scan’s primary filter; cleared exactly when the write succeeds.
logpageThe page bytes (header + area).What is handed to fileio_write/encryption.
flowchart LR
  FI["LOG_FLUSH_INFO<br/>toflush[ ], num_toflush"]
  LB["LOG_BUFFER<br/>pageid, phy_pageid, dirty, logpage"]
  AI["log_append_info<br/>nxio_lsa, prev_lsa, log_pgptr"]
  FI -->|"toflush[i] resolves to"| LB
  LB -->|"dirty pages written, then"| AI
  AI -->|"nxio_lsa.pageid flushed LAST"| FI

Figure 7-1. The three durability structures and how the flusher pivots between them.

Durability invariant. A commit record at LSA L is durable iff get_nxio_lsa () > L (the watermark moved past L’s page). logpb_flush_all_append_pages advances nxio_lsa only after fileio_synchronize returns and the nxio_lsa page is written last. If it advanced before the fsync, a crash could leave a committed transaction whose log is not on disk, and recovery would lose it.

7.2 logpb_flush_all_append_pages — the durability engine

Section titled “7.2 logpb_flush_all_append_pages — the durability engine”

The only function that writes append pages and moves nxio_lsa. It runs under LOG_CS write mode (assert (LOG_CS_OWN_WRITE_MODE)) and returns 1 = flushed, 0 = nothing to do, < 0 = error.

Phase 1 — decide whether to flush at all, under a short-lived flush_mutex acquisition that is released before the body. Two early returns set need_flush = false and return 0: when num_toflush < 1 (empty list), and when num_toflush == 1 && !logpb_is_dirty (toflush[0]). The single-clean-page short-circuit keeps idle timer-driven flushes from rewriting an unchanged end-of-log marker.

Phase 2 — place the end-of-log marker, branching on log_Pb.partial_append.status (Ch 6’s LOGPB_APPENDREC_* enum), then re-acquire flush_mutex for the rest of the body:

  • IN_PROGRESS — a record is half-appended. Copy the header page aside, clear the slot’s dirty, overwrite the in-progress header with LOG_END_OF_LOG, write that copy via logpb_write_page_to_disk; status → PARTIAL_FLUSHED_END_OF_LOG. If the page already left the buffer it is fatal → goto error.
  • PARTIAL_FLUSHED_END_OF_LOG — re-entry continuing the flush; log and fall through.
  • PARTIAL_ENDED / SUCCESS — normal case: build an eof record and logpb_start_append it without advancing append_lsa (overwritten later).
  • anything else — assert_release (false)goto error.

Phase 3 — the two-step contiguous-run scan (under flush_mutex), whose rule is the nxio_lsa page is flushed last. A while (true) alternates step 1 (skip until a dirty non-nxio page; exit if none remain) and step 2 (extend the run). Step 2 has four break conditions — each a real branch that ends the run and re-enters the skip phase:

// logpb_flush_all_append_pages (step-2 run conditions) -- src/transaction/log_page_buffer.c
if (!bufptr->dirty) break; /* <- clean stops run */
if (bufptr->pageid == log_Gl.append.get_nxio_lsa ().pageid) break; /* <- nxio last */
if (prv_bufptr->pageid + 1 != bufptr->pageid) break; /* <- logical gap */
if (prv_bufptr->phy_pageid + 1 != bufptr->phy_pageid) break;/* <- physical gap */

The run [idxflush, i) then goes to logpb_writev_append_pages; a NULL return is fatal (goto error), otherwise need_sync = true and each page’s dirty is cleared — only after the write returns non-NULL.

Phase 4 — flush the nxio_lsa page last, branching on whether it holds a complete record:

// logpb_flush_all_append_pages (nxio page) -- src/transaction/log_page_buffer.c
if (log_Pb.partial_append.status == LOGPB_APPENDREC_SUCCESS
|| nxio_lsa.pageid != log_Gl.append.prev_lsa.pageid) /* complete -> write it */
{ /* assert_release pageid match and dirty, else goto error */
logpb_write_page_to_disk (thread_p, bufptr->logpage, bufptr->pageid);
need_sync = true; bufptr->dirty = false; }
else { /* skip: nxio page holds an incomplete record, defer until complete */ }

Phase 5 — the fsync, still under flush_mutex. When need_sync is set and the PRM_ID_SUPPRESS_FSYNC sampling escape allows it (escape == 0, or total_sync_count % escape == 0), it calls fileio_synchronize (thread_p, log_Gl.append.vdes, log_Name_active, false); a NULL_VOLDES return is fatal → goto error.

Phase 6 — advance nxio_lsa, again branching on partial_append.status:

  • LOGPB_APPENDREC_PARTIAL_ENDED — restore the original record header, rewrite + fsync again, then set_nxio_lsa (log_Gl.hdr.append_lsa); status → PARTIAL_FLUSHED_ORIGINAL.
  • LOGPB_APPENDREC_PARTIAL_FLUSHED_END_OF_LOG — cannot validate yet; set_nxio_lsa (log_Gl.append.prev_lsa) (one record short).
  • LOGPB_APPENDREC_SUCCESSset_nxio_lsa (log_Gl.hdr.append_lsa).
  • else — assert_release (false)goto error.

The list is then reset (num_toflush = 0) and, if log_Gl.append.log_pgptr != NULL, that live page is re-seeded as toflush[0]; flush_mutex is released and the function returns 1. The error: label releases flush_mutex if still held and calls logpb_fatal_error — every error path is unrecoverable.

flowchart TD
  A["enter (LOG_CS write)"] --> B{num_toflush?}
  B -->|"< 1, or 1+clean"| Z0["return 0"]
  B -->|flushable| C{partial_append.status}
  C -->|IN_PROGRESS| D["overwrite EOL, write copy"]
  C -->|SUCCESS/PARTIAL_ENDED| E["start_append EOL marker"]
  D --> F["scan toflush: skip clean,<br/>collect dirty run, writev, clear dirty"]
  E --> F
  F --> H{nxio page holds<br/>partial record?}
  H -->|no| I["write nxio page LAST"]
  H -->|yes| K
  I --> K{need_sync?}
  K -->|yes| L["fileio_synchronize"]
  K -->|no| M
  L --> M["advance nxio_lsa per status,<br/>num_toflush=0, reseed log_pgptr"]
  M --> Z1["return 1"]
  D -.->|fatal| ERR["logpb_fatal_error"]
  L -.->|fail| ERR

Figure 7-2. Branch-complete flow of logpb_flush_all_append_pages.

7.3 logpb_writev_append_pages — the actual write

Section titled “7.3 logpb_writev_append_pages — the actual write”

The lowest-level write helper. It CRC-stamps every page (logpb_set_page_checksum, NULL on failure), then loops over npages with two per-page branches:

// logpb_writev_append_pages -- src/transaction/log_page_buffer.c
if (LOG_IS_PAGE_TDE_ENCRYPTED (log_pgptr)) /* branch 1: encrypt into enc_pgptr; */
// ... on encrypt failure, turn TDE off for this page ...
if (fileio_write (..., log_pgptr, phy_pageid + i, LOG_PAGESIZE, write_mode) == NULL)
{ /* branch 2: ER_LOG_WRITE_OUT_OF_SPACE / ER_LOG_WRITE */ to_flush = NULL; break; }

Despite the name it loops fileio_write page by page at phy_pageid + i; Phase-3 contiguity makes the batch one sequential extent. write_mode is FILEIO_WRITE_NO_COMPENSATE_WRITE under DWB. A to_flush == NULL return on any write failure is fatal to the caller.

logpb_flush_pages_direct is the core: under assert (LOG_CS_OWN_WRITE_MODE) it calls logpb_prior_lsa_append_all_list (the Ch 5 drain) then logpb_flush_all_append_pages (the engine). Two thin wrappers add the critical section: logpb_force_flush_pages is just LOG_CS_ENTER; logpb_flush_pages_direct; LOG_CS_EXIT, and logpb_force_flush_header_and_pages adds logpb_flush_header (Ch 10) before the exit — used at checkpoint and wherever the header’s eof_lsa/append fields must match disk.

7.5 logpb_flush_pages — the four commit modes

Section titled “7.5 logpb_flush_pages — the four commit modes”

logpb_flush_pages (thread_p, flush_lsa) is the entry every committing transaction calls. !SERVER_MODE is direct flush under LOG_CS. In SERVER_MODE two fall-backs go direct:

  • not restarted, or flush_lsa NULL/ISNULL → direct flush, return.
  • daemon unavailable (!log_is_log_flush_daemon_available ()) → direct flush.

Otherwise it derives a 2×2 policy from async_commit (PRM_ID_LOG_ASYNC_COMMIT) × group_commit (LOG_IS_GROUP_COMMIT_ACTIVE ()):

// logpb_flush_pages (mode matrix) -- src/transaction/log_page_buffer.c
// async group | need_wait need_wakeup_LFT
// X X | true true (sync, non-group: wake daemon, wait)
// X O | true false (sync, group: just wait)
// O X | false true (async: wake daemon, return)
// O O | false false (async+group: just return)

The wait loop is the waiter side of group commit — it sleeps on gc_cond (holding gc_mutex) until nxio_lsa passes its flush_lsa:

// logpb_flush_pages (group-commit wait) -- src/transaction/log_page_buffer.c
if (need_wakeup_LFT == false && pgbuf_has_perm_pages_fixed (thread_p))
need_wakeup_LFT = true; /* <- holding data pages: push daemon to avoid a stall */
while (LSA_LT (&nxio_lsa, flush_lsa)) { // re-read nxio_lsa each iteration
// ... lock gc_mutex ...
if (LSA_GE (&nxio_lsa, flush_lsa)) break; /* <- re-check under lock: already durable */
if (need_wakeup_LFT == true) log_wakeup_log_flush_daemon ();
pthread_cond_timedwait (&gc_cond, &gc_mutex, &to); /* 1000ms deadline */
need_wakeup_LFT = true; /* <- after first wait, always nudge daemon */
}

The in-lock re-check prevents a lost-wakeup race; the 1000ms timeout bounds latency. (The shared-fsync mechanics are in §7.6 / takeaway 4 — not repeated here.)

Open question (carried from the companion). The exact group-commit window policy — how long the daemon coalesces before syncing — is the log_get_log_group_commit_interval looper period plus on-demand wakeup () calls. The companion flags the batching/latency trade-off as unresolved; this chapter documents the mechanism, not the tuning.

7.6 The flush daemon and group-commit producer side

Section titled “7.6 The flush daemon and group-commit producer side”

The daemon is a cubthread::daemon with looper period log_get_log_group_commit_interval; its task body log_flush_execute (in log_manager.c) guards on BO_IS_SERVER_RESTARTED () and log_Flush_has_been_requested (returning early if either is false), does one shared LOG_CS_ENTER; logpb_flush_pages_direct; LOG_CS_EXIT, then under gc_mutex runs pthread_cond_broadcast (&gc_cond) and clears log_Flush_has_been_requested. The broadcast (not signal) lets one flush satisfy every waiter whose flush_lsa <= nxio_lsa. The producer side — log_wakeup_log_flush_daemon, called by committers and the WAL path — does only log_Flush_has_been_requested = true; log_Flush_daemon->wakeup (); (SERVER_MODE only); setting the flag before wakeup () guarantees a daemon already mid-iteration sees the request next loop.

stateDiagram-v2
  [*] --> Sleeping
  Sleeping --> Checking : timer tick or wakeup
  Checking --> Sleeping : not requested, return
  Checking --> Flushing : requested\n LOG_CS enter
  Flushing --> Broadcasting : flush_pages_direct done\n one fsync
  Broadcasting --> Sleeping : gc_cond broadcast\n clear request

Figure 7-3. log_Flush_daemon state cycle. Waiters in logpb_flush_pages observe nxio_lsa advance after Broadcasting.

7.7 logpb_flush_log_for_wal — the read-side WAL invariant

Section titled “7.7 logpb_flush_log_for_wal — the read-side WAL invariant”

Called by the page buffer manager before writing any data page, passing its last-modifying LSA. It enforces WAL with double-checked locking on logpb_need_wal:

// logpb_flush_log_for_wal -- src/transaction/log_page_buffer.c
if (logpb_need_wal (lsa_ptr)) /* <- cheap atomic check, no lock */
{
LOG_CS_ENTER (thread_p);
if (logpb_need_wal (lsa_ptr)) /* <- re-check under LOG_CS, else someone flushed it */
logpb_flush_pages_direct (thread_p);
LOG_CS_EXIT (thread_p);
assert (LSA_ISNULL (lsa_ptr) || !logpb_need_wal (lsa_ptr)); /* <- post-condition */
}

The predicate logpb_need_wal (lsa) is just LSA_LE (&get_nxio_lsa (), lsa) — true when the log up to *lsa is not yet durable — making the invariant directly testable.

WAL invariant. No data page modified at LSA L may be written while logpb_need_wal (L) holds (nxio_lsa <= L). The buffer manager calls logpb_flush_log_for_wal first; its post-condition assert (!logpb_need_wal (lsa_ptr)) guarantees the log is durable to L before the write. Violating it lets a redo record’s effect reach disk without the record, so recovery cannot reconstruct or undo the change. The two logpb_need_wal calls (outside/inside LOG_CS) avoid both a needless critical section when already durable and a redundant flush when a concurrent committer advanced nxio_lsa.

  1. nxio_lsa is the one durability watermark — the lowest not-yet-written LSA, atomic in log_append_info, answering both “is this commit durable?” and “must I flush before this data write?”. It advances only after fileio_synchronize succeeds.
  2. logpb_flush_all_append_pages flushes the nxio_lsa page last. The two-step scan (skip clean, collect contiguous dirty) batches adjacent pages, then writes the watermark page alone, so the new end-of-log is never validated before its predecessors are on disk.
  3. The main flush_mutex spans the whole flush body — scan, nxio-page write, and the fileio_synchronize all run while it is held; only the Phase-1 num_toflush peek takes a separate short lock. The group-commit wait uses gc_cond/gc_mutex, a distinct lock.
  4. Group commit amortizes one fsync over many committers. Waiters block on gc_cond and re-check nxio_lsa under gc_mutex; the daemon does one logpb_flush_pages_direct and broadcasts, releasing everyone whose flush_lsa is now covered.
  5. The 2×2 commit matrix (async_commit × group_commit) picks wake-and-wait, just-wait, wake-and-return, or just-return; non-SERVER and no-daemon paths fall back to direct flush.
  6. WAL is enforced read-side by logpb_flush_log_for_wal via double-checked logpb_need_wal around LOG_CS; its post-condition asserts the log is durable to the requested LSA before any data page is written.
  7. The exact group-commit window policy is an open question (from the companion): the mechanism is the daemon’s looper interval plus on-demand wakeup (), but the batching/latency tuning is not pinned down here.

This chapter answers one reader question: how does a transaction-boundary record ride the same prior-list / page-buffer / flush pipeline that Chapters 3-7 built, force its own durability, and drive the final state transitions and lock release? Boundary records are special only in carrying a LOG_REC_DONETIME payload (no undo/redo data) and in being wrapped by durability and state-machine discipline. The append mechanics are unchanged — log_commit reuses prior_lsa_alloc_and_copy_data / prior_lsa_next_record (Chapters 3-4) and the logpb_flush_pages force path (Chapter 7). We focus on the wrapping; recovery-side replay is out of scope (cubrid-recovery-manager.md).

8.1 The three structs at the transaction boundary

Section titled “8.1 The three structs at the transaction boundary”

Three structs meet at commit/abort time: the descriptor log_tdes, the per-record header log_rec_header (Chapters 1, 3), and the boundary payload log_rec_donetime.

log_rec_donetime — the commit/abort payload

Section titled “log_rec_donetime — the commit/abort payload”

The entire type-specific payload of a LOG_COMMIT/LOG_ABORT record; its existence at a known LSA is the information.

// log_rec_donetime — src/transaction/log_record.hpp
struct log_rec_donetime
{
INT64 at_time; /* Database creation time. For safety reasons */
};
FieldRoleWhy it exists
at_timeWall-clock time(NULL) captured in log_append_donetime_internal.Timestamps completion for forensics. The “Database creation time” comment is stale — it holds the termination time; the commit protocol never reads it back.

Invariant — the donetime record’s LSA is the commit point. The record carries no other state, so durability reduces to “the page holding this LSA is on disk”. Everything in §8.4 makes that true before the client is told the commit succeeded.

The generic header (full coverage in Chapter 1) is reused verbatim; at the boundary only type and prev_tranlsa carry special meaning.

// log_rec_header — src/transaction/log_record.hpp
struct log_rec_header
{
LOG_LSA prev_tranlsa; /* previous log record for the same transaction */
LOG_LSA back_lsa, forw_lsa; /* physical backward/forward links */
TRANID trid; /* transaction identifier */
LOG_RECTYPE type; /* e.g. LOG_COMMIT, LOG_ABORT */
};
FieldRole at a COMMIT/ABORT recordWhy it matters here
prev_tranlsaCloses the per-transaction chain. Recovery never undoes a committed chain, but the link is still written.Chapter 4 assigns it at attach time from tdes->tail_lsa.
typeLOG_COMMIT, LOG_ABORT, or LOG_COMMIT_WITH_POSTPONE when postpone work remains.Recovery dispatch keys on this to decide “this trid is done — do not undo it”.
back_lsa / forw_lsaPhysical-order links, assigned by the prior-list machinery as for a data record.Lets the analysis pass scan past the boundary record.
tridThe committing/aborting transaction’s id.Recovery groups records by trid.

log_tdes — the transaction descriptor (boundary-relevant fields)

Section titled “log_tdes — the transaction descriptor (boundary-relevant fields)”

log_tdes is large; only the fields the commit/abort path reads or writes are covered. The full struct lives in log_impl.h.

// log_tdes (excerpt) — src/transaction/log_impl.h
struct log_tdes
{
int tran_index; TRANID trid; TRAN_STATE state;
LOG_LSA head_lsa; LOG_LSA tail_lsa; LOG_LSA undo_nxlsa;
LOG_LSA posp_nxlsa; LOG_LSA commit_abort_lsa;
LOG_TOPOPS_STACK topops; /* topops.last must be < 0 at the boundary */
void *first_save_entry; bool has_supplemental_log;
// ... condensed ...
};

Five fields behave identically on both paths, so they get one note rather than a two-column matrix: tran_index (table index resolved by LOG_FIND_THREAD_TRAN_INDEX; log_abort_by_tdes rebinds it onto the executing thread, §8.7), trid (stamped into log_rec_header.trid, recycled by logtb_get_new_tran_id), head_lsa (informational, never read by the protocol), topops.last (must be < 0; a live sysop is a bug — assert(false) + force-attach to outer), and first_save_entry (freed via spage_free_saved_spaces). The fields whose role diverges between commit and abort:

FieldRole at commitRole at abort
stateACTIVE -> WILL_COMMIT -> (…_WITH_POSTPONE) -> COMMITTED.Straight to ABORTED before any rollback.
tail_lsaNULL = touched nothing -> skip donetime (§8.3, §8.5); else chain tail the record links to.Same gate.
undo_nxlsaReset to NULL so a checkpoint during WILL_COMMIT sees no stale cursor.The rollback cursor log_rollback walks prev_tranlsa from.
posp_nxlsaNon-NULL -> postpone pending -> LOG_COMMIT_WITH_POSTPONE (§8.3.1).Unused.
has_supplemental_logIf set, a LOG_SUPPLEMENT_TRAN_USER record precedes the commit record (CDC), then cleared.Just cleared.
commit_abort_lsaStamped with the boundary LSA so checkpoint distinguishes concluded from live — stamped not here but by the prior-list append in log_append.cpp when the donetime node is materialized.Same.
flowchart TB
  TDES["log_tdes\nstate, tail_lsa,\nundo_nxlsa, posp_nxlsa"]
  HDR["log_rec_header\ntype = LOG_COMMIT/ABORT\nprev_tranlsa = tail_lsa"]
  DT["log_rec_donetime\nat_time"]
  NODE["LOG_PRIOR_NODE\n(Chapter 3)"]
  TDES -->|prev_tranlsa = tail_lsa| HDR
  HDR --> NODE
  DT -->|node->data_header| NODE
  NODE -->|prior_lsa_next_record| PL["prior list -> page buffer -> disk"]

Figure 8-1 — log_tdes supplies the chain tail, a log_rec_header of type LOG_COMMIT/LOG_ABORT is built, and a log_rec_donetime becomes the node’s data header. From there it is an ordinary prior-list node.

8.2 log_commit — the entry point and its branch fan-out

Section titled “8.2 log_commit — the entry point and its branch fan-out”

log_commit resolves the descriptor, validates state, and routes to the 2PC or local path — every branch:

// log_commit — src/transaction/log_manager.c
if (tdes == NULL) return TRAN_UNACTIVE_UNKNOWN; /* <- fatal: unknown index */
if (!LOG_ISTRAN_ACTIVE (tdes) && !LOG_ISTRAN_2PC_PREPARE (tdes) && LOG_ISRESTARTED ())
return tdes->state; /* <- not commitable; no-op */
if (tdes->topops.last >= 0) /* <- impossible-but-handled */
{ assert (false); while (tdes->topops.last >= 0) log_sysop_attach_to_outer (thread_p); }
if (log_2pc_clear_and_is_tran_distributed (tdes))
state = log_2pc_commit (...); /* <- 2PC arm (cubrid-2pc.md) */
else /* <- local arm */
{ state = log_commit_local (thread_p, tdes, retain_lock, true);
state = log_complete (thread_p, tdes, LOG_COMMIT, LOG_NEED_NEWTRID, LOG_ALREADY_WROTE_EOT_LOG); }
if (log_No_logging) { /* force pages + data, flush header */ }
perfmon_inc_stat (thread_p, PSTAT_TRAN_NUM_COMMITS); /* return state */

Invariant — topops.last < 0 at the transaction boundary. Commit and abort require no open system operation. The code warns, asserts in debug, and force-folds open sysops into the outer transaction with log_sysop_attach_to_outer; violating it would skip a sysop’s records from the boundary record’s prev_tranlsa chain.

8.3 log_commit_local — postpone, append, release, flush

Section titled “8.3 log_commit_local — postpone, append, release, flush”

log_commit_local does the real work in a strict order dictated by one rule.

Invariant — nothing may be logged after the transaction enters an unactive state. If a checkpoint snapshots the transaction as TRAN_UNACTIVE_WILL_COMMIT and a crash precedes still-pending logging (e.g. unique statistics), recovery commits it without those changes — silent data loss. So tx_lob_locator_clear and logtb_complete_mvcc (both of which log) run before tdes->state = TRAN_UNACTIVE_WILL_COMMIT.

// log_commit_local — src/transaction/log_manager.c
tx_lob_locator_clear (...); logtb_complete_mvcc (thread_p, tdes, true); /* both log -> must precede WILL_COMMIT */
tdes->state = TRAN_UNACTIVE_WILL_COMMIT;
LSA_SET_NULL (&tdes->undo_nxlsa); /* checkpoint must not see a stale undo cursor */
if (!LSA_ISNULL (&tdes->tail_lsa)) /* <- transaction touched data */
{
log_tran_do_postpone (thread_p, tdes); /* §8.3.1 — run postpone if any */
if (is_local_tran) {
if (... log_does_allow_replication () ...)
log_append_repl_info_and_commit_log (...); /* repl+commit, one mutex */
else log_append_commit_log (thread_p, tdes, &commit_lsa); /* plain LOG_COMMIT */
if (retain_lock != true) lock_unlock_all (thread_p); /* <- retain_lock gate */
log_change_tran_as_completed (thread_p, tdes, LOG_COMMIT, &commit_lsa); /* state + force */
} else { /* participant: commit log + unlock deferred to log_complete_for_2pc */ }
}
else { if (retain_lock != true) lock_unlock_all (thread_p); /* <- read-only: no donetime record */
tdes->state = TRAN_UNACTIVE_COMMITTED; }
return tdes->state;

The replication route takes prior_lsa_mutex once so the replication and commit records get adjacent LSAs; the plain route just appends the donetime record. The participant branch (is_local_tran == false) defers both the commit record and lock release to log_complete_for_2pc (cubrid-2pc.md).

8.3.1 log_tran_do_postponeLOG_COMMIT_WITH_POSTPONE

Section titled “8.3.1 log_tran_do_postpone — LOG_COMMIT_WITH_POSTPONE”

If posp_nxlsa is non-NULL the transaction has deferred actions; log_tran_do_postpone writes and forces a LOG_COMMIT_WITH_POSTPONE record before running the postpones (Chapter 9).

// log_tran_do_postpone — src/transaction/log_manager.c
if (LSA_ISNULL (&tdes->posp_nxlsa)) return; /* <- nothing to postpone */
assert (tdes->topops.last < 0);
log_append_commit_postpone (thread_p, tdes, &tdes->posp_nxlsa); /* COMMIT_WITH_POSTPONE + flush */
if (tdes->m_log_postpone_cache.do_postpone (*thread_p, tdes->posp_nxlsa))
{ perfmon_inc_stat (..., PSTAT_TRAN_NUM_PPCACHE_HITS); return; } /* cache fast-path */
log_do_postpone (thread_p, tdes, &tdes->posp_nxlsa); /* scan forward, run LOG_POSTPONE records */

log_append_commit_postpone sets state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE and forces immediately, so the marker is durable before postpones run (recovery can resume after a crash). The plain LOG_COMMIT written later closes the transaction.

8.4 Forcing durability — log_append_commit_log and the WAL force

Section titled “8.4 Forcing durability — log_append_commit_log and the WAL force”

log_append_commit_log is a thin shell over log_append_donetime_internal, the single place both commit and abort build their donetime record:

// log_append_donetime_internal — src/transaction/log_manager.c
node = prior_lsa_alloc_and_copy_data (thread_p, iscommitted, RV_NOT_DEFINED, ...); /* type = LOG_COMMIT/ABORT */
if (node == NULL) return; /* <- alloc failed: eot_lsa stays NULL */
((LOG_REC_DONETIME *) node->data_header)->at_time = time (NULL); /* the only payload field */
lsa = (with_lock == LOG_PRIOR_LSA_WITH_LOCK) /* caller holds prior mutex, else take it */
? prior_lsa_next_record_with_lock (...) : prior_lsa_next_record (thread_p, node, tdes);
LSA_COPY (eot_lsa, &lsa); /* hand the commit LSA back to the caller */

iscommitted doubles as the record type; with_lock lets the replication route reuse the mutex it already holds. Then log_change_tran_as_completed performs the durability force:

// log_change_tran_as_completed — src/transaction/log_manager.c
if (iscommitted == LOG_COMMIT)
{ tdes->state = TRAN_UNACTIVE_COMMITTED;
logpb_flush_pages (thread_p, lsa); } /* <- COMMIT: always force up to commit LSA */
else {
tdes->state = TRAN_UNACTIVE_ABORTED; /* SERVER_MODE only: */
if (BO_IS_SERVER_RESTARTED () && VOLATILE_ACCESS (log_Gl.run_nxchkpt_atpageid, INT64) == NULL_PAGEID)
logpb_flush_pages (thread_p, lsa); /* <- ABORT: force only if checkpoint in flight */
}

Invariant — a committed transaction’s commit record is on stable storage before the client is told “committed”. logpb_flush_pages (thread_p, lsa) is the group-commit demand from Chapter 7: it pushes the committer onto the flush daemon’s waiter set and blocks until nxio_lsa >= commit_lsa (many committers share one fsync). This is the only point in the commit path that can block on I/O. The abort branch is asymmetric by design: a lost un-flushed LOG_ABORT is harmless (recovery re-undoes anyway), so it forces only when a checkpoint is in flight on a restarted server, lest that checkpoint reclaim an archive recovery still needs.

8.5 log_complete — final state transition and next-trid

Section titled “8.5 log_complete — final state transition and next-trid”

Both log_commit and log_abort finish via log_complete, passing two enum flags: who already wrote the EOT record and whether to recycle the trid. Commit passes LOG_ALREADY_WROTE_EOT_LOG (record already forced, else arm just asserts); abort passes LOG_NEED_TO_WRITE_EOT_LOG (the if arm appends LOG_ABORT).

// log_complete — src/transaction/log_manager.c
if (LSA_ISNULL (&tdes->tail_lsa)) { /* read-only: set COMMITTED/ABORTED; recycle or clear tdes */ }
else {
if (wrote_eot_log == LOG_NEED_TO_WRITE_EOT_LOG) /* <- abort: write LOG_ABORT now */
{ log_append_abort_log (...); log_change_tran_as_completed (..., LOG_ABORT, &abort_lsa); }
else assert (iscommitted == LOG_COMMIT && state == TRAN_UNACTIVE_COMMITTED); /* commit already wrote it */
tdes->unlock_global_oldest_visible_mvccid (); /* always */
if (iscommitted == LOG_COMMIT) log_Gl.mvcc_table.reset_transaction_lowest_active (...); /* commit only */
if (get_newtrid == LOG_NEED_NEWTRID) logtb_get_new_tran_id (thread_p, tdes);
}
if (LOG_ISCHECKPOINT_TIME ()) log_wakeup_checkpoint_daemon (); /* or logpb_checkpoint in SA mode */

Branch fan-out:

  1. tail_lsa NULL. No EOT record; set state, then recycle the trid (LOG_NEED_NEWTRID) or hard-clear (logtb_clear_tdes).
  2. tail_lsa non-NULL, abort. Emit the abort record, force per §8.4, set state.
  3. tail_lsa non-NULL, commit. Assert the record was already written and state is TRAN_UNACTIVE_COMMITTED.
  4. MVCC unblocking (data path). unlock_global_oldest_visible_mvccid always runs; reset_transaction_lowest_active only on commit (cubrid-mvcc.md).
  5. next-trid. logtb_get_new_tran_id recycles the index with a fresh trid — the donetime record is CUBRID’s EOT marker, not a distinct type.
  6. checkpoint kick. If the append crossed the threshold, wake the checkpoint daemon or run logpb_checkpoint inline (SA mode).
flowchart TD
  A["log_commit(tran_index, retain_lock)"] --> B{"tdes NULL?"}
  B -->|yes| Z0["return TRAN_UNACTIVE_UNKNOWN"]
  B -->|no| C{"active or 2PC-prepared?"}
  C -->|no, restarted| Z1["no-op, return tdes.state"]
  C -->|yes| D{"topops.last >= 0?"}
  D -->|yes| D1["assert false\nattach_to_outer until < 0"]
  D -->|no| E{"distributed 2PC?"}
  D1 --> E
  E -->|yes| E1["log_2pc_commit\nsee cubrid-2pc.md"]
  E -->|no| F["log_commit_local"]
  subgraph LOCAL["log_commit_local — strict order"]
    direction TB
    F --> G["tx_lob_locator_clear\nlogtb_complete_mvcc\nboth LOG before state change"]
    G --> H["state = WILL_COMMIT\nundo_nxlsa = NULL"]
    H --> I{"tail_lsa NULL?"}
    I -->|yes, read-only| J["unlock unless retained\nstate = COMMITTED"]
    I -->|no| K["log_tran_do_postpone"]
    K --> K1{"posp_nxlsa set?"}
    K1 -->|yes| K2["append COMMIT_WITH_POSTPONE\nforce, then run LOG_POSTPONE"]
    K1 -->|no| L
    K2 --> L["log_append_commit_log\n+repl info if HA"]
    L --> M["lock_unlock_all\nunless retain_lock"]
    M --> N["log_change_tran_as_completed\nstate = COMMITTED\nlogpb_flush_pages = group-commit force"]
  end
  E1 --> O["log_complete"]
  J --> O
  N --> O
  O --> P["MVCC unblock · recycle trid\nkick checkpoint if due"]
  P --> Z2["return state"]

Figure 8-2 — Commit control flow. The only point that blocks on I/O is logpb_flush_pages inside log_change_tran_as_completed (the group-commit force of §8.4); everything before it is bookkeeping. Note the two ordering invariants the diagram encodes: the records that tx_lob_locator_clear and logtb_complete_mvcc emit are written before the state moves to WILL_COMMIT, and an empty tail_lsa short-circuits to a no-record read-only commit.

8.6 log_abort and log_abort_local — undo before the boundary

Section titled “8.6 log_abort and log_abort_local — undo before the boundary”

log_abort mirrors log_commit’s entry validation with two extra guards, then routes to log_abort_local -> log_complete.

// log_abort (excerpt) — src/transaction/log_manager.c
if (LOG_HAS_LOGGING_BEEN_IGNORED ())
{ er_set (... ER_LOG_CORRUPTED_DB_DUE_NOLOGGING ...); return tdes->state; } /* <- no log to undo */
if (!LOG_ISTRAN_ACTIVE (tdes) && !LOG_ISTRAN_2PC_PREPARE (tdes))
return tdes->state; /* <- nothing to abort */
// topops.last >= 0 -> same assert+attach salvage as commit
state = log_abort_local (thread_p, tdes, true);
state = log_complete (thread_p, tdes, LOG_ABORT, LOG_NEED_NEWTRID, LOG_NEED_TO_WRITE_EOT_LOG);

The extra LOG_HAS_LOGGING_BEEN_IGNORED guard is the key difference from commit: with no undo records, rollback is impossible and the database is declared corrupted. log_abort_local differs from log_commit_local in ordering: it sets TRAN_UNACTIVE_ABORTED first, then does the work.

// log_abort_local — src/transaction/log_manager.c
tdes->state = TRAN_UNACTIVE_ABORTED; /* <- set early; rollback logs CLRs, allowed */
if (!LSA_ISNULL (&tdes->tail_lsa)) /* <- transaction touched data */
{ log_rollback (thread_p, tdes, NULL); /* §8.6.1 — the undo pass */
log_cleanup_modified_class_list (thread_p, tdes, NULL, true, true); /* + free first_save_entry */ }
/* both branches: */ logtb_complete_mvcc (thread_p, tdes, false); /* committed=false -> discard mvccid */
lock_unlock_all (thread_p); /* <- always release; abort never retains */
tx_lob_locator_clear (thread_p, tdes, false, NULL);
return tdes->state;
flowchart TD
  A["log_abort(tran_index)"] --> B{"logging been ignored?"}
  B -->|yes| Z0["ER_LOG_CORRUPTED_DB_DUE_NOLOGGING\nreturn — no undo records exist"]
  B -->|no| C{"active or 2PC-prepared?"}
  C -->|no| Z1["nothing to abort, return"]
  C -->|yes| D["topops salvage\nassert + attach_to_outer"]
  D --> E["log_abort_local"]
  subgraph LOCAL["log_abort_local — state set FIRST"]
    direction TB
    E --> F["state = ABORTED\nset early: rollback may log CLRs"]
    F --> G{"tail_lsa NULL?"}
    G -->|no| H["log_rollback\nwalk prev_tranlsa backward,\nappend compensating CLRs"]
    H --> I["log_cleanup_modified_class_list"]
    G -->|yes| J
    I --> J["logtb_complete_mvcc false\ndiscard mvccid"]
    J --> K["lock_unlock_all\nalways — abort never retains"]
  end
  K --> L["log_complete LOG_ABORT"]
  L --> M{"tail_lsa non-NULL?"}
  M -->|yes| N["log_append_abort_log\nlog_change_tran_as_completed\nforce only if checkpoint in flight"]
  M -->|no| O["set ABORTED, no EOT record"]
  N --> P["recycle trid"]
  O --> P
  P --> Z2["return ABORTED"]

Figure 8-3 — Abort control flow. The mirror image of Figure 8-2 with two deliberate asymmetries. (1) State first: log_abort_local sets ABORTED before doing the work, because the rollback pass itself logs compensating records (CLRs) and those must be allowed after the state flips — the opposite of commit, where logging after WILL_COMMIT is forbidden. (2) Lazy force: a lost un-flushed LOG_ABORT is harmless (recovery re-undoes anyway), so the durability force fires only when a checkpoint is in flight on a restarted server.

Setting state early is safe here but forbidden in commit because rollback logs compensation log records (CLRs) — redo-only records (Chapter 9) expected while the transaction is already aborted. logtb_complete_mvcc(..., false) discards the MVCCID, and log_abort_local ignores retain_lock — an abort always calls lock_unlock_all.

8.6.1 log_rollback — walking prev_tranlsa backward

Section titled “8.6.1 log_rollback — walking prev_tranlsa backward”

log_rollback walks the chain backward from tdes->undo_nxlsa, re-applying each undo image. Per-record-type CLR generation is Chapter 9; the branch that matters here is the cursor discipline.

// log_rollback (control skeleton) — src/transaction/log_manager.c
LSA_COPY (&prev_tranlsa, &tdes->undo_nxlsa); /* start cursor */
while (!LSA_ISNULL (&prev_tranlsa) && !isdone)
{
logpb_fetch_page (...); /* fatal on error */
log_rec = LOG_GET_LOG_RECORD_HEADER (log_pgptr, &log_lsa);
LSA_COPY (&prev_tranlsa, &log_rec->prev_tranlsa); /* advance cursor BEFORE undo */
LSA_COPY (&tdes->undo_nxlsa, &prev_tranlsa); /* persist cursor (CLR may move it) */
switch (log_rec->type) { /* ... see Chapter 9 ... */ }
}

Invariant — the undo cursor is advanced before the undo is applied. Both prev_tranlsa and tdes->undo_nxlsa move to log_rec->prev_tranlsa before the undo runs, because applying an undo logs a chained CLR — a not-yet-advanced cursor could re-undo the record or follow the CLR’s own link. upto_lsa (NULL here, non-NULL from log_rollback_to_savepoint) stops a partial rollback early; xlogtb_reset_wait_msecs(INFINITE_WAIT) blocks lock timeouts. Recovery-time replay is in cubrid-recovery-manager.md.

8.7 Restart-driven variants — log_abort_by_tdes and log_abort_all_active_transaction

Section titled “8.7 Restart-driven variants — log_abort_by_tdes and log_abort_all_active_transaction”

At shutdown or crash recovery, transactions must be aborted by a thread other than their owner. log_abort_by_tdes rebinds the executing thread to the victim’s tran_index so every LOG_FIND_THREAD_TRAN_INDEX lookup inside log_abort resolves correctly, then reuses the ordinary path:

// log_abort_by_tdes — src/transaction/log_manager.c (SERVER_MODE)
thread_p->tran_index = tdes->tran_index; /* impersonate the victim's index */
pthread_mutex_unlock (&thread_p->tran_index_lock);
(void) log_abort (thread_p, tdes->tran_index); /* reuse the normal abort path */

log_abort_all_active_transaction is the shutdown sweep: in server mode it loops over every index, dispatching an async abort onto each active transaction and re-looping until no worker threads remain. The dispatch is not direct — css_push_external_task queues log_abort_task_execute, a thin wrapper that calls log_abort_by_tdes(&thread_ref, &tdes):

// log_abort_all_active_transaction (server-mode essence) — src/transaction/log_manager.c
if (already_called) return; already_called = 1; /* <- idempotent static guard */
loop: repeat_loop = false;
for (i = 0; i < log_Gl.trantable.num_total_indices; i++)
if (i != LOG_SYSTEM_TRAN_INDEX && (tdes = LOG_FIND_TDES (i)) && tdes->trid != NULL_TRANID)
{ if (css_count_transaction_worker_threads (...) > 0) repeat_loop = true; /* still busy */
else if (LOG_ISTRAN_ACTIVE (tdes) && !abort_thread_running[i])
{ /* exec_f = std::bind (log_abort_task_execute, _1, std::ref (*tdes)); */
css_push_external_task (...); /* -> log_abort_task_execute -> log_abort_by_tdes */
abort_thread_running[i] = 1; repeat_loop = true; } }
if (repeat_loop) { thread_sleep (50);
if (css_is_shutdown_timeout_expired ()) _exit (0); goto loop; } /* <- give up: hard exit */

already_called runs the sweep once; LOG_SYSTEM_TRAN_INDEX is skipped; a transaction with live workers forces another pass; an expired timeout _exit(0)s. The SA_MODE branch walks the table and calls log_abort synchronously.

  1. Boundary records reuse the whole pipeline — built/attached like data records (Chapters 3-4); the one-field log_rec_donetime’s LSA is the durable commit point.
  2. log_commit routes; log_commit_local works — postpone, append, unlock, force; log_complete only finalizes state, since the record was already written (LOG_ALREADY_WROTE_EOT_LOG).
  3. Ordering protects against checkpoint-during-commit — anything that logs runs before TRAN_UNACTIVE_WILL_COMMIT.
  4. Commit forces, abort usually does notlogpb_flush_pages always for commit (group-commit, Chapter 7), for abort only with a checkpoint in flight on a restarted server.
  5. Abort sets state first, then undoeslog_rollback advances the cursor before each undo so CLRs do not re-enter the chain.
  6. retain_lock is a commit-only knob — abort always unlocks.
  7. Restart variants re-bind, not re-implementlog_abort_by_tdes impersonates tran_index and calls log_abort; log_abort_all_active_transaction dispatches log_abort_task_execute idempotently until workers drain or the timeout forces _exit(0).

Chapter 9: System Operations Postpone and Compensation

Section titled “Chapter 9: System Operations Postpone and Compensation”

A system operation (sysop, or “top operation”) is a sub-transaction the server commits or aborts independently of the enclosing user transaction — index splits, file allocation, overflow-record management. This chapter traces how sysops, postponed actions, and compensation records reuse the prior-list pipeline (Chapters 3–5) while carrying their own logical-undo payloads. For WAL/postpone/ARIES theory see the companion cubrid-log-manager.md (“System operations”, “Postpone & compensation”). Every family calls prior_lsa_alloc_and_copy_data + prior_lsa_next_record; the novelty is which header is stamped and how the tdes sysop stack and the LSA chains (undo_nxlsa, posp_nxlsa, per-level posp_lsa) mutate around it.

log_sysop_start appends nothing; it pushes a frame onto an in-memory stack in the transaction descriptor. The table lists only the sysop/postpone-relevant log_tdes fields — the full ~82-field struct is in Chapter 2.

FieldRoleWhy it exists
topops (LOG_TOPOPS_STACK)Nesting stack of active sysopslast is current depth, max the allocated size
topop_lsaLSA of the most-recent sysop’s parentFast “are we in a sysop” probe for appenders
tail_lsaLSA of this tran’s last appended recordHigh-water mark a sysop end compares to detect “no change”
undo_nxlsaNext record to undoRewound by a CLR so undo skips the already-undone record
posp_nxlsaFirst transaction-level postpone recordSeeded by a LOG_POSTPONE appended outside any sysop
savept_lsaLSA of last savepointChains savepoints; target of log_abort_partial
tail_topresult_lsaLSA of last partial commit/abortStamped into every sysop-end as prv_topresult_lsa
state (TRAN_STATE)Transaction stateGates which sysop-end arms are legal
m_log_postpone_cacheCached postpone redo + LSAsLets do_postpone replay from memory
rcv.sysop_start_postpone_lsaRecovery anchor for an in-flight sysop postponeResume a crashed sysop’s postpone phase
rcv.tran_start_postpone_lsaRecovery anchor for a tran-level postponeResume the transaction postpone phase
rcv.atomic_sysop_start_lsaRecovery anchor for an atomic sysopRoll an interrupted atomic sysop back as a unit

(The rcv.* members live in the embedded log_rcv_tdes.) Each stack frame is a log_topops_addresses carrying two LSAs, read through three accessor macros:

// log_topops_addresses -- src/transaction/log_impl.h
struct log_topops_addresses
{
LOG_LSA lastparent_lsa; /* The last address of the parent transaction. This is needed for undo of the top
* system action */
LOG_LSA posp_lsa; /* The first address of a postpone log record for top system operation. We add this
* since it is reset during recovery to the last reference postpone address. */
};
// LOG_TDES_LAST_SYSOP* -- src/transaction/log_manager.c
#define LOG_TDES_LAST_SYSOP(tdes) (&(tdes)->topops.stack[(tdes)->topops.last])
#define LOG_TDES_LAST_SYSOP_PARENT_LSA(tdes) (&LOG_TDES_LAST_SYSOP(tdes)->lastparent_lsa)
#define LOG_TDES_LAST_SYSOP_POSP_LSA(tdes) (&LOG_TDES_LAST_SYSOP(tdes)->posp_lsa)
flowchart LR
  subgraph tdes["log_tdes"]
    tail["tail_lsa"]
    posp["posp_nxlsa (tran level)"]
    undo["undo_nxlsa"]
    stk["topops.stack[]"]
  end
  stk --> f0["[0] lastparent_lsa, posp_lsa"]
  stk --> fl["[last] lastparent_lsa, posp_lsa"]

Figure 9-1: the sysop stack and the LSA anchors it threads through log_tdes.

Invariant — the parent LSA bounds the sysop body. Every record a sysop appends has tail_lsa > LOG_TDES_LAST_SYSOP_PARENT_LSA(tdes); end functions detect an empty sysop by LSA_LE(&tdes->tail_lsa, parent_lsa). If violated, an end would log a phantom record or skip a needed end marker, desyncing log nesting from the stack. Enforced in log_sysop_commit_internal and log_sysop_abort.

9.2 log_sysop_start and log_sysop_start_atomic

Section titled “9.2 log_sysop_start and log_sysop_start_atomic”
// log_sysop_start -- src/transaction/log_manager.c
if (tdes->topops.max == 0 || (tdes->topops.last + 1) >= tdes->topops.max) /* first-alloc OR full */
if (logtb_realloc_topops_stack (tdes, 1) == NULL) /* OOM: bail, stack unchanged */
{ assert (false); tdes->unlock_topop (); return; }
// ... condensed: VACUUM_IS_THREAD_VACUUM diagnostic logging only ...
tdes->topops.last++; /* <- push */
LSA_COPY (&tdes->topops.stack[tdes->topops.last].lastparent_lsa, &tdes->tail_lsa);
LSA_COPY (&tdes->topop_lsa, &tdes->tail_lsa);
LSA_SET_NULL (&tdes->topops.stack[tdes->topops.last].posp_lsa); /* <- no postpone yet */

The topops.max == 0 clause handles a transaction’s first sysop (no stack yet) and the second clause is grow-when-full. Branches: (1) tdes == NULLER_LOG_UNKNOWN_TRANINDEX fatal early-return; (2) realloc, on OOM unlock+return without pushing; (3) VACUUM diagnostics only; (4) happy path snapshots tail_lsa into lastparent_lsa and nulls posp_lsa.

log_sysop_start_atomic wraps it, then ensures one LOG_SYSOP_ATOMIC_START marker exists so recovery can roll the whole atomic sysop back as a unit:

// log_sysop_start_atomic -- src/transaction/log_manager.c
log_sysop_start (thread_p); /* ... re-fetch tdes, guard ... */
if (LSA_ISNULL (&tdes->rcv.atomic_sysop_start_lsa)) /* first atomic level: emit marker */
{ node = prior_lsa_alloc_and_copy_data (thread_p, LOG_SYSOP_ATOMIC_START, ...);
(void) prior_lsa_next_record (thread_p, node, tdes); }
else
{ assert (tdes->topops.last > 0); /* nested: parent already marked */
assert (LSA_ISNULL (&tdes->rcv.sysop_start_postpone_lsa)); }

The else arm fires for a nested atomic sysop: the outer level owns atomic_sysop_start_lsa, so the inner sysop inherits atomicity with no second marker. The asserts encode “no atomic start while a sysop-postpone runs.”

All non-abort ends route through log_sysop_commit_internal, which stamps a log_rec_sysop_end (comments verbatim from source):

// log_rec_sysop_end -- src/transaction/log_record.hpp
struct log_rec_sysop_end
{
LOG_LSA lastparent_lsa; /* last address before the top action */
LOG_LSA prv_topresult_lsa; /* previous top action (either, partial abort or partial commit) address */
LOG_SYSOP_END_TYPE type; /* end system op type */
const VFID *vfid; /* File where the page belong. ... used to get TDE information. */
union /* other info based on type */
{
LOG_REC_UNDO undo; /* undo data for logical undo */
LOG_REC_MVCC_UNDO mvcc_undo; /* undo data for logical undo of MVCC operation */
LOG_LSA compensate_lsa; /* compensate lsa for logical compensate */
struct { LOG_LSA postpone_lsa; bool is_sysop_postpone; } run_postpone; /* run postpone info */
};
};
FieldRoleWhy it exists
lastparent_lsaWhere the sysop body startsRecovery undo of the sysop stops here; copied from the frame
prv_topresult_lsaPrevious partial commit/abort LSAChains top results so recovery walks them backward
typeWhich union arm is validDispatch key for append and recovery
vfidFile of the affected pageTDE key lookup; doubles as MVCC vacuum info
undoLOG_REC_UNDO payload for LOGICAL_UNDOrcvindex + length for logical undo replay
mvcc_undoLOG_REC_MVCC_UNDO payload for LOGICAL_MVCC_UNDOAdds mvccid and vacuum_info
compensate_lsaUndo-skip target for LOGICAL_COMPENSATENext-undo LSA after this logical compensation
run_postpone.postpone_lsaOriginal LOG_POSTPONE LSALinks the run-postpone to its source
run_postpone.is_sysop_postponeSysop vs. tran postpone flagRecovery must know which postpone phase produced this

Same bytes, six interpretations. Each wrapper (§9.4) fills exactly the member above for its type:

typeActive memberProduced by
LOG_SYSOP_END_COMMIT(none)log_sysop_commit
LOG_SYSOP_END_ABORT(none)log_sysop_abort
LOG_SYSOP_END_LOGICAL_UNDOundolog_sysop_end_logical_undo (non-MVCC)
LOG_SYSOP_END_LOGICAL_MVCC_UNDOmvcc_undolog_sysop_end_logical_undo (MVCC)
LOG_SYSOP_END_LOGICAL_COMPENSATEcompensate_lsalog_sysop_end_logical_compensate
LOG_SYSOP_END_LOGICAL_RUN_POSTPONErun_postponelog_sysop_end_logical_run_postpone

9.4 log_sysop_commit_internal — branch-complete

Section titled “9.4 log_sysop_commit_internal — branch-complete”

The caller sets log_record->type; commit_internal validates state-vs-type, runs pending postpone, appends the end (Fig 9-2).

flowchart TD
  A["commit_internal(log_record)"] --> B{"tdes == NULL?"}
  B -->|yes| Z["assert_release; return"]
  B -->|no| C{"empty sysop\nAND COMMIT or no_logging?"}
  C -->|yes| D["assert posp_lsa NULL\nno-op end"]
  C -->|no| F{"switch type"}
  F -->|RUN_POSTPONE| G["assert *_COMMITTED_WITH_POSTPONE\nset is_sysop_postpone"]
  F -->|COMPENSATE| H["assert aborting OR rv-finish"]
  F -->|UNDO / MVCC_UNDO| I["no state restriction"]
  F -->|COMMIT| J["assert not in postpone phase\nunless rv-finish"]
  G --> K
  H --> K
  I --> K
  J --> K["fill lastparent_lsa, prv_topresult_lsa"]
  K --> M["do_postpone -> append_sysop_end -> tail_topresult_lsa = tail_lsa"]
  D --> P["log_sysop_end_final"]
  M --> P

Figure 9-2: every branch of log_sysop_commit_internal.

// log_sysop_commit_internal -- src/transaction/log_manager.c
assert (log_record->type != LOG_SYSOP_END_ABORT); /* aborts never come here */
if ((LSA_ISNULL (&tdes->tail_lsa) || LSA_LE (&tdes->tail_lsa, LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes)))
&& (log_record->type == LOG_SYSOP_END_COMMIT || log_No_logging))
assert (LSA_ISNULL (&LOG_TDES_LAST_SYSOP (tdes)->posp_lsa)); /* empty COMMIT: nothing to log */
else
{ if (log_record->type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE)
{ assert (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE
|| tdes->state == TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE);
log_record->run_postpone.is_sysop_postpone = /* recovery needs which phase */
(tdes->state == TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE && !is_rv_finish_postpone); }
// ... condensed: COMPENSATE / LOGICAL_UNDO / COMMIT state asserts (see Fig 9-2) ...
log_record->lastparent_lsa = *LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes);
log_record->prv_topresult_lsa = tdes->tail_topresult_lsa;
log_sysop_do_postpone (thread_p, tdes, log_record, data_size, data); /* run postpones */
log_append_sysop_end (thread_p, tdes, log_record, data_size, data); /* emit end */
LSA_COPY (&tdes->tail_topresult_lsa, &tdes->tail_lsa); }
log_sysop_end_final (thread_p, tdes); /* always pops the stack */

Invariant — the end-record type must agree with the transaction state. LOGICAL_RUN_POSTPONE only in *_COMMITTED_WITH_POSTPONE; LOGICAL_COMPENSATE only while aborting (or recovery postpone-finish); a plain COMMIT never during a postpone phase unless recovery’s is_rv_finish_postpone re-entry. If violated, recovery re-runs a postpone twice or skips an undo, corrupting the page.

log_sysop_end_final runs on every path, so even empty/error paths decrement topops.last. The four logical wrappers pre-fill the union member from §9.3; only log_sysop_end_logical_run_postpone leaves is_sysop_postpone for commit_internal to derive from state.

9.5 log_sysop_abort — rollback then mark

Section titled “9.5 log_sysop_abort — rollback then mark”

Abort skips commit_internal; it rolls back and stamps an ABORT end directly:

// log_sysop_abort -- src/transaction/log_manager.c
if (LSA_ISNULL (&tdes->tail_lsa) || LSA_LE (&tdes->tail_lsa, &LOG_TDES_LAST_SYSOP (tdes)->lastparent_lsa))
{ /* No change: empty sysop, nothing to undo or log */ }
else
{ save_state = tdes->state;
tdes->state = TRAN_UNACTIVE_ABORTED; /* <- so compensation appends are legal */
log_rollback (thread_p, tdes, LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes)); /* undo body, emits CLRs */
sysop_end.type = LOG_SYSOP_END_ABORT;
sysop_end.lastparent_lsa = *LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes);
sysop_end.prv_topresult_lsa = tdes->tail_topresult_lsa;
log_append_sysop_end (thread_p, tdes, &sysop_end, 0, NULL);
LSA_COPY (&tdes->tail_topresult_lsa, &tdes->tail_lsa);
tdes->state = save_state; } /* <- restore: sysop abort != tran abort */
log_sysop_end_final (thread_p, tdes);

The temporary state = TRAN_UNACTIVE_ABORTED is load-bearing: it lets log_rollback append CLRs (§9.8); the original state is then restored so the outer transaction is unaffected.

9.6 log_append_postpone — deferring an action to commit

Section titled “9.6 log_append_postpone — deferring an action to commit”

A LOG_POSTPONE records a redo-only action not applied now but replayed after commit.

flowchart TD
  A["log_append_postpone"] --> B{"log_No_logging?"}
  B -->|yes| C["run redofun NOW; return"]
  B -->|no| E{"skipredo OR\nno sysop AND not active/aborted?"}
  E -->|yes| F["run redofun NOW;\nif !skipredo append_redo_data; return"]
  E -->|no| G{"tail_lsa NULL or\nbefore crash point?"}
  G -->|yes| H["append LOG_DUMMY_HEAD_POSTPONE"]
  G -->|no| J
  H --> J["alloc LOG_POSTPONE; cache redo + start_lsa"]
  J --> N{"in sysop?"}
  N -->|yes, posp_lsa NULL| O["frame.posp_lsa = tail_lsa"]
  N -->|no, posp_nxlsa NULL| P["posp_nxlsa = tail_lsa"]

Figure 9-3: log_append_postpone branches.

Two escape hatches run the redo synchronously (log_No_logging, or it cannot be deferred — Fig 9-3). Otherwise the record is appended, its redo + start LSA are pushed into m_log_postpone_cache, then the right anchor seeds:

// log_append_postpone -- src/transaction/log_manager.c
node = prior_lsa_alloc_and_copy_data (thread_p, LOG_POSTPONE, rcvindex, addr, 0, NULL, length, (char *) data);
tdes->m_log_postpone_cache.add_redo_data (*node); /* save before node may be freed */
start_lsa = prior_lsa_next_record (thread_p, node, tdes);
tdes->m_log_postpone_cache.add_lsa (start_lsa);
if (tdes->topops.last >= 0) /* in sysop: seed frame anchor */
{ if (LSA_ISNULL (&tdes->topops.stack[tdes->topops.last].posp_lsa))
LSA_COPY (&tdes->topops.stack[tdes->topops.last].posp_lsa, &tdes->tail_lsa); }
else if (LSA_ISNULL (&tdes->posp_nxlsa)) /* tran level: seed tran anchor */
LSA_COPY (&tdes->posp_nxlsa, &tdes->tail_lsa);

Invariant — the first postpone seeds exactly one anchor. The earliest LOG_POSTPONE in a sysop level sets that frame’s posp_lsa; the earliest at transaction level sets posp_nxlsa. Later postpones leave it untouched (LSA_ISNULL guard). This anchor is the start of the postpone-replay scan; overwriting it would orphan earlier postpones.

9.7 The postpone phase: log_sysop_do_postpone, log_do_postpone, log_run_postpone_op

Section titled “9.7 The postpone phase: log_sysop_do_postpone, log_do_postpone, log_run_postpone_op”

When a sysop with pending postpones ends, log_sysop_do_postpone writes a LOG_SYSOP_START_POSTPONE marker, then replays. Its header log_rec_sysop_start_postpone (log_record.hpp) stashes the entire end record so recovery can finish after a crash:

FieldRoleWhy it exists
sysop_end (LOG_REC_SYSOP_END)“log record used for end of system operation”Lets log_sysop_end_recovery_postpone re-emit the correct end after a crash
posp_lsa”address where the first postpone operation start”Where the post-crash replay scan resumes
// log_sysop_do_postpone -- src/transaction/log_manager.c
if (LSA_ISNULL (LOG_TDES_LAST_SYSOP_POSP_LSA (tdes))) { return; } /* nothing to postpone */
sysop_start_postpone.sysop_end = *sysop_end;
sysop_start_postpone.posp_lsa = *LOG_TDES_LAST_SYSOP_POSP_LSA (tdes);
log_append_sysop_start_postpone (thread_p, tdes, &sysop_start_postpone, data_size, data);
if (tdes->m_log_postpone_cache.do_postpone (*thread_p, *(LOG_TDES_LAST_SYSOP_POSP_LSA (tdes))))
{ tdes->state = save_state; return; } /* fast path: replay from memory */
log_do_postpone (thread_p, tdes, LOG_TDES_LAST_SYSOP_POSP_LSA (tdes)); /* slow path: scan the log */

The transaction-level parallel log_append_commit_postpone writes a LOG_COMMIT_WITH_POSTPONE whose header is log_rec_start_postpone (log_record.hpp), flips to TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE, and flushes so the commit is durable before postpones run:

FieldRoleWhy it exists
posp_lsaAddress where the first transaction postpone op startsAnchor the post-commit replay scan; reset during recovery to the last reference
at_time”donetime. For the time-specific recovery”Stamp the commit-postpone moment for point-in-time / time-specific recovery

log_do_postpone is the slow-path forward scan; it skips nested-top bodies via log_get_next_nested_top and dispatches on log_rec->type. Only LOG_POSTPONE triggers replay; the start-marker group (LOG_COMMIT_WITH_POSTPONE[_OBSOLETE], LOG_SYSOP_START_POSTPONE, LOG_2PC_*) does LSA_SET_NULL(&forward_lsa) to end the loop; data/redo/CLR/savepoint arms are inert (already applied when logged); default is “bad log_rectype”. log_run_postpone_op reads the redo and runs it — and a page-spanning malloc failure is fatal, not a graceful return:

// log_run_postpone_op -- src/transaction/log_manager.c
LSA_COPY (&ref_lsa, log_lsa); /* remember the original postpone LSA */
// ... condensed: advance past LOG_RECORD_HEADER + LOG_REC_REDO header ...
redo = *((LOG_REC_REDO *) ((char *) log_pgptr->area + log_lsa->offset));
if (log_lsa->offset + redo.length < (int) LOGAREA_SIZE)
rcv_data = (char *) log_pgptr->area + log_lsa->offset; /* contiguous: point in place */
else
{ area = (char *) malloc (redo.length); /* spans pages: need contiguous copy */
if (area == NULL)
{ logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "log_run_postpone_op"); return ER_FAILED; }
logpb_copy_from_log (thread_p, area, redo.length, log_lsa, log_pgptr); rcv_data = area; }
(void) log_execute_run_postpone (thread_p, &ref_lsa, &redo, rcv_data);
if (area != NULL) free_and_init (area);

ref_lsa lands in the run-postpone record (log_rec_run_postpone, log_record.hpp) so recovery knows which postpone already executed:

FieldRoleWhy it exists
data (LOG_DATA)“Location of recovery data” (rcvindex, vpid, offset)Which page + redo function the postpone touched
ref_lsa”Address of the original postpone record”A second recovery pass matches this and skips the already-run postpone
length”Length of redo data”Bounds the redo copy

The producer log_append_run_postpone asserts WILL_COMMIT or a *_COMMITTED_WITH_POSTPONE state, stamps the three fields, appends, and sets the page LSA — making the action idempotent on a second recovery pass.

9.8 Compensation: log_append_compensate_internal and rewinding undo_nxlsa

Section titled “9.8 Compensation: log_append_compensate_internal and rewinding undo_nxlsa”

A Compensation Log Record (CLR, LOG_COMPENSATE) records the redo of an undo so undo is never re-done after a crash. Its header is log_rec_compensate (log_record.hpp):

FieldRoleWhy it exists
data (LOG_DATA)“Location of recovery data” (rcvindex, pageid, offset, volid)Locates the page the compensating redo re-applies
undo_nxlsa”Address of next log record to undo”Recovery undo jumps here, skipping the compensated record
length”Length of compensating data”Bounds the redo payload
// log_append_compensate_internal -- src/transaction/log_manager.c
node = prior_lsa_alloc_and_copy_data (thread_p, LOG_COMPENSATE, rcvindex, NULL, length, (char *) data, 0, NULL);
LSA_COPY (&prev_lsa, &tdes->undo_nxlsa); /* remember where we were */
compensate = (LOG_REC_COMPENSATE *) node->data_header;
// ... condensed: fill compensate->data {rcvindex, pageid, offset, volid}, length ...
if (undo_nxlsa != NULL) LSA_COPY (&compensate->undo_nxlsa, undo_nxlsa); /* explicit skip target */
else LSA_COPY (&compensate->undo_nxlsa, &prev_lsa); /* default: current link */
start_lsa = prior_lsa_next_record (thread_p, node, tdes);
if (pgptr != NULL) pgbuf_set_lsa (thread_p, pgptr, &start_lsa); /* TDE/page-LSA only when fixed */
LSA_COPY (&tdes->undo_nxlsa, &prev_lsa); /* <- rewind: undo continues from BEFORE this CLR */

Invariant — a CLR is redo-only and rewinds the undo cursor past itself. After the append, tdes->undo_nxlsa is reset to prev_lsa and the CLR’s own undo_nxlsa points at the next record needing undo. During recovery undo, reaching a CLR jumps to undo_nxlsa and never re-applies the undo it represents; skipping the rewind would double-undo the page. The pgptr != NULL guard handles a CLR logged in recovery when the page could not be fixed — TDE and pgbuf_set_lsa are skipped.

The sibling log_sysop_end_logical_compensate (§9.3) achieves the same skip at sysop granularity via the sysop-end record’s compensate_lsa.

log_append_savepoint chains a LOG_SAVEPOINT with header log_rec_savept (log_record.hpp):

FieldRoleWhy it exists
prv_savept”Previous savepoint record” LSASingly-linked chain so log_get_savepoint_lsa walks back by name
length”Savepoint name” length (name follows the record)Bounds the variable-length name copy
// log_append_savepoint -- src/transaction/log_manager.c
if (!LOG_ISTRAN_ACTIVE (tdes)) { er_set (... ER_LOG_CANNOT_ADD_SAVEPOINT ...); return NULL; }
if (savept_name == NULL) { er_set (... ER_LOG_NONAME_SAVEPOINT ...); return NULL; }
node = prior_lsa_alloc_and_copy_data (thread_p, LOG_SAVEPOINT, ..., savept_name, ...);
savept = (LOG_REC_SAVEPT *) node->data_header;
LSA_COPY (&savept->prv_savept, &tdes->savept_lsa); /* <- link to previous savepoint */
(void) prior_lsa_next_record (thread_p, node, tdes);
LSA_COPY (&tdes->savept_lsa, &tdes->tail_lsa); /* <- this is now the latest savepoint */

Branches: NULL tdes (fatal), non-active tran (ER_LOG_CANNOT_ADD_SAVEPOINT), NULL name (ER_LOG_NONAME_SAVEPOINT), else append.

log_abort_partial rolls back to a named savepoint by reusing the sysop machinery — it forges a sysop whose parent LSA is the savepoint. Five guard branches run before the synthetic body:

  1. tdes == NULLER_LOG_UNKNOWN_TRANINDEX, return TRAN_UNACTIVE_UNKNOWN.
  2. LOG_HAS_LOGGING_BEEN_IGNORED ()ER_LOG_CORRUPTED_DB_DUE_NOLOGGING, return current state.
  3. !LOG_ISTRAN_ACTIVE → silently return current state.
  4. NULL name or unknown savepoint → ER_LOG_UNKNOWN_SAVEPOINT, return TRAN_UNACTIVE_UNKNOWN.
  5. Dangling sysops (topops.last >= 0) → warn + assert(false), drain via log_sysop_attach_to_outer.
// log_abort_partial -- src/transaction/log_manager.c
if (tdes == NULL) { er_set (... ER_LOG_UNKNOWN_TRANINDEX ...); return TRAN_UNACTIVE_UNKNOWN; }
if (LOG_HAS_LOGGING_BEEN_IGNORED ()) { er_set (...); return tdes->state; }
if (!LOG_ISTRAN_ACTIVE (tdes)) { return tdes->state; }
if (savepoint_name == NULL || log_get_savepoint_lsa (...) == NULL)
{ er_set (... ER_LOG_UNKNOWN_SAVEPOINT ...); return TRAN_UNACTIVE_UNKNOWN; }
if (tdes->topops.last >= 0) /* dangling sysops: drain them first */
{ er_set (... ER_LOG_HAS_TOPOPS_DURING_COMMIT_ABORT ...); assert (false);
while (tdes->topops.last >= 0) log_sysop_attach_to_outer (thread_p); }
log_sysop_start (thread_p);
LSA_COPY (&tdes->topops.stack[tdes->topops.last].lastparent_lsa, savept_lsa); /* stop at savepoint */
// ... condensed: if posp_nxlsa not null, transfer/clamp it into the frame's posp_lsa ...
log_sysop_abort (thread_p); /* the actual rollback + CLRs */
LSA_COPY (&tdes->savept_lsa, savept_lsa); /* discard newer savepoints */
return TRAN_UNACTIVE_ABORTED;

Partial abort is “abort a synthetic sysop spanning savepoint→now.” The elided postpone-anchor transfer moves posp_nxlsa into the frame’s posp_lsa (clamped to savept_lsa) so postpones whose source predates the savepoint are not lost.

9.10 log_sysop_attach_to_outer — committing a sysop into its parent

Section titled “9.10 log_sysop_attach_to_outer — committing a sysop into its parent”

A sysop may merge into its enclosing scope, transferring only its postpone anchor:

// log_sysop_attach_to_outer -- src/transaction/log_manager.c
if (tdes->topops.last == 0 && (!LOG_ISTRAN_ACTIVE (tdes) || tdes->is_system_transaction ()))
{ assert_release (false); log_sysop_commit (thread_p); return; } /* nothing to attach to */
if (tdes->topops.last - 1 >= 0) /* attach to parent sysop frame */
{ if (LSA_ISNULL (&tdes->topops.stack[tdes->topops.last - 1].posp_lsa))
LSA_COPY (&tdes->topops.stack[tdes->topops.last - 1].posp_lsa,
&tdes->topops.stack[tdes->topops.last].posp_lsa); }
else /* attach to transaction level */
{ if (LSA_ISNULL (&tdes->posp_nxlsa))
LSA_COPY (&tdes->posp_nxlsa, &tdes->topops.stack[tdes->topops.last].posp_lsa); }
log_sysop_end_final (thread_p, tdes); /* pop, no end record appended */

Three branches: (1) nothing to attach to → fall back to a real commit; (2) parent sysop → push this level’s posp_lsa up if the parent has none; (3) top-level → push into posp_nxlsa. No LOG_SYSOP_END is written — the sysop’s effects become the parent’s.

  1. Sysops start as stack frames, not log records. log_sysop_start pushes topops and snapshots tail_lsa into lastparent_lsa; the first on-disk evidence is the end record (or atomic-start marker). The parent-LSA invariant (§9.1) is how end functions detect an empty sysop.
  2. log_sysop_commit_internal is one hub with six type-driven arms. The log_rec_sysop_end union is reinterpreted per type; the function validates arm-vs-state, runs postpones, appends the end, chains tail_topresult_lsa.
  3. Abort is rollback-then-mark with a state swap. log_sysop_abort sets TRAN_UNACTIVE_ABORTED so log_rollback emits CLRs, appends LOG_SYSOP_END_ABORT, then restores the outer state.
  4. Postpone defers redo to post-commit replay, anchored once. The first LOG_POSTPONE seeds posp_lsa (sysop) or posp_nxlsa (tran); the cache replays from memory, log_do_postpone from the log, stopping at any start marker and replaying only LOG_POSTPONE.
  5. log_run_postpone_op makes postpone idempotent via the LOG_RUN_POSTPONE ref_lsa back-pointer; a page-spanning malloc failure is fatal.
  6. Compensation is redo-only and rewinds the undo cursor. log_append_compensate_internal stamps a CLR whose undo_nxlsa skips the compensated record and resets tdes->undo_nxlsa.
  7. Savepoints and partial abort piggy-back on sysops. log_abort_partial clears five guards, forges a sysop spanning savepoint→now, and calls log_sysop_abort; log_sysop_attach_to_outer merges a sysop into its parent with no end record, transferring only the postpone anchor.

Chapter 10: Archiving Header Maintenance and Edge Paths

Section titled “Chapter 10: Archiving Header Maintenance and Edge Paths”

Chapters 3-9 traced the hot per-record path. This chapter covers everything off it: the background machinery that recycles the active log into archive volumes, the on-disk durability of the log header, and the edge records and corruption checks. For the “active log vs. archives” and “force-at-commit” theory, see the companion.

10.1 Three header structs: page header, log header, archive header

Section titled “10.1 Three header structs: page header, log header, archive header”

Every log page begins with a LOG_HDRPAGE; logical page -9 (LOGPB_HEADER_PAGE_ID) carries a LOG_HEADER; every archive’s physical page 0 carries a LOG_ARV_HEADER.

flowchart LR
  hp["active page -9<br/>LOG_HDRPAGE + LOG_HEADER"] -->|"nxarv_pageid"| p0["active data page<br/>LOG_HDRPAGE + records"]
  p0 -.archived into.-> ap0["archive phy 0<br/>LOG_HDRPAGE + LOG_ARV_HEADER"]

Figure 10-1: the three header structs and the active-to-archive copy relationship.

LOG_HDRPAGE — per-page header prefix. A LOG_PAGE is LOG_HDRPAGE hdr + char area[1].

FieldRoleWhy it exists
logical_pageidLogical page id in the infinite logIdentity independent of physical slot; header page is always -9
offsetOffset of first record on this pageSalvage anchor when a prior page is corrupt and unarchived
flagsTDE flags (..._AES/_ARIA)Marks a page whose records must be encrypted before leaving memory; header page is 0
checksumCRC32 over sampled bytesCorruption detection on read (10.6)

LOG_HEADER — active-log master record. In the data area of page -9, mirrored as log_Gl.hdr. Every member is listed.

FieldRoleWhy it exists
magicfile(1) magicGuard vs. non-log files
dummy / dummy3 / dummy4Alignment pads8-byte align
db_creationDB creation timeTies log to DB; copied to LOG_ARV_HEADER
vol_creationActive-vol creation timeDiagnostics / ordering
db_release / db_compatibilityRelease string, compat floatReject incompatible build/version
db_iopagesize / db_logpagesizePage sizes at creationRun DB at the size the log expects
is_shutdownClean-shutdown flagRecovery: was dismount clean
next_tridNext txn idCopied to LOG_ARV_HEADER for replay
mvcc_next_idNext MVCC idMVCC allocation high-water
avg_ntrans / avg_nlocksSizing estimatesPre-size txn/lock tables
npagesActive pages, excl. headerSizes active vol / archive range
db_charsetDB charset idCharset guard at mount
was_copiedCopied-DB flagCopy vs. original
fpageidLogical pageid at active physical slot 1Active analogue of LOG_ARV_HEADER.fpageid
append_lsaCurrent append positionHigh-water of real log
chkpt_lsaLowest LSA recovery replays fromRecovery start; durable here (10.5)
nxarv_pageidNext logical page to archiveActive/archive boundary (10.2)
nxarv_phy_pageidPhysical slot of nxarv_pageidSkips recomputing logpb_to_physical_pageid
nxarv_numNext archive numberNames next _lgarNNN
last_arv_num_for_syscrashesOldest archive for crash recoveryDeletion floor; -1 = unpinned
last_deleted_arv_numHighest archive removedRemove resumes without re-scan
bkup_level0/1/2_lsa / bkinfo[]Per-level backup LSAs + infoBackup — see backup chapter
prefix_nameLog prefix nameNames volume family
has_logging_been_skippedLogging-skipped flagMarks a WAL-bypass window
vacuum_last_blockidLast vacuum block idGates deletion (10.4)
perm_status_obsoleteObsolete statusLayout compat
ha_server_state / ha_file_status / ha_promotion_timeHA state, copy status, promotion timeReplication / HA
eof_lsaLSA of LOG_END_OF_LOGDurable log end; durable here (10.5)
smallest_lsa_at_last_chkptOldest dirty LSA at last chkptBounds recovery/vacuum lookback
mvcc_op_log_lsaLSA of last MVCC opVacuum MVCC anchor
oldest_visible_mvccid / newest_block_mvccidOldest visible, newest block MVCCIDVacuum visibility / block bounds
db_restore_timeLast restore timeRestore bookkeeping
mark_will_delMarked for deletionDB-drop bookkeeping
does_block_need_vacuumBlock needs vacuumVacuum scheduling
was_active_log_resetActive log was resetCleared in logpb_archive_active_log

State invariant — the archive boundary. nxarv_pageid is the single source of truth for what is archived (< lives only in an archive, >= is still active), and nxarv_phy_pageid must equal logpb_to_physical_pageid(nxarv_pageid). Both advance as a unit at the end of logpb_archive_active_log, then logpb_flush_header makes them durable atomically. If they disagree, the next archive reads the wrong physical slot and corrupts the sequence.

LOG_ARV_HEADER — one per archive volume.

FieldRoleWhy it exists
magicCUBRID_MAGIC_LOG_ARCHIVEGuard vs. mounting wrong file as archive
dummyPadalign
db_creationFrom log_Gl.hdr.db_creationTies archive to DB
vol_creationtime(NULL) when writtenDiagnostics / ordering
next_tridFrom log_Gl.hdr.next_tridReplay context
npagesData pages, excl. previous-lsa pageBounds reads
fpageidLogical pageid at physical slot 1Logical-to-physical map in this archive
arv_numThis archive’s numberSelf-identifying
dummy2Padalign

10.2 logpb_archive_active_log — rolling the active log into an archive

Section titled “10.2 logpb_archive_active_log — rolling the active log into an archive”

Called under LOG_CS write when the active log fills, it copies [nxarv_pageid .. prev_lsa.pageid-1] into a fresh archive, then advances the boundary. Figure 10-2 traces every branch.

flowchart TB
  start["enter LOG_CS write"] --> wake["wake remove daemon SERVER, or remove-exceed-limit SA"]
  wake --> guard{"nxarv_pageid >= append_lsa.pageid ?"}
  guard -->|yes, only incomplete page| ret["er_log_debug + return"]
  guard -->|no| dis{"archive.vdes open ?"}
  dis -->|yes| dismount["dismount old archive"]
  dis -->|no| mal["malloc arv hdr page"]
  dismount --> mal
  mal -->|NULL| err["goto error"]
  mal -->|ok| flush["flush_all_append_pages, build hdr"]
  flush --> bg{"bg archiving and vdes open ?"}
  bg -->|yes| chk["set hdr checksum"]
  bg -->|no| fmt["fileio_format new vol"]
  fmt -->|NULL_VOLDES| err
  fmt --> chk
  chk -->|error| err
  chk --> wrhdr["write header page phy 0"]
  wrhdr -->|NULL| err
  wrhdr --> loop["copy loop: read LOGPB_IO_NPAGES, write"]
  loop -->|read/write fails| err
  loop --> fin{"background archiving ?"}
  fin -->|yes| rename["dismount, rename _lgar_t, remount"]
  fin -->|no| sync["fileio_synchronize"]
  rename -->|fail| err
  sync -->|fail| err
  rename --> adv["advance nxarv_num/pageid/phy_pageid"]
  sync --> adv
  adv --> fh["logpb_flush_header"]
  fh --> done["cache hdr, log, return"]
  err --> fatal["logpb_fatal_error -> exit"]

Figure 10-2: branch-complete flow of logpb_archive_active_log.

The early guard (nxarv_pageid >= append_lsa.pageid -> er_log_debug + return) refuses an empty range; logpb_flush_all_append_pages is then forced. The header is built self-describing (db_creation/next_trid/fpageid copied from log_Gl.hdr), with last_pageid clamped so a degenerate range never yields negative npages:

// logpb_archive_active_log -- src/transaction/log_page_buffer.c
last_pageid = log_Gl.append.prev_lsa.pageid - 1; /* <- never the live append page */
if (last_pageid < arvhdr->fpageid) last_pageid = arvhdr->fpageid; /* <- clamp >= 1 page */
arvhdr->npages = (DKNPAGES) (last_pageid - arvhdr->fpageid + 1);

The copy loop reads up to LOGPB_IO_NPAGES (4) pages via logpb_read_page_from_active_log and writes them FILEIO_WRITE_NO_COMPENSATE_WRITE as-stored (still TDE-encrypted); any read <= 0 or write NULL jumps to error. The boundary advance is the durable commit: last_arv_num_for_syscrashes is pinned to nxarv_num if still -1 (recovery floor); nxarv_num++; nxarv_pageid/nxarv_phy_pageid advance as a unit; was_active_log_reset = false; then logpb_flush_header. The error label calls logpb_fatal_error(..., true, ...) — a failed archive is unrecoverable, so the server exits (10.7).

10.3 logpb_write_toflush_pages_to_archive — background archiving

Section titled “10.3 logpb_write_toflush_pages_to_archive — background archiving”

When PRM_ID_LOG_BACKGROUND_ARCHIVING is on, full pages stream to a temp volume (_lgar_t) as they flush, so the eventual archive only renames it. It returns early when bg_archive_info.vdes == NULL_VOLDES || num_toflush <= 1, then copies every toflush[] page below prev_lsa.pageid, reconciling cursor pageid against the next bufptr->pageid in three branches:

// logpb_write_toflush_pages_to_archive -- src/transaction/log_page_buffer.c
if (pageid > bufptr->pageid) { assert_release (...); dismount; return; } /* backwards: never */
else if (pageid < bufptr->pageid) { if (logpb_fetch_page (...)) { dismount; return; } } /* gap: fetch */
else { log_pgptr = flush_info->toflush[i]; i++; } /* match: use in hand */

Each page is TDE-encrypted if LOG_IS_PAGE_TDE_ENCRYPTED; on encryption failure it is written plaintext with the TDE flag cleared (a logged data-leak tradeoff). fileio_synchronize runs once every PRM_ID_PB_SYNC_ON_NFLUSH pages. Any write failure dismounts the temp volume and abandons bg archiving (10.2 falls back to fileio_format).

10.4 The remove daemon — gated deletion of old archives

Section titled “10.4 The remove daemon — gated deletion of old archives”

Deletion never happens on the hot path. On a server, logpb_archive_active_log only wakes the daemon (log_wakeup_remove_log_archive_daemon calls wakeup(), async). log_remove_log_archive_daemon_task also fires periodically (compute_period reads PRM_ID_REMOVE_LOG_ARCHIVES_INTERVAL: non-zero = timed wait, zero = wake-only). Its body and the SA path both call logpb_remove_archive_logs_exceed_limit, which early-exits with 0 if log_max_archives == INT_MAX (unlimited) or !vacuum_is_safe_to_remove_archives() (vacuum data not loaded). The window [last_deleted_arv_num + 1, nxarv_num - num_remove_arv_num] then has its high end clamped by each gate with MIN:

// logpb_remove_archive_logs_exceed_limit -- src/transaction/log_page_buffer.c
if (log_Gl.hdr.last_arv_num_for_syscrashes != -1) /* crash-recovery floor */
last_arv_num_to_delete = MIN (last_arv_num_to_delete, log_Gl.hdr.last_arv_num_for_syscrashes);
if (vacuum_first_pageid != NULL_PAGEID && logpb_is_page_in_archive (vacuum_first_pageid))
last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_vacuum);
if (prm_get_integer_value (PRM_ID_SUPPLEMENTAL_LOG)) { /* CDC + flashback gates */
if (logpb_is_page_in_archive (cdc_min_log_pageid_to_keep ())) /* CDC progress */
last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_cdc);
if (flashback_is_needed_to_keep_archive ())
last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_flashback); }

State invariant — no consumer is read past its floor. An archive is deletable only if its number is below every live consumer’s minimum: last_arv_num_for_syscrashes, vacuum, CDC (cdc_min_log_pageid_to_keep — oldest page CDC has not consumed, only under PRM_ID_SUPPLEMENTAL_LOG), flashback, and (server) HA copy progress (logwr_get_min_copied_fpageid, unless PRM_ID_FORCE_REMOVE_LOG_ARCHIVES). The MIN() chain enforces this; drop any clamp and that consumer finds its archive deleted.

Then max_count caps batch size, last_arv_num_to_delete-- (window is exclusive of the last needed archive), and only if >= first_arv_num_to_delete does it persist last_deleted_arv_num and logpb_flush_header. The unlink runs after LOG_CS_EXIT via logpb_remove_archive_logs_internal.

10.5 logpb_flush_header — making the active-log header durable

Section titled “10.5 logpb_flush_header — making the active-log header durable”

Every boundary change above ends here. It asserts LOG_CS_OWN_WRITE_MODE, lazily allocates loghdr_pgptr if NULL (OOM -> logpb_fatal_error), then snapshots and writes to page -9:

// logpb_flush_header -- src/transaction/log_page_buffer.c
log_hdr = (LOG_HEADER *) (log_Gl.loghdr_pgptr->area);
*log_hdr = log_Gl.hdr; /* <- snapshot in-memory header */
log_Gl.loghdr_pgptr->hdr.flags = 0; /* <- never TDE-encrypted */
logpb_write_page_to_disk (thread_p, log_Gl.loghdr_pgptr, LOGPB_HEADER_PAGE_ID);

This is the single point where chkpt_lsa (recovery start) and eof_lsa (durable log end) become durable. It does not flush append pages; those use Ch 7’s WAL flush.

10.6 Edge records and corruption: EOF marker, dummies, checksum

Section titled “10.6 Edge records and corruption: EOF marker, dummies, checksum”

LOG_END_OF_LOG placement. In logpb_flush_all_append_pages, an EOF marker (eof.type = LOG_END_OF_LOG, null forw_lsa) is appended in place via logpb_start_append so recovery finds where the log stops, but append_lsa is not advanced — the next real record overwrites it.

LOG_DUMMY_GENERIC and other dummies. Several log types carry no payload; the enum comment is literally "ridiculous, but flush needs it". A dummy gives flush a record to terminate/pad a page when a real record would straddle a boundary awkwardly, so the page closes without a partial record header.

Checksum. logpb_compute_page_checksum samples 16 bytes from the head and tail of each 4096-byte block and CRC32s the concatenation, zeroing hdr.checksum during the computation and restoring it after so the stored checksum never checks itself. logpb_set_page_checksum stores it; logpb_page_has_valid_checksum recomputes and compares; logpb_page_check_corruption sets *is_page_corrupted = !has_valid_checksum. Any change must be mirrored in logwr_check_page_checksum so replication agrees.

logpb_invalid_all_append_pages. When append state must be reset (e.g. after a partial-append failure), the one branch (if log_Gl.append.log_pgptr != NULL) flushes the dirty append page via logpb_flush_pages_direct first so committed work is not lost and nulls log_pgptr; it then zeroes flush_info->num_toflush and sets toflush[0] = NULL under flush_mutex.

10.7 logpb_fatal_error_internal — last-resort flush and exit

Section titled “10.7 logpb_fatal_error_internal — last-resort flush and exit”

Unrecoverable errors call logpb_fatal_error -> logpb_fatal_error_internal with need_flush = true (logpb_fatal_error_exit_immediately_wo_flush passes false when flushing is itself unsafe):

// logpb_fatal_error_internal -- src/transaction/log_page_buffer.c
if (log_exit == true && need_flush == true && log_Gl.append.log_pgptr != NULL) {
static int in_fatal = false; /* <- reentrancy guard */
if (in_fatal == false) {
in_fatal = true;
pgbuf_flush_checkpoint (...); /* flush only up to prev_lsa */
in_fatal = false; } }
fileio_synchronize_all (thread_p); /* <- force everything to stable storage */
/* then boot_server_status(DOWN); NDEBUG -> exit, debug -> abort core dump */

Branches: the flush block runs only when all three of log_exit, need_flush, and a live append page hold; the in_fatal guard blocks recursive entry if the flush itself faults. It flushes “as much as you can without forcing the current unfinished log record” (committed work below prev_lsa durable, the partial record left for recovery), then fileio_synchronize_alls and exits (NDEBUG) or aborts (debug).

10.8 Open questions carried from the high-level doc

Section titled “10.8 Open questions carried from the high-level doc”

Four items from the companion remain open: the group-commit window (flush daemon’s wake timing and its interaction with PRM_ID_PB_SYNC_ON_NFLUSH, 10.3); whether the prior-list list_size cap throttles archive/flush; TDE placement (encryption is lazy in 10.3 and skipped in logpb_archive_active_log’s direct copy — the single authoritative encrypt-on-disk point is untraced); and the LOG_DUMMY_GENERIC invariant (the condition under which flush requires a dummy is documented only by the source comment).

  1. Three nested header structs: LOG_HDRPAGE per page; LOG_HEADER (page -9) the master record; LOG_ARV_HEADER per archive, with db_creation/next_trid copied from the former.
  2. nxarv_pageid/nxarv_phy_pageid are the archive boundary, advanced as a unit and flushed by logpb_archive_active_log (which also clears was_active_log_reset).
  3. Archiving forces a full append flush, copies [nxarv_pageid .. prev_lsa.pageid-1] as-stored, and treats any I/O failure as fatal.
  4. Deletion is gated: logpb_remove_archive_logs_exceed_limit clamps the window with a MIN() chain against crash-recovery, vacuum, CDC, flashback, and HA floors.
  5. logpb_flush_header is the single durability point for chkpt_lsa/eof_lsa/archive bookkeeping, under LOG_CS write, flags = 0.
  6. Edge records: LOG_END_OF_LOG appended in place without advancing append_lsa; dummies pad pages; a sampled-CRC32 checksum drives logpb_page_check_corruption.
  7. Fatal path: logpb_fatal_error_internal uses an in_fatal guard, flushes only up to prev_lsa, then exits/aborts.

The following are line numbers as observed on 2026-06-08; symbols are the canonical anchor and line numbers are hints that decay.

SymbolFileLine
LOG_PAGESIZEsrc/storage/storage_common.h99
log_Zip_supportsrc/transaction/log_append.cpp40
log_Zip_min_size_to_compresssrc/transaction/log_append.cpp41
log_append_info::get_nxio_lsasrc/transaction/log_append.cpp106
log_append_info::set_nxio_lsasrc/transaction/log_append.cpp112
log_prior_lsa_info::log_prior_lsa_infosrc/transaction/log_append.cpp117
LOG_RESET_APPEND_LSAsrc/transaction/log_append.cpp128
LOG_RESET_PREV_LSAsrc/transaction/log_append.cpp136
LOG_APPEND_PTRsrc/transaction/log_append.cpp145
log_append_init_zipsrc/transaction/log_append.cpp185
log_append_final_zipsrc/transaction/log_append.cpp232
prior_lsa_alloc_and_copy_datasrc/transaction/log_append.cpp273
prior_lsa_alloc_and_copy_crumbssrc/transaction/log_append.cpp410
prior_lsa_copy_undo_data_to_nodesrc/transaction/log_append.cpp493
prior_lsa_copy_redo_data_to_nodesrc/transaction/log_append.cpp524
prior_lsa_gen_undoredo_record_from_crumbssrc/transaction/log_append.cpp651
prior_lsa_gen_recordsrc/transaction/log_append.cpp1217
prior_update_header_mvcc_infosrc/transaction/log_append.cpp1320
prior_lsa_next_record_internalsrc/transaction/log_append.cpp1357
commit_abort_lsasrc/transaction/log_append.cpp1485
prior_lsa_next_recordsrc/transaction/log_append.cpp1553
prior_lsa_next_record_with_locksrc/transaction/log_append.cpp1559
prior_set_tde_encryptedsrc/transaction/log_append.cpp1565
prior_is_tde_encryptedsrc/transaction/log_append.cpp1581
prior_lsa_start_appendsrc/transaction/log_append.cpp1593
prior_lsa_end_appendsrc/transaction/log_append.cpp1652
prior_lsa_append_datasrc/transaction/log_append.cpp1661
log_append_get_zip_undosrc/transaction/log_append.cpp1725
log_append_get_zip_redosrc/transaction/log_append.cpp1751
log_prior_lsa_append_alignsrc/transaction/log_append.cpp1892
log_prior_lsa_append_advance_when_doesnot_fitsrc/transaction/log_append.cpp1905
log_prior_lsa_append_add_alignsrc/transaction/log_append.cpp1917
log_crumbsrc/transaction/log_append.hpp46
log_data_addrsrc/transaction/log_append.hpp53
LOG_PRIOR_LSA_LOCKsrc/transaction/log_append.hpp66
log_append_infosrc/transaction/log_append.hpp73
log_prior_nodesrc/transaction/log_append.hpp91
log_prior_lsa_infosrc/transaction/log_append.hpp112
log_zip_allocsrc/transaction/log_compress.c237
log_zipsrc/transaction/log_compress.h53
log_global::log_globalsrc/transaction/log_global.c49
LOGAREA_SIZEsrc/transaction/log_impl.h121
log_setdirtysrc/transaction/log_impl.h305
log_flush_infosrc/transaction/log_impl.h322
log_topops_addressessrc/transaction/log_impl.h353
log_topops_stacksrc/transaction/log_impl.h362
log_rcv_tdessrc/transaction/log_impl.h458
log_tdessrc/transaction/log_impl.h475
log_globalsrc/transaction/log_impl.h671
log_lsasrc/transaction/log_lsa.hpp35
NULL_LSAsrc/transaction/log_lsa.hpp67
MAX_LSAsrc/transaction/log_lsa.hpp72
LSA_COPYsrc/transaction/log_lsa.hpp80
LSA_AS_ARGSsrc/transaction/log_lsa.hpp91
LOG_TDES_LAST_SYSOPsrc/transaction/log_manager.c199
LOG_TDES_LAST_SYSOP_PARENT_LSAsrc/transaction/log_manager.c200
LOG_TDES_LAST_SYSOP_POSP_LSAsrc/transaction/log_manager.c201
log_Flush_daemonsrc/transaction/log_manager.c363
log_create_internalsrc/transaction/log_manager.c827
log_initialize_internalsrc/transaction/log_manager.c1100
log_abort_by_tdessrc/transaction/log_manager.c1583
log_abort_all_active_transactionsrc/transaction/log_manager.c1608
log_finalsrc/transaction/log_manager.c1720
log_append_undoredo_datasrc/transaction/log_manager.c1893
log_append_undo_datasrc/transaction/log_manager.c1973
log_append_redo_datasrc/transaction/log_manager.c2035
log_append_undoredo_crumbssrc/transaction/log_manager.c2086
log_append_postponesrc/transaction/log_manager.c2719
log_append_run_postponesrc/transaction/log_manager.c2881
log_append_compensate_internalsrc/transaction/log_manager.c3047
log_append_savepointsrc/transaction/log_manager.c3365
log_sysop_startsrc/transaction/log_manager.c3599
log_sysop_start_atomicsrc/transaction/log_manager.c3665
log_sysop_commit_internalsrc/transaction/log_manager.c3825
log_sysop_commitsrc/transaction/log_manager.c3916
log_sysop_end_logical_undosrc/transaction/log_manager.c3941
log_sysop_end_logical_compensatesrc/transaction/log_manager.c3984
log_sysop_end_logical_run_postponesrc/transaction/log_manager.c4003
log_sysop_end_recovery_postponesrc/transaction/log_manager.c4024
log_sysop_abortsrc/transaction/log_manager.c4038
log_sysop_attach_to_outersrc/transaction/log_manager.c4097
log_append_commit_postponesrc/transaction/log_manager.c4384
log_append_sysop_start_postponesrc/transaction/log_manager.c4455
log_append_repl_info_and_commit_logsrc/transaction/log_manager.c4647
log_append_donetime_internalsrc/transaction/log_manager.c4679
log_change_tran_as_completedsrc/transaction/log_manager.c4722
log_append_commit_logsrc/transaction/log_manager.c4779
log_append_commit_log_with_locksrc/transaction/log_manager.c4802
log_append_abort_logsrc/transaction/log_manager.c4816
log_commit_localsrc/transaction/log_manager.c5159
log_abort_localsrc/transaction/log_manager.c5277
log_commitsrc/transaction/log_manager.c5352
log_abortsrc/transaction/log_manager.c5461
log_abort_partialsrc/transaction/log_manager.c5558
log_completesrc/transaction/log_manager.c5653
log_rollbacksrc/transaction/log_manager.c7664
log_tran_do_postponesrc/transaction/log_manager.c8156
log_sysop_do_postponesrc/transaction/log_manager.c8190
log_do_postponesrc/transaction/log_manager.c8237
log_run_postpone_opsrc/transaction/log_manager.c8481
log_wakeup_remove_log_archive_daemonsrc/transaction/log_manager.c10099
log_wakeup_log_flush_daemonsrc/transaction/log_manager.c10126
log_is_log_flush_daemon_availablesrc/transaction/log_manager.c10141
log_remove_log_archive_daemon_tasksrc/transaction/log_manager.c10185
log_flush_executesrc/transaction/log_manager.c10377
log_flush_daemon_initsrc/transaction/log_manager.c10493
log_abort_task_executesrc/transaction/log_manager.c10558
cdc_min_log_pageid_to_keepsrc/transaction/log_manager.c14021
LOG_IS_SYSTEM_OP_STARTEDsrc/transaction/log_manager.h59
LOGPB_HEADER_PAGE_IDsrc/transaction/log_page_buffer.c138
LOG_APPEND_ALIGNsrc/transaction/log_page_buffer.c164
LOG_APPEND_ADVANCE_WHEN_DOESNOT_FITsrc/transaction/log_page_buffer.c176
LOG_APPEND_ADVANCE_WHEN_DOESNOT_FITsrc/transaction/log_page_buffer.c177
LOG_APPEND_SETDIRTY_ADD_ALIGNsrc/transaction/log_page_buffer.c184
log_buffersrc/transaction/log_page_buffer.c192
log_buffersrc/transaction/log_page_buffer.c194
log_pb_global_datasrc/transaction/log_page_buffer.c244
logpb_get_log_buffersrc/transaction/log_page_buffer.c394
logpb_initialize_log_buffersrc/transaction/log_page_buffer.c425
logpb_compute_page_checksumsrc/transaction/log_page_buffer.c446
logpb_set_page_checksumsrc/transaction/log_page_buffer.c495
logpb_page_has_valid_checksumsrc/transaction/log_page_buffer.c523
logpb_initialize_poolsrc/transaction/log_page_buffer.c553
logpb_finalize_poolsrc/transaction/log_page_buffer.c672
logpb_create_pagesrc/transaction/log_page_buffer.c783
logpb_locate_pagesrc/transaction/log_page_buffer.c807
logpb_set_dirtysrc/transaction/log_page_buffer.c929
logpb_flush_headersrc/transaction/log_page_buffer.c1676
logpb_fetch_start_append_pagesrc/transaction/log_page_buffer.c2504
logpb_fetch_start_append_page_newsrc/transaction/log_page_buffer.c2586
logpb_next_append_pagesrc/transaction/log_page_buffer.c2630
logpb_writev_append_pagessrc/transaction/log_page_buffer.c2780
logpb_write_toflush_pages_to_archivesrc/transaction/log_page_buffer.c2868
logpb_append_next_recordsrc/transaction/log_page_buffer.c2981
logpb_append_prior_lsa_listsrc/transaction/log_page_buffer.c3040
prior_lsa_remove_prior_listsrc/transaction/log_page_buffer.c3084
logpb_prior_lsa_append_all_listsrc/transaction/log_page_buffer.c3106
logpb_flush_all_append_pagessrc/transaction/log_page_buffer.c3232
logpb_flush_pages_directsrc/transaction/log_page_buffer.c3952
logpb_flush_pagessrc/transaction/log_page_buffer.c3980
logpb_force_flush_pagessrc/transaction/log_page_buffer.c4096
logpb_force_flush_header_and_pagessrc/transaction/log_page_buffer.c4104
logpb_invalid_all_append_pagessrc/transaction/log_page_buffer.c4121
logpb_flush_log_for_walsrc/transaction/log_page_buffer.c4162
logpb_start_appendsrc/transaction/log_page_buffer.c4207
logpb_append_datasrc/transaction/log_page_buffer.c4290
logpb_append_crumbssrc/transaction/log_page_buffer.c4366
logpb_end_appendsrc/transaction/log_page_buffer.c4455
logpb_archive_active_logsrc/transaction/log_page_buffer.c5649
logpb_remove_archive_logs_exceed_limitsrc/transaction/log_page_buffer.c5991
logpb_fatal_errorsrc/transaction/log_page_buffer.c10607
logpb_fatal_error_exit_immediately_wo_flushsrc/transaction/log_page_buffer.c10618
logpb_fatal_error_internalsrc/transaction/log_page_buffer.c10629
logpb_initialize_flush_infosrc/transaction/log_page_buffer.c10878
logpb_finalize_flush_infosrc/transaction/log_page_buffer.c10912
logpb_need_walsrc/transaction/log_page_buffer.c11229
logpb_page_check_corruptionsrc/transaction/log_page_buffer.c11508
logpb_get_tde_algorithmsrc/transaction/log_page_buffer.c11565
logpb_set_tde_algorithmsrc/transaction/log_page_buffer.c11593
log_rectypesrc/transaction/log_record.hpp35
log_rec_headersrc/transaction/log_record.hpp146
log_datasrc/transaction/log_record.hpp157
log_rec_undoredosrc/transaction/log_record.hpp167
log_rec_undosrc/transaction/log_record.hpp176
log_rec_redosrc/transaction/log_record.hpp184
log_vacuum_infosrc/transaction/log_record.hpp192
log_rec_mvcc_undoredosrc/transaction/log_record.hpp202
log_rec_mvcc_undosrc/transaction/log_record.hpp211
log_rec_mvcc_redosrc/transaction/log_record.hpp220
log_rec_donetimesrc/transaction/log_record.hpp237
log_rec_compensatesrc/transaction/log_record.hpp262
log_rec_start_postponesrc/transaction/log_record.hpp271
log_sysop_end_typesrc/transaction/log_record.hpp285
log_rec_sysop_endsrc/transaction/log_record.hpp305
log_rec_sysop_start_postponesrc/transaction/log_record.hpp328
log_rec_run_postponesrc/transaction/log_record.hpp336
log_rec_saveptsrc/transaction/log_record.hpp380
LOG_GET_LOG_RECORD_HEADERsrc/transaction/log_record.hpp441
LOG_IS_MVCC_OP_RECORD_TYPEsrc/transaction/log_record.hpp463
LOG_HDRPAGE_FLAG_ENCRYPTED_MASKsrc/transaction/log_storage.hpp45
LOG_IS_PAGE_TDE_ENCRYPTEDsrc/transaction/log_storage.hpp47
LOGPB_HEADER_PAGE_IDsrc/transaction/log_storage.hpp51
log_hdrpagesrc/transaction/log_storage.hpp63
log_pagesrc/transaction/log_storage.hpp80
log_pagesrc/transaction/log_storage.hpp81
log_headersrc/transaction/log_storage.hpp113
log_arv_headersrc/transaction/log_storage.hpp231
logtb_get_new_tran_idsrc/transaction/log_tran_table.c1741
LOG_IS_MVCC_OPERATIONsrc/transaction/mvcc.h261
  • cubrid-log-manager.md — the high-level companion. See also cubrid-prior-list.md (the prior-list mechanism) and cubrid-recovery-manager.md (how these records are replayed).
  • Raw analyses under raw/code-analysis/cubrid/storage/log_manager/.
  • Code: src/transaction/log_manager.{c,h}, log_append.{cpp,hpp}, log_record.hpp, log_lsa.{cpp,hpp}, log_storage.hpp, log_page_buffer.c.
  • Methodology: knowledge/methodology/code-analysis-detail-doc.md.