CUBRID Log Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-log-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single log record inside the kernel.

Contents:

Ch	Title	Status
1	Data Structure Map	✅
2	Initialization and Memory	✅
3	Building a Prior Node from a Caller Request	✅
4	LSA Assignment and Attach to the Prior List	✅
5	Draining the Prior List into the Page Buffer	✅
6	Crossing a Log Page Boundary	✅
7	Flush Durability and the WAL Rule	✅
8	Commit and Abort Lifecycle	✅
9	System Operations Postpone and Compensation	✅
10	Archiving Header Maintenance and Edge Paths	✅

Chapter 1: Data Structure Map

A log record is re-encoded across three tiers from caller to disk: caller inputs (log_data_addr, log_crumb), the staging tier (log_prior_node, log_prior_lsa_info, log_append_info), and the on-disk tier (log_hdrpage, log_page, log_header, log_arv_header). The body — the log_rec_* family off log_rec_header — flows through all three unchanged. This is the field-level map; later chapters trace the motion between tiers.

Cross-link: WAL theory (why the log LSA orders everything, why the log must reach disk before the data page) lives in the high-level companion cubrid-log-manager.md. This chapter documents the structures the rule operates over, not the rule.

1.1 The addressing primitive — `log_lsa`

A log sequence address (LSA) packs a logical page id and an in-page byte offset into a 64-bit bit-field.

// log_lsa -- src/transaction/log_lsa.hpp
struct log_lsa
{
  std::int64_t pageid:48;  /* Log page identifier : 6 bytes length */
  std::int64_t offset:16;  /* Offset in page. :16 of int64 (not short) for alignment */
  // ... condensed: is_null(), is_max(), set_null(); ordering compares pageid then offset ...
};

Field	Role	Why
`pageid` (48b)	Logical page id in the infinite log	Unbounded append-only page sequence
`offset` (16b)	Byte offset within `area[]`	`:16` of an `int64` packs to 8 bytes

INVARIANT — LSA total ordering. operator< compares pageid then offset, making LSAs a monotone WAL clock; every before/after and durability decision is an LSA comparison. Lose it and the WAL rule (Ch 7) and recovery replay cannot decide what to redo.

Sentinels/shims: NULL_LSA = {-1,-1} (set_null() writes both fields), MAX_LSA = {(1<<47)-1,(1<<15)-1}, and the LSA_* macros (LSA_COPY, LSA_SET_NULL, LSA_ISNULL, LSA_EQ/LE/LT/GE/GT, LSA_AS_ARGS) — inline wrappers over the operators so legacy C compiles.

1.2 The record header — `log_rec_header`

Every on-disk record begins with a fixed log_rec_header threading it into a physical chain and a per-transaction chain:

// log_rec_header -- src/transaction/log_record.hpp
struct log_rec_header
{
  LOG_LSA prev_tranlsa;  /* prev record of SAME transaction */
  LOG_LSA back_lsa, forw_lsa;  /* physically prev / next record */
  TRANID trid;  LOG_RECTYPE type;
};

Field	Role	Why
`prev_tranlsa`	Prior record of the same transaction	Undo walks one transaction backward
`back_lsa`	Physically previous record	Reverse log scan
`forw_lsa`	Physically next record	Redo forward scan; `NULL_LSA` until successor known (Ch 4)
`trid`	Owning transaction id	Demultiplexes the interleaved stream
`type`	`LOG_RECTYPE` discriminator	Tagged-union tag; selects the payload struct

INVARIANT — header forms the doubly linked physical chain. For adjacent A then B: B.back_lsa == addr(A) and A.forw_lsa == addr(B). back_lsa is set at build, forw_lsa only once the successor’s LSA is known — so the chain is briefly half-open at the tail. Disagreement makes undo and redo scans visit different record sets, breaking recovery.

1.3 The type tag — `log_rectype`

The enum type ranges over is explicitly numbered and append-only: obsolete values are wrapped in #if 0 rather than deleted, so the on-disk integer meaning never shifts.

// log_rectype -- src/transaction/log_record.hpp (condensed)
enum log_rectype
{
  LOG_SMALLER_LOGREC_TYPE = 0,      /* lower-bound check */
#if 0
  LOG_CLIENT_NAME = 1,              /* Obsolete -- hole preserved */
#endif
  LOG_UNDOREDO_DATA = 2, LOG_UNDO_DATA = 3, LOG_REDO_DATA = 4,
  // ... LOG_COMMIT=17, LOG_SYSOP_END=20, LOG_ABORT=22, ...
  LOG_MVCC_UNDOREDO_DATA = 46, LOG_MVCC_UNDO_DATA = 47, LOG_MVCC_REDO_DATA = 48,
  LOG_MVCC_DIFF_UNDOREDO_DATA = 49, LOG_SYSOP_ATOMIC_START = 50,
  LOG_DUMMY_GENERIC = 51,           /* dummy used for flush */
  LOG_SUPPLEMENTAL_INFO = 52,
  LOG_LARGER_LOGREC_TYPE            /* upper-bound check */
};

INVARIANT — sentinel bounds and stable wire values. LOG_SMALLER_LOGREC_TYPE (0) and LOG_LARGER_LOGREC_TYPE bracket the valid range; because the integer is persisted a number is never reused (the #if 0 holes guarantee it). Classification macros (LOG_IS_UNDO_RECORD_TYPE, LOG_IS_REDO_RECORD_TYPE, LOG_IS_UNDOREDO_RECORD_TYPE, LOG_IS_MVCC_OP_RECORD_TYPE) read type.

1.4 The recovery-data locator — `log_data`

Undo/redo payloads embed a log_data naming where on a data volume the change applies — a recovery coordinate, not the log’s address:

// log_data -- src/transaction/log_record.hpp
struct log_data { LOG_RCVINDEX rcvindex; PAGEID pageid; PGLENGTH offset; VOLID volid; };

Field	Role	Why
`rcvindex`	Index into the recovery dispatch table	Picks the `rv*` function for the bytes
`pageid`	Target data page id	Page to refix
`offset`	Offset/slot within that page	Where the change lands
`volid`	Volume id of the target page	Disambiguates `pageid` across volumes

1.5 The payload family — undo/redo and MVCC variants

The type tag selects one payload, following the header on the page; all build on log_data:

// log_rec_undoredo / undo / redo -- src/transaction/log_record.hpp
struct log_rec_undoredo { LOG_DATA data; int ulength, rlength; };
struct log_rec_undo { LOG_DATA data; int length; };
struct log_rec_redo { LOG_DATA data; int length; };

log_rec_undoredo carries ulength+rlength (lengths frame the two blobs); log_rec_undo carries one length (undo image only, logical undo); log_rec_redo carries one length (redo image only, page-physical redo). MVCC variants wrap these and attach an MVCC id plus vacuum bookkeeping:

// MVCC payload wrappers -- src/transaction/log_record.hpp
struct log_rec_mvcc_undoredo { LOG_REC_UNDOREDO undoredo; MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };
struct log_rec_mvcc_undo     { LOG_REC_UNDO    undo;     MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };
struct log_rec_mvcc_redo     { LOG_REC_REDO    redo;     MVCCID mvccid; };  /* no vacuum_info */

Struct	Wraps	Adds	Why
`log_rec_mvcc_undoredo`	`log_rec_undoredo`	`mvccid`, `vacuum_info`	MVCC ops vacuum tracks
`log_rec_mvcc_undo`	`log_rec_undo`	`mvccid`, `vacuum_info`	MVCC delete-style ops
`log_rec_mvcc_redo`	`log_rec_redo`	`mvccid` only	Pure redo creates no version to vacuum

log_vacuum_info is the back-pointer carried by undo MVCC records:

// log_vacuum_info -- src/transaction/log_record.hpp
struct log_vacuum_info { LOG_LSA prev_mvcc_op_log_lsa; VFID vfid; };

Field	Role	Why
`prev_mvcc_op_log_lsa`	LSA of the previous MVCC-op record	Vacuum walks this chain in log order
`vfid`	File the change belongs to	Detect dropped/reused file; decide object kind

1.6 The staging node — `log_prior_node`

The append path materializes a record as a log_prior_node linked into the prior list — the central staging structure (Ch 3–5):

// log_prior_node -- src/transaction/log_append.hpp
struct log_prior_node
{
  LOG_RECORD_HEADER log_header;
  LOG_LSA start_lsa;  bool tde_encrypted;
  int data_header_length;  char *data_header;
  int ulength;  char *udata;   int rlength;  char *rdata;
  LOG_PRIOR_NODE *next;
};

Field	Role	Why
`log_header`	Embedded `log_rec_header`	Copied onto the page; `back_lsa`/`forw_lsa` filled at linking
`start_lsa` / `tde_encrypted`	Assigned LSA; encryption flag	LSA asserted vs page offset; flag drives `hdr.flags` at drain
`data_header_length` / `data_header`	Length + buffer of the `log_rec_*` struct	Serialized apart from variable data
`ulength`/`udata`, `rlength`/`rdata`	Length + buffer of undo / redo bytes	The two images, possibly compressed
`next`	Next node	Orders nodes awaiting drain

INVARIANT — a prior node owns its heap buffers. data_header, udata, rdata are independently malloc-ed; length is zero exactly when the pointer is unused. Drain frees them after copying into the page buffer; leak or double-free corrupts the heap.

1.7 The prior-list anchor — `log_prior_lsa_info`

The in-memory anchor for the whole prior list — LSA cursor, list head/tail, and the serializing mutex:

// log_prior_lsa_info -- src/transaction/log_append.hpp
struct log_prior_lsa_info
{
  LOG_LSA prior_lsa; LOG_LSA prev_lsa;
  LOG_PRIOR_NODE *prior_list_header; LOG_PRIOR_NODE *prior_list_tail;
  INT64 list_size;                          /* bytes */
  LOG_PRIOR_NODE *prior_flush_list_header;
  std::mutex prior_lsa_mutex;
};

Field	Role	Why
`prior_lsa`	Next LSA to assign	Monotone allocator cursor; advanced by record size (Ch 4)
`prev_lsa`	LSA of the last appended record	Fills the next node’s `back_lsa`
`prior_list_header` / `prior_list_tail`	Head / tail of the awaiting-drain list	Drain start; O(1) append
`list_size`	Total bytes staged	Flusher decides when to drain
`prior_flush_list_header`	Head of the detached flush sublist	Drain steals here so producers keep appending
`prior_lsa_mutex`	Mutex over all the above	LSA assignment + linkage atomic

INVARIANT — prior_lsa_mutex serializes LSA assignment. prior_lsa is advanced and the node linked under one acquisition, so no two records share an LSA and list order matches LSA order; splitting the two mis-orders the drained page.

1.8 The on-disk append cursor — `log_append_info`

The disk-facing append point — open log file, fixed page, and the lowest LSA not yet on disk:

// log_append_info -- src/transaction/log_append.hpp
struct log_append_info
{
  int vdes;
  std::atomic<LOG_LSA> nxio_lsa;  /* Lowest LSA NOT yet written to disk (WAL) */
  LOG_LSA prev_lsa;  LOG_PAGE *log_pgptr;  bool appending_page_tde_encrypted;
  // ... condensed: get_nxio_lsa(), set_nxio_lsa() ...
};

Field	Role	Why
`vdes`	OS fd of the active log volume	Target of page writes
`nxio_lsa`	Atomic lowest LSA not yet flushed	WAL watermark; readers/flusher race without the prior mutex
`prev_lsa`	Last record appended to the buffer	Drain-side mirror of staging `prev_lsa`
`log_pgptr`	Currently fixed log page	Drain target; replaced on page boundary (Ch 6)
`appending_page_tde_encrypted`	Live page must be encrypted	Carries the node’s `tde_encrypted` onto the page

INVARIANT — nxio_lsa is the WAL durability watermark. Records with LSA < nxio_lsa are on disk; >= nxio_lsa are not. Flusher and WAL checks touch it concurrently, so it is std::atomic, reached only via get_nxio_lsa()/set_nxio_lsa(); a torn read lets a data page flush ahead of its log (Ch 7).

1.9 The caller inputs — `log_data_addr` and `log_crumb`

What a caller (heap/btree op) hands the append API; everything above derives from these:

// log_data_addr / log_crumb -- src/transaction/log_append.hpp
struct log_crumb { int length; const void *data; };
struct log_data_addr { const VFID *vfid; PAGE_PTR pgptr; PGLENGTH offset; };

Struct / Field	Role	Why
`log_crumb.length` / `.data`	One contiguous piece of caller data	Callers pass an array to gather scattered buffers
`log_data_addr.vfid`	File the page belongs to, or NULL	File/TDE context; `log_data.volid`/`pageid` come from `pgptr`, not `vfid`
`log_data_addr.pgptr`	Pointer to the fixed data page	Its volid/pageid extracted into `log_data`
`log_data_addr.offset`	Offset/slot of the change	Becomes `log_data.offset`; high bits hold `LOG_RV_RECORD_*` flags

1.10 On-disk page structures — `log_hdrpage` and `log_page`

A physical log page is a log_hdrpage plus a flexible area[]. The area[1] is the struct-hack — never sizeof it; use LOG_PAGESIZE:

// log_hdrpage / log_page -- src/transaction/log_storage.hpp
struct log_hdrpage { LOG_PAGEID logical_pageid; PGLENGTH offset; short flags; int checksum; };
struct log_page { LOG_HDRPAGE hdr; char area[1]; };  /* area is flexible */

Field	Role	Why
`logical_pageid`	Page id in the infinite sequence	Matches `log_lsa.pageid`; identity check on read
`offset`	Offset of the first record starting here	Salvage anchor if the prior page is corrupt
`flags`	TDE bits (`..._ENCRYPTED_AES/ARIA`)	`LOG_IS_PAGE_TDE_ENCRYPTED` tests the mask
`checksum`	CRC32 over the page	Detects torn pages
`log_page.hdr`	The header above	Fixed page prefix
`log_page.area[]`	Record bytes	Sized by `LOG_PAGESIZE`

INVARIANT — LOG_PAGEID -9 is the header page. LOGPB_HEADER_PAGE_ID = -9 holds the log_header, carries no log records, and is duplicated into every archive. Code must never write a normal record onto pageid -9.

1.11 The volume headers — `log_header` and `log_arv_header`

log_header is the master control block on page -9. Every member, grouped by role:

Field group	Fields	Role
Identity / safety	`magic`, `db_creation`, `db_release`, `db_compatibility`, `db_iopagesize`, `db_logpagesize`, `db_charset`	Refuse a log from an incompatible build/page size
Append cursor	`append_lsa`, `fpageid`, `eof_lsa`	Persisted append loc, pageid at slot 1, end of log
Recovery	`chkpt_lsa`, `smallest_lsa_at_last_chkpt`	Lowest LSA recovery starts from
Transaction / MVCC	`next_trid`, `mvcc_next_id`, `mvcc_op_log_lsa`, `oldest_visible_mvccid`, `newest_block_mvccid`, `vacuum_last_blockid`, `does_block_need_vacuum`	Next ids to assign; vacuum’s progress
Archive	`nxarv_pageid`, `nxarv_phy_pageid`, `nxarv_num`, `last_arv_num_for_syscrashes`, `last_deleted_arv_num`, `npages`	Drives Ch 10’s archiving
Backup	`bkup_level0_lsa`/`1`/`2`, `bkinfo[]`	Per-level incremental backup anchors
HA / lifecycle	`ha_server_state`, `ha_file_status`, `ha_promotion_time`, `is_shutdown`, `was_active_log_reset`, `has_logging_been_skipped`, `db_restore_time`, `mark_will_del`	Replication state; clean-shutdown flag
Alignment / misc	`dummy`, `dummy3`, `dummy4`, `vol_creation`, `avg_ntrans`, `avg_nlocks`, `was_copied`, `prefix_name`, `perm_status_obsolete`	`dummy` pads; `vol_creation` time; `avg_` sizing hints; `was_copied` resets a copied DB; `prefix_name` log prefix; `perm_status_obsolete` legacy

log_arv_header is the smaller header stamped on each archive file:

// log_arv_header -- src/transaction/log_storage.hpp
struct log_arv_header
{
  char magic[CUBRID_MAGIC_MAX_LENGTH];
  INT32 dummy; INT64 db_creation; INT64 vol_creation;
  TRANID next_trid; DKNPAGES npages; LOG_PAGEID fpageid;
  int arv_num; INT32 dummy2;
};

Field	Role	Why
`magic`	File-type magic	`file`/magic recognition + sanity
`db_creation` / `vol_creation`	Creation timestamps	Match archive to its database/volume
`next_trid`	Next trid at archive time	Recovery context
`npages`	Page count in this archive	Bounds the page range
`fpageid`	Logical pageid at physical slot 1	Maps physical to logical pages
`arv_num`	Archive sequence number	Matches `log_header.nxarv_num` chain
`dummy`, `dummy2`	Alignment pads	Keep the on-disk layout stable

1.12 Struct relationships

flowchart TB
  subgraph CALLER["Caller inputs"]
    DADDR["log_data_addr"]
    CRUMB["log_crumb[]"]
  end
  subgraph STAGE["Staging tier (memory)"]
    PLINFO["log_prior_lsa_info"]
    NODE["log_prior_node"]
    AINFO["log_append_info"]
  end
  subgraph REC["Record body (all tiers)"]
    HDR["log_rec_header"]
    PAY["log_rec_*\n+ MVCC wrappers"]
  end
  subgraph DISK["On-disk tier"]
    PAGE["log_page"]
    HPAGE["log_hdrpage"]
    LHDR["log_header (page -9)"]
    AHDR["log_arv_header"]
  end

  DADDR --> NODE
  CRUMB --> NODE
  PLINFO -->|owns list of| NODE
  NODE -->|embeds| HDR
  NODE -->|serializes| PAY
  PAY -->|embeds| LDATA["log_data"]
  PLINFO -->|drains to| AINFO
  AINFO -->|fixes / writes| PAGE
  PAGE -->|hdr is| HPAGE
  LHDR -->|append_lsa to| PAGE
  LHDR -->|nxarv_* feed| AHDR

Figure 1-1. How a record’s structures connect across the three tiers.

1.13 Pointer-relationship summary

The LSA/pointer edges a modifier must keep consistent:

Physical chain — log_rec_header.forw_lsa/back_lsa; prev_tranlsa.
Staging allocator — log_prior_lsa_info.prior_lsa/prev_lsa.
Durability watermark — log_append_info.nxio_lsa.
Vacuum chain — log_vacuum_info.prev_mvcc_op_log_lsa.

1.14 Chapter summary — key takeaways

A log record crosses three tiers — caller inputs, staging (log_prior_node anchored by log_prior_lsa_info, drained via log_append_info), and on-disk (log_page under log_header); the body (log_rec_header + a log_rec_* payload) is the constant.
log_lsa is the 48:16 bit-field clock; its total ordering founds every durability decision. log_rec_header threads each record into a physical doubly linked chain and a per-transaction chain, with type discriminating the append-only, hole-preserving log_rectype.
MVCC wrappers add mvccid and (undo only) log_vacuum_info; the redo wrapper omits vacuum_info — a pure redo creates no version to vacuum.
prior_lsa_mutex makes LSA assignment plus linkage atomic and nxio_lsa is the atomic WAL watermark — the two concurrency invariants the append path rests on; the on-disk page reserves pageid -9 for log_header, whose nxarv_* feed log_arv_header.

Chapter 2: Initialization and Memory

The reader question: before any record can be appended, how are the prior-list, the page-buffer pool, the flush bookkeeping, and the global log state bootstrapped and allocated? For conceptual roles — what the prior list is for, why WAL demands a ring — see the companion cubrid-log-manager.md (“The append pipeline”, “Durability”). This chapter is bring-up mechanics: who mallocs, what each field starts at, which teardown frees it. Two entry points, both under LOG_CS, both calling log_final first if a prior instance is mounted:

log_create_internal — runs once at DB creation: formats the active-log volume, writes the first LOG_HEADER to page -9, flushes one empty append page, then tears the pool back down. No live state survives.
log_initialize_internal — runs at every restart / SA boot: mounts the existing active log, reads page -9, keeps the pool alive, hands control to recovery.

2.1 The global singleton: `log_global` / `log_Gl`

Everything hangs off one process-wide singleton, log_Gl (struct log_global), default-constructed at static init by log_global::log_global (log_global.c); bring-up populates its members rather than allocating it.

// log_global -- src/transaction/log_impl.h  (condensed; #if SERVER_MODE members noted in the table)
struct log_global {
  TRANTABLE trantable;     LOG_APPEND_INFO append;      LOG_PRIOR_LSA_INFO prior_info;
  LOG_HEADER hdr;          LOG_ARCHIVES archive;        LOG_PAGEID run_nxchkpt_atpageid;
  LOG_LSA chkpt_redo_lsa;  DKNPAGES chkpt_every_npages; LOG_RECVPHASE rcv_phase; LOG_LSA rcv_phase_lsa;
  LOG_PAGE *loghdr_pgptr;  LOG_FLUSH_INFO flush_info;   LOG_GROUP_COMMIT_INFO group_commit_info;
  logwr_info *writer_info; /* the ONLY heap member of the ctor: new logwr_info() */
  BACKGROUND_ARCHIVING_INFO bg_archive_info; mvcctable mvcc_table; GLOBAL_UNIQUE_STATS_TABLE unique_stats_table;
  // #if SERVER_MODE: flushed_lsa_lower_bound, chkpt_lsa_lock, backup_in_progress; #else: final_restored_lsa
};

The ctor nulls every LSA-valued field to NULL_LSA, seeds flush_info to {0, 0, NULL, PTHREAD_MUTEX_INITIALIZER}, runs prior_info’s ctor (§2.5), and news writer_info (its only heap allocation). Every field:

Field	Role	Why it exists / ctor seed
`trantable`	Per-transaction `LOG_TDES` table	`area == NULL` is the “not initialized” sentinel; sized by `logtb_define_trantable_log_latch`.
`append`	Live append cursor (`vdes`, `log_pgptr`, `prev_lsa`, atomic `nxio_lsa`)	Where prior nodes drain into a page; Ch 4-5.
`prior_info`	In-memory prior-list head/tail + LSA cursors	Decouples LSA assignment from disk layout; Ch 3-5.
`hdr`	In-RAM copy of on-disk `LOG_HEADER` (`append_lsa`/`eof_lsa` live here)	Avoids re-reading page `-9`.
`archive`	Current archive descriptor cache	Used when a wanted page rolled into an archive.
`run_nxchkpt_atpageid`	Page id where next checkpoint fires	`NULL_PAGEID` during create/init; recomputed at end of init.
`flushed_lsa_lower_bound` / `chkpt_lsa_lock`	SERVER_MODE flush-coord LSA + chkpt-LSA mutex	`NULL_LSA` / `PTHREAD_MUTEX_INITIALIZER`.
`chkpt_redo_lsa` / `chkpt_every_npages`	Redo-start LSA + checkpoint frequency	`NULL_LSA` / `INT_MAX` (latter from `PRM_ID_LOG_CHECKPOINT_NPAGES`).
`rcv_phase` / `rcv_phase_lsa`	Recovery phase + its LSA	`LOG_RECOVERY_ANALYSIS_PHASE` / `NULL_LSA`; `log_final` resets phase.
`backup_in_progress` / `final_restored_lsa`	`#if` pair: SERVER backup flag vs SA last-restored LSA	One per build; `false` / `NULL_LSA`.
`loghdr_pgptr`	One `LOG_PAGESIZE` scratch page for header I/O	Global buffer `malloc`’d in `log_initialize_internal`, freed in `log_final` — distinct from the create-path local of the same name (§2.2).
`flush_info`	`toflush[]` + counters + mutex	Dirty append pages to push on a flush; §2.4.
`group_commit_info`	Mutex+cond for group commit	Lets committers coalesce fsyncs.
`writer_info`	HA log-writer state	Only ctor `new`; deleted in `~log_global`.
`bg_archive_info`	Background archiving descriptor	Init’d at tail of init if `PRM_ID_LOG_BACKGROUND_ARCHIVING` is on.
`mvcc_table` / `unique_stats_table`	MVCC snapshot table / global unique-index stats	Default-ctor / `GLOBAL_UNIQUE_STATS_TABLE_INITIALIZER`.

graph TD
  subgraph logGl["log_Gl (LOG_GLOBAL singleton)"]
    A["append : LOG_APPEND_INFO<br/>vdes, log_pgptr, prev_lsa, nxio_lsa"]
    P["prior_info : LOG_PRIOR_LSA_INFO<br/>prior_lsa, prev_lsa, list head/tail"]
    F["flush_info : LOG_FLUSH_INFO<br/>toflush[], max/num_toflush, mutex"]
  end
  PB["log_Pb (LOG_PB_GLOBAL_DATA)<br/>buffers[], pages_area, header_page"]
  F -- "toflush[] points into" --> PB
  A -- "log_pgptr points into" --> PB

Figure 2-1. The global singleton and the separately-declared page-buffer global log_Pb.

2.2 `log_create_internal` — first-ever bring-up

Runs under LOG_CS_ENTER. Every branch:

Stale-state guard: trantable.area != NULL → log_final (§2.7).
umask; logpb_initialize_pool (§2.3) allocates the ring. Error → goto error.
logpb_initialize_log_names builds log_Name_active etc. Error → goto error.
logpb_initialize_header (&log_Gl.hdr, ...) fills the in-RAM header (page count, db_logpagesize = LOG_PAGESIZE). Error → goto error.
logpb_create_header_page carves the page--9 buffer into a stack local loghdr_pgptr — declared in log_create_internal, not the global log_Gl.loghdr_pgptr of §2.1; the create path scratches a separate page from the restart-path I/O buffer.
fileio_format creates the active-log file; the compound if goto errors on any of vdes == NULL_VOLDES, logpb_fetch_start_append_page failing, or the local loghdr_pgptr == NULL:

// log_create_internal -- src/transaction/log_manager.c
  log_Gl.append.vdes = fileio_format (thread_p, db_fullname, log_Name_active, ...);
  if (log_Gl.append.vdes == NULL_VOLDES
      || logpb_fetch_start_append_page (thread_p) != NO_ERROR || loghdr_pgptr == NULL)
    goto error;     /* <- any one failure unwinds the whole pool */

Mark the empty append page dirty; logpb_flush_pages_direct writes the end-of-log mark.
memcpy the in-RAM hdr into the local loghdr_pgptr->area; logpb_flush_page writes page -9 (error → goto error; under CUBRID_DEBUG it reads back and asserts).
Clear log_pgptr, dismount, create volume-info/log-info files, register active + backup-info volumes via logpb_add_volume.
Normal exit: logpb_finalize_pool, LOG_CS_EXIT, NO_ERROR.

The error: label runs the same logpb_finalize_pool + LOG_CS_EXIT (returning ER_FAILED if unset). Create never leaves a live pool.

INVARIANT — page -9 is the single source of truth for log geometry. The only place a fresh LOG_HEADER is written from scratch; every later boot reads it back. The step-8 memcpy + synchronous logpb_flush_page enforces it. If that flush fails silently, restart reads garbage geometry (db_logpagesize, fpageid) and re-formats or refuses to mount.

2.2b `log_initialize_internal` — restart bring-up

Shares the early scaffolding but diverges at the mount: it reads page -9, keeps the pool, dispatches to recovery. Every branch in order:

Clean-state guard: trantable.area != NULL → log_final.
Log-names init: logpb_initialize_log_names failure is fatal (logpb_fatal_error then goto error), not a plain propagate.
loghdr_pgptr malloc: the global log_Gl.loghdr_pgptr (page--9 I/O buffer for logpb_fetch_header/logpb_flush_header); NULL → fatal + goto error. Freed in log_final (§2.7) and on error:.
Pool init: logpb_initialize_pool (§2.3); error → goto error.
fileio_mount returning NULL_VOLDES splits two ways — media-crash (ismedia_crash != false) synthesizes an approximate header (logpb_initialize_header for geometry, then the forced fields below mark everything un-checkpointed, LOG_RESET_APPEND_LSA syncs into prior_info, chkpt_lsa nulled, nxarv_* maxed); else error_code = ER_IO_MOUNT_FAIL; goto error:
```
// log_initialize_internal -- src/transaction/log_manager.c
  log_Gl.hdr.fpageid = LOGPAGEID_MAX;  log_Gl.hdr.append_lsa.pageid = LOGPAGEID_MAX;
  log_Gl.hdr.append_lsa.offset = 0;    LOG_RESET_APPEND_LSA (&log_Gl.hdr.append_lsa);
```
Non-NULL vdes else: logpb_fetch_header (&log_Gl.hdr) reads the real page -9 into the mirror.
Copy hdr.chkpt_lsa → chkpt_redo_lsa. restore_slave branch (ismedia_crash && r_args && r_args->restore_slave): copy db_creation, smallest_lsa_at_last_chkpt, append_lsa out into r_args for HA slave restore.
Prefix-name mismatch: strcmp(hdr.prefix_name, prefix_logname) != 0 → ER_LOG_INCOMPATIBLE_PREFIX_NAME (notification) and continue anyhow.
Page-size mismatch → recursive re-init: hdr.db_iopagesize != IO_PAGESIZE || hdr.db_logpagesize != LOG_PAGESIZE → db_set_page_size, logpb_finalize_pool, dismount, LOG_CS_EXIT, re-logtb_define_trantable_log_latch, then call log_initialize_internal again and return — buffers rebuilt at the right size (cross-ref §2.8).
Compatibility checks (rel_get_disk_compatible, rel_is_log_compatible) goto error on incompatible versions; logtb_define_trantable_log_latch(-1) builds the live trantable; fileio_map_mounted verifies the log belongs to this DB (else undefine trantable + goto error).
Recovery dispatch: init_emergency == false && (hdr.is_shutdown == false || ismedia_crash) → prior run crashed → log_recovery. Else clean/emergency boot → logpb_fetch_start_append_page, read EOF record to seed prev_lsa via LOG_RESET_PREV_LSA(&eof->back_lsa), set is_shutdown = false, logpb_flush_header.
Prior/append LSA assert + reset (cross-ref §2.5): set rcv_phase = LOG_RESTARTED, then the defensive assert(0) + re-reset if append.prev_lsa/hdr.append_lsa diverge from prior_info; recompute chkpt_every_npages, run_nxchkpt_atpageid, bring up bg-archiving, LOG_CS_EXIT, return.

The error: label dismounts vdes if mounted, free_and_inits loghdr_pgptr, LOG_CS_EXIT, logpb_fatal_error — a failed restart aborts.

2.3 `logpb_initialize_pool` — the page-buffer ring

The ring lives in a separate global, log_Pb of type LOG_PB_GLOBAL_DATA, not inside log_Gl.

// log_pb_global_data / log_buffer -- src/transaction/log_page_buffer.c
struct log_pb_global_data {
  LOG_BUFFER *buffers;  LOG_PAGE *pages_area;  LOG_BUFFER header_buffer;  LOG_PAGE *header_page;
  int num_buffers;      LOGPB_PARTIAL_APPEND partial_append; };
struct log_buffer {
  volatile LOG_PAGEID pageid;  volatile LOG_PHY_PAGEID phy_pageid;  bool dirty;  LOG_PAGE *logpage; };

LOG_PB_GLOBAL_DATA: buffers (descriptor array), pages_area (one slab of num_buffers * LOG_PAGESIZE), header_buffer/header_page (the page--9 descriptor + backing page), num_buffers, partial_append (record-split-across-flush state, Ch 6). The per-page descriptor LOG_BUFFER:

Field	Role	Why it exists
`pageid`	Logical id of the resident log-sequence page	`NULL_PAGEID` = free; lookups key on this. `volatile` — read without the lock.
`phy_pageid`	Physical offset in the active-log file	Translation cache so each flush skips `logpb_to_physical_pageid`.
`dirty`	Page differs from disk	Drives whether a slot is added to `toflush[]`.
`logpage`	Pointer into the shared `pages_area` slab	Decouples the small descriptor from the `LOG_PAGESIZE` payload.

Branch-complete (asserts LOG_CS_OWN_WRITE_MODE):

log_append_init_zip (§2.6) — compression contexts come up before the ring.
If logpb_Initialized, logpb_finalize_pool (re-entrant safety), then assert pages_area == NULL.
num_buffers = prm_get_integer_value (PRM_ID_LOG_NBUFFERS).
malloc buffers. NULL → er_set + return ER_OUT_OF_VIRTUAL_MEMORY (no pool to unwind).
malloc pages_area (num_buffers * LOG_PAGESIZE). NULL → free_and_init(buffers), return.
memset slab to LOG_PAGE_INIT_VALUE; loop logpb_initialize_log_buffer (&buffers[i], pages_area + i*LOG_PAGESIZE) wires descriptor i to slab slot i, setting pageid = phy_pageid = NULL_PAGEID, dirty = false, and stamping the page header (logical_pageid = NULL_PAGEID, offset = NULL_OFFSET, flags = 0).
malloc header_page (one LOG_PAGESIZE); NULL → free both prior allocations, return. Wired into header_buffer — resident slot for page -9 (LOGPB_HEADER_PAGE_ID == -9).
logpb_initialize_flush_info (§2.4). Error → goto error.
partial_append.status = LOGPB_APPENDREC_SUCCESS; its aligned scratch page pointer is set.
logpb_Initialized = true; pthread_*_init the chkpt-lsa lock, group-commit cond/mutex, writer_info conds/mutexes; writer_info->is_init = true. Return NO_ERROR.

The error: label runs logpb_finalize_pool then logpb_fatal_error (aborts) — a pool-init failure is fatal, unlike the early malloc returns which merely propagate.

INVARIANT — the descriptor array and the page slab are the same length and freed together. buffers[i].logpage always points at pages_area + i*LOG_PAGESIZE; logpb_locate_page recovers the index by (log_pg - pages_area) / LOG_PAGESIZE and asserts the round-trip. If num_buffers diverged between the two mallocs, that arithmetic indexes out of bounds.

flowchart TD
  S["init_zip; finalize_pool if re-entrant"] --> N["num_buffers = PRM_ID_LOG_NBUFFERS"]
  N --> B{"malloc buffers?"}
  B -- no --> E1["return ER_OUT_OF_VIRTUAL_MEMORY"]
  B -- yes --> P{"malloc pages_area?"}
  P -- no --> E2["free buffers; return"]
  P -- yes --> Hp{"malloc header_page?"}
  Hp -- no --> E3["free buffers+pages; return"]
  Hp -- yes --> Fi{"init_flush_info?"}
  Fi -- no --> Err["goto error: finalize_pool; fatal_error"]
  Fi -- yes --> Done["init mutexes; Initialized=true; NO_ERROR"]

Figure 2-2. Branch map of logpb_initialize_pool, every allocation-failure path.

2.4 `logpb_initialize_flush_info` — the dirty-page roster

LOG_FLUSH_INFO (embedded as log_Gl.flush_info) is the list of append pages a flush must push to disk.

// log_flush_info -- src/transaction/log_impl.h
struct log_flush_info {
  int max_toflush;  int num_toflush;  LOG_PAGE **toflush;
#if defined(SERVER_MODE)
  pthread_mutex_t flush_mutex;
#endif
};

Field	Role	Why it exists
`max_toflush`	Capacity, set to `num_buffers - 1`	One slot reserved (header flushes separately), so the roster never exceeds `num_buffers - 1`.
`num_toflush`	Live count of staged pages	Reset to 0 here and after each flush.
`toflush`	Array of `LOG_PAGE*` in ascending page-id order	`calloc`’d to `num_buffers` pointers; sorted so the writev issues contiguous I/O.
`flush_mutex`	(SERVER_MODE) serializes roster mutation	Log-flush thread and committers both touch it.

logpb_initialize_flush_info: if toflush != NULL it calls logpb_finalize_flush_info first (re-entrant) then asserts toflush == NULL; sets max_toflush = num_buffers - 1, num_toflush = 0, calloc’s toflush to num_buffers pointers (extra slot is harmless slack), er_sets ER_OUT_OF_VIRTUAL_MEMORY on NULL, and pthread_mutex_inits — even on allocation failure, returning the error code the caller treats as goto error. logpb_finalize_flush_info reverses it: if toflush != NULL, lock, free_and_init(toflush), zero counters, unlock, pthread_mutex_destroy; no-op (double-call safe) when already NULL.

2.5 `prior_lsa_info` constructor — seeding the prior list

LOG_PRIOR_LSA_INFO heads the in-memory prior list (the staging area between a caller’s append request and the page buffer; Ch 3-5).

// log_prior_lsa_info -- src/transaction/log_append.hpp
struct log_prior_lsa_info {
  LOG_LSA prior_lsa;  LOG_LSA prev_lsa;  LOG_PRIOR_NODE *prior_list_header;  LOG_PRIOR_NODE *prior_list_tail;
  INT64 list_size;    LOG_PRIOR_NODE *prior_flush_list_header;  std::mutex prior_lsa_mutex;  log_prior_lsa_info (); };

Field	Role	Why it exists
`prior_lsa`	LSA the next appended node will receive	Advancing it under the mutex issues LSAs in monotonic order without touching disk.
`prev_lsa`	LSA of the previously appended node	Lets each new node store a `back_lsa` for backward chaining / undo.
`prior_list_header` / `prior_list_tail`	FIFO head (drain consumes) / tail (O(1) append)	Drain (Ch 5) reads head; new nodes splice at tail.
`list_size`	Queued byte count	Lets the drainer/flusher decide when to push.
`prior_flush_list_header`	Sub-list already promoted toward flush	Separates “appended” from “being flushed”.
`prior_lsa_mutex`	The hot lock of the whole subsystem	Every LSA assignment serializes here.

The ctor seeds everything empty; the real LSA seed is deferred to log_initialize_internal, which copies recovered header LSAs into both append and prior_info:

// log_prior_lsa_info ctor / LOG_RESET_*_LSA -- src/transaction/log_append.cpp
log_prior_lsa_info::log_prior_lsa_info ()   // every member: NULL_LSA / NULL / 0 / default mutex
  : prior_lsa (NULL_LSA), prev_lsa (NULL_LSA), prior_list_header (NULL), prior_list_tail (NULL)
  , list_size (0), prior_flush_list_header (NULL), prior_lsa_mutex () { }
void LOG_RESET_APPEND_LSA (const LOG_LSA *lsa)   // header drives prior_lsa
{ log_Gl.hdr.append_lsa = *lsa; log_Gl.prior_info.prior_lsa = *lsa; }
void LOG_RESET_PREV_LSA (const LOG_LSA *lsa)
{ log_Gl.append.prev_lsa = *lsa; log_Gl.prior_info.prev_lsa = *lsa; }

INVARIANT — prior_info.prior_lsa == hdr.append_lsa and prior_info.prev_lsa == append.prev_lsa at end of init. log_initialize_internal assert(0)s and re-resets on divergence (if (!LSA_EQ (&log_Gl.hdr.append_lsa, &log_Gl.prior_info.prior_lsa)) { assert (0); LOG_RESET_APPEND_LSA (...); } and the symmetric prev_lsa check). If it drifted, the first appended record would get an LSA disagreeing with where the cursor writes, corrupting the back-chain.

2.6 `log_append_init_zip` / `log_append_final_zip` — compression contexts

LOG_ZIP is the (de)compression scratch buffer: struct log_zip { LOG_ZIP_SIZE_T data_length = 0; LOG_ZIP_SIZE_T buf_size = 0; char *log_data = nullptr; }; (log_compress.h).

Field	Role	Why it exists
`data_length`	Bytes currently held	Result length after `log_zip`/`log_unzip`.
`buf_size`	Capacity of `log_data`	`log_zip_realloc_if_needed` grows it; avoids re-malloc per record.
`log_data`	The (de)compression buffer	Holds LZ4 output; `log_zip_alloc(IO_PAGESIZE)` sizes it.

log_append_init_zip branches on mode and PRM_ID_LOG_COMPRESS:

Compression disabled → log_Zip_support = false, return.
SERVER_MODE: log_Zip_support = true; the buffers are per-thread, allocated lazily on first use — log_append_get_zip_undo/_redo do if (thread_p->log_zip_undo == NULL) thread_p->log_zip_undo = log_zip_alloc (IO_PAGESIZE);.
SA-mode: allocate two process-global statics log_zip_undo/log_zip_redo plus a log_data_ptr scratch of IO_PAGESIZE * 2. If any is NULL → log_Zip_support = false and free whichever allocated (each under its own if). Else log_Zip_support = true.

log_append_final_zip mirrors it: if !log_Zip_support return; under SERVER_MODE nothing (per-thread buffers die with the thread entry); in SA-mode frees log_zip_undo/log_zip_redo/log_data_ptr. It runs from logpb_finalize_pool (§2.7), so zip teardown is tied to pool teardown.

INVARIANT — log_Zip_support is the single gate. All callers gate on it, never on the individual buffer pointers; init sets it false on any partial allocation failure so a half-allocated context is never used.

2.7 Teardown: `log_final` and `logpb_finalize_pool`

log_final is the orderly shutdown and the re-entrancy guard create/init call up front. Branch-complete:

Destroy server daemons and system transactions; LOG_CS_ENTER; reset rcv_phase.
trantable.area == NULL → nothing initialized; exit.
Else !logpb_is_pool_initialized() → only trantable; logtb_undefine_trantable, exit.
Else append.vdes == NULL_VOLDES → pool but no volume; logpb_finalize_pool + logtb_undefine_trantable, exit.
Else abort every active transaction (log_abort), tracking anyloose_ends; flush to disk (logpb_flush_pages_direct + pgbuf_flush_all + fileio_synchronize_all).
Header branch: if !anyloose_ends && error_code == NO_ERROR, set hdr.is_shutdown = true and snap chkpt_lsa = append_lsa (clean — restart skips recovery). Else logpb_checkpoint.
logpb_flush_header, logpb_finalize_pool, logtb_undefine_trantable, dismount bg-archive + active volumes, free_and_init(loghdr_pgptr), LOG_CS_EXIT.

logpb_finalize_pool (from log_final and the create/init error paths) is idempotent — returns if !logpb_Initialized. Otherwise it reverses bring-up exactly: clear the append cursor (log_pgptr = NULL, nxio_lsa/prev_lsa = NULL_LSA, mirrored into prior_info), free_and_init buffers/pages_area/header_page, num_buffers = 0, logpb_Initialized = false, logpb_finalize_flush_info (§2.4), destroy chkpt + group-commit locks, finalize writer info, and finally log_append_final_zip (§2.6) — zip freed last, mirroring init’s zip-first, so no in-flight append (touching a per-thread LOG_ZIP) outlives its buffers.

2.8 The `LOGAREA_SIZE` / `LOG_PAGESIZE` relationship

A LOG_PAGE is LOG_PAGESIZE bytes (db_Log_page_size in storage_common.h). The first SSIZEOF(LOG_HDRPAGE) bytes are the page header; the rest is the record area: #define LOGAREA_SIZE (LOG_PAGESIZE - SSIZEOF(LOG_HDRPAGE)) (log_impl.h).

This constant constrains all record placement. The append macros (LOG_APPEND_ALIGN, LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT) compare append_lsa.offset against LOGAREA_SIZE and call logpb_next_append_page on overflow; LOG_PRIOR_LSA_LAST_APPEND_OFFSET() likewise returns LOGAREA_SIZE, so the prior-list and page-buffer sides agree on where a page ends (page-crossing is Chapter 6). At init the point is: geometry is fixed from the header’s db_logpagesize, validated against the running LOG_PAGESIZE.

INVARIANT — db_logpagesize must equal the running LOG_PAGESIZE. As traced in §2.2b step 9, log_initialize_internal checks db_iopagesize != IO_PAGESIZE || db_logpagesize != LOG_PAGESIZE; on mismatch it db_set_page_sizes, finalizes the pool, dismounts, and recursively re-enters itself so buffers are reallocated at the correct size. Otherwise LOGAREA_SIZE is computed against the wrong page size and records straddle physical page boundaries.

2.9 Chapter summary — key takeaways

Two entry points, different lifetimes. log_create_internal formats, writes page -9, finalizes the pool (no live state); log_initialize_internal mounts, reads page -9, keeps the pool live, runs recovery.
Restart has a richer branch tree (§2.2b): fatal log-names path, global loghdr_pgptr malloc, fileio_mount NULL_VOLDES split (media-crash header synthesis with LOGPAGEID_MAX vs ER_IO_MOUNT_FAIL), logpb_fetch_header, restore_slave copy-out, tolerated prefix mismatch, recursive page-size re-init, recovery-vs-clean dispatch.
Two globals: log_Gl (append/prior/header/flush) vs the separate ring log_Pb; flush_info.toflush[] and append.log_pgptr point into log_Pb.
The ring is two parallel allocations — LOG_BUFFER[] + one pages_area slab; descriptor i ↔ slab slot i, recovered by pointer arithmetic. Flush capacity is num_buffers - 1.
The prior list starts empty, LSA-seeded from the header via LOG_RESET_APPEND_LSA/LOG_RESET_PREV_LSA into both append and prior_info; init asserts they agree.
Compression is mode-split: SA-mode process-global LOG_ZIP statics, server-mode per-thread lazy; log_Zip_support is the single gate, false on any partial failure.
Teardown reverses bring-up, freeing flush-info and zip last; log_final’s is_shutdown = true branch lets the next boot skip recovery.

Chapter 3: Building a Prior Node from a Caller Request

When the engine modifies a page it calls a log_append_* API. Before the change can reach disk it must become a prior node — a heap-allocated LOG_PRIOR_NODE carrying a fully formed log record. This chapter answers: given a caller tuple (rcvindex, addr, undo_data, redo_data), how is a complete LOG_PRIOR_NODE built before it is handed an LSA?

The defining property of this phase: it runs entirely outside the prior-list mutex — allocation, header sizing, payload copying, and compression all happen on the caller’s memory. Only once the node is finished does Chapter 4’s prior_lsa_next_record take prior_lsa_mutex, stamp the LSA, and splice it in (the companion’s single-writer pipeline).

3.1 The append API surface — thin wrappers over crumbs

The entry points (log_append_undoredo_data, log_append_undo_data, log_append_redo_data, plus the *2 and *_recdes variants) package the caller’s contiguous buffer into one LOG_CRUMB and delegate to the crumbs API.

// log_append_undoredo_data -- src/transaction/log_manager.c
LOG_CRUMB undo_crumb, redo_crumb;
assert (0 == undo_length || undo_data != NULL);   /* <- zero length must mean NULL data */
undo_crumb.data = undo_data; undo_crumb.length = undo_length;   // ... redo_crumb the same ...
log_append_undoredo_crumbs (thread_p, rcvindex, addr, 1, 1, &undo_crumb, &redo_crumb);
// inside log_append_undoredo_crumbs: type from rcvindex alone:
LOG_RECTYPE rectype = LOG_IS_MVCC_OPERATION (rcvindex) ? LOG_MVCC_UNDOREDO_DATA : LOG_UNDOREDO_DATA;

A LOG_CRUMB is a (length, data) pair. The *2 variants synthesize a LOG_DATA_ADDR from (vfid, pgptr, offset); _recdes variants wrap a RECDES. LOG_IS_MVCC_OPERATION is true for MVCC heap/btree ops and RVES_NOTIFY_VACUUM; the undo-only path picks LOG_(MVCC_)UNDO_DATA, the redo-only path *REDO*. This rectype is the switch key for every sizing decision downstream. Before construction, log_append_*_crumbs runs a guard chain (Figure 3-1); each guard is a distinct early return, so the node is built only after all five pass:

flowchart TB
  B{"log_No_logging?"} -- yes --> B1["log_skip_logging; return"]
  B -- no --> D{"LOG_FIND_TDES == NULL?"}
  D -- yes --> D1["ER_LOG_UNKNOWN_TRANINDEX; return"]
  D -- no --> E{"not sysop AND not active AND not aborted?"}
  E -- yes --> E1["return, log nothing"]
  E -- no --> F{"log_can_skip_undo_logging?"}
  F -- yes --> F1["append redo crumbs only; return"]
  F -- no --> G["prior_lsa_alloc_and_copy_crumbs"]
  G --> H{"node == NULL?"} -- yes --> H1["return"]
  H -- no --> I["TDE encrypt; prior_lsa_next_record (Ch 4)"]

Figure 3-1 — Guard chain of log_append_undoredo_crumbs. When undo is skippable it degenerates to a redo-only append. log_append_undo_crumbs skips silently (no redo fallback); log_append_redo_crumbs uses log_can_skip_redo_logging.

3.2 `LOG_PRIOR_NODE` — the construction target

// struct log_prior_node -- src/transaction/log_append.hpp
struct log_prior_node {
  LOG_RECORD_HEADER log_header;
  LOG_LSA start_lsa;        /* for assertion */
  bool tde_encrypted;
  int data_header_length;   char *data_header;
  int ulength;              char *udata;
  int rlength;              char *rdata;
  LOG_PRIOR_NODE *next;
};

Field	Role	Why it exists
`log_header`	Only `.type` set here.	Record identity / switch key. LSA links filled in Ch 4 under the mutex.
`start_lsa`	Eventual LSA. Unset here — `/* for assertion */`.	Assigned by `prior_lsa_next_record` (Ch 4); read only by MVCC vacuum-header assertions. The node has no log position yet, so reading it during construction is a bug.
`tde_encrypted`	Whether the log page must be TDE-encrypted.	`false` at alloc, raised by `prior_set_tde_encrypted`; drives page-boundary encryption (Ch 6).
`data_header_length` / `data_header`	Byte size + separate malloc holding the filled `LOG_REC_*`.	From `rectype` via `sizeof(LOG_REC_*)`; separate buffer lets the drain (Ch 5) copy header then data independently.
`ulength` / `udata`	Stored undo length (high bit = zipped) + heap copy of undo bytes.	Node must own its payload; caller’s buffer may be freed after return. Drain copies exactly `ulength` bytes.
`rlength` / `rdata`	As above, for redo.	Redo payload ownership and length.
`next`	List pointer.	NULL here; Ch 4 sets it on append to `prior_list_tail`.

A finished node is three independent mallocs — node, data_header (a LOG_REC_UNDOREDO or LOG_REC_MVCC_UNDOREDO), and each payload copy — making it self-owned; next, start_lsa, and the log_header LSA links stay blank until Ch 4.

Invariant — the node owns its payload by value. udata/rdata are always freshly malloc’d copies (the copiers always memcpy), never aliases of the caller’s buffers. If violated, the asynchronous drain in Ch 5 could read freed memory.

3.3 Allocation dispatch — `prior_lsa_alloc_and_copy_crumbs`

prior_lsa_alloc_and_copy_crumbs mallocs the node, zeroes every construction field, sets log_header.type, then dispatches:

// prior_lsa_alloc_and_copy_crumbs -- src/transaction/log_append.cpp
node->log_header.type = rec_type; node->tde_encrypted = false; /* ... all payload fields zeroed ... */
switch (rec_type) {
  case LOG_UNDOREDO_DATA: ... case LOG_MVCC_REDO_DATA:         /* all 8 undo/redo families */
    error = prior_lsa_gen_undoredo_record_from_crumbs (thread_p, node, rcvindex, addr, ...); break;
  default: assert_release (false); error = ER_FAILED; break;   /* <- crumbs path is undo/redo only */
}

On error it frees data_header, udata, rdata, then the node, and returns NULL — the caller (§3.1) treats NULL as “give up silently.”

The sibling prior_lsa_alloc_and_copy_data handles non-crumb families (postpone, compensate, commit, sysop, 2PC): its switch routes undo/redo cases to assert_release(false) and the rest to prior_lsa_gen_record, prior_lsa_gen_postpone_record, etc. — so the two allocators partition the type space: crumbs for undo/redo data, plain copy for control records.

prior_lsa_gen_record is the plain-copy builder Chapters 8–10 lean on for commit/abort/sysop nodes. It does no compression and no MVCC stamping — only sizes, allocates, and copies an optional undo blob; the header contents are filled by the caller. Its three branches:

Branch	Effect
`switch (rec_type)`	Dummy/decision records (`LOG_DUMMY_HEAD_POSTPONE`, `LOG_2PC__DECISION`, `LOG_START_CHKPT`, `LOG_SYSOP_ATOMIC_START`) assert `length==0 && data==NULL` and leave `data_header_length == 0`; control records set `data_header_length = sizeof(LOG_REC_)` (e.g. `LOG_COMMIT`/`LOG_ABORT` → `LOG_REC_DONETIME`, `LOG_SYSOP_END` → `LOG_REC_SYSOP_END`); `default` leaves it 0.
`if (data_header_length > 0)`	Mallocs the header (memset in debug builds); on failure raises `ER_OUT_OF_VIRTUAL_MEMORY` and returns immediately — no `udata` copy attempted.
`if (length > 0)`	Copies the optional undo blob via `prior_lsa_copy_undo_data_to_node`, propagating its error code; otherwise returns `NO_ERROR`.

3.4 `prior_lsa_gen_undoredo_record_from_crumbs` — the core builder

The builder runs four phases (Figure 3-2). It sums the crumb lengths, fetches the per-side zip scratch (log_append_get_zip_undo/_redo), and sets type-shaped flags: a LOG_IS_UNDOREDO_RECORD_TYPE sets has_undo + has_redo and needs both scratches (or a zero-length side); a LOG_IS_REDO_RECORD_TYPE sets has_redo, needs zip_redo; otherwise UNDO needs zip_undo — all &&-gated by log_Zip_support into can_zip.

It then (optionally) compresses (§3.5), sizes and mallocs the typed header, aims local pointers at its sub-fields, fills the shared LOG_DATA, and copies the payloads. Pointer aiming uses a fall-through switch: each MVCC arm grabs its extra mvccid_p/vacuum_info_p, then [[fallthrough]] into the non-MVCC arm for the shared length/data pointers — UNDO sets ulength_p only, REDO rlength_p only, UNDOREDO both:

// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cpp
case LOG_MVCC_UNDOREDO_DATA: case LOG_MVCC_DIFF_UNDOREDO_DATA:  /* MVCC arm: extra ptrs, then fall through */
  vacuum_info_p = &mvcc_undoredo_p->vacuum_info; mvccid_p = &mvcc_undoredo_p->mvccid; [[fallthrough]];
case LOG_UNDOREDO_DATA: case LOG_DIFF_UNDOREDO_DATA:            /* shared: aim both length ptrs + log_data_p */
  data_header_ulength_p = &undoredo_p->ulength; ... log_data_p = &undoredo_p->data; break;

The shared LOG_DATA is filled from addr: rcvindex, offset, and (pageid, volid) via pgbuf_get_vpid_ptr — or NULL_PAGEID/NULL_VOLID when addr->pgptr == NULL (logical logging).

flowchart TB
  M["Phase 1: sum lengths, get zip scratch, compute has_undo/has_redo/can_zip"] --> Z{"can_zip AND\nsome side >= thr?"}
  Z -- yes --> ZB["Phase 2: log_diff + log_zip; if both zipped rewrite type to *_DIFF_*"]
  Z -- no --> HSZ["Phase 3a: size header by type"]
  ZB --> HSZ
  HSZ --> MAL{"malloc data_header OK?"}
  MAL -- no --> ERR["ER_OUT_OF_VIRTUAL_MEMORY; goto error"]
  MAL -- yes --> PTR["Phase 3b: aim ptrs via fall-through switch, fill LOG_DATA, stamp MVCCID/vacuum if set"]
  PTR --> CP["Phase 4: copy udata/rdata (zipped or raw)"]
  CP --> RET["return NO_ERROR"]
  ERR --> RETE["return error_code"]

Figure 3-2 — Control flow of prior_lsa_gen_undoredo_record_from_crumbs. Every branch reaches return NO_ERROR or the error: label, which frees data_header/udata/rdata.

3.5 The compression branch — boundary is the node, not the page

CUBRID compresses per record (per prior node), never per log page — which is why it lives in construction, before any LSA or page is assigned: the compressed bytes are sized into ulength/rlength and copied into the node, so Ch 6’s page-boundary logic never sees uncompressed data. Two globals gate it; scratch is a per-side LOG_ZIP:

// src/transaction/log_append.cpp ; src/transaction/log_compress.h
bool log_Zip_support = false;            /* <- master toggle, from prm */
int log_Zip_min_size_to_compress = 255;  /* <- per-side threshold (bytes) */
struct log_zip { LOG_ZIP_SIZE_T data_length = 0; LOG_ZIP_SIZE_T buf_size = 0; char *log_data = nullptr; };

LOG_ZIP holds one result, all three fields: log_data is the output buffer (prior_lsa_copy_*_data_to_node memcpys from it), data_length its produced length (what MAKE_ZIP_LEN wraps into the header; raw if it did not shrink), buf_size its log_zip_alloc-set capacity (IO_PAGESIZE

LZ4 bound) so it is not reallocated per record. Scratch comes from log_append_get_zip_undo/_redo: per-thread in SERVER_MODE (thread_p->log_zip_undo, lazily log_zip_alloc’d), file-static singletons stand-alone. If thread_p is NULL and unresolvable via thread_get_thread_entry_info, the getter returns NULL — forcing can_zip false for that side via the zip_* != NULL clause.

The compression block and the length-stamping copy run as one unit:

// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cpp
if (can_zip && (ulength >= log_Zip_min_size_to_compress || rlength >= log_Zip_min_size_to_compress)) {
    if (ulength >= thr && rlength >= thr) {
        (void) log_diff (ulength, undo_data, rlength, redo_data);   /* <- redo diffed against undo */
        is_undo_zip = log_zip (zip_undo, ulength, undo_data);
        is_redo_zip = log_zip (zip_redo, rlength, redo_data);
        if (is_redo_zip) is_diff = true;
    } else { if (ulength >= thr) is_undo_zip = log_zip (zip_undo, ulength, undo_data);
             if (rlength >= thr) is_redo_zip = log_zip (zip_redo, rlength, redo_data); }
}
if (is_diff) node->log_header.type = is_mvcc_op ? LOG_MVCC_DIFF_UNDOREDO_DATA : LOG_DIFF_UNDOREDO_DATA;
// ... after header sized/aimed, undo arm (redo symmetric): ...
if (is_undo_zip) { *data_header_ulength_p = MAKE_ZIP_LEN (zip_undo->data_length);   /* <- sets 0x80000000 */
    error_code = prior_lsa_copy_undo_data_to_node (node, zip_undo->data_length, (char *) zip_undo->log_data);
} else if (has_undo) { *data_header_ulength_p = ulength;
    error_code = prior_lsa_copy_undo_crumbs_to_node (node, num_ucrumbs, ucrumbs); }

Four outcomes: neither side over threshold (skipped); both large (log_diff rewrites redo as its difference from undo, then both zip, flipping the type to *_DIFF_* if redo zipped); only one large (that side zips, no diff); log_zip returns false (copied raw). MAKE_ZIP_LEN(len) is len | 0x80000000; recovery strips it via GET_ZIP_LEN/ZIP_CHECK.

Invariant — header length encodes compression state. Whether a side is zipped is recorded only in the sign bit of the header length field; a zipped payload written without MAKE_ZIP_LEN would feed compressed bytes straight to recovery and corrupt the page. Pairing is_*_zip with MAKE_ZIP_LEN in the same arm guarantees they never diverge.

The copier prior_lsa_copy_undo_data_to_node (_redo_ mirrors it) mallocs length bytes (returns NO_ERROR for length <= 0 || data == NULL, ER_OUT_OF_VIRTUAL_MEMORY on failure), memcpys, and sets node->ulength; the crumb copiers malloc once then memcpy each crumb. Either way node->ulength/rlength holds the stored length.

3.6 Stamping MVCC identity

For MVCC types the pointer switch left mvccid_p/vacuum_info_p non-NULL, so two extra fills run. The MVCCID comes from the current TDES, preferring the innermost sub-transaction id:

// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cpp
if (mvccid_p != NULL) {
    tdes = LOG_FIND_CURRENT_TDES (thread_p);
    if (tdes == NULL || !MVCCID_IS_VALID (tdes->mvccinfo.id)) {
        assert_release (false); error_code = ER_FAILED; goto error;   /* <- MVCC op needs an MVCCID */
    } else if (!tdes->mvccinfo.sub_ids.empty ()) *mvccid_p = tdes->mvccinfo.sub_ids.back ();  /* nested sysop */
    else *mvccid_p = tdes->mvccinfo.id;
}

vacuum_info_p gets the file id (addr->vfid, or NULL for RVES_NOTIFY_VACUUM, else assert_release(false)), and prev_mvcc_op_log_lsa is set NULL — completed later in Ch 4’s prior_lsa_next_record_internal, which links the record into the vacuum chain once the LSA is known. These two fields, plus start_lsa, are the only ones here depending on transaction/log state, not the caller tuple. The two record layouts:

// struct log_rec_undoredo / log_rec_mvcc_undoredo -- src/transaction/log_record.hpp
struct log_rec_undoredo { LOG_DATA data; int ulength; int rlength; };
struct log_rec_mvcc_undoredo { LOG_REC_UNDOREDO undoredo; MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };

Every field: data (the LOG_DATA triple rcvindex/pageid/offset plus volid) is where recovery dispatches and locates the bytes; ulength/rlength are the stored lengths (high bit = zipped). The MVCC variant embeds undoredo so non-MVCC readers share code, then adds mvccid (the writer’s id, for vacuum and visibility) and vacuum_info (prev_mvcc_op_log_lsa back-link + owning vfid; back-link filled Ch 4).

3.7 Chapter summary — key takeaways

The public log_append_* APIs are thin — wrap the buffer in a LOG_CRUMB, delegate to log_append_*_crumbs, which pick the LOG_RECTYPE from rcvindex and run a five-guard chain first.
Construction is lock-free and self-owning — all work outside prior_lsa_mutex; the node owns three mallocs (node, data_header, payload copies) so the async drain never touches caller memory.
Two allocators partition the type space — ..._crumbs → prior_lsa_gen_undoredo_record_from_crumbs for undo/redo data; ..._data → prior_lsa_gen_record for control records, whose three branches size the header (0 for dummy/decision types), malloc with an ER_OUT_OF_VIRTUAL_MEMORY bail, and copy an optional undo blob.
The core builder runs measure → compress → size+fill the typed header → copy payloads, a [[fallthrough]] switch sharing the non-MVCC layout across the UNDO/REDO/UNDOREDO shapes.
Compression is per-node, not per-page — gated by log_Zip_support and the 255-byte threshold with per-thread LOG_ZIP scratch (NULL thread_p ⇒ no compression); both-sides-large triggers log_diff and may rewrite the type to *_DIFF_*. The zipped/raw choice is recorded only in the length’s high bit via MAKE_ZIP_LEN.
MVCC records get MVCCID + vacuum info from the TDES (sub-id preferred); prev_mvcc_op_log_lsa and start_lsa stay NULL/blank until the LSA is assigned in Chapter 4 — reading start_lsa during construction is a bug.

Chapter 4: LSA Assignment and Attach to the Prior List

Chapter 3 left us holding a fully formed LOG_PRIOR_NODE whose payload is populated but whose position in the log is unknown. This chapter assigns the node its LSA and splices it onto the prior-list tail inside one short mutex-guarded critical section. For why CUBRID stages records in an in-memory prior list, see the “prior list” section of cubrid-log-manager.md. The payoff:

Invariant 4-A (LSA order = mutex-acquisition order). Every LSA the engine hands out is monotonically increasing, and the order in which two threads receive their LSAs is exactly the order in which they acquired prior_info.prior_lsa_mutex. The mutex is held for only an O(1) sequence of pointer/offset updates — no I/O, no allocation.

4.1 The structs in play

Three structs meet here: the node (Ch 1; only the fields this chapter writes), the global cursor, and the embedded on-disk record header.

log_prior_node (fields written in this chapter)

Field	Role	Why it exists
`log_header`	`LOG_RECORD_HEADER` — bytes that will physically precede the record in the page	Carries the four linkage LSAs + trid + type that recovery walks
`start_lsa`	The LSA this node is assigned	Returned to the caller as the record’s identity; cross-checked when the node is drained
`tde_encrypted`	Whether the holding page must be TDE-encrypted	Set by `prior_set_tde_encrypted`; read when the page is allocated/flushed
`data_header_length`	Byte length of `data_header`	Drives the offset advance for the data-header region
`data_header`	The typed record header (e.g. `LOG_REC_SYSOP_END`)	Cast on the matched `type` arm to read MVCC/sysop sub-fields
`ulength` / `udata`	Undo payload length / buffer	`ulength>0` triggers an offset advance for undo data
`rlength` / `rdata`	Redo payload length / buffer	`rlength>0` triggers an offset advance for redo data
`next`	Singly-linked pointer to the next prior node	Set when this node becomes the new tail

log_rec_header (LOG_RECORD_HEADER) — every field

The physical record header; prior_lsa_start_append/prior_lsa_end_append exist almost entirely to fill it.

Field	Role	Why it exists
`prev_tranlsa`	Previous log record of the same transaction	Lets undo/rollback walk one transaction’s records backward without scanning the whole log
`back_lsa`	Previous physical record (any transaction)	Lets recovery walk the global log backward
`forw_lsa`	Next physical record	Lets analysis/redo walk forward; known only after this record’s size is fixed, so filled in `prior_lsa_end_append`
`trid`	Transaction id owning this record	Recovery groups records by transaction; set from `tdes->trid`
`type`	`LOG_RECTYPE` (e.g. `LOG_COMMIT`, `LOG_SYSOP_END`)	Dispatch key for every `type`-specific branch in `prior_lsa_next_record_internal`

log_prior_lsa_info (the global cursor; `log_Gl.prior_info`) — every field

Field	Role	Why it exists
`prior_lsa`	The next LSA to assign — the moving cursor	Every node copies this into `start_lsa`; advanced by the offset helpers as the node’s bytes are accounted for
`prev_lsa`	LSA of the last record appended to the prior stream	Becomes the new node’s `back_lsa`, then is updated to point at the new node
`prior_list_header`	Head of the singly-linked prior list	The drain side (Chapter 5) consumes from the head
`prior_list_tail`	Tail of the prior list	New nodes attach here in O(1)
`list_size`	Bytes staged but not yet flushed	Compared against `logpb_get_memsize()` to decide when to force a flush
`prior_flush_list_header`	Head of the detached list being flushed	Set when the list is unhooked for draining (Chapter 5)
`prior_lsa_mutex`	`std::mutex` serializing the whole assignment	The single lock whose acquisition order defines LSA order (Invariant 4-A)

flowchart LR
  subgraph G["log_Gl.prior_info (LOG_PRIOR_LSA_INFO)"]
    PL["prior_lsa<br/>(next LSA cursor)"]
    PV["prev_lsa<br/>(last record)"]
    H["prior_list_header"]
    T["prior_list_tail"]
    M["prior_lsa_mutex"]
  end
  N["new LOG_PRIOR_NODE<br/>start_lsa, log_header, next"]
  PL -- "copied into" --> N
  PV -- "copied into log_header.back_lsa" --> N
  T -- "->next = node, then tail = node" --> N
  M -. "guards all of the above" .- G

Figure 4-1. The cursor feeds the node its identity and linkage, then adopts the node as its new tail.

4.2 The entry points: with_lock and the LOG_PRIOR_LSA_LOCK enum

Two public entry points, one shared body; the only difference is whether the caller already holds prior_lsa_mutex.

// prior_lsa_next_record / _with_lock -- src/transaction/log_append.cpp
prior_lsa_next_record (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node, log_tdes *tdes)
{ return prior_lsa_next_record_internal (thread_p, node, tdes, LOG_PRIOR_LSA_WITHOUT_LOCK); }

prior_lsa_next_record_with_lock (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node, log_tdes *tdes)
{ return prior_lsa_next_record_internal (thread_p, node, tdes, LOG_PRIOR_LSA_WITH_LOCK); }

The with_lock argument is one of the two values below (the enum has no comments in the source; the annotations here are editorial):

// LOG_PRIOR_LSA_LOCK -- src/transaction/log_append.hpp
enum LOG_PRIOR_LSA_LOCK
{
  LOG_PRIOR_LSA_WITHOUT_LOCK = 0,   // internal locks/unlocks the mutex itself
  LOG_PRIOR_LSA_WITH_LOCK = 1       // caller already holds the mutex
};

The _with_lock variant lets a caller emit several records with no interleaving: take the mutex once, call _with_lock repeatedly. The plain variant is the common single-record path.

4.3 prior_lsa_next_record_internal — branch-complete walkthrough

The body has three phases: lock + prior_lsa_start_append (4.4); a 6-arm type-dispatch ladder (table below); then the offset-walk + prior_lsa_end_append (4.5) + tail splice + conditional unlock-and-flush. The frame and the tail splice, quoted verbatim (note both splice arms are two statements, not a chained assignment):

// prior_lsa_next_record_internal -- src/transaction/log_append.cpp
  if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK) { log_Gl.prior_info.prior_lsa_mutex.lock (); }
  prior_lsa_start_append (thread_p, node, tdes);     // <- assigns start_lsa + header linkage (4.4)
  LSA_COPY (&start_lsa, &node->start_lsa);            // <- snapshot before any advance
  // ... vacuum-produce guard + 6-arm type dispatch ladder (tables below) ...
  log_prior_lsa_append_advance_when_doesnot_fit (node->data_header_length);
  log_prior_lsa_append_add_align (node->data_header_length);
  if (node->ulength > 0) { prior_lsa_append_data (node->ulength); }
  if (node->rlength > 0) { prior_lsa_append_data (node->rlength); }
  prior_lsa_end_append (thread_p, node);              // <- fixes forw_lsa (4.5)

  if (log_Gl.prior_info.prior_list_tail == NULL)
    {
      log_Gl.prior_info.prior_list_header = node;     // <- empty list: node is head ...
      log_Gl.prior_info.prior_list_tail = node;       // <- ... and tail
    }
  else
    {
      log_Gl.prior_info.prior_list_tail->next = node; // <- O(1) tail splice (two statements)
      log_Gl.prior_info.prior_list_tail = node;
    }
  log_Gl.prior_info.list_size += (sizeof (LOG_PRIOR_NODE) + node->data_header_length
                                  + node->ulength + node->rlength);
  if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK)
    {
      log_Gl.prior_info.prior_lsa_mutex.unlock ();    // <- release BEFORE the flush decision
      // ... condensed: if list_size >= logpb_get_memsize() -> force-flush fork (see callout) ...
    }
  tdes->num_log_records_written++;
  return start_lsa;

Before the ladder, a vacuum-produce guard fires: under LOG_ISRESTARTED () and log_Gl.hdr.does_block_need_vacuum, if start_lsa crossed into a new vacuum block id versus mvcc_op_log_lsa, it calls vacuum_produce_log_block_data (asserting the prior block id is exactly one behind). Skipped entirely during crash recovery.

The 6-arm type-dispatch ladder. Mutually exclusive if/else if on node->log_header.type. Every assignment happens under the mutex with the just-snapshotted start_lsa — the reason the captured LSAs are coherent (Chapters 8–9).

#	Matched `type`(s)	Guard	Action
1	`LOG_MVCC_UNDO_DATA`, `LOG_MVCC_UNDOREDO_DATA`, `LOG_MVCC_DIFF_UNDOREDO_DATA`, or (`LOG_SYSOP_END` && `((LOG_REC_SYSOP_END *)data_header)->type == LOG_SYSOP_END_LOGICAL_MVCC_UNDO`)	—	Resolve `vacuum_info`/`mvccid` via nested sub-branch; `vacuum_info->prev_mvcc_op_log_lsa = log_Gl.hdr.mvcc_op_log_lsa`; `prior_update_header_mvcc_info (start_lsa, mvccid)` (4.6)
2	`LOG_SYSOP_START_POSTPONE`	`assert (LSA_ISNULL (rcv.sysop_start_postpone_lsa))`	`rcv.sysop_start_postpone_lsa = start_lsa`; if `lastparent_lsa < rcv.atomic_sysop_start_lsa` null it; `tdes->state = TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE` (under mutex, for checkpoint correctness)
3	`LOG_SYSOP_END`	—	If `atomic_sysop_start_lsa` non-null && `lastparent_lsa <` it → null; same test/null for `sysop_start_postpone_lsa`
4	`LOG_COMMIT_WITH_POSTPONE` or `LOG_COMMIT_WITH_POSTPONE_OBSOLETE`	—	`rcv.tran_start_postpone_lsa = start_lsa`
5	`LOG_SYSOP_ATOMIC_START`	`assert (LSA_ISNULL (rcv.atomic_sysop_start_lsa))`	`rcv.atomic_sysop_start_lsa = start_lsa`
6	`LOG_COMMIT` or `LOG_ABORT`	`assert (commit_abort_lsa.is_null ())`	`commit_abort_lsa = start_lsa`

Nested 3-way MVCC sub-branch (inside arm 1) — selects which struct holds vacuum_info/mvccid:

Sub-arm	Condition	`vacuum_info` / `mvccid` source
a	`type == LOG_MVCC_UNDO_DATA`	`(LOG_REC_MVCC_UNDO *) node->data_header` → `&mvcc_undo->vacuum_info`, `mvcc_undo->mvccid`
b	`type == LOG_SYSOP_END`	`&((LOG_REC_SYSOP_END *) node->data_header)->mvcc_undo` → `&mvcc_undo->vacuum_info`, `mvcc_undo->mvccid`
c	else (`LOG_MVCC_UNDOREDO_DATA` / `LOG_MVCC_DIFF_UNDOREDO_DATA`, asserted)	`(LOG_REC_MVCC_UNDOREDO *) node->data_header` → `&mvcc_undoredo->vacuum_info`, `mvcc_undoredo->mvccid`

If none of arms 1–6 match (the common data-record case), the ladder is a no-op and control falls straight to the offset walk.

Unlock-then-flush fork. Only WITHOUT_LOCK unlocks here, and the list_size >= logpb_get_memsize() check sits outside the mutex. Inside it, SERVER_MODE wakes the flush daemon and sleeps 1 ms when not in crash recovery, versus a synchronous logpb_prior_lsa_append_all_list under LOG_CS during recovery. SA mode (#else) is always synchronous. Chapter 5 covers the drain.

flowchart TD
  A["enter internal"] --> B{"WITHOUT_LOCK?"}
  B -- yes --> C["lock prior_lsa_mutex"]
  B -- no --> D["prior_lsa_start_append"]
  C --> D
  D --> E["snapshot start_lsa"]
  E --> VG{"vacuum-produce guard"}
  VG --> F["6-arm type dispatch ladder<br/>see arms 1-6 table; no-op if no match"]
  F --> G["advance + add_align data_header"]
  G --> H{"ulength>0?"}
  H -- yes --> I["append_data ulength"]
  H -- no --> J{"rlength>0?"}
  I --> J
  J -- yes --> K["append_data rlength"]
  J -- no --> L["end_append: set forw_lsa"]
  K --> L
  L --> M{"tail == NULL?"}
  M -- yes --> N["header = tail = node"]
  M -- no --> O["tail->next = node; tail = node"]
  N --> P["list_size += footprint"]
  O --> P
  P --> Q{"WITHOUT_LOCK?"}
  Q -- yes --> R["unlock; maybe force flush"]
  Q -- no --> S["num_log_records_written++; return"]
  R --> S

Figure 4-2. Branch-complete control flow of prior_lsa_next_record_internal, including all six dispatch arms.

4.4 prior_lsa_start_append — assigning the LSA and the backward chain

This is where the node’s identity is born.

// prior_lsa_start_append -- src/transaction/log_append.cpp
  log_prior_lsa_append_advance_when_doesnot_fit (sizeof (LOG_RECORD_HEADER)); // <- header must not straddle a page
  node->log_header.trid = tdes->trid;
  LSA_COPY (&node->start_lsa, &log_Gl.prior_info.prior_lsa);   // <- THE LSA assignment, before any advance (Inv 4-C)
  if (tdes->is_system_worker_transaction () && !tdes->is_under_sysop ())
    {
      LSA_SET_NULL (&node->log_header.prev_tranlsa);  // <- worker, no sysop: lose the per-tran chain
      LSA_SET_NULL (&tdes->head_lsa);
      LSA_SET_NULL (&tdes->tail_lsa);
    }
  else
    {
      LSA_COPY (&node->log_header.prev_tranlsa, &tdes->tail_lsa);     // chain to this tran's last record
      LSA_COPY (&tdes->tail_lsa, &log_Gl.prior_info.prior_lsa);       // this record is now the tran tail
      if (LSA_ISNULL (&tdes->head_lsa))
        { LSA_COPY (&tdes->head_lsa, &tdes->tail_lsa); }              // first record of the tran
      LSA_COPY (&tdes->undo_nxlsa, &log_Gl.prior_info.prior_lsa);     // next to undo on rollback
    }
  LSA_COPY (&node->log_header.back_lsa, &log_Gl.prior_info.prev_lsa); // <- physical backward link (any tran)
  LSA_SET_NULL (&node->log_header.forw_lsa);                          // <- not known yet (end_append)
  LSA_COPY (&log_Gl.prior_info.prev_lsa, &log_Gl.prior_info.prior_lsa); // <- prev_lsa now names THIS record
  log_prior_lsa_append_add_align (sizeof (LOG_RECORD_HEADER));        // <- account the header bytes

The transaction-chain fork: system workers (e.g. vacuum) do not own a rollback chain, so a worker record not under a sysop nulls prev_tranlsa/head_lsa/tail_lsa. Everyone else chains to the prior tail and updates it. The physical back_lsa = prev_lsa link is transaction-independent; prev_lsa then advances to name this record. forw_lsa is nulled here, fixed in end_append.

Invariant 4-B. A system-worker record NOT under a sysop carries a null prev_tranlsa. Violating it would make recovery walk a non-existent transaction chain.

Invariant 4-C (start before advance). start_lsa is read before any add_align advances the cursor, so it names the first byte of the record. The header-fit guard runs first so that first byte is on a page that can hold the header.

4.5 prior_lsa_end_append — fixing forw_lsa

Once the data-header, undo, and redo regions are accounted for, the cursor sits at the first byte past this record — the next record’s start, i.e. this record’s forw_lsa. Both helpers run before forw_lsa is read: align, then bump to the next page if the next header would not fit. So forw_lsa always names a position where the following header can legally live, and every forw_lsa equals the next record’s start_lsa with no straddling header between.

// prior_lsa_end_append -- src/transaction/log_append.cpp
static void
prior_lsa_end_append (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node)
{
  log_prior_lsa_append_align ();                                  // <- align to next record start
  log_prior_lsa_append_advance_when_doesnot_fit (sizeof (LOG_RECORD_HEADER)); // <- next header must fit too
  LSA_COPY (&node->log_header.forw_lsa, &log_Gl.prior_info.prior_lsa);
}

4.6 prior_update_header_mvcc_info — vacuum block bookkeeping

Invoked from arm 1 of the ladder. It maintains the running MVCC-block summary in the global log header so vacuum knows which blocks have MVCC work.

// prior_update_header_mvcc_info -- src/transaction/log_append.cpp
  assert (MVCCID_IS_VALID (mvccid));
  if (!log_Gl.hdr.does_block_need_vacuum)            // <- FIRST MVCC record of this block
    {
      log_Gl.hdr.oldest_visible_mvccid = log_Gl.mvcc_table.get_global_oldest_visible ();
      log_Gl.hdr.newest_block_mvccid = mvccid;
    }
  else
    {
      // ... condensed: sanity asserts on oldest/newest/block id ...
      if (log_Gl.hdr.newest_block_mvccid < mvccid)   // <- subsequent record: raise high-water only
        { log_Gl.hdr.newest_block_mvccid = mvccid; }
    }
  log_Gl.hdr.mvcc_op_log_lsa = record_lsa;           // <- both branches: latest MVCC op position
  log_Gl.hdr.does_block_need_vacuum = true;

The first MVCC record of a block seeds oldest_visible_mvccid from the MVCC table; subsequent records only raise newest_block_mvccid (the elided else also asserts the block id matches mvcc_op_log_lsa). Both arms set mvcc_op_log_lsa = record_lsa and mark the block dirty — consistent with the totally-ordered LSA stream because it runs under the mutex.

4.7 The offset helpers — how prior_lsa walks the record footprint

Three statics advance log_Gl.prior_info.prior_lsa, consuming each region. All operate on a 0-based offset within a LOGAREA_SIZE-byte page area (leading assert (... offset >= 0) lines elided).

// offset helpers -- src/transaction/log_append.cpp
static void log_prior_lsa_append_align ()
{
  log_Gl.prior_info.prior_lsa.offset = DB_ALIGN (log_Gl.prior_info.prior_lsa.offset, DOUBLE_ALIGNMENT);
  if ((size_t) log_Gl.prior_info.prior_lsa.offset >= (size_t) LOGAREA_SIZE)  // <- align rolled off page
    { log_Gl.prior_info.prior_lsa.pageid++; log_Gl.prior_info.prior_lsa.offset = 0; }
}
static void log_prior_lsa_append_advance_when_doesnot_fit (size_t length)
{
  if ((size_t) log_Gl.prior_info.prior_lsa.offset + length >= (size_t) LOGAREA_SIZE)  // <- region won't fit
    { log_Gl.prior_info.prior_lsa.pageid++; log_Gl.prior_info.prior_lsa.offset = 0; }
}
static void log_prior_lsa_append_add_align (size_t add)
{
  log_Gl.prior_info.prior_lsa.offset += (add);   // <- consume the region's bytes
  log_prior_lsa_append_align ();                  // <- then align (may roll to next page)
}

advance_when_doesnot_fit is the only one with a branch — a pre-check so a header never straddles a boundary. The pairing advance_when_doesnot_fit(N) then add_align(N) first ensures region N fits, then consumes it. Payloads that span pages (prior_lsa_append_data) are Chapter 6’s subject.

4.8 prior_set_tde_encrypted — marking the node for encryption

Separate from the LSA path, called on the node for sensitive records.

// prior_set_tde_encrypted -- src/transaction/log_append.cpp
  if (!tde_is_loaded())                                  // <- cipher must be available
    {
      er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_TDE_CIPHER_IS_NOT_LOADED, 0);
      return ER_TDE_CIPHER_IS_NOT_LOADED;                // <- error branch
    }
  tde_er_log ("prior_set_tde_encrypted(): rcvindex = %s\n", rv_rcvindex_string (recvindex));
  node->tde_encrypted = true;                            // <- the only state change
  return NO_ERROR;

Two branches: cipher not loaded → log error, return ER_TDE_CIPHER_IS_NOT_LOADED, node untouched; otherwise flip node->tde_encrypted = true. The flag is read later when the holding page is allocated/flushed — it does not participate in LSA assignment, which is why it is a standalone setter, not part of prior_lsa_start_append. (Query side: the trivial prior_is_tde_encrypted.)

4.9 Chapter summary — key takeaways

One mutex defines the order. prior_lsa_mutex is acquired once per record (or held across several via _with_lock); acquisition order is LSA order (Invariant 4-A). Read-then-advance under one lock means no shared or out-of-order LSAs; no separate counter.
prior_lsa_start_append is the moment of birth. Copies prior_lsa into start_lsa before advancing (Invariant 4-C), sets trid, builds prev_tranlsa/back_lsa, nulls forw_lsa.
The transaction chain forks on worker status. Worker records not under a sysop get null prev_tranlsa/head_lsa/tail_lsa (Invariant 4-B); everyone else chains and updates tail_lsa/head_lsa/undo_nxlsa.
A 6-arm type ladder captures start_lsa under the lock. MVCC-undo (nested 3-way select → prior_update_header_mvcc_info), SYSOP_START_POSTPONE (also flips tdes->state), SYSOP_END, COMMIT_WITH_POSTPONE(_OBSOLETE), SYSOP_ATOMIC_START, COMMIT/ABORT each stash the LSA into tdes->rcv.*/commit_abort_lsa.
forw_lsa is fixed last. prior_lsa_end_append aligns past the record and guards the next header’s fit, so forw_lsa equals the next record’s start_lsa.
Offset helpers consume the footprint. advance_when_doesnot_fit pre-checks fit (the one branch), add_align consumes-then-aligns, align rounds to DOUBLE_ALIGNMENT and rolls pages.
Expensive work is outside the lock. The O(1) splice and list_size bump end the critical section; the flush check and any flush run after unlock.

Chapter 5: Draining the Prior List into the Page Buffer

Chapter 4 left a chain of log_prior_nodes with LSAs wired up by prior_lsa_next_record — promised but not yet copied into any LOG_PAGE frame. This chapter traces the single-writer drain that detaches the list, walks it in LSN order, and serializes each node into the page buffer. We stop at the page boundary (Chapter 6 owns logpb_next_append_page); the WAL rule is Chapter 7 (companion cubrid-log-manager.md).

5.1 The two serialization layers

Two locks, two jobs: prior_lsa_mutex serializes appenders against each other (held only for the LSA-stamp-and-link, Chapter 4; does not protect the page buffer); LOG_CS write mode serializes appenders against the page-buffer writer — every drain function opens with assert (LOG_CS_OWN_WRITE_MODE (thread_p)).

INVARIANT (single-writer drain). The drain runs while the caller owns LOG_CS write mode; with two drainers, append_lsa.offset and log_pgptr would update non-atomically and records would interleave. The LOG_CS_OWN_WRITE_MODE assert makes a violation fatal. Every struct table below assumes this.

The hand-off is the detach: under prior_lsa_mutex the writer snips off the list and nulls the header, so later appenders build a fresh list while the detached one is drained lock-free.

flowchart TD
  A1["prior_lsa_next_record\nholds prior_lsa_mutex briefly"]
  D1["logpb_prior_lsa_append_all_list"]
  D2["detach list under prior_lsa_mutex\nreset header/tail/list_size"]
  D3["logpb_append_prior_lsa_list\nwalk nodes in LSN order"]
  D4["logpb_append_next_record per node"]
  D5["copy bytes into LOG_PAGE frames\nset dirty, free node"]
  A1 -->|"attach to prior_list"| D2
  D1 --> D2 --> D3 --> D4 --> D5

Figure 5-1. The two serialization layers and the detach hand-off.

5.2 Structs the drain reads and writes

`log_prior_node` — the unit being drained (`log_append.hpp`)

Field	Role	Why it exists
`log_header`	`LOG_RECORD_HEADER` copied verbatim by `logpb_start_append`	On-disk record header
`start_lsa`	Must equal `append_lsa` when appended	Catches LSN-order corruption
`tde_encrypted`	Destination page is TDE-encrypted	Drives `appending_page_tde_encrypted`
`data_header_length`	Byte length of `data_header`	Sizes the header copy
`data_header`	Fixed per-record-type header payload	Part after `LOG_RECORD_HEADER`
`ulength` / `udata`	Length/pointer of the undo segment	Rollback image
`rlength` / `rdata`	Length/pointer of the redo segment	Recovery image
`next`	Link to the next node	Walked in LSN order

INVARIANT (node order = LSN order). Tail-append under prior_lsa_mutex makes next traversal exactly ascending LSN; logpb_append_next_record re-checks each node via LSA_EQ (&node->start_lsa, &log_Gl.hdr.append_lsa) and a mismatch is a logpb_fatal_error.

`LOG_PAGE` / `log_hdrpage` — the destination frame (`log_storage.hpp`)

log_page is { LOG_HDRPAGE hdr; char area[1]; }; log_hdrpage is the per-frame header.

Field	Role	Why it exists
`hdr.logical_pageid`	Identity of this frame in the log	Maps page to physical slot
`hdr.offset`	Offset of the first record on this page	Set once by `logpb_start_append`; enables salvage
`hdr.flags`	TDE encryption flags	Stamped by `logpb_set_tde_algorithm`
`hdr.checksum`	CRC32 of the page	Computed at flush (Chapter 7)
`area`	Buffer header+payload are `memcpy`’d into	`LOG_APPEND_PTR()` = `area + append_lsa.offset`

`LOG_BUFFER` — frame wrapper carrying the dirty bit (`log_page_buffer.c`)

Field	Role	Why it exists
`pageid` (volatile)	Logical page id of the wrapped frame	Validates flush targets
`phy_pageid` (volatile)	Physical page id in the active log	Maps logical page to disk slot
`dirty` (bool)	“Has unflushed changes”	Raised by `logpb_set_dirty`, cleared by flusher (Chapter 7)
`logpage` (`LOG_PAGE*`)	Back-pointer to buffered payload	`logpb_get_log_buffer` recovers the wrapper from a `LOG_PAGE*`

`log_append_info` — the single writer’s cursor state (`log_append.hpp`)

Field	Role	Why it exists
`vdes`	Active-log volume descriptor	Flush target; untouched by the drain
`nxio_lsa` (atomic)	Lowest LSN not yet on disk	The WAL frontier (Chapter 7)
`prev_lsa`	Address of the last fully appended record	`logpb_start_append` checks `back_lsa == prev_lsa`, then advances it
`log_pgptr`	The currently fixed append page frame	`LOG_APPEND_PTR()` writes into `log_pgptr->area`
`appending_page_tde_encrypted`	Page being filled needs TDE	Set per node from `node->tde_encrypted`

INVARIANT (back_lsa chaining). logpb_start_append asserts back_lsa == prev_lsa before each header; the on-disk backward chain must stay unbroken or the process fatals out.

5.3 `logpb_prior_lsa_append_all_list` — detach then drain

// logpb_prior_lsa_append_all_list -- src/transaction/log_page_buffer.c
int
logpb_prior_lsa_append_all_list (THREAD_ENTRY * thread_p)
{
  LOG_PRIOR_NODE *prior_list;
  assert (LOG_CS_OWN_WRITE_MODE (thread_p));     /* <- single-writer invariant */

  log_Gl.prior_info.prior_lsa_mutex.lock ();
  prior_list = prior_lsa_remove_prior_list (thread_p);  /* <- detach */
  log_Gl.prior_info.prior_lsa_mutex.unlock ();          /* <- mutex dropped early */

  if (prior_list != NULL)
    {
      // ... condensed: perfmon stats ...
      logpb_append_prior_lsa_list (thread_p, prior_list);  /* <- drain, no mutex held */
    }
  return NO_ERROR;
}

prior_lsa_remove_prior_list is the detach — the only mutation of the prior-list header during the drain:

// prior_lsa_remove_prior_list -- src/transaction/log_page_buffer.c
static LOG_PRIOR_NODE *
prior_lsa_remove_prior_list (THREAD_ENTRY * thread_p)
{
  LOG_PRIOR_NODE *prior_list;
  assert (LOG_CS_OWN_WRITE_MODE (thread_p));
  prior_list = log_Gl.prior_info.prior_list_header;
  log_Gl.prior_info.prior_list_header = NULL;    /* <- reset header/tail/size: */
  log_Gl.prior_info.prior_list_tail = NULL;      /*    new appenders start fresh */
  log_Gl.prior_info.list_size = 0;
  return prior_list;
}

Branch: if prior_list == NULL the drain is skipped; otherwise the mutex is released before the copy, shrinking the appender-blocking window to the three pointer writes.

5.4 `logpb_append_prior_lsa_list` — walk and free

The detached list is parked on prior_flush_list_header (a separate slot, so a rebuilt prior_list_header stays untouched), then drained node-by-node.

// logpb_append_prior_lsa_list -- src/transaction/log_page_buffer.c
static int
logpb_append_prior_lsa_list (THREAD_ENTRY * thread_p, LOG_PRIOR_NODE * list)
{
  LOG_PRIOR_NODE *node;
  assert (log_Gl.prior_info.prior_flush_list_header == NULL);  /* <- no concurrent drain */
  log_Gl.prior_info.prior_flush_list_header = list;

  while (log_Gl.prior_info.prior_flush_list_header != NULL)
    {
      node = log_Gl.prior_info.prior_flush_list_header;
      log_Gl.prior_info.prior_flush_list_header = node->next;  /* <- advance before copy */
      logpb_append_next_record (thread_p, node);               /* <- the copy */

      if (node->data_header != NULL) free_and_init (node->data_header);
      if (node->udata != NULL)       free_and_init (node->udata);
      if (node->rdata != NULL)       free_and_init (node->rdata);
      free_and_init (node);                                     /* <- node lifetime ends */
    }
  return NO_ERROR;
}

Each segment is freed only if non-NULL (a node may carry any subset); the head advances before the copy so the loop ends on the NULL next. The assert (prior_flush_list_header == NULL) enforces no overlapping flush list — a corollary of the single-writer invariant, holding under LOG_CS.

5.5 `logpb_append_next_record` — one node, header + payload

// logpb_append_next_record -- src/transaction/log_page_buffer.c
static int
logpb_append_next_record (THREAD_ENTRY * thread_p, LOG_PRIOR_NODE * node)
{
  if (!LSA_EQ (&node->start_lsa, &log_Gl.hdr.append_lsa))
    logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "logpb_append_next_record");  /* <- LSN-order */

  if (log_Gl.flush_info.num_toflush + 1 >= log_Gl.flush_info.max_toflush)
    logpb_flush_all_append_pages (thread_p);   /* <- flush early, before this record */

  log_Gl.append.appending_page_tde_encrypted = prior_is_tde_encrypted (node);
  logpb_start_append (thread_p, &node->log_header);   /* writes LOG_RECORD_HEADER */

  if (node->data_header != NULL)
    {
      LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT (thread_p, node->data_header_length);  /* keep header contiguous */
      logpb_append_data (thread_p, node->data_header_length, node->data_header);
    }
  if (node->udata != NULL)
    logpb_append_data (thread_p, node->ulength, node->udata);
  if (node->rdata != NULL)
    logpb_append_data (thread_p, node->rlength, node->rdata);

  logpb_end_append (thread_p, &node->log_header);
  log_Gl.append.appending_page_tde_encrypted = false;   /* reset for next node */
  return NO_ERROR;
}

The non-obvious branch is the early flush (num_toflush + 1 >= max_toflush): flushing now, with no record in progress, keeps the partial-append state machine (LOGPB_APPENDREC_*, Chapter 7) from triggering mid-record. data_header is pre-advanced to stay on one page; udata/rdata wrap. Figure 5-2 covers every branch.

flowchart TD
  S["enter logpb_append_next_record"] --> C1{"start_lsa == append_lsa ?"}
  C1 -->|no| F["logpb_fatal_error"]
  C1 -->|yes| C2{"flush list nearly full ?"}
  C2 -->|yes| FL["logpb_flush_all_append_pages"]
  C2 -->|no| H
  FL --> H["set tde flag\nlogpb_start_append: write header"]
  H --> C3{"data_header ?"}
  C3 -->|yes| DH["ADVANCE_WHEN_DOESNOT_FIT\nappend_data header"]
  C3 -->|no| C4
  DH --> C4{"udata ?"}
  C4 -->|yes| UD["append_data udata"]
  C4 -->|no| C5
  UD --> C5{"rdata ?"}
  C5 -->|yes| RD["append_data rdata"]
  C5 -->|no| E
  RD --> E["logpb_end_append\nreset tde flag"]

Figure 5-2. Branch-complete flow of logpb_append_next_record.

5.6 `logpb_start_append` — stamp the record header

// logpb_start_append -- src/transaction/log_page_buffer.c
static void
logpb_start_append (THREAD_ENTRY * thread_p, LOG_RECORD_HEADER * header)
{
  LOG_RECORD_HEADER *log_rec;
  // ... condensed: assert, perfmon, ADVANCE_WHEN_DOESNOT_FIT (header contiguous) ...
  if (!LSA_EQ (&header->back_lsa, &log_Gl.append.prev_lsa))
    logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "logpb_start_append");   /* <- back-chain check */

  if (log_Gl.append.appending_page_tde_encrypted
      && !LOG_IS_PAGE_TDE_ENCRYPTED (log_Gl.append.log_pgptr))
    {
      // ... condensed: stamp TDE algorithm on the page ...
      logpb_set_dirty (thread_p, log_Gl.append.log_pgptr);
    }

  log_rec = (LOG_RECORD_HEADER *) LOG_APPEND_PTR ();
  *log_rec = *header;                       /* <- the header copy */
  // ... condensed: if hdr.offset == NULL_OFFSET, set first-record offset on this page ...

  if (log_rec->type == LOG_END_OF_LOG)
    {
      LSA_COPY (&log_Gl.hdr.eof_lsa, &log_Gl.hdr.append_lsa);
      logpb_set_dirty (thread_p, log_Gl.append.log_pgptr);
    }
  else
    {
      LSA_COPY (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa);             /* advance prev_lsa */
      LOG_APPEND_SETDIRTY_ADD_ALIGN (thread_p, sizeof (LOG_RECORD_HEADER));   /* dirty + bump + align */
      log_Pb.partial_append.status = LOGPB_APPENDREC_IN_PROGRESS;
    }
}

Two branches: hdr.offset == NULL_OFFSET sets the page’s first-record offset once; the type split routes LOG_END_OF_LOG (EOF sentinel, Chapter 7) down a placeholder path leaving prev_lsa/IN_PROGRESS untouched, vs else advancing the chain into IN_PROGRESS.

5.7 `logpb_append_data` — the aligned byte copy

// logpb_append_data -- src/transaction/log_page_buffer.c
static void
logpb_append_data (THREAD_ENTRY * thread_p, int length, const char *data)
{
  int copy_length; char *ptr, *last_ptr;
  if (length == 0 || data == NULL)
    return;                                   /* <- empty segment: nothing to do */

  LOG_APPEND_ALIGN (thread_p, LOG_DONT_SET_DIRTY);   /* align, don't dirty yet */
  ptr = LOG_APPEND_PTR ();
  last_ptr = LOG_LAST_APPEND_PTR ();          /* = area + LOGAREA_SIZE */

  if ((ptr + length) >= last_ptr)             /* <- does NOT fit in this page */
    {
      while (length > 0)
        {
          if (ptr >= last_ptr)
            {
              logpb_next_append_page (thread_p, LOG_SET_DIRTY);  /* Chapter 6 */
              ptr = LOG_APPEND_PTR ();  last_ptr = LOG_LAST_APPEND_PTR ();
            }
          copy_length = (ptr + length >= last_ptr) ? CAST_BUFLEN (last_ptr - ptr) : length;
          memcpy (ptr, data, copy_length);
          ptr += copy_length; data += copy_length; length -= copy_length;
          log_Gl.hdr.append_lsa.offset += copy_length;   /* advance by bytes copied */
        }
    }
  else                                        /* <- fits entirely */
    {
      memcpy (ptr, data, length);
      log_Gl.hdr.append_lsa.offset += length;
    }
  LOG_APPEND_ALIGN (thread_p, LOG_SET_DIRTY); /* align for next append AND mark dirty */
}

The boundary-span path copies to page end, calls logpb_next_append_page (Chapter 6), and repeats until length == 0. logpb_append_crumbs is the scatter-gather sibling (same fit/span logic), not on the drain path.

INVARIANT (cursor tracks bytes copied). append_lsa.offset advances by exactly the bytes memcpy’d on every path; drift would make the next LOG_APPEND_PTR() point at the wrong byte and records overlap. Both LOG_APPEND_ALIGN calls only round up.

5.8 `logpb_end_append` — close the record, point forward

// logpb_end_append -- src/transaction/log_page_buffer.c
static void
logpb_end_append (THREAD_ENTRY * thread_p, LOG_RECORD_HEADER * header)
{
  // ... condensed: align + ADVANCE_WHEN_DOESNOT_FIT position the cursor at next slot ...
  assert (LSA_EQ (&header->forw_lsa, &log_Gl.hdr.append_lsa));   /* <- forw_lsa = next slot */

  if (!LSA_EQ (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa))
    logpb_set_dirty (thread_p, log_Gl.append.log_pgptr);         /* dirty if cursor moved off prev */

  if (log_Pb.partial_append.status == LOGPB_APPENDREC_IN_PROGRESS)
    ;                                                            /* normal: fall through */
  else if (log_Pb.partial_append.status == LOGPB_APPENDREC_PARTIAL_FLUSHED_END_OF_LOG)
    {
      log_Pb.partial_append.status = LOGPB_APPENDREC_PARTIAL_ENDED;
      logpb_flush_all_append_pages (thread_p);                   /* re-flush correct version */
    }
  else
    assert_release (false);                                      /* invalid state */

  log_Pb.partial_append.status = LOGPB_APPENDREC_SUCCESS;        /* record now stable */
}

After the cursor is repositioned and the forw_lsa assert (partner to back_lsa) confirms it, the state machine branches: IN_PROGRESS falls through; PARTIAL_FLUSHED_END_OF_LOG (a forced flush swapped in an EOF sentinel) re-flushes the real record (Chapter 7); else → assert_release(false). All end at SUCCESS.

INVARIANT (record bracketing). Between logpb_start_append (IN_PROGRESS) and logpb_end_append (SUCCESS), exactly one record is mid-write. A forced flush seeing IN_PROGRESS knows it caught a partial record; SUCCESS means the page is safe to flush. Breaking the bracket lets a half-written record reach disk unmarked.

5.9 Chapter summary — key takeaways

Two locks, two jobs. prior_lsa_mutex serializes appenders; LOG_CS serializes them against the single writer.
Detach, then drain. Reset header/tail/list_size under the mutex, release, then copy.
LSN order, freed immediately. Each node copied via logpb_append_next_record, then its segments and itself freed.
Three assertions prove the chain. back_lsa==prev_lsa, forw_lsa==append_lsa, start_lsa==append_lsa — divergence is fatal.
Cursor stays honest. logpb_append_data advances append_lsa.offset by exactly the bytes copied.
Dirty, not flushed. logpb_set_dirty only flips LOG_BUFFER::dirty; flush/checksum/WAL are Chapter 7.
Boundary crossing deferred. Every logpb_next_append_page hands off to Chapter 6.

Chapter 6: Crossing a Log Page Boundary

The drain loop of Chapter 5 streams a prior node’s bytes into log_Gl.append.log_pgptr one fragment at a time. When the running offset log_Gl.hdr.append_lsa.offset reaches the page’s usable limit (LOGAREA_SIZE), the appender must seal the full page, register it for flush, and obtain a fresh logical page.

The reader question: what happens when a record does not fit, and how is a fresh page fetched while the full one is queued for flush? The mid-stream answer is logpb_next_append_page; the first-page bootstrap is logpb_fetch_start_append_page and its stripped twin logpb_fetch_start_append_page_new. All obtain a buffer frame and initialize its header through logpb_create_page / logpb_locate_page. WAL and the append/flush split are in the high-level companion; flush durability is Chapter 7. The prior-side mirror log_prior_lsa_append_advance_when_doesnot_fit (Chapter 4) reserves the address space across the page tail before any bytes exist; this chapter fetches the physical frame for that address.

6.1 Who triggers the crossing

The appender never calls logpb_next_append_page from record-assembly code. Two macros own that decision, both comparing log_Gl.hdr.append_lsa.offset against LOGAREA_SIZE (the alignment/advance arithmetic is Chapter 4 / Chapter 5 material): LOG_APPEND_ALIGN crosses after a fragment when the DOUBLE_ALIGNMENT-rounded offset reaches the limit; LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT(length) crosses before writing when offset + length would overrun, so the fragment lands whole on the next page.

That after-vs-before split is why logpb_next_append_page takes current_setdirty. LOG_APPEND_ALIGN (reached via LOG_APPEND_SETDIRTY_ADD_ALIGN with LOG_SET_DIRTY) has already dirtied the page it is leaving; the ADVANCE macro crosses before any byte is written, so nothing is dirty yet. Both therefore pass LOG_DONT_SET_DIRTY, leaving the seal branch inside logpb_next_append_page dead in the hot path — it exists only for direct callers that did not pre-seal.

logpb_set_dirty flips one boolean on the page’s buffer frame:

// logpb_set_dirty -- src/transaction/log_page_buffer.c
void
logpb_set_dirty (THREAD_ENTRY * thread_p, LOG_PAGE * log_pgptr)
{
  LOG_BUFFER *bufptr;
  bufptr = logpb_get_log_buffer (log_pgptr);  /* <- recovers frame from page address */
  // ... condensed ...
  bufptr->dirty = true;
}

Invariant (dirty-before-detach): a page that received append bytes must be bufptr->dirty before log_Gl.append.log_pgptr is repointed away. Every write path runs LOG_APPEND_SETDIRTY_ADD_ALIGN (calling LOG_APPEND_ALIGN with LOG_SET_DIRTY) before the offset reaches LOGAREA_SIZE. If violated, the full page sits in toflush[] un-dirtied and the flusher skips it, losing committed records.

6.2 The structs at the seam

log_append_info is the appender’s fixed cursor; log_page / log_hdrpage are the physical page layout; log_flush_info is the hand-off queue to the flusher.

log_append_info (log_append.hpp) — one global, log_Gl.append.

Field	Role	Why it exists
`vdes`	Volume descriptor (fd) of the active log	The eventual `fileio_write` target
`nxio_lsa`	`atomic<LOG_LSA>`: lowest LSA not yet on disk	WAL boundary; re-pointed by the fetch helpers (6.5), read by Chapter 7
`prev_lsa`	LSA of the last appended record	`logpb_start_append` checks `back_lsa == prev_lsa`; logical, so survives a crossing unchanged (6.7)
`log_pgptr`	The currently fixed append page	The pointer the crossing nulls then re-points
`appending_page_tde_encrypted`	Pages created mid-append must be TDE-encrypted	Propagates the record’s encryption decision onto new mid-record pages (6.6)

log_hdrpage (log_storage.hpp) — header at the front of every log page.

Field	Role	Why it exists
`logical_pageid`	`LOG_PAGEID`: page’s address in the infinite log	So readers/flushers know which logical page a frame holds
`offset`	`PGLENGTH`: byte offset of the first full record here	Salvage anchor for recovery if the prior page is corrupt
`flags`	`short` bitfield; today only TDE bits	Carries `LOG_HDRPAGE_FLAG_ENCRYPTED_AES`/`_ARIA`; set via `logpb_set_tde_algorithm`
`checksum`	`int`: CRC32 over the page	Consistency check; memset garbage at create, computed at write time (6.4)

log_page (log_storage.hpp): LOG_HDRPAGE hdr followed by char area[1] (record region, sized LOGAREA_SIZE). Never sizeof it; use LOG_PAGESIZE.

log_flush_info (log_impl.h) — the queue the crossing pushes into; one global, log_Gl.flush_info.

Field	Role	Why it exists
`max_toflush`	Capacity of `toflush`	Threshold that forces a flush when the queue fills
`num_toflush`	Count of queued pages	Incremented under `flush_mutex` per crossing
`toflush`	`LOG_PAGE **`: ordered pages awaiting flush	Hand-off list; array order = flush order
`flush_mutex`	Mutex (SERVER_MODE) over the three above	Lets the Log Flush Thread and appender share the queue safely

graph TD
  subgraph append_cursor
    AI["log_append_info<br/>log_Gl.append"]
    AI -->|log_pgptr| PG["LOG_PAGE (current)"]
    AI -->|appending_page_tde_encrypted| TDE["TDE decision"]
  end
  PG -->|hdr| HDR["log_hdrpage<br/>logical_pageid / offset / flags / checksum"]
  subgraph flush_queue
    FI["log_flush_info<br/>log_Gl.flush_info"]
    FI -->|toflush num_toflush| Q["LOG_PAGE *[] ordered"]
  end
  PG -.->|enqueued on crossing| Q

Figure 6-1. Struct relationships: the crossing repoints log_pgptr to a new LOG_PAGE and pushes a page into toflush[].

6.3 `logpb_next_append_page`: branch-complete walkthrough

// logpb_next_append_page -- src/transaction/log_page_buffer.c
  assert (LOG_CS_OWN_WRITE_MODE (thread_p));          /* (entry) LOG CS held write-exclusive */
  if (current_setdirty == LOG_SET_DIRTY)
    { logpb_set_dirty (thread_p, log_Gl.append.log_pgptr); }   /* (A) seal old page */
  log_Gl.append.log_pgptr = NULL;                     /* (B) detach; (C) pageid++, offset=0 */
  log_Gl.hdr.append_lsa.pageid++;  log_Gl.hdr.append_lsa.offset = 0;
  if (LOGPB_AT_NEXT_ARCHIVE_PAGE_ID (log_Gl.hdr.append_lsa.pageid))
    { logpb_archive_active_log (thread_p); }          /* (D) wrap onto unarchived slot */
  if (LOGPB_IS_FIRST_PHYSICAL_PAGE (log_Gl.hdr.append_lsa.pageid))
    { log_Gl.hdr.fpageid += LOGPB_ACTIVE_NPAGES; logpb_flush_header (thread_p); }  /* (E) cycled */
  log_Gl.append.log_pgptr = logpb_create_page (thread_p, log_Gl.hdr.append_lsa.pageid);  /* (F) */
  if (log_Gl.append.log_pgptr == NULL)
    { logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "log_next_append_page"); return; }  /* (G) */
  if (log_Gl.append.appending_page_tde_encrypted)    /* (H) propagate TDE — see 6.6 */
    { /* ... logpb_set_tde_algorithm + logpb_set_dirty ... */ }
  rv = pthread_mutex_lock (&flush_info->flush_mutex);
  flush_info->toflush[flush_info->num_toflush++] = log_Gl.append.log_pgptr;  /* (I) enqueue NEW page */
  need_flush = (flush_info->num_toflush >= flush_info->max_toflush);  /* (J) queue full? */
  pthread_mutex_unlock (&flush_info->flush_mutex);
  if (need_flush)
    { logpb_flush_all_append_pages (thread_p); }      /* (K) forced flush, outside the mutex */

Figure 6-2 traces every branch; two are non-obvious. (B): between detach and (F) there is no current append page, but the write-exclusive LOG CS (entry assert) means no other appender ever observes the gap. (I): the page enqueued is the fresh empty page, not the one just filled — that one was queued at its own birth, so every page enters toflush[] exactly once. The rest (D archive-wrap → Chapter 10, E ring-wrap header bump, G fatal-NULL) are labelled in the excerpt and flowchart.

flowchart TD
  S["enter, LOG CS write-held"] --> A{"current_setdirty == LOG_SET_DIRTY?"}
  A -->|yes| A1["logpb_set_dirty(old page)"]
  A -->|no| B
  A1 --> B["log_pgptr = NULL; pageid++; offset = 0"]
  B --> D{"LOGPB_AT_NEXT_ARCHIVE_PAGE_ID?"}
  D -->|yes| D1["logpb_archive_active_log"]
  D -->|no| E
  D1 --> E{"LOGPB_IS_FIRST_PHYSICAL_PAGE?"}
  E -->|yes| E1["fpageid += ACTIVE_NPAGES; logpb_flush_header"]
  E -->|no| F
  E1 --> F["log_pgptr = logpb_create_page(pageid)"]
  F --> G{"log_pgptr == NULL?"}
  G -->|yes| G1["logpb_fatal_error -> return"]
  G -->|no| H{"appending_page_tde_encrypted?"}
  H -->|yes| H1["set_tde_algorithm; set_dirty"]
  H -->|no| I
  H1 --> I["lock flush_mutex; toflush[num++] = new page"]
  I --> J{"num_toflush >= max_toflush?"}
  J -->|yes| J1["need_flush = true"]
  J -->|no| K
  J1 --> K["unlock flush_mutex"]
  K --> L{"need_flush?"}
  L -->|yes| L1["logpb_flush_all_append_pages"]
  L -->|no| Z["return"]
  L1 --> Z

Figure 6-2. logpb_next_append_page control flow, every branch.

6.4 Obtaining the frame: `logpb_locate_page` for `NEW_PAGE`

logpb_create_page(thread_p, pageid) is return logpb_locate_page (thread_p, pageid, NEW_PAGE);. logpb_locate_page maps a logical pageid to a buffer frame and, for NEW_PAGE, initializes the header in place — never touching disk. The branches that matter:

// logpb_locate_page -- src/transaction/log_page_buffer.c
  index = logpb_get_log_buffer_index (pageid);         /* ring hash -> frame slot; bad index -> NULL */
  log_bufptr = &log_Pb.buffers[index];
  if (log_bufptr->pageid != NULL_PAGEID && log_bufptr->pageid != pageid)
    {                                                  /* frame holds a DIFFERENT page */
      if (log_bufptr->dirty == true)
        { assert_release (false); /* must not victimize dirty */ ... }
      log_bufptr->pageid = NULL_PAGEID;                /* invalidate */
    }
  if (log_bufptr->pageid == NULL_PAGEID)
    {
      if (fetch_mode == NEW_PAGE)
        {
          memset (log_bufptr->logpage, LOG_PAGE_INIT_VALUE, LOG_PAGESIZE);  /* 0xff fill */
          log_bufptr->logpage->hdr.logical_pageid = pageid;   /* (1) */
          log_bufptr->logpage->hdr.offset = NULL_OFFSET;      /* (2) */
          log_bufptr->logpage->hdr.flags = 0;                 /* (3) clears any TDE bits */
        }
      else /* OLD_PAGE */
        { if (logpb_read_page_from_file (...) != NO_ERROR) { return NULL; } }
    }
  else
    { assert (fetch_mode == OLD_PAGE); /* frame already holds exactly this page */ }

The three header writes initialize the new page: (1) logical_pageid = pageid makes the frame be this logical page; (2) offset = NULL_OFFSET — no record starts here yet, logpb_start_append later overwrites it with the first-record offset; (3) flags = 0 clears stale TDE bits, which is why (H) must re-apply the algorithm. checksum is not set here — it stays memset garbage until logpb_set_page_checksum (called by logpb_writev_append_pages per page before fileio_write) runs log_pgptr->hdr.checksum = checksum_crc32;.

Invariant (one frame per ring slot): the assert_release (false) encodes that the appender must never evict a dirty frame for a new append page. The ring is sized so a slot’s prior occupant is flushed before reuse. Violating it overwrites a page the flusher believed safe — silent log corruption.

6.5 The two public entry points

logpb_next_append_page handles mid-stream crossings; two public functions handle the first page of an append session, bootstrapping log_pgptr and re-anchoring WAL.

// logpb_fetch_start_append_page -- src/transaction/log_page_buffer.c
  PAGE_FETCH_MODE flag = OLD_PAGE;
  if ((log_Gl.hdr.append_lsa.pageid == FIRST_LOG_PAGEID)  /* NDEBUG: ==0; else PRM_ID_FIRST_LOG_PAGEID */
      && (log_Gl.hdr.append_lsa.offset == 0))
    { flag = NEW_PAGE; }                               /* empty log: skip the read */
  if (log_Gl.append.log_pgptr != NULL)
    { logpb_invalid_all_append_pages (thread_p); }     /* stale append page: discard */
  log_Gl.append.log_pgptr = logpb_locate_page (thread_p, log_Gl.hdr.append_lsa.pageid, flag);
  if (log_Gl.append.log_pgptr == NULL) { return ER_FAILED; }
  log_Gl.append.set_nxio_lsa (log_Gl.hdr.append_lsa);  /* (*) re-anchor WAL boundary */
  // ... same flush_mutex enqueue as 6.3: toflush[num_toflush++] = log_pgptr; threshold -> need_flush ...
  if (need_flush) { logpb_flush_pages_direct (thread_p); }  /* note: direct, not _all_append */

Two branches distinguish it from the mid-stream path. flag selection: an empty log (pageid equal to PRM_ID_FIRST_LOG_PAGEID, offset 0) is fetched NEW_PAGE with no read; otherwise OLD_PAGE reads back the on-disk page (e.g. restart resuming a half-full last page) so the appender continues after the tail. stale-page discard: a non-NULL log_pgptr on entry triggers logpb_invalid_all_append_pages. The (*) line re-anchors WAL — nxio_lsa jumps to the current append position (“everything below here is durable”); mid-stream logpb_next_append_page never touches nxio_lsa, since within a session the boundary moves only when pages actually flush (Chapter 7).

// logpb_fetch_start_append_page_new -- src/transaction/log_page_buffer.c
  log_Gl.append.log_pgptr = logpb_locate_page (thread_p, log_Gl.hdr.append_lsa.pageid, NEW_PAGE);
  if (log_Gl.append.log_pgptr == NULL) { return NULL; }   /* caller handles NULL */
  log_Gl.append.set_nxio_lsa (log_Gl.hdr.append_lsa);
  return log_Gl.append.log_pgptr;

_new is the stripped variant: always NEW_PAGE, no stale-page check, no enqueue and no flush threshold. It serves callers (log creation / format) that want a fresh first page but manage flushing themselves — the thing to know before reusing it is that it skips the toflush[] bookkeeping the other two perform.

6.6 TDE flag propagation across the boundary

The encryption decision belongs to the record but is enforced per page via log_Gl.append.appending_page_tde_encrypted. logpb_append_next_record (Chapter 5) owns the flag’s lifetime: it sets the flag from prior_is_tde_encrypted (node) before assembly and resets it to false after logpb_end_append. While true, two re-stamp sites carry the TDE bits onto pages entered during assembly. Site (H) inside logpb_next_append_page (6.3) calls logpb_set_tde_algorithm then logpb_set_dirty on the just-created page. The second site, in logpb_start_append, guards against re-stamping:

// logpb_start_append -- src/transaction/log_page_buffer.c
  if (log_Gl.append.appending_page_tde_encrypted)
    {
      if (!LOG_IS_PAGE_TDE_ENCRYPTED (log_Gl.append.log_pgptr))   /* idempotent guard */
        {
          TDE_ALGORITHM tde_algo = (TDE_ALGORITHM) prm_get_integer_value (PRM_ID_TDE_DEFAULT_ALGORITHM);
          logpb_set_tde_algorithm (thread_p, log_Gl.append.log_pgptr, tde_algo);
          logpb_set_dirty (thread_p, log_Gl.append.log_pgptr);
        }
    }

logpb_set_tde_algorithm writes hdr.flags (clear the encrypted mask, OR the algorithm bit). Because logpb_locate_page zeroed hdr.flags at create (6.4 step 3), the new page starts un-encrypted and (H) re-stamps it; the trailing logpb_set_dirty is essential or the bit could be lost. The logpb_start_append guard with LOG_IS_PAGE_TDE_ENCRYPTED keeps it from re-stamping a page (H) already handled.

Invariant (encryption follows the record across pages): if record R is TDE-encrypted, every page R touches — including pages allocated mid-R — has a non-zero TDE flag. Enforced by appending_page_tde_encrypted staying true for R’s whole assembly plus the (H) re-stamp on each new page. If broken, part of an encrypted record is written in clear text: logpb_writev_append_pages checks LOG_IS_PAGE_TDE_ENCRYPTED per page at write time, so a missing flag means that page silently skips encryption.

6.7 What survives the crossing

log_Gl.append.prev_lsa is logical, not page-relative, so logpb_next_append_page leaves it untouched — it keeps pointing at the last appended record regardless of page, letting logpb_start_append validate header->back_lsa == prev_lsa even when the new record begins on a freshly crossed page. prev_lsa advances only in logpb_start_append (LSA_COPY (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa)), never in the page fetch. This buffer-side seam pairs with Chapter 4’s prior-side log_prior_lsa_append_advance_when_doesnot_fit: both compute the break from the same LOGAREA_SIZE threshold, so reserved address and materialized frame never disagree.

6.8 Chapter summary — key takeaways

One function owns the mid-stream crossing. logpb_next_append_page optionally seals the full page, nulls log_pgptr, advances append_lsa.pageid, creates a fresh frame, and queues a page — all under the write-held LOG CS.
The page enqueued on a crossing is the new page, not the full one — every page enters toflush[] exactly once, at its own birth.
NEW_PAGE create writes three header fields, not the checksum. logpb_locate_page sets logical_pageid, offset = NULL_OFFSET, flags = 0 and memsets the body 0xff; checksum is computed by logpb_set_page_checksum just before fileio_write.
flags = 0 at create is why TDE must be re-stamped. Three sites touch appending_page_tde_encrypted: set/reset in logpb_append_next_record plus the re-stamps in logpb_next_append_page (H) and logpb_start_append, keeping an encrypted record encrypted across page breaks.
Threshold flush is decided under flush_mutex, executed outside it — num_toflush >= max_toflush sets need_flush, then logpb_flush_all_append_pages runs after release.
logpb_fetch_start_append_page vs _new. The former chooses NEW_PAGE/OLD_PAGE, discards stale pages, enqueues, and re-anchors nxio_lsa; _new is always NEW_PAGE and skips the enqueue/threshold work for callers managing their own flush.
The crossing is the buffer-side mirror of Chapter 4 — the prior side reserves address space across the tail before bytes exist, this side fetches the frame for that address, both keyed off LOGAREA_SIZE.

Chapter 7: Flush Durability and the WAL Rule

This chapter answers one question: how do dirty log pages reach stable storage, how is nxio_lsa advanced, and how do group commit and the WAL invariant cooperate to keep recovery correct? Chapters 3–6 built a prior list, drained it into the page buffer (Ch 5), and crossed page boundaries (Ch 6) — all in volatile memory. Here the bytes hit the disk.

For the framing of WAL and why the log must be durable before data pages, see the companion cubrid-log-manager.md §“Write-Ahead Logging”. We trace the enforcing code, not the theory.

7.1 The three structures that hold the durability state

Durability is coordinated across three structs: log_append_info (log_append.hpp) owns the watermark; log_flush_info (log_impl.h) is the work list of pages to scan; log_buffer (log_page_buffer.c) is the per-page slot whose dirty bit the flusher clears.

log_append_info holds { int vdes; std::atomic<LOG_LSA> nxio_lsa; LOG_LSA prev_lsa; LOG_PAGE *log_pgptr; bool appending_page_tde_encrypted; }:

Field	Role	Why it exists
`vdes`	Active-log fd passed to every `fileio_write`/`fileio_synchronize`.	The flusher must know which fd to write and fsync.
`nxio_lsa`	The durability watermark — lowest LSA whose page is not yet forced to disk. Atomic for lock-free reads.	Answers both “is record X durable?” (group-commit) and “must I flush before this data page?” (WAL).
`prev_lsa`	Last appended (in-buffer) record, vs `nxio_lsa` (last flushed).	Lets the flusher detect a partial record (`nxio_lsa.pageid == prev_lsa.pageid`) and not validate it early.
`log_pgptr`	Append page currently fixed for new records.	After a flush resets `num_toflush`, it is re-seeded as `toflush[0]`.
`appending_page_tde_encrypted`	Whether the next page must be TDE-encrypted.	Carries the encryption decision from append time to write time.

The accessors are trivial (get_nxio_lsa = nxio_lsa.load (), set_nxio_lsa = nxio_lsa.store ()); the atomicity is the contract.

log_flush_info holds { int max_toflush; int num_toflush; LOG_PAGE **toflush; pthread_mutex_t flush_mutex; } (the mutex is SERVER_MODE only):

Field	Role	Why it exists
`max_toflush`	Array capacity; at `num_toflush == max_toflush` the buffer is full (`log_buffer_full_count` ticks).	Bounds the batch; full-list events drive Ch 6’s partial-append path.
`num_toflush`	Count of queued pages; `< 1` means nothing to flush.	The loop bound; reset to 0 then re-seeded with the live append page.
`toflush`	Array of `LOG_PAGE*`, ascending by `pageid`.	The contiguity scan walks it to coalesce pages into one `writev`.
`flush_mutex`	Guards the array vs concurrent producers (Ch 5 drain) and the flusher.	Held across the entire flush body — scan, nxio-page write, and the `fileio_synchronize` — acquired in Phase 2’s scan setup and released only after the Phase 6 `nxio_lsa` advance; i.e. the fsync runs while `flush_mutex` is held. A separate short-lived acquisition guards only the Phase-1 `num_toflush` check. The group-commit wait is on `gc_cond`/`gc_mutex` (§7.5), a different lock entirely.

log_buffer holds { volatile LOG_PAGEID pageid; volatile LOG_PHY_PAGEID phy_pageid; bool dirty; LOG_PAGE *logpage; }:

Field	Role	Why it exists
`pageid`	Logical page id; `volatile` as slots recycle.	Flusher asserts `bufptr->pageid == toflush[i]->hdr.logical_pageid` to confirm the slot still holds its page.
`phy_pageid`	Physical offset in the active log volume.	Write target `phy_pageid + i`; contiguity needs `phy_pageid+1`, not just `pageid+1`.
`dirty`	Has un-flushed changes.	The scan’s primary filter; cleared exactly when the write succeeds.
`logpage`	The page bytes (header + area).	What is handed to `fileio_write`/encryption.

flowchart LR
  FI["LOG_FLUSH_INFO<br/>toflush[ ], num_toflush"]
  LB["LOG_BUFFER<br/>pageid, phy_pageid, dirty, logpage"]
  AI["log_append_info<br/>nxio_lsa, prev_lsa, log_pgptr"]
  FI -->|"toflush[i] resolves to"| LB
  LB -->|"dirty pages written, then"| AI
  AI -->|"nxio_lsa.pageid flushed LAST"| FI

Figure 7-1. The three durability structures and how the flusher pivots between them.

Durability invariant. A commit record at LSA L is durable iff get_nxio_lsa () > L (the watermark moved past L’s page). logpb_flush_all_append_pages advances nxio_lsa only after fileio_synchronize returns and the nxio_lsa page is written last. If it advanced before the fsync, a crash could leave a committed transaction whose log is not on disk, and recovery would lose it.

7.2 logpb_flush_all_append_pages — the durability engine

The only function that writes append pages and moves nxio_lsa. It runs under LOG_CS write mode (assert (LOG_CS_OWN_WRITE_MODE)) and returns 1 = flushed, 0 = nothing to do, < 0 = error.

Phase 1 — decide whether to flush at all, under a short-lived flush_mutex acquisition that is released before the body. Two early returns set need_flush = false and return 0: when num_toflush < 1 (empty list), and when num_toflush == 1 && !logpb_is_dirty (toflush[0]). The single-clean-page short-circuit keeps idle timer-driven flushes from rewriting an unchanged end-of-log marker.

Phase 2 — place the end-of-log marker, branching on log_Pb.partial_append.status (Ch 6’s LOGPB_APPENDREC_* enum), then re-acquire flush_mutex for the rest of the body:

IN_PROGRESS — a record is half-appended. Copy the header page aside, clear the slot’s dirty, overwrite the in-progress header with LOG_END_OF_LOG, write that copy via logpb_write_page_to_disk; status → PARTIAL_FLUSHED_END_OF_LOG. If the page already left the buffer it is fatal → goto error.
PARTIAL_FLUSHED_END_OF_LOG — re-entry continuing the flush; log and fall through.
PARTIAL_ENDED / SUCCESS — normal case: build an eof record and logpb_start_append it without advancing append_lsa (overwritten later).
anything else — assert_release (false) → goto error.

Phase 3 — the two-step contiguous-run scan (under flush_mutex), whose rule is the nxio_lsa page is flushed last. A while (true) alternates step 1 (skip until a dirty non-nxio page; exit if none remain) and step 2 (extend the run). Step 2 has four break conditions — each a real branch that ends the run and re-enters the skip phase:

// logpb_flush_all_append_pages (step-2 run conditions) -- src/transaction/log_page_buffer.c
if (!bufptr->dirty) break;                                  /* <- clean stops run */
if (bufptr->pageid == log_Gl.append.get_nxio_lsa ().pageid) break;  /* <- nxio last */
if (prv_bufptr->pageid + 1 != bufptr->pageid) break;        /* <- logical gap */
if (prv_bufptr->phy_pageid + 1 != bufptr->phy_pageid) break;/* <- physical gap */

The run [idxflush, i) then goes to logpb_writev_append_pages; a NULL return is fatal (goto error), otherwise need_sync = true and each page’s dirty is cleared — only after the write returns non-NULL.

Phase 4 — flush the nxio_lsa page last, branching on whether it holds a complete record:

// logpb_flush_all_append_pages (nxio page) -- src/transaction/log_page_buffer.c
if (log_Pb.partial_append.status == LOGPB_APPENDREC_SUCCESS
    || nxio_lsa.pageid != log_Gl.append.prev_lsa.pageid)   /* complete -> write it */
  { /* assert_release pageid match and dirty, else goto error */
    logpb_write_page_to_disk (thread_p, bufptr->logpage, bufptr->pageid);
    need_sync = true; bufptr->dirty = false; }
else { /* skip: nxio page holds an incomplete record, defer until complete */ }

Phase 5 — the fsync, still under flush_mutex. When need_sync is set and the PRM_ID_SUPPRESS_FSYNC sampling escape allows it (escape == 0, or total_sync_count % escape == 0), it calls fileio_synchronize (thread_p, log_Gl.append.vdes, log_Name_active, false); a NULL_VOLDES return is fatal → goto error.

Phase 6 — advance nxio_lsa, again branching on partial_append.status:

LOGPB_APPENDREC_PARTIAL_ENDED — restore the original record header, rewrite + fsync again, then set_nxio_lsa (log_Gl.hdr.append_lsa); status → PARTIAL_FLUSHED_ORIGINAL.
LOGPB_APPENDREC_PARTIAL_FLUSHED_END_OF_LOG — cannot validate yet; set_nxio_lsa (log_Gl.append.prev_lsa) (one record short).
LOGPB_APPENDREC_SUCCESS — set_nxio_lsa (log_Gl.hdr.append_lsa).
else — assert_release (false) → goto error.

The list is then reset (num_toflush = 0) and, if log_Gl.append.log_pgptr != NULL, that live page is re-seeded as toflush[0]; flush_mutex is released and the function returns 1. The error: label releases flush_mutex if still held and calls logpb_fatal_error — every error path is unrecoverable.

flowchart TD
  A["enter (LOG_CS write)"] --> B{num_toflush?}
  B -->|"< 1, or 1+clean"| Z0["return 0"]
  B -->|flushable| C{partial_append.status}
  C -->|IN_PROGRESS| D["overwrite EOL, write copy"]
  C -->|SUCCESS/PARTIAL_ENDED| E["start_append EOL marker"]
  D --> F["scan toflush: skip clean,<br/>collect dirty run, writev, clear dirty"]
  E --> F
  F --> H{nxio page holds<br/>partial record?}
  H -->|no| I["write nxio page LAST"]
  H -->|yes| K
  I --> K{need_sync?}
  K -->|yes| L["fileio_synchronize"]
  K -->|no| M
  L --> M["advance nxio_lsa per status,<br/>num_toflush=0, reseed log_pgptr"]
  M --> Z1["return 1"]
  D -.->|fatal| ERR["logpb_fatal_error"]
  L -.->|fail| ERR

Figure 7-2. Branch-complete flow of logpb_flush_all_append_pages.

7.3 logpb_writev_append_pages — the actual write

The lowest-level write helper. It CRC-stamps every page (logpb_set_page_checksum, NULL on failure), then loops over npages with two per-page branches:

// logpb_writev_append_pages -- src/transaction/log_page_buffer.c
if (LOG_IS_PAGE_TDE_ENCRYPTED (log_pgptr))        /* branch 1: encrypt into enc_pgptr; */
  // ... on encrypt failure, turn TDE off for this page ...
if (fileio_write (..., log_pgptr, phy_pageid + i, LOG_PAGESIZE, write_mode) == NULL)
  { /* branch 2: ER_LOG_WRITE_OUT_OF_SPACE / ER_LOG_WRITE */ to_flush = NULL; break; }

Despite the name it loops fileio_write page by page at phy_pageid + i; Phase-3 contiguity makes the batch one sequential extent. write_mode is FILEIO_WRITE_NO_COMPENSATE_WRITE under DWB. A to_flush == NULL return on any write failure is fatal to the caller.

7.4 The synchronous demand paths

logpb_flush_pages_direct is the core: under assert (LOG_CS_OWN_WRITE_MODE) it calls logpb_prior_lsa_append_all_list (the Ch 5 drain) then logpb_flush_all_append_pages (the engine). Two thin wrappers add the critical section: logpb_force_flush_pages is just LOG_CS_ENTER; logpb_flush_pages_direct; LOG_CS_EXIT, and logpb_force_flush_header_and_pages adds logpb_flush_header (Ch 10) before the exit — used at checkpoint and wherever the header’s eof_lsa/append fields must match disk.

7.5 logpb_flush_pages — the four commit modes

logpb_flush_pages (thread_p, flush_lsa) is the entry every committing transaction calls. !SERVER_MODE is direct flush under LOG_CS. In SERVER_MODE two fall-backs go direct:

not restarted, or flush_lsa NULL/ISNULL → direct flush, return.
daemon unavailable (!log_is_log_flush_daemon_available ()) → direct flush.

Otherwise it derives a 2×2 policy from async_commit (PRM_ID_LOG_ASYNC_COMMIT) × group_commit (LOG_IS_GROUP_COMMIT_ACTIVE ()):

// logpb_flush_pages (mode matrix) -- src/transaction/log_page_buffer.c
//  async  group | need_wait  need_wakeup_LFT
//    X      X    |   true        true     (sync, non-group: wake daemon, wait)
//    X      O    |   true        false    (sync, group: just wait)
//    O      X    |   false       true     (async: wake daemon, return)
//    O      O    |   false       false    (async+group: just return)

The wait loop is the waiter side of group commit — it sleeps on gc_cond (holding gc_mutex) until nxio_lsa passes its flush_lsa:

// logpb_flush_pages (group-commit wait) -- src/transaction/log_page_buffer.c
if (need_wakeup_LFT == false && pgbuf_has_perm_pages_fixed (thread_p))
  need_wakeup_LFT = true;          /* <- holding data pages: push daemon to avoid a stall */
while (LSA_LT (&nxio_lsa, flush_lsa)) {        // re-read nxio_lsa each iteration
    // ... lock gc_mutex ...
    if (LSA_GE (&nxio_lsa, flush_lsa)) break;  /* <- re-check under lock: already durable */
    if (need_wakeup_LFT == true) log_wakeup_log_flush_daemon ();
    pthread_cond_timedwait (&gc_cond, &gc_mutex, &to);   /* 1000ms deadline */
    need_wakeup_LFT = true;        /* <- after first wait, always nudge daemon */
}

The in-lock re-check prevents a lost-wakeup race; the 1000ms timeout bounds latency. (The shared-fsync mechanics are in §7.6 / takeaway 4 — not repeated here.)

Open question (carried from the companion). The exact group-commit window policy — how long the daemon coalesces before syncing — is the log_get_log_group_commit_interval looper period plus on-demand wakeup () calls. The companion flags the batching/latency trade-off as unresolved; this chapter documents the mechanism, not the tuning.

7.6 The flush daemon and group-commit producer side

The daemon is a cubthread::daemon with looper period log_get_log_group_commit_interval; its task body log_flush_execute (in log_manager.c) guards on BO_IS_SERVER_RESTARTED () and log_Flush_has_been_requested (returning early if either is false), does one shared LOG_CS_ENTER; logpb_flush_pages_direct; LOG_CS_EXIT, then under gc_mutex runs pthread_cond_broadcast (&gc_cond) and clears log_Flush_has_been_requested. The broadcast (not signal) lets one flush satisfy every waiter whose flush_lsa <= nxio_lsa. The producer side — log_wakeup_log_flush_daemon, called by committers and the WAL path — does only log_Flush_has_been_requested = true; log_Flush_daemon->wakeup (); (SERVER_MODE only); setting the flag before wakeup () guarantees a daemon already mid-iteration sees the request next loop.

stateDiagram-v2
  [*] --> Sleeping
  Sleeping --> Checking : timer tick or wakeup
  Checking --> Sleeping : not requested, return
  Checking --> Flushing : requested\n LOG_CS enter
  Flushing --> Broadcasting : flush_pages_direct done\n one fsync
  Broadcasting --> Sleeping : gc_cond broadcast\n clear request

Figure 7-3. log_Flush_daemon state cycle. Waiters in logpb_flush_pages observe nxio_lsa advance after Broadcasting.

7.7 logpb_flush_log_for_wal — the read-side WAL invariant

Called by the page buffer manager before writing any data page, passing its last-modifying LSA. It enforces WAL with double-checked locking on logpb_need_wal:

// logpb_flush_log_for_wal -- src/transaction/log_page_buffer.c
if (logpb_need_wal (lsa_ptr))                 /* <- cheap atomic check, no lock */
  {
    LOG_CS_ENTER (thread_p);
    if (logpb_need_wal (lsa_ptr))             /* <- re-check under LOG_CS, else someone flushed it */
      logpb_flush_pages_direct (thread_p);
    LOG_CS_EXIT (thread_p);
    assert (LSA_ISNULL (lsa_ptr) || !logpb_need_wal (lsa_ptr));  /* <- post-condition */
  }

The predicate logpb_need_wal (lsa) is just LSA_LE (&get_nxio_lsa (), lsa) — true when the log up to *lsa is not yet durable — making the invariant directly testable.

WAL invariant. No data page modified at LSA L may be written while logpb_need_wal (L) holds (nxio_lsa <= L). The buffer manager calls logpb_flush_log_for_wal first; its post-condition assert (!logpb_need_wal (lsa_ptr)) guarantees the log is durable to L before the write. Violating it lets a redo record’s effect reach disk without the record, so recovery cannot reconstruct or undo the change. The two logpb_need_wal calls (outside/inside LOG_CS) avoid both a needless critical section when already durable and a redundant flush when a concurrent committer advanced nxio_lsa.

7.8 Chapter summary — key takeaways

nxio_lsa is the one durability watermark — the lowest not-yet-written LSA, atomic in log_append_info, answering both “is this commit durable?” and “must I flush before this data write?”. It advances only after fileio_synchronize succeeds.
logpb_flush_all_append_pages flushes the nxio_lsa page last. The two-step scan (skip clean, collect contiguous dirty) batches adjacent pages, then writes the watermark page alone, so the new end-of-log is never validated before its predecessors are on disk.
The main flush_mutex spans the whole flush body — scan, nxio-page write, and the fileio_synchronize all run while it is held; only the Phase-1 num_toflush peek takes a separate short lock. The group-commit wait uses gc_cond/gc_mutex, a distinct lock.
Group commit amortizes one fsync over many committers. Waiters block on gc_cond and re-check nxio_lsa under gc_mutex; the daemon does one logpb_flush_pages_direct and broadcasts, releasing everyone whose flush_lsa is now covered.
The 2×2 commit matrix (async_commit × group_commit) picks wake-and-wait, just-wait, wake-and-return, or just-return; non-SERVER and no-daemon paths fall back to direct flush.
WAL is enforced read-side by logpb_flush_log_for_wal via double-checked logpb_need_wal around LOG_CS; its post-condition asserts the log is durable to the requested LSA before any data page is written.
The exact group-commit window policy is an open question (from the companion): the mechanism is the daemon’s looper interval plus on-demand wakeup (), but the batching/latency tuning is not pinned down here.

Chapter 8: Commit and Abort Lifecycle

This chapter answers one reader question: how does a transaction-boundary record ride the same prior-list / page-buffer / flush pipeline that Chapters 3-7 built, force its own durability, and drive the final state transitions and lock release? Boundary records are special only in carrying a LOG_REC_DONETIME payload (no undo/redo data) and in being wrapped by durability and state-machine discipline. The append mechanics are unchanged — log_commit reuses prior_lsa_alloc_and_copy_data / prior_lsa_next_record (Chapters 3-4) and the logpb_flush_pages force path (Chapter 7). We focus on the wrapping; recovery-side replay is out of scope (cubrid-recovery-manager.md).

8.1 The three structs at the transaction boundary

Three structs meet at commit/abort time: the descriptor log_tdes, the per-record header log_rec_header (Chapters 1, 3), and the boundary payload log_rec_donetime.

`log_rec_donetime` — the commit/abort payload

The entire type-specific payload of a LOG_COMMIT/LOG_ABORT record; its existence at a known LSA is the information.

// log_rec_donetime — src/transaction/log_record.hpp
struct log_rec_donetime
{
  INT64 at_time;    /* Database creation time. For safety reasons */
};

Field	Role	Why it exists
`at_time`	Wall-clock `time(NULL)` captured in `log_append_donetime_internal`.	Timestamps completion for forensics. The “Database creation time” comment is stale — it holds the termination time; the commit protocol never reads it back.

Invariant — the donetime record’s LSA is the commit point. The record carries no other state, so durability reduces to “the page holding this LSA is on disk”. Everything in §8.4 makes that true before the client is told the commit succeeded.

`log_rec_header` — role at the boundary

The generic header (full coverage in Chapter 1) is reused verbatim; at the boundary only type and prev_tranlsa carry special meaning.

// log_rec_header — src/transaction/log_record.hpp
struct log_rec_header
{
  LOG_LSA prev_tranlsa;   /* previous log record for the same transaction */
  LOG_LSA back_lsa, forw_lsa;  /* physical backward/forward links */
  TRANID  trid;           /* transaction identifier */
  LOG_RECTYPE type;       /* e.g. LOG_COMMIT, LOG_ABORT */
};

Field	Role at a COMMIT/ABORT record	Why it matters here
`prev_tranlsa`	Closes the per-transaction chain. Recovery never undoes a committed chain, but the link is still written.	Chapter 4 assigns it at attach time from `tdes->tail_lsa`.
`type`	`LOG_COMMIT`, `LOG_ABORT`, or `LOG_COMMIT_WITH_POSTPONE` when postpone work remains.	Recovery dispatch keys on this to decide “this trid is done — do not undo it”.
`back_lsa` / `forw_lsa`	Physical-order links, assigned by the prior-list machinery as for a data record.	Lets the analysis pass scan past the boundary record.
`trid`	The committing/aborting transaction’s id.	Recovery groups records by `trid`.

`log_tdes` — the transaction descriptor (boundary-relevant fields)

log_tdes is large; only the fields the commit/abort path reads or writes are covered. The full struct lives in log_impl.h.

// log_tdes (excerpt) — src/transaction/log_impl.h
struct log_tdes
{
  int        tran_index;          TRANID trid;       TRAN_STATE state;
  LOG_LSA    head_lsa;            LOG_LSA tail_lsa;  LOG_LSA undo_nxlsa;
  LOG_LSA    posp_nxlsa;          LOG_LSA commit_abort_lsa;
  LOG_TOPOPS_STACK topops;        /* topops.last must be < 0 at the boundary */
  void      *first_save_entry;    bool has_supplemental_log;
  // ... condensed ...
};

Five fields behave identically on both paths, so they get one note rather than a two-column matrix: tran_index (table index resolved by LOG_FIND_THREAD_TRAN_INDEX; log_abort_by_tdes rebinds it onto the executing thread, §8.7), trid (stamped into log_rec_header.trid, recycled by logtb_get_new_tran_id), head_lsa (informational, never read by the protocol), topops.last (must be < 0; a live sysop is a bug — assert(false) + force-attach to outer), and first_save_entry (freed via spage_free_saved_spaces). The fields whose role diverges between commit and abort:

Field	Role at commit	Role at abort
`state`	`ACTIVE` -> `WILL_COMMIT` -> (`…_WITH_POSTPONE`) -> `COMMITTED`.	Straight to `ABORTED` before any rollback.
`tail_lsa`	NULL = touched nothing -> skip donetime (§8.3, §8.5); else chain tail the record links to.	Same gate.
`undo_nxlsa`	Reset to NULL so a checkpoint during `WILL_COMMIT` sees no stale cursor.	The rollback cursor `log_rollback` walks `prev_tranlsa` from.
`posp_nxlsa`	Non-NULL -> postpone pending -> `LOG_COMMIT_WITH_POSTPONE` (§8.3.1).	Unused.
`has_supplemental_log`	If set, a `LOG_SUPPLEMENT_TRAN_USER` record precedes the commit record (CDC), then cleared.	Just cleared.
`commit_abort_lsa`	Stamped with the boundary LSA so checkpoint distinguishes concluded from live — stamped not here but by the prior-list append in `log_append.cpp` when the donetime node is materialized.	Same.

flowchart TB
  TDES["log_tdes\nstate, tail_lsa,\nundo_nxlsa, posp_nxlsa"]
  HDR["log_rec_header\ntype = LOG_COMMIT/ABORT\nprev_tranlsa = tail_lsa"]
  DT["log_rec_donetime\nat_time"]
  NODE["LOG_PRIOR_NODE\n(Chapter 3)"]
  TDES -->|prev_tranlsa = tail_lsa| HDR
  HDR --> NODE
  DT -->|node->data_header| NODE
  NODE -->|prior_lsa_next_record| PL["prior list -> page buffer -> disk"]

Figure 8-1 — log_tdes supplies the chain tail, a log_rec_header of type LOG_COMMIT/LOG_ABORT is built, and a log_rec_donetime becomes the node’s data header. From there it is an ordinary prior-list node.

8.2 `log_commit` — the entry point and its branch fan-out

log_commit resolves the descriptor, validates state, and routes to the 2PC or local path — every branch:

// log_commit — src/transaction/log_manager.c
if (tdes == NULL) return TRAN_UNACTIVE_UNKNOWN;        /* <- fatal: unknown index */
if (!LOG_ISTRAN_ACTIVE (tdes) && !LOG_ISTRAN_2PC_PREPARE (tdes) && LOG_ISRESTARTED ())
  return tdes->state;                                  /* <- not commitable; no-op */
if (tdes->topops.last >= 0)                            /* <- impossible-but-handled */
  { assert (false); while (tdes->topops.last >= 0) log_sysop_attach_to_outer (thread_p); }
if (log_2pc_clear_and_is_tran_distributed (tdes))
  state = log_2pc_commit (...);                        /* <- 2PC arm (cubrid-2pc.md) */
else                                                   /* <- local arm */
  { state = log_commit_local (thread_p, tdes, retain_lock, true);
    state = log_complete (thread_p, tdes, LOG_COMMIT, LOG_NEED_NEWTRID, LOG_ALREADY_WROTE_EOT_LOG); }
if (log_No_logging) { /* force pages + data, flush header */ }
perfmon_inc_stat (thread_p, PSTAT_TRAN_NUM_COMMITS);   /* return state */

Invariant — topops.last < 0 at the transaction boundary. Commit and abort require no open system operation. The code warns, asserts in debug, and force-folds open sysops into the outer transaction with log_sysop_attach_to_outer; violating it would skip a sysop’s records from the boundary record’s prev_tranlsa chain.

8.3 `log_commit_local` — postpone, append, release, flush

log_commit_local does the real work in a strict order dictated by one rule.

Invariant — nothing may be logged after the transaction enters an unactive state. If a checkpoint snapshots the transaction as TRAN_UNACTIVE_WILL_COMMIT and a crash precedes still-pending logging (e.g. unique statistics), recovery commits it without those changes — silent data loss. So tx_lob_locator_clear and logtb_complete_mvcc (both of which log) run before tdes->state = TRAN_UNACTIVE_WILL_COMMIT.

// log_commit_local — src/transaction/log_manager.c
tx_lob_locator_clear (...); logtb_complete_mvcc (thread_p, tdes, true);  /* both log -> must precede WILL_COMMIT */
tdes->state = TRAN_UNACTIVE_WILL_COMMIT;
LSA_SET_NULL (&tdes->undo_nxlsa);                    /* checkpoint must not see a stale undo cursor */
if (!LSA_ISNULL (&tdes->tail_lsa))                   /* <- transaction touched data */
  {
    log_tran_do_postpone (thread_p, tdes);           /* §8.3.1 — run postpone if any */
    if (is_local_tran) {
      if (... log_does_allow_replication () ...)
        log_append_repl_info_and_commit_log (...);   /* repl+commit, one mutex */
      else log_append_commit_log (thread_p, tdes, &commit_lsa);            /* plain LOG_COMMIT */
      if (retain_lock != true) lock_unlock_all (thread_p);                 /* <- retain_lock gate */
      log_change_tran_as_completed (thread_p, tdes, LOG_COMMIT, &commit_lsa);  /* state + force */
    } else { /* participant: commit log + unlock deferred to log_complete_for_2pc */ }
  }
else { if (retain_lock != true) lock_unlock_all (thread_p);  /* <- read-only: no donetime record */
       tdes->state = TRAN_UNACTIVE_COMMITTED; }
return tdes->state;

The replication route takes prior_lsa_mutex once so the replication and commit records get adjacent LSAs; the plain route just appends the donetime record. The participant branch (is_local_tran == false) defers both the commit record and lock release to log_complete_for_2pc (cubrid-2pc.md).

8.3.1 `log_tran_do_postpone` — `LOG_COMMIT_WITH_POSTPONE`

If posp_nxlsa is non-NULL the transaction has deferred actions; log_tran_do_postpone writes and forces a LOG_COMMIT_WITH_POSTPONE record before running the postpones (Chapter 9).

// log_tran_do_postpone — src/transaction/log_manager.c
if (LSA_ISNULL (&tdes->posp_nxlsa)) return;          /* <- nothing to postpone */
assert (tdes->topops.last < 0);
log_append_commit_postpone (thread_p, tdes, &tdes->posp_nxlsa);  /* COMMIT_WITH_POSTPONE + flush */
if (tdes->m_log_postpone_cache.do_postpone (*thread_p, tdes->posp_nxlsa))
  { perfmon_inc_stat (..., PSTAT_TRAN_NUM_PPCACHE_HITS); return; }  /* cache fast-path */
log_do_postpone (thread_p, tdes, &tdes->posp_nxlsa);  /* scan forward, run LOG_POSTPONE records */

log_append_commit_postpone sets state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE and forces immediately, so the marker is durable before postpones run (recovery can resume after a crash). The plain LOG_COMMIT written later closes the transaction.

8.4 Forcing durability — `log_append_commit_log` and the WAL force

log_append_commit_log is a thin shell over log_append_donetime_internal, the single place both commit and abort build their donetime record:

// log_append_donetime_internal — src/transaction/log_manager.c
node = prior_lsa_alloc_and_copy_data (thread_p, iscommitted, RV_NOT_DEFINED, ...);  /* type = LOG_COMMIT/ABORT */
if (node == NULL) return;                            /* <- alloc failed: eot_lsa stays NULL */
((LOG_REC_DONETIME *) node->data_header)->at_time = time (NULL);  /* the only payload field */
lsa = (with_lock == LOG_PRIOR_LSA_WITH_LOCK)         /* caller holds prior mutex, else take it */
      ? prior_lsa_next_record_with_lock (...) : prior_lsa_next_record (thread_p, node, tdes);
LSA_COPY (eot_lsa, &lsa);                             /* hand the commit LSA back to the caller */

iscommitted doubles as the record type; with_lock lets the replication route reuse the mutex it already holds. Then log_change_tran_as_completed performs the durability force:

// log_change_tran_as_completed — src/transaction/log_manager.c
if (iscommitted == LOG_COMMIT)
  { tdes->state = TRAN_UNACTIVE_COMMITTED;
    logpb_flush_pages (thread_p, lsa); }             /* <- COMMIT: always force up to commit LSA */
else {
  tdes->state = TRAN_UNACTIVE_ABORTED;               /* SERVER_MODE only: */
  if (BO_IS_SERVER_RESTARTED () && VOLATILE_ACCESS (log_Gl.run_nxchkpt_atpageid, INT64) == NULL_PAGEID)
    logpb_flush_pages (thread_p, lsa);               /* <- ABORT: force only if checkpoint in flight */
}

Invariant — a committed transaction’s commit record is on stable storage before the client is told “committed”. logpb_flush_pages (thread_p, lsa) is the group-commit demand from Chapter 7: it pushes the committer onto the flush daemon’s waiter set and blocks until nxio_lsa >= commit_lsa (many committers share one fsync). This is the only point in the commit path that can block on I/O. The abort branch is asymmetric by design: a lost un-flushed LOG_ABORT is harmless (recovery re-undoes anyway), so it forces only when a checkpoint is in flight on a restarted server, lest that checkpoint reclaim an archive recovery still needs.

8.5 `log_complete` — final state transition and next-trid

Both log_commit and log_abort finish via log_complete, passing two enum flags: who already wrote the EOT record and whether to recycle the trid. Commit passes LOG_ALREADY_WROTE_EOT_LOG (record already forced, else arm just asserts); abort passes LOG_NEED_TO_WRITE_EOT_LOG (the if arm appends LOG_ABORT).

// log_complete — src/transaction/log_manager.c
if (LSA_ISNULL (&tdes->tail_lsa)) { /* read-only: set COMMITTED/ABORTED; recycle or clear tdes */ }
else {
  if (wrote_eot_log == LOG_NEED_TO_WRITE_EOT_LOG)    /* <- abort: write LOG_ABORT now */
    { log_append_abort_log (...); log_change_tran_as_completed (..., LOG_ABORT, &abort_lsa); }
  else assert (iscommitted == LOG_COMMIT && state == TRAN_UNACTIVE_COMMITTED);  /* commit already wrote it */
  tdes->unlock_global_oldest_visible_mvccid ();      /* always */
  if (iscommitted == LOG_COMMIT) log_Gl.mvcc_table.reset_transaction_lowest_active (...);  /* commit only */
  if (get_newtrid == LOG_NEED_NEWTRID) logtb_get_new_tran_id (thread_p, tdes);
}
if (LOG_ISCHECKPOINT_TIME ()) log_wakeup_checkpoint_daemon ();  /* or logpb_checkpoint in SA mode */

Branch fan-out:

tail_lsa NULL. No EOT record; set state, then recycle the trid (LOG_NEED_NEWTRID) or hard-clear (logtb_clear_tdes).
tail_lsa non-NULL, abort. Emit the abort record, force per §8.4, set state.
tail_lsa non-NULL, commit. Assert the record was already written and state is TRAN_UNACTIVE_COMMITTED.
MVCC unblocking (data path). unlock_global_oldest_visible_mvccid always runs; reset_transaction_lowest_active only on commit (cubrid-mvcc.md).
next-trid. logtb_get_new_tran_id recycles the index with a fresh trid — the donetime record is CUBRID’s EOT marker, not a distinct type.
checkpoint kick. If the append crossed the threshold, wake the checkpoint daemon or run logpb_checkpoint inline (SA mode).

flowchart TD
  A["log_commit(tran_index, retain_lock)"] --> B{"tdes NULL?"}
  B -->|yes| Z0["return TRAN_UNACTIVE_UNKNOWN"]
  B -->|no| C{"active or 2PC-prepared?"}
  C -->|no, restarted| Z1["no-op, return tdes.state"]
  C -->|yes| D{"topops.last >= 0?"}
  D -->|yes| D1["assert false\nattach_to_outer until < 0"]
  D -->|no| E{"distributed 2PC?"}
  D1 --> E
  E -->|yes| E1["log_2pc_commit\nsee cubrid-2pc.md"]
  E -->|no| F["log_commit_local"]
  subgraph LOCAL["log_commit_local — strict order"]
    direction TB
    F --> G["tx_lob_locator_clear\nlogtb_complete_mvcc\nboth LOG before state change"]
    G --> H["state = WILL_COMMIT\nundo_nxlsa = NULL"]
    H --> I{"tail_lsa NULL?"}
    I -->|yes, read-only| J["unlock unless retained\nstate = COMMITTED"]
    I -->|no| K["log_tran_do_postpone"]
    K --> K1{"posp_nxlsa set?"}
    K1 -->|yes| K2["append COMMIT_WITH_POSTPONE\nforce, then run LOG_POSTPONE"]
    K1 -->|no| L
    K2 --> L["log_append_commit_log\n+repl info if HA"]
    L --> M["lock_unlock_all\nunless retain_lock"]
    M --> N["log_change_tran_as_completed\nstate = COMMITTED\nlogpb_flush_pages = group-commit force"]
  end
  E1 --> O["log_complete"]
  J --> O
  N --> O
  O --> P["MVCC unblock · recycle trid\nkick checkpoint if due"]
  P --> Z2["return state"]

Figure 8-2 — Commit control flow. The only point that blocks on I/O is logpb_flush_pages inside log_change_tran_as_completed (the group-commit force of §8.4); everything before it is bookkeeping. Note the two ordering invariants the diagram encodes: the records that tx_lob_locator_clear and logtb_complete_mvcc emit are written before the state moves to WILL_COMMIT, and an empty tail_lsa short-circuits to a no-record read-only commit.

8.6 `log_abort` and `log_abort_local` — undo before the boundary

log_abort mirrors log_commit’s entry validation with two extra guards, then routes to log_abort_local -> log_complete.

// log_abort (excerpt) — src/transaction/log_manager.c
if (LOG_HAS_LOGGING_BEEN_IGNORED ())
  { er_set (... ER_LOG_CORRUPTED_DB_DUE_NOLOGGING ...); return tdes->state; }  /* <- no log to undo */
if (!LOG_ISTRAN_ACTIVE (tdes) && !LOG_ISTRAN_2PC_PREPARE (tdes))
  return tdes->state;                                  /* <- nothing to abort */
// topops.last >= 0 -> same assert+attach salvage as commit
state = log_abort_local (thread_p, tdes, true);
state = log_complete (thread_p, tdes, LOG_ABORT, LOG_NEED_NEWTRID, LOG_NEED_TO_WRITE_EOT_LOG);

The extra LOG_HAS_LOGGING_BEEN_IGNORED guard is the key difference from commit: with no undo records, rollback is impossible and the database is declared corrupted. log_abort_local differs from log_commit_local in ordering: it sets TRAN_UNACTIVE_ABORTED first, then does the work.

// log_abort_local — src/transaction/log_manager.c
tdes->state = TRAN_UNACTIVE_ABORTED;                 /* <- set early; rollback logs CLRs, allowed */
if (!LSA_ISNULL (&tdes->tail_lsa))                   /* <- transaction touched data */
  { log_rollback (thread_p, tdes, NULL);             /* §8.6.1 — the undo pass */
    log_cleanup_modified_class_list (thread_p, tdes, NULL, true, true); /* + free first_save_entry */ }
/* both branches: */ logtb_complete_mvcc (thread_p, tdes, false);  /* committed=false -> discard mvccid */
lock_unlock_all (thread_p);                          /* <- always release; abort never retains */
tx_lob_locator_clear (thread_p, tdes, false, NULL);
return tdes->state;

flowchart TD
  A["log_abort(tran_index)"] --> B{"logging been ignored?"}
  B -->|yes| Z0["ER_LOG_CORRUPTED_DB_DUE_NOLOGGING\nreturn — no undo records exist"]
  B -->|no| C{"active or 2PC-prepared?"}
  C -->|no| Z1["nothing to abort, return"]
  C -->|yes| D["topops salvage\nassert + attach_to_outer"]
  D --> E["log_abort_local"]
  subgraph LOCAL["log_abort_local — state set FIRST"]
    direction TB
    E --> F["state = ABORTED\nset early: rollback may log CLRs"]
    F --> G{"tail_lsa NULL?"}
    G -->|no| H["log_rollback\nwalk prev_tranlsa backward,\nappend compensating CLRs"]
    H --> I["log_cleanup_modified_class_list"]
    G -->|yes| J
    I --> J["logtb_complete_mvcc false\ndiscard mvccid"]
    J --> K["lock_unlock_all\nalways — abort never retains"]
  end
  K --> L["log_complete LOG_ABORT"]
  L --> M{"tail_lsa non-NULL?"}
  M -->|yes| N["log_append_abort_log\nlog_change_tran_as_completed\nforce only if checkpoint in flight"]
  M -->|no| O["set ABORTED, no EOT record"]
  N --> P["recycle trid"]
  O --> P
  P --> Z2["return ABORTED"]

Figure 8-3 — Abort control flow. The mirror image of Figure 8-2 with two deliberate asymmetries. (1) State first: log_abort_local sets ABORTED before doing the work, because the rollback pass itself logs compensating records (CLRs) and those must be allowed after the state flips — the opposite of commit, where logging after WILL_COMMIT is forbidden. (2) Lazy force: a lost un-flushed LOG_ABORT is harmless (recovery re-undoes anyway), so the durability force fires only when a checkpoint is in flight on a restarted server.

Setting state early is safe here but forbidden in commit because rollback logs compensation log records (CLRs) — redo-only records (Chapter 9) expected while the transaction is already aborted. logtb_complete_mvcc(..., false) discards the MVCCID, and log_abort_local ignores retain_lock — an abort always calls lock_unlock_all.

8.6.1 `log_rollback` — walking `prev_tranlsa` backward

log_rollback walks the chain backward from tdes->undo_nxlsa, re-applying each undo image. Per-record-type CLR generation is Chapter 9; the branch that matters here is the cursor discipline.

// log_rollback (control skeleton) — src/transaction/log_manager.c
LSA_COPY (&prev_tranlsa, &tdes->undo_nxlsa);           /* start cursor */
while (!LSA_ISNULL (&prev_tranlsa) && !isdone)
  {
    logpb_fetch_page (...);                            /* fatal on error */
    log_rec = LOG_GET_LOG_RECORD_HEADER (log_pgptr, &log_lsa);
    LSA_COPY (&prev_tranlsa, &log_rec->prev_tranlsa);  /* advance cursor BEFORE undo */
    LSA_COPY (&tdes->undo_nxlsa, &prev_tranlsa);       /* persist cursor (CLR may move it) */
    switch (log_rec->type) { /* ... see Chapter 9 ... */ }
  }

Invariant — the undo cursor is advanced before the undo is applied. Both prev_tranlsa and tdes->undo_nxlsa move to log_rec->prev_tranlsa before the undo runs, because applying an undo logs a chained CLR — a not-yet-advanced cursor could re-undo the record or follow the CLR’s own link. upto_lsa (NULL here, non-NULL from log_rollback_to_savepoint) stops a partial rollback early; xlogtb_reset_wait_msecs(INFINITE_WAIT) blocks lock timeouts. Recovery-time replay is in cubrid-recovery-manager.md.

8.7 Restart-driven variants — `log_abort_by_tdes` and `log_abort_all_active_transaction`

At shutdown or crash recovery, transactions must be aborted by a thread other than their owner. log_abort_by_tdes rebinds the executing thread to the victim’s tran_index so every LOG_FIND_THREAD_TRAN_INDEX lookup inside log_abort resolves correctly, then reuses the ordinary path:

// log_abort_by_tdes — src/transaction/log_manager.c  (SERVER_MODE)
thread_p->tran_index = tdes->tran_index;   /* impersonate the victim's index */
pthread_mutex_unlock (&thread_p->tran_index_lock);
(void) log_abort (thread_p, tdes->tran_index);  /* reuse the normal abort path */

log_abort_all_active_transaction is the shutdown sweep: in server mode it loops over every index, dispatching an async abort onto each active transaction and re-looping until no worker threads remain. The dispatch is not direct — css_push_external_task queues log_abort_task_execute, a thin wrapper that calls log_abort_by_tdes(&thread_ref, &tdes):

// log_abort_all_active_transaction (server-mode essence) — src/transaction/log_manager.c
if (already_called) return; already_called = 1;        /* <- idempotent static guard */
loop: repeat_loop = false;
  for (i = 0; i < log_Gl.trantable.num_total_indices; i++)
    if (i != LOG_SYSTEM_TRAN_INDEX && (tdes = LOG_FIND_TDES (i)) && tdes->trid != NULL_TRANID)
      { if (css_count_transaction_worker_threads (...) > 0) repeat_loop = true;  /* still busy */
        else if (LOG_ISTRAN_ACTIVE (tdes) && !abort_thread_running[i])
          { /* exec_f = std::bind (log_abort_task_execute, _1, std::ref (*tdes)); */
            css_push_external_task (...);              /* -> log_abort_task_execute -> log_abort_by_tdes */
            abort_thread_running[i] = 1; repeat_loop = true; } }
  if (repeat_loop) { thread_sleep (50);
    if (css_is_shutdown_timeout_expired ()) _exit (0); goto loop; }  /* <- give up: hard exit */

already_called runs the sweep once; LOG_SYSTEM_TRAN_INDEX is skipped; a transaction with live workers forces another pass; an expired timeout _exit(0)s. The SA_MODE branch walks the table and calls log_abort synchronously.

8.8 Chapter summary — key takeaways

Boundary records reuse the whole pipeline — built/attached like data records (Chapters 3-4); the one-field log_rec_donetime’s LSA is the durable commit point.
log_commit routes; log_commit_local works — postpone, append, unlock, force; log_complete only finalizes state, since the record was already written (LOG_ALREADY_WROTE_EOT_LOG).
Ordering protects against checkpoint-during-commit — anything that logs runs before TRAN_UNACTIVE_WILL_COMMIT.
Commit forces, abort usually does not — logpb_flush_pages always for commit (group-commit, Chapter 7), for abort only with a checkpoint in flight on a restarted server.
Abort sets state first, then undoes — log_rollback advances the cursor before each undo so CLRs do not re-enter the chain.
retain_lock is a commit-only knob — abort always unlocks.
Restart variants re-bind, not re-implement — log_abort_by_tdes impersonates tran_index and calls log_abort; log_abort_all_active_transaction dispatches log_abort_task_execute idempotently until workers drain or the timeout forces _exit(0).

Chapter 9: System Operations Postpone and Compensation

A system operation (sysop, or “top operation”) is a sub-transaction the server commits or aborts independently of the enclosing user transaction — index splits, file allocation, overflow-record management. This chapter traces how sysops, postponed actions, and compensation records reuse the prior-list pipeline (Chapters 3–5) while carrying their own logical-undo payloads. For WAL/postpone/ARIES theory see the companion cubrid-log-manager.md (“System operations”, “Postpone & compensation”). Every family calls prior_lsa_alloc_and_copy_data + prior_lsa_next_record; the novelty is which header is stamped and how the tdes sysop stack and the LSA chains (undo_nxlsa, posp_nxlsa, per-level posp_lsa) mutate around it.

9.1 The sysop stack on `log_tdes`

log_sysop_start appends nothing; it pushes a frame onto an in-memory stack in the transaction descriptor. The table lists only the sysop/postpone-relevant log_tdes fields — the full ~82-field struct is in Chapter 2.

Field	Role	Why it exists
`topops` (`LOG_TOPOPS_STACK`)	Nesting stack of active sysops	`last` is current depth, `max` the allocated size
`topop_lsa`	LSA of the most-recent sysop’s parent	Fast “are we in a sysop” probe for appenders
`tail_lsa`	LSA of this tran’s last appended record	High-water mark a sysop end compares to detect “no change”
`undo_nxlsa`	Next record to undo	Rewound by a CLR so undo skips the already-undone record
`posp_nxlsa`	First transaction-level postpone record	Seeded by a `LOG_POSTPONE` appended outside any sysop
`savept_lsa`	LSA of last savepoint	Chains savepoints; target of `log_abort_partial`
`tail_topresult_lsa`	LSA of last partial commit/abort	Stamped into every sysop-end as `prv_topresult_lsa`
`state` (`TRAN_STATE`)	Transaction state	Gates which sysop-end arms are legal
`m_log_postpone_cache`	Cached postpone redo + LSAs	Lets `do_postpone` replay from memory
`rcv.sysop_start_postpone_lsa`	Recovery anchor for an in-flight sysop postpone	Resume a crashed sysop’s postpone phase
`rcv.tran_start_postpone_lsa`	Recovery anchor for a tran-level postpone	Resume the transaction postpone phase
`rcv.atomic_sysop_start_lsa`	Recovery anchor for an atomic sysop	Roll an interrupted atomic sysop back as a unit

(The rcv.* members live in the embedded log_rcv_tdes.) Each stack frame is a log_topops_addresses carrying two LSAs, read through three accessor macros:

// log_topops_addresses -- src/transaction/log_impl.h
struct log_topops_addresses
{
  LOG_LSA lastparent_lsa;  /* The last address of the parent transaction. This is needed for undo of the top
         * system action */
  LOG_LSA posp_lsa;    /* The first address of a postpone log record for top system operation. We add this
         * since it is reset during recovery to the last reference postpone address. */
};
// LOG_TDES_LAST_SYSOP* -- src/transaction/log_manager.c
#define LOG_TDES_LAST_SYSOP(tdes) (&(tdes)->topops.stack[(tdes)->topops.last])
#define LOG_TDES_LAST_SYSOP_PARENT_LSA(tdes) (&LOG_TDES_LAST_SYSOP(tdes)->lastparent_lsa)
#define LOG_TDES_LAST_SYSOP_POSP_LSA(tdes) (&LOG_TDES_LAST_SYSOP(tdes)->posp_lsa)

flowchart LR
  subgraph tdes["log_tdes"]
    tail["tail_lsa"]
    posp["posp_nxlsa (tran level)"]
    undo["undo_nxlsa"]
    stk["topops.stack[]"]
  end
  stk --> f0["[0] lastparent_lsa, posp_lsa"]
  stk --> fl["[last] lastparent_lsa, posp_lsa"]

Figure 9-1: the sysop stack and the LSA anchors it threads through log_tdes.

Invariant — the parent LSA bounds the sysop body. Every record a sysop appends has tail_lsa > LOG_TDES_LAST_SYSOP_PARENT_LSA(tdes); end functions detect an empty sysop by LSA_LE(&tdes->tail_lsa, parent_lsa). If violated, an end would log a phantom record or skip a needed end marker, desyncing log nesting from the stack. Enforced in log_sysop_commit_internal and log_sysop_abort.

9.2 `log_sysop_start` and `log_sysop_start_atomic`

// log_sysop_start -- src/transaction/log_manager.c
if (tdes->topops.max == 0 || (tdes->topops.last + 1) >= tdes->topops.max)   /* first-alloc OR full */
  if (logtb_realloc_topops_stack (tdes, 1) == NULL)        /* OOM: bail, stack unchanged */
    { assert (false); tdes->unlock_topop (); return; }
// ... condensed: VACUUM_IS_THREAD_VACUUM diagnostic logging only ...
tdes->topops.last++;                                       /* <- push */
LSA_COPY (&tdes->topops.stack[tdes->topops.last].lastparent_lsa, &tdes->tail_lsa);
LSA_COPY (&tdes->topop_lsa, &tdes->tail_lsa);
LSA_SET_NULL (&tdes->topops.stack[tdes->topops.last].posp_lsa);  /* <- no postpone yet */

The topops.max == 0 clause handles a transaction’s first sysop (no stack yet) and the second clause is grow-when-full. Branches: (1) tdes == NULL → ER_LOG_UNKNOWN_TRANINDEX fatal early-return; (2) realloc, on OOM unlock+return without pushing; (3) VACUUM diagnostics only; (4) happy path snapshots tail_lsa into lastparent_lsa and nulls posp_lsa.

log_sysop_start_atomic wraps it, then ensures one LOG_SYSOP_ATOMIC_START marker exists so recovery can roll the whole atomic sysop back as a unit:

// log_sysop_start_atomic -- src/transaction/log_manager.c
log_sysop_start (thread_p);                                /* ... re-fetch tdes, guard ... */
if (LSA_ISNULL (&tdes->rcv.atomic_sysop_start_lsa))        /* first atomic level: emit marker */
  { node = prior_lsa_alloc_and_copy_data (thread_p, LOG_SYSOP_ATOMIC_START, ...);
    (void) prior_lsa_next_record (thread_p, node, tdes); }
else
  { assert (tdes->topops.last > 0);                        /* nested: parent already marked */
    assert (LSA_ISNULL (&tdes->rcv.sysop_start_postpone_lsa)); }

The else arm fires for a nested atomic sysop: the outer level owns atomic_sysop_start_lsa, so the inner sysop inherits atomicity with no second marker. The asserts encode “no atomic start while a sysop-postpone runs.”

9.3 The sysop-end union and its six arms

All non-abort ends route through log_sysop_commit_internal, which stamps a log_rec_sysop_end (comments verbatim from source):

// log_rec_sysop_end -- src/transaction/log_record.hpp
struct log_rec_sysop_end
{
  LOG_LSA lastparent_lsa;  /* last address before the top action */
  LOG_LSA prv_topresult_lsa;  /* previous top action (either, partial abort or partial commit) address */
  LOG_SYSOP_END_TYPE type;  /* end system op type */
  const VFID *vfid;    /* File where the page belong. ... used to get TDE information. */
  union        /* other info based on type */
  {
    LOG_REC_UNDO undo;      /* undo data for logical undo */
    LOG_REC_MVCC_UNDO mvcc_undo;  /* undo data for logical undo of MVCC operation */
    LOG_LSA compensate_lsa;    /* compensate lsa for logical compensate */
    struct { LOG_LSA postpone_lsa; bool is_sysop_postpone; } run_postpone;  /* run postpone info */
  };
};

Field	Role	Why it exists
`lastparent_lsa`	Where the sysop body starts	Recovery undo of the sysop stops here; copied from the frame
`prv_topresult_lsa`	Previous partial commit/abort LSA	Chains top results so recovery walks them backward
`type`	Which union arm is valid	Dispatch key for append and recovery
`vfid`	File of the affected page	TDE key lookup; doubles as MVCC vacuum info
`undo`	`LOG_REC_UNDO` payload for `LOGICAL_UNDO`	`rcvindex` + length for logical undo replay
`mvcc_undo`	`LOG_REC_MVCC_UNDO` payload for `LOGICAL_MVCC_UNDO`	Adds `mvccid` and `vacuum_info`
`compensate_lsa`	Undo-skip target for `LOGICAL_COMPENSATE`	Next-undo LSA after this logical compensation
`run_postpone.postpone_lsa`	Original `LOG_POSTPONE` LSA	Links the run-postpone to its source
`run_postpone.is_sysop_postpone`	Sysop vs. tran postpone flag	Recovery must know which postpone phase produced this

Same bytes, six interpretations. Each wrapper (§9.4) fills exactly the member above for its type:

`type`	Active member	Produced by
`LOG_SYSOP_END_COMMIT`	(none)	`log_sysop_commit`
`LOG_SYSOP_END_ABORT`	(none)	`log_sysop_abort`
`LOG_SYSOP_END_LOGICAL_UNDO`	`undo`	`log_sysop_end_logical_undo` (non-MVCC)
`LOG_SYSOP_END_LOGICAL_MVCC_UNDO`	`mvcc_undo`	`log_sysop_end_logical_undo` (MVCC)
`LOG_SYSOP_END_LOGICAL_COMPENSATE`	`compensate_lsa`	`log_sysop_end_logical_compensate`
`LOG_SYSOP_END_LOGICAL_RUN_POSTPONE`	`run_postpone`	`log_sysop_end_logical_run_postpone`

9.4 `log_sysop_commit_internal` — branch-complete

The caller sets log_record->type; commit_internal validates state-vs-type, runs pending postpone, appends the end (Fig 9-2).

flowchart TD
  A["commit_internal(log_record)"] --> B{"tdes == NULL?"}
  B -->|yes| Z["assert_release; return"]
  B -->|no| C{"empty sysop\nAND COMMIT or no_logging?"}
  C -->|yes| D["assert posp_lsa NULL\nno-op end"]
  C -->|no| F{"switch type"}
  F -->|RUN_POSTPONE| G["assert *_COMMITTED_WITH_POSTPONE\nset is_sysop_postpone"]
  F -->|COMPENSATE| H["assert aborting OR rv-finish"]
  F -->|UNDO / MVCC_UNDO| I["no state restriction"]
  F -->|COMMIT| J["assert not in postpone phase\nunless rv-finish"]
  G --> K
  H --> K
  I --> K
  J --> K["fill lastparent_lsa, prv_topresult_lsa"]
  K --> M["do_postpone -> append_sysop_end -> tail_topresult_lsa = tail_lsa"]
  D --> P["log_sysop_end_final"]
  M --> P

Figure 9-2: every branch of log_sysop_commit_internal.

// log_sysop_commit_internal -- src/transaction/log_manager.c
assert (log_record->type != LOG_SYSOP_END_ABORT);     /* aborts never come here */
if ((LSA_ISNULL (&tdes->tail_lsa) || LSA_LE (&tdes->tail_lsa, LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes)))
    && (log_record->type == LOG_SYSOP_END_COMMIT || log_No_logging))
  assert (LSA_ISNULL (&LOG_TDES_LAST_SYSOP (tdes)->posp_lsa));   /* empty COMMIT: nothing to log */
else
  { if (log_record->type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE)
      { assert (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE
                || tdes->state == TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE);
        log_record->run_postpone.is_sysop_postpone =          /* recovery needs which phase */
          (tdes->state == TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE && !is_rv_finish_postpone); }
    // ... condensed: COMPENSATE / LOGICAL_UNDO / COMMIT state asserts (see Fig 9-2) ...
    log_record->lastparent_lsa = *LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes);
    log_record->prv_topresult_lsa = tdes->tail_topresult_lsa;
    log_sysop_do_postpone (thread_p, tdes, log_record, data_size, data);  /* run postpones */
    log_append_sysop_end (thread_p, tdes, log_record, data_size, data);   /* emit end */
    LSA_COPY (&tdes->tail_topresult_lsa, &tdes->tail_lsa); }
log_sysop_end_final (thread_p, tdes);                  /* always pops the stack */

Invariant — the end-record type must agree with the transaction state. LOGICAL_RUN_POSTPONE only in *_COMMITTED_WITH_POSTPONE; LOGICAL_COMPENSATE only while aborting (or recovery postpone-finish); a plain COMMIT never during a postpone phase unless recovery’s is_rv_finish_postpone re-entry. If violated, recovery re-runs a postpone twice or skips an undo, corrupting the page.

log_sysop_end_final runs on every path, so even empty/error paths decrement topops.last. The four logical wrappers pre-fill the union member from §9.3; only log_sysop_end_logical_run_postpone leaves is_sysop_postpone for commit_internal to derive from state.

9.5 `log_sysop_abort` — rollback then mark

Abort skips commit_internal; it rolls back and stamps an ABORT end directly:

// log_sysop_abort -- src/transaction/log_manager.c
if (LSA_ISNULL (&tdes->tail_lsa) || LSA_LE (&tdes->tail_lsa, &LOG_TDES_LAST_SYSOP (tdes)->lastparent_lsa))
  { /* No change: empty sysop, nothing to undo or log */ }
else
  { save_state = tdes->state;
    tdes->state = TRAN_UNACTIVE_ABORTED;               /* <- so compensation appends are legal */
    log_rollback (thread_p, tdes, LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes));  /* undo body, emits CLRs */
    sysop_end.type = LOG_SYSOP_END_ABORT;
    sysop_end.lastparent_lsa = *LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes);
    sysop_end.prv_topresult_lsa = tdes->tail_topresult_lsa;
    log_append_sysop_end (thread_p, tdes, &sysop_end, 0, NULL);
    LSA_COPY (&tdes->tail_topresult_lsa, &tdes->tail_lsa);
    tdes->state = save_state; }                         /* <- restore: sysop abort != tran abort */
log_sysop_end_final (thread_p, tdes);

The temporary state = TRAN_UNACTIVE_ABORTED is load-bearing: it lets log_rollback append CLRs (§9.8); the original state is then restored so the outer transaction is unaffected.

9.6 `log_append_postpone` — deferring an action to commit

A LOG_POSTPONE records a redo-only action not applied now but replayed after commit.

flowchart TD
  A["log_append_postpone"] --> B{"log_No_logging?"}
  B -->|yes| C["run redofun NOW; return"]
  B -->|no| E{"skipredo OR\nno sysop AND not active/aborted?"}
  E -->|yes| F["run redofun NOW;\nif !skipredo append_redo_data; return"]
  E -->|no| G{"tail_lsa NULL or\nbefore crash point?"}
  G -->|yes| H["append LOG_DUMMY_HEAD_POSTPONE"]
  G -->|no| J
  H --> J["alloc LOG_POSTPONE; cache redo + start_lsa"]
  J --> N{"in sysop?"}
  N -->|yes, posp_lsa NULL| O["frame.posp_lsa = tail_lsa"]
  N -->|no, posp_nxlsa NULL| P["posp_nxlsa = tail_lsa"]

Figure 9-3: log_append_postpone branches.

Two escape hatches run the redo synchronously (log_No_logging, or it cannot be deferred — Fig 9-3). Otherwise the record is appended, its redo + start LSA are pushed into m_log_postpone_cache, then the right anchor seeds:

// log_append_postpone -- src/transaction/log_manager.c
node = prior_lsa_alloc_and_copy_data (thread_p, LOG_POSTPONE, rcvindex, addr, 0, NULL, length, (char *) data);
tdes->m_log_postpone_cache.add_redo_data (*node);      /* save before node may be freed */
start_lsa = prior_lsa_next_record (thread_p, node, tdes);
tdes->m_log_postpone_cache.add_lsa (start_lsa);
if (tdes->topops.last >= 0)                            /* in sysop: seed frame anchor */
  { if (LSA_ISNULL (&tdes->topops.stack[tdes->topops.last].posp_lsa))
      LSA_COPY (&tdes->topops.stack[tdes->topops.last].posp_lsa, &tdes->tail_lsa); }
else if (LSA_ISNULL (&tdes->posp_nxlsa))               /* tran level: seed tran anchor */
  LSA_COPY (&tdes->posp_nxlsa, &tdes->tail_lsa);

Invariant — the first postpone seeds exactly one anchor. The earliest LOG_POSTPONE in a sysop level sets that frame’s posp_lsa; the earliest at transaction level sets posp_nxlsa. Later postpones leave it untouched (LSA_ISNULL guard). This anchor is the start of the postpone-replay scan; overwriting it would orphan earlier postpones.

9.7 The postpone phase: `log_sysop_do_postpone`, `log_do_postpone`, `log_run_postpone_op`

When a sysop with pending postpones ends, log_sysop_do_postpone writes a LOG_SYSOP_START_POSTPONE marker, then replays. Its header log_rec_sysop_start_postpone (log_record.hpp) stashes the entire end record so recovery can finish after a crash:

Field	Role	Why it exists
`sysop_end` (`LOG_REC_SYSOP_END`)	“log record used for end of system operation”	Lets `log_sysop_end_recovery_postpone` re-emit the correct end after a crash
`posp_lsa`	”address where the first postpone operation start”	Where the post-crash replay scan resumes

// log_sysop_do_postpone -- src/transaction/log_manager.c
if (LSA_ISNULL (LOG_TDES_LAST_SYSOP_POSP_LSA (tdes))) { return; }   /* nothing to postpone */
sysop_start_postpone.sysop_end = *sysop_end;
sysop_start_postpone.posp_lsa = *LOG_TDES_LAST_SYSOP_POSP_LSA (tdes);
log_append_sysop_start_postpone (thread_p, tdes, &sysop_start_postpone, data_size, data);
if (tdes->m_log_postpone_cache.do_postpone (*thread_p, *(LOG_TDES_LAST_SYSOP_POSP_LSA (tdes))))
  { tdes->state = save_state; return; }                /* fast path: replay from memory */
log_do_postpone (thread_p, tdes, LOG_TDES_LAST_SYSOP_POSP_LSA (tdes));  /* slow path: scan the log */

The transaction-level parallel log_append_commit_postpone writes a LOG_COMMIT_WITH_POSTPONE whose header is log_rec_start_postpone (log_record.hpp), flips to TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE, and flushes so the commit is durable before postpones run:

Field	Role	Why it exists
`posp_lsa`	Address where the first transaction postpone op starts	Anchor the post-commit replay scan; reset during recovery to the last reference
`at_time`	”donetime. For the time-specific recovery”	Stamp the commit-postpone moment for point-in-time / time-specific recovery

log_do_postpone is the slow-path forward scan; it skips nested-top bodies via log_get_next_nested_top and dispatches on log_rec->type. Only LOG_POSTPONE triggers replay; the start-marker group (LOG_COMMIT_WITH_POSTPONE[_OBSOLETE], LOG_SYSOP_START_POSTPONE, LOG_2PC_*) does LSA_SET_NULL(&forward_lsa) to end the loop; data/redo/CLR/savepoint arms are inert (already applied when logged); default is “bad log_rectype”. log_run_postpone_op reads the redo and runs it — and a page-spanning malloc failure is fatal, not a graceful return:

// log_run_postpone_op -- src/transaction/log_manager.c
LSA_COPY (&ref_lsa, log_lsa);          /* remember the original postpone LSA */
// ... condensed: advance past LOG_RECORD_HEADER + LOG_REC_REDO header ...
redo = *((LOG_REC_REDO *) ((char *) log_pgptr->area + log_lsa->offset));
if (log_lsa->offset + redo.length < (int) LOGAREA_SIZE)
  rcv_data = (char *) log_pgptr->area + log_lsa->offset;   /* contiguous: point in place */
else
  { area = (char *) malloc (redo.length);                  /* spans pages: need contiguous copy */
    if (area == NULL)
      { logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "log_run_postpone_op"); return ER_FAILED; }
    logpb_copy_from_log (thread_p, area, redo.length, log_lsa, log_pgptr); rcv_data = area; }
(void) log_execute_run_postpone (thread_p, &ref_lsa, &redo, rcv_data);
if (area != NULL) free_and_init (area);

ref_lsa lands in the run-postpone record (log_rec_run_postpone, log_record.hpp) so recovery knows which postpone already executed:

Field	Role	Why it exists
`data` (`LOG_DATA`)	“Location of recovery data” (rcvindex, vpid, offset)	Which page + redo function the postpone touched
`ref_lsa`	”Address of the original postpone record”	A second recovery pass matches this and skips the already-run postpone
`length`	”Length of redo data”	Bounds the redo copy

The producer log_append_run_postpone asserts WILL_COMMIT or a *_COMMITTED_WITH_POSTPONE state, stamps the three fields, appends, and sets the page LSA — making the action idempotent on a second recovery pass.

9.8 Compensation: `log_append_compensate_internal` and rewinding `undo_nxlsa`

A Compensation Log Record (CLR, LOG_COMPENSATE) records the redo of an undo so undo is never re-done after a crash. Its header is log_rec_compensate (log_record.hpp):

Field	Role	Why it exists
`data` (`LOG_DATA`)	“Location of recovery data” (rcvindex, pageid, offset, volid)	Locates the page the compensating redo re-applies
`undo_nxlsa`	”Address of next log record to undo”	Recovery undo jumps here, skipping the compensated record
`length`	”Length of compensating data”	Bounds the redo payload

// log_append_compensate_internal -- src/transaction/log_manager.c
node = prior_lsa_alloc_and_copy_data (thread_p, LOG_COMPENSATE, rcvindex, NULL, length, (char *) data, 0, NULL);
LSA_COPY (&prev_lsa, &tdes->undo_nxlsa);              /* remember where we were */
compensate = (LOG_REC_COMPENSATE *) node->data_header;
// ... condensed: fill compensate->data {rcvindex, pageid, offset, volid}, length ...
if (undo_nxlsa != NULL) LSA_COPY (&compensate->undo_nxlsa, undo_nxlsa);  /* explicit skip target */
else                    LSA_COPY (&compensate->undo_nxlsa, &prev_lsa);   /* default: current link */
start_lsa = prior_lsa_next_record (thread_p, node, tdes);
if (pgptr != NULL) pgbuf_set_lsa (thread_p, pgptr, &start_lsa);  /* TDE/page-LSA only when fixed */
LSA_COPY (&tdes->undo_nxlsa, &prev_lsa);             /* <- rewind: undo continues from BEFORE this CLR */

Invariant — a CLR is redo-only and rewinds the undo cursor past itself. After the append, tdes->undo_nxlsa is reset to prev_lsa and the CLR’s own undo_nxlsa points at the next record needing undo. During recovery undo, reaching a CLR jumps to undo_nxlsa and never re-applies the undo it represents; skipping the rewind would double-undo the page. The pgptr != NULL guard handles a CLR logged in recovery when the page could not be fixed — TDE and pgbuf_set_lsa are skipped.

The sibling log_sysop_end_logical_compensate (§9.3) achieves the same skip at sysop granularity via the sysop-end record’s compensate_lsa.

9.9 Savepoints and partial abort

log_append_savepoint chains a LOG_SAVEPOINT with header log_rec_savept (log_record.hpp):

Field	Role	Why it exists
`prv_savept`	”Previous savepoint record” LSA	Singly-linked chain so `log_get_savepoint_lsa` walks back by name
`length`	”Savepoint name” length (name follows the record)	Bounds the variable-length name copy

// log_append_savepoint -- src/transaction/log_manager.c
if (!LOG_ISTRAN_ACTIVE (tdes)) { er_set (... ER_LOG_CANNOT_ADD_SAVEPOINT ...); return NULL; }
if (savept_name == NULL)      { er_set (... ER_LOG_NONAME_SAVEPOINT ...);     return NULL; }
node = prior_lsa_alloc_and_copy_data (thread_p, LOG_SAVEPOINT, ..., savept_name, ...);
savept = (LOG_REC_SAVEPT *) node->data_header;
LSA_COPY (&savept->prv_savept, &tdes->savept_lsa);   /* <- link to previous savepoint */
(void) prior_lsa_next_record (thread_p, node, tdes);
LSA_COPY (&tdes->savept_lsa, &tdes->tail_lsa);       /* <- this is now the latest savepoint */

Branches: NULL tdes (fatal), non-active tran (ER_LOG_CANNOT_ADD_SAVEPOINT), NULL name (ER_LOG_NONAME_SAVEPOINT), else append.

log_abort_partial rolls back to a named savepoint by reusing the sysop machinery — it forges a sysop whose parent LSA is the savepoint. Five guard branches run before the synthetic body:

tdes == NULL → ER_LOG_UNKNOWN_TRANINDEX, return TRAN_UNACTIVE_UNKNOWN.
LOG_HAS_LOGGING_BEEN_IGNORED () → ER_LOG_CORRUPTED_DB_DUE_NOLOGGING, return current state.
!LOG_ISTRAN_ACTIVE → silently return current state.
NULL name or unknown savepoint → ER_LOG_UNKNOWN_SAVEPOINT, return TRAN_UNACTIVE_UNKNOWN.
Dangling sysops (topops.last >= 0) → warn + assert(false), drain via log_sysop_attach_to_outer.

// log_abort_partial -- src/transaction/log_manager.c
if (tdes == NULL) { er_set (... ER_LOG_UNKNOWN_TRANINDEX ...); return TRAN_UNACTIVE_UNKNOWN; }
if (LOG_HAS_LOGGING_BEEN_IGNORED ()) { er_set (...); return tdes->state; }
if (!LOG_ISTRAN_ACTIVE (tdes))       { return tdes->state; }
if (savepoint_name == NULL || log_get_savepoint_lsa (...) == NULL)
  { er_set (... ER_LOG_UNKNOWN_SAVEPOINT ...); return TRAN_UNACTIVE_UNKNOWN; }
if (tdes->topops.last >= 0)                          /* dangling sysops: drain them first */
  { er_set (... ER_LOG_HAS_TOPOPS_DURING_COMMIT_ABORT ...); assert (false);
    while (tdes->topops.last >= 0) log_sysop_attach_to_outer (thread_p); }
log_sysop_start (thread_p);
LSA_COPY (&tdes->topops.stack[tdes->topops.last].lastparent_lsa, savept_lsa);  /* stop at savepoint */
// ... condensed: if posp_nxlsa not null, transfer/clamp it into the frame's posp_lsa ...
log_sysop_abort (thread_p);                          /* the actual rollback + CLRs */
LSA_COPY (&tdes->savept_lsa, savept_lsa);            /* discard newer savepoints */
return TRAN_UNACTIVE_ABORTED;

Partial abort is “abort a synthetic sysop spanning savepoint→now.” The elided postpone-anchor transfer moves posp_nxlsa into the frame’s posp_lsa (clamped to savept_lsa) so postpones whose source predates the savepoint are not lost.

9.10 `log_sysop_attach_to_outer` — committing a sysop into its parent

A sysop may merge into its enclosing scope, transferring only its postpone anchor:

// log_sysop_attach_to_outer -- src/transaction/log_manager.c
if (tdes->topops.last == 0 && (!LOG_ISTRAN_ACTIVE (tdes) || tdes->is_system_transaction ()))
  { assert_release (false); log_sysop_commit (thread_p); return; }   /* nothing to attach to */
if (tdes->topops.last - 1 >= 0)                       /* attach to parent sysop frame */
  { if (LSA_ISNULL (&tdes->topops.stack[tdes->topops.last - 1].posp_lsa))
      LSA_COPY (&tdes->topops.stack[tdes->topops.last - 1].posp_lsa,
                &tdes->topops.stack[tdes->topops.last].posp_lsa); }
else                                                  /* attach to transaction level */
  { if (LSA_ISNULL (&tdes->posp_nxlsa))
      LSA_COPY (&tdes->posp_nxlsa, &tdes->topops.stack[tdes->topops.last].posp_lsa); }
log_sysop_end_final (thread_p, tdes);                 /* pop, no end record appended */

Three branches: (1) nothing to attach to → fall back to a real commit; (2) parent sysop → push this level’s posp_lsa up if the parent has none; (3) top-level → push into posp_nxlsa. No LOG_SYSOP_END is written — the sysop’s effects become the parent’s.

9.11 Chapter summary — key takeaways

Sysops start as stack frames, not log records. log_sysop_start pushes topops and snapshots tail_lsa into lastparent_lsa; the first on-disk evidence is the end record (or atomic-start marker). The parent-LSA invariant (§9.1) is how end functions detect an empty sysop.
log_sysop_commit_internal is one hub with six type-driven arms. The log_rec_sysop_end union is reinterpreted per type; the function validates arm-vs-state, runs postpones, appends the end, chains tail_topresult_lsa.
Abort is rollback-then-mark with a state swap. log_sysop_abort sets TRAN_UNACTIVE_ABORTED so log_rollback emits CLRs, appends LOG_SYSOP_END_ABORT, then restores the outer state.
Postpone defers redo to post-commit replay, anchored once. The first LOG_POSTPONE seeds posp_lsa (sysop) or posp_nxlsa (tran); the cache replays from memory, log_do_postpone from the log, stopping at any start marker and replaying only LOG_POSTPONE.
log_run_postpone_op makes postpone idempotent via the LOG_RUN_POSTPONE ref_lsa back-pointer; a page-spanning malloc failure is fatal.
Compensation is redo-only and rewinds the undo cursor. log_append_compensate_internal stamps a CLR whose undo_nxlsa skips the compensated record and resets tdes->undo_nxlsa.
Savepoints and partial abort piggy-back on sysops. log_abort_partial clears five guards, forges a sysop spanning savepoint→now, and calls log_sysop_abort; log_sysop_attach_to_outer merges a sysop into its parent with no end record, transferring only the postpone anchor.

Chapter 10: Archiving Header Maintenance and Edge Paths

Chapters 3-9 traced the hot per-record path. This chapter covers everything off it: the background machinery that recycles the active log into archive volumes, the on-disk durability of the log header, and the edge records and corruption checks. For the “active log vs. archives” and “force-at-commit” theory, see the companion.

10.1 Three header structs: page header, log header, archive header

Every log page begins with a LOG_HDRPAGE; logical page -9 (LOGPB_HEADER_PAGE_ID) carries a LOG_HEADER; every archive’s physical page 0 carries a LOG_ARV_HEADER.

flowchart LR
  hp["active page -9<br/>LOG_HDRPAGE + LOG_HEADER"] -->|"nxarv_pageid"| p0["active data page<br/>LOG_HDRPAGE + records"]
  p0 -.archived into.-> ap0["archive phy 0<br/>LOG_HDRPAGE + LOG_ARV_HEADER"]

Figure 10-1: the three header structs and the active-to-archive copy relationship.

LOG_HDRPAGE — per-page header prefix. A LOG_PAGE is LOG_HDRPAGE hdr + char area[1].

Field	Role	Why it exists
`logical_pageid`	Logical page id in the infinite log	Identity independent of physical slot; header page is always `-9`
`offset`	Offset of first record on this page	Salvage anchor when a prior page is corrupt and unarchived
`flags`	TDE flags (`..._AES`/`_ARIA`)	Marks a page whose records must be encrypted before leaving memory; header page is `0`
`checksum`	CRC32 over sampled bytes	Corruption detection on read (10.6)

LOG_HEADER — active-log master record. In the data area of page -9, mirrored as log_Gl.hdr. Every member is listed.

Field	Role	Why it exists
`magic`	`file(1)` magic	Guard vs. non-log files
`dummy` / `dummy3` / `dummy4`	Alignment pads	8-byte align
`db_creation`	DB creation time	Ties log to DB; copied to `LOG_ARV_HEADER`
`vol_creation`	Active-vol creation time	Diagnostics / ordering
`db_release` / `db_compatibility`	Release string, compat float	Reject incompatible build/version
`db_iopagesize` / `db_logpagesize`	Page sizes at creation	Run DB at the size the log expects
`is_shutdown`	Clean-shutdown flag	Recovery: was dismount clean
`next_trid`	Next txn id	Copied to `LOG_ARV_HEADER` for replay
`mvcc_next_id`	Next MVCC id	MVCC allocation high-water
`avg_ntrans` / `avg_nlocks`	Sizing estimates	Pre-size txn/lock tables
`npages`	Active pages, excl. header	Sizes active vol / archive range
`db_charset`	DB charset id	Charset guard at mount
`was_copied`	Copied-DB flag	Copy vs. original
`fpageid`	Logical pageid at active physical slot 1	Active analogue of `LOG_ARV_HEADER.fpageid`
`append_lsa`	Current append position	High-water of real log
`chkpt_lsa`	Lowest LSA recovery replays from	Recovery start; durable here (10.5)
`nxarv_pageid`	Next logical page to archive	Active/archive boundary (10.2)
`nxarv_phy_pageid`	Physical slot of `nxarv_pageid`	Skips recomputing `logpb_to_physical_pageid`
`nxarv_num`	Next archive number	Names next `_lgarNNN`
`last_arv_num_for_syscrashes`	Oldest archive for crash recovery	Deletion floor; `-1` = unpinned
`last_deleted_arv_num`	Highest archive removed	Remove resumes without re-scan
`bkup_level0/1/2_lsa` / `bkinfo[]`	Per-level backup LSAs + info	Backup — see backup chapter
`prefix_name`	Log prefix name	Names volume family
`has_logging_been_skipped`	Logging-skipped flag	Marks a WAL-bypass window
`vacuum_last_blockid`	Last vacuum block id	Gates deletion (10.4)
`perm_status_obsolete`	Obsolete status	Layout compat
`ha_server_state` / `ha_file_status` / `ha_promotion_time`	HA state, copy status, promotion time	Replication / HA
`eof_lsa`	LSA of `LOG_END_OF_LOG`	Durable log end; durable here (10.5)
`smallest_lsa_at_last_chkpt`	Oldest dirty LSA at last chkpt	Bounds recovery/vacuum lookback
`mvcc_op_log_lsa`	LSA of last MVCC op	Vacuum MVCC anchor
`oldest_visible_mvccid` / `newest_block_mvccid`	Oldest visible, newest block MVCCID	Vacuum visibility / block bounds
`db_restore_time`	Last restore time	Restore bookkeeping
`mark_will_del`	Marked for deletion	DB-drop bookkeeping
`does_block_need_vacuum`	Block needs vacuum	Vacuum scheduling
`was_active_log_reset`	Active log was reset	Cleared in `logpb_archive_active_log`

State invariant — the archive boundary. nxarv_pageid is the single source of truth for what is archived (< lives only in an archive, >= is still active), and nxarv_phy_pageid must equal logpb_to_physical_pageid(nxarv_pageid). Both advance as a unit at the end of logpb_archive_active_log, then logpb_flush_header makes them durable atomically. If they disagree, the next archive reads the wrong physical slot and corrupts the sequence.

LOG_ARV_HEADER — one per archive volume.

Field	Role	Why it exists
`magic`	`CUBRID_MAGIC_LOG_ARCHIVE`	Guard vs. mounting wrong file as archive
`dummy`	Pad	align
`db_creation`	From `log_Gl.hdr.db_creation`	Ties archive to DB
`vol_creation`	`time(NULL)` when written	Diagnostics / ordering
`next_trid`	From `log_Gl.hdr.next_trid`	Replay context
`npages`	Data pages, excl. previous-lsa page	Bounds reads
`fpageid`	Logical pageid at physical slot 1	Logical-to-physical map in this archive
`arv_num`	This archive’s number	Self-identifying
`dummy2`	Pad	align

10.2 logpb_archive_active_log — rolling the active log into an archive

Called under LOG_CS write when the active log fills, it copies [nxarv_pageid .. prev_lsa.pageid-1] into a fresh archive, then advances the boundary. Figure 10-2 traces every branch.

flowchart TB
  start["enter LOG_CS write"] --> wake["wake remove daemon SERVER, or remove-exceed-limit SA"]
  wake --> guard{"nxarv_pageid >= append_lsa.pageid ?"}
  guard -->|yes, only incomplete page| ret["er_log_debug + return"]
  guard -->|no| dis{"archive.vdes open ?"}
  dis -->|yes| dismount["dismount old archive"]
  dis -->|no| mal["malloc arv hdr page"]
  dismount --> mal
  mal -->|NULL| err["goto error"]
  mal -->|ok| flush["flush_all_append_pages, build hdr"]
  flush --> bg{"bg archiving and vdes open ?"}
  bg -->|yes| chk["set hdr checksum"]
  bg -->|no| fmt["fileio_format new vol"]
  fmt -->|NULL_VOLDES| err
  fmt --> chk
  chk -->|error| err
  chk --> wrhdr["write header page phy 0"]
  wrhdr -->|NULL| err
  wrhdr --> loop["copy loop: read LOGPB_IO_NPAGES, write"]
  loop -->|read/write fails| err
  loop --> fin{"background archiving ?"}
  fin -->|yes| rename["dismount, rename _lgar_t, remount"]
  fin -->|no| sync["fileio_synchronize"]
  rename -->|fail| err
  sync -->|fail| err
  rename --> adv["advance nxarv_num/pageid/phy_pageid"]
  sync --> adv
  adv --> fh["logpb_flush_header"]
  fh --> done["cache hdr, log, return"]
  err --> fatal["logpb_fatal_error -> exit"]

Figure 10-2: branch-complete flow of logpb_archive_active_log.

The early guard (nxarv_pageid >= append_lsa.pageid -> er_log_debug + return) refuses an empty range; logpb_flush_all_append_pages is then forced. The header is built self-describing (db_creation/next_trid/fpageid copied from log_Gl.hdr), with last_pageid clamped so a degenerate range never yields negative npages:

// logpb_archive_active_log -- src/transaction/log_page_buffer.c
last_pageid = log_Gl.append.prev_lsa.pageid - 1;                    /* <- never the live append page */
if (last_pageid < arvhdr->fpageid) last_pageid = arvhdr->fpageid;   /* <- clamp >= 1 page */
arvhdr->npages = (DKNPAGES) (last_pageid - arvhdr->fpageid + 1);

The copy loop reads up to LOGPB_IO_NPAGES (4) pages via logpb_read_page_from_active_log and writes them FILEIO_WRITE_NO_COMPENSATE_WRITE as-stored (still TDE-encrypted); any read <= 0 or write NULL jumps to error. The boundary advance is the durable commit: last_arv_num_for_syscrashes is pinned to nxarv_num if still -1 (recovery floor); nxarv_num++; nxarv_pageid/nxarv_phy_pageid advance as a unit; was_active_log_reset = false; then logpb_flush_header. The error label calls logpb_fatal_error(..., true, ...) — a failed archive is unrecoverable, so the server exits (10.7).

10.3 logpb_write_toflush_pages_to_archive — background archiving

When PRM_ID_LOG_BACKGROUND_ARCHIVING is on, full pages stream to a temp volume (_lgar_t) as they flush, so the eventual archive only renames it. It returns early when bg_archive_info.vdes == NULL_VOLDES || num_toflush <= 1, then copies every toflush[] page below prev_lsa.pageid, reconciling cursor pageid against the next bufptr->pageid in three branches:

// logpb_write_toflush_pages_to_archive -- src/transaction/log_page_buffer.c
if (pageid > bufptr->pageid)      { assert_release (...); dismount; return; }  /* backwards: never */
else if (pageid < bufptr->pageid) { if (logpb_fetch_page (...)) { dismount; return; } }  /* gap: fetch */
else                              { log_pgptr = flush_info->toflush[i]; i++; }  /* match: use in hand */

Each page is TDE-encrypted if LOG_IS_PAGE_TDE_ENCRYPTED; on encryption failure it is written plaintext with the TDE flag cleared (a logged data-leak tradeoff). fileio_synchronize runs once every PRM_ID_PB_SYNC_ON_NFLUSH pages. Any write failure dismounts the temp volume and abandons bg archiving (10.2 falls back to fileio_format).

10.4 The remove daemon — gated deletion of old archives

Deletion never happens on the hot path. On a server, logpb_archive_active_log only wakes the daemon (log_wakeup_remove_log_archive_daemon calls wakeup(), async). log_remove_log_archive_daemon_task also fires periodically (compute_period reads PRM_ID_REMOVE_LOG_ARCHIVES_INTERVAL: non-zero = timed wait, zero = wake-only). Its body and the SA path both call logpb_remove_archive_logs_exceed_limit, which early-exits with 0 if log_max_archives == INT_MAX (unlimited) or !vacuum_is_safe_to_remove_archives() (vacuum data not loaded). The window [last_deleted_arv_num + 1, nxarv_num - num_remove_arv_num] then has its high end clamped by each gate with MIN:

// logpb_remove_archive_logs_exceed_limit -- src/transaction/log_page_buffer.c
if (log_Gl.hdr.last_arv_num_for_syscrashes != -1)              /* crash-recovery floor */
  last_arv_num_to_delete = MIN (last_arv_num_to_delete, log_Gl.hdr.last_arv_num_for_syscrashes);
if (vacuum_first_pageid != NULL_PAGEID && logpb_is_page_in_archive (vacuum_first_pageid))
  last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_vacuum);
if (prm_get_integer_value (PRM_ID_SUPPLEMENTAL_LOG)) {          /* CDC + flashback gates */
  if (logpb_is_page_in_archive (cdc_min_log_pageid_to_keep ()))   /* CDC progress */
    last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_cdc);
  if (flashback_is_needed_to_keep_archive ())
    last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_flashback); }

State invariant — no consumer is read past its floor. An archive is deletable only if its number is below every live consumer’s minimum: last_arv_num_for_syscrashes, vacuum, CDC (cdc_min_log_pageid_to_keep — oldest page CDC has not consumed, only under PRM_ID_SUPPLEMENTAL_LOG), flashback, and (server) HA copy progress (logwr_get_min_copied_fpageid, unless PRM_ID_FORCE_REMOVE_LOG_ARCHIVES). The MIN() chain enforces this; drop any clamp and that consumer finds its archive deleted.

Then max_count caps batch size, last_arv_num_to_delete-- (window is exclusive of the last needed archive), and only if >= first_arv_num_to_delete does it persist last_deleted_arv_num and logpb_flush_header. The unlink runs after LOG_CS_EXIT via logpb_remove_archive_logs_internal.

10.5 logpb_flush_header — making the active-log header durable

Every boundary change above ends here. It asserts LOG_CS_OWN_WRITE_MODE, lazily allocates loghdr_pgptr if NULL (OOM -> logpb_fatal_error), then snapshots and writes to page -9:

// logpb_flush_header -- src/transaction/log_page_buffer.c
log_hdr = (LOG_HEADER *) (log_Gl.loghdr_pgptr->area);
*log_hdr = log_Gl.hdr;                          /* <- snapshot in-memory header */
log_Gl.loghdr_pgptr->hdr.flags = 0;             /* <- never TDE-encrypted */
logpb_write_page_to_disk (thread_p, log_Gl.loghdr_pgptr, LOGPB_HEADER_PAGE_ID);

This is the single point where chkpt_lsa (recovery start) and eof_lsa (durable log end) become durable. It does not flush append pages; those use Ch 7’s WAL flush.

10.6 Edge records and corruption: EOF marker, dummies, checksum

LOG_END_OF_LOG placement. In logpb_flush_all_append_pages, an EOF marker (eof.type = LOG_END_OF_LOG, null forw_lsa) is appended in place via logpb_start_append so recovery finds where the log stops, but append_lsa is not advanced — the next real record overwrites it.

LOG_DUMMY_GENERIC and other dummies. Several log types carry no payload; the enum comment is literally "ridiculous, but flush needs it". A dummy gives flush a record to terminate/pad a page when a real record would straddle a boundary awkwardly, so the page closes without a partial record header.

Checksum. logpb_compute_page_checksum samples 16 bytes from the head and tail of each 4096-byte block and CRC32s the concatenation, zeroing hdr.checksum during the computation and restoring it after so the stored checksum never checks itself. logpb_set_page_checksum stores it; logpb_page_has_valid_checksum recomputes and compares; logpb_page_check_corruption sets *is_page_corrupted = !has_valid_checksum. Any change must be mirrored in logwr_check_page_checksum so replication agrees.

logpb_invalid_all_append_pages. When append state must be reset (e.g. after a partial-append failure), the one branch (if log_Gl.append.log_pgptr != NULL) flushes the dirty append page via logpb_flush_pages_direct first so committed work is not lost and nulls log_pgptr; it then zeroes flush_info->num_toflush and sets toflush[0] = NULL under flush_mutex.

10.7 logpb_fatal_error_internal — last-resort flush and exit

Unrecoverable errors call logpb_fatal_error -> logpb_fatal_error_internal with need_flush = true (logpb_fatal_error_exit_immediately_wo_flush passes false when flushing is itself unsafe):

// logpb_fatal_error_internal -- src/transaction/log_page_buffer.c
if (log_exit == true && need_flush == true && log_Gl.append.log_pgptr != NULL) {
  static int in_fatal = false;                  /* <- reentrancy guard */
  if (in_fatal == false) {
    in_fatal = true;
    pgbuf_flush_checkpoint (...);                /* flush only up to prev_lsa */
    in_fatal = false; } }
fileio_synchronize_all (thread_p);              /* <- force everything to stable storage */
/* then boot_server_status(DOWN); NDEBUG -> exit, debug -> abort core dump */

Branches: the flush block runs only when all three of log_exit, need_flush, and a live append page hold; the in_fatal guard blocks recursive entry if the flush itself faults. It flushes “as much as you can without forcing the current unfinished log record” (committed work below prev_lsa durable, the partial record left for recovery), then fileio_synchronize_alls and exits (NDEBUG) or aborts (debug).

10.8 Open questions carried from the high-level doc

Four items from the companion remain open: the group-commit window (flush daemon’s wake timing and its interaction with PRM_ID_PB_SYNC_ON_NFLUSH, 10.3); whether the prior-list list_size cap throttles archive/flush; TDE placement (encryption is lazy in 10.3 and skipped in logpb_archive_active_log’s direct copy — the single authoritative encrypt-on-disk point is untraced); and the LOG_DUMMY_GENERIC invariant (the condition under which flush requires a dummy is documented only by the source comment).

10.9 Chapter summary — key takeaways

Three nested header structs: LOG_HDRPAGE per page; LOG_HEADER (page -9) the master record; LOG_ARV_HEADER per archive, with db_creation/next_trid copied from the former.
nxarv_pageid/nxarv_phy_pageid are the archive boundary, advanced as a unit and flushed by logpb_archive_active_log (which also clears was_active_log_reset).
Archiving forces a full append flush, copies [nxarv_pageid .. prev_lsa.pageid-1] as-stored, and treats any I/O failure as fatal.
Deletion is gated: logpb_remove_archive_logs_exceed_limit clamps the window with a MIN() chain against crash-recovery, vacuum, CDC, flashback, and HA floors.
logpb_flush_header is the single durability point for chkpt_lsa/eof_lsa/archive bookkeeping, under LOG_CS write, flags = 0.
Edge records: LOG_END_OF_LOG appended in place without advancing append_lsa; dummies pad pages; a sampled-CRC32 checksum drives logpb_page_check_corruption.
Fatal path: logpb_fatal_error_internal uses an in_fatal guard, flushes only up to prev_lsa, then exits/aborts.

Position hints as of this revision

The following are line numbers as observed on 2026-06-08; symbols are the canonical anchor and line numbers are hints that decay.

Symbol	File	Line
`LOG_PAGESIZE`	`src/storage/storage_common.h`	99
`log_Zip_support`	`src/transaction/log_append.cpp`	40
`log_Zip_min_size_to_compress`	`src/transaction/log_append.cpp`	41
`log_append_info::get_nxio_lsa`	`src/transaction/log_append.cpp`	106
`log_append_info::set_nxio_lsa`	`src/transaction/log_append.cpp`	112
`log_prior_lsa_info::log_prior_lsa_info`	`src/transaction/log_append.cpp`	117
`LOG_RESET_APPEND_LSA`	`src/transaction/log_append.cpp`	128
`LOG_RESET_PREV_LSA`	`src/transaction/log_append.cpp`	136
`LOG_APPEND_PTR`	`src/transaction/log_append.cpp`	145
`log_append_init_zip`	`src/transaction/log_append.cpp`	185
`log_append_final_zip`	`src/transaction/log_append.cpp`	232
`prior_lsa_alloc_and_copy_data`	`src/transaction/log_append.cpp`	273
`prior_lsa_alloc_and_copy_crumbs`	`src/transaction/log_append.cpp`	410
`prior_lsa_copy_undo_data_to_node`	`src/transaction/log_append.cpp`	493
`prior_lsa_copy_redo_data_to_node`	`src/transaction/log_append.cpp`	524
`prior_lsa_gen_undoredo_record_from_crumbs`	`src/transaction/log_append.cpp`	651
`prior_lsa_gen_record`	`src/transaction/log_append.cpp`	1217
`prior_update_header_mvcc_info`	`src/transaction/log_append.cpp`	1320
`prior_lsa_next_record_internal`	`src/transaction/log_append.cpp`	1357
`commit_abort_lsa`	`src/transaction/log_append.cpp`	1485
`prior_lsa_next_record`	`src/transaction/log_append.cpp`	1553
`prior_lsa_next_record_with_lock`	`src/transaction/log_append.cpp`	1559
`prior_set_tde_encrypted`	`src/transaction/log_append.cpp`	1565
`prior_is_tde_encrypted`	`src/transaction/log_append.cpp`	1581
`prior_lsa_start_append`	`src/transaction/log_append.cpp`	1593
`prior_lsa_end_append`	`src/transaction/log_append.cpp`	1652
`prior_lsa_append_data`	`src/transaction/log_append.cpp`	1661
`log_append_get_zip_undo`	`src/transaction/log_append.cpp`	1725
`log_append_get_zip_redo`	`src/transaction/log_append.cpp`	1751
`log_prior_lsa_append_align`	`src/transaction/log_append.cpp`	1892
`log_prior_lsa_append_advance_when_doesnot_fit`	`src/transaction/log_append.cpp`	1905
`log_prior_lsa_append_add_align`	`src/transaction/log_append.cpp`	1917
`log_crumb`	`src/transaction/log_append.hpp`	46
`log_data_addr`	`src/transaction/log_append.hpp`	53
`LOG_PRIOR_LSA_LOCK`	`src/transaction/log_append.hpp`	66
`log_append_info`	`src/transaction/log_append.hpp`	73
`log_prior_node`	`src/transaction/log_append.hpp`	91
`log_prior_lsa_info`	`src/transaction/log_append.hpp`	112
`log_zip_alloc`	`src/transaction/log_compress.c`	237
`log_zip`	`src/transaction/log_compress.h`	53
`log_global::log_global`	`src/transaction/log_global.c`	49
`LOGAREA_SIZE`	`src/transaction/log_impl.h`	121
`log_setdirty`	`src/transaction/log_impl.h`	305
`log_flush_info`	`src/transaction/log_impl.h`	322
`log_topops_addresses`	`src/transaction/log_impl.h`	353
`log_topops_stack`	`src/transaction/log_impl.h`	362
`log_rcv_tdes`	`src/transaction/log_impl.h`	458
`log_tdes`	`src/transaction/log_impl.h`	475
`log_global`	`src/transaction/log_impl.h`	671
`log_lsa`	`src/transaction/log_lsa.hpp`	35
`NULL_LSA`	`src/transaction/log_lsa.hpp`	67
`MAX_LSA`	`src/transaction/log_lsa.hpp`	72
`LSA_COPY`	`src/transaction/log_lsa.hpp`	80
`LSA_AS_ARGS`	`src/transaction/log_lsa.hpp`	91
`LOG_TDES_LAST_SYSOP`	`src/transaction/log_manager.c`	199
`LOG_TDES_LAST_SYSOP_PARENT_LSA`	`src/transaction/log_manager.c`	200
`LOG_TDES_LAST_SYSOP_POSP_LSA`	`src/transaction/log_manager.c`	201
`log_Flush_daemon`	`src/transaction/log_manager.c`	363
`log_create_internal`	`src/transaction/log_manager.c`	827
`log_initialize_internal`	`src/transaction/log_manager.c`	1100
`log_abort_by_tdes`	`src/transaction/log_manager.c`	1583
`log_abort_all_active_transaction`	`src/transaction/log_manager.c`	1608
`log_final`	`src/transaction/log_manager.c`	1720
`log_append_undoredo_data`	`src/transaction/log_manager.c`	1893
`log_append_undo_data`	`src/transaction/log_manager.c`	1973
`log_append_redo_data`	`src/transaction/log_manager.c`	2035
`log_append_undoredo_crumbs`	`src/transaction/log_manager.c`	2086
`log_append_postpone`	`src/transaction/log_manager.c`	2719
`log_append_run_postpone`	`src/transaction/log_manager.c`	2881
`log_append_compensate_internal`	`src/transaction/log_manager.c`	3047
`log_append_savepoint`	`src/transaction/log_manager.c`	3365
`log_sysop_start`	`src/transaction/log_manager.c`	3599
`log_sysop_start_atomic`	`src/transaction/log_manager.c`	3665
`log_sysop_commit_internal`	`src/transaction/log_manager.c`	3825
`log_sysop_commit`	`src/transaction/log_manager.c`	3916
`log_sysop_end_logical_undo`	`src/transaction/log_manager.c`	3941
`log_sysop_end_logical_compensate`	`src/transaction/log_manager.c`	3984
`log_sysop_end_logical_run_postpone`	`src/transaction/log_manager.c`	4003
`log_sysop_end_recovery_postpone`	`src/transaction/log_manager.c`	4024
`log_sysop_abort`	`src/transaction/log_manager.c`	4038
`log_sysop_attach_to_outer`	`src/transaction/log_manager.c`	4097
`log_append_commit_postpone`	`src/transaction/log_manager.c`	4384
`log_append_sysop_start_postpone`	`src/transaction/log_manager.c`	4455
`log_append_repl_info_and_commit_log`	`src/transaction/log_manager.c`	4647
`log_append_donetime_internal`	`src/transaction/log_manager.c`	4679
`log_change_tran_as_completed`	`src/transaction/log_manager.c`	4722
`log_append_commit_log`	`src/transaction/log_manager.c`	4779
`log_append_commit_log_with_lock`	`src/transaction/log_manager.c`	4802
`log_append_abort_log`	`src/transaction/log_manager.c`	4816
`log_commit_local`	`src/transaction/log_manager.c`	5159
`log_abort_local`	`src/transaction/log_manager.c`	5277
`log_commit`	`src/transaction/log_manager.c`	5352
`log_abort`	`src/transaction/log_manager.c`	5461
`log_abort_partial`	`src/transaction/log_manager.c`	5558
`log_complete`	`src/transaction/log_manager.c`	5653
`log_rollback`	`src/transaction/log_manager.c`	7664
`log_tran_do_postpone`	`src/transaction/log_manager.c`	8156
`log_sysop_do_postpone`	`src/transaction/log_manager.c`	8190
`log_do_postpone`	`src/transaction/log_manager.c`	8237
`log_run_postpone_op`	`src/transaction/log_manager.c`	8481
`log_wakeup_remove_log_archive_daemon`	`src/transaction/log_manager.c`	10099
`log_wakeup_log_flush_daemon`	`src/transaction/log_manager.c`	10126
`log_is_log_flush_daemon_available`	`src/transaction/log_manager.c`	10141
`log_remove_log_archive_daemon_task`	`src/transaction/log_manager.c`	10185
`log_flush_execute`	`src/transaction/log_manager.c`	10377
`log_flush_daemon_init`	`src/transaction/log_manager.c`	10493
`log_abort_task_execute`	`src/transaction/log_manager.c`	10558
`cdc_min_log_pageid_to_keep`	`src/transaction/log_manager.c`	14021
`LOG_IS_SYSTEM_OP_STARTED`	`src/transaction/log_manager.h`	59
`LOGPB_HEADER_PAGE_ID`	`src/transaction/log_page_buffer.c`	138
`LOG_APPEND_ALIGN`	`src/transaction/log_page_buffer.c`	164
`LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT`	`src/transaction/log_page_buffer.c`	176
`LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT`	`src/transaction/log_page_buffer.c`	177
`LOG_APPEND_SETDIRTY_ADD_ALIGN`	`src/transaction/log_page_buffer.c`	184
`log_buffer`	`src/transaction/log_page_buffer.c`	192
`log_buffer`	`src/transaction/log_page_buffer.c`	194
`log_pb_global_data`	`src/transaction/log_page_buffer.c`	244
`logpb_get_log_buffer`	`src/transaction/log_page_buffer.c`	394
`logpb_initialize_log_buffer`	`src/transaction/log_page_buffer.c`	425
`logpb_compute_page_checksum`	`src/transaction/log_page_buffer.c`	446
`logpb_set_page_checksum`	`src/transaction/log_page_buffer.c`	495
`logpb_page_has_valid_checksum`	`src/transaction/log_page_buffer.c`	523
`logpb_initialize_pool`	`src/transaction/log_page_buffer.c`	553
`logpb_finalize_pool`	`src/transaction/log_page_buffer.c`	672
`logpb_create_page`	`src/transaction/log_page_buffer.c`	783
`logpb_locate_page`	`src/transaction/log_page_buffer.c`	807
`logpb_set_dirty`	`src/transaction/log_page_buffer.c`	929
`logpb_flush_header`	`src/transaction/log_page_buffer.c`	1676
`logpb_fetch_start_append_page`	`src/transaction/log_page_buffer.c`	2504
`logpb_fetch_start_append_page_new`	`src/transaction/log_page_buffer.c`	2586
`logpb_next_append_page`	`src/transaction/log_page_buffer.c`	2630
`logpb_writev_append_pages`	`src/transaction/log_page_buffer.c`	2780
`logpb_write_toflush_pages_to_archive`	`src/transaction/log_page_buffer.c`	2868
`logpb_append_next_record`	`src/transaction/log_page_buffer.c`	2981
`logpb_append_prior_lsa_list`	`src/transaction/log_page_buffer.c`	3040
`prior_lsa_remove_prior_list`	`src/transaction/log_page_buffer.c`	3084
`logpb_prior_lsa_append_all_list`	`src/transaction/log_page_buffer.c`	3106
`logpb_flush_all_append_pages`	`src/transaction/log_page_buffer.c`	3232
`logpb_flush_pages_direct`	`src/transaction/log_page_buffer.c`	3952
`logpb_flush_pages`	`src/transaction/log_page_buffer.c`	3980
`logpb_force_flush_pages`	`src/transaction/log_page_buffer.c`	4096
`logpb_force_flush_header_and_pages`	`src/transaction/log_page_buffer.c`	4104
`logpb_invalid_all_append_pages`	`src/transaction/log_page_buffer.c`	4121
`logpb_flush_log_for_wal`	`src/transaction/log_page_buffer.c`	4162
`logpb_start_append`	`src/transaction/log_page_buffer.c`	4207
`logpb_append_data`	`src/transaction/log_page_buffer.c`	4290
`logpb_append_crumbs`	`src/transaction/log_page_buffer.c`	4366
`logpb_end_append`	`src/transaction/log_page_buffer.c`	4455
`logpb_archive_active_log`	`src/transaction/log_page_buffer.c`	5649
`logpb_remove_archive_logs_exceed_limit`	`src/transaction/log_page_buffer.c`	5991
`logpb_fatal_error`	`src/transaction/log_page_buffer.c`	10607
`logpb_fatal_error_exit_immediately_wo_flush`	`src/transaction/log_page_buffer.c`	10618
`logpb_fatal_error_internal`	`src/transaction/log_page_buffer.c`	10629
`logpb_initialize_flush_info`	`src/transaction/log_page_buffer.c`	10878
`logpb_finalize_flush_info`	`src/transaction/log_page_buffer.c`	10912
`logpb_need_wal`	`src/transaction/log_page_buffer.c`	11229
`logpb_page_check_corruption`	`src/transaction/log_page_buffer.c`	11508
`logpb_get_tde_algorithm`	`src/transaction/log_page_buffer.c`	11565
`logpb_set_tde_algorithm`	`src/transaction/log_page_buffer.c`	11593
`log_rectype`	`src/transaction/log_record.hpp`	35
`log_rec_header`	`src/transaction/log_record.hpp`	146
`log_data`	`src/transaction/log_record.hpp`	157
`log_rec_undoredo`	`src/transaction/log_record.hpp`	167
`log_rec_undo`	`src/transaction/log_record.hpp`	176
`log_rec_redo`	`src/transaction/log_record.hpp`	184
`log_vacuum_info`	`src/transaction/log_record.hpp`	192
`log_rec_mvcc_undoredo`	`src/transaction/log_record.hpp`	202
`log_rec_mvcc_undo`	`src/transaction/log_record.hpp`	211
`log_rec_mvcc_redo`	`src/transaction/log_record.hpp`	220
`log_rec_donetime`	`src/transaction/log_record.hpp`	237
`log_rec_compensate`	`src/transaction/log_record.hpp`	262
`log_rec_start_postpone`	`src/transaction/log_record.hpp`	271
`log_sysop_end_type`	`src/transaction/log_record.hpp`	285
`log_rec_sysop_end`	`src/transaction/log_record.hpp`	305
`log_rec_sysop_start_postpone`	`src/transaction/log_record.hpp`	328
`log_rec_run_postpone`	`src/transaction/log_record.hpp`	336
`log_rec_savept`	`src/transaction/log_record.hpp`	380
`LOG_GET_LOG_RECORD_HEADER`	`src/transaction/log_record.hpp`	441
`LOG_IS_MVCC_OP_RECORD_TYPE`	`src/transaction/log_record.hpp`	463
`LOG_HDRPAGE_FLAG_ENCRYPTED_MASK`	`src/transaction/log_storage.hpp`	45
`LOG_IS_PAGE_TDE_ENCRYPTED`	`src/transaction/log_storage.hpp`	47
`LOGPB_HEADER_PAGE_ID`	`src/transaction/log_storage.hpp`	51
`log_hdrpage`	`src/transaction/log_storage.hpp`	63
`log_page`	`src/transaction/log_storage.hpp`	80
`log_page`	`src/transaction/log_storage.hpp`	81
`log_header`	`src/transaction/log_storage.hpp`	113
`log_arv_header`	`src/transaction/log_storage.hpp`	231
`logtb_get_new_tran_id`	`src/transaction/log_tran_table.c`	1741
`LOG_IS_MVCC_OPERATION`	`src/transaction/mvcc.h`	261

Sources

cubrid-log-manager.md — the high-level companion. See also cubrid-prior-list.md (the prior-list mechanism) and cubrid-recovery-manager.md (how these records are replayed).
Raw analyses under raw/code-analysis/cubrid/storage/log_manager/.
Code: src/transaction/log_manager.{c,h}, log_append.{cpp,hpp}, log_record.hpp, log_lsa.{cpp,hpp}, log_storage.hpp, log_page_buffer.c.
Methodology: knowledge/methodology/code-analysis-detail-doc.md.