CUBRID Log Manager — Code-Level Deep Dive
Where this document fits: The high-level analysis
cubrid-log-manager.mdcovers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single log record inside the kernel.
Contents:
Chapter 1: Data Structure Map
Section titled “Chapter 1: Data Structure Map”A log record is re-encoded across three tiers from caller to disk:
caller inputs (log_data_addr, log_crumb), the staging tier
(log_prior_node, log_prior_lsa_info, log_append_info), and the
on-disk tier (log_hdrpage, log_page, log_header,
log_arv_header). The body — the log_rec_* family off
log_rec_header — flows through all three unchanged. This is the
field-level map; later chapters trace the motion between tiers.
Cross-link: WAL theory (why the log LSA orders everything, why the log must reach disk before the data page) lives in the high-level companion
cubrid-log-manager.md. This chapter documents the structures the rule operates over, not the rule.
1.1 The addressing primitive — log_lsa
Section titled “1.1 The addressing primitive — log_lsa”A log sequence address (LSA) packs a logical page id and an in-page byte offset into a 64-bit bit-field.
// log_lsa -- src/transaction/log_lsa.hppstruct log_lsa{ std::int64_t pageid:48; /* Log page identifier : 6 bytes length */ std::int64_t offset:16; /* Offset in page. :16 of int64 (not short) for alignment */ // ... condensed: is_null(), is_max(), set_null(); ordering compares pageid then offset ...};| Field | Role | Why |
|---|---|---|
pageid (48b) | Logical page id in the infinite log | Unbounded append-only page sequence |
offset (16b) | Byte offset within area[] | :16 of an int64 packs to 8 bytes |
INVARIANT — LSA total ordering. operator< compares pageid then
offset, making LSAs a monotone WAL clock; every before/after and
durability decision is an LSA comparison. Lose it and the WAL rule (Ch 7)
and recovery replay cannot decide what to redo.
Sentinels/shims: NULL_LSA = {-1,-1} (set_null() writes both
fields), MAX_LSA = {(1<<47)-1,(1<<15)-1}, and the LSA_* macros
(LSA_COPY, LSA_SET_NULL, LSA_ISNULL, LSA_EQ/LE/LT/GE/GT,
LSA_AS_ARGS) — inline wrappers over the operators so legacy C compiles.
1.2 The record header — log_rec_header
Section titled “1.2 The record header — log_rec_header”Every on-disk record begins with a fixed log_rec_header threading it
into a physical chain and a per-transaction chain:
// log_rec_header -- src/transaction/log_record.hppstruct log_rec_header{ LOG_LSA prev_tranlsa; /* prev record of SAME transaction */ LOG_LSA back_lsa, forw_lsa; /* physically prev / next record */ TRANID trid; LOG_RECTYPE type;};| Field | Role | Why |
|---|---|---|
prev_tranlsa | Prior record of the same transaction | Undo walks one transaction backward |
back_lsa | Physically previous record | Reverse log scan |
forw_lsa | Physically next record | Redo forward scan; NULL_LSA until successor known (Ch 4) |
trid | Owning transaction id | Demultiplexes the interleaved stream |
type | LOG_RECTYPE discriminator | Tagged-union tag; selects the payload struct |
INVARIANT — header forms the doubly linked physical chain. For
adjacent A then B: B.back_lsa == addr(A) and A.forw_lsa == addr(B).
back_lsa is set at build, forw_lsa only once the successor’s LSA is
known — so the chain is briefly half-open at the tail. Disagreement makes
undo and redo scans visit different record sets, breaking recovery.
1.3 The type tag — log_rectype
Section titled “1.3 The type tag — log_rectype”The enum type ranges over is explicitly numbered and append-only:
obsolete values are wrapped in #if 0 rather than deleted, so the
on-disk integer meaning never shifts.
// log_rectype -- src/transaction/log_record.hpp (condensed)enum log_rectype{ LOG_SMALLER_LOGREC_TYPE = 0, /* lower-bound check */#if 0 LOG_CLIENT_NAME = 1, /* Obsolete -- hole preserved */#endif LOG_UNDOREDO_DATA = 2, LOG_UNDO_DATA = 3, LOG_REDO_DATA = 4, // ... LOG_COMMIT=17, LOG_SYSOP_END=20, LOG_ABORT=22, ... LOG_MVCC_UNDOREDO_DATA = 46, LOG_MVCC_UNDO_DATA = 47, LOG_MVCC_REDO_DATA = 48, LOG_MVCC_DIFF_UNDOREDO_DATA = 49, LOG_SYSOP_ATOMIC_START = 50, LOG_DUMMY_GENERIC = 51, /* dummy used for flush */ LOG_SUPPLEMENTAL_INFO = 52, LOG_LARGER_LOGREC_TYPE /* upper-bound check */};INVARIANT — sentinel bounds and stable wire values.
LOG_SMALLER_LOGREC_TYPE (0) and LOG_LARGER_LOGREC_TYPE bracket the
valid range; because the integer is persisted a number is never reused
(the #if 0 holes guarantee it). Classification macros
(LOG_IS_UNDO_RECORD_TYPE, LOG_IS_REDO_RECORD_TYPE,
LOG_IS_UNDOREDO_RECORD_TYPE, LOG_IS_MVCC_OP_RECORD_TYPE) read type.
1.4 The recovery-data locator — log_data
Section titled “1.4 The recovery-data locator — log_data”Undo/redo payloads embed a log_data naming where on a data volume the
change applies — a recovery coordinate, not the log’s address:
// log_data -- src/transaction/log_record.hppstruct log_data { LOG_RCVINDEX rcvindex; PAGEID pageid; PGLENGTH offset; VOLID volid; };| Field | Role | Why |
|---|---|---|
rcvindex | Index into the recovery dispatch table | Picks the rv* function for the bytes |
pageid | Target data page id | Page to refix |
offset | Offset/slot within that page | Where the change lands |
volid | Volume id of the target page | Disambiguates pageid across volumes |
1.5 The payload family — undo/redo and MVCC variants
Section titled “1.5 The payload family — undo/redo and MVCC variants”The type tag selects one payload, following the header on the page; all
build on log_data:
// log_rec_undoredo / undo / redo -- src/transaction/log_record.hppstruct log_rec_undoredo { LOG_DATA data; int ulength, rlength; };struct log_rec_undo { LOG_DATA data; int length; };struct log_rec_redo { LOG_DATA data; int length; };log_rec_undoredo carries ulength+rlength (lengths frame the two
blobs); log_rec_undo carries one length (undo image only, logical
undo); log_rec_redo carries one length (redo image only, page-physical
redo). MVCC variants wrap these and attach an MVCC id plus vacuum
bookkeeping:
// MVCC payload wrappers -- src/transaction/log_record.hppstruct log_rec_mvcc_undoredo { LOG_REC_UNDOREDO undoredo; MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };struct log_rec_mvcc_undo { LOG_REC_UNDO undo; MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };struct log_rec_mvcc_redo { LOG_REC_REDO redo; MVCCID mvccid; }; /* no vacuum_info */| Struct | Wraps | Adds | Why |
|---|---|---|---|
log_rec_mvcc_undoredo | log_rec_undoredo | mvccid, vacuum_info | MVCC ops vacuum tracks |
log_rec_mvcc_undo | log_rec_undo | mvccid, vacuum_info | MVCC delete-style ops |
log_rec_mvcc_redo | log_rec_redo | mvccid only | Pure redo creates no version to vacuum |
log_vacuum_info is the back-pointer carried by undo MVCC records:
// log_vacuum_info -- src/transaction/log_record.hppstruct log_vacuum_info { LOG_LSA prev_mvcc_op_log_lsa; VFID vfid; };| Field | Role | Why |
|---|---|---|
prev_mvcc_op_log_lsa | LSA of the previous MVCC-op record | Vacuum walks this chain in log order |
vfid | File the change belongs to | Detect dropped/reused file; decide object kind |
1.6 The staging node — log_prior_node
Section titled “1.6 The staging node — log_prior_node”The append path materializes a record as a log_prior_node linked into
the prior list — the central staging structure (Ch 3–5):
// log_prior_node -- src/transaction/log_append.hppstruct log_prior_node{ LOG_RECORD_HEADER log_header; LOG_LSA start_lsa; bool tde_encrypted; int data_header_length; char *data_header; int ulength; char *udata; int rlength; char *rdata; LOG_PRIOR_NODE *next;};| Field | Role | Why |
|---|---|---|
log_header | Embedded log_rec_header | Copied onto the page; back_lsa/forw_lsa filled at linking |
start_lsa / tde_encrypted | Assigned LSA; encryption flag | LSA asserted vs page offset; flag drives hdr.flags at drain |
data_header_length / data_header | Length + buffer of the log_rec_* struct | Serialized apart from variable data |
ulength/udata, rlength/rdata | Length + buffer of undo / redo bytes | The two images, possibly compressed |
next | Next node | Orders nodes awaiting drain |
INVARIANT — a prior node owns its heap buffers. data_header,
udata, rdata are independently malloc-ed; length is zero exactly
when the pointer is unused. Drain frees them after copying into the page
buffer; leak or double-free corrupts the heap.
1.7 The prior-list anchor — log_prior_lsa_info
Section titled “1.7 The prior-list anchor — log_prior_lsa_info”The in-memory anchor for the whole prior list — LSA cursor, list head/tail, and the serializing mutex:
// log_prior_lsa_info -- src/transaction/log_append.hppstruct log_prior_lsa_info{ LOG_LSA prior_lsa; LOG_LSA prev_lsa; LOG_PRIOR_NODE *prior_list_header; LOG_PRIOR_NODE *prior_list_tail; INT64 list_size; /* bytes */ LOG_PRIOR_NODE *prior_flush_list_header; std::mutex prior_lsa_mutex;};| Field | Role | Why |
|---|---|---|
prior_lsa | Next LSA to assign | Monotone allocator cursor; advanced by record size (Ch 4) |
prev_lsa | LSA of the last appended record | Fills the next node’s back_lsa |
prior_list_header / prior_list_tail | Head / tail of the awaiting-drain list | Drain start; O(1) append |
list_size | Total bytes staged | Flusher decides when to drain |
prior_flush_list_header | Head of the detached flush sublist | Drain steals here so producers keep appending |
prior_lsa_mutex | Mutex over all the above | LSA assignment + linkage atomic |
INVARIANT — prior_lsa_mutex serializes LSA assignment. prior_lsa
is advanced and the node linked under one acquisition, so no two records
share an LSA and list order matches LSA order; splitting the two
mis-orders the drained page.
1.8 The on-disk append cursor — log_append_info
Section titled “1.8 The on-disk append cursor — log_append_info”The disk-facing append point — open log file, fixed page, and the lowest LSA not yet on disk:
// log_append_info -- src/transaction/log_append.hppstruct log_append_info{ int vdes; std::atomic<LOG_LSA> nxio_lsa; /* Lowest LSA NOT yet written to disk (WAL) */ LOG_LSA prev_lsa; LOG_PAGE *log_pgptr; bool appending_page_tde_encrypted; // ... condensed: get_nxio_lsa(), set_nxio_lsa() ...};| Field | Role | Why |
|---|---|---|
vdes | OS fd of the active log volume | Target of page writes |
nxio_lsa | Atomic lowest LSA not yet flushed | WAL watermark; readers/flusher race without the prior mutex |
prev_lsa | Last record appended to the buffer | Drain-side mirror of staging prev_lsa |
log_pgptr | Currently fixed log page | Drain target; replaced on page boundary (Ch 6) |
appending_page_tde_encrypted | Live page must be encrypted | Carries the node’s tde_encrypted onto the page |
INVARIANT — nxio_lsa is the WAL durability watermark. Records with
LSA < nxio_lsa are on disk; >= nxio_lsa are not. Flusher and WAL
checks touch it concurrently, so it is std::atomic, reached only via
get_nxio_lsa()/set_nxio_lsa(); a torn read lets a data page flush
ahead of its log (Ch 7).
1.9 The caller inputs — log_data_addr and log_crumb
Section titled “1.9 The caller inputs — log_data_addr and log_crumb”What a caller (heap/btree op) hands the append API; everything above derives from these:
// log_data_addr / log_crumb -- src/transaction/log_append.hppstruct log_crumb { int length; const void *data; };struct log_data_addr { const VFID *vfid; PAGE_PTR pgptr; PGLENGTH offset; };| Struct / Field | Role | Why |
|---|---|---|
log_crumb.length / .data | One contiguous piece of caller data | Callers pass an array to gather scattered buffers |
log_data_addr.vfid | File the page belongs to, or NULL | File/TDE context; log_data.volid/pageid come from pgptr, not vfid |
log_data_addr.pgptr | Pointer to the fixed data page | Its volid/pageid extracted into log_data |
log_data_addr.offset | Offset/slot of the change | Becomes log_data.offset; high bits hold LOG_RV_RECORD_* flags |
1.10 On-disk page structures — log_hdrpage and log_page
Section titled “1.10 On-disk page structures — log_hdrpage and log_page”A physical log page is a log_hdrpage plus a flexible area[]. The
area[1] is the struct-hack — never sizeof it; use LOG_PAGESIZE:
// log_hdrpage / log_page -- src/transaction/log_storage.hppstruct log_hdrpage { LOG_PAGEID logical_pageid; PGLENGTH offset; short flags; int checksum; };struct log_page { LOG_HDRPAGE hdr; char area[1]; }; /* area is flexible */| Field | Role | Why |
|---|---|---|
logical_pageid | Page id in the infinite sequence | Matches log_lsa.pageid; identity check on read |
offset | Offset of the first record starting here | Salvage anchor if the prior page is corrupt |
flags | TDE bits (..._ENCRYPTED_AES/ARIA) | LOG_IS_PAGE_TDE_ENCRYPTED tests the mask |
checksum | CRC32 over the page | Detects torn pages |
log_page.hdr | The header above | Fixed page prefix |
log_page.area[] | Record bytes | Sized by LOG_PAGESIZE |
INVARIANT — LOG_PAGEID -9 is the header page.
LOGPB_HEADER_PAGE_ID = -9 holds the log_header, carries no log
records, and is duplicated into every archive. Code must never write a
normal record onto pageid -9.
1.11 The volume headers — log_header and log_arv_header
Section titled “1.11 The volume headers — log_header and log_arv_header”log_header is the master control block on page -9. Every member,
grouped by role:
| Field group | Fields | Role |
|---|---|---|
| Identity / safety | magic, db_creation, db_release, db_compatibility, db_iopagesize, db_logpagesize, db_charset | Refuse a log from an incompatible build/page size |
| Append cursor | append_lsa, fpageid, eof_lsa | Persisted append loc, pageid at slot 1, end of log |
| Recovery | chkpt_lsa, smallest_lsa_at_last_chkpt | Lowest LSA recovery starts from |
| Transaction / MVCC | next_trid, mvcc_next_id, mvcc_op_log_lsa, oldest_visible_mvccid, newest_block_mvccid, vacuum_last_blockid, does_block_need_vacuum | Next ids to assign; vacuum’s progress |
| Archive | nxarv_pageid, nxarv_phy_pageid, nxarv_num, last_arv_num_for_syscrashes, last_deleted_arv_num, npages | Drives Ch 10’s archiving |
| Backup | bkup_level0_lsa/1/2, bkinfo[] | Per-level incremental backup anchors |
| HA / lifecycle | ha_server_state, ha_file_status, ha_promotion_time, is_shutdown, was_active_log_reset, has_logging_been_skipped, db_restore_time, mark_will_del | Replication state; clean-shutdown flag |
| Alignment / misc | dummy, dummy3, dummy4, vol_creation, avg_ntrans, avg_nlocks, was_copied, prefix_name, perm_status_obsolete | dummy* pads; vol_creation time; avg_* sizing hints; was_copied resets a copied DB; prefix_name log prefix; perm_status_obsolete legacy |
log_arv_header is the smaller header stamped on each archive file:
// log_arv_header -- src/transaction/log_storage.hppstruct log_arv_header{ char magic[CUBRID_MAGIC_MAX_LENGTH]; INT32 dummy; INT64 db_creation; INT64 vol_creation; TRANID next_trid; DKNPAGES npages; LOG_PAGEID fpageid; int arv_num; INT32 dummy2;};| Field | Role | Why |
|---|---|---|
magic | File-type magic | file/magic recognition + sanity |
db_creation / vol_creation | Creation timestamps | Match archive to its database/volume |
next_trid | Next trid at archive time | Recovery context |
npages | Page count in this archive | Bounds the page range |
fpageid | Logical pageid at physical slot 1 | Maps physical to logical pages |
arv_num | Archive sequence number | Matches log_header.nxarv_num chain |
dummy, dummy2 | Alignment pads | Keep the on-disk layout stable |
1.12 Struct relationships
Section titled “1.12 Struct relationships”flowchart TB
subgraph CALLER["Caller inputs"]
DADDR["log_data_addr"]
CRUMB["log_crumb[]"]
end
subgraph STAGE["Staging tier (memory)"]
PLINFO["log_prior_lsa_info"]
NODE["log_prior_node"]
AINFO["log_append_info"]
end
subgraph REC["Record body (all tiers)"]
HDR["log_rec_header"]
PAY["log_rec_*\n+ MVCC wrappers"]
end
subgraph DISK["On-disk tier"]
PAGE["log_page"]
HPAGE["log_hdrpage"]
LHDR["log_header (page -9)"]
AHDR["log_arv_header"]
end
DADDR --> NODE
CRUMB --> NODE
PLINFO -->|owns list of| NODE
NODE -->|embeds| HDR
NODE -->|serializes| PAY
PAY -->|embeds| LDATA["log_data"]
PLINFO -->|drains to| AINFO
AINFO -->|fixes / writes| PAGE
PAGE -->|hdr is| HPAGE
LHDR -->|append_lsa to| PAGE
LHDR -->|nxarv_* feed| AHDR
Figure 1-1. How a record’s structures connect across the three tiers.
1.13 Pointer-relationship summary
Section titled “1.13 Pointer-relationship summary”The LSA/pointer edges a modifier must keep consistent:
- Physical chain —
log_rec_header.forw_lsa/back_lsa;prev_tranlsa. - Staging allocator —
log_prior_lsa_info.prior_lsa/prev_lsa. - Durability watermark —
log_append_info.nxio_lsa. - Vacuum chain —
log_vacuum_info.prev_mvcc_op_log_lsa.
1.14 Chapter summary — key takeaways
Section titled “1.14 Chapter summary — key takeaways”- A log record crosses three tiers — caller inputs, staging
(
log_prior_nodeanchored bylog_prior_lsa_info, drained vialog_append_info), and on-disk (log_pageunderlog_header); the body (log_rec_header+ alog_rec_*payload) is the constant. log_lsais the 48:16 bit-field clock; its total ordering founds every durability decision.log_rec_headerthreads each record into a physical doubly linked chain and a per-transaction chain, withtypediscriminating the append-only, hole-preservinglog_rectype.- MVCC wrappers add
mvccidand (undo only)log_vacuum_info; the redo wrapper omitsvacuum_info— a pure redo creates no version to vacuum. prior_lsa_mutexmakes LSA assignment plus linkage atomic andnxio_lsais the atomic WAL watermark — the two concurrency invariants the append path rests on; the on-disk page reserves pageid -9 forlog_header, whosenxarv_*feedlog_arv_header.
Chapter 2: Initialization and Memory
Section titled “Chapter 2: Initialization and Memory”The reader question: before any record can be appended, how are the prior-list, the page-buffer pool, the flush bookkeeping, and the global log state bootstrapped and allocated? For conceptual roles — what the prior list is for, why WAL demands a ring — see the companion cubrid-log-manager.md (“The append pipeline”, “Durability”). This chapter is bring-up mechanics: who mallocs, what each field starts at, which teardown frees it. Two entry points, both under LOG_CS, both calling log_final first if a prior instance is mounted:
log_create_internal— runs once at DB creation: formats the active-log volume, writes the firstLOG_HEADERto page-9, flushes one empty append page, then tears the pool back down. No live state survives.log_initialize_internal— runs at every restart / SA boot: mounts the existing active log, reads page-9, keeps the pool alive, hands control to recovery.
2.1 The global singleton: log_global / log_Gl
Section titled “2.1 The global singleton: log_global / log_Gl”Everything hangs off one process-wide singleton, log_Gl (struct log_global), default-constructed at static init by log_global::log_global (log_global.c); bring-up populates its members rather than allocating it.
// log_global -- src/transaction/log_impl.h (condensed; #if SERVER_MODE members noted in the table)struct log_global { TRANTABLE trantable; LOG_APPEND_INFO append; LOG_PRIOR_LSA_INFO prior_info; LOG_HEADER hdr; LOG_ARCHIVES archive; LOG_PAGEID run_nxchkpt_atpageid; LOG_LSA chkpt_redo_lsa; DKNPAGES chkpt_every_npages; LOG_RECVPHASE rcv_phase; LOG_LSA rcv_phase_lsa; LOG_PAGE *loghdr_pgptr; LOG_FLUSH_INFO flush_info; LOG_GROUP_COMMIT_INFO group_commit_info; logwr_info *writer_info; /* the ONLY heap member of the ctor: new logwr_info() */ BACKGROUND_ARCHIVING_INFO bg_archive_info; mvcctable mvcc_table; GLOBAL_UNIQUE_STATS_TABLE unique_stats_table; // #if SERVER_MODE: flushed_lsa_lower_bound, chkpt_lsa_lock, backup_in_progress; #else: final_restored_lsa};The ctor nulls every LSA-valued field to NULL_LSA, seeds flush_info to {0, 0, NULL, PTHREAD_MUTEX_INITIALIZER}, runs prior_info’s ctor (§2.5), and news writer_info (its only heap allocation). Every field:
| Field | Role | Why it exists / ctor seed |
|---|---|---|
trantable | Per-transaction LOG_TDES table | area == NULL is the “not initialized” sentinel; sized by logtb_define_trantable_log_latch. |
append | Live append cursor (vdes, log_pgptr, prev_lsa, atomic nxio_lsa) | Where prior nodes drain into a page; Ch 4-5. |
prior_info | In-memory prior-list head/tail + LSA cursors | Decouples LSA assignment from disk layout; Ch 3-5. |
hdr | In-RAM copy of on-disk LOG_HEADER (append_lsa/eof_lsa live here) | Avoids re-reading page -9. |
archive | Current archive descriptor cache | Used when a wanted page rolled into an archive. |
run_nxchkpt_atpageid | Page id where next checkpoint fires | NULL_PAGEID during create/init; recomputed at end of init. |
flushed_lsa_lower_bound / chkpt_lsa_lock | SERVER_MODE flush-coord LSA + chkpt-LSA mutex | NULL_LSA / PTHREAD_MUTEX_INITIALIZER. |
chkpt_redo_lsa / chkpt_every_npages | Redo-start LSA + checkpoint frequency | NULL_LSA / INT_MAX (latter from PRM_ID_LOG_CHECKPOINT_NPAGES). |
rcv_phase / rcv_phase_lsa | Recovery phase + its LSA | LOG_RECOVERY_ANALYSIS_PHASE / NULL_LSA; log_final resets phase. |
backup_in_progress / final_restored_lsa | #if pair: SERVER backup flag vs SA last-restored LSA | One per build; false / NULL_LSA. |
loghdr_pgptr | One LOG_PAGESIZE scratch page for header I/O | Global buffer malloc’d in log_initialize_internal, freed in log_final — distinct from the create-path local of the same name (§2.2). |
flush_info | toflush[] + counters + mutex | Dirty append pages to push on a flush; §2.4. |
group_commit_info | Mutex+cond for group commit | Lets committers coalesce fsyncs. |
writer_info | HA log-writer state | Only ctor new; deleted in ~log_global. |
bg_archive_info | Background archiving descriptor | Init’d at tail of init if PRM_ID_LOG_BACKGROUND_ARCHIVING is on. |
mvcc_table / unique_stats_table | MVCC snapshot table / global unique-index stats | Default-ctor / GLOBAL_UNIQUE_STATS_TABLE_INITIALIZER. |
graph TD
subgraph logGl["log_Gl (LOG_GLOBAL singleton)"]
A["append : LOG_APPEND_INFO<br/>vdes, log_pgptr, prev_lsa, nxio_lsa"]
P["prior_info : LOG_PRIOR_LSA_INFO<br/>prior_lsa, prev_lsa, list head/tail"]
F["flush_info : LOG_FLUSH_INFO<br/>toflush[], max/num_toflush, mutex"]
end
PB["log_Pb (LOG_PB_GLOBAL_DATA)<br/>buffers[], pages_area, header_page"]
F -- "toflush[] points into" --> PB
A -- "log_pgptr points into" --> PB
Figure 2-1. The global singleton and the separately-declared page-buffer global log_Pb.
2.2 log_create_internal — first-ever bring-up
Section titled “2.2 log_create_internal — first-ever bring-up”Runs under LOG_CS_ENTER. Every branch:
- Stale-state guard:
trantable.area != NULL→log_final(§2.7). umask;logpb_initialize_pool(§2.3) allocates the ring. Error →goto error.logpb_initialize_log_namesbuildslog_Name_activeetc. Error →goto error.logpb_initialize_header (&log_Gl.hdr, ...)fills the in-RAM header (page count,db_logpagesize = LOG_PAGESIZE). Error →goto error.logpb_create_header_pagecarves the page--9buffer into a stack localloghdr_pgptr— declared inlog_create_internal, not the globallog_Gl.loghdr_pgptrof §2.1; the create path scratches a separate page from the restart-path I/O buffer.fileio_formatcreates the active-log file; the compoundifgoto errors on any ofvdes == NULL_VOLDES,logpb_fetch_start_append_pagefailing, or the localloghdr_pgptr == NULL:
// log_create_internal -- src/transaction/log_manager.c log_Gl.append.vdes = fileio_format (thread_p, db_fullname, log_Name_active, ...); if (log_Gl.append.vdes == NULL_VOLDES || logpb_fetch_start_append_page (thread_p) != NO_ERROR || loghdr_pgptr == NULL) goto error; /* <- any one failure unwinds the whole pool */- Mark the empty append page dirty;
logpb_flush_pages_directwrites the end-of-log mark. memcpythe in-RAMhdrinto the localloghdr_pgptr->area;logpb_flush_pagewrites page-9(error →goto error; underCUBRID_DEBUGit reads back andasserts).- Clear
log_pgptr, dismount, create volume-info/log-info files, register active + backup-info volumes vialogpb_add_volume. - Normal exit:
logpb_finalize_pool,LOG_CS_EXIT,NO_ERROR.
The error: label runs the same logpb_finalize_pool + LOG_CS_EXIT (returning ER_FAILED if unset). Create never leaves a live pool.
INVARIANT — page
-9is the single source of truth for log geometry. The only place a freshLOG_HEADERis written from scratch; every later boot reads it back. The step-8memcpy+ synchronouslogpb_flush_pageenforces it. If that flush fails silently, restart reads garbage geometry (db_logpagesize,fpageid) and re-formats or refuses to mount.
2.2b log_initialize_internal — restart bring-up
Section titled “2.2b log_initialize_internal — restart bring-up”Shares the early scaffolding but diverges at the mount: it reads page -9, keeps the pool, dispatches to recovery. Every branch in order:
- Clean-state guard:
trantable.area != NULL → log_final. - Log-names init:
logpb_initialize_log_namesfailure is fatal (logpb_fatal_errorthengoto error), not a plain propagate. loghdr_pgptrmalloc: the globallog_Gl.loghdr_pgptr(page--9I/O buffer forlogpb_fetch_header/logpb_flush_header);NULL→ fatal +goto error. Freed inlog_final(§2.7) and onerror:.- Pool init:
logpb_initialize_pool(§2.3); error →goto error. fileio_mountreturningNULL_VOLDESsplits two ways — media-crash (ismedia_crash != false) synthesizes an approximate header (logpb_initialize_headerfor geometry, then the forced fields below mark everything un-checkpointed,LOG_RESET_APPEND_LSAsyncs intoprior_info,chkpt_lsanulled,nxarv_*maxed); elseerror_code = ER_IO_MOUNT_FAIL; goto error:// log_initialize_internal -- src/transaction/log_manager.clog_Gl.hdr.fpageid = LOGPAGEID_MAX; log_Gl.hdr.append_lsa.pageid = LOGPAGEID_MAX;log_Gl.hdr.append_lsa.offset = 0; LOG_RESET_APPEND_LSA (&log_Gl.hdr.append_lsa);- Non-NULL
vdeselse:logpb_fetch_header (&log_Gl.hdr)reads the real page-9into the mirror. - Copy
hdr.chkpt_lsa→chkpt_redo_lsa.restore_slavebranch (ismedia_crash && r_args && r_args->restore_slave): copydb_creation,smallest_lsa_at_last_chkpt,append_lsaout intor_argsfor HA slave restore. - Prefix-name mismatch:
strcmp(hdr.prefix_name, prefix_logname) != 0→ER_LOG_INCOMPATIBLE_PREFIX_NAME(notification) and continue anyhow. - Page-size mismatch → recursive re-init:
hdr.db_iopagesize != IO_PAGESIZE || hdr.db_logpagesize != LOG_PAGESIZE→db_set_page_size,logpb_finalize_pool, dismount,LOG_CS_EXIT, re-logtb_define_trantable_log_latch, then calllog_initialize_internalagain and return — buffers rebuilt at the right size (cross-ref §2.8). - Compatibility checks (
rel_get_disk_compatible,rel_is_log_compatible)goto erroron incompatible versions;logtb_define_trantable_log_latch(-1)builds the live trantable;fileio_map_mountedverifies the log belongs to this DB (else undefine trantable +goto error). - Recovery dispatch:
init_emergency == false && (hdr.is_shutdown == false || ismedia_crash)→ prior run crashed →log_recovery. Else clean/emergency boot →logpb_fetch_start_append_page, read EOF record to seedprev_lsaviaLOG_RESET_PREV_LSA(&eof->back_lsa), setis_shutdown = false,logpb_flush_header. - Prior/append LSA assert + reset (cross-ref §2.5): set
rcv_phase = LOG_RESTARTED, then the defensiveassert(0)+ re-reset ifappend.prev_lsa/hdr.append_lsadiverge fromprior_info; recomputechkpt_every_npages,run_nxchkpt_atpageid, bring up bg-archiving,LOG_CS_EXIT, return.
The error: label dismounts vdes if mounted, free_and_inits loghdr_pgptr, LOG_CS_EXIT, logpb_fatal_error — a failed restart aborts.
2.3 logpb_initialize_pool — the page-buffer ring
Section titled “2.3 logpb_initialize_pool — the page-buffer ring”The ring lives in a separate global, log_Pb of type LOG_PB_GLOBAL_DATA, not inside log_Gl.
// log_pb_global_data / log_buffer -- src/transaction/log_page_buffer.cstruct log_pb_global_data { LOG_BUFFER *buffers; LOG_PAGE *pages_area; LOG_BUFFER header_buffer; LOG_PAGE *header_page; int num_buffers; LOGPB_PARTIAL_APPEND partial_append; };struct log_buffer { volatile LOG_PAGEID pageid; volatile LOG_PHY_PAGEID phy_pageid; bool dirty; LOG_PAGE *logpage; };LOG_PB_GLOBAL_DATA: buffers (descriptor array), pages_area (one slab of num_buffers * LOG_PAGESIZE), header_buffer/header_page (the page--9 descriptor + backing page), num_buffers, partial_append (record-split-across-flush state, Ch 6). The per-page descriptor LOG_BUFFER:
| Field | Role | Why it exists |
|---|---|---|
pageid | Logical id of the resident log-sequence page | NULL_PAGEID = free; lookups key on this. volatile — read without the lock. |
phy_pageid | Physical offset in the active-log file | Translation cache so each flush skips logpb_to_physical_pageid. |
dirty | Page differs from disk | Drives whether a slot is added to toflush[]. |
logpage | Pointer into the shared pages_area slab | Decouples the small descriptor from the LOG_PAGESIZE payload. |
Branch-complete (asserts LOG_CS_OWN_WRITE_MODE):
log_append_init_zip(§2.6) — compression contexts come up before the ring.- If
logpb_Initialized,logpb_finalize_pool(re-entrant safety), thenassert pages_area == NULL. num_buffers = prm_get_integer_value (PRM_ID_LOG_NBUFFERS).malloc buffers.NULL→er_set+ returnER_OUT_OF_VIRTUAL_MEMORY(no pool to unwind).malloc pages_area(num_buffers * LOG_PAGESIZE).NULL→free_and_init(buffers), return.memsetslab toLOG_PAGE_INIT_VALUE; looplogpb_initialize_log_buffer (&buffers[i], pages_area + i*LOG_PAGESIZE)wires descriptor i to slab slot i, settingpageid = phy_pageid = NULL_PAGEID,dirty = false, and stamping the page header (logical_pageid = NULL_PAGEID,offset = NULL_OFFSET,flags = 0).malloc header_page(oneLOG_PAGESIZE);NULL→ free both prior allocations, return. Wired intoheader_buffer— resident slot for page-9(LOGPB_HEADER_PAGE_ID == -9).logpb_initialize_flush_info(§2.4). Error →goto error.partial_append.status = LOGPB_APPENDREC_SUCCESS; its aligned scratch page pointer is set.logpb_Initialized = true;pthread_*_initthe chkpt-lsa lock, group-commit cond/mutex, writer_info conds/mutexes;writer_info->is_init = true. ReturnNO_ERROR.
The error: label runs logpb_finalize_pool then logpb_fatal_error (aborts) — a pool-init failure is fatal, unlike the early malloc returns which merely propagate.
INVARIANT — the descriptor array and the page slab are the same length and freed together.
buffers[i].logpagealways points atpages_area + i*LOG_PAGESIZE;logpb_locate_pagerecovers the index by(log_pg - pages_area) / LOG_PAGESIZEandasserts the round-trip. Ifnum_buffersdiverged between the twomallocs, that arithmetic indexes out of bounds.
flowchart TD
S["init_zip; finalize_pool if re-entrant"] --> N["num_buffers = PRM_ID_LOG_NBUFFERS"]
N --> B{"malloc buffers?"}
B -- no --> E1["return ER_OUT_OF_VIRTUAL_MEMORY"]
B -- yes --> P{"malloc pages_area?"}
P -- no --> E2["free buffers; return"]
P -- yes --> Hp{"malloc header_page?"}
Hp -- no --> E3["free buffers+pages; return"]
Hp -- yes --> Fi{"init_flush_info?"}
Fi -- no --> Err["goto error: finalize_pool; fatal_error"]
Fi -- yes --> Done["init mutexes; Initialized=true; NO_ERROR"]
Figure 2-2. Branch map of logpb_initialize_pool, every allocation-failure path.
2.4 logpb_initialize_flush_info — the dirty-page roster
Section titled “2.4 logpb_initialize_flush_info — the dirty-page roster”LOG_FLUSH_INFO (embedded as log_Gl.flush_info) is the list of append pages a flush must push to disk.
// log_flush_info -- src/transaction/log_impl.hstruct log_flush_info { int max_toflush; int num_toflush; LOG_PAGE **toflush;#if defined(SERVER_MODE) pthread_mutex_t flush_mutex;#endif};| Field | Role | Why it exists |
|---|---|---|
max_toflush | Capacity, set to num_buffers - 1 | One slot reserved (header flushes separately), so the roster never exceeds num_buffers - 1. |
num_toflush | Live count of staged pages | Reset to 0 here and after each flush. |
toflush | Array of LOG_PAGE* in ascending page-id order | calloc’d to num_buffers pointers; sorted so the writev issues contiguous I/O. |
flush_mutex | (SERVER_MODE) serializes roster mutation | Log-flush thread and committers both touch it. |
logpb_initialize_flush_info: if toflush != NULL it calls logpb_finalize_flush_info first (re-entrant) then asserts toflush == NULL; sets max_toflush = num_buffers - 1, num_toflush = 0, calloc’s toflush to num_buffers pointers (extra slot is harmless slack), er_sets ER_OUT_OF_VIRTUAL_MEMORY on NULL, and pthread_mutex_inits — even on allocation failure, returning the error code the caller treats as goto error. logpb_finalize_flush_info reverses it: if toflush != NULL, lock, free_and_init(toflush), zero counters, unlock, pthread_mutex_destroy; no-op (double-call safe) when already NULL.
2.5 prior_lsa_info constructor — seeding the prior list
Section titled “2.5 prior_lsa_info constructor — seeding the prior list”LOG_PRIOR_LSA_INFO heads the in-memory prior list (the staging area between a caller’s append request and the page buffer; Ch 3-5).
// log_prior_lsa_info -- src/transaction/log_append.hppstruct log_prior_lsa_info { LOG_LSA prior_lsa; LOG_LSA prev_lsa; LOG_PRIOR_NODE *prior_list_header; LOG_PRIOR_NODE *prior_list_tail; INT64 list_size; LOG_PRIOR_NODE *prior_flush_list_header; std::mutex prior_lsa_mutex; log_prior_lsa_info (); };| Field | Role | Why it exists |
|---|---|---|
prior_lsa | LSA the next appended node will receive | Advancing it under the mutex issues LSAs in monotonic order without touching disk. |
prev_lsa | LSA of the previously appended node | Lets each new node store a back_lsa for backward chaining / undo. |
prior_list_header / prior_list_tail | FIFO head (drain consumes) / tail (O(1) append) | Drain (Ch 5) reads head; new nodes splice at tail. |
list_size | Queued byte count | Lets the drainer/flusher decide when to push. |
prior_flush_list_header | Sub-list already promoted toward flush | Separates “appended” from “being flushed”. |
prior_lsa_mutex | The hot lock of the whole subsystem | Every LSA assignment serializes here. |
The ctor seeds everything empty; the real LSA seed is deferred to log_initialize_internal, which copies recovered header LSAs into both append and prior_info:
// log_prior_lsa_info ctor / LOG_RESET_*_LSA -- src/transaction/log_append.cpplog_prior_lsa_info::log_prior_lsa_info () // every member: NULL_LSA / NULL / 0 / default mutex : prior_lsa (NULL_LSA), prev_lsa (NULL_LSA), prior_list_header (NULL), prior_list_tail (NULL) , list_size (0), prior_flush_list_header (NULL), prior_lsa_mutex () { }void LOG_RESET_APPEND_LSA (const LOG_LSA *lsa) // header drives prior_lsa{ log_Gl.hdr.append_lsa = *lsa; log_Gl.prior_info.prior_lsa = *lsa; }void LOG_RESET_PREV_LSA (const LOG_LSA *lsa){ log_Gl.append.prev_lsa = *lsa; log_Gl.prior_info.prev_lsa = *lsa; }INVARIANT —
prior_info.prior_lsa == hdr.append_lsaandprior_info.prev_lsa == append.prev_lsaat end of init.log_initialize_internalassert(0)s and re-resets on divergence (if (!LSA_EQ (&log_Gl.hdr.append_lsa, &log_Gl.prior_info.prior_lsa)) { assert (0); LOG_RESET_APPEND_LSA (...); }and the symmetricprev_lsacheck). If it drifted, the first appended record would get an LSA disagreeing with where the cursor writes, corrupting the back-chain.
2.6 log_append_init_zip / log_append_final_zip — compression contexts
Section titled “2.6 log_append_init_zip / log_append_final_zip — compression contexts”LOG_ZIP is the (de)compression scratch buffer: struct log_zip { LOG_ZIP_SIZE_T data_length = 0; LOG_ZIP_SIZE_T buf_size = 0; char *log_data = nullptr; }; (log_compress.h).
| Field | Role | Why it exists |
|---|---|---|
data_length | Bytes currently held | Result length after log_zip/log_unzip. |
buf_size | Capacity of log_data | log_zip_realloc_if_needed grows it; avoids re-malloc per record. |
log_data | The (de)compression buffer | Holds LZ4 output; log_zip_alloc(IO_PAGESIZE) sizes it. |
log_append_init_zip branches on mode and PRM_ID_LOG_COMPRESS:
- Compression disabled →
log_Zip_support = false, return. SERVER_MODE:log_Zip_support = true; the buffers are per-thread, allocated lazily on first use —log_append_get_zip_undo/_redodoif (thread_p->log_zip_undo == NULL) thread_p->log_zip_undo = log_zip_alloc (IO_PAGESIZE);.- SA-mode: allocate two process-global statics
log_zip_undo/log_zip_redoplus alog_data_ptrscratch ofIO_PAGESIZE * 2. If any isNULL→log_Zip_support = falseand free whichever allocated (each under its ownif). Elselog_Zip_support = true.
log_append_final_zip mirrors it: if !log_Zip_support return; under SERVER_MODE nothing (per-thread buffers die with the thread entry); in SA-mode frees log_zip_undo/log_zip_redo/log_data_ptr. It runs from logpb_finalize_pool (§2.7), so zip teardown is tied to pool teardown.
INVARIANT —
log_Zip_supportis the single gate. All callers gate on it, never on the individual buffer pointers; init sets itfalseon any partial allocation failure so a half-allocated context is never used.
2.7 Teardown: log_final and logpb_finalize_pool
Section titled “2.7 Teardown: log_final and logpb_finalize_pool”log_final is the orderly shutdown and the re-entrancy guard create/init call up front. Branch-complete:
- Destroy server daemons and system transactions;
LOG_CS_ENTER; resetrcv_phase. trantable.area == NULL→ nothing initialized; exit.- Else
!logpb_is_pool_initialized()→ only trantable;logtb_undefine_trantable, exit. - Else
append.vdes == NULL_VOLDES→ pool but no volume;logpb_finalize_pool+logtb_undefine_trantable, exit. - Else abort every active transaction (
log_abort), trackinganyloose_ends; flush to disk (logpb_flush_pages_direct+pgbuf_flush_all+fileio_synchronize_all). - Header branch: if
!anyloose_ends && error_code == NO_ERROR, sethdr.is_shutdown = trueand snapchkpt_lsa = append_lsa(clean — restart skips recovery). Elselogpb_checkpoint. logpb_flush_header,logpb_finalize_pool,logtb_undefine_trantable, dismount bg-archive + active volumes,free_and_init(loghdr_pgptr),LOG_CS_EXIT.
logpb_finalize_pool (from log_final and the create/init error paths) is idempotent — returns if !logpb_Initialized. Otherwise it reverses bring-up exactly: clear the append cursor (log_pgptr = NULL, nxio_lsa/prev_lsa = NULL_LSA, mirrored into prior_info), free_and_init buffers/pages_area/header_page, num_buffers = 0, logpb_Initialized = false, logpb_finalize_flush_info (§2.4), destroy chkpt + group-commit locks, finalize writer info, and finally log_append_final_zip (§2.6) — zip freed last, mirroring init’s zip-first, so no in-flight append (touching a per-thread LOG_ZIP) outlives its buffers.
2.8 The LOGAREA_SIZE / LOG_PAGESIZE relationship
Section titled “2.8 The LOGAREA_SIZE / LOG_PAGESIZE relationship”A LOG_PAGE is LOG_PAGESIZE bytes (db_Log_page_size in storage_common.h). The first SSIZEOF(LOG_HDRPAGE) bytes are the page header; the rest is the record area: #define LOGAREA_SIZE (LOG_PAGESIZE - SSIZEOF(LOG_HDRPAGE)) (log_impl.h).
This constant constrains all record placement. The append macros (LOG_APPEND_ALIGN, LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT) compare append_lsa.offset against LOGAREA_SIZE and call logpb_next_append_page on overflow; LOG_PRIOR_LSA_LAST_APPEND_OFFSET() likewise returns LOGAREA_SIZE, so the prior-list and page-buffer sides agree on where a page ends (page-crossing is Chapter 6). At init the point is: geometry is fixed from the header’s db_logpagesize, validated against the running LOG_PAGESIZE.
INVARIANT —
db_logpagesizemust equal the runningLOG_PAGESIZE. As traced in §2.2b step 9,log_initialize_internalchecksdb_iopagesize != IO_PAGESIZE || db_logpagesize != LOG_PAGESIZE; on mismatch itdb_set_page_sizes, finalizes the pool, dismounts, and recursively re-enters itself so buffers are reallocated at the correct size. OtherwiseLOGAREA_SIZEis computed against the wrong page size and records straddle physical page boundaries.
2.9 Chapter summary — key takeaways
Section titled “2.9 Chapter summary — key takeaways”- Two entry points, different lifetimes.
log_create_internalformats, writes page-9, finalizes the pool (no live state);log_initialize_internalmounts, reads page-9, keeps the pool live, runs recovery. - Restart has a richer branch tree (§2.2b): fatal log-names path, global
loghdr_pgptrmalloc,fileio_mount NULL_VOLDESsplit (media-crash header synthesis withLOGPAGEID_MAXvsER_IO_MOUNT_FAIL),logpb_fetch_header,restore_slavecopy-out, tolerated prefix mismatch, recursive page-size re-init, recovery-vs-clean dispatch. - Two globals:
log_Gl(append/prior/header/flush) vs the separate ringlog_Pb;flush_info.toflush[]andappend.log_pgptrpoint intolog_Pb. - The ring is two parallel allocations —
LOG_BUFFER[]+ onepages_areaslab; descriptor i ↔ slab slot i, recovered by pointer arithmetic. Flush capacity isnum_buffers - 1. - The prior list starts empty, LSA-seeded from the header via
LOG_RESET_APPEND_LSA/LOG_RESET_PREV_LSAinto bothappendandprior_info; initasserts they agree. - Compression is mode-split: SA-mode process-global
LOG_ZIPstatics, server-mode per-thread lazy;log_Zip_supportis the single gate,falseon any partial failure. - Teardown reverses bring-up, freeing flush-info and zip last;
log_final’sis_shutdown = truebranch lets the next boot skip recovery.
Chapter 3: Building a Prior Node from a Caller Request
Section titled “Chapter 3: Building a Prior Node from a Caller Request”When the engine modifies a page it calls a log_append_* API. Before
the change can reach disk it must become a prior node — a
heap-allocated LOG_PRIOR_NODE carrying a fully formed log record. This
chapter answers: given a caller tuple (rcvindex, addr, undo_data, redo_data), how is a complete LOG_PRIOR_NODE built before it is handed
an LSA?
The defining property of this phase: it runs entirely outside the
prior-list mutex — allocation, header sizing, payload copying, and
compression all happen on the caller’s memory. Only once the node is
finished does Chapter 4’s prior_lsa_next_record take prior_lsa_mutex,
stamp the LSA, and splice it in (the companion’s
single-writer pipeline).
3.1 The append API surface — thin wrappers over crumbs
Section titled “3.1 The append API surface — thin wrappers over crumbs”The entry points (log_append_undoredo_data, log_append_undo_data,
log_append_redo_data, plus the *2 and *_recdes variants) package
the caller’s contiguous buffer into one LOG_CRUMB and delegate to the
crumbs API.
// log_append_undoredo_data -- src/transaction/log_manager.cLOG_CRUMB undo_crumb, redo_crumb;assert (0 == undo_length || undo_data != NULL); /* <- zero length must mean NULL data */undo_crumb.data = undo_data; undo_crumb.length = undo_length; // ... redo_crumb the same ...log_append_undoredo_crumbs (thread_p, rcvindex, addr, 1, 1, &undo_crumb, &redo_crumb);// inside log_append_undoredo_crumbs: type from rcvindex alone:LOG_RECTYPE rectype = LOG_IS_MVCC_OPERATION (rcvindex) ? LOG_MVCC_UNDOREDO_DATA : LOG_UNDOREDO_DATA;A LOG_CRUMB is a (length, data) pair. The *2 variants synthesize
a LOG_DATA_ADDR from (vfid, pgptr, offset); _recdes variants wrap
a RECDES. LOG_IS_MVCC_OPERATION is true for MVCC heap/btree ops and
RVES_NOTIFY_VACUUM; the undo-only path picks LOG_(MVCC_)UNDO_DATA,
the redo-only path *REDO*. This rectype is the switch key for every
sizing decision downstream. Before construction, log_append_*_crumbs
runs a guard chain (Figure 3-1); each guard is a distinct early return,
so the node is built only after all five pass:
flowchart TB
B{"log_No_logging?"} -- yes --> B1["log_skip_logging; return"]
B -- no --> D{"LOG_FIND_TDES == NULL?"}
D -- yes --> D1["ER_LOG_UNKNOWN_TRANINDEX; return"]
D -- no --> E{"not sysop AND not active AND not aborted?"}
E -- yes --> E1["return, log nothing"]
E -- no --> F{"log_can_skip_undo_logging?"}
F -- yes --> F1["append redo crumbs only; return"]
F -- no --> G["prior_lsa_alloc_and_copy_crumbs"]
G --> H{"node == NULL?"} -- yes --> H1["return"]
H -- no --> I["TDE encrypt; prior_lsa_next_record (Ch 4)"]
Figure 3-1 — Guard chain of log_append_undoredo_crumbs. When undo is
skippable it degenerates to a redo-only append. log_append_undo_crumbs
skips silently (no redo fallback); log_append_redo_crumbs uses
log_can_skip_redo_logging.
3.2 LOG_PRIOR_NODE — the construction target
Section titled “3.2 LOG_PRIOR_NODE — the construction target”// struct log_prior_node -- src/transaction/log_append.hppstruct log_prior_node { LOG_RECORD_HEADER log_header; LOG_LSA start_lsa; /* for assertion */ bool tde_encrypted; int data_header_length; char *data_header; int ulength; char *udata; int rlength; char *rdata; LOG_PRIOR_NODE *next;};| Field | Role | Why it exists |
|---|---|---|
log_header | Only .type set here. | Record identity / switch key. LSA links filled in Ch 4 under the mutex. |
start_lsa | Eventual LSA. Unset here — /* for assertion */. | Assigned by prior_lsa_next_record (Ch 4); read only by MVCC vacuum-header assertions. The node has no log position yet, so reading it during construction is a bug. |
tde_encrypted | Whether the log page must be TDE-encrypted. | false at alloc, raised by prior_set_tde_encrypted; drives page-boundary encryption (Ch 6). |
data_header_length / data_header | Byte size + separate malloc holding the filled LOG_REC_*. | From rectype via sizeof(LOG_REC_*); separate buffer lets the drain (Ch 5) copy header then data independently. |
ulength / udata | Stored undo length (high bit = zipped) + heap copy of undo bytes. | Node must own its payload; caller’s buffer may be freed after return. Drain copies exactly ulength bytes. |
rlength / rdata | As above, for redo. | Redo payload ownership and length. |
next | List pointer. | NULL here; Ch 4 sets it on append to prior_list_tail. |
A finished node is three independent mallocs — node, data_header
(a LOG_REC_UNDOREDO or LOG_REC_MVCC_UNDOREDO), and each payload copy
— making it self-owned; next, start_lsa, and the log_header LSA
links stay blank until Ch 4.
Invariant — the node owns its payload by value. udata/rdata are
always freshly malloc’d copies (the copiers always memcpy), never
aliases of the caller’s buffers. If violated, the asynchronous drain in
Ch 5 could read freed memory.
3.3 Allocation dispatch — prior_lsa_alloc_and_copy_crumbs
Section titled “3.3 Allocation dispatch — prior_lsa_alloc_and_copy_crumbs”prior_lsa_alloc_and_copy_crumbs mallocs the node, zeroes every
construction field, sets log_header.type, then dispatches:
// prior_lsa_alloc_and_copy_crumbs -- src/transaction/log_append.cppnode->log_header.type = rec_type; node->tde_encrypted = false; /* ... all payload fields zeroed ... */switch (rec_type) { case LOG_UNDOREDO_DATA: ... case LOG_MVCC_REDO_DATA: /* all 8 undo/redo families */ error = prior_lsa_gen_undoredo_record_from_crumbs (thread_p, node, rcvindex, addr, ...); break; default: assert_release (false); error = ER_FAILED; break; /* <- crumbs path is undo/redo only */}On error it frees data_header, udata, rdata, then the node, and
returns NULL — the caller (§3.1) treats NULL as “give up silently.”
The sibling prior_lsa_alloc_and_copy_data handles non-crumb
families (postpone, compensate, commit, sysop, 2PC): its switch routes
undo/redo cases to assert_release(false) and the rest to
prior_lsa_gen_record, prior_lsa_gen_postpone_record, etc. — so the
two allocators partition the type space: crumbs for undo/redo data,
plain copy for control records.
prior_lsa_gen_record is the plain-copy builder Chapters 8–10 lean on
for commit/abort/sysop nodes. It does no compression and no MVCC
stamping — only sizes, allocates, and copies an optional undo blob;
the header contents are filled by the caller. Its three branches:
| Branch | Effect |
|---|---|
switch (rec_type) | Dummy/decision records (LOG_DUMMY_HEAD_POSTPONE, LOG_2PC_*_DECISION, LOG_START_CHKPT, LOG_SYSOP_ATOMIC_START) assert length==0 && data==NULL and leave data_header_length == 0; control records set data_header_length = sizeof(LOG_REC_*) (e.g. LOG_COMMIT/LOG_ABORT → LOG_REC_DONETIME, LOG_SYSOP_END → LOG_REC_SYSOP_END); default leaves it 0. |
if (data_header_length > 0) | Mallocs the header (memset in debug builds); on failure raises ER_OUT_OF_VIRTUAL_MEMORY and returns immediately — no udata copy attempted. |
if (length > 0) | Copies the optional undo blob via prior_lsa_copy_undo_data_to_node, propagating its error code; otherwise returns NO_ERROR. |
3.4 prior_lsa_gen_undoredo_record_from_crumbs — the core builder
Section titled “3.4 prior_lsa_gen_undoredo_record_from_crumbs — the core builder”The builder runs four phases (Figure 3-2). It sums the crumb lengths,
fetches the per-side zip scratch (log_append_get_zip_undo/_redo), and
sets type-shaped flags: a LOG_IS_UNDOREDO_RECORD_TYPE sets has_undo +
has_redo and needs both scratches (or a zero-length side); a
LOG_IS_REDO_RECORD_TYPE sets has_redo, needs zip_redo; otherwise
UNDO needs zip_undo — all &&-gated by log_Zip_support into can_zip.
It then (optionally) compresses (§3.5), sizes and mallocs the typed
header, aims local pointers at its sub-fields, fills the shared
LOG_DATA, and copies the payloads. Pointer aiming uses a fall-through
switch: each MVCC arm grabs its extra mvccid_p/vacuum_info_p, then
[[fallthrough]] into the non-MVCC arm for the shared length/data
pointers — UNDO sets ulength_p only, REDO rlength_p only, UNDOREDO
both:
// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cppcase LOG_MVCC_UNDOREDO_DATA: case LOG_MVCC_DIFF_UNDOREDO_DATA: /* MVCC arm: extra ptrs, then fall through */ vacuum_info_p = &mvcc_undoredo_p->vacuum_info; mvccid_p = &mvcc_undoredo_p->mvccid; [[fallthrough]];case LOG_UNDOREDO_DATA: case LOG_DIFF_UNDOREDO_DATA: /* shared: aim both length ptrs + log_data_p */ data_header_ulength_p = &undoredo_p->ulength; ... log_data_p = &undoredo_p->data; break;The shared LOG_DATA is filled from addr: rcvindex, offset, and
(pageid, volid) via pgbuf_get_vpid_ptr — or NULL_PAGEID/NULL_VOLID
when addr->pgptr == NULL (logical logging).
flowchart TB
M["Phase 1: sum lengths, get zip scratch, compute has_undo/has_redo/can_zip"] --> Z{"can_zip AND\nsome side >= thr?"}
Z -- yes --> ZB["Phase 2: log_diff + log_zip; if both zipped rewrite type to *_DIFF_*"]
Z -- no --> HSZ["Phase 3a: size header by type"]
ZB --> HSZ
HSZ --> MAL{"malloc data_header OK?"}
MAL -- no --> ERR["ER_OUT_OF_VIRTUAL_MEMORY; goto error"]
MAL -- yes --> PTR["Phase 3b: aim ptrs via fall-through switch, fill LOG_DATA, stamp MVCCID/vacuum if set"]
PTR --> CP["Phase 4: copy udata/rdata (zipped or raw)"]
CP --> RET["return NO_ERROR"]
ERR --> RETE["return error_code"]
Figure 3-2 — Control flow of prior_lsa_gen_undoredo_record_from_crumbs.
Every branch reaches return NO_ERROR or the error: label, which
frees data_header/udata/rdata.
3.5 The compression branch — boundary is the node, not the page
Section titled “3.5 The compression branch — boundary is the node, not the page”CUBRID compresses per record (per prior node), never per log page —
which is why it lives in construction, before any LSA or page is
assigned: the compressed bytes are sized into ulength/rlength and
copied into the node, so Ch 6’s page-boundary logic never sees
uncompressed data. Two globals gate it; scratch is a per-side LOG_ZIP:
// src/transaction/log_append.cpp ; src/transaction/log_compress.hbool log_Zip_support = false; /* <- master toggle, from prm */int log_Zip_min_size_to_compress = 255; /* <- per-side threshold (bytes) */struct log_zip { LOG_ZIP_SIZE_T data_length = 0; LOG_ZIP_SIZE_T buf_size = 0; char *log_data = nullptr; };LOG_ZIP holds one result, all three fields: log_data is the output
buffer (prior_lsa_copy_*_data_to_node memcpys from it), data_length
its produced length (what MAKE_ZIP_LEN wraps into the header; raw if it
did not shrink), buf_size its log_zip_alloc-set capacity (IO_PAGESIZE
- LZ4 bound) so it is not reallocated per record. Scratch comes from
log_append_get_zip_undo/_redo: per-thread inSERVER_MODE(thread_p->log_zip_undo, lazilylog_zip_alloc’d), file-static singletons stand-alone. Ifthread_pis NULL and unresolvable viathread_get_thread_entry_info, the getter returns NULL — forcingcan_zipfalse for that side via thezip_* != NULLclause.
The compression block and the length-stamping copy run as one unit:
// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cppif (can_zip && (ulength >= log_Zip_min_size_to_compress || rlength >= log_Zip_min_size_to_compress)) { if (ulength >= thr && rlength >= thr) { (void) log_diff (ulength, undo_data, rlength, redo_data); /* <- redo diffed against undo */ is_undo_zip = log_zip (zip_undo, ulength, undo_data); is_redo_zip = log_zip (zip_redo, rlength, redo_data); if (is_redo_zip) is_diff = true; } else { if (ulength >= thr) is_undo_zip = log_zip (zip_undo, ulength, undo_data); if (rlength >= thr) is_redo_zip = log_zip (zip_redo, rlength, redo_data); }}if (is_diff) node->log_header.type = is_mvcc_op ? LOG_MVCC_DIFF_UNDOREDO_DATA : LOG_DIFF_UNDOREDO_DATA;// ... after header sized/aimed, undo arm (redo symmetric): ...if (is_undo_zip) { *data_header_ulength_p = MAKE_ZIP_LEN (zip_undo->data_length); /* <- sets 0x80000000 */ error_code = prior_lsa_copy_undo_data_to_node (node, zip_undo->data_length, (char *) zip_undo->log_data);} else if (has_undo) { *data_header_ulength_p = ulength; error_code = prior_lsa_copy_undo_crumbs_to_node (node, num_ucrumbs, ucrumbs); }Four outcomes: neither side over threshold (skipped); both large
(log_diff rewrites redo as its difference from undo, then both zip,
flipping the type to *_DIFF_* if redo zipped); only one large (that
side zips, no diff); log_zip returns false (copied raw).
MAKE_ZIP_LEN(len) is len | 0x80000000; recovery strips it via
GET_ZIP_LEN/ZIP_CHECK.
Invariant — header length encodes compression state. Whether a side
is zipped is recorded only in the sign bit of the header length field;
a zipped payload written without MAKE_ZIP_LEN would feed compressed
bytes straight to recovery and corrupt the page. Pairing is_*_zip with
MAKE_ZIP_LEN in the same arm guarantees they never diverge.
The copier prior_lsa_copy_undo_data_to_node (_redo_ mirrors it)
mallocs length bytes (returns NO_ERROR for length <= 0 || data == NULL, ER_OUT_OF_VIRTUAL_MEMORY on failure), memcpys, and sets
node->ulength; the crumb copiers malloc once then memcpy each crumb.
Either way node->ulength/rlength holds the stored length.
3.6 Stamping MVCC identity
Section titled “3.6 Stamping MVCC identity”For MVCC types the pointer switch left mvccid_p/vacuum_info_p
non-NULL, so two extra fills run. The MVCCID comes from the current
TDES, preferring the innermost sub-transaction id:
// prior_lsa_gen_undoredo_record_from_crumbs -- src/transaction/log_append.cppif (mvccid_p != NULL) { tdes = LOG_FIND_CURRENT_TDES (thread_p); if (tdes == NULL || !MVCCID_IS_VALID (tdes->mvccinfo.id)) { assert_release (false); error_code = ER_FAILED; goto error; /* <- MVCC op needs an MVCCID */ } else if (!tdes->mvccinfo.sub_ids.empty ()) *mvccid_p = tdes->mvccinfo.sub_ids.back (); /* nested sysop */ else *mvccid_p = tdes->mvccinfo.id;}vacuum_info_p gets the file id (addr->vfid, or NULL for
RVES_NOTIFY_VACUUM, else assert_release(false)), and
prev_mvcc_op_log_lsa is set NULL — completed later in Ch 4’s
prior_lsa_next_record_internal, which links the record into the vacuum
chain once the LSA is known. These two fields, plus start_lsa, are the
only ones here depending on transaction/log state, not the caller tuple.
The two record layouts:
// struct log_rec_undoredo / log_rec_mvcc_undoredo -- src/transaction/log_record.hppstruct log_rec_undoredo { LOG_DATA data; int ulength; int rlength; };struct log_rec_mvcc_undoredo { LOG_REC_UNDOREDO undoredo; MVCCID mvccid; LOG_VACUUM_INFO vacuum_info; };Every field: data (the LOG_DATA triple rcvindex/pageid/offset
plus volid) is where recovery dispatches and locates the bytes;
ulength/rlength are the stored lengths (high bit = zipped). The MVCC
variant embeds undoredo so non-MVCC readers share code, then adds
mvccid (the writer’s id, for vacuum and visibility) and vacuum_info
(prev_mvcc_op_log_lsa back-link + owning vfid; back-link filled Ch 4).
3.7 Chapter summary — key takeaways
Section titled “3.7 Chapter summary — key takeaways”- The public
log_append_*APIs are thin — wrap the buffer in aLOG_CRUMB, delegate tolog_append_*_crumbs, which pick theLOG_RECTYPEfromrcvindexand run a five-guard chain first. - Construction is lock-free and self-owning — all work outside
prior_lsa_mutex; the node owns three mallocs (node,data_header, payload copies) so the async drain never touches caller memory. - Two allocators partition the type space —
..._crumbs→prior_lsa_gen_undoredo_record_from_crumbsfor undo/redo data;..._data→prior_lsa_gen_recordfor control records, whose three branches size the header (0 for dummy/decision types), malloc with anER_OUT_OF_VIRTUAL_MEMORYbail, and copy an optional undo blob. - The core builder runs measure → compress → size+fill the typed
header → copy payloads, a
[[fallthrough]]switch sharing the non-MVCC layout across the UNDO/REDO/UNDOREDO shapes. - Compression is per-node, not per-page — gated by
log_Zip_supportand the 255-byte threshold with per-threadLOG_ZIPscratch (NULL thread_p ⇒ no compression); both-sides-large triggerslog_diffand may rewrite the type to*_DIFF_*. The zipped/raw choice is recorded only in the length’s high bit viaMAKE_ZIP_LEN. - MVCC records get MVCCID + vacuum info from the TDES (sub-id
preferred);
prev_mvcc_op_log_lsaandstart_lsastay NULL/blank until the LSA is assigned in Chapter 4 — readingstart_lsaduring construction is a bug.
Chapter 4: LSA Assignment and Attach to the Prior List
Section titled “Chapter 4: LSA Assignment and Attach to the Prior List”Chapter 3 left us holding a fully formed LOG_PRIOR_NODE whose payload is populated but whose position in the log is unknown. This chapter assigns the node its LSA and splices it onto the prior-list tail inside one short mutex-guarded critical section. For why CUBRID stages records in an in-memory prior list, see the “prior list” section of cubrid-log-manager.md. The payoff:
Invariant 4-A (LSA order = mutex-acquisition order). Every LSA the engine hands out is monotonically increasing, and the order in which two threads receive their LSAs is exactly the order in which they acquired
prior_info.prior_lsa_mutex. The mutex is held for only an O(1) sequence of pointer/offset updates — no I/O, no allocation.
4.1 The structs in play
Section titled “4.1 The structs in play”Three structs meet here: the node (Ch 1; only the fields this chapter writes), the global cursor, and the embedded on-disk record header.
log_prior_node (fields written in this chapter)
Section titled “log_prior_node (fields written in this chapter)”| Field | Role | Why it exists |
|---|---|---|
log_header | LOG_RECORD_HEADER — bytes that will physically precede the record in the page | Carries the four linkage LSAs + trid + type that recovery walks |
start_lsa | The LSA this node is assigned | Returned to the caller as the record’s identity; cross-checked when the node is drained |
tde_encrypted | Whether the holding page must be TDE-encrypted | Set by prior_set_tde_encrypted; read when the page is allocated/flushed |
data_header_length | Byte length of data_header | Drives the offset advance for the data-header region |
data_header | The typed record header (e.g. LOG_REC_SYSOP_END) | Cast on the matched type arm to read MVCC/sysop sub-fields |
ulength / udata | Undo payload length / buffer | ulength>0 triggers an offset advance for undo data |
rlength / rdata | Redo payload length / buffer | rlength>0 triggers an offset advance for redo data |
next | Singly-linked pointer to the next prior node | Set when this node becomes the new tail |
log_rec_header (LOG_RECORD_HEADER) — every field
Section titled “log_rec_header (LOG_RECORD_HEADER) — every field”The physical record header; prior_lsa_start_append/prior_lsa_end_append exist almost entirely to fill it.
| Field | Role | Why it exists |
|---|---|---|
prev_tranlsa | Previous log record of the same transaction | Lets undo/rollback walk one transaction’s records backward without scanning the whole log |
back_lsa | Previous physical record (any transaction) | Lets recovery walk the global log backward |
forw_lsa | Next physical record | Lets analysis/redo walk forward; known only after this record’s size is fixed, so filled in prior_lsa_end_append |
trid | Transaction id owning this record | Recovery groups records by transaction; set from tdes->trid |
type | LOG_RECTYPE (e.g. LOG_COMMIT, LOG_SYSOP_END) | Dispatch key for every type-specific branch in prior_lsa_next_record_internal |
log_prior_lsa_info (the global cursor; log_Gl.prior_info) — every field
Section titled “log_prior_lsa_info (the global cursor; log_Gl.prior_info) — every field”| Field | Role | Why it exists |
|---|---|---|
prior_lsa | The next LSA to assign — the moving cursor | Every node copies this into start_lsa; advanced by the offset helpers as the node’s bytes are accounted for |
prev_lsa | LSA of the last record appended to the prior stream | Becomes the new node’s back_lsa, then is updated to point at the new node |
prior_list_header | Head of the singly-linked prior list | The drain side (Chapter 5) consumes from the head |
prior_list_tail | Tail of the prior list | New nodes attach here in O(1) |
list_size | Bytes staged but not yet flushed | Compared against logpb_get_memsize() to decide when to force a flush |
prior_flush_list_header | Head of the detached list being flushed | Set when the list is unhooked for draining (Chapter 5) |
prior_lsa_mutex | std::mutex serializing the whole assignment | The single lock whose acquisition order defines LSA order (Invariant 4-A) |
flowchart LR
subgraph G["log_Gl.prior_info (LOG_PRIOR_LSA_INFO)"]
PL["prior_lsa<br/>(next LSA cursor)"]
PV["prev_lsa<br/>(last record)"]
H["prior_list_header"]
T["prior_list_tail"]
M["prior_lsa_mutex"]
end
N["new LOG_PRIOR_NODE<br/>start_lsa, log_header, next"]
PL -- "copied into" --> N
PV -- "copied into log_header.back_lsa" --> N
T -- "->next = node, then tail = node" --> N
M -. "guards all of the above" .- G
Figure 4-1. The cursor feeds the node its identity and linkage, then adopts the node as its new tail.
4.2 The entry points: with_lock and the LOG_PRIOR_LSA_LOCK enum
Section titled “4.2 The entry points: with_lock and the LOG_PRIOR_LSA_LOCK enum”Two public entry points, one shared body; the only difference is whether the caller already holds prior_lsa_mutex.
// prior_lsa_next_record / _with_lock -- src/transaction/log_append.cppprior_lsa_next_record (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node, log_tdes *tdes){ return prior_lsa_next_record_internal (thread_p, node, tdes, LOG_PRIOR_LSA_WITHOUT_LOCK); }
prior_lsa_next_record_with_lock (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node, log_tdes *tdes){ return prior_lsa_next_record_internal (thread_p, node, tdes, LOG_PRIOR_LSA_WITH_LOCK); }The with_lock argument is one of the two values below (the enum has no comments in the source; the annotations here are editorial):
// LOG_PRIOR_LSA_LOCK -- src/transaction/log_append.hppenum LOG_PRIOR_LSA_LOCK{ LOG_PRIOR_LSA_WITHOUT_LOCK = 0, // internal locks/unlocks the mutex itself LOG_PRIOR_LSA_WITH_LOCK = 1 // caller already holds the mutex};The _with_lock variant lets a caller emit several records with no interleaving: take the mutex once, call _with_lock repeatedly. The plain variant is the common single-record path.
4.3 prior_lsa_next_record_internal — branch-complete walkthrough
Section titled “4.3 prior_lsa_next_record_internal — branch-complete walkthrough”The body has three phases: lock + prior_lsa_start_append (4.4); a 6-arm type-dispatch ladder (table below); then the offset-walk + prior_lsa_end_append (4.5) + tail splice + conditional unlock-and-flush. The frame and the tail splice, quoted verbatim (note both splice arms are two statements, not a chained assignment):
// prior_lsa_next_record_internal -- src/transaction/log_append.cpp if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK) { log_Gl.prior_info.prior_lsa_mutex.lock (); } prior_lsa_start_append (thread_p, node, tdes); // <- assigns start_lsa + header linkage (4.4) LSA_COPY (&start_lsa, &node->start_lsa); // <- snapshot before any advance // ... vacuum-produce guard + 6-arm type dispatch ladder (tables below) ... log_prior_lsa_append_advance_when_doesnot_fit (node->data_header_length); log_prior_lsa_append_add_align (node->data_header_length); if (node->ulength > 0) { prior_lsa_append_data (node->ulength); } if (node->rlength > 0) { prior_lsa_append_data (node->rlength); } prior_lsa_end_append (thread_p, node); // <- fixes forw_lsa (4.5)
if (log_Gl.prior_info.prior_list_tail == NULL) { log_Gl.prior_info.prior_list_header = node; // <- empty list: node is head ... log_Gl.prior_info.prior_list_tail = node; // <- ... and tail } else { log_Gl.prior_info.prior_list_tail->next = node; // <- O(1) tail splice (two statements) log_Gl.prior_info.prior_list_tail = node; } log_Gl.prior_info.list_size += (sizeof (LOG_PRIOR_NODE) + node->data_header_length + node->ulength + node->rlength); if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK) { log_Gl.prior_info.prior_lsa_mutex.unlock (); // <- release BEFORE the flush decision // ... condensed: if list_size >= logpb_get_memsize() -> force-flush fork (see callout) ... } tdes->num_log_records_written++; return start_lsa;Before the ladder, a vacuum-produce guard fires: under LOG_ISRESTARTED () and log_Gl.hdr.does_block_need_vacuum, if start_lsa crossed into a new vacuum block id versus mvcc_op_log_lsa, it calls vacuum_produce_log_block_data (asserting the prior block id is exactly one behind). Skipped entirely during crash recovery.
The 6-arm type-dispatch ladder. Mutually exclusive if/else if on node->log_header.type. Every assignment happens under the mutex with the just-snapshotted start_lsa — the reason the captured LSAs are coherent (Chapters 8–9).
| # | Matched type(s) | Guard | Action |
|---|---|---|---|
| 1 | LOG_MVCC_UNDO_DATA, LOG_MVCC_UNDOREDO_DATA, LOG_MVCC_DIFF_UNDOREDO_DATA, or (LOG_SYSOP_END && ((LOG_REC_SYSOP_END *)data_header)->type == LOG_SYSOP_END_LOGICAL_MVCC_UNDO) | — | Resolve vacuum_info/mvccid via nested sub-branch; vacuum_info->prev_mvcc_op_log_lsa = log_Gl.hdr.mvcc_op_log_lsa; prior_update_header_mvcc_info (start_lsa, mvccid) (4.6) |
| 2 | LOG_SYSOP_START_POSTPONE | assert (LSA_ISNULL (rcv.sysop_start_postpone_lsa)) | rcv.sysop_start_postpone_lsa = start_lsa; if lastparent_lsa < rcv.atomic_sysop_start_lsa null it; tdes->state = TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE (under mutex, for checkpoint correctness) |
| 3 | LOG_SYSOP_END | — | If atomic_sysop_start_lsa non-null && lastparent_lsa < it → null; same test/null for sysop_start_postpone_lsa |
| 4 | LOG_COMMIT_WITH_POSTPONE or LOG_COMMIT_WITH_POSTPONE_OBSOLETE | — | rcv.tran_start_postpone_lsa = start_lsa |
| 5 | LOG_SYSOP_ATOMIC_START | assert (LSA_ISNULL (rcv.atomic_sysop_start_lsa)) | rcv.atomic_sysop_start_lsa = start_lsa |
| 6 | LOG_COMMIT or LOG_ABORT | assert (commit_abort_lsa.is_null ()) | commit_abort_lsa = start_lsa |
Nested 3-way MVCC sub-branch (inside arm 1) — selects which struct holds vacuum_info/mvccid:
| Sub-arm | Condition | vacuum_info / mvccid source |
|---|---|---|
| a | type == LOG_MVCC_UNDO_DATA | (LOG_REC_MVCC_UNDO *) node->data_header → &mvcc_undo->vacuum_info, mvcc_undo->mvccid |
| b | type == LOG_SYSOP_END | &((LOG_REC_SYSOP_END *) node->data_header)->mvcc_undo → &mvcc_undo->vacuum_info, mvcc_undo->mvccid |
| c | else (LOG_MVCC_UNDOREDO_DATA / LOG_MVCC_DIFF_UNDOREDO_DATA, asserted) | (LOG_REC_MVCC_UNDOREDO *) node->data_header → &mvcc_undoredo->vacuum_info, mvcc_undoredo->mvccid |
If none of arms 1–6 match (the common data-record case), the ladder is a no-op and control falls straight to the offset walk.
Unlock-then-flush fork. Only
WITHOUT_LOCKunlocks here, and thelist_size >= logpb_get_memsize()check sits outside the mutex. Inside it,SERVER_MODEwakes the flush daemon and sleeps 1 ms when not in crash recovery, versus a synchronouslogpb_prior_lsa_append_all_listunderLOG_CSduring recovery. SA mode (#else) is always synchronous. Chapter 5 covers the drain.
flowchart TD
A["enter internal"] --> B{"WITHOUT_LOCK?"}
B -- yes --> C["lock prior_lsa_mutex"]
B -- no --> D["prior_lsa_start_append"]
C --> D
D --> E["snapshot start_lsa"]
E --> VG{"vacuum-produce guard"}
VG --> F["6-arm type dispatch ladder<br/>see arms 1-6 table; no-op if no match"]
F --> G["advance + add_align data_header"]
G --> H{"ulength>0?"}
H -- yes --> I["append_data ulength"]
H -- no --> J{"rlength>0?"}
I --> J
J -- yes --> K["append_data rlength"]
J -- no --> L["end_append: set forw_lsa"]
K --> L
L --> M{"tail == NULL?"}
M -- yes --> N["header = tail = node"]
M -- no --> O["tail->next = node; tail = node"]
N --> P["list_size += footprint"]
O --> P
P --> Q{"WITHOUT_LOCK?"}
Q -- yes --> R["unlock; maybe force flush"]
Q -- no --> S["num_log_records_written++; return"]
R --> S
Figure 4-2. Branch-complete control flow of prior_lsa_next_record_internal, including all six dispatch arms.
4.4 prior_lsa_start_append — assigning the LSA and the backward chain
Section titled “4.4 prior_lsa_start_append — assigning the LSA and the backward chain”This is where the node’s identity is born.
// prior_lsa_start_append -- src/transaction/log_append.cpp log_prior_lsa_append_advance_when_doesnot_fit (sizeof (LOG_RECORD_HEADER)); // <- header must not straddle a page node->log_header.trid = tdes->trid; LSA_COPY (&node->start_lsa, &log_Gl.prior_info.prior_lsa); // <- THE LSA assignment, before any advance (Inv 4-C) if (tdes->is_system_worker_transaction () && !tdes->is_under_sysop ()) { LSA_SET_NULL (&node->log_header.prev_tranlsa); // <- worker, no sysop: lose the per-tran chain LSA_SET_NULL (&tdes->head_lsa); LSA_SET_NULL (&tdes->tail_lsa); } else { LSA_COPY (&node->log_header.prev_tranlsa, &tdes->tail_lsa); // chain to this tran's last record LSA_COPY (&tdes->tail_lsa, &log_Gl.prior_info.prior_lsa); // this record is now the tran tail if (LSA_ISNULL (&tdes->head_lsa)) { LSA_COPY (&tdes->head_lsa, &tdes->tail_lsa); } // first record of the tran LSA_COPY (&tdes->undo_nxlsa, &log_Gl.prior_info.prior_lsa); // next to undo on rollback } LSA_COPY (&node->log_header.back_lsa, &log_Gl.prior_info.prev_lsa); // <- physical backward link (any tran) LSA_SET_NULL (&node->log_header.forw_lsa); // <- not known yet (end_append) LSA_COPY (&log_Gl.prior_info.prev_lsa, &log_Gl.prior_info.prior_lsa); // <- prev_lsa now names THIS record log_prior_lsa_append_add_align (sizeof (LOG_RECORD_HEADER)); // <- account the header bytesThe transaction-chain fork: system workers (e.g. vacuum) do not own a rollback chain, so a worker record not under a sysop nulls prev_tranlsa/head_lsa/tail_lsa. Everyone else chains to the prior tail and updates it. The physical back_lsa = prev_lsa link is transaction-independent; prev_lsa then advances to name this record. forw_lsa is nulled here, fixed in end_append.
Invariant 4-B. A system-worker record NOT under a sysop carries a null
prev_tranlsa. Violating it would make recovery walk a non-existent transaction chain.
Invariant 4-C (start before advance).
start_lsais read before anyadd_alignadvances the cursor, so it names the first byte of the record. The header-fit guard runs first so that first byte is on a page that can hold the header.
4.5 prior_lsa_end_append — fixing forw_lsa
Section titled “4.5 prior_lsa_end_append — fixing forw_lsa”Once the data-header, undo, and redo regions are accounted for, the cursor sits at the first byte past this record — the next record’s start, i.e. this record’s forw_lsa. Both helpers run before forw_lsa is read: align, then bump to the next page if the next header would not fit. So forw_lsa always names a position where the following header can legally live, and every forw_lsa equals the next record’s start_lsa with no straddling header between.
// prior_lsa_end_append -- src/transaction/log_append.cppstatic voidprior_lsa_end_append (THREAD_ENTRY *thread_p, LOG_PRIOR_NODE *node){ log_prior_lsa_append_align (); // <- align to next record start log_prior_lsa_append_advance_when_doesnot_fit (sizeof (LOG_RECORD_HEADER)); // <- next header must fit too LSA_COPY (&node->log_header.forw_lsa, &log_Gl.prior_info.prior_lsa);}4.6 prior_update_header_mvcc_info — vacuum block bookkeeping
Section titled “4.6 prior_update_header_mvcc_info — vacuum block bookkeeping”Invoked from arm 1 of the ladder. It maintains the running MVCC-block summary in the global log header so vacuum knows which blocks have MVCC work.
// prior_update_header_mvcc_info -- src/transaction/log_append.cpp assert (MVCCID_IS_VALID (mvccid)); if (!log_Gl.hdr.does_block_need_vacuum) // <- FIRST MVCC record of this block { log_Gl.hdr.oldest_visible_mvccid = log_Gl.mvcc_table.get_global_oldest_visible (); log_Gl.hdr.newest_block_mvccid = mvccid; } else { // ... condensed: sanity asserts on oldest/newest/block id ... if (log_Gl.hdr.newest_block_mvccid < mvccid) // <- subsequent record: raise high-water only { log_Gl.hdr.newest_block_mvccid = mvccid; } } log_Gl.hdr.mvcc_op_log_lsa = record_lsa; // <- both branches: latest MVCC op position log_Gl.hdr.does_block_need_vacuum = true;The first MVCC record of a block seeds oldest_visible_mvccid from the MVCC table; subsequent records only raise newest_block_mvccid (the elided else also asserts the block id matches mvcc_op_log_lsa). Both arms set mvcc_op_log_lsa = record_lsa and mark the block dirty — consistent with the totally-ordered LSA stream because it runs under the mutex.
4.7 The offset helpers — how prior_lsa walks the record footprint
Section titled “4.7 The offset helpers — how prior_lsa walks the record footprint”Three statics advance log_Gl.prior_info.prior_lsa, consuming each region. All operate on a 0-based offset within a LOGAREA_SIZE-byte page area (leading assert (... offset >= 0) lines elided).
// offset helpers -- src/transaction/log_append.cppstatic void log_prior_lsa_append_align (){ log_Gl.prior_info.prior_lsa.offset = DB_ALIGN (log_Gl.prior_info.prior_lsa.offset, DOUBLE_ALIGNMENT); if ((size_t) log_Gl.prior_info.prior_lsa.offset >= (size_t) LOGAREA_SIZE) // <- align rolled off page { log_Gl.prior_info.prior_lsa.pageid++; log_Gl.prior_info.prior_lsa.offset = 0; }}static void log_prior_lsa_append_advance_when_doesnot_fit (size_t length){ if ((size_t) log_Gl.prior_info.prior_lsa.offset + length >= (size_t) LOGAREA_SIZE) // <- region won't fit { log_Gl.prior_info.prior_lsa.pageid++; log_Gl.prior_info.prior_lsa.offset = 0; }}static void log_prior_lsa_append_add_align (size_t add){ log_Gl.prior_info.prior_lsa.offset += (add); // <- consume the region's bytes log_prior_lsa_append_align (); // <- then align (may roll to next page)}advance_when_doesnot_fit is the only one with a branch — a pre-check so a header never straddles a boundary. The pairing advance_when_doesnot_fit(N) then add_align(N) first ensures region N fits, then consumes it. Payloads that span pages (prior_lsa_append_data) are Chapter 6’s subject.
4.8 prior_set_tde_encrypted — marking the node for encryption
Section titled “4.8 prior_set_tde_encrypted — marking the node for encryption”Separate from the LSA path, called on the node for sensitive records.
// prior_set_tde_encrypted -- src/transaction/log_append.cpp if (!tde_is_loaded()) // <- cipher must be available { er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_TDE_CIPHER_IS_NOT_LOADED, 0); return ER_TDE_CIPHER_IS_NOT_LOADED; // <- error branch } tde_er_log ("prior_set_tde_encrypted(): rcvindex = %s\n", rv_rcvindex_string (recvindex)); node->tde_encrypted = true; // <- the only state change return NO_ERROR;Two branches: cipher not loaded → log error, return ER_TDE_CIPHER_IS_NOT_LOADED, node untouched; otherwise flip node->tde_encrypted = true. The flag is read later when the holding page is allocated/flushed — it does not participate in LSA assignment, which is why it is a standalone setter, not part of prior_lsa_start_append. (Query side: the trivial prior_is_tde_encrypted.)
4.9 Chapter summary — key takeaways
Section titled “4.9 Chapter summary — key takeaways”- One mutex defines the order.
prior_lsa_mutexis acquired once per record (or held across several via_with_lock); acquisition order is LSA order (Invariant 4-A). Read-then-advance under one lock means no shared or out-of-order LSAs; no separate counter. prior_lsa_start_appendis the moment of birth. Copiesprior_lsaintostart_lsabefore advancing (Invariant 4-C), setstrid, buildsprev_tranlsa/back_lsa, nullsforw_lsa.- The transaction chain forks on worker status. Worker records not under a sysop get null
prev_tranlsa/head_lsa/tail_lsa(Invariant 4-B); everyone else chains and updatestail_lsa/head_lsa/undo_nxlsa. - A 6-arm type ladder captures
start_lsaunder the lock. MVCC-undo (nested 3-way select →prior_update_header_mvcc_info),SYSOP_START_POSTPONE(also flipstdes->state),SYSOP_END,COMMIT_WITH_POSTPONE(_OBSOLETE),SYSOP_ATOMIC_START,COMMIT/ABORTeach stash the LSA intotdes->rcv.*/commit_abort_lsa. forw_lsais fixed last.prior_lsa_end_appendaligns past the record and guards the next header’s fit, soforw_lsaequals the next record’sstart_lsa.- Offset helpers consume the footprint.
advance_when_doesnot_fitpre-checks fit (the one branch),add_alignconsumes-then-aligns,alignrounds toDOUBLE_ALIGNMENTand rolls pages. - Expensive work is outside the lock. The O(1) splice and
list_sizebump end the critical section; the flush check and any flush run after unlock.
Chapter 5: Draining the Prior List into the Page Buffer
Section titled “Chapter 5: Draining the Prior List into the Page Buffer”Chapter 4 left a chain of log_prior_nodes with LSAs wired up by
prior_lsa_next_record — promised but not yet copied into any LOG_PAGE
frame. This chapter traces the single-writer drain that detaches the list,
walks it in LSN order, and serializes each node into the page buffer. We stop
at the page boundary (Chapter 6 owns logpb_next_append_page); the WAL rule
is Chapter 7 (companion cubrid-log-manager.md).
5.1 The two serialization layers
Section titled “5.1 The two serialization layers”Two locks, two jobs: prior_lsa_mutex serializes appenders against each
other (held only for the LSA-stamp-and-link, Chapter 4; does not protect
the page buffer); LOG_CS write mode serializes appenders against the
page-buffer writer — every drain function opens with
assert (LOG_CS_OWN_WRITE_MODE (thread_p)).
INVARIANT (single-writer drain). The drain runs while the caller owns
LOG_CSwrite mode; with two drainers,append_lsa.offsetandlog_pgptrwould update non-atomically and records would interleave. TheLOG_CS_OWN_WRITE_MODEassert makes a violation fatal. Every struct table below assumes this.
The hand-off is the detach: under prior_lsa_mutex the writer snips off
the list and nulls the header, so later appenders build a fresh list while
the detached one is drained lock-free.
flowchart TD A1["prior_lsa_next_record\nholds prior_lsa_mutex briefly"] D1["logpb_prior_lsa_append_all_list"] D2["detach list under prior_lsa_mutex\nreset header/tail/list_size"] D3["logpb_append_prior_lsa_list\nwalk nodes in LSN order"] D4["logpb_append_next_record per node"] D5["copy bytes into LOG_PAGE frames\nset dirty, free node"] A1 -->|"attach to prior_list"| D2 D1 --> D2 --> D3 --> D4 --> D5
Figure 5-1. The two serialization layers and the detach hand-off.
5.2 Structs the drain reads and writes
Section titled “5.2 Structs the drain reads and writes”log_prior_node — the unit being drained (log_append.hpp)
Section titled “log_prior_node — the unit being drained (log_append.hpp)”| Field | Role | Why it exists |
|---|---|---|
log_header | LOG_RECORD_HEADER copied verbatim by logpb_start_append | On-disk record header |
start_lsa | Must equal append_lsa when appended | Catches LSN-order corruption |
tde_encrypted | Destination page is TDE-encrypted | Drives appending_page_tde_encrypted |
data_header_length | Byte length of data_header | Sizes the header copy |
data_header | Fixed per-record-type header payload | Part after LOG_RECORD_HEADER |
ulength / udata | Length/pointer of the undo segment | Rollback image |
rlength / rdata | Length/pointer of the redo segment | Recovery image |
next | Link to the next node | Walked in LSN order |
INVARIANT (node order = LSN order). Tail-append under
prior_lsa_mutexmakesnexttraversal exactly ascending LSN;logpb_append_next_recordre-checks each node viaLSA_EQ (&node->start_lsa, &log_Gl.hdr.append_lsa)and a mismatch is alogpb_fatal_error.
LOG_PAGE / log_hdrpage — the destination frame (log_storage.hpp)
Section titled “LOG_PAGE / log_hdrpage — the destination frame (log_storage.hpp)”log_page is { LOG_HDRPAGE hdr; char area[1]; }; log_hdrpage is the
per-frame header.
| Field | Role | Why it exists |
|---|---|---|
hdr.logical_pageid | Identity of this frame in the log | Maps page to physical slot |
hdr.offset | Offset of the first record on this page | Set once by logpb_start_append; enables salvage |
hdr.flags | TDE encryption flags | Stamped by logpb_set_tde_algorithm |
hdr.checksum | CRC32 of the page | Computed at flush (Chapter 7) |
area | Buffer header+payload are memcpy’d into | LOG_APPEND_PTR() = area + append_lsa.offset |
LOG_BUFFER — frame wrapper carrying the dirty bit (log_page_buffer.c)
Section titled “LOG_BUFFER — frame wrapper carrying the dirty bit (log_page_buffer.c)”| Field | Role | Why it exists |
|---|---|---|
pageid (volatile) | Logical page id of the wrapped frame | Validates flush targets |
phy_pageid (volatile) | Physical page id in the active log | Maps logical page to disk slot |
dirty (bool) | “Has unflushed changes” | Raised by logpb_set_dirty, cleared by flusher (Chapter 7) |
logpage (LOG_PAGE*) | Back-pointer to buffered payload | logpb_get_log_buffer recovers the wrapper from a LOG_PAGE* |
log_append_info — the single writer’s cursor state (log_append.hpp)
Section titled “log_append_info — the single writer’s cursor state (log_append.hpp)”| Field | Role | Why it exists |
|---|---|---|
vdes | Active-log volume descriptor | Flush target; untouched by the drain |
nxio_lsa (atomic) | Lowest LSN not yet on disk | The WAL frontier (Chapter 7) |
prev_lsa | Address of the last fully appended record | logpb_start_append checks back_lsa == prev_lsa, then advances it |
log_pgptr | The currently fixed append page frame | LOG_APPEND_PTR() writes into log_pgptr->area |
appending_page_tde_encrypted | Page being filled needs TDE | Set per node from node->tde_encrypted |
INVARIANT (back_lsa chaining).
logpb_start_appendassertsback_lsa == prev_lsabefore each header; the on-disk backward chain must stay unbroken or the process fatals out.
5.3 logpb_prior_lsa_append_all_list — detach then drain
Section titled “5.3 logpb_prior_lsa_append_all_list — detach then drain”// logpb_prior_lsa_append_all_list -- src/transaction/log_page_buffer.cintlogpb_prior_lsa_append_all_list (THREAD_ENTRY * thread_p){ LOG_PRIOR_NODE *prior_list; assert (LOG_CS_OWN_WRITE_MODE (thread_p)); /* <- single-writer invariant */
log_Gl.prior_info.prior_lsa_mutex.lock (); prior_list = prior_lsa_remove_prior_list (thread_p); /* <- detach */ log_Gl.prior_info.prior_lsa_mutex.unlock (); /* <- mutex dropped early */
if (prior_list != NULL) { // ... condensed: perfmon stats ... logpb_append_prior_lsa_list (thread_p, prior_list); /* <- drain, no mutex held */ } return NO_ERROR;}prior_lsa_remove_prior_list is the detach — the only mutation of the
prior-list header during the drain:
// prior_lsa_remove_prior_list -- src/transaction/log_page_buffer.cstatic LOG_PRIOR_NODE *prior_lsa_remove_prior_list (THREAD_ENTRY * thread_p){ LOG_PRIOR_NODE *prior_list; assert (LOG_CS_OWN_WRITE_MODE (thread_p)); prior_list = log_Gl.prior_info.prior_list_header; log_Gl.prior_info.prior_list_header = NULL; /* <- reset header/tail/size: */ log_Gl.prior_info.prior_list_tail = NULL; /* new appenders start fresh */ log_Gl.prior_info.list_size = 0; return prior_list;}Branch: if prior_list == NULL the drain is skipped; otherwise the mutex is
released before the copy, shrinking the appender-blocking window to the
three pointer writes.
5.4 logpb_append_prior_lsa_list — walk and free
Section titled “5.4 logpb_append_prior_lsa_list — walk and free”The detached list is parked on prior_flush_list_header (a separate slot,
so a rebuilt prior_list_header stays untouched), then drained node-by-node.
// logpb_append_prior_lsa_list -- src/transaction/log_page_buffer.cstatic intlogpb_append_prior_lsa_list (THREAD_ENTRY * thread_p, LOG_PRIOR_NODE * list){ LOG_PRIOR_NODE *node; assert (log_Gl.prior_info.prior_flush_list_header == NULL); /* <- no concurrent drain */ log_Gl.prior_info.prior_flush_list_header = list;
while (log_Gl.prior_info.prior_flush_list_header != NULL) { node = log_Gl.prior_info.prior_flush_list_header; log_Gl.prior_info.prior_flush_list_header = node->next; /* <- advance before copy */ logpb_append_next_record (thread_p, node); /* <- the copy */
if (node->data_header != NULL) free_and_init (node->data_header); if (node->udata != NULL) free_and_init (node->udata); if (node->rdata != NULL) free_and_init (node->rdata); free_and_init (node); /* <- node lifetime ends */ } return NO_ERROR;}Each segment is freed only if non-NULL (a node may carry any subset); the
head advances before the copy so the loop ends on the NULL next. The
assert (prior_flush_list_header == NULL) enforces no overlapping flush
list — a corollary of the single-writer invariant, holding under LOG_CS.
5.5 logpb_append_next_record — one node, header + payload
Section titled “5.5 logpb_append_next_record — one node, header + payload”// logpb_append_next_record -- src/transaction/log_page_buffer.cstatic intlogpb_append_next_record (THREAD_ENTRY * thread_p, LOG_PRIOR_NODE * node){ if (!LSA_EQ (&node->start_lsa, &log_Gl.hdr.append_lsa)) logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "logpb_append_next_record"); /* <- LSN-order */
if (log_Gl.flush_info.num_toflush + 1 >= log_Gl.flush_info.max_toflush) logpb_flush_all_append_pages (thread_p); /* <- flush early, before this record */
log_Gl.append.appending_page_tde_encrypted = prior_is_tde_encrypted (node); logpb_start_append (thread_p, &node->log_header); /* writes LOG_RECORD_HEADER */
if (node->data_header != NULL) { LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT (thread_p, node->data_header_length); /* keep header contiguous */ logpb_append_data (thread_p, node->data_header_length, node->data_header); } if (node->udata != NULL) logpb_append_data (thread_p, node->ulength, node->udata); if (node->rdata != NULL) logpb_append_data (thread_p, node->rlength, node->rdata);
logpb_end_append (thread_p, &node->log_header); log_Gl.append.appending_page_tde_encrypted = false; /* reset for next node */ return NO_ERROR;}The non-obvious branch is the early flush (num_toflush + 1 >= max_toflush):
flushing now, with no record in progress, keeps the partial-append state
machine (LOGPB_APPENDREC_*, Chapter 7) from triggering mid-record.
data_header is pre-advanced to stay on one page; udata/rdata wrap.
Figure 5-2 covers every branch.
flowchart TD
S["enter logpb_append_next_record"] --> C1{"start_lsa == append_lsa ?"}
C1 -->|no| F["logpb_fatal_error"]
C1 -->|yes| C2{"flush list nearly full ?"}
C2 -->|yes| FL["logpb_flush_all_append_pages"]
C2 -->|no| H
FL --> H["set tde flag\nlogpb_start_append: write header"]
H --> C3{"data_header ?"}
C3 -->|yes| DH["ADVANCE_WHEN_DOESNOT_FIT\nappend_data header"]
C3 -->|no| C4
DH --> C4{"udata ?"}
C4 -->|yes| UD["append_data udata"]
C4 -->|no| C5
UD --> C5{"rdata ?"}
C5 -->|yes| RD["append_data rdata"]
C5 -->|no| E
RD --> E["logpb_end_append\nreset tde flag"]
Figure 5-2. Branch-complete flow of logpb_append_next_record.
5.6 logpb_start_append — stamp the record header
Section titled “5.6 logpb_start_append — stamp the record header”// logpb_start_append -- src/transaction/log_page_buffer.cstatic voidlogpb_start_append (THREAD_ENTRY * thread_p, LOG_RECORD_HEADER * header){ LOG_RECORD_HEADER *log_rec; // ... condensed: assert, perfmon, ADVANCE_WHEN_DOESNOT_FIT (header contiguous) ... if (!LSA_EQ (&header->back_lsa, &log_Gl.append.prev_lsa)) logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "logpb_start_append"); /* <- back-chain check */
if (log_Gl.append.appending_page_tde_encrypted && !LOG_IS_PAGE_TDE_ENCRYPTED (log_Gl.append.log_pgptr)) { // ... condensed: stamp TDE algorithm on the page ... logpb_set_dirty (thread_p, log_Gl.append.log_pgptr); }
log_rec = (LOG_RECORD_HEADER *) LOG_APPEND_PTR (); *log_rec = *header; /* <- the header copy */ // ... condensed: if hdr.offset == NULL_OFFSET, set first-record offset on this page ...
if (log_rec->type == LOG_END_OF_LOG) { LSA_COPY (&log_Gl.hdr.eof_lsa, &log_Gl.hdr.append_lsa); logpb_set_dirty (thread_p, log_Gl.append.log_pgptr); } else { LSA_COPY (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa); /* advance prev_lsa */ LOG_APPEND_SETDIRTY_ADD_ALIGN (thread_p, sizeof (LOG_RECORD_HEADER)); /* dirty + bump + align */ log_Pb.partial_append.status = LOGPB_APPENDREC_IN_PROGRESS; }}Two branches: hdr.offset == NULL_OFFSET sets the page’s first-record
offset once; the type split routes LOG_END_OF_LOG (EOF sentinel,
Chapter 7) down a placeholder path leaving prev_lsa/IN_PROGRESS untouched,
vs else advancing the chain into IN_PROGRESS.
5.7 logpb_append_data — the aligned byte copy
Section titled “5.7 logpb_append_data — the aligned byte copy”// logpb_append_data -- src/transaction/log_page_buffer.cstatic voidlogpb_append_data (THREAD_ENTRY * thread_p, int length, const char *data){ int copy_length; char *ptr, *last_ptr; if (length == 0 || data == NULL) return; /* <- empty segment: nothing to do */
LOG_APPEND_ALIGN (thread_p, LOG_DONT_SET_DIRTY); /* align, don't dirty yet */ ptr = LOG_APPEND_PTR (); last_ptr = LOG_LAST_APPEND_PTR (); /* = area + LOGAREA_SIZE */
if ((ptr + length) >= last_ptr) /* <- does NOT fit in this page */ { while (length > 0) { if (ptr >= last_ptr) { logpb_next_append_page (thread_p, LOG_SET_DIRTY); /* Chapter 6 */ ptr = LOG_APPEND_PTR (); last_ptr = LOG_LAST_APPEND_PTR (); } copy_length = (ptr + length >= last_ptr) ? CAST_BUFLEN (last_ptr - ptr) : length; memcpy (ptr, data, copy_length); ptr += copy_length; data += copy_length; length -= copy_length; log_Gl.hdr.append_lsa.offset += copy_length; /* advance by bytes copied */ } } else /* <- fits entirely */ { memcpy (ptr, data, length); log_Gl.hdr.append_lsa.offset += length; } LOG_APPEND_ALIGN (thread_p, LOG_SET_DIRTY); /* align for next append AND mark dirty */}The boundary-span path copies to page end, calls logpb_next_append_page
(Chapter 6), and repeats until length == 0. logpb_append_crumbs is the
scatter-gather sibling (same fit/span logic), not on the drain path.
INVARIANT (cursor tracks bytes copied).
append_lsa.offsetadvances by exactly the bytesmemcpy’d on every path; drift would make the nextLOG_APPEND_PTR()point at the wrong byte and records overlap. BothLOG_APPEND_ALIGNcalls only round up.
5.8 logpb_end_append — close the record, point forward
Section titled “5.8 logpb_end_append — close the record, point forward”// logpb_end_append -- src/transaction/log_page_buffer.cstatic voidlogpb_end_append (THREAD_ENTRY * thread_p, LOG_RECORD_HEADER * header){ // ... condensed: align + ADVANCE_WHEN_DOESNOT_FIT position the cursor at next slot ... assert (LSA_EQ (&header->forw_lsa, &log_Gl.hdr.append_lsa)); /* <- forw_lsa = next slot */
if (!LSA_EQ (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa)) logpb_set_dirty (thread_p, log_Gl.append.log_pgptr); /* dirty if cursor moved off prev */
if (log_Pb.partial_append.status == LOGPB_APPENDREC_IN_PROGRESS) ; /* normal: fall through */ else if (log_Pb.partial_append.status == LOGPB_APPENDREC_PARTIAL_FLUSHED_END_OF_LOG) { log_Pb.partial_append.status = LOGPB_APPENDREC_PARTIAL_ENDED; logpb_flush_all_append_pages (thread_p); /* re-flush correct version */ } else assert_release (false); /* invalid state */
log_Pb.partial_append.status = LOGPB_APPENDREC_SUCCESS; /* record now stable */}After the cursor is repositioned and the forw_lsa assert (partner to
back_lsa) confirms it, the state machine branches: IN_PROGRESS falls
through; PARTIAL_FLUSHED_END_OF_LOG (a forced flush swapped in an EOF
sentinel) re-flushes the real record (Chapter 7); else →
assert_release(false). All end at SUCCESS.
INVARIANT (record bracketing). Between
logpb_start_append(IN_PROGRESS) andlogpb_end_append(SUCCESS), exactly one record is mid-write. A forced flush seeing IN_PROGRESS knows it caught a partial record; SUCCESS means the page is safe to flush. Breaking the bracket lets a half-written record reach disk unmarked.
5.9 Chapter summary — key takeaways
Section titled “5.9 Chapter summary — key takeaways”- Two locks, two jobs.
prior_lsa_mutexserializes appenders;LOG_CSserializes them against the single writer. - Detach, then drain. Reset header/tail/
list_sizeunder the mutex, release, then copy. - LSN order, freed immediately. Each node copied via
logpb_append_next_record, then its segments and itself freed. - Three assertions prove the chain.
back_lsa==prev_lsa,forw_lsa==append_lsa,start_lsa==append_lsa— divergence is fatal. - Cursor stays honest.
logpb_append_dataadvancesappend_lsa.offsetby exactly the bytes copied. - Dirty, not flushed.
logpb_set_dirtyonly flipsLOG_BUFFER::dirty; flush/checksum/WAL are Chapter 7. - Boundary crossing deferred. Every
logpb_next_append_pagehands off to Chapter 6.
Chapter 6: Crossing a Log Page Boundary
Section titled “Chapter 6: Crossing a Log Page Boundary”The drain loop of Chapter 5 streams a prior node’s bytes into log_Gl.append.log_pgptr one fragment at a time. When the running offset log_Gl.hdr.append_lsa.offset reaches the page’s usable limit (LOGAREA_SIZE), the appender must seal the full page, register it for flush, and obtain a fresh logical page.
The reader question: what happens when a record does not fit, and how is a fresh page fetched while the full one is queued for flush? The mid-stream answer is logpb_next_append_page; the first-page bootstrap is logpb_fetch_start_append_page and its stripped twin logpb_fetch_start_append_page_new. All obtain a buffer frame and initialize its header through logpb_create_page / logpb_locate_page. WAL and the append/flush split are in the high-level companion; flush durability is Chapter 7. The prior-side mirror log_prior_lsa_append_advance_when_doesnot_fit (Chapter 4) reserves the address space across the page tail before any bytes exist; this chapter fetches the physical frame for that address.
6.1 Who triggers the crossing
Section titled “6.1 Who triggers the crossing”The appender never calls logpb_next_append_page from record-assembly code. Two macros own that decision, both comparing log_Gl.hdr.append_lsa.offset against LOGAREA_SIZE (the alignment/advance arithmetic is Chapter 4 / Chapter 5 material): LOG_APPEND_ALIGN crosses after a fragment when the DOUBLE_ALIGNMENT-rounded offset reaches the limit; LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT(length) crosses before writing when offset + length would overrun, so the fragment lands whole on the next page.
That after-vs-before split is why logpb_next_append_page takes current_setdirty. LOG_APPEND_ALIGN (reached via LOG_APPEND_SETDIRTY_ADD_ALIGN with LOG_SET_DIRTY) has already dirtied the page it is leaving; the ADVANCE macro crosses before any byte is written, so nothing is dirty yet. Both therefore pass LOG_DONT_SET_DIRTY, leaving the seal branch inside logpb_next_append_page dead in the hot path — it exists only for direct callers that did not pre-seal.
logpb_set_dirty flips one boolean on the page’s buffer frame:
// logpb_set_dirty -- src/transaction/log_page_buffer.cvoidlogpb_set_dirty (THREAD_ENTRY * thread_p, LOG_PAGE * log_pgptr){ LOG_BUFFER *bufptr; bufptr = logpb_get_log_buffer (log_pgptr); /* <- recovers frame from page address */ // ... condensed ... bufptr->dirty = true;}Invariant (dirty-before-detach): a page that received append bytes must be bufptr->dirty before log_Gl.append.log_pgptr is repointed away. Every write path runs LOG_APPEND_SETDIRTY_ADD_ALIGN (calling LOG_APPEND_ALIGN with LOG_SET_DIRTY) before the offset reaches LOGAREA_SIZE. If violated, the full page sits in toflush[] un-dirtied and the flusher skips it, losing committed records.
6.2 The structs at the seam
Section titled “6.2 The structs at the seam”log_append_info is the appender’s fixed cursor; log_page / log_hdrpage are the physical page layout; log_flush_info is the hand-off queue to the flusher.
log_append_info (log_append.hpp) — one global, log_Gl.append.
| Field | Role | Why it exists |
|---|---|---|
vdes | Volume descriptor (fd) of the active log | The eventual fileio_write target |
nxio_lsa | atomic<LOG_LSA>: lowest LSA not yet on disk | WAL boundary; re-pointed by the fetch helpers (6.5), read by Chapter 7 |
prev_lsa | LSA of the last appended record | logpb_start_append checks back_lsa == prev_lsa; logical, so survives a crossing unchanged (6.7) |
log_pgptr | The currently fixed append page | The pointer the crossing nulls then re-points |
appending_page_tde_encrypted | Pages created mid-append must be TDE-encrypted | Propagates the record’s encryption decision onto new mid-record pages (6.6) |
log_hdrpage (log_storage.hpp) — header at the front of every log page.
| Field | Role | Why it exists |
|---|---|---|
logical_pageid | LOG_PAGEID: page’s address in the infinite log | So readers/flushers know which logical page a frame holds |
offset | PGLENGTH: byte offset of the first full record here | Salvage anchor for recovery if the prior page is corrupt |
flags | short bitfield; today only TDE bits | Carries LOG_HDRPAGE_FLAG_ENCRYPTED_AES/_ARIA; set via logpb_set_tde_algorithm |
checksum | int: CRC32 over the page | Consistency check; memset garbage at create, computed at write time (6.4) |
log_page (log_storage.hpp): LOG_HDRPAGE hdr followed by char area[1] (record region, sized LOGAREA_SIZE). Never sizeof it; use LOG_PAGESIZE.
log_flush_info (log_impl.h) — the queue the crossing pushes into; one global, log_Gl.flush_info.
| Field | Role | Why it exists |
|---|---|---|
max_toflush | Capacity of toflush | Threshold that forces a flush when the queue fills |
num_toflush | Count of queued pages | Incremented under flush_mutex per crossing |
toflush | LOG_PAGE **: ordered pages awaiting flush | Hand-off list; array order = flush order |
flush_mutex | Mutex (SERVER_MODE) over the three above | Lets the Log Flush Thread and appender share the queue safely |
graph TD
subgraph append_cursor
AI["log_append_info<br/>log_Gl.append"]
AI -->|log_pgptr| PG["LOG_PAGE (current)"]
AI -->|appending_page_tde_encrypted| TDE["TDE decision"]
end
PG -->|hdr| HDR["log_hdrpage<br/>logical_pageid / offset / flags / checksum"]
subgraph flush_queue
FI["log_flush_info<br/>log_Gl.flush_info"]
FI -->|toflush num_toflush| Q["LOG_PAGE *[] ordered"]
end
PG -.->|enqueued on crossing| Q
Figure 6-1. Struct relationships: the crossing repoints log_pgptr to a new LOG_PAGE and pushes a page into toflush[].
6.3 logpb_next_append_page: branch-complete walkthrough
Section titled “6.3 logpb_next_append_page: branch-complete walkthrough”// logpb_next_append_page -- src/transaction/log_page_buffer.c assert (LOG_CS_OWN_WRITE_MODE (thread_p)); /* (entry) LOG CS held write-exclusive */ if (current_setdirty == LOG_SET_DIRTY) { logpb_set_dirty (thread_p, log_Gl.append.log_pgptr); } /* (A) seal old page */ log_Gl.append.log_pgptr = NULL; /* (B) detach; (C) pageid++, offset=0 */ log_Gl.hdr.append_lsa.pageid++; log_Gl.hdr.append_lsa.offset = 0; if (LOGPB_AT_NEXT_ARCHIVE_PAGE_ID (log_Gl.hdr.append_lsa.pageid)) { logpb_archive_active_log (thread_p); } /* (D) wrap onto unarchived slot */ if (LOGPB_IS_FIRST_PHYSICAL_PAGE (log_Gl.hdr.append_lsa.pageid)) { log_Gl.hdr.fpageid += LOGPB_ACTIVE_NPAGES; logpb_flush_header (thread_p); } /* (E) cycled */ log_Gl.append.log_pgptr = logpb_create_page (thread_p, log_Gl.hdr.append_lsa.pageid); /* (F) */ if (log_Gl.append.log_pgptr == NULL) { logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "log_next_append_page"); return; } /* (G) */ if (log_Gl.append.appending_page_tde_encrypted) /* (H) propagate TDE — see 6.6 */ { /* ... logpb_set_tde_algorithm + logpb_set_dirty ... */ } rv = pthread_mutex_lock (&flush_info->flush_mutex); flush_info->toflush[flush_info->num_toflush++] = log_Gl.append.log_pgptr; /* (I) enqueue NEW page */ need_flush = (flush_info->num_toflush >= flush_info->max_toflush); /* (J) queue full? */ pthread_mutex_unlock (&flush_info->flush_mutex); if (need_flush) { logpb_flush_all_append_pages (thread_p); } /* (K) forced flush, outside the mutex */Figure 6-2 traces every branch; two are non-obvious. (B): between detach and (F) there is no current append page, but the write-exclusive LOG CS (entry assert) means no other appender ever observes the gap. (I): the page enqueued is the fresh empty page, not the one just filled — that one was queued at its own birth, so every page enters toflush[] exactly once. The rest (D archive-wrap → Chapter 10, E ring-wrap header bump, G fatal-NULL) are labelled in the excerpt and flowchart.
flowchart TD
S["enter, LOG CS write-held"] --> A{"current_setdirty == LOG_SET_DIRTY?"}
A -->|yes| A1["logpb_set_dirty(old page)"]
A -->|no| B
A1 --> B["log_pgptr = NULL; pageid++; offset = 0"]
B --> D{"LOGPB_AT_NEXT_ARCHIVE_PAGE_ID?"}
D -->|yes| D1["logpb_archive_active_log"]
D -->|no| E
D1 --> E{"LOGPB_IS_FIRST_PHYSICAL_PAGE?"}
E -->|yes| E1["fpageid += ACTIVE_NPAGES; logpb_flush_header"]
E -->|no| F
E1 --> F["log_pgptr = logpb_create_page(pageid)"]
F --> G{"log_pgptr == NULL?"}
G -->|yes| G1["logpb_fatal_error -> return"]
G -->|no| H{"appending_page_tde_encrypted?"}
H -->|yes| H1["set_tde_algorithm; set_dirty"]
H -->|no| I
H1 --> I["lock flush_mutex; toflush[num++] = new page"]
I --> J{"num_toflush >= max_toflush?"}
J -->|yes| J1["need_flush = true"]
J -->|no| K
J1 --> K["unlock flush_mutex"]
K --> L{"need_flush?"}
L -->|yes| L1["logpb_flush_all_append_pages"]
L -->|no| Z["return"]
L1 --> Z
Figure 6-2. logpb_next_append_page control flow, every branch.
6.4 Obtaining the frame: logpb_locate_page for NEW_PAGE
Section titled “6.4 Obtaining the frame: logpb_locate_page for NEW_PAGE”logpb_create_page(thread_p, pageid) is return logpb_locate_page (thread_p, pageid, NEW_PAGE);. logpb_locate_page maps a logical pageid to a buffer frame and, for NEW_PAGE, initializes the header in place — never touching disk. The branches that matter:
// logpb_locate_page -- src/transaction/log_page_buffer.c index = logpb_get_log_buffer_index (pageid); /* ring hash -> frame slot; bad index -> NULL */ log_bufptr = &log_Pb.buffers[index]; if (log_bufptr->pageid != NULL_PAGEID && log_bufptr->pageid != pageid) { /* frame holds a DIFFERENT page */ if (log_bufptr->dirty == true) { assert_release (false); /* must not victimize dirty */ ... } log_bufptr->pageid = NULL_PAGEID; /* invalidate */ } if (log_bufptr->pageid == NULL_PAGEID) { if (fetch_mode == NEW_PAGE) { memset (log_bufptr->logpage, LOG_PAGE_INIT_VALUE, LOG_PAGESIZE); /* 0xff fill */ log_bufptr->logpage->hdr.logical_pageid = pageid; /* (1) */ log_bufptr->logpage->hdr.offset = NULL_OFFSET; /* (2) */ log_bufptr->logpage->hdr.flags = 0; /* (3) clears any TDE bits */ } else /* OLD_PAGE */ { if (logpb_read_page_from_file (...) != NO_ERROR) { return NULL; } } } else { assert (fetch_mode == OLD_PAGE); /* frame already holds exactly this page */ }The three header writes initialize the new page: (1) logical_pageid = pageid makes the frame be this logical page; (2) offset = NULL_OFFSET — no record starts here yet, logpb_start_append later overwrites it with the first-record offset; (3) flags = 0 clears stale TDE bits, which is why (H) must re-apply the algorithm. checksum is not set here — it stays memset garbage until logpb_set_page_checksum (called by logpb_writev_append_pages per page before fileio_write) runs log_pgptr->hdr.checksum = checksum_crc32;.
Invariant (one frame per ring slot): the assert_release (false) encodes that the appender must never evict a dirty frame for a new append page. The ring is sized so a slot’s prior occupant is flushed before reuse. Violating it overwrites a page the flusher believed safe — silent log corruption.
6.5 The two public entry points
Section titled “6.5 The two public entry points”logpb_next_append_page handles mid-stream crossings; two public functions handle the first page of an append session, bootstrapping log_pgptr and re-anchoring WAL.
// logpb_fetch_start_append_page -- src/transaction/log_page_buffer.c PAGE_FETCH_MODE flag = OLD_PAGE; if ((log_Gl.hdr.append_lsa.pageid == FIRST_LOG_PAGEID) /* NDEBUG: ==0; else PRM_ID_FIRST_LOG_PAGEID */ && (log_Gl.hdr.append_lsa.offset == 0)) { flag = NEW_PAGE; } /* empty log: skip the read */ if (log_Gl.append.log_pgptr != NULL) { logpb_invalid_all_append_pages (thread_p); } /* stale append page: discard */ log_Gl.append.log_pgptr = logpb_locate_page (thread_p, log_Gl.hdr.append_lsa.pageid, flag); if (log_Gl.append.log_pgptr == NULL) { return ER_FAILED; } log_Gl.append.set_nxio_lsa (log_Gl.hdr.append_lsa); /* (*) re-anchor WAL boundary */ // ... same flush_mutex enqueue as 6.3: toflush[num_toflush++] = log_pgptr; threshold -> need_flush ... if (need_flush) { logpb_flush_pages_direct (thread_p); } /* note: direct, not _all_append */Two branches distinguish it from the mid-stream path. flag selection: an empty log (pageid equal to PRM_ID_FIRST_LOG_PAGEID, offset 0) is fetched NEW_PAGE with no read; otherwise OLD_PAGE reads back the on-disk page (e.g. restart resuming a half-full last page) so the appender continues after the tail. stale-page discard: a non-NULL log_pgptr on entry triggers logpb_invalid_all_append_pages. The (*) line re-anchors WAL — nxio_lsa jumps to the current append position (“everything below here is durable”); mid-stream logpb_next_append_page never touches nxio_lsa, since within a session the boundary moves only when pages actually flush (Chapter 7).
// logpb_fetch_start_append_page_new -- src/transaction/log_page_buffer.c log_Gl.append.log_pgptr = logpb_locate_page (thread_p, log_Gl.hdr.append_lsa.pageid, NEW_PAGE); if (log_Gl.append.log_pgptr == NULL) { return NULL; } /* caller handles NULL */ log_Gl.append.set_nxio_lsa (log_Gl.hdr.append_lsa); return log_Gl.append.log_pgptr;_new is the stripped variant: always NEW_PAGE, no stale-page check, no enqueue and no flush threshold. It serves callers (log creation / format) that want a fresh first page but manage flushing themselves — the thing to know before reusing it is that it skips the toflush[] bookkeeping the other two perform.
6.6 TDE flag propagation across the boundary
Section titled “6.6 TDE flag propagation across the boundary”The encryption decision belongs to the record but is enforced per page via log_Gl.append.appending_page_tde_encrypted. logpb_append_next_record (Chapter 5) owns the flag’s lifetime: it sets the flag from prior_is_tde_encrypted (node) before assembly and resets it to false after logpb_end_append. While true, two re-stamp sites carry the TDE bits onto pages entered during assembly. Site (H) inside logpb_next_append_page (6.3) calls logpb_set_tde_algorithm then logpb_set_dirty on the just-created page. The second site, in logpb_start_append, guards against re-stamping:
// logpb_start_append -- src/transaction/log_page_buffer.c if (log_Gl.append.appending_page_tde_encrypted) { if (!LOG_IS_PAGE_TDE_ENCRYPTED (log_Gl.append.log_pgptr)) /* idempotent guard */ { TDE_ALGORITHM tde_algo = (TDE_ALGORITHM) prm_get_integer_value (PRM_ID_TDE_DEFAULT_ALGORITHM); logpb_set_tde_algorithm (thread_p, log_Gl.append.log_pgptr, tde_algo); logpb_set_dirty (thread_p, log_Gl.append.log_pgptr); } }logpb_set_tde_algorithm writes hdr.flags (clear the encrypted mask, OR the algorithm bit). Because logpb_locate_page zeroed hdr.flags at create (6.4 step 3), the new page starts un-encrypted and (H) re-stamps it; the trailing logpb_set_dirty is essential or the bit could be lost. The logpb_start_append guard with LOG_IS_PAGE_TDE_ENCRYPTED keeps it from re-stamping a page (H) already handled.
Invariant (encryption follows the record across pages): if record R is TDE-encrypted, every page R touches — including pages allocated mid-R — has a non-zero TDE flag. Enforced by appending_page_tde_encrypted staying true for R’s whole assembly plus the (H) re-stamp on each new page. If broken, part of an encrypted record is written in clear text: logpb_writev_append_pages checks LOG_IS_PAGE_TDE_ENCRYPTED per page at write time, so a missing flag means that page silently skips encryption.
6.7 What survives the crossing
Section titled “6.7 What survives the crossing”log_Gl.append.prev_lsa is logical, not page-relative, so logpb_next_append_page leaves it untouched — it keeps pointing at the last appended record regardless of page, letting logpb_start_append validate header->back_lsa == prev_lsa even when the new record begins on a freshly crossed page. prev_lsa advances only in logpb_start_append (LSA_COPY (&log_Gl.append.prev_lsa, &log_Gl.hdr.append_lsa)), never in the page fetch. This buffer-side seam pairs with Chapter 4’s prior-side log_prior_lsa_append_advance_when_doesnot_fit: both compute the break from the same LOGAREA_SIZE threshold, so reserved address and materialized frame never disagree.
6.8 Chapter summary — key takeaways
Section titled “6.8 Chapter summary — key takeaways”- One function owns the mid-stream crossing.
logpb_next_append_pageoptionally seals the full page, nullslog_pgptr, advancesappend_lsa.pageid, creates a fresh frame, and queues a page — all under the write-held LOG CS. - The page enqueued on a crossing is the new page, not the full one — every page enters
toflush[]exactly once, at its own birth. NEW_PAGEcreate writes three header fields, not the checksum.logpb_locate_pagesetslogical_pageid,offset = NULL_OFFSET,flags = 0and memsets the body0xff;checksumis computed bylogpb_set_page_checksumjust beforefileio_write.flags = 0at create is why TDE must be re-stamped. Three sites touchappending_page_tde_encrypted: set/reset inlogpb_append_next_recordplus the re-stamps inlogpb_next_append_page(H) andlogpb_start_append, keeping an encrypted record encrypted across page breaks.- Threshold flush is decided under
flush_mutex, executed outside it —num_toflush >= max_toflushsetsneed_flush, thenlogpb_flush_all_append_pagesruns after release. logpb_fetch_start_append_pagevs_new. The former choosesNEW_PAGE/OLD_PAGE, discards stale pages, enqueues, and re-anchorsnxio_lsa;_newis alwaysNEW_PAGEand skips the enqueue/threshold work for callers managing their own flush.- The crossing is the buffer-side mirror of Chapter 4 — the prior side reserves address space across the tail before bytes exist, this side fetches the frame for that address, both keyed off
LOGAREA_SIZE.
Chapter 7: Flush Durability and the WAL Rule
Section titled “Chapter 7: Flush Durability and the WAL Rule”This chapter answers one question: how do dirty log pages reach stable
storage, how is nxio_lsa advanced, and how do group commit and the WAL
invariant cooperate to keep recovery correct? Chapters 3–6 built a prior
list, drained it into the page buffer (Ch 5), and crossed page boundaries
(Ch 6) — all in volatile memory. Here the bytes hit the disk.
For the framing of WAL and why the log must be durable before data pages,
see the companion cubrid-log-manager.md §“Write-Ahead Logging”. We trace
the enforcing code, not the theory.
7.1 The three structures that hold the durability state
Section titled “7.1 The three structures that hold the durability state”Durability is coordinated across three structs: log_append_info
(log_append.hpp) owns the watermark; log_flush_info (log_impl.h) is the
work list of pages to scan; log_buffer (log_page_buffer.c) is the per-page
slot whose dirty bit the flusher clears.
log_append_info holds { int vdes; std::atomic<LOG_LSA> nxio_lsa; LOG_LSA prev_lsa; LOG_PAGE *log_pgptr; bool appending_page_tde_encrypted; }:
| Field | Role | Why it exists |
|---|---|---|
vdes | Active-log fd passed to every fileio_write/fileio_synchronize. | The flusher must know which fd to write and fsync. |
nxio_lsa | The durability watermark — lowest LSA whose page is not yet forced to disk. Atomic for lock-free reads. | Answers both “is record X durable?” (group-commit) and “must I flush before this data page?” (WAL). |
prev_lsa | Last appended (in-buffer) record, vs nxio_lsa (last flushed). | Lets the flusher detect a partial record (nxio_lsa.pageid == prev_lsa.pageid) and not validate it early. |
log_pgptr | Append page currently fixed for new records. | After a flush resets num_toflush, it is re-seeded as toflush[0]. |
appending_page_tde_encrypted | Whether the next page must be TDE-encrypted. | Carries the encryption decision from append time to write time. |
The accessors are trivial (get_nxio_lsa = nxio_lsa.load (),
set_nxio_lsa = nxio_lsa.store ()); the atomicity is the contract.
log_flush_info holds { int max_toflush; int num_toflush; LOG_PAGE **toflush; pthread_mutex_t flush_mutex; } (the mutex is SERVER_MODE only):
| Field | Role | Why it exists |
|---|---|---|
max_toflush | Array capacity; at num_toflush == max_toflush the buffer is full (log_buffer_full_count ticks). | Bounds the batch; full-list events drive Ch 6’s partial-append path. |
num_toflush | Count of queued pages; < 1 means nothing to flush. | The loop bound; reset to 0 then re-seeded with the live append page. |
toflush | Array of LOG_PAGE*, ascending by pageid. | The contiguity scan walks it to coalesce pages into one writev. |
flush_mutex | Guards the array vs concurrent producers (Ch 5 drain) and the flusher. | Held across the entire flush body — scan, nxio-page write, and the fileio_synchronize — acquired in Phase 2’s scan setup and released only after the Phase 6 nxio_lsa advance; i.e. the fsync runs while flush_mutex is held. A separate short-lived acquisition guards only the Phase-1 num_toflush check. The group-commit wait is on gc_cond/gc_mutex (§7.5), a different lock entirely. |
log_buffer holds { volatile LOG_PAGEID pageid; volatile LOG_PHY_PAGEID phy_pageid; bool dirty; LOG_PAGE *logpage; }:
| Field | Role | Why it exists |
|---|---|---|
pageid | Logical page id; volatile as slots recycle. | Flusher asserts bufptr->pageid == toflush[i]->hdr.logical_pageid to confirm the slot still holds its page. |
phy_pageid | Physical offset in the active log volume. | Write target phy_pageid + i; contiguity needs phy_pageid+1, not just pageid+1. |
dirty | Has un-flushed changes. | The scan’s primary filter; cleared exactly when the write succeeds. |
logpage | The page bytes (header + area). | What is handed to fileio_write/encryption. |
flowchart LR FI["LOG_FLUSH_INFO<br/>toflush[ ], num_toflush"] LB["LOG_BUFFER<br/>pageid, phy_pageid, dirty, logpage"] AI["log_append_info<br/>nxio_lsa, prev_lsa, log_pgptr"] FI -->|"toflush[i] resolves to"| LB LB -->|"dirty pages written, then"| AI AI -->|"nxio_lsa.pageid flushed LAST"| FI
Figure 7-1. The three durability structures and how the flusher pivots between them.
Durability invariant. A commit record at LSA L is durable iff
get_nxio_lsa () > L(the watermark moved past L’s page).logpb_flush_all_append_pagesadvancesnxio_lsaonly afterfileio_synchronizereturns and thenxio_lsapage is written last. If it advanced before the fsync, a crash could leave a committed transaction whose log is not on disk, and recovery would lose it.
7.2 logpb_flush_all_append_pages — the durability engine
Section titled “7.2 logpb_flush_all_append_pages — the durability engine”The only function that writes append pages and moves nxio_lsa. It runs
under LOG_CS write mode (assert (LOG_CS_OWN_WRITE_MODE)) and returns
1 = flushed, 0 = nothing to do, < 0 = error.
Phase 1 — decide whether to flush at all, under a short-lived
flush_mutex acquisition that is released before the body. Two early returns
set need_flush = false and return 0: when num_toflush < 1 (empty list),
and when num_toflush == 1 && !logpb_is_dirty (toflush[0]). The
single-clean-page short-circuit keeps idle timer-driven flushes from
rewriting an unchanged end-of-log marker.
Phase 2 — place the end-of-log marker, branching on
log_Pb.partial_append.status (Ch 6’s LOGPB_APPENDREC_* enum), then
re-acquire flush_mutex for the rest of the body:
IN_PROGRESS— a record is half-appended. Copy the header page aside, clear the slot’sdirty, overwrite the in-progress header withLOG_END_OF_LOG, write that copy vialogpb_write_page_to_disk; status →PARTIAL_FLUSHED_END_OF_LOG. If the page already left the buffer it is fatal →goto error.PARTIAL_FLUSHED_END_OF_LOG— re-entry continuing the flush; log and fall through.PARTIAL_ENDED/SUCCESS— normal case: build aneofrecord andlogpb_start_appendit without advancingappend_lsa(overwritten later).- anything else —
assert_release (false)→goto error.
Phase 3 — the two-step contiguous-run scan (under flush_mutex), whose
rule is the nxio_lsa page is flushed last. A while (true) alternates
step 1 (skip until a dirty non-nxio page; exit if none remain) and step 2
(extend the run). Step 2 has four break conditions — each a real branch
that ends the run and re-enters the skip phase:
// logpb_flush_all_append_pages (step-2 run conditions) -- src/transaction/log_page_buffer.cif (!bufptr->dirty) break; /* <- clean stops run */if (bufptr->pageid == log_Gl.append.get_nxio_lsa ().pageid) break; /* <- nxio last */if (prv_bufptr->pageid + 1 != bufptr->pageid) break; /* <- logical gap */if (prv_bufptr->phy_pageid + 1 != bufptr->phy_pageid) break;/* <- physical gap */The run [idxflush, i) then goes to logpb_writev_append_pages; a NULL
return is fatal (goto error), otherwise need_sync = true and each page’s
dirty is cleared — only after the write returns non-NULL.
Phase 4 — flush the nxio_lsa page last, branching on whether it holds a
complete record:
// logpb_flush_all_append_pages (nxio page) -- src/transaction/log_page_buffer.cif (log_Pb.partial_append.status == LOGPB_APPENDREC_SUCCESS || nxio_lsa.pageid != log_Gl.append.prev_lsa.pageid) /* complete -> write it */ { /* assert_release pageid match and dirty, else goto error */ logpb_write_page_to_disk (thread_p, bufptr->logpage, bufptr->pageid); need_sync = true; bufptr->dirty = false; }else { /* skip: nxio page holds an incomplete record, defer until complete */ }Phase 5 — the fsync, still under flush_mutex. When need_sync is set
and the PRM_ID_SUPPRESS_FSYNC sampling escape allows it (escape == 0, or
total_sync_count % escape == 0), it calls
fileio_synchronize (thread_p, log_Gl.append.vdes, log_Name_active, false);
a NULL_VOLDES return is fatal → goto error.
Phase 6 — advance nxio_lsa, again branching on partial_append.status:
LOGPB_APPENDREC_PARTIAL_ENDED— restore the original record header, rewrite + fsync again, thenset_nxio_lsa (log_Gl.hdr.append_lsa); status →PARTIAL_FLUSHED_ORIGINAL.LOGPB_APPENDREC_PARTIAL_FLUSHED_END_OF_LOG— cannot validate yet;set_nxio_lsa (log_Gl.append.prev_lsa)(one record short).LOGPB_APPENDREC_SUCCESS—set_nxio_lsa (log_Gl.hdr.append_lsa).- else —
assert_release (false)→goto error.
The list is then reset (num_toflush = 0) and, if log_Gl.append.log_pgptr != NULL, that live page is re-seeded as toflush[0]; flush_mutex is
released and the function returns 1. The error: label releases
flush_mutex if still held and calls logpb_fatal_error — every error path
is unrecoverable.
flowchart TD
A["enter (LOG_CS write)"] --> B{num_toflush?}
B -->|"< 1, or 1+clean"| Z0["return 0"]
B -->|flushable| C{partial_append.status}
C -->|IN_PROGRESS| D["overwrite EOL, write copy"]
C -->|SUCCESS/PARTIAL_ENDED| E["start_append EOL marker"]
D --> F["scan toflush: skip clean,<br/>collect dirty run, writev, clear dirty"]
E --> F
F --> H{nxio page holds<br/>partial record?}
H -->|no| I["write nxio page LAST"]
H -->|yes| K
I --> K{need_sync?}
K -->|yes| L["fileio_synchronize"]
K -->|no| M
L --> M["advance nxio_lsa per status,<br/>num_toflush=0, reseed log_pgptr"]
M --> Z1["return 1"]
D -.->|fatal| ERR["logpb_fatal_error"]
L -.->|fail| ERR
Figure 7-2. Branch-complete flow of logpb_flush_all_append_pages.
7.3 logpb_writev_append_pages — the actual write
Section titled “7.3 logpb_writev_append_pages — the actual write”The lowest-level write helper. It CRC-stamps every page
(logpb_set_page_checksum, NULL on failure), then loops over npages with
two per-page branches:
// logpb_writev_append_pages -- src/transaction/log_page_buffer.cif (LOG_IS_PAGE_TDE_ENCRYPTED (log_pgptr)) /* branch 1: encrypt into enc_pgptr; */ // ... on encrypt failure, turn TDE off for this page ...if (fileio_write (..., log_pgptr, phy_pageid + i, LOG_PAGESIZE, write_mode) == NULL) { /* branch 2: ER_LOG_WRITE_OUT_OF_SPACE / ER_LOG_WRITE */ to_flush = NULL; break; }Despite the name it loops fileio_write page by page at phy_pageid + i;
Phase-3 contiguity makes the batch one sequential extent. write_mode is
FILEIO_WRITE_NO_COMPENSATE_WRITE under DWB. A to_flush == NULL return on
any write failure is fatal to the caller.
7.4 The synchronous demand paths
Section titled “7.4 The synchronous demand paths”logpb_flush_pages_direct is the core: under assert (LOG_CS_OWN_WRITE_MODE)
it calls logpb_prior_lsa_append_all_list (the Ch 5 drain) then
logpb_flush_all_append_pages (the engine). Two thin wrappers add the
critical section: logpb_force_flush_pages is just LOG_CS_ENTER; logpb_flush_pages_direct; LOG_CS_EXIT, and logpb_force_flush_header_and_pages
adds logpb_flush_header (Ch 10) before the exit — used at checkpoint and
wherever the header’s eof_lsa/append fields must match disk.
7.5 logpb_flush_pages — the four commit modes
Section titled “7.5 logpb_flush_pages — the four commit modes”logpb_flush_pages (thread_p, flush_lsa) is the entry every committing
transaction calls. !SERVER_MODE is direct flush under LOG_CS. In
SERVER_MODE two fall-backs go direct:
- not restarted, or
flush_lsaNULL/ISNULL → direct flush, return. - daemon unavailable (
!log_is_log_flush_daemon_available ()) → direct flush.
Otherwise it derives a 2×2 policy from async_commit (PRM_ID_LOG_ASYNC_COMMIT)
× group_commit (LOG_IS_GROUP_COMMIT_ACTIVE ()):
// logpb_flush_pages (mode matrix) -- src/transaction/log_page_buffer.c// async group | need_wait need_wakeup_LFT// X X | true true (sync, non-group: wake daemon, wait)// X O | true false (sync, group: just wait)// O X | false true (async: wake daemon, return)// O O | false false (async+group: just return)The wait loop is the waiter side of group commit — it sleeps on gc_cond
(holding gc_mutex) until nxio_lsa passes its flush_lsa:
// logpb_flush_pages (group-commit wait) -- src/transaction/log_page_buffer.cif (need_wakeup_LFT == false && pgbuf_has_perm_pages_fixed (thread_p)) need_wakeup_LFT = true; /* <- holding data pages: push daemon to avoid a stall */while (LSA_LT (&nxio_lsa, flush_lsa)) { // re-read nxio_lsa each iteration // ... lock gc_mutex ... if (LSA_GE (&nxio_lsa, flush_lsa)) break; /* <- re-check under lock: already durable */ if (need_wakeup_LFT == true) log_wakeup_log_flush_daemon (); pthread_cond_timedwait (&gc_cond, &gc_mutex, &to); /* 1000ms deadline */ need_wakeup_LFT = true; /* <- after first wait, always nudge daemon */}The in-lock re-check prevents a lost-wakeup race; the 1000ms timeout bounds latency. (The shared-fsync mechanics are in §7.6 / takeaway 4 — not repeated here.)
Open question (carried from the companion). The exact group-commit window policy — how long the daemon coalesces before syncing — is the
log_get_log_group_commit_intervallooperperiod plus on-demandwakeup ()calls. The companion flags the batching/latency trade-off as unresolved; this chapter documents the mechanism, not the tuning.
7.6 The flush daemon and group-commit producer side
Section titled “7.6 The flush daemon and group-commit producer side”The daemon is a cubthread::daemon with looper period
log_get_log_group_commit_interval; its task body log_flush_execute
(in log_manager.c) guards on BO_IS_SERVER_RESTARTED () and
log_Flush_has_been_requested (returning early if either is false), does one
shared LOG_CS_ENTER; logpb_flush_pages_direct; LOG_CS_EXIT, then under
gc_mutex runs pthread_cond_broadcast (&gc_cond) and clears
log_Flush_has_been_requested. The broadcast (not signal) lets one flush
satisfy every waiter whose flush_lsa <= nxio_lsa. The producer side —
log_wakeup_log_flush_daemon, called by committers and the WAL path — does
only log_Flush_has_been_requested = true; log_Flush_daemon->wakeup ();
(SERVER_MODE only); setting the flag before wakeup () guarantees a daemon
already mid-iteration sees the request next loop.
stateDiagram-v2 [*] --> Sleeping Sleeping --> Checking : timer tick or wakeup Checking --> Sleeping : not requested, return Checking --> Flushing : requested\n LOG_CS enter Flushing --> Broadcasting : flush_pages_direct done\n one fsync Broadcasting --> Sleeping : gc_cond broadcast\n clear request
Figure 7-3. log_Flush_daemon state cycle. Waiters in logpb_flush_pages observe nxio_lsa advance after Broadcasting.
7.7 logpb_flush_log_for_wal — the read-side WAL invariant
Section titled “7.7 logpb_flush_log_for_wal — the read-side WAL invariant”Called by the page buffer manager before writing any data page, passing
its last-modifying LSA. It enforces WAL with double-checked locking on
logpb_need_wal:
// logpb_flush_log_for_wal -- src/transaction/log_page_buffer.cif (logpb_need_wal (lsa_ptr)) /* <- cheap atomic check, no lock */ { LOG_CS_ENTER (thread_p); if (logpb_need_wal (lsa_ptr)) /* <- re-check under LOG_CS, else someone flushed it */ logpb_flush_pages_direct (thread_p); LOG_CS_EXIT (thread_p); assert (LSA_ISNULL (lsa_ptr) || !logpb_need_wal (lsa_ptr)); /* <- post-condition */ }The predicate logpb_need_wal (lsa) is just LSA_LE (&get_nxio_lsa (), lsa)
— true when the log up to *lsa is not yet durable — making the invariant
directly testable.
WAL invariant. No data page modified at LSA L may be written while
logpb_need_wal (L)holds (nxio_lsa <= L). The buffer manager callslogpb_flush_log_for_walfirst; its post-conditionassert (!logpb_need_wal (lsa_ptr))guarantees the log is durable to L before the write. Violating it lets a redo record’s effect reach disk without the record, so recovery cannot reconstruct or undo the change. The twologpb_need_walcalls (outside/insideLOG_CS) avoid both a needless critical section when already durable and a redundant flush when a concurrent committer advancednxio_lsa.
7.8 Chapter summary — key takeaways
Section titled “7.8 Chapter summary — key takeaways”nxio_lsais the one durability watermark — the lowest not-yet-written LSA, atomic inlog_append_info, answering both “is this commit durable?” and “must I flush before this data write?”. It advances only afterfileio_synchronizesucceeds.logpb_flush_all_append_pagesflushes thenxio_lsapage last. The two-step scan (skip clean, collect contiguous dirty) batches adjacent pages, then writes the watermark page alone, so the new end-of-log is never validated before its predecessors are on disk.- The main
flush_mutexspans the whole flush body — scan, nxio-page write, and thefileio_synchronizeall run while it is held; only the Phase-1num_toflushpeek takes a separate short lock. The group-commit wait usesgc_cond/gc_mutex, a distinct lock. - Group commit amortizes one fsync over many committers. Waiters block
on
gc_condand re-checknxio_lsaundergc_mutex; the daemon does onelogpb_flush_pages_directandbroadcasts, releasing everyone whoseflush_lsais now covered. - The 2×2 commit matrix (
async_commit×group_commit) picks wake-and-wait, just-wait, wake-and-return, or just-return; non-SERVER and no-daemon paths fall back to direct flush. - WAL is enforced read-side by
logpb_flush_log_for_walvia double-checkedlogpb_need_walaroundLOG_CS; its post-condition asserts the log is durable to the requested LSA before any data page is written. - The exact group-commit window policy is an open question (from the
companion): the mechanism is the daemon’s
looperinterval plus on-demandwakeup (), but the batching/latency tuning is not pinned down here.
Chapter 8: Commit and Abort Lifecycle
Section titled “Chapter 8: Commit and Abort Lifecycle”This chapter answers one reader question: how does a
transaction-boundary record ride the same prior-list / page-buffer /
flush pipeline that Chapters 3-7 built, force its own durability, and
drive the final state transitions and lock release? Boundary records
are special only in carrying a LOG_REC_DONETIME payload (no undo/redo
data) and in being wrapped by durability and state-machine
discipline. The append mechanics are unchanged — log_commit reuses
prior_lsa_alloc_and_copy_data / prior_lsa_next_record (Chapters 3-4)
and the logpb_flush_pages force path (Chapter 7). We focus on the
wrapping; recovery-side replay is out of scope (cubrid-recovery-manager.md).
8.1 The three structs at the transaction boundary
Section titled “8.1 The three structs at the transaction boundary”Three structs meet at commit/abort time: the descriptor log_tdes, the
per-record header log_rec_header (Chapters 1, 3), and the boundary
payload log_rec_donetime.
log_rec_donetime — the commit/abort payload
Section titled “log_rec_donetime — the commit/abort payload”The entire type-specific payload of a LOG_COMMIT/LOG_ABORT record;
its existence at a known LSA is the information.
// log_rec_donetime — src/transaction/log_record.hppstruct log_rec_donetime{ INT64 at_time; /* Database creation time. For safety reasons */};| Field | Role | Why it exists |
|---|---|---|
at_time | Wall-clock time(NULL) captured in log_append_donetime_internal. | Timestamps completion for forensics. The “Database creation time” comment is stale — it holds the termination time; the commit protocol never reads it back. |
Invariant — the donetime record’s LSA is the commit point. The record carries no other state, so durability reduces to “the page holding this LSA is on disk”. Everything in §8.4 makes that true before the client is told the commit succeeded.
log_rec_header — role at the boundary
Section titled “log_rec_header — role at the boundary”The generic header (full coverage in Chapter 1) is reused verbatim; at
the boundary only type and prev_tranlsa carry special meaning.
// log_rec_header — src/transaction/log_record.hppstruct log_rec_header{ LOG_LSA prev_tranlsa; /* previous log record for the same transaction */ LOG_LSA back_lsa, forw_lsa; /* physical backward/forward links */ TRANID trid; /* transaction identifier */ LOG_RECTYPE type; /* e.g. LOG_COMMIT, LOG_ABORT */};| Field | Role at a COMMIT/ABORT record | Why it matters here |
|---|---|---|
prev_tranlsa | Closes the per-transaction chain. Recovery never undoes a committed chain, but the link is still written. | Chapter 4 assigns it at attach time from tdes->tail_lsa. |
type | LOG_COMMIT, LOG_ABORT, or LOG_COMMIT_WITH_POSTPONE when postpone work remains. | Recovery dispatch keys on this to decide “this trid is done — do not undo it”. |
back_lsa / forw_lsa | Physical-order links, assigned by the prior-list machinery as for a data record. | Lets the analysis pass scan past the boundary record. |
trid | The committing/aborting transaction’s id. | Recovery groups records by trid. |
log_tdes — the transaction descriptor (boundary-relevant fields)
Section titled “log_tdes — the transaction descriptor (boundary-relevant fields)”log_tdes is large; only the fields the commit/abort path reads or
writes are covered. The full struct lives in log_impl.h.
// log_tdes (excerpt) — src/transaction/log_impl.hstruct log_tdes{ int tran_index; TRANID trid; TRAN_STATE state; LOG_LSA head_lsa; LOG_LSA tail_lsa; LOG_LSA undo_nxlsa; LOG_LSA posp_nxlsa; LOG_LSA commit_abort_lsa; LOG_TOPOPS_STACK topops; /* topops.last must be < 0 at the boundary */ void *first_save_entry; bool has_supplemental_log; // ... condensed ...};Five fields behave identically on both paths, so they get one note
rather than a two-column matrix: tran_index (table index resolved by
LOG_FIND_THREAD_TRAN_INDEX; log_abort_by_tdes rebinds it onto the
executing thread, §8.7), trid (stamped into log_rec_header.trid,
recycled by logtb_get_new_tran_id), head_lsa (informational, never
read by the protocol), topops.last (must be < 0; a live sysop is a
bug — assert(false) + force-attach to outer), and first_save_entry
(freed via spage_free_saved_spaces). The fields whose role diverges
between commit and abort:
| Field | Role at commit | Role at abort |
|---|---|---|
state | ACTIVE -> WILL_COMMIT -> (…_WITH_POSTPONE) -> COMMITTED. | Straight to ABORTED before any rollback. |
tail_lsa | NULL = touched nothing -> skip donetime (§8.3, §8.5); else chain tail the record links to. | Same gate. |
undo_nxlsa | Reset to NULL so a checkpoint during WILL_COMMIT sees no stale cursor. | The rollback cursor log_rollback walks prev_tranlsa from. |
posp_nxlsa | Non-NULL -> postpone pending -> LOG_COMMIT_WITH_POSTPONE (§8.3.1). | Unused. |
has_supplemental_log | If set, a LOG_SUPPLEMENT_TRAN_USER record precedes the commit record (CDC), then cleared. | Just cleared. |
commit_abort_lsa | Stamped with the boundary LSA so checkpoint distinguishes concluded from live — stamped not here but by the prior-list append in log_append.cpp when the donetime node is materialized. | Same. |
flowchart TB TDES["log_tdes\nstate, tail_lsa,\nundo_nxlsa, posp_nxlsa"] HDR["log_rec_header\ntype = LOG_COMMIT/ABORT\nprev_tranlsa = tail_lsa"] DT["log_rec_donetime\nat_time"] NODE["LOG_PRIOR_NODE\n(Chapter 3)"] TDES -->|prev_tranlsa = tail_lsa| HDR HDR --> NODE DT -->|node->data_header| NODE NODE -->|prior_lsa_next_record| PL["prior list -> page buffer -> disk"]
Figure 8-1 — log_tdes supplies the chain tail, a log_rec_header of
type LOG_COMMIT/LOG_ABORT is built, and a log_rec_donetime becomes
the node’s data header. From there it is an ordinary prior-list node.
8.2 log_commit — the entry point and its branch fan-out
Section titled “8.2 log_commit — the entry point and its branch fan-out”log_commit resolves the descriptor, validates state, and routes to the
2PC or local path — every branch:
// log_commit — src/transaction/log_manager.cif (tdes == NULL) return TRAN_UNACTIVE_UNKNOWN; /* <- fatal: unknown index */if (!LOG_ISTRAN_ACTIVE (tdes) && !LOG_ISTRAN_2PC_PREPARE (tdes) && LOG_ISRESTARTED ()) return tdes->state; /* <- not commitable; no-op */if (tdes->topops.last >= 0) /* <- impossible-but-handled */ { assert (false); while (tdes->topops.last >= 0) log_sysop_attach_to_outer (thread_p); }if (log_2pc_clear_and_is_tran_distributed (tdes)) state = log_2pc_commit (...); /* <- 2PC arm (cubrid-2pc.md) */else /* <- local arm */ { state = log_commit_local (thread_p, tdes, retain_lock, true); state = log_complete (thread_p, tdes, LOG_COMMIT, LOG_NEED_NEWTRID, LOG_ALREADY_WROTE_EOT_LOG); }if (log_No_logging) { /* force pages + data, flush header */ }perfmon_inc_stat (thread_p, PSTAT_TRAN_NUM_COMMITS); /* return state */Invariant — topops.last < 0 at the transaction boundary. Commit
and abort require no open system operation. The code warns, asserts in
debug, and force-folds open sysops into the outer transaction with
log_sysop_attach_to_outer; violating it would skip a sysop’s records
from the boundary record’s prev_tranlsa chain.
8.3 log_commit_local — postpone, append, release, flush
Section titled “8.3 log_commit_local — postpone, append, release, flush”log_commit_local does the real work in a strict order dictated by one
rule.
Invariant — nothing may be logged after the transaction enters an
unactive state. If a checkpoint snapshots the transaction as
TRAN_UNACTIVE_WILL_COMMIT and a crash precedes still-pending logging
(e.g. unique statistics), recovery commits it without those changes —
silent data loss. So tx_lob_locator_clear and logtb_complete_mvcc
(both of which log) run before tdes->state = TRAN_UNACTIVE_WILL_COMMIT.
// log_commit_local — src/transaction/log_manager.ctx_lob_locator_clear (...); logtb_complete_mvcc (thread_p, tdes, true); /* both log -> must precede WILL_COMMIT */tdes->state = TRAN_UNACTIVE_WILL_COMMIT;LSA_SET_NULL (&tdes->undo_nxlsa); /* checkpoint must not see a stale undo cursor */if (!LSA_ISNULL (&tdes->tail_lsa)) /* <- transaction touched data */ { log_tran_do_postpone (thread_p, tdes); /* §8.3.1 — run postpone if any */ if (is_local_tran) { if (... log_does_allow_replication () ...) log_append_repl_info_and_commit_log (...); /* repl+commit, one mutex */ else log_append_commit_log (thread_p, tdes, &commit_lsa); /* plain LOG_COMMIT */ if (retain_lock != true) lock_unlock_all (thread_p); /* <- retain_lock gate */ log_change_tran_as_completed (thread_p, tdes, LOG_COMMIT, &commit_lsa); /* state + force */ } else { /* participant: commit log + unlock deferred to log_complete_for_2pc */ } }else { if (retain_lock != true) lock_unlock_all (thread_p); /* <- read-only: no donetime record */ tdes->state = TRAN_UNACTIVE_COMMITTED; }return tdes->state;The replication route takes prior_lsa_mutex once so the replication
and commit records get adjacent LSAs; the plain route just appends the
donetime record. The participant branch (is_local_tran == false) defers
both the commit record and lock release to log_complete_for_2pc
(cubrid-2pc.md).
8.3.1 log_tran_do_postpone — LOG_COMMIT_WITH_POSTPONE
Section titled “8.3.1 log_tran_do_postpone — LOG_COMMIT_WITH_POSTPONE”If posp_nxlsa is non-NULL the transaction has deferred actions;
log_tran_do_postpone writes and forces a LOG_COMMIT_WITH_POSTPONE
record before running the postpones (Chapter 9).
// log_tran_do_postpone — src/transaction/log_manager.cif (LSA_ISNULL (&tdes->posp_nxlsa)) return; /* <- nothing to postpone */assert (tdes->topops.last < 0);log_append_commit_postpone (thread_p, tdes, &tdes->posp_nxlsa); /* COMMIT_WITH_POSTPONE + flush */if (tdes->m_log_postpone_cache.do_postpone (*thread_p, tdes->posp_nxlsa)) { perfmon_inc_stat (..., PSTAT_TRAN_NUM_PPCACHE_HITS); return; } /* cache fast-path */log_do_postpone (thread_p, tdes, &tdes->posp_nxlsa); /* scan forward, run LOG_POSTPONE records */log_append_commit_postpone sets state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE and forces immediately, so the
marker is durable before postpones run (recovery can resume after a
crash). The plain LOG_COMMIT written later closes the transaction.
8.4 Forcing durability — log_append_commit_log and the WAL force
Section titled “8.4 Forcing durability — log_append_commit_log and the WAL force”log_append_commit_log is a thin shell over
log_append_donetime_internal, the single place both commit and abort
build their donetime record:
// log_append_donetime_internal — src/transaction/log_manager.cnode = prior_lsa_alloc_and_copy_data (thread_p, iscommitted, RV_NOT_DEFINED, ...); /* type = LOG_COMMIT/ABORT */if (node == NULL) return; /* <- alloc failed: eot_lsa stays NULL */((LOG_REC_DONETIME *) node->data_header)->at_time = time (NULL); /* the only payload field */lsa = (with_lock == LOG_PRIOR_LSA_WITH_LOCK) /* caller holds prior mutex, else take it */ ? prior_lsa_next_record_with_lock (...) : prior_lsa_next_record (thread_p, node, tdes);LSA_COPY (eot_lsa, &lsa); /* hand the commit LSA back to the caller */iscommitted doubles as the record type; with_lock lets the
replication route reuse the mutex it already holds. Then
log_change_tran_as_completed performs the durability force:
// log_change_tran_as_completed — src/transaction/log_manager.cif (iscommitted == LOG_COMMIT) { tdes->state = TRAN_UNACTIVE_COMMITTED; logpb_flush_pages (thread_p, lsa); } /* <- COMMIT: always force up to commit LSA */else { tdes->state = TRAN_UNACTIVE_ABORTED; /* SERVER_MODE only: */ if (BO_IS_SERVER_RESTARTED () && VOLATILE_ACCESS (log_Gl.run_nxchkpt_atpageid, INT64) == NULL_PAGEID) logpb_flush_pages (thread_p, lsa); /* <- ABORT: force only if checkpoint in flight */}Invariant — a committed transaction’s commit record is on stable
storage before the client is told “committed”. logpb_flush_pages (thread_p, lsa) is the group-commit demand from Chapter 7: it pushes the
committer onto the flush daemon’s waiter set and blocks until
nxio_lsa >= commit_lsa (many committers share one fsync). This is the
only point in the commit path that can block on I/O. The abort branch is
asymmetric by design: a lost un-flushed LOG_ABORT is harmless (recovery
re-undoes anyway), so it forces only when a checkpoint is in flight on a
restarted server, lest that checkpoint reclaim an archive recovery still
needs.
8.5 log_complete — final state transition and next-trid
Section titled “8.5 log_complete — final state transition and next-trid”Both log_commit and log_abort finish via log_complete, passing two
enum flags: who already wrote the EOT record and whether to recycle
the trid. Commit passes LOG_ALREADY_WROTE_EOT_LOG (record already
forced, else arm just asserts); abort passes LOG_NEED_TO_WRITE_EOT_LOG
(the if arm appends LOG_ABORT).
// log_complete — src/transaction/log_manager.cif (LSA_ISNULL (&tdes->tail_lsa)) { /* read-only: set COMMITTED/ABORTED; recycle or clear tdes */ }else { if (wrote_eot_log == LOG_NEED_TO_WRITE_EOT_LOG) /* <- abort: write LOG_ABORT now */ { log_append_abort_log (...); log_change_tran_as_completed (..., LOG_ABORT, &abort_lsa); } else assert (iscommitted == LOG_COMMIT && state == TRAN_UNACTIVE_COMMITTED); /* commit already wrote it */ tdes->unlock_global_oldest_visible_mvccid (); /* always */ if (iscommitted == LOG_COMMIT) log_Gl.mvcc_table.reset_transaction_lowest_active (...); /* commit only */ if (get_newtrid == LOG_NEED_NEWTRID) logtb_get_new_tran_id (thread_p, tdes);}if (LOG_ISCHECKPOINT_TIME ()) log_wakeup_checkpoint_daemon (); /* or logpb_checkpoint in SA mode */Branch fan-out:
tail_lsaNULL. No EOT record; set state, then recycle the trid (LOG_NEED_NEWTRID) or hard-clear (logtb_clear_tdes).tail_lsanon-NULL, abort. Emit the abort record, force per §8.4, set state.tail_lsanon-NULL, commit. Assert the record was already written and state isTRAN_UNACTIVE_COMMITTED.- MVCC unblocking (data path).
unlock_global_oldest_visible_mvccidalways runs;reset_transaction_lowest_activeonly on commit (cubrid-mvcc.md). - next-trid.
logtb_get_new_tran_idrecycles the index with a freshtrid— the donetime record is CUBRID’s EOT marker, not a distinct type. - checkpoint kick. If the append crossed the threshold, wake the
checkpoint daemon or run
logpb_checkpointinline (SA mode).
flowchart TD
A["log_commit(tran_index, retain_lock)"] --> B{"tdes NULL?"}
B -->|yes| Z0["return TRAN_UNACTIVE_UNKNOWN"]
B -->|no| C{"active or 2PC-prepared?"}
C -->|no, restarted| Z1["no-op, return tdes.state"]
C -->|yes| D{"topops.last >= 0?"}
D -->|yes| D1["assert false\nattach_to_outer until < 0"]
D -->|no| E{"distributed 2PC?"}
D1 --> E
E -->|yes| E1["log_2pc_commit\nsee cubrid-2pc.md"]
E -->|no| F["log_commit_local"]
subgraph LOCAL["log_commit_local — strict order"]
direction TB
F --> G["tx_lob_locator_clear\nlogtb_complete_mvcc\nboth LOG before state change"]
G --> H["state = WILL_COMMIT\nundo_nxlsa = NULL"]
H --> I{"tail_lsa NULL?"}
I -->|yes, read-only| J["unlock unless retained\nstate = COMMITTED"]
I -->|no| K["log_tran_do_postpone"]
K --> K1{"posp_nxlsa set?"}
K1 -->|yes| K2["append COMMIT_WITH_POSTPONE\nforce, then run LOG_POSTPONE"]
K1 -->|no| L
K2 --> L["log_append_commit_log\n+repl info if HA"]
L --> M["lock_unlock_all\nunless retain_lock"]
M --> N["log_change_tran_as_completed\nstate = COMMITTED\nlogpb_flush_pages = group-commit force"]
end
E1 --> O["log_complete"]
J --> O
N --> O
O --> P["MVCC unblock · recycle trid\nkick checkpoint if due"]
P --> Z2["return state"]
Figure 8-2 — Commit control flow. The only point that blocks on I/O
is logpb_flush_pages inside log_change_tran_as_completed (the
group-commit force of §8.4); everything before it is bookkeeping. Note
the two ordering invariants the diagram encodes: the records that
tx_lob_locator_clear and logtb_complete_mvcc emit are written
before the state moves to WILL_COMMIT, and an empty tail_lsa
short-circuits to a no-record read-only commit.
8.6 log_abort and log_abort_local — undo before the boundary
Section titled “8.6 log_abort and log_abort_local — undo before the boundary”log_abort mirrors log_commit’s entry validation with two extra
guards, then routes to log_abort_local -> log_complete.
// log_abort (excerpt) — src/transaction/log_manager.cif (LOG_HAS_LOGGING_BEEN_IGNORED ()) { er_set (... ER_LOG_CORRUPTED_DB_DUE_NOLOGGING ...); return tdes->state; } /* <- no log to undo */if (!LOG_ISTRAN_ACTIVE (tdes) && !LOG_ISTRAN_2PC_PREPARE (tdes)) return tdes->state; /* <- nothing to abort */// topops.last >= 0 -> same assert+attach salvage as commitstate = log_abort_local (thread_p, tdes, true);state = log_complete (thread_p, tdes, LOG_ABORT, LOG_NEED_NEWTRID, LOG_NEED_TO_WRITE_EOT_LOG);The extra LOG_HAS_LOGGING_BEEN_IGNORED guard is the key difference from
commit: with no undo records, rollback is impossible and the database is
declared corrupted. log_abort_local differs from log_commit_local in
ordering: it sets TRAN_UNACTIVE_ABORTED first, then does the work.
// log_abort_local — src/transaction/log_manager.ctdes->state = TRAN_UNACTIVE_ABORTED; /* <- set early; rollback logs CLRs, allowed */if (!LSA_ISNULL (&tdes->tail_lsa)) /* <- transaction touched data */ { log_rollback (thread_p, tdes, NULL); /* §8.6.1 — the undo pass */ log_cleanup_modified_class_list (thread_p, tdes, NULL, true, true); /* + free first_save_entry */ }/* both branches: */ logtb_complete_mvcc (thread_p, tdes, false); /* committed=false -> discard mvccid */lock_unlock_all (thread_p); /* <- always release; abort never retains */tx_lob_locator_clear (thread_p, tdes, false, NULL);return tdes->state;flowchart TD
A["log_abort(tran_index)"] --> B{"logging been ignored?"}
B -->|yes| Z0["ER_LOG_CORRUPTED_DB_DUE_NOLOGGING\nreturn — no undo records exist"]
B -->|no| C{"active or 2PC-prepared?"}
C -->|no| Z1["nothing to abort, return"]
C -->|yes| D["topops salvage\nassert + attach_to_outer"]
D --> E["log_abort_local"]
subgraph LOCAL["log_abort_local — state set FIRST"]
direction TB
E --> F["state = ABORTED\nset early: rollback may log CLRs"]
F --> G{"tail_lsa NULL?"}
G -->|no| H["log_rollback\nwalk prev_tranlsa backward,\nappend compensating CLRs"]
H --> I["log_cleanup_modified_class_list"]
G -->|yes| J
I --> J["logtb_complete_mvcc false\ndiscard mvccid"]
J --> K["lock_unlock_all\nalways — abort never retains"]
end
K --> L["log_complete LOG_ABORT"]
L --> M{"tail_lsa non-NULL?"}
M -->|yes| N["log_append_abort_log\nlog_change_tran_as_completed\nforce only if checkpoint in flight"]
M -->|no| O["set ABORTED, no EOT record"]
N --> P["recycle trid"]
O --> P
P --> Z2["return ABORTED"]
Figure 8-3 — Abort control flow. The mirror image of Figure 8-2 with
two deliberate asymmetries. (1) State first: log_abort_local
sets ABORTED before doing the work, because the rollback pass
itself logs compensating records (CLRs) and those must be allowed
after the state flips — the opposite of commit, where logging after
WILL_COMMIT is forbidden. (2) Lazy force: a lost un-flushed
LOG_ABORT is harmless (recovery re-undoes anyway), so the durability
force fires only when a checkpoint is in flight on a restarted server.
Setting state early is safe here but forbidden in commit because
rollback logs compensation log records (CLRs) — redo-only records
(Chapter 9) expected while the transaction is already aborted.
logtb_complete_mvcc(..., false) discards the MVCCID, and log_abort_local
ignores retain_lock — an abort always calls lock_unlock_all.
8.6.1 log_rollback — walking prev_tranlsa backward
Section titled “8.6.1 log_rollback — walking prev_tranlsa backward”log_rollback walks the chain backward from tdes->undo_nxlsa,
re-applying each undo image. Per-record-type CLR generation is Chapter 9;
the branch that matters here is the cursor discipline.
// log_rollback (control skeleton) — src/transaction/log_manager.cLSA_COPY (&prev_tranlsa, &tdes->undo_nxlsa); /* start cursor */while (!LSA_ISNULL (&prev_tranlsa) && !isdone) { logpb_fetch_page (...); /* fatal on error */ log_rec = LOG_GET_LOG_RECORD_HEADER (log_pgptr, &log_lsa); LSA_COPY (&prev_tranlsa, &log_rec->prev_tranlsa); /* advance cursor BEFORE undo */ LSA_COPY (&tdes->undo_nxlsa, &prev_tranlsa); /* persist cursor (CLR may move it) */ switch (log_rec->type) { /* ... see Chapter 9 ... */ } }Invariant — the undo cursor is advanced before the undo is applied.
Both prev_tranlsa and tdes->undo_nxlsa move to log_rec->prev_tranlsa
before the undo runs, because applying an undo logs a chained CLR — a
not-yet-advanced cursor could re-undo the record or follow the CLR’s own
link. upto_lsa (NULL here, non-NULL from log_rollback_to_savepoint)
stops a partial rollback early; xlogtb_reset_wait_msecs(INFINITE_WAIT)
blocks lock timeouts. Recovery-time replay is in cubrid-recovery-manager.md.
8.7 Restart-driven variants — log_abort_by_tdes and log_abort_all_active_transaction
Section titled “8.7 Restart-driven variants — log_abort_by_tdes and log_abort_all_active_transaction”At shutdown or crash recovery, transactions must be aborted by a thread
other than their owner. log_abort_by_tdes rebinds the executing thread
to the victim’s tran_index so every LOG_FIND_THREAD_TRAN_INDEX lookup
inside log_abort resolves correctly, then reuses the ordinary path:
// log_abort_by_tdes — src/transaction/log_manager.c (SERVER_MODE)thread_p->tran_index = tdes->tran_index; /* impersonate the victim's index */pthread_mutex_unlock (&thread_p->tran_index_lock);(void) log_abort (thread_p, tdes->tran_index); /* reuse the normal abort path */log_abort_all_active_transaction is the shutdown sweep: in server mode
it loops over every index, dispatching an async abort onto each active
transaction and re-looping until no worker threads remain. The dispatch
is not direct — css_push_external_task queues log_abort_task_execute,
a thin wrapper that calls log_abort_by_tdes(&thread_ref, &tdes):
// log_abort_all_active_transaction (server-mode essence) — src/transaction/log_manager.cif (already_called) return; already_called = 1; /* <- idempotent static guard */loop: repeat_loop = false; for (i = 0; i < log_Gl.trantable.num_total_indices; i++) if (i != LOG_SYSTEM_TRAN_INDEX && (tdes = LOG_FIND_TDES (i)) && tdes->trid != NULL_TRANID) { if (css_count_transaction_worker_threads (...) > 0) repeat_loop = true; /* still busy */ else if (LOG_ISTRAN_ACTIVE (tdes) && !abort_thread_running[i]) { /* exec_f = std::bind (log_abort_task_execute, _1, std::ref (*tdes)); */ css_push_external_task (...); /* -> log_abort_task_execute -> log_abort_by_tdes */ abort_thread_running[i] = 1; repeat_loop = true; } } if (repeat_loop) { thread_sleep (50); if (css_is_shutdown_timeout_expired ()) _exit (0); goto loop; } /* <- give up: hard exit */already_called runs the sweep once; LOG_SYSTEM_TRAN_INDEX is skipped;
a transaction with live workers forces another pass; an expired timeout
_exit(0)s. The SA_MODE branch walks the table and calls log_abort
synchronously.
8.8 Chapter summary — key takeaways
Section titled “8.8 Chapter summary — key takeaways”- Boundary records reuse the whole pipeline — built/attached like
data records (Chapters 3-4); the one-field
log_rec_donetime’s LSA is the durable commit point. log_commitroutes;log_commit_localworks — postpone, append, unlock, force;log_completeonly finalizes state, since the record was already written (LOG_ALREADY_WROTE_EOT_LOG).- Ordering protects against checkpoint-during-commit — anything that
logs runs before
TRAN_UNACTIVE_WILL_COMMIT. - Commit forces, abort usually does not —
logpb_flush_pagesalways for commit (group-commit, Chapter 7), for abort only with a checkpoint in flight on a restarted server. - Abort sets state first, then undoes —
log_rollbackadvances the cursor before each undo so CLRs do not re-enter the chain. retain_lockis a commit-only knob — abort always unlocks.- Restart variants re-bind, not re-implement —
log_abort_by_tdesimpersonatestran_indexand callslog_abort;log_abort_all_active_transactiondispatcheslog_abort_task_executeidempotently until workers drain or the timeout forces_exit(0).
Chapter 9: System Operations Postpone and Compensation
Section titled “Chapter 9: System Operations Postpone and Compensation”A system operation (sysop, or “top operation”) is a sub-transaction the server
commits or aborts independently of the enclosing user transaction — index
splits, file allocation, overflow-record management. This chapter traces how
sysops, postponed actions, and compensation records reuse the prior-list
pipeline (Chapters 3–5) while carrying their own logical-undo payloads. For
WAL/postpone/ARIES theory see the companion cubrid-log-manager.md (“System
operations”, “Postpone & compensation”). Every family calls
prior_lsa_alloc_and_copy_data + prior_lsa_next_record; the novelty is which
header is stamped and how the tdes sysop stack and the LSA chains
(undo_nxlsa, posp_nxlsa, per-level posp_lsa) mutate around it.
9.1 The sysop stack on log_tdes
Section titled “9.1 The sysop stack on log_tdes”log_sysop_start appends nothing; it pushes a frame onto an in-memory stack in
the transaction descriptor. The table lists only the sysop/postpone-relevant
log_tdes fields — the full ~82-field struct is in Chapter 2.
| Field | Role | Why it exists |
|---|---|---|
topops (LOG_TOPOPS_STACK) | Nesting stack of active sysops | last is current depth, max the allocated size |
topop_lsa | LSA of the most-recent sysop’s parent | Fast “are we in a sysop” probe for appenders |
tail_lsa | LSA of this tran’s last appended record | High-water mark a sysop end compares to detect “no change” |
undo_nxlsa | Next record to undo | Rewound by a CLR so undo skips the already-undone record |
posp_nxlsa | First transaction-level postpone record | Seeded by a LOG_POSTPONE appended outside any sysop |
savept_lsa | LSA of last savepoint | Chains savepoints; target of log_abort_partial |
tail_topresult_lsa | LSA of last partial commit/abort | Stamped into every sysop-end as prv_topresult_lsa |
state (TRAN_STATE) | Transaction state | Gates which sysop-end arms are legal |
m_log_postpone_cache | Cached postpone redo + LSAs | Lets do_postpone replay from memory |
rcv.sysop_start_postpone_lsa | Recovery anchor for an in-flight sysop postpone | Resume a crashed sysop’s postpone phase |
rcv.tran_start_postpone_lsa | Recovery anchor for a tran-level postpone | Resume the transaction postpone phase |
rcv.atomic_sysop_start_lsa | Recovery anchor for an atomic sysop | Roll an interrupted atomic sysop back as a unit |
(The rcv.* members live in the embedded log_rcv_tdes.) Each stack frame is a
log_topops_addresses carrying two LSAs, read through three accessor macros:
// log_topops_addresses -- src/transaction/log_impl.hstruct log_topops_addresses{ LOG_LSA lastparent_lsa; /* The last address of the parent transaction. This is needed for undo of the top * system action */ LOG_LSA posp_lsa; /* The first address of a postpone log record for top system operation. We add this * since it is reset during recovery to the last reference postpone address. */};// LOG_TDES_LAST_SYSOP* -- src/transaction/log_manager.c#define LOG_TDES_LAST_SYSOP(tdes) (&(tdes)->topops.stack[(tdes)->topops.last])#define LOG_TDES_LAST_SYSOP_PARENT_LSA(tdes) (&LOG_TDES_LAST_SYSOP(tdes)->lastparent_lsa)#define LOG_TDES_LAST_SYSOP_POSP_LSA(tdes) (&LOG_TDES_LAST_SYSOP(tdes)->posp_lsa)flowchart LR
subgraph tdes["log_tdes"]
tail["tail_lsa"]
posp["posp_nxlsa (tran level)"]
undo["undo_nxlsa"]
stk["topops.stack[]"]
end
stk --> f0["[0] lastparent_lsa, posp_lsa"]
stk --> fl["[last] lastparent_lsa, posp_lsa"]
Figure 9-1: the sysop stack and the LSA anchors it threads through log_tdes.
Invariant — the parent LSA bounds the sysop body. Every record a sysop appends has
tail_lsa > LOG_TDES_LAST_SYSOP_PARENT_LSA(tdes); end functions detect an empty sysop byLSA_LE(&tdes->tail_lsa, parent_lsa). If violated, an end would log a phantom record or skip a needed end marker, desyncing log nesting from the stack. Enforced inlog_sysop_commit_internalandlog_sysop_abort.
9.2 log_sysop_start and log_sysop_start_atomic
Section titled “9.2 log_sysop_start and log_sysop_start_atomic”// log_sysop_start -- src/transaction/log_manager.cif (tdes->topops.max == 0 || (tdes->topops.last + 1) >= tdes->topops.max) /* first-alloc OR full */ if (logtb_realloc_topops_stack (tdes, 1) == NULL) /* OOM: bail, stack unchanged */ { assert (false); tdes->unlock_topop (); return; }// ... condensed: VACUUM_IS_THREAD_VACUUM diagnostic logging only ...tdes->topops.last++; /* <- push */LSA_COPY (&tdes->topops.stack[tdes->topops.last].lastparent_lsa, &tdes->tail_lsa);LSA_COPY (&tdes->topop_lsa, &tdes->tail_lsa);LSA_SET_NULL (&tdes->topops.stack[tdes->topops.last].posp_lsa); /* <- no postpone yet */The topops.max == 0 clause handles a transaction’s first sysop (no stack yet)
and the second clause is grow-when-full. Branches: (1) tdes == NULL →
ER_LOG_UNKNOWN_TRANINDEX fatal early-return; (2) realloc, on OOM unlock+return
without pushing; (3) VACUUM diagnostics only; (4) happy path snapshots
tail_lsa into lastparent_lsa and nulls posp_lsa.
log_sysop_start_atomic wraps it, then ensures one LOG_SYSOP_ATOMIC_START
marker exists so recovery can roll the whole atomic sysop back as a unit:
// log_sysop_start_atomic -- src/transaction/log_manager.clog_sysop_start (thread_p); /* ... re-fetch tdes, guard ... */if (LSA_ISNULL (&tdes->rcv.atomic_sysop_start_lsa)) /* first atomic level: emit marker */ { node = prior_lsa_alloc_and_copy_data (thread_p, LOG_SYSOP_ATOMIC_START, ...); (void) prior_lsa_next_record (thread_p, node, tdes); }else { assert (tdes->topops.last > 0); /* nested: parent already marked */ assert (LSA_ISNULL (&tdes->rcv.sysop_start_postpone_lsa)); }The else arm fires for a nested atomic sysop: the outer level owns
atomic_sysop_start_lsa, so the inner sysop inherits atomicity with no second
marker. The asserts encode “no atomic start while a sysop-postpone runs.”
9.3 The sysop-end union and its six arms
Section titled “9.3 The sysop-end union and its six arms”All non-abort ends route through log_sysop_commit_internal, which stamps a
log_rec_sysop_end (comments verbatim from source):
// log_rec_sysop_end -- src/transaction/log_record.hppstruct log_rec_sysop_end{ LOG_LSA lastparent_lsa; /* last address before the top action */ LOG_LSA prv_topresult_lsa; /* previous top action (either, partial abort or partial commit) address */ LOG_SYSOP_END_TYPE type; /* end system op type */ const VFID *vfid; /* File where the page belong. ... used to get TDE information. */ union /* other info based on type */ { LOG_REC_UNDO undo; /* undo data for logical undo */ LOG_REC_MVCC_UNDO mvcc_undo; /* undo data for logical undo of MVCC operation */ LOG_LSA compensate_lsa; /* compensate lsa for logical compensate */ struct { LOG_LSA postpone_lsa; bool is_sysop_postpone; } run_postpone; /* run postpone info */ };};| Field | Role | Why it exists |
|---|---|---|
lastparent_lsa | Where the sysop body starts | Recovery undo of the sysop stops here; copied from the frame |
prv_topresult_lsa | Previous partial commit/abort LSA | Chains top results so recovery walks them backward |
type | Which union arm is valid | Dispatch key for append and recovery |
vfid | File of the affected page | TDE key lookup; doubles as MVCC vacuum info |
undo | LOG_REC_UNDO payload for LOGICAL_UNDO | rcvindex + length for logical undo replay |
mvcc_undo | LOG_REC_MVCC_UNDO payload for LOGICAL_MVCC_UNDO | Adds mvccid and vacuum_info |
compensate_lsa | Undo-skip target for LOGICAL_COMPENSATE | Next-undo LSA after this logical compensation |
run_postpone.postpone_lsa | Original LOG_POSTPONE LSA | Links the run-postpone to its source |
run_postpone.is_sysop_postpone | Sysop vs. tran postpone flag | Recovery must know which postpone phase produced this |
Same bytes, six interpretations. Each wrapper (§9.4) fills exactly the member
above for its type:
type | Active member | Produced by |
|---|---|---|
LOG_SYSOP_END_COMMIT | (none) | log_sysop_commit |
LOG_SYSOP_END_ABORT | (none) | log_sysop_abort |
LOG_SYSOP_END_LOGICAL_UNDO | undo | log_sysop_end_logical_undo (non-MVCC) |
LOG_SYSOP_END_LOGICAL_MVCC_UNDO | mvcc_undo | log_sysop_end_logical_undo (MVCC) |
LOG_SYSOP_END_LOGICAL_COMPENSATE | compensate_lsa | log_sysop_end_logical_compensate |
LOG_SYSOP_END_LOGICAL_RUN_POSTPONE | run_postpone | log_sysop_end_logical_run_postpone |
9.4 log_sysop_commit_internal — branch-complete
Section titled “9.4 log_sysop_commit_internal — branch-complete”The caller sets log_record->type; commit_internal validates state-vs-type,
runs pending postpone, appends the end (Fig 9-2).
flowchart TD
A["commit_internal(log_record)"] --> B{"tdes == NULL?"}
B -->|yes| Z["assert_release; return"]
B -->|no| C{"empty sysop\nAND COMMIT or no_logging?"}
C -->|yes| D["assert posp_lsa NULL\nno-op end"]
C -->|no| F{"switch type"}
F -->|RUN_POSTPONE| G["assert *_COMMITTED_WITH_POSTPONE\nset is_sysop_postpone"]
F -->|COMPENSATE| H["assert aborting OR rv-finish"]
F -->|UNDO / MVCC_UNDO| I["no state restriction"]
F -->|COMMIT| J["assert not in postpone phase\nunless rv-finish"]
G --> K
H --> K
I --> K
J --> K["fill lastparent_lsa, prv_topresult_lsa"]
K --> M["do_postpone -> append_sysop_end -> tail_topresult_lsa = tail_lsa"]
D --> P["log_sysop_end_final"]
M --> P
Figure 9-2: every branch of log_sysop_commit_internal.
// log_sysop_commit_internal -- src/transaction/log_manager.cassert (log_record->type != LOG_SYSOP_END_ABORT); /* aborts never come here */if ((LSA_ISNULL (&tdes->tail_lsa) || LSA_LE (&tdes->tail_lsa, LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes))) && (log_record->type == LOG_SYSOP_END_COMMIT || log_No_logging)) assert (LSA_ISNULL (&LOG_TDES_LAST_SYSOP (tdes)->posp_lsa)); /* empty COMMIT: nothing to log */else { if (log_record->type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE) { assert (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE || tdes->state == TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE); log_record->run_postpone.is_sysop_postpone = /* recovery needs which phase */ (tdes->state == TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE && !is_rv_finish_postpone); } // ... condensed: COMPENSATE / LOGICAL_UNDO / COMMIT state asserts (see Fig 9-2) ... log_record->lastparent_lsa = *LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes); log_record->prv_topresult_lsa = tdes->tail_topresult_lsa; log_sysop_do_postpone (thread_p, tdes, log_record, data_size, data); /* run postpones */ log_append_sysop_end (thread_p, tdes, log_record, data_size, data); /* emit end */ LSA_COPY (&tdes->tail_topresult_lsa, &tdes->tail_lsa); }log_sysop_end_final (thread_p, tdes); /* always pops the stack */Invariant — the end-record type must agree with the transaction state.
LOGICAL_RUN_POSTPONEonly in*_COMMITTED_WITH_POSTPONE;LOGICAL_COMPENSATEonly while aborting (or recovery postpone-finish); a plainCOMMITnever during a postpone phase unless recovery’sis_rv_finish_postponere-entry. If violated, recovery re-runs a postpone twice or skips an undo, corrupting the page.
log_sysop_end_final runs on every path, so even empty/error paths decrement
topops.last. The four logical wrappers pre-fill the union member from §9.3;
only log_sysop_end_logical_run_postpone leaves is_sysop_postpone for
commit_internal to derive from state.
9.5 log_sysop_abort — rollback then mark
Section titled “9.5 log_sysop_abort — rollback then mark”Abort skips commit_internal; it rolls back and stamps an ABORT end directly:
// log_sysop_abort -- src/transaction/log_manager.cif (LSA_ISNULL (&tdes->tail_lsa) || LSA_LE (&tdes->tail_lsa, &LOG_TDES_LAST_SYSOP (tdes)->lastparent_lsa)) { /* No change: empty sysop, nothing to undo or log */ }else { save_state = tdes->state; tdes->state = TRAN_UNACTIVE_ABORTED; /* <- so compensation appends are legal */ log_rollback (thread_p, tdes, LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes)); /* undo body, emits CLRs */ sysop_end.type = LOG_SYSOP_END_ABORT; sysop_end.lastparent_lsa = *LOG_TDES_LAST_SYSOP_PARENT_LSA (tdes); sysop_end.prv_topresult_lsa = tdes->tail_topresult_lsa; log_append_sysop_end (thread_p, tdes, &sysop_end, 0, NULL); LSA_COPY (&tdes->tail_topresult_lsa, &tdes->tail_lsa); tdes->state = save_state; } /* <- restore: sysop abort != tran abort */log_sysop_end_final (thread_p, tdes);The temporary state = TRAN_UNACTIVE_ABORTED is load-bearing: it lets
log_rollback append CLRs (§9.8); the original state is then restored so the
outer transaction is unaffected.
9.6 log_append_postpone — deferring an action to commit
Section titled “9.6 log_append_postpone — deferring an action to commit”A LOG_POSTPONE records a redo-only action not applied now but replayed after
commit.
flowchart TD
A["log_append_postpone"] --> B{"log_No_logging?"}
B -->|yes| C["run redofun NOW; return"]
B -->|no| E{"skipredo OR\nno sysop AND not active/aborted?"}
E -->|yes| F["run redofun NOW;\nif !skipredo append_redo_data; return"]
E -->|no| G{"tail_lsa NULL or\nbefore crash point?"}
G -->|yes| H["append LOG_DUMMY_HEAD_POSTPONE"]
G -->|no| J
H --> J["alloc LOG_POSTPONE; cache redo + start_lsa"]
J --> N{"in sysop?"}
N -->|yes, posp_lsa NULL| O["frame.posp_lsa = tail_lsa"]
N -->|no, posp_nxlsa NULL| P["posp_nxlsa = tail_lsa"]
Figure 9-3: log_append_postpone branches.
Two escape hatches run the redo synchronously (log_No_logging, or it cannot be
deferred — Fig 9-3). Otherwise the record is appended, its redo + start LSA are
pushed into m_log_postpone_cache, then the right anchor seeds:
// log_append_postpone -- src/transaction/log_manager.cnode = prior_lsa_alloc_and_copy_data (thread_p, LOG_POSTPONE, rcvindex, addr, 0, NULL, length, (char *) data);tdes->m_log_postpone_cache.add_redo_data (*node); /* save before node may be freed */start_lsa = prior_lsa_next_record (thread_p, node, tdes);tdes->m_log_postpone_cache.add_lsa (start_lsa);if (tdes->topops.last >= 0) /* in sysop: seed frame anchor */ { if (LSA_ISNULL (&tdes->topops.stack[tdes->topops.last].posp_lsa)) LSA_COPY (&tdes->topops.stack[tdes->topops.last].posp_lsa, &tdes->tail_lsa); }else if (LSA_ISNULL (&tdes->posp_nxlsa)) /* tran level: seed tran anchor */ LSA_COPY (&tdes->posp_nxlsa, &tdes->tail_lsa);Invariant — the first postpone seeds exactly one anchor. The earliest
LOG_POSTPONEin a sysop level sets that frame’sposp_lsa; the earliest at transaction level setsposp_nxlsa. Later postpones leave it untouched (LSA_ISNULLguard). This anchor is the start of the postpone-replay scan; overwriting it would orphan earlier postpones.
9.7 The postpone phase: log_sysop_do_postpone, log_do_postpone, log_run_postpone_op
Section titled “9.7 The postpone phase: log_sysop_do_postpone, log_do_postpone, log_run_postpone_op”When a sysop with pending postpones ends, log_sysop_do_postpone writes a
LOG_SYSOP_START_POSTPONE marker, then replays. Its header
log_rec_sysop_start_postpone (log_record.hpp) stashes the entire end record
so recovery can finish after a crash:
| Field | Role | Why it exists |
|---|---|---|
sysop_end (LOG_REC_SYSOP_END) | “log record used for end of system operation” | Lets log_sysop_end_recovery_postpone re-emit the correct end after a crash |
posp_lsa | ”address where the first postpone operation start” | Where the post-crash replay scan resumes |
// log_sysop_do_postpone -- src/transaction/log_manager.cif (LSA_ISNULL (LOG_TDES_LAST_SYSOP_POSP_LSA (tdes))) { return; } /* nothing to postpone */sysop_start_postpone.sysop_end = *sysop_end;sysop_start_postpone.posp_lsa = *LOG_TDES_LAST_SYSOP_POSP_LSA (tdes);log_append_sysop_start_postpone (thread_p, tdes, &sysop_start_postpone, data_size, data);if (tdes->m_log_postpone_cache.do_postpone (*thread_p, *(LOG_TDES_LAST_SYSOP_POSP_LSA (tdes)))) { tdes->state = save_state; return; } /* fast path: replay from memory */log_do_postpone (thread_p, tdes, LOG_TDES_LAST_SYSOP_POSP_LSA (tdes)); /* slow path: scan the log */The transaction-level parallel log_append_commit_postpone writes a
LOG_COMMIT_WITH_POSTPONE whose header is log_rec_start_postpone
(log_record.hpp), flips to TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE, and flushes
so the commit is durable before postpones run:
| Field | Role | Why it exists |
|---|---|---|
posp_lsa | Address where the first transaction postpone op starts | Anchor the post-commit replay scan; reset during recovery to the last reference |
at_time | ”donetime. For the time-specific recovery” | Stamp the commit-postpone moment for point-in-time / time-specific recovery |
log_do_postpone is the slow-path forward scan; it skips nested-top bodies via
log_get_next_nested_top and dispatches on log_rec->type. Only LOG_POSTPONE
triggers replay; the start-marker group (LOG_COMMIT_WITH_POSTPONE[_OBSOLETE],
LOG_SYSOP_START_POSTPONE, LOG_2PC_*) does LSA_SET_NULL(&forward_lsa) to end
the loop; data/redo/CLR/savepoint arms are inert (already applied when logged);
default is “bad log_rectype”. log_run_postpone_op reads the redo and runs it
— and a page-spanning malloc failure is fatal, not a graceful return:
// log_run_postpone_op -- src/transaction/log_manager.cLSA_COPY (&ref_lsa, log_lsa); /* remember the original postpone LSA */// ... condensed: advance past LOG_RECORD_HEADER + LOG_REC_REDO header ...redo = *((LOG_REC_REDO *) ((char *) log_pgptr->area + log_lsa->offset));if (log_lsa->offset + redo.length < (int) LOGAREA_SIZE) rcv_data = (char *) log_pgptr->area + log_lsa->offset; /* contiguous: point in place */else { area = (char *) malloc (redo.length); /* spans pages: need contiguous copy */ if (area == NULL) { logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "log_run_postpone_op"); return ER_FAILED; } logpb_copy_from_log (thread_p, area, redo.length, log_lsa, log_pgptr); rcv_data = area; }(void) log_execute_run_postpone (thread_p, &ref_lsa, &redo, rcv_data);if (area != NULL) free_and_init (area);ref_lsa lands in the run-postpone record (log_rec_run_postpone,
log_record.hpp) so recovery knows which postpone already executed:
| Field | Role | Why it exists |
|---|---|---|
data (LOG_DATA) | “Location of recovery data” (rcvindex, vpid, offset) | Which page + redo function the postpone touched |
ref_lsa | ”Address of the original postpone record” | A second recovery pass matches this and skips the already-run postpone |
length | ”Length of redo data” | Bounds the redo copy |
The producer log_append_run_postpone asserts WILL_COMMIT or a
*_COMMITTED_WITH_POSTPONE state, stamps the three fields, appends, and sets the
page LSA — making the action idempotent on a second recovery pass.
9.8 Compensation: log_append_compensate_internal and rewinding undo_nxlsa
Section titled “9.8 Compensation: log_append_compensate_internal and rewinding undo_nxlsa”A Compensation Log Record (CLR, LOG_COMPENSATE) records the redo of an undo so
undo is never re-done after a crash. Its header is log_rec_compensate
(log_record.hpp):
| Field | Role | Why it exists |
|---|---|---|
data (LOG_DATA) | “Location of recovery data” (rcvindex, pageid, offset, volid) | Locates the page the compensating redo re-applies |
undo_nxlsa | ”Address of next log record to undo” | Recovery undo jumps here, skipping the compensated record |
length | ”Length of compensating data” | Bounds the redo payload |
// log_append_compensate_internal -- src/transaction/log_manager.cnode = prior_lsa_alloc_and_copy_data (thread_p, LOG_COMPENSATE, rcvindex, NULL, length, (char *) data, 0, NULL);LSA_COPY (&prev_lsa, &tdes->undo_nxlsa); /* remember where we were */compensate = (LOG_REC_COMPENSATE *) node->data_header;// ... condensed: fill compensate->data {rcvindex, pageid, offset, volid}, length ...if (undo_nxlsa != NULL) LSA_COPY (&compensate->undo_nxlsa, undo_nxlsa); /* explicit skip target */else LSA_COPY (&compensate->undo_nxlsa, &prev_lsa); /* default: current link */start_lsa = prior_lsa_next_record (thread_p, node, tdes);if (pgptr != NULL) pgbuf_set_lsa (thread_p, pgptr, &start_lsa); /* TDE/page-LSA only when fixed */LSA_COPY (&tdes->undo_nxlsa, &prev_lsa); /* <- rewind: undo continues from BEFORE this CLR */Invariant — a CLR is redo-only and rewinds the undo cursor past itself. After the append,
tdes->undo_nxlsais reset toprev_lsaand the CLR’s ownundo_nxlsapoints at the next record needing undo. During recovery undo, reaching a CLR jumps toundo_nxlsaand never re-applies the undo it represents; skipping the rewind would double-undo the page. Thepgptr != NULLguard handles a CLR logged in recovery when the page could not be fixed — TDE andpgbuf_set_lsaare skipped.
The sibling log_sysop_end_logical_compensate (§9.3) achieves the same skip at
sysop granularity via the sysop-end record’s compensate_lsa.
9.9 Savepoints and partial abort
Section titled “9.9 Savepoints and partial abort”log_append_savepoint chains a LOG_SAVEPOINT with header log_rec_savept
(log_record.hpp):
| Field | Role | Why it exists |
|---|---|---|
prv_savept | ”Previous savepoint record” LSA | Singly-linked chain so log_get_savepoint_lsa walks back by name |
length | ”Savepoint name” length (name follows the record) | Bounds the variable-length name copy |
// log_append_savepoint -- src/transaction/log_manager.cif (!LOG_ISTRAN_ACTIVE (tdes)) { er_set (... ER_LOG_CANNOT_ADD_SAVEPOINT ...); return NULL; }if (savept_name == NULL) { er_set (... ER_LOG_NONAME_SAVEPOINT ...); return NULL; }node = prior_lsa_alloc_and_copy_data (thread_p, LOG_SAVEPOINT, ..., savept_name, ...);savept = (LOG_REC_SAVEPT *) node->data_header;LSA_COPY (&savept->prv_savept, &tdes->savept_lsa); /* <- link to previous savepoint */(void) prior_lsa_next_record (thread_p, node, tdes);LSA_COPY (&tdes->savept_lsa, &tdes->tail_lsa); /* <- this is now the latest savepoint */Branches: NULL tdes (fatal), non-active tran (ER_LOG_CANNOT_ADD_SAVEPOINT),
NULL name (ER_LOG_NONAME_SAVEPOINT), else append.
log_abort_partial rolls back to a named savepoint by reusing the sysop
machinery — it forges a sysop whose parent LSA is the savepoint. Five guard
branches run before the synthetic body:
tdes == NULL→ER_LOG_UNKNOWN_TRANINDEX, returnTRAN_UNACTIVE_UNKNOWN.LOG_HAS_LOGGING_BEEN_IGNORED ()→ER_LOG_CORRUPTED_DB_DUE_NOLOGGING, return currentstate.!LOG_ISTRAN_ACTIVE→ silently return currentstate.- NULL name or unknown savepoint →
ER_LOG_UNKNOWN_SAVEPOINT, returnTRAN_UNACTIVE_UNKNOWN. - Dangling sysops (
topops.last >= 0) → warn +assert(false), drain vialog_sysop_attach_to_outer.
// log_abort_partial -- src/transaction/log_manager.cif (tdes == NULL) { er_set (... ER_LOG_UNKNOWN_TRANINDEX ...); return TRAN_UNACTIVE_UNKNOWN; }if (LOG_HAS_LOGGING_BEEN_IGNORED ()) { er_set (...); return tdes->state; }if (!LOG_ISTRAN_ACTIVE (tdes)) { return tdes->state; }if (savepoint_name == NULL || log_get_savepoint_lsa (...) == NULL) { er_set (... ER_LOG_UNKNOWN_SAVEPOINT ...); return TRAN_UNACTIVE_UNKNOWN; }if (tdes->topops.last >= 0) /* dangling sysops: drain them first */ { er_set (... ER_LOG_HAS_TOPOPS_DURING_COMMIT_ABORT ...); assert (false); while (tdes->topops.last >= 0) log_sysop_attach_to_outer (thread_p); }log_sysop_start (thread_p);LSA_COPY (&tdes->topops.stack[tdes->topops.last].lastparent_lsa, savept_lsa); /* stop at savepoint */// ... condensed: if posp_nxlsa not null, transfer/clamp it into the frame's posp_lsa ...log_sysop_abort (thread_p); /* the actual rollback + CLRs */LSA_COPY (&tdes->savept_lsa, savept_lsa); /* discard newer savepoints */return TRAN_UNACTIVE_ABORTED;Partial abort is “abort a synthetic sysop spanning savepoint→now.” The elided
postpone-anchor transfer moves posp_nxlsa into the frame’s posp_lsa (clamped
to savept_lsa) so postpones whose source predates the savepoint are not lost.
9.10 log_sysop_attach_to_outer — committing a sysop into its parent
Section titled “9.10 log_sysop_attach_to_outer — committing a sysop into its parent”A sysop may merge into its enclosing scope, transferring only its postpone anchor:
// log_sysop_attach_to_outer -- src/transaction/log_manager.cif (tdes->topops.last == 0 && (!LOG_ISTRAN_ACTIVE (tdes) || tdes->is_system_transaction ())) { assert_release (false); log_sysop_commit (thread_p); return; } /* nothing to attach to */if (tdes->topops.last - 1 >= 0) /* attach to parent sysop frame */ { if (LSA_ISNULL (&tdes->topops.stack[tdes->topops.last - 1].posp_lsa)) LSA_COPY (&tdes->topops.stack[tdes->topops.last - 1].posp_lsa, &tdes->topops.stack[tdes->topops.last].posp_lsa); }else /* attach to transaction level */ { if (LSA_ISNULL (&tdes->posp_nxlsa)) LSA_COPY (&tdes->posp_nxlsa, &tdes->topops.stack[tdes->topops.last].posp_lsa); }log_sysop_end_final (thread_p, tdes); /* pop, no end record appended */Three branches: (1) nothing to attach to → fall back to a real commit; (2) parent
sysop → push this level’s posp_lsa up if the parent has none; (3) top-level →
push into posp_nxlsa. No LOG_SYSOP_END is written — the sysop’s effects
become the parent’s.
9.11 Chapter summary — key takeaways
Section titled “9.11 Chapter summary — key takeaways”- Sysops start as stack frames, not log records.
log_sysop_startpushestopopsand snapshotstail_lsaintolastparent_lsa; the first on-disk evidence is the end record (or atomic-start marker). The parent-LSA invariant (§9.1) is how end functions detect an empty sysop. log_sysop_commit_internalis one hub with six type-driven arms. Thelog_rec_sysop_endunion is reinterpreted pertype; the function validates arm-vs-state, runs postpones, appends the end, chainstail_topresult_lsa.- Abort is rollback-then-mark with a state swap.
log_sysop_abortsetsTRAN_UNACTIVE_ABORTEDsolog_rollbackemits CLRs, appendsLOG_SYSOP_END_ABORT, then restores the outer state. - Postpone defers redo to post-commit replay, anchored once. The first
LOG_POSTPONEseedsposp_lsa(sysop) orposp_nxlsa(tran); the cache replays from memory,log_do_postponefrom the log, stopping at any start marker and replaying onlyLOG_POSTPONE. log_run_postpone_opmakes postpone idempotent via theLOG_RUN_POSTPONEref_lsaback-pointer; a page-spanningmallocfailure is fatal.- Compensation is redo-only and rewinds the undo cursor.
log_append_compensate_internalstamps a CLR whoseundo_nxlsaskips the compensated record and resetstdes->undo_nxlsa. - Savepoints and partial abort piggy-back on sysops.
log_abort_partialclears five guards, forges a sysop spanning savepoint→now, and callslog_sysop_abort;log_sysop_attach_to_outermerges a sysop into its parent with no end record, transferring only the postpone anchor.
Chapter 10: Archiving Header Maintenance and Edge Paths
Section titled “Chapter 10: Archiving Header Maintenance and Edge Paths”Chapters 3-9 traced the hot per-record path. This chapter covers everything off it: the background machinery that recycles the active log into archive volumes, the on-disk durability of the log header, and the edge records and corruption checks. For the “active log vs. archives” and “force-at-commit” theory, see the companion.
10.1 Three header structs: page header, log header, archive header
Section titled “10.1 Three header structs: page header, log header, archive header”Every log page begins with a LOG_HDRPAGE; logical page -9 (LOGPB_HEADER_PAGE_ID) carries a LOG_HEADER; every archive’s physical page 0 carries a LOG_ARV_HEADER.
flowchart LR hp["active page -9<br/>LOG_HDRPAGE + LOG_HEADER"] -->|"nxarv_pageid"| p0["active data page<br/>LOG_HDRPAGE + records"] p0 -.archived into.-> ap0["archive phy 0<br/>LOG_HDRPAGE + LOG_ARV_HEADER"]
Figure 10-1: the three header structs and the active-to-archive copy relationship.
LOG_HDRPAGE — per-page header prefix. A LOG_PAGE is LOG_HDRPAGE hdr + char area[1].
| Field | Role | Why it exists |
|---|---|---|
logical_pageid | Logical page id in the infinite log | Identity independent of physical slot; header page is always -9 |
offset | Offset of first record on this page | Salvage anchor when a prior page is corrupt and unarchived |
flags | TDE flags (..._AES/_ARIA) | Marks a page whose records must be encrypted before leaving memory; header page is 0 |
checksum | CRC32 over sampled bytes | Corruption detection on read (10.6) |
LOG_HEADER — active-log master record. In the data area of page -9, mirrored as log_Gl.hdr. Every member is listed.
| Field | Role | Why it exists |
|---|---|---|
magic | file(1) magic | Guard vs. non-log files |
dummy / dummy3 / dummy4 | Alignment pads | 8-byte align |
db_creation | DB creation time | Ties log to DB; copied to LOG_ARV_HEADER |
vol_creation | Active-vol creation time | Diagnostics / ordering |
db_release / db_compatibility | Release string, compat float | Reject incompatible build/version |
db_iopagesize / db_logpagesize | Page sizes at creation | Run DB at the size the log expects |
is_shutdown | Clean-shutdown flag | Recovery: was dismount clean |
next_trid | Next txn id | Copied to LOG_ARV_HEADER for replay |
mvcc_next_id | Next MVCC id | MVCC allocation high-water |
avg_ntrans / avg_nlocks | Sizing estimates | Pre-size txn/lock tables |
npages | Active pages, excl. header | Sizes active vol / archive range |
db_charset | DB charset id | Charset guard at mount |
was_copied | Copied-DB flag | Copy vs. original |
fpageid | Logical pageid at active physical slot 1 | Active analogue of LOG_ARV_HEADER.fpageid |
append_lsa | Current append position | High-water of real log |
chkpt_lsa | Lowest LSA recovery replays from | Recovery start; durable here (10.5) |
nxarv_pageid | Next logical page to archive | Active/archive boundary (10.2) |
nxarv_phy_pageid | Physical slot of nxarv_pageid | Skips recomputing logpb_to_physical_pageid |
nxarv_num | Next archive number | Names next _lgarNNN |
last_arv_num_for_syscrashes | Oldest archive for crash recovery | Deletion floor; -1 = unpinned |
last_deleted_arv_num | Highest archive removed | Remove resumes without re-scan |
bkup_level0/1/2_lsa / bkinfo[] | Per-level backup LSAs + info | Backup — see backup chapter |
prefix_name | Log prefix name | Names volume family |
has_logging_been_skipped | Logging-skipped flag | Marks a WAL-bypass window |
vacuum_last_blockid | Last vacuum block id | Gates deletion (10.4) |
perm_status_obsolete | Obsolete status | Layout compat |
ha_server_state / ha_file_status / ha_promotion_time | HA state, copy status, promotion time | Replication / HA |
eof_lsa | LSA of LOG_END_OF_LOG | Durable log end; durable here (10.5) |
smallest_lsa_at_last_chkpt | Oldest dirty LSA at last chkpt | Bounds recovery/vacuum lookback |
mvcc_op_log_lsa | LSA of last MVCC op | Vacuum MVCC anchor |
oldest_visible_mvccid / newest_block_mvccid | Oldest visible, newest block MVCCID | Vacuum visibility / block bounds |
db_restore_time | Last restore time | Restore bookkeeping |
mark_will_del | Marked for deletion | DB-drop bookkeeping |
does_block_need_vacuum | Block needs vacuum | Vacuum scheduling |
was_active_log_reset | Active log was reset | Cleared in logpb_archive_active_log |
State invariant — the archive boundary. nxarv_pageid is the single source of truth for what is archived (< lives only in an archive, >= is still active), and nxarv_phy_pageid must equal logpb_to_physical_pageid(nxarv_pageid). Both advance as a unit at the end of logpb_archive_active_log, then logpb_flush_header makes them durable atomically. If they disagree, the next archive reads the wrong physical slot and corrupts the sequence.
LOG_ARV_HEADER — one per archive volume.
| Field | Role | Why it exists |
|---|---|---|
magic | CUBRID_MAGIC_LOG_ARCHIVE | Guard vs. mounting wrong file as archive |
dummy | Pad | align |
db_creation | From log_Gl.hdr.db_creation | Ties archive to DB |
vol_creation | time(NULL) when written | Diagnostics / ordering |
next_trid | From log_Gl.hdr.next_trid | Replay context |
npages | Data pages, excl. previous-lsa page | Bounds reads |
fpageid | Logical pageid at physical slot 1 | Logical-to-physical map in this archive |
arv_num | This archive’s number | Self-identifying |
dummy2 | Pad | align |
10.2 logpb_archive_active_log — rolling the active log into an archive
Section titled “10.2 logpb_archive_active_log — rolling the active log into an archive”Called under LOG_CS write when the active log fills, it copies [nxarv_pageid .. prev_lsa.pageid-1] into a fresh archive, then advances the boundary. Figure 10-2 traces every branch.
flowchart TB
start["enter LOG_CS write"] --> wake["wake remove daemon SERVER, or remove-exceed-limit SA"]
wake --> guard{"nxarv_pageid >= append_lsa.pageid ?"}
guard -->|yes, only incomplete page| ret["er_log_debug + return"]
guard -->|no| dis{"archive.vdes open ?"}
dis -->|yes| dismount["dismount old archive"]
dis -->|no| mal["malloc arv hdr page"]
dismount --> mal
mal -->|NULL| err["goto error"]
mal -->|ok| flush["flush_all_append_pages, build hdr"]
flush --> bg{"bg archiving and vdes open ?"}
bg -->|yes| chk["set hdr checksum"]
bg -->|no| fmt["fileio_format new vol"]
fmt -->|NULL_VOLDES| err
fmt --> chk
chk -->|error| err
chk --> wrhdr["write header page phy 0"]
wrhdr -->|NULL| err
wrhdr --> loop["copy loop: read LOGPB_IO_NPAGES, write"]
loop -->|read/write fails| err
loop --> fin{"background archiving ?"}
fin -->|yes| rename["dismount, rename _lgar_t, remount"]
fin -->|no| sync["fileio_synchronize"]
rename -->|fail| err
sync -->|fail| err
rename --> adv["advance nxarv_num/pageid/phy_pageid"]
sync --> adv
adv --> fh["logpb_flush_header"]
fh --> done["cache hdr, log, return"]
err --> fatal["logpb_fatal_error -> exit"]
Figure 10-2: branch-complete flow of logpb_archive_active_log.
The early guard (nxarv_pageid >= append_lsa.pageid -> er_log_debug + return) refuses an empty range; logpb_flush_all_append_pages is then forced. The header is built self-describing (db_creation/next_trid/fpageid copied from log_Gl.hdr), with last_pageid clamped so a degenerate range never yields negative npages:
// logpb_archive_active_log -- src/transaction/log_page_buffer.clast_pageid = log_Gl.append.prev_lsa.pageid - 1; /* <- never the live append page */if (last_pageid < arvhdr->fpageid) last_pageid = arvhdr->fpageid; /* <- clamp >= 1 page */arvhdr->npages = (DKNPAGES) (last_pageid - arvhdr->fpageid + 1);The copy loop reads up to LOGPB_IO_NPAGES (4) pages via logpb_read_page_from_active_log and writes them FILEIO_WRITE_NO_COMPENSATE_WRITE as-stored (still TDE-encrypted); any read <= 0 or write NULL jumps to error. The boundary advance is the durable commit: last_arv_num_for_syscrashes is pinned to nxarv_num if still -1 (recovery floor); nxarv_num++; nxarv_pageid/nxarv_phy_pageid advance as a unit; was_active_log_reset = false; then logpb_flush_header. The error label calls logpb_fatal_error(..., true, ...) — a failed archive is unrecoverable, so the server exits (10.7).
10.3 logpb_write_toflush_pages_to_archive — background archiving
Section titled “10.3 logpb_write_toflush_pages_to_archive — background archiving”When PRM_ID_LOG_BACKGROUND_ARCHIVING is on, full pages stream to a temp volume (_lgar_t) as they flush, so the eventual archive only renames it. It returns early when bg_archive_info.vdes == NULL_VOLDES || num_toflush <= 1, then copies every toflush[] page below prev_lsa.pageid, reconciling cursor pageid against the next bufptr->pageid in three branches:
// logpb_write_toflush_pages_to_archive -- src/transaction/log_page_buffer.cif (pageid > bufptr->pageid) { assert_release (...); dismount; return; } /* backwards: never */else if (pageid < bufptr->pageid) { if (logpb_fetch_page (...)) { dismount; return; } } /* gap: fetch */else { log_pgptr = flush_info->toflush[i]; i++; } /* match: use in hand */Each page is TDE-encrypted if LOG_IS_PAGE_TDE_ENCRYPTED; on encryption failure it is written plaintext with the TDE flag cleared (a logged data-leak tradeoff). fileio_synchronize runs once every PRM_ID_PB_SYNC_ON_NFLUSH pages. Any write failure dismounts the temp volume and abandons bg archiving (10.2 falls back to fileio_format).
10.4 The remove daemon — gated deletion of old archives
Section titled “10.4 The remove daemon — gated deletion of old archives”Deletion never happens on the hot path. On a server, logpb_archive_active_log only wakes the daemon (log_wakeup_remove_log_archive_daemon calls wakeup(), async). log_remove_log_archive_daemon_task also fires periodically (compute_period reads PRM_ID_REMOVE_LOG_ARCHIVES_INTERVAL: non-zero = timed wait, zero = wake-only). Its body and the SA path both call logpb_remove_archive_logs_exceed_limit, which early-exits with 0 if log_max_archives == INT_MAX (unlimited) or !vacuum_is_safe_to_remove_archives() (vacuum data not loaded). The window [last_deleted_arv_num + 1, nxarv_num - num_remove_arv_num] then has its high end clamped by each gate with MIN:
// logpb_remove_archive_logs_exceed_limit -- src/transaction/log_page_buffer.cif (log_Gl.hdr.last_arv_num_for_syscrashes != -1) /* crash-recovery floor */ last_arv_num_to_delete = MIN (last_arv_num_to_delete, log_Gl.hdr.last_arv_num_for_syscrashes);if (vacuum_first_pageid != NULL_PAGEID && logpb_is_page_in_archive (vacuum_first_pageid)) last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_vacuum);if (prm_get_integer_value (PRM_ID_SUPPLEMENTAL_LOG)) { /* CDC + flashback gates */ if (logpb_is_page_in_archive (cdc_min_log_pageid_to_keep ())) /* CDC progress */ last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_cdc); if (flashback_is_needed_to_keep_archive ()) last_arv_num_to_delete = MIN (last_arv_num_to_delete, min_arv_required_for_flashback); }State invariant — no consumer is read past its floor. An archive is deletable only if its number is below every live consumer’s minimum: last_arv_num_for_syscrashes, vacuum, CDC (cdc_min_log_pageid_to_keep — oldest page CDC has not consumed, only under PRM_ID_SUPPLEMENTAL_LOG), flashback, and (server) HA copy progress (logwr_get_min_copied_fpageid, unless PRM_ID_FORCE_REMOVE_LOG_ARCHIVES). The MIN() chain enforces this; drop any clamp and that consumer finds its archive deleted.
Then max_count caps batch size, last_arv_num_to_delete-- (window is exclusive of the last needed archive), and only if >= first_arv_num_to_delete does it persist last_deleted_arv_num and logpb_flush_header. The unlink runs after LOG_CS_EXIT via logpb_remove_archive_logs_internal.
10.5 logpb_flush_header — making the active-log header durable
Section titled “10.5 logpb_flush_header — making the active-log header durable”Every boundary change above ends here. It asserts LOG_CS_OWN_WRITE_MODE, lazily allocates loghdr_pgptr if NULL (OOM -> logpb_fatal_error), then snapshots and writes to page -9:
// logpb_flush_header -- src/transaction/log_page_buffer.clog_hdr = (LOG_HEADER *) (log_Gl.loghdr_pgptr->area);*log_hdr = log_Gl.hdr; /* <- snapshot in-memory header */log_Gl.loghdr_pgptr->hdr.flags = 0; /* <- never TDE-encrypted */logpb_write_page_to_disk (thread_p, log_Gl.loghdr_pgptr, LOGPB_HEADER_PAGE_ID);This is the single point where chkpt_lsa (recovery start) and eof_lsa (durable log end) become durable. It does not flush append pages; those use Ch 7’s WAL flush.
10.6 Edge records and corruption: EOF marker, dummies, checksum
Section titled “10.6 Edge records and corruption: EOF marker, dummies, checksum”LOG_END_OF_LOG placement. In logpb_flush_all_append_pages, an EOF marker (eof.type = LOG_END_OF_LOG, null forw_lsa) is appended in place via logpb_start_append so recovery finds where the log stops, but append_lsa is not advanced — the next real record overwrites it.
LOG_DUMMY_GENERIC and other dummies. Several log types carry no payload; the enum comment is literally "ridiculous, but flush needs it". A dummy gives flush a record to terminate/pad a page when a real record would straddle a boundary awkwardly, so the page closes without a partial record header.
Checksum. logpb_compute_page_checksum samples 16 bytes from the head and tail of each 4096-byte block and CRC32s the concatenation, zeroing hdr.checksum during the computation and restoring it after so the stored checksum never checks itself. logpb_set_page_checksum stores it; logpb_page_has_valid_checksum recomputes and compares; logpb_page_check_corruption sets *is_page_corrupted = !has_valid_checksum. Any change must be mirrored in logwr_check_page_checksum so replication agrees.
logpb_invalid_all_append_pages. When append state must be reset (e.g. after a partial-append failure), the one branch (if log_Gl.append.log_pgptr != NULL) flushes the dirty append page via logpb_flush_pages_direct first so committed work is not lost and nulls log_pgptr; it then zeroes flush_info->num_toflush and sets toflush[0] = NULL under flush_mutex.
10.7 logpb_fatal_error_internal — last-resort flush and exit
Section titled “10.7 logpb_fatal_error_internal — last-resort flush and exit”Unrecoverable errors call logpb_fatal_error -> logpb_fatal_error_internal with need_flush = true (logpb_fatal_error_exit_immediately_wo_flush passes false when flushing is itself unsafe):
// logpb_fatal_error_internal -- src/transaction/log_page_buffer.cif (log_exit == true && need_flush == true && log_Gl.append.log_pgptr != NULL) { static int in_fatal = false; /* <- reentrancy guard */ if (in_fatal == false) { in_fatal = true; pgbuf_flush_checkpoint (...); /* flush only up to prev_lsa */ in_fatal = false; } }fileio_synchronize_all (thread_p); /* <- force everything to stable storage *//* then boot_server_status(DOWN); NDEBUG -> exit, debug -> abort core dump */Branches: the flush block runs only when all three of log_exit, need_flush, and a live append page hold; the in_fatal guard blocks recursive entry if the flush itself faults. It flushes “as much as you can without forcing the current unfinished log record” (committed work below prev_lsa durable, the partial record left for recovery), then fileio_synchronize_alls and exits (NDEBUG) or aborts (debug).
10.8 Open questions carried from the high-level doc
Section titled “10.8 Open questions carried from the high-level doc”Four items from the companion remain open: the group-commit window (flush daemon’s wake timing and its interaction with PRM_ID_PB_SYNC_ON_NFLUSH, 10.3); whether the prior-list list_size cap throttles archive/flush; TDE placement (encryption is lazy in 10.3 and skipped in logpb_archive_active_log’s direct copy — the single authoritative encrypt-on-disk point is untraced); and the LOG_DUMMY_GENERIC invariant (the condition under which flush requires a dummy is documented only by the source comment).
10.9 Chapter summary — key takeaways
Section titled “10.9 Chapter summary — key takeaways”- Three nested header structs:
LOG_HDRPAGEper page;LOG_HEADER(page -9) the master record;LOG_ARV_HEADERper archive, withdb_creation/next_tridcopied from the former. nxarv_pageid/nxarv_phy_pageidare the archive boundary, advanced as a unit and flushed bylogpb_archive_active_log(which also clearswas_active_log_reset).- Archiving forces a full append flush, copies
[nxarv_pageid .. prev_lsa.pageid-1]as-stored, and treats any I/O failure as fatal. - Deletion is gated:
logpb_remove_archive_logs_exceed_limitclamps the window with aMIN()chain against crash-recovery, vacuum, CDC, flashback, and HA floors. logpb_flush_headeris the single durability point forchkpt_lsa/eof_lsa/archive bookkeeping, underLOG_CSwrite,flags = 0.- Edge records:
LOG_END_OF_LOGappended in place without advancingappend_lsa; dummies pad pages; a sampled-CRC32 checksum driveslogpb_page_check_corruption. - Fatal path:
logpb_fatal_error_internaluses anin_fatalguard, flushes only up toprev_lsa, thenexits/aborts.
Position hints as of this revision
Section titled “Position hints as of this revision”The following are line numbers as observed on 2026-06-08; symbols are the canonical anchor and line numbers are hints that decay.
| Symbol | File | Line |
|---|---|---|
LOG_PAGESIZE | src/storage/storage_common.h | 99 |
log_Zip_support | src/transaction/log_append.cpp | 40 |
log_Zip_min_size_to_compress | src/transaction/log_append.cpp | 41 |
log_append_info::get_nxio_lsa | src/transaction/log_append.cpp | 106 |
log_append_info::set_nxio_lsa | src/transaction/log_append.cpp | 112 |
log_prior_lsa_info::log_prior_lsa_info | src/transaction/log_append.cpp | 117 |
LOG_RESET_APPEND_LSA | src/transaction/log_append.cpp | 128 |
LOG_RESET_PREV_LSA | src/transaction/log_append.cpp | 136 |
LOG_APPEND_PTR | src/transaction/log_append.cpp | 145 |
log_append_init_zip | src/transaction/log_append.cpp | 185 |
log_append_final_zip | src/transaction/log_append.cpp | 232 |
prior_lsa_alloc_and_copy_data | src/transaction/log_append.cpp | 273 |
prior_lsa_alloc_and_copy_crumbs | src/transaction/log_append.cpp | 410 |
prior_lsa_copy_undo_data_to_node | src/transaction/log_append.cpp | 493 |
prior_lsa_copy_redo_data_to_node | src/transaction/log_append.cpp | 524 |
prior_lsa_gen_undoredo_record_from_crumbs | src/transaction/log_append.cpp | 651 |
prior_lsa_gen_record | src/transaction/log_append.cpp | 1217 |
prior_update_header_mvcc_info | src/transaction/log_append.cpp | 1320 |
prior_lsa_next_record_internal | src/transaction/log_append.cpp | 1357 |
commit_abort_lsa | src/transaction/log_append.cpp | 1485 |
prior_lsa_next_record | src/transaction/log_append.cpp | 1553 |
prior_lsa_next_record_with_lock | src/transaction/log_append.cpp | 1559 |
prior_set_tde_encrypted | src/transaction/log_append.cpp | 1565 |
prior_is_tde_encrypted | src/transaction/log_append.cpp | 1581 |
prior_lsa_start_append | src/transaction/log_append.cpp | 1593 |
prior_lsa_end_append | src/transaction/log_append.cpp | 1652 |
prior_lsa_append_data | src/transaction/log_append.cpp | 1661 |
log_append_get_zip_undo | src/transaction/log_append.cpp | 1725 |
log_append_get_zip_redo | src/transaction/log_append.cpp | 1751 |
log_prior_lsa_append_align | src/transaction/log_append.cpp | 1892 |
log_prior_lsa_append_advance_when_doesnot_fit | src/transaction/log_append.cpp | 1905 |
log_prior_lsa_append_add_align | src/transaction/log_append.cpp | 1917 |
log_crumb | src/transaction/log_append.hpp | 46 |
log_data_addr | src/transaction/log_append.hpp | 53 |
LOG_PRIOR_LSA_LOCK | src/transaction/log_append.hpp | 66 |
log_append_info | src/transaction/log_append.hpp | 73 |
log_prior_node | src/transaction/log_append.hpp | 91 |
log_prior_lsa_info | src/transaction/log_append.hpp | 112 |
log_zip_alloc | src/transaction/log_compress.c | 237 |
log_zip | src/transaction/log_compress.h | 53 |
log_global::log_global | src/transaction/log_global.c | 49 |
LOGAREA_SIZE | src/transaction/log_impl.h | 121 |
log_setdirty | src/transaction/log_impl.h | 305 |
log_flush_info | src/transaction/log_impl.h | 322 |
log_topops_addresses | src/transaction/log_impl.h | 353 |
log_topops_stack | src/transaction/log_impl.h | 362 |
log_rcv_tdes | src/transaction/log_impl.h | 458 |
log_tdes | src/transaction/log_impl.h | 475 |
log_global | src/transaction/log_impl.h | 671 |
log_lsa | src/transaction/log_lsa.hpp | 35 |
NULL_LSA | src/transaction/log_lsa.hpp | 67 |
MAX_LSA | src/transaction/log_lsa.hpp | 72 |
LSA_COPY | src/transaction/log_lsa.hpp | 80 |
LSA_AS_ARGS | src/transaction/log_lsa.hpp | 91 |
LOG_TDES_LAST_SYSOP | src/transaction/log_manager.c | 199 |
LOG_TDES_LAST_SYSOP_PARENT_LSA | src/transaction/log_manager.c | 200 |
LOG_TDES_LAST_SYSOP_POSP_LSA | src/transaction/log_manager.c | 201 |
log_Flush_daemon | src/transaction/log_manager.c | 363 |
log_create_internal | src/transaction/log_manager.c | 827 |
log_initialize_internal | src/transaction/log_manager.c | 1100 |
log_abort_by_tdes | src/transaction/log_manager.c | 1583 |
log_abort_all_active_transaction | src/transaction/log_manager.c | 1608 |
log_final | src/transaction/log_manager.c | 1720 |
log_append_undoredo_data | src/transaction/log_manager.c | 1893 |
log_append_undo_data | src/transaction/log_manager.c | 1973 |
log_append_redo_data | src/transaction/log_manager.c | 2035 |
log_append_undoredo_crumbs | src/transaction/log_manager.c | 2086 |
log_append_postpone | src/transaction/log_manager.c | 2719 |
log_append_run_postpone | src/transaction/log_manager.c | 2881 |
log_append_compensate_internal | src/transaction/log_manager.c | 3047 |
log_append_savepoint | src/transaction/log_manager.c | 3365 |
log_sysop_start | src/transaction/log_manager.c | 3599 |
log_sysop_start_atomic | src/transaction/log_manager.c | 3665 |
log_sysop_commit_internal | src/transaction/log_manager.c | 3825 |
log_sysop_commit | src/transaction/log_manager.c | 3916 |
log_sysop_end_logical_undo | src/transaction/log_manager.c | 3941 |
log_sysop_end_logical_compensate | src/transaction/log_manager.c | 3984 |
log_sysop_end_logical_run_postpone | src/transaction/log_manager.c | 4003 |
log_sysop_end_recovery_postpone | src/transaction/log_manager.c | 4024 |
log_sysop_abort | src/transaction/log_manager.c | 4038 |
log_sysop_attach_to_outer | src/transaction/log_manager.c | 4097 |
log_append_commit_postpone | src/transaction/log_manager.c | 4384 |
log_append_sysop_start_postpone | src/transaction/log_manager.c | 4455 |
log_append_repl_info_and_commit_log | src/transaction/log_manager.c | 4647 |
log_append_donetime_internal | src/transaction/log_manager.c | 4679 |
log_change_tran_as_completed | src/transaction/log_manager.c | 4722 |
log_append_commit_log | src/transaction/log_manager.c | 4779 |
log_append_commit_log_with_lock | src/transaction/log_manager.c | 4802 |
log_append_abort_log | src/transaction/log_manager.c | 4816 |
log_commit_local | src/transaction/log_manager.c | 5159 |
log_abort_local | src/transaction/log_manager.c | 5277 |
log_commit | src/transaction/log_manager.c | 5352 |
log_abort | src/transaction/log_manager.c | 5461 |
log_abort_partial | src/transaction/log_manager.c | 5558 |
log_complete | src/transaction/log_manager.c | 5653 |
log_rollback | src/transaction/log_manager.c | 7664 |
log_tran_do_postpone | src/transaction/log_manager.c | 8156 |
log_sysop_do_postpone | src/transaction/log_manager.c | 8190 |
log_do_postpone | src/transaction/log_manager.c | 8237 |
log_run_postpone_op | src/transaction/log_manager.c | 8481 |
log_wakeup_remove_log_archive_daemon | src/transaction/log_manager.c | 10099 |
log_wakeup_log_flush_daemon | src/transaction/log_manager.c | 10126 |
log_is_log_flush_daemon_available | src/transaction/log_manager.c | 10141 |
log_remove_log_archive_daemon_task | src/transaction/log_manager.c | 10185 |
log_flush_execute | src/transaction/log_manager.c | 10377 |
log_flush_daemon_init | src/transaction/log_manager.c | 10493 |
log_abort_task_execute | src/transaction/log_manager.c | 10558 |
cdc_min_log_pageid_to_keep | src/transaction/log_manager.c | 14021 |
LOG_IS_SYSTEM_OP_STARTED | src/transaction/log_manager.h | 59 |
LOGPB_HEADER_PAGE_ID | src/transaction/log_page_buffer.c | 138 |
LOG_APPEND_ALIGN | src/transaction/log_page_buffer.c | 164 |
LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT | src/transaction/log_page_buffer.c | 176 |
LOG_APPEND_ADVANCE_WHEN_DOESNOT_FIT | src/transaction/log_page_buffer.c | 177 |
LOG_APPEND_SETDIRTY_ADD_ALIGN | src/transaction/log_page_buffer.c | 184 |
log_buffer | src/transaction/log_page_buffer.c | 192 |
log_buffer | src/transaction/log_page_buffer.c | 194 |
log_pb_global_data | src/transaction/log_page_buffer.c | 244 |
logpb_get_log_buffer | src/transaction/log_page_buffer.c | 394 |
logpb_initialize_log_buffer | src/transaction/log_page_buffer.c | 425 |
logpb_compute_page_checksum | src/transaction/log_page_buffer.c | 446 |
logpb_set_page_checksum | src/transaction/log_page_buffer.c | 495 |
logpb_page_has_valid_checksum | src/transaction/log_page_buffer.c | 523 |
logpb_initialize_pool | src/transaction/log_page_buffer.c | 553 |
logpb_finalize_pool | src/transaction/log_page_buffer.c | 672 |
logpb_create_page | src/transaction/log_page_buffer.c | 783 |
logpb_locate_page | src/transaction/log_page_buffer.c | 807 |
logpb_set_dirty | src/transaction/log_page_buffer.c | 929 |
logpb_flush_header | src/transaction/log_page_buffer.c | 1676 |
logpb_fetch_start_append_page | src/transaction/log_page_buffer.c | 2504 |
logpb_fetch_start_append_page_new | src/transaction/log_page_buffer.c | 2586 |
logpb_next_append_page | src/transaction/log_page_buffer.c | 2630 |
logpb_writev_append_pages | src/transaction/log_page_buffer.c | 2780 |
logpb_write_toflush_pages_to_archive | src/transaction/log_page_buffer.c | 2868 |
logpb_append_next_record | src/transaction/log_page_buffer.c | 2981 |
logpb_append_prior_lsa_list | src/transaction/log_page_buffer.c | 3040 |
prior_lsa_remove_prior_list | src/transaction/log_page_buffer.c | 3084 |
logpb_prior_lsa_append_all_list | src/transaction/log_page_buffer.c | 3106 |
logpb_flush_all_append_pages | src/transaction/log_page_buffer.c | 3232 |
logpb_flush_pages_direct | src/transaction/log_page_buffer.c | 3952 |
logpb_flush_pages | src/transaction/log_page_buffer.c | 3980 |
logpb_force_flush_pages | src/transaction/log_page_buffer.c | 4096 |
logpb_force_flush_header_and_pages | src/transaction/log_page_buffer.c | 4104 |
logpb_invalid_all_append_pages | src/transaction/log_page_buffer.c | 4121 |
logpb_flush_log_for_wal | src/transaction/log_page_buffer.c | 4162 |
logpb_start_append | src/transaction/log_page_buffer.c | 4207 |
logpb_append_data | src/transaction/log_page_buffer.c | 4290 |
logpb_append_crumbs | src/transaction/log_page_buffer.c | 4366 |
logpb_end_append | src/transaction/log_page_buffer.c | 4455 |
logpb_archive_active_log | src/transaction/log_page_buffer.c | 5649 |
logpb_remove_archive_logs_exceed_limit | src/transaction/log_page_buffer.c | 5991 |
logpb_fatal_error | src/transaction/log_page_buffer.c | 10607 |
logpb_fatal_error_exit_immediately_wo_flush | src/transaction/log_page_buffer.c | 10618 |
logpb_fatal_error_internal | src/transaction/log_page_buffer.c | 10629 |
logpb_initialize_flush_info | src/transaction/log_page_buffer.c | 10878 |
logpb_finalize_flush_info | src/transaction/log_page_buffer.c | 10912 |
logpb_need_wal | src/transaction/log_page_buffer.c | 11229 |
logpb_page_check_corruption | src/transaction/log_page_buffer.c | 11508 |
logpb_get_tde_algorithm | src/transaction/log_page_buffer.c | 11565 |
logpb_set_tde_algorithm | src/transaction/log_page_buffer.c | 11593 |
log_rectype | src/transaction/log_record.hpp | 35 |
log_rec_header | src/transaction/log_record.hpp | 146 |
log_data | src/transaction/log_record.hpp | 157 |
log_rec_undoredo | src/transaction/log_record.hpp | 167 |
log_rec_undo | src/transaction/log_record.hpp | 176 |
log_rec_redo | src/transaction/log_record.hpp | 184 |
log_vacuum_info | src/transaction/log_record.hpp | 192 |
log_rec_mvcc_undoredo | src/transaction/log_record.hpp | 202 |
log_rec_mvcc_undo | src/transaction/log_record.hpp | 211 |
log_rec_mvcc_redo | src/transaction/log_record.hpp | 220 |
log_rec_donetime | src/transaction/log_record.hpp | 237 |
log_rec_compensate | src/transaction/log_record.hpp | 262 |
log_rec_start_postpone | src/transaction/log_record.hpp | 271 |
log_sysop_end_type | src/transaction/log_record.hpp | 285 |
log_rec_sysop_end | src/transaction/log_record.hpp | 305 |
log_rec_sysop_start_postpone | src/transaction/log_record.hpp | 328 |
log_rec_run_postpone | src/transaction/log_record.hpp | 336 |
log_rec_savept | src/transaction/log_record.hpp | 380 |
LOG_GET_LOG_RECORD_HEADER | src/transaction/log_record.hpp | 441 |
LOG_IS_MVCC_OP_RECORD_TYPE | src/transaction/log_record.hpp | 463 |
LOG_HDRPAGE_FLAG_ENCRYPTED_MASK | src/transaction/log_storage.hpp | 45 |
LOG_IS_PAGE_TDE_ENCRYPTED | src/transaction/log_storage.hpp | 47 |
LOGPB_HEADER_PAGE_ID | src/transaction/log_storage.hpp | 51 |
log_hdrpage | src/transaction/log_storage.hpp | 63 |
log_page | src/transaction/log_storage.hpp | 80 |
log_page | src/transaction/log_storage.hpp | 81 |
log_header | src/transaction/log_storage.hpp | 113 |
log_arv_header | src/transaction/log_storage.hpp | 231 |
logtb_get_new_tran_id | src/transaction/log_tran_table.c | 1741 |
LOG_IS_MVCC_OPERATION | src/transaction/mvcc.h | 261 |
Sources
Section titled “Sources”cubrid-log-manager.md— the high-level companion. See alsocubrid-prior-list.md(the prior-list mechanism) andcubrid-recovery-manager.md(how these records are replayed).- Raw analyses under
raw/code-analysis/cubrid/storage/log_manager/. - Code:
src/transaction/log_manager.{c,h},log_append.{cpp,hpp},log_record.hpp,log_lsa.{cpp,hpp},log_storage.hpp,log_page_buffer.c. - Methodology:
knowledge/methodology/code-analysis-detail-doc.md.