Skip to content

CUBRID Recovery Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-recovery-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full restart of a crashed database inside the kernel.

Contents:

ChTitleStatus
1Data-Structure Map
2Restart Entry and Log Page Access
3Analysis Pass Driver
4Analysis Record Dispatch and Transaction Table Rebuild
5Sysop and Postpone Bookkeeping During Analysis
6Redo Pass Driver and Synchronous Apply
7Parallel Redo Infrastructure
8Atomic Sysop Abort and Postpone Completion
9Undo Pass and Compensation
10The RV_fun Dispatch Table
11Special Paths

Theory lives in the companion cubrid-recovery-manager.md (“The recovery dispatch table”, “Redo pass — modern dispatch via templates”); this chapter pins down every field of every recovery-side structure and the pointers between them.

flowchart TB
  LD["LOG_DATA\nrcvindex / vpid / offset"]
  RVF["RV_fun[rcvindex]\n(struct rvfun)"]
  CTX["log_rv_redo_context"]
  RCV["LOG_RCV"]
  RECINFO["log_rv_redo_rec_info<T>"]
  FUNC["undofun / redofun"]

  LD -->|"selects"| RVF
  RVF --> FUNC
  RECINFO -->|"typed header copy"| RCV
  CTX -->|"unzip buffer feeds rcv.data"| RCV
  FUNC -->|"called with &rcv"| RCV

Figure 1-1 — the rcvindex selects an RV_fun entry; the redo context unzips the payload into the LOG_RCV the chosen function receives.

1.2 LOG_RCV — the universal recovery argument

Section titled “1.2 LOG_RCV — the universal recovery argument”

Every undo, redo, compensate, and run-postpone function has signature int (*)(THREAD_ENTRY *, LOG_RCV *)LOG_RCV is the narrow waist.

// log_rcv -- src/transaction/recovery.h
struct log_rcv
{ /* Recovery information */
MVCCID mvcc_id = MVCCID_NULL; /* mvcc id */
PAGE_PTR pgptr = nullptr; /* Page to recover. Page should not be free by recovery functions,
* however it should be set dirty whenever is needed */
// ... condensed: PGLENGTH offset; int length ...
const char *data = nullptr; /* Replacement data. Pointer becomes invalid once the recovery
* of the data is finished */ /* <- borrowed, see invariant below */
LOG_LSA reference_lsa = NULL_LSA; /* Next LSA used by compensate/postpone. */
// ... condensed: default ctor; copy/move ctors and both assignments deleted ...
};
FieldRoleWhy it exists
mvcc_idMVCCID for MVCC-class records, else MVCCID_NULLrecord-header field; set only by the MVCC log_rv_get_log_rec_mvccid specializations
pgptrpage to recover, fixed by the driver; nullptr for logical recordsfix/unfix is centralized in the driver
offsetoffset or slot id within pgptr, from LOG_DATA.offsetphysical recovery is page+offset addressed
lengthbyte length of dataraw buffer, no terminator
dataredo replacement or undo before-imagepoints into a LOG_ZIP buffer or the log page — lifetime rule below
reference_lsacompensate: the transaction’s undo_nxlsa at undo time — the next LSA the undo chain resumes from, handed to log_sysop_end_logical_compensate; both log_rollback_record (runtime rollback) and log_rv_undo_record (restart undo) fill it. Run-postpone: LSA of the original postpone record, filled by log_execute_run_postponeanchor for the manual logical functions (1.7) that append their own compensation / run-postpone records

Invariant (borrowed-data lifetime). rcv.data and rcv.pgptr are loans: data aliases the unzip buffer (m_redo_zip.log_data) or the log page; pgptr is unfixed by the caller on return. Enforcement: all four copy/move operations deleted; log_rv_redo_record_sync builds a fresh stack-local LOG_RCV per record, a scope_exit unfixing pgptr. Stashing rcv->data means the next record’s unzip silently corrupts the replay.

rvfun (recovery.h) bundles fun_t = int (*)(THREAD_ENTRY *, LOG_RCV *), dump_fun_t = void (*)(FILE *, int, void *), and six fields; extern struct rvfun RV_fun[] is initialized in recovery.c:

FieldRoleWhy it exists
recv_indexcopy of the entry’s own index (/* For verification */)rv_check_rvfuns asserts RV_fun[i].recv_index == i at debug startup
recv_stringname, e.g. "RVDK_FORMAT"trace/dump output via rv_rcvindex_string
undofunapplied by undo/rollback — and by redo of compensate records: log_rv_get_fun<LOG_REC_COMPENSATE> returns undofun (// yes, undo)a CLR’s redo is the original undo
redofunapplied by redo pass, run-postpone, HA replication applythe forward image applier
dump_undofun / dump_redofunpayload pretty-printers, NULL if nonelog-dump tooling only

rv_rcvindex_string is branch-free (return RV_fun[rcvindex].recv_string;). rv_check_rvfuns only turns initializer misordering into a debug-build startup failure (er_set plus assert (false)); nothing guards an out-of-range argument such as RV_NOT_DEFINED (999) — callers must pass a defined index.

1.4 LOG_RCVINDEX — the index space, by family

Section titled “1.4 LOG_RCVINDEX — the index space, by family”

Invariant (append-only numbering). Indices persist inside log records, so renumbering replays the wrong function on old databases. The enum header warns: “NEW ENTRIES SHOULD BE ADDED AT THE BOTTON OF THE FILE … to AVOID OLD DATABASES TO BE RECOVERED UNDER OLD FILE” — hence RVPGBUF_SET_TDE_ALGORITHM (127) far from its siblings (120–123). RV_LAST_LOGID = RVHF_LOB_REMOVE_DIR (129) marks the top; RV_NOT_DEFINED = 999 is the “no rcvindex” sentinel.

FamilyRangeSubsystem
RVDK_*0–9disk manager
RVFL_*10–32, 128file manager
RVHF_*33–53, 126, 129heap
RVOVF_*54–57overflow records
RVEH_*58–65extendible hash
RVBT_*66–91, 124–125b-tree, incl. logical-key set (1.7)
RVCT_*92–96catalog pages
RVLOG_*97logical-redo noop marker
RVREPL_*98–103replication, HA appliers only
RVVAC_*104–117vacuum
RVES_*118external storage (LOB)
RVLOC_*119locator classname dummy
RVPGBUF_*120–123, 127page buffer

The redo pass (Ch 6) and each parallel-redo applier (Ch 7) own one log_rv_redo_context (log_recovery_redo.hpp):

FieldRoleWhy it exists
m_readerlog_reader cursor, built with LOG_CS_SAFE_READERindependent log position per applier
m_redo_zipunzip target for redo payloads; its log_data becomes rcv.dataoutput must outlive the recovery-function call
m_undo_zipunzip target for the undo half of diff undoredo recordsLOG_DIFF_UNDOREDO_DATA stores redo as an XOR diff against undo
m_end_redo_lsaconst upper bound; records at or past it are not redonefreezes the redo horizon before the pass
m_reader_fetch_page_modeconst fetch mode for set_lsa_and_fetch_page; NORMAL refetches only when the pageid changes (do_fetch_page = FORCE || m_lsa.pageid != lsa.pageid)the only constructor call (redo pass, log_recovery.c) passes NORMAL; FORCE is retained unused for future reuse (log_reader.hpp comment)

Default constructor deleted; the two-argument constructor pre-grows both buffers to LOGAREA_SIZE; move and both assignments deleted. The copy constructor — the only allowed copy — delegates back with (o.m_end_redo_lsa, o.m_reader_fetch_page_mode): only the two const knobs survive, so each parallel-redo worker gets fresh buffers and reader.

Each applied record is a log_rv_redo_rec_info<T>: every special member is deleted except the (log_lsa, LOG_RECTYPE, const T &) constructor — built once, fully initialized, never reseated.

FieldRoleWhy it exists
m_start_lsaLSA of the record headerstamped onto the page after apply (pgbuf_set_lsa); key of the check below
m_typethe LOG_RECTYPEdrives the LOG_DIFF_UNDOREDO_DATA XOR-diff branch in log_rv_get_log_rec_redo_data
m_logrecby-value copy of the typed body T — one of LOG_REC_{UNDOREDO, MVCC_UNDOREDO, REDO, MVCC_REDO, RUN_POSTPONE, COMPENSATE}frees the reader to advance; log_rv_get_log_rec_* specializations extract vpid/mvccid/length/offset

Invariant (per-page LSA ordering, debug builds). Redo for one page must apply in log order even across threads; vpid_lsa_consistency_check (compiled out under NDEBUG) checks a necessary condition of it:

// vpid_lsa_consistency_check::check -- src/transaction/log_recovery_redo.cpp
std::lock_guard<std::mutex> lck (mtx);
const vpid_key_t key {a_vpid.volid, a_vpid.pageid};
const auto map_it = consistency_check_map.find (key);
if (map_it != consistency_check_map.cend ())
{
assert ((*map_it).second < a_log_lsa); /* <- later applies must beat the stored LSA */
}
consistency_check_map.emplace (key, a_log_lsa); /* <- emplace never overwrites an existing key */
FieldRoleWhy it exists
mtxserializes check and cleanupthe map (global log_Gl_recovery_redo_consistency_check) is hit by every applier
consistency_check_mapper-page baseline — vpid_log_lsa_map_t maps vpid_key_t = (volid, pageid) to the first LSA applied to the page; emplace never overwrites, so the baseline never advancesthe assert demands every later apply carry an LSA above the baseline — weaker than pairwise monotonicity (a swap between two later applies passes), but an image older than the first apply still trips it

cleanup() clears the map after the pass; log_rv_redo_record_sync consults it only while log_Gl.rcv_phase != LOG_RESTARTED.

LOG_RCV_TDES (field rcv of log_tdes, log_impl.h) carries analysis-pass discoveries into later passes (Ch 4, 5, 8). Five LSAs:

FieldRoleWhy it exists
sysop_start_postpone_lsaLSA of the LOG_SYSOP_START_POSTPONE in progress at crashresume anchor for the sysop postpone phase (Ch 8)
tran_start_postpone_lsawhere transaction-level postpone begansplits “committed, postpones pending” from plain active
atomic_sysop_start_lsastart of an interrupted atomic file op (file_perm_alloc / file_perm_dealloc)must complete or roll back fully before postpones run (Ch 8)
analysis_last_aborted_sysop_lsaend LSA of the last sysop aborted during analysis (“to recover logical redo operation”)logical redo must not re-enter the aborted range
analysis_last_aborted_sysop_start_lsamatching start LSA of that sysopthe other end of the bracket

LOG_RECVPHASE (log_impl.h), the global mode switch log_Gl.rcv_phase, is consulted far outside recovery (page-buffer fix rules, the check above): LOG_RESTARTED (recovery done), LOG_RECOVERY_ANALYSIS_PHASE (Ch 3–5), LOG_RECOVERY_REDO_PHASE (Ch 6–7), LOG_RECOVERY_UNDO_PHASE (Ch 9), LOG_RECOVERY_FINISH_2PC_PHASE (Ch 11).

Checkpoint snapshot records (log_record.hpp): a fixed LOG_REC_CHKPT header, then ntrans trans entries, then ntops sysop entries.

log_rec_chkpt fieldRoleWhy it exists
redo_lsa”Oldest LSA of dirty data page in page buffers” (source comment)redo-pass lower bound — the fuzzy-checkpoint contract
ntranscount of trans entries followingvariable-sized record
ntopscount of sysop entries after the trans arraysame

LOG_INFO_CHKPT_TRANS snapshots the same-named live log_tdes fields; analysis re-creates a TDES per entry, then corrects it from the log tail (Ch 4):

FieldRoleWhy it exists
isloose_endloose-end flag at checkpointmarks 2PC/client loose ends
tridtransaction identifierkey for re-creating the TDES
stateTRAN_STATE at checkpointseeds loose-end classification
head_lsafirst log record of the transactionbounds the backward chain
tail_lsalast record at checkpointanalysis scan resumes here
undo_nxlsanext record to undo, given CLRs logged during undorollback skips already-compensated work
posp_nxlsafirst postpone recordwhere postpone execution starts
savept_lsalast savepointsavepoint chain head, partial rollback
tail_topresult_lsalast partial abort/commitnested-sysop resolution
start_postpone_lsastart-postpone address, if mid-postponesuch a transaction must finish postpones, not be undone
user_nameclient name (char[LOG_USERNAME_MAX])restored into the TDES

LOG_INFO_CHKPT_SYSOP snapshots the two persistent sysop anchors of LOG_RCV_TDES. The other three LOG_RCV_TDES LSAs never travel in it: the analysis_last_aborted_* pair are products of the current analysis run, never persisted, and tran_start_postpone_lsa rides in the per-transaction entry instead, as LOG_INFO_CHKPT_TRANS.start_postpone_lsa:

FieldRoleWhy it exists
tridwhich transaction’s TDES the two LSAs are restored intokeyed by transaction, not parallel to the trans array
sysop_start_postpone_lsasaved rcv.sysop_start_postpone_lsathe sysop state can predate the checkpoint
atomic_sysop_start_lsasaved rcv.atomic_sysop_start_lsasame, for interrupted atomic file ops

1.7 LOG_ZIP and the logical-classifier macros

Section titled “1.7 LOG_ZIP and the logical-classifier macros”

LOG_ZIP (log_compress.h), the compression workspace of the write path and (1.5) the redo context, owns log_data (freed by log_zip_free_data); all four copy/move operations are deleted — a member-wise copy would double-free:

FieldRoleWhy it exists
data_lengthvalid bytes currently in log_dataafter log_unzip, the length handed to rcv.length
buf_sizeallocated capacitylog_zip_realloc_if_needed grows it; sticky across records
log_datathe owned buffer (“used as data buffer”)the storage rcv.data borrows — the 1.2 lifetime rule

A stored length marks compression in its top bit: MAKE_ZIP_LEN(l) sets 0x80000000, ZIP_CHECK(l) tests, GET_ZIP_LEN(l) strips.

The classifier macros. Four pure disjunctions over LOG_RCVINDEX; the branches are the listed indices.

// RCV_IS_BTREE_LOGICAL_LOG -- src/transaction/recovery.h
#define RCV_IS_BTREE_LOGICAL_LOG(idx) \
((idx) == RVBT_DELETE_OBJECT_PHYSICAL \
|| (idx) == RVBT_MVCC_DELETE_OBJECT \
|| (idx) == RVBT_MVCC_INSERT_OBJECT \
|| (idx) == RVBT_NON_MVCC_INSERT_OBJECT \
|| (idx) == RVBT_MARK_DELETED \
|| (idx) == RVBT_DELETE_OBJECT_POSTPONE \
|| (idx) == RVBT_MVCC_INSERT_OBJECT_UNQ \
|| (idx) == RVBT_MVCC_NOTIFY_VACUUM \
|| (idx) == RVBT_ONLINE_INDEX_UNDO_TRAN_DELETE \
|| (idx) == RVBT_ONLINE_INDEX_UNDO_TRAN_INSERT)

These ten ops are logged by key value, not page image — undo re-descends the tree, never running against one fixed page.

RCV_IS_LOGICAL_COMPENSATE_MANUAL is the btree set plus exactly six: RVFL_ALLOC, RVFL_USER_PAGE_MARK_DELETE, RVPGBUF_DEALLOC, RVFL_TRACKER_HEAP_REUSE, RVHF_LOB_REMOVE_DIR, RVFL_TRACKER_UNREGISTER; their undofun appends its own compensation via rcv.reference_lsa, so the rollback driver must not auto-append a LOG_COMPENSATE — a re-crash would double-undo. RCV_IS_LOGICAL_RUN_POSTPONE_MANUAL matches exactly four: RVFL_DEALLOC, RVHF_MARK_DELETED, RVHF_LOB_REMOVE_DIR, RVBT_DELETE_OBJECT_POSTPONE; as postpone actions their redofun closes with LOG_SYSOP_END_LOGICAL_RUN_POSTPONE, not a standard LOG_RUN_POSTPONE (Ch 8). RVHF_LOB_REMOVE_DIR and RVBT_DELETE_OBJECT_POSTPONE sit in both sets.

RCV_IS_LOGICAL_LOG (vpid, idx) is the master test and the only one that inspects the address: ((vpid)->volid == NULL_VOLID) || ((vpid)->pageid == NULL_PAGEID) short-circuits to logical regardless of index; then RCV_IS_BTREE_LOGICAL_LOG (idx); then eleven indices: RVBT_MVCC_INCREMENTS_UPD, RVPGBUF_FLUSH_PAGE, RVFL_DESTROY, RVFL_ALLOC, RVFL_DEALLOC, RVVAC_NOTIFY_DROPPED_FILE, RVPGBUF_DEALLOC, RVES_NOTIFY_VACUUM, RVHF_MARK_DELETED, RVFL_TRACKER_HEAP_REUSE, RVFL_TRACKER_UNREGISTER. A new logical index missing here makes recovery try to fix a nonexistent page — a fix error during rollback, far from the bug.

flowchart TD
  A["record vpid + rcvindex"] --> B{"volid or pageid NULL?"}
  B -- yes --> L["logical: undofun gets pgptr = nullptr"]
  B -- no --> C{"RCV_IS_BTREE_LOGICAL_LOG?"}
  C -- yes --> L
  C -- no --> D{"one of the 11 listed indices?"}
  D -- yes --> L
  D -- no --> P["physical: driver fixes page, passes pgptr"]

Figure 1-2 — RCV_IS_LOGICAL_LOG as evaluated by undo/rollback drivers.

  1. LOG_RCV is the one calling convention; data/pgptr are borrowed, so all four copy/move operations are deleted.
  2. RV_fun[] is indexed by the append-only LOG_RCVINDEX; debug-startup rv_check_rvfuns catches only misordering — nothing bounds-checks lookups.
  3. Compensate records redo through undofun (log_rv_get_fun<LOG_REC_COMPENSATE>): a CLR’s redo re-does the undo.
  4. log_rv_redo_context (reader + two zip buffers + frozen m_end_redo_lsa; copies rebuild fresh buffers; only NORMAL fetch mode used) feeds immutable log_rv_redo_rec_info<T> snapshots; debug-only vpid_lsa_consistency_check asserts every later apply per page stays above the first-applied LSA — a necessary condition of log order, not full pairwise monotonicity.
  5. Analysis state = LOG_RCV_TDES (five LSAs), seeded from LOG_REC_CHKPT
    • LOG_INFO_CHKPT_TRANS + LOG_INFO_CHKPT_SYSOP (only the two sysop anchors persist; tran-level postpone travels in the trans entry), gated by LOG_RECVPHASE.
  6. The RCV_IS_* macros split physical vs logical, automatic vs manual; a new logical index missing from RCV_IS_LOGICAL_LOG breaks rollback long after the feature ships.

Chapter 2: Restart Entry and Log Page Access

Section titled “Chapter 2: Restart Entry and Log Page Access”

Who drives recovery at server start, how the checkpoint anchor is found and downgraded for a media crash, and how the passes (Ch 3, 6, 9) physically read log pages. Theory: the companion cubrid-recovery-manager.md.

2.1 log_recovery — the restart orchestrator, branch by branch

Section titled “2.1 log_recovery — the restart orchestrator, branch by branch”

log_recovery (in log_recovery.c) has one caller, log_initialize_internal, gated on init_emergency == false && (log_Gl.hdr.is_shutdown == false || ismedia_crash == true) (restoredb passes ismedia_crash; emergency startup skips recovery); the caller holds the log CS in write mode (assert (LOG_CS_OWN_WRITE_MODE)).

// log_recovery -- src/transaction/log_recovery.c
/* ... condensed: branch 1 -- NULL LOG_FIND_TDES is er_set + logpb_fatal_error, return ... */
rcv_tdes->state = TRAN_RECOVERY; /* <- the recovery "transaction" */
if (LOG_HAS_LOGGING_BEEN_IGNORED ())
{ /* <- branch 2: fatal, then clear the flag */
/* ... condensed ... */
}
/* ... condensed ... */
LSA_COPY (&rcv_lsa, &log_Gl.hdr.chkpt_lsa);
if (ismedia_crash != false)
{ /* <- branch 3a: downgrade anchor */
(void) fileio_map_mounted (thread_p,
(bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint, &rcv_lsa);
}
/* ... condensed: else, branch 3b -- if (stopat != NULL) *stopat = -1 ... */
vacuum_notify_server_crashed (&rcv_lsa);

Branches 1 and 2 fatal via logpb_fatal_error; branch 2 fires when LOG_HAS_LOGGING_BEEN_IGNORED() (log_impl.h) sees log_Gl.hdr.has_logging_been_skipped — a crash while logging was skipped is unrepairable (ER_LOG_CORRUPTED_DB_DUE_CRASH_NOLOGGING). Branch 3a: restored volumes may be older than the header checkpoint, so log_rv_find_checkpoint is mapped over every volume, copying its disk_get_checkpoint LSA into rcv_lsa when LSA_ISNULL (rcv_lsa) || LSA_LT (&chkpt_lsa, rcv_lsa) and returning true so all volumes are visited — the oldest checkpoint wins. vacuum_notify_server_crashed copies rcv_lsa into vacuum_Data.recovery_lsa for vacuum’s backward scan when analysis finds no MVCC op record.

Invariant — the analysis start LSA is no newer than the checkpoint recorded in any permanent volume. A volume header stores the checkpoint LSA at its last flush; replay must start at or before it, else redo skips updates the restored volume never received. Figure 2-1 maps the rest.

flowchart TD
  A["ANALYSIS  Ch 3"] --> B["logpb_fetch_start_append_page<br/>error: fatal"]
  B --> C{did_incom_recovery}
  C -->|false| D["LOG_RESET_PREV_LSA from EOF back_lsa"] --> G
  C -->|true| G["append LOG_DUMMY_CRASH_RECOVERY<br/>rcv_phase_lsa = tail_lsa"]
  G --> H["REDO Ch 6, then UNDO Ch 9<br/>log_system_tdes::rv_final"] --> K{did_incom_recovery}
  K -->|true| L["log_recovery_notpartof_volumes"] --> N
  K -->|false| N["TRAN_ACTIVE, logtb_set_num_loose_end_trans"]
  N --> O{2pc loose ends}
  O -->|yes| P["FINISH_2PC: log_2pc_recovery  Ch 11"] --> R
  O -->|no| R["logpb_decache_archive_info<br/>CS exit, logpb_checkpoint, CS enter"]
  R --> S["flush all + header, then locator_initialize,<br/>heap_classrepr_restart_cache -- each error: fatal"]

Figure 2-1: log_recovery after the anchor is fixed.

Section 2.5 covers the append-point re-arm; log_append_empty_record writes LOG_DUMMY_CRASH_RECOVERY, whose LSA becomes log_Gl.rcv_phase_lsa, the crash boundary undo keys on (Ch 9). A stopat cut sets did_incom_recovery (Ch 3); log_recovery_notpartof_volumes then drops volumes created after the restore point. The close exits the log CS around logpb_checkpoint, flushes dirty pages and the header, and re-caches the catalog tracker and class representations — two further fatal branches; the caller then sets LOG_RESTARTED.

log_Gl.rcv_phase (enum log_recvphase, log_impl.h) is the global mode; LOG_ISRESTARTED() tests for LOG_RESTARTED, which the caller sets after the phases above. logpb_copy_page fills its recovery cache only if (!LOG_ISRESTARTED ()), and the physical readers run debug checksum checks only when LOG_RESTARTED — torn tails during recovery are repaired logically (Section 2.7).

2.3 logpb_fetch_page — the single physical-read entry

Section titled “2.3 logpb_fetch_page — the single physical-read entry”

logpb_fetch_page (in log_page_buffer.c) takes an enum log_cs_access_mode (log_impl.h). The classic analysis and undo scans call it with LOG_CS_FORCE_USE (they run under the log CS held by log_recovery); the redo machinery’s log_reader forwards LOG_CS_SAFE_READER so its positioned fetches skip the CS (Section 2.6).

// logpb_fetch_page -- src/transaction/log_page_buffer.c
if (LSA_LE (&append_lsa, req_lsa) /* <- case 1: page beyond flushed area */
|| LSA_LE (&append_prev_lsa, req_lsa)) /* <- case 2: page may hold a temp EOL */
{
LOG_CS_ENTER (thread_p);
/* ... condensed ... */
if (LSA_LE (&log_Gl.hdr.append_lsa, req_lsa)) /* retry with mutex */
{
logpb_prior_lsa_append_all_list (thread_p); /* <- drain prior list to buffers */
}
LOG_CS_EXIT (thread_p);
}
rv = logpb_copy_page (thread_p, req_lsa->pageid, access_mode, log_pgptr);
/* ... condensed: rv != NO_ERROR is the only error exit ... */

The front gate folds the in-memory prior-LSA list into the page buffer so a reader near the append point never sees a stale tail. logpb_copy_page then has four arms: a LOGPB_HEADER_PAGE_ID request is served from the cached header_buffer (file read when not cached); an out-of-range buffer index raises ER_LOG_PAGE_CORRUPTED; a buffer hit memcpys and re-checks log_bufptr->pageid — the safe-reader mode skips the read CS, so this re-check is its lock-free validation; everything else falls to logpb_read_page_from_file, caching the page forward-only while !LOG_ISRESTARTED ().

2.4 Active versus archive: logpb_read_page_from_file

Section titled “2.4 Active versus archive: logpb_read_page_from_file”

A pageid is archived iff LOGPB_IS_ARCHIVE_PAGE (pageid) — not the header page and below LOGPB_NEXT_ARCHIVE_PAGE_ID (log_Gl.hdr.nxarv_pageid); logpb_is_page_in_archive wraps it. LOG_CS_SAFE_READER takes the read CS itself (sets log_csect_entered); other modes assert (LOG_CS_OWN). The CS protects the archive set — an archive created mid-read once left logpb_to_physical_pageid stale (the in-code comment records the bug).

// logpb_read_page_from_file -- src/transaction/log_page_buffer.c
bool fetch_from_archive = logpb_is_page_in_archive (pageid);
if (fetch_from_archive)
{
bool is_archive_page_in_active_log = (pageid + LOGPB_ACTIVE_NPAGES) > log_Gl.hdr.append_lsa.pageid;
bool dont_fetch_archive_from_active = !LOG_ISRESTARTED () || log_Gl.hdr.was_active_log_reset;
if (is_archive_page_in_active_log && !dont_fetch_archive_from_active)
{
fetch_from_archive = false; /* <- slot not yet lapped in circular active file */
}
}

The shortcut: the active file is circular with LOGPB_ACTIVE_NPAGES (= log_Gl.hdr.npages) slots, so an archived page stays readable from active until its slot is re-appended — disabled during recovery and after an active-log reset, when the active tail is exactly what the crash made suspect.

The remaining arms: an archive fetch (logpb_fetch_from_archive) returning NULL is goto error. An active fetch maps the slot via logpb_to_physical_pageid, then fileio_read (ER_LOG_READ, goto error); the self-id check: hdr.logical_pageid == pageid is good, then tde_decrypt_log_page if encrypted (archives decrypt inside logpb_fetch_from_archive); == pageid + LOGPB_ACTIVE_NPAGES means lapped since the check — retry from archive; anything else is ER_LOG_PAGE_CORRUPTED. Both exits release the CS iff log_csect_entered; debug checksum only when LOG_RESTARTED.

Invariant — every log page self-identifies. An active-file read is valid only if hdr.logical_pageid matches; the one benign mismatch is one lap, pageid + LOGPB_ACTIVE_NPAGES — without the check a lapped slot would replay as the old page.

2.5 logpb_fetch_start_append_page — re-arming the append point

Section titled “2.5 logpb_fetch_start_append_page — re-arming the append point”

Between analysis and redo the log must become writable again. Four branches: an empty log (append_lsa offset 0, pageid 0 — debug builds: PRM_ID_FIRST_LOG_PAGEID) makes logpb_locate_page get NEW_PAGE instead of OLD_PAGE; a leftover log_Gl.append.log_pgptr is discarded (logpb_invalid_all_append_pages); NULL from logpb_locate_page is the only error exit (ER_FAILED, fatal in log_recovery); on success set_nxio_lsa (log_Gl.hdr.append_lsa) is recorded and the page joins flush_info->toflush, flushed (logpb_flush_pages_direct) when the array is full.

2.6 log_reader — the C++ fetch wrapper for the redo machinery

Section titled “2.6 log_reader — the C++ fetch wrapper for the redo machinery”

The modern redo path (Ch 6, Ch 7) uses log_reader (log_reader.hpp, final class, header-only; the sibling log_reader.cpp is stale — no CMakeLists builds it).

FieldRoleWhy it exists
m_thread_entryLazily cached THREAD_ENTRY *Single-thread contract, asserted each use
m_lsaRead position; starts NULL_LSADrives fetch pageid, intra-page offset, memoization
m_cs_accessMode passed to logpb_fetch_page; default LOG_CS_FORCE_USECS-owning passes vs CS-free readers (LETS-port leftover)
m_pagelog_page * aligned into m_area_buffer by the constructorPrivate fetch destination — no shared-buffer locking
m_area_bufferchar [IO_MAX_PAGE_SIZE + DOUBLE_ALIGNMENT]Inline no-heap storage; copied workers get their own

set_lsa_and_fetch_page computes do_fetch_page { fetch_page_mode == fetch_mode::FORCE || m_lsa.pageid != lsa.pageid }, assigns m_lsa = lsa, and fetches (logpb_fetch_page (.., m_cs_access, m_page), fatal on failure) only when true: NORMAL memoizes the current page, FORCE always refetches. align, add_align, advance_when_does_not_fit and copy_from_log delegate to the classic LOG_READ_ALIGN family and logpb_copy_from_log (bottom of the same header), refetching on page crossings — but only fetch_page (under set_lsa_and_fetch_page and skip) forwards m_cs_access; the delegating members use the family’s default LOG_CS_FORCE_USE, so even a safe reader briefly takes the read CS at mid-record page crossings.

The owning aggregate log_rv_redo_context (log_recovery_redo.hpp):

FieldRoleWhy it exists
m_readerlog_reader { LOG_CS_SAFE_READER }Private reader per context; CS-free positioned fetches
m_redo_zip, m_undo_zipLOG_ZIP scratch buffersDecompression targets reused across records (Section 2.8)
m_end_redo_lsaconst LOG_LSA redo stop boundWorkers compare record LSAs without touching globals
m_reader_fetch_page_modeconst log_reader::fetch_modeNORMAL memoizes pages; FORCE kept for reuse

The synchronous redo driver constructs it with fetch_mode::NORMAL; the copy constructor re-runs the main one so each parallel worker (Ch 7) gets fresh buffers; the constructor pre-sizes both zips to LOGAREA_SIZE, the destructor frees them (log_zip_free_data).

2.7 The NULL_OFFSET convention for incompletely archived records

Section titled “2.7 The NULL_OFFSET convention for incompletely archived records”

NULL_OFFSET is (-1) (storage_common.h). When the archiver copies an active page whose last record continues onto the next page, an LSA into the continuation may carry offset == NULL_OFFSET: the record’s completion postdates archiving. Every forward scan — analysis, redo, walkers like log_startof_nxrec — must repair it before dereferencing:

// log_recovery_analysis (record loop) -- src/transaction/log_recovery.c
if (lsa.offset == NULL_OFFSET)
{
lsa.offset = log_page_p->hdr.offset; /* <- page's first record offset */
if (lsa.offset == NULL_OFFSET)
{
/* Continue with next pageid */
if (logpb_is_page_in_archive (log_lsa.pageid))
{
lsa.pageid = log_lsa.pageid + 1; /* <- archive: keep walking */
}
else
{
lsa.pageid = NULL_PAGEID; /* <- active: stop scan */
}
continue;
}
}

A page whose own hdr.offset is NULL_OFFSET holds no record start (pure continuation) — in an archive try the next page; in the active log the scan ran off the end. Analysis scratch pages are initialized to hdr.offset = NULL_OFFSET.

2.8 LOG_ZIP allocation helpers each pass instantiates

Section titled “2.8 LOG_ZIP allocation helpers each pass instantiates”

LOG_ZIP (struct log_zip, log_compress.h) is a grow-only buffer:

FieldRoleWhy it exists
data_lengthBytes currently storedConsumers read exactly this much; capacity may be larger
buf_sizeCapacity of log_dataGrow-only check; sized to the LZ4 worst case
log_dataThe buffer (char *)Reused across records; nullptr until first sizing

log_zip_realloc_if_needed (log_zip, new_size) (in log_compress.c) grows only when new_size > 0 && new_size > log_zip.buf_size, to LOG_ZIP_BUF_SIZE (LZ4, new_size) (ER_OUT_OF_VIRTUAL_MEMORY on failure); a second check, new_size > 0 && log_zip.log_data == nullptr, zeroes the fields and returns false (caller fatals) — true covers success and no-grow. log_zip_alloc mallocs + zeroes the struct and sizes it the same way (nullptr on failure, husk freed); log_zip_free runs log_zip_free_data then frees the struct. The redo context pre-sizes its two zips (Section 2.6); the undo pass log_zip_alloc (LOGAREA_SIZE)s undo_unzip_ptr, freed on every exit of log_recovery_undo; the shared consumer log_rv_get_unzip_log_data splits compressed from plain via ZIP_CHECK (length)log_unzip versus memcpy after log_zip_realloc_if_needed.

  1. log_recovery runs as the TRAN_RECOVERY system transaction under an already-held write-mode log CS; only log_initialize_internal calls it, and emergency startup skips it.
  2. The analysis anchor is log_Gl.hdr.chkpt_lsa, downgraded on a media crash to the oldest per-volume checkpoint found via log_rv_find_checkpoint.
  3. Between analysis and redo the append point is re-armed and a LOG_DUMMY_CRASH_RECOVERY appended; its LSA (log_Gl.rcv_phase_lsa) is the crash boundary undo keys on.
  4. Classic analysis and undo scans fetch with LOG_CS_FORCE_USE under the held log CS; log_reader forwards LOG_CS_SAFE_READER for positioned fetches, though its page-crossing helpers still default to LOG_CS_FORCE_USE and briefly take the read CS.
  5. logpb_read_page_from_file splits active versus archive on LOGPB_IS_ARCHIVE_PAGE; the only benign self-id mismatch is the one-lap alias pageid + LOGPB_ACTIVE_NPAGES, and the archived-but-still-in-active shortcut is disabled during recovery.
  6. NULL_OFFSET (-1) marks LSAs into incompletely archived records; every forward scan repairs it from hdr.offset, advancing a page in archives and terminating in the active log.

log_recovery_analysis walks forward from the checkpoint anchor through possibly corrupted or truncated log and computes the redo range: a page-fetch outer loop around a record-step inner loop. Record semantics go to log_rv_analysis_record (Ch 4–5); the driver owns cursor advancement, corruption defenses, the truncate-or-fatal decision, and redo-range bookkeeping. ARIES rationale: recovery-phases section of cubrid-recovery-manager.md; page-fetch mechanics: Ch 2.

3.1 Entry point, outputs, and driver state

Section titled “3.1 Entry point, outputs, and driver state”

log_recovery resolves the anchor — log_Gl.hdr.chkpt_lsa, or under media crash the oldest checkpoint among data-volume headers (log_rv_find_checkpoint) — and passes it as start_lsa, with is_media_crash (truncate vs fatal, 3.2) and stop_at (the restoredb -d boundary, 3.7). Outputs: start_redo_lsa (the anchor unless Ch 4 pulls it back), end_redo_lsa (Invariant 3-B), did_incom_recovery (truncated; log_recovery skips the EOF back-link fix-up), num_redo_log_records (3.8).

Key driver locals of log_recovery_analysis:

LocalRoleWhy
lsanext record; NULL ends both loopssingle termination condition
log_lsacurrent record, page in log_page_plsa advances before dispatch (3.6)
prev_lsalast good recordresetlog target
prev_prev_lsaresetlog’s new_prev_lsatracks prev_lsa; NULL only if the first fetched page is broken
first_corrupted_rec_lsafirst all-0xff 4 KB blockper-record cut-off (3.5)
last_checked_page_idpage already checksummedprobe once per page (3.3)
is_log_page_brokenfetch failed / record tail missingtruncate-or-fatal fork (3.2)
is_log_page_corruptedreadable but checksum failedpartial flush (3.5); terminal (Invariant 3-C)
null_block4 KB of LOG_PAGE_INIT_VALUE (0xff, log_common_impl.h)tear-detection memcmp operand
checkpoint_lsaset by LOG_END_CHKPT dispatch (Ch 4)2PC tail re-read (3.8)
may_use_checkpoint / may_need_synch_checkpoint_2pcdispatch flags (Ch 4)the second arms the 2PC tail
last_at_timestays -1 in the driverecho to *stop_at is inert (3.7)

Initialization copies start_lsa into lsa, start_redo_lsa, end_redo_lsa, and prev_lsa — a degenerate redo range until proven otherwise — and nulls or zeroes everything else.

3.2 Outer loop: the is_log_page_broken branch

Section titled “3.2 Outer loop: the is_log_page_broken branch”

Each outer iteration logpb_fetch_pages the page under the cursor; failure — past the flushed log, missing archive, TDE decryption error — sets is_log_page_broken, as can the inner loop’s broken-tail break (3.4): one branch decides what broken means.

Media-crash arm: truncate and accept — log past the restore point may legitimately not exist. It echoes last_at_time via *stop_at, steps the last record’s owner tdes (tail_lsa/undo_nxlsa) back to log_rec->prev_tranlsa so undo never chases the truncated record, re-fetches prev_lsa’s page (clobbered by the failed fetch; fatal on failure), then log_recovery_resetlog (thread_p, &prev_lsa, &prev_prev_lsa) makes prev_lsa the new append point (Ch 11), sets *did_incom_recovery, resets the MVCC table, and returns — skipping the 2PC tail (3.8). Normal-crash arm: fatal — after a plain crash every page up to eof_lsa must be readable; when er_errid () is ER_TDE_CIPHER_IS_NOT_LOADED the message names TDE: the page is intact but undecryptable.

Invariant 3-B (redo-range honesty). On return, every record in [start_redo_lsa, end_redo_lsa) is readable and structurally complete; the boundary itself is the last fully-probed record or a position re-initialized before redo reads it. Normal end: the last dispatched record. Truncation (3.6 step 8): reverted to prev_lsa. Broken-record probe (3.4): deliberately advanced onto the broken record — equal to prev_lsa — so resetlog makes that position the new append point, overwritten by LOG_DUMMY_CRASH_RECOVERY before redo runs. Violation: redo (Ch 6) applies half-written bodies.

flowchart TD
  A["fetch page at lsa"] --> B{"broken?"}
  B -- no --> C["inner loop 3.3-3.6"] --> D{"lsa null?"}
  D -- no --> A
  D -- yes --> E["2PC tail; reset_start_mvccid"]
  B -- "yes, media crash" --> F["resetlog at prev_lsa; did_incom_recovery; return"]
  B -- "yes, normal crash" --> G["fatal (TDE or generic)"]

Figure 3-1 — outer loop of log_recovery_analysis.

3.3 Inner loop entry: NULL_OFFSET repair and the corruption probe

Section titled “3.3 Inner loop entry: NULL_OFFSET repair and the corruption probe”

The inner loop runs while the cursor stays on the fetched page: while (!LSA_ISNULL (&lsa) && lsa.pageid == log_lsa.pageid). Two housekeeping steps precede record access.

NULL_OFFSET repair. A record archived while incomplete leaves the continuation’s offset unknown: the cursor arrives as (pageid, NULL_OFFSET) and is re-anchored on log_page_p->hdr.offset, the first header starting in this page. If that too is NULL_OFFSET (only continuation bytes here): archive page — lsa.pageid = log_lsa.pageid + 1, keep walking the record’s middle; active page — lsa.pageid = NULL_PAGEID, scan over. Either way continue.

Per-page corruption probe. Guarded by last_checked_page_id, so once per page. logpb_page_check_corruption wraps logpb_page_has_valid_checksum (CRC32 vs hdr.checksum); a helper error is fatal. A corrupt archive page is fatal (/* Should not happen. */ — archives are written once); a corrupt active page means a partial page flush. logpb_page_get_first_null_block_lsa locates the tear: the first 4 KB block that memcmps equal to null_block yields (hdr.logical_pageid, i * block_size), minus sizeof (LOG_HDRPAGE) when nonzero — LSA offsets index area[], the raw page starts earlier.

If no block matches (corrupt, but every block holds data), first_corrupted_rec_lsa stays NULL: the 3.5 cut-off and its safety nets (gated on !is_log_page_corrupted) are skipped; only the page-advance ban and EOF stop of Invariant 3-C still apply.

3.4 Multi-page records: log_is_page_of_record_broken

Section titled “3.4 Multi-page records: log_is_page_of_record_broken”

After log_rec = LOG_GET_LOG_RECORD_HEADER (log_page_p, &log_lsa), the media-crash path runs one more probe — a header may sit on the last restored page while its body spills onto pages never restored:

// log_is_page_of_record_broken -- src/transaction/log_recovery.c
LSA_COPY (&fwd_log_lsa, &log_rec_header->forw_lsa);
/* TODO - Do we need to handle NULL fwd_log_lsa? */
if (!LSA_ISNULL (&fwd_log_lsa))
{
if (LSA_GE (log_lsa, &fwd_log_lsa)
|| (!LSA_ISNULL (&log_Gl.hdr.eof_lsa) && LSA_GT (&fwd_log_lsa, &log_Gl.hdr.eof_lsa)))
{
is_log_page_broken = true; /* <- forw_lsa is nonsense */
}
else
{
if (fwd_log_lsa.pageid != log_lsa->pageid
&& (fwd_log_lsa.offset != 0 || fwd_log_lsa.pageid > log_lsa->pageid + 1))
{
// ... condensed: record spans pages -- probe-fetch fwd_log_lsa page;
// failure -> broken ...
}
}
}

Branch by branch: (1) forw_lsa NULL — declines; the 3.5 safety nets judge instead (the TODO admits the gap). (2) forw_lsa not after the current record, or beyond eof_lsa — the header itself is garbage: broken (eof_lsa is NULL-guarded: restoring without an active volume recovers it only during analysis). (3) forw_lsa on a later page at nonzero offset, or more than one page ahead — the body provably reaches that page: probe-fetch it; failure means the tail is gone, success means sane. The excluded case — next record at offset 0 of the next page — proves nothing; no fetch is spent.

On a broken verdict the inner loop copies end_redo_lsa = lsa, sets prev_lsa and prev_prev_lsa to it, debug-traces, and breaks — the reset happens in 3.2, where prev_lsa is now the broken record itself: resetlog cuts there, sacrificing it so everything earlier survives.

For pages that failed the checksum, the driver decides per record whether it precedes the torn region. Two safety nets first widen the verdict (only while is_log_page_corrupted is false): (1) missing end-of-logforw_lsa NULL on a non-LOG_END_OF_LOG record in the active log is impossible (every chain ends at an EOF record): page declared corrupted, cut-off from the null-block scan. (2) Body crossing a null block — when forw_lsa stays in-page, map the record start and forw_lsa - 1 to block indexes ((offset + sizeof (LOG_HDRPAGE)) / block_size); if they differ and the ending block equals null_block, the body was never fully flushed: the cut-off becomes the record itself.

With a non-NULL cut-off, three outcomes. A record strictly past the tear ends the scan at the previous good record:

// log_recovery_analysis -- src/transaction/log_recovery.c
if (LSA_GT (&log_lsa, &first_corrupted_rec_lsa))
{
LOG_RESET_APPEND_LSA (end_redo_lsa); /* <- starts past the tear */
LSA_SET_NULL (&lsa);
break;
}

The else arm flags the record itself corrupted when log_lsa == first_corrupted_rec_lsa, when forw_lsa points past the tear, or when the DB_ALIGN-ed end of its header overruns LOGAREA_SIZE or lands past the tear; then LOG_RESET_APPEND_LSA (&log_lsa) — the first casualty’s own position becomes the new append point — nulls lsa, breaks. A record provably before the tear is processed normally.

Invariant 3-C (corruption is terminal per page). Once is_log_page_corrupted is true, the cursor never advances to another page. Enforced by the post-advance null-out (corrupted, not LOG_END_OF_LOG, lsa.pageid != log_lsa.pageidLSA_SET_NULL) plus the stop after dispatching LOG_END_OF_LOG. Recycled pages from earlier log wraps can hold valid-looking stale records; following them replays a previous epoch.

3.6 Advancing the cursor: every remaining branch

Section titled “3.6 Advancing the cursor: every remaining branch”

The rest of the inner-loop body, in order:

  1. end_redo_lsa = lsa; lsa = log_rec->forw_lsa — the range tip moves before dispatch.
  2. Corrupted-page page-advance ban (Invariant 3-C).
  3. Archive null-forward fix: NULL lsa on an archive page → log_lsa.pageid + 1 — incomplete archiving, not end of log.
  4. Loop guard. lsa backward or sideways (lsa.pageid < log_lsa.pageid, or same page and lsa.offset <= log_lsa.offset): “loop in the log” debug-trace, logpb_fatal_error, then LSA_SET_NULL (&lsa); break;.
  5. Missing-EOF repair. NULL lsa, log_rtype != LOG_END_OF_LOG, no truncation yet: the append LSA parks at end_redo_lsa; if log_startof_nxrec finds the next record start, advance there, patch the in-buffer log_rec->forw_lsa, flush the page (logpb_write_page_to_disk) — a physical repair. Either way log_Gl.hdr.next_trid = tran_id.
  6. Redo counting. *num_redo_log_records counts twelve redo-bearing types — LOG_REDO_DATA, LOG_UNDOREDO_DATA, LOG_DIFF_UNDOREDO_DATA, their three LOG_MVCC_* counterparts, LOG_DBEXTERN_REDO_DATA, LOG_RUN_POSTPONE, LOG_COMPENSATE, LOG_2PC_PREPARE, LOG_2PC_START, LOG_2PC_RECV_ACK; everything else hits the silent default.
  7. Dispatch. log_rv_analysis_record rebuilds transaction state (Ch 4); its LOG_END_OF_LOG case is log_rv_analysis_log_end (3.8).
  8. Post-dispatch truncation. *did_incom_recovery raised (3.7): end_redo_lsa = prev_lsa — the trigger is excluded from redo; lsa nulled, break.
  9. Self-loop assert. LSA_EQ (end_redo_lsa, &lsa) — the cursor did not move: assert_release, scan aborts via NULL cursor.
  10. Corrupted page + LOG_END_OF_LOG → stop (second half of Invariant 3-C).
  11. prev_lsa = end_redo_lsa; prev_prev_lsa = prev_lsa; — the resetlog anchors trail the tip by one record.
  12. Page-id back-fill. Forward (pageid, NULL_OFFSET) with a stale smaller pageid → current page (pairs with 3.3’s repair).

Invariant 3-A (monotone cursor). Every iteration strictly increases the cursor (pageid, offset). Enforced by steps 4 and 9, both terminating the scan. Violation: analysis spins forever.

3.7 Point-in-time stop: stop_at and LOG_REC_DONETIME

Section titled “3.7 Point-in-time stop: stop_at and LOG_REC_DONETIME”

stop_at comes from log_recovery: -1 (no limit) on normal restart, the restoredb -d timestamp on media crash. The driver never reads commit times itself — log_rv_analysis_complete does, for LOG_COMMIT / LOG_ABORT (log_rv_analysis_commit_with_postpone applies the same test to its LOG_REC_START_POSTPONE at_time). It reads the LOG_REC_DONETIME payload behind the header; when *stop_at != (time_t) (-1) and difftime (*stop_at, last_at_time) < 0 — the first done record stamped after the stop point — it nulls the page cursor, calls log_recovery_resetlog (thread_p, &record_header_lsa, prev_lsa) to cut the log before this commit, and raises *did_incom_recovery; 3.6 step 8 then excludes the record and ends the scan. That last_at_time is its local; the driver’s copy, echoed into *stop_at in 3.2, stays -1 — inert today.

3.8 log_rv_analysis_log_end, the 2PC re-read tail, and the outputs

Section titled “3.8 log_rv_analysis_log_end, the 2PC re-read tail, and the outputs”

The one dispatch case belonging to the driver’s story is the clean end of log, log_rv_analysis_log_end — one branch on logpb_is_page_in_archive. In the active log the EOF’s own position becomes log_Gl.hdr.append_lsa (LOG_RESET_APPEND_LSA (log_lsa) — new appends overwrite the EOF record), next_trid is restored from its owner, and the cursor takes the EOF’s NULL forw_lsa — both loops end (the missing-EOF repair exempts LOG_END_OF_LOG). An EOF in an archive page is a stale leftover from before the archiving cut: the header is untouched, the NULL forward goes through 3.6 step 3, and the scan continues.

The 2PC re-read tail. If any dispatched record set may_need_synch_checkpoint_2pc (a LOG_REC_CHKPT listing transactions in 2PC at checkpoint time — Ch 4), the driver re-reads the checkpoint record after the outer loop: (1) logpb_fetch_page on checkpoint_lsa, failure fatal; (2) the LOG_INFO_CHKPT_TRANS array of chkpt.ntrans entries, read in-page when log_lsa.offset + size < LOGAREA_SIZE, else malloc + logpb_copy_from_log (failed malloc: fatal); (3) each chkpt_trans[i].trid resolves via logtb_find_tran_index; log_2pc_recovery_analysis_info runs only for tdes still LOG_ISTRAN_2PC. The media-crash arm of 3.2 returns before this tail — truncated restores skip 2PC reconstruction.

log_recovery then emits ER_LOG_RECOVERY_REDO_STARTED from the range and the count; log_cnt_pages_containing_lsa returns 0 when *to_lsa == *from_lsa, else the inclusive to_lsa->pageid - from_lsa->pageid + 1. When nothing past the anchor survived, end_redo_lsa still equals start_redo_lsa from initialization — the count is honestly zero.

  1. log_recovery_analysis is a page-fetch outer loop around a record-step inner loop; corruption decisions belong to the driver, record semantics to log_rv_analysis_record (Ch 4–5).
  2. Broken pages fork on is_media_crash: backups truncate via log_recovery_resetlog at prev_lsa and raise did_incom_recovery; normal restarts are fatal (ER_TDE_CIPHER_IS_NOT_LOADED means “load the TDE key”).
  3. Partial page flush is caught by a once-per-page CRC check; the tear is the first all-0xff 4 KB block, and first_corrupted_rec_lsa cuts the scan with three per-record outcomes. A corrupted page is terminal (Invariant 3-C).
  4. log_is_page_of_record_broken (media crash only) validates forw_lsa plausibility and probe-fetches a multi-page record’s last page; a missing tail parks end_redo_lsa and prev_lsa on the broken record so resetlog cuts there.
  5. The redo range is honest (Invariant 3-B): everything strictly before end_redo_lsa is readable and complete; the boundary is fully probed or re-initialized before redo reads it.
  6. Point-in-time restore lives in log_rv_analysis_complete (LOG_REC_DONETIME), not the driver; analysis is also not read-only — a missing LOG_END_OF_LOG is physically repaired via log_startof_nxrec, a patched forw_lsa, and a page flush.

Chapter 4: Analysis Record Dispatch and Transaction Table Rebuild

Section titled “Chapter 4: Analysis Record Dispatch and Transaction Table Rebuild”

Chapter 3’s driver feeds every LOG_RECORD_HEADER it reads to log_rv_analysis_record. This chapter traces how each arm rebuilds transaction-table state, plus the global counters that ride along — append point, next TRANID, MVCCID horizon. The postpone/sysop arms belong to Chapter 5; ARIES theory lives in the companion cubrid-recovery-manager.md.

4.1 The dispatch switch in log_rv_analysis_record

Section titled “4.1 The dispatch switch in log_rv_analysis_record”

log_rv_analysis_record is a pure demultiplexer: one switch (log_type), no logic of its own; its pointer parameters all belong to the driver’s loop state (Chapter 3). Every LOG_RECTYPE lands in exactly one arm:

Record type(s)HandlerEffect on the table
LOG_UNDOREDO_DATA, LOG_DIFF_UNDOREDO_DATA, LOG_UNDO_DATA, LOG_REDO_DATA, their four LOG_MVCC_* twins, LOG_DBEXTERN_REDO_DATAlog_rv_analysis_undo_redoadvance tail_lsa + undo_nxlsa (4.3)
LOG_SAVEPOINTlog_rv_analysis_save_pointsame, plus savept_lsa (4.3)
LOG_COMPENSATElog_rv_analysis_compensateredirect undo_nxlsa past undone work (4.3)
LOG_COMMIT, LOG_ABORTlog_rv_analysis_completefree the tran index, or stop analysis early (4.4)
the seven LOG_2PC_* typesthe seven log_rv_analysis_2pc_* armsstamp a 2PC tdes->state (4.5)
LOG_START_CHKPT / LOG_END_CHKPTlog_rv_analysis_start_checkpoint / _end_checkpointarm may_use_checkpoint (4.7) / merge snapshot (4.8)
LOG_DUMMY_HEAD_POSTPONE, LOG_POSTPONE, LOG_RUN_POSTPONE, LOG_COMMIT_WITH_POSTPONE (+_OBSOLETE), LOG_SYSOP_START_POSTPONE, LOG_SYSOP_END, LOG_SYSOP_ATOMIC_STARTthe matching log_rv_analysis_* postpone/sysop armsChapter 5; commit-with-postpone’s early-stop branch mirrors 4.4
LOG_END_OF_LOGlog_rv_analysis_log_endreset append point + next_trid (4.10)
LOG_DUMMY_CRASH_RECOVERY, LOG_REPLICATION_DATA, LOG_REPLICATION_STATEMENT, LOG_DUMMY_HA_SERVER_STATE, LOG_DUMMY_OVF_RECORD, LOG_DUMMY_GENERIC, LOG_SUPPLEMENTAL_INFOnone — bare breakno table effect
LOG_SMALLER_LOGREC_TYPE, LOG_LARGER_LOGREC_TYPE, defaultnoneer_set (ER_LOG_PAGE_CORRUPTED) + assert (false) — “probably the log is corrupted”

Return codes are discarded — most via (void) casts; the sysop-end and checkpoint arms simply ignore them. Almost every failure calls logpb_fatal_error, which terminates recovery; the lone exception is end-checkpoint’s sysop re-read (4.8 step 7) — debug builds assert, release builds swallow the error.

4.2 logtb_rv_find_allocate_tran_index — the lazy TDES allocator

Section titled “4.2 logtb_rv_find_allocate_tran_index — the lazy TDES allocator”

Nearly every arm starts here (log_tran_table.c): map tran_id to a TDES, allocating on first sight. Three branches: B1logtb_is_system_worker_tranid (trid) short-circuits to log_system_tdes::rv_get_or_alloc_tdes, keeping system workers out of the table. B2logtb_find_tran_index misses: logtb_allocate_tran_index (..., TRAN_UNACTIVE_UNILATERALLY_ABORTED, ...), then LSA_COPY (&tdes->head_lsa, log_lsa); allocation failure is logpb_fatal_error + return NULL. B3 — hit: LOG_FIND_TDES.

Invariant — presumed abort. Every TDES created during analysis is born TRAN_UNACTIVE_UNILATERALLY_ABORTED, head_lsa = first sighting. Only a later completion record (removal, 4.4) or 2PC/postpone record (state upgrade) changes the verdict; any other initial state would make the undo pass (Chapter 9) skip a loser and leave its updates on disk.

4.3 The simple arms — undo_redo, save_point, compensate

Section titled “4.3 The simple arms — undo_redo, save_point, compensate”

log_rv_analysis_undo_redo covers all nine data-change types. Only non-happy branch: NULL TDES means logpb_fatal_error, return ER_FAILED. Otherwise LSA_COPY (&tdes->tail_lsa, log_lsa) then LSA_COPY (&tdes->undo_nxlsa, &tdes->tail_lsa): tail_lsa is the latest record, undo_nxlsa where undo starts walking backward; for a plain data record they coincide. log_rv_analysis_save_point adds LSA_COPY (&tdes->savept_lsa, &tdes->tail_lsa) for post-restart partial rollback.

log_rv_analysis_compensate handles LOG_COMPENSATE — a CLR, proof some update was already undone — and is the one arm where undo_nxlsa diverges from tail_lsa. After the allocator + NULL-fatal branch, it advances to the LOG_REC_COMPENSATE body (LOG_READ_ADD_ALIGN, LOG_READ_ADVANCE_WHEN_DOESNT_FIT) and executes one copy — LSA_COPY (&tdes->undo_nxlsa, &compensate->undo_nxlsa) — and does not advance tail_lsa. The copied pointer lands before the compensated update, so undo never restarts from the CLR itself: ARIES’ never-undo-an-undo rule, enforced purely by pointer redirection.

4.4 log_rv_analysis_complete — commit/abort finalization

Section titled “4.4 log_rv_analysis_complete — commit/abort finalization”

LOG_COMMIT and LOG_ABORT share log_rv_analysis_complete — the only arm that removes table state, and one of two early-stop arms (the other, log_rv_analysis_commit_with_postpone in Chapter 5, carries the same stop_at/resetlog branch). Four branches:

// log_rv_analysis_complete -- src/transaction/log_recovery.c
tran_index = logtb_find_tran_index (thread_p, tran_id); /* <- find, never allocate */
// ... condensed: B1 -- if not media crash, goto end; else read LOG_REC_DONETIME -> last_at_time ...
if (stop_at != NULL && *stop_at != (time_t) (-1) && difftime (*stop_at, last_at_time) < 0)
{ /* B2: completion is newer than --until-time */
log_lsa->pageid = NULL_PAGEID;
log_recovery_resetlog (thread_p, &record_header_lsa, prev_lsa);
*did_incom_recovery = true;
return NO_ERROR; /* <- index NOT freed: tran stays a loser */
}
end:
// ... condensed: B3 -- if tran_index != NULL_TRAN_INDEX, logtb_free_tran_index ...
return NO_ERROR; /* B4: never seen before -> nothing to drop */

Two asymmetries: it finds, never allocates — a completion whose transaction predates the window is a no-op (B4); and B2 keeps the index — truncating the log at the commit record makes the transaction retroactively in-flight, so undo rolls it back: restore-to-timestamp.

4.5 The seven 2PC arms — a state-transition table

Section titled “4.5 The seven 2PC arms — a state-transition table”

Structurally identical: allocate the TDES (NULL: logpb_fatal_error, ER_FAILED), overwrite tdes->state, advance tail_lsa; none touches undo_nxlsa. Only the stamped state differs:

Record typeHandlertdes->state stamped
LOG_2PC_PREPARElog_rv_analysis_2pc_prepareTRAN_UNACTIVE_2PC_PREPARE
LOG_2PC_STARTlog_rv_analysis_2pc_startTRAN_UNACTIVE_2PC_COLLECTING_PARTICIPANT_VOTES
LOG_2PC_COMMIT_DECISIONlog_rv_analysis_2pc_commit_decisionTRAN_UNACTIVE_2PC_COMMIT_DECISION
LOG_2PC_ABORT_DECISIONlog_rv_analysis_2pc_abort_decisionTRAN_UNACTIVE_2PC_ABORT_DECISION
LOG_2PC_COMMIT_INFORM_PARTICPSlog_rv_analysis_2pc_commit_inform_particpsTRAN_UNACTIVE_COMMITTED_INFORMING_PARTICIPANTS
LOG_2PC_ABORT_INFORM_PARTICPSlog_rv_analysis_2pc_abort_inform_particpsTRAN_UNACTIVE_ABORTED_INFORMING_PARTICIPANTS
LOG_2PC_RECV_ACKlog_rv_analysis_2pc_recv_ackunchanged — only tail_lsa advances

LOG_2PC_PREPARE is the participant side; the rest are coordinator records. Prepare and start also plant tdes->gtrid = LOG_2PC_NULL_GTRID: a sentinel that the body (gtrid, participants, locks) was not read — it “needs to be read during either redo phase, or during finish_commit_protocol phase” (source comment); 4.9 consumes it.

A completed checkpoint is two records: an empty LOG_START_CHKPT anchor and a LOG_END_CHKPT whose body (log_record.hpp) is a LOG_REC_CHKPT header, ntrans LOG_INFO_CHKPT_TRANS entries, then ntops LOG_INFO_CHKPT_SYSOP entries.

LOG_REC_CHKPT (log_rec_chkpt) has three fields: redo_lsa — oldest recovery LSA of any dirty data page, because redo must start at the oldest unflushed change (4.8 step 8); ntrans and ntops — counts of the two arrays that follow, which are not self-delimiting (ntops is commonly zero).

LOG_INFO_CHKPT_TRANS (log_info_chkpt_trans) — one serialized TDES per live transaction:

FieldRoleWhy it exists
isloose_endto tdes->isloose_endClient loose ends
tridTransaction idMerge key for the allocator
stateSnapshot state; TRAN_ACTIVE and TRAN_UNACTIVE_ABORTED remap to TRAN_UNACTIVE_UNILATERALLY_ABORTED, others verbatimPresumed abort; 2PC/postpone states survive
head_lsato tdes->head_lsaMay predate the analysis window
tail_lsato tdes->tail_lsaChain resume point; 2PC walk cursor (4.9)
undo_nxlsato tdes->undo_nxlsaPre-checkpoint CLR redirects (4.3)
posp_nxlsato tdes->posp_nxlsaPostpone chain start (Chapter 5)
savept_lsato tdes->savept_lsaPre-checkpoint savepoints
tail_topresult_lsato tdes->tail_topresult_lsaSkip completed sysops on rollback
start_postpone_lsato tdes->rcv.tran_start_postpone_lsaPostpone completion (Chapter 8)
user_nameto tdes->client via set_system_internal_with_userLoose-end owner

LOG_INFO_CHKPT_SYSOP (log_info_chkpt_sysop) — only sysops committing with postpone are checkpointed; an ordinary in-flight sysop simply dies with its transaction:

FieldRoleWhy it exists
tridOwning transactionThe sysop array is flat; entries join by id
sysop_start_postpone_lsato tdes->rcv.sysop_start_postpone_lsaNon-null triggers re-reading that record (4.8 step 7)
atomic_sysop_start_lsato tdes->rcv.atomic_sysop_start_lsaDrives atomic-sysop abort (Chapter 8)

4.7 log_rv_analysis_start_checkpoint and the may_use_checkpoint guard

Section titled “4.7 log_rv_analysis_start_checkpoint and the may_use_checkpoint guard”

The LOG_START_CHKPT arm is one condition — if (LSA_EQ (log_lsa, start_lsa)) { *may_use_checkpoint = true; } — and that condition is the design. start_lsa is where analysis began: log_Gl.hdr.chkpt_lsa, updated only when a checkpoint completes (Chapter 3). The flag arms only for the anchor start record, never for a LOG_START_CHKPT met mid-scan — such a snapshot “can contain stuff which does not exist any longer” (source comment).

stateDiagram-v2
    [*] --> Unset : analysis starts, flag false
    Unset --> Armed : LOG_START_CHKPT at start_lsa
    Unset --> Unset : LOG_START_CHKPT elsewhere, LSA_EQ fails
    Armed --> Consumed : LOG_END_CHKPT, merge snapshot then clear flag
    Unset --> Unset : LOG_END_CHKPT, guard returns early
    Consumed --> Consumed : any later checkpoint records ignored

Figure 4-1: Lifecycle of may_use_checkpoint. Only the END pairing with the anchor START can merge a snapshot.

This answers the crash-window question. Crash between START and END: the header still names the previous completed checkpoint; the unfinished window’s START fails LSA_EQ, its END was never written. A second complete window inside the range (media recovery): its START fails LSA_EQ, its END dies on the 4.8 guard.

4.8 log_rv_analysis_end_checkpoint — merging the snapshot, branch by branch

Section titled “4.8 log_rv_analysis_end_checkpoint — merging the snapshot, branch by branch”

The longest arm; every branch accounted for:

  1. Guard. if (*may_use_checkpoint == false) return NO_ERROR; — unpaired ENDs die here; otherwise the flag clears at once: single-shot.
  2. Anchor capture. LSA_COPY (check_point, log_lsa) saves the END’s LSA into the driver’s checkpoint_lsa — used by the run-postpone arm (Chapter 5) and 4.9.
  3. Header read. LOG_REC_CHKPT is copied by value (chkpt = *tmp_chkpt) — later page advances may evict its page.
  4. Trans array — two branches. In-page (log_lsa->offset + size < LOGAREA_SIZE): used in place; else malloc + logpb_copy_from_log; malloc failure is fatal.
  5. Merge loop over chkpt.ntrans entries — allocator first (NULL: free area, logpb_fatal_error, ER_FAILED), then:
// log_rv_analysis_end_checkpoint -- src/transaction/log_recovery.c
logtb_clear_tdes (thread_p, tdes); /* <- wipe what the loop built so far */
if (chkpt_one->state == TRAN_ACTIVE || chkpt_one->state == TRAN_UNACTIVE_ABORTED)
{
tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; /* <- presumed-abort remap */
}
else
{
tdes->state = chkpt_one->state; /* <- 2PC / postpone states survive */
}
// ... condensed: isloose_end, six LSA_COPYs, rcv.tran_start_postpone_lsa, user name ...
if (LOG_ISTRAN_2PC (tdes))
{
*may_need_synch_checkpoint_2pc = true; /* <- defer 2PC body reads (4.9) */
}

Invariant — snapshot atomicity with the END record. logtb_clear_tdes clobbers state already built from records between START and END. Safe only because logpb_checkpoint snapshots the table and appends LOG_END_CHKPT (prior_lsa_next_record_with_lock) under one log_Gl.prior_info.prior_lsa_mutex hold: nothing appends in between, so the snapshot supersedes everything since START. Release the mutex earlier and this merge would silently regress tail_lsa/undo_nxlsa — undo would skip live changes. 6. Trans area release. free_and_init (area) — nulls area for reuse by the sysop array. 7. Sysop merge, gated by chkpt.ntops > 0. Same in-page-vs-malloc branches as step 4. Per entry: allocate the TDES by trid; grow the topops stack (logtb_realloc_topops_stack) when tdes->topops.max == 0 || (tdes->topops.last + 1) >= tdes->topops.max (failure: free, fatal); copy both LSAs into tdes->rcv. If sysop_start_postpone_lsa is non-null: bump topops.last from -1 to 0 — else assert (tdes->topops.last == 0), at most one level during recovery — and log_read_sysop_start_postpone re-reads that record on a private page buffer to fill topops.stack[last].lastparent_lsa and .posp_lsa, which the checkpoint entry omits. The only place analysis re-reads an older record; its error path is assert (false); return error_code; — no logpb_fatal_error (4.1). 8. Redo start pull-back. if (LSA_LT (&chkpt.redo_lsa, start_redo_lsa)) LSA_COPY (start_redo_lsa, &chkpt.redo_lsa); — redo (Chapter 6) begins at the oldest dirty page’s recovery LSA. 9. Final free_and_init (area) for the sysop copy (no-op if in-page), then NO_ERROR.

4.9 may_need_synch_checkpoint_2pc — the deferred 2PC reconstruction

Section titled “4.9 may_need_synch_checkpoint_2pc — the deferred 2PC reconstruction”

After the main loop, log_recovery_analysis re-fetches LOG_END_CHKPT at the saved checkpoint_lsa and, for every trans entry whose TDES still satisfies LOG_ISTRAN_2PC, calls log_2pc_recovery_analysis_info (thread_p, tdes, &chkpt_trans[i].tail_lsa) (log_2pc.c): a prev_tranlsa back-chain walk from the snapshot-time tail_lsa, reading the LOG_2PC_PREPARE body while tdes->gtrid == LOG_2PC_NULL_GTRID and the LOG_2PC_START body while tdes->coord == NULL, collecting acks. The snapshot omits 2PC bodies “due to the big space overhead (e.g., locks)” (source comment), and they may predate the window — only a backward walk recovers them; the re-check skips transactions that completed after the snapshot.

4.10 LOG_END_OF_LOG, next_trid, and MVCCID restoration

Section titled “4.10 LOG_END_OF_LOG, next_trid, and MVCCID restoration”

Two pieces of global state ride along with the per-transaction rebuild. First, the EOF arm — log_rv_analysis_log_end is one branch, if (!logpb_is_page_in_archive (log_lsa->pageid)): only an EOF in the active log counts. Inside it, LOG_RESET_APPEND_LSA (log_lsa) re-anchors the append point so post-recovery writes overwrite the EOF, and log_Gl.hdr.next_trid = tran_id restarts the TRANID counter from the EOF record’s own trid — restart never re-issues an id seen in the log. An EOF inside an archive is an artifact of archiving an incomplete log and is skipped; the no-EOF-found repair path is the driver’s (Chapter 3).

Second, MVCCIDs. Deliberately, no analysis arm restores tdes->mvccinfo — rebuilt losers carry no MVCCID out of analysis. Instead the last statement of log_recovery_analysis (and of its incomplete-recovery early return) is log_Gl.mvcc_table.reset_start_mvccid () (mvcc_table.cpp), re-seeding the active-MVCCID bitmap start and m_current_status_lowest_active_mvccid from log_Gl.hdr.mvcc_next_id: every lower MVCCID is treated as no longer active. Redo refines the header value — each replayed MVCC record pushes log_Gl.hdr.mvcc_next_id past its own id — and reset_start_mvccid runs once more after redo (Chapter 6). A loser’s original MVCCID reappears only during undo: logtb_rv_assign_mvccid_for_undo_recovery sets tdes->mvccinfo.id from the undone record’s rcv->mvcc_id (Chapter 9).

  1. log_rv_analysis_record is a logic-free demultiplexer; an unknown LOG_RECTYPE is page corruption; seven dummy/replication types are no-ops. Handler failures end in logpb_fatal_error — except end-checkpoint’s sysop re-read, dropped in release builds.
  2. logtb_rv_find_allocate_tran_index enforces presumed abort: transactions are born TRAN_UNACTIVE_UNILATERALLY_ABORTED at first sighting; system workers live in a separate log_system_tdes map.
  3. Only log_rv_analysis_compensate makes undo_nxlsa diverge from tail_lsa, jumping over already-undone work via the CLR’s stored pointer.
  4. log_rv_analysis_complete finds but never allocates, and is the only arm that removes table state; its stop_at branch truncates the log and keeps the index — point-in-time restore.
  5. The seven 2PC arms differ only in the stamped TRAN_STATE; prepare/start plant the gtrid = LOG_2PC_NULL_GTRID sentinel consumed by the post-loop log_2pc_recovery_analysis_info walk.
  6. A LOG_END_CHKPT merges only when armed by a LOG_START_CHKPT at exactly start_lsa — half-built or extra checkpoint windows are ignored by construction; the logtb_clear_tdes-then-overwrite merge is safe because logpb_checkpoint snapshots the table and appends the END under one prior_lsa_mutex hold.
  7. Global counters ride along: LOG_END_OF_LOG re-anchors the append point and next_trid; MVCCIDs are not rebuilt per transaction — reset_start_mvccid re-seeds the MVCC table from log_Gl.hdr.mvcc_next_id, and undo re-attaches loser MVCCIDs lazily.

Chapter 5: Sysop and Postpone Bookkeeping During Analysis

Section titled “Chapter 5: Sysop and Postpone Bookkeeping During Analysis”

The messy middles — transactions caught inside system operations, atomic sysops, or commit-time postpones — become five LSA annotations in LOG_RCV_TDES, written by the log_rv_analysis_* arms below (driver: Ch 3, dispatch: Ch 4). Theory: high-level companion (cubrid-recovery-manager.md).

5.1 LOG_RCV_TDES — the recovery annotation block

Section titled “5.1 LOG_RCV_TDES — the recovery annotation block”

LOG_RCV_TDES (struct log_rcv_tdes in log_impl.h) is five LOG_LSA fields embedded in every LOG_TDES as field rcv; outside recovery all five stay null.

FieldRoleWhy it exists
sysop_start_postpone_lsaLast open LOG_SYSOP_START_POSTPONE; written by log_rv_analysis_sysop_start_postpone, checkpoint-restored (Ch 4), reset by log_rv_analysis_sysop_endlog_recovery_finish_sysop_postpone (Ch 8) re-reads it to resume the sysop’s postpone list — no end record points to it
tran_start_postpone_lsaThe transaction’s LOG_COMMIT_WITH_POSTPONE; written by log_rv_analysis_commit_with_postpone + obsolete variant, checkpoint-restored (Ch 4)Non-null-ness picks the state restored when a sysop postpone ends (5.7); bound for log_recovery_finish_postpone
atomic_sysop_start_lsaLast unmatched LOG_SYSOP_ATOMIC_START; written by log_rv_analysis_atomic_sysop_start, checkpoint-restored (Ch 4), reset by both sysop arms when the atomic op is proven completeStill set after redo → log_recovery_abort_all_atomic_sysops (Ch 8) rolls back to it before postpones run
analysis_last_aborted_sysop_lsaMost recent ABORT-type LOG_SYSOP_END; written only in that arm of log_rv_analysis_sysop_endUpper bound of the logical-redo skip range (log_recovery_needs_skip_logical_redo, Ch 6)
analysis_last_aborted_sysop_start_lsalastparent_lsa of that same aborted sysop endLower bound of the same skip range
flowchart LR
    cwp["commit_with_postpone"] --> f1["tran_start_postpone_lsa"]
    ssp["sysop_start_postpone"] --> f2["sysop_start_postpone_lsa"]
    ats["atomic_sysop_start"] --> f3["atomic_sysop_start_lsa"]
    se["sysop_end"] --> f4["analysis_last_aborted_sysop_lsa<br/>+ _start_lsa"]
    se -. resets .-> f2
    se -. resets .-> f3
    f1 --> fp["finish_postpone (Ch 8)"]
    f1 --> fsp["finish_sysop_postpone (Ch 8)"]
    f2 --> fsp
    f3 --> aas["abort_all_atomic_sysops (Ch 8)"]
    f4 --> skip["needs_skip_logical_redo (Ch 6)"]

Figure 5-1: annotation writers (left) and post-redo consumers (right), prefixes elided.

Invariant — annotations survive only while their phase is open. Each field is nulled once analysis proves its phase concluded pre-crash (reset guards, 5.7). Stale atomic_sysop_start_lsa → Ch 8 rolls back a committed operation; stale sysop_start_postpone_lsa → an already-run postpone list replays.

5.2 LOG_REC_SYSOP_END and LOG_SYSOP_END_TYPE

Section titled “5.2 LOG_REC_SYSOP_END and LOG_SYSOP_END_TYPE”

Every system operation ends with LOG_SYSOP_END, body LOG_REC_SYSOP_END (log_record.hpp) — three fixed fields, a vfid pointer, and a union switched by type:

FieldRoleWhy it exists
lastparent_lsaTransaction’s last LSA before the sysop startedUndo jump target over the sysop; compared against the annotations to detect nesting order
prv_topresult_lsaPrevious concluded top action’s LSAChains sysop results so partial abort can skip them (tail_topresult_lsa)
typeOne of six LOG_SYSOP_END_TYPE valuesSelects union interpretation and recovery behavior
vfidOwning file; equals mvcc_undo’s vacuum-info file for MVCC undoTDE (encryption) context lookup
union undoLogical undo payload (LOGICAL_UNDO)Multi-page op undoes via one logical recovery function
union mvcc_undoUndo + MVCCID/vacuum info (LOGICAL_MVCC_UNDO)Vacuum must see the operation’s MVCCID
union compensate_lsaNext-undo LSA (LOGICAL_COMPENSATE)The sysop replaces a compensation record; undo resumes here
union run_postponepostpone_lsa + is_sysop_postpone flag (LOGICAL_RUN_POSTPONE)Replaces a LOG_RUN_POSTPONE; the flag says whose postpone list advances (5.7)

LOG_SYSOP_END_TYPE (enum log_sysop_end_type, log_record.hpp) has six values: LOG_SYSOP_END_COMMIT (“permanent changes”), LOG_SYSOP_END_ABORT (“aborted system op”), and the four LOG_SYSOP_END_LOGICAL_* flavors UNDO, MVCC_UNDO, COMPENSATE, RUN_POSTPONE. The union is a role matrix switched solely by type (asserted by LOG_SYSOP_END_TYPE_CHECK); 5.7 traces each value’s analysis-time effect.

5.3 Postpone-side arms: LOG_POSTPONE, LOG_DUMMY_HEAD_POSTPONE, LOG_RUN_POSTPONE

Section titled “5.3 Postpone-side arms: LOG_POSTPONE, LOG_DUMMY_HEAD_POSTPONE, LOG_RUN_POSTPONE”

log_rv_analysis_postpone (LOG_POSTPONE) and log_rv_analysis_dummy_head_postpone (the no-op LOG_DUMMY_HEAD_POSTPONE marker) each have two branches: a fatal logtb_rv_find_allocate_tran_index == NULL early return (logpb_fatal_error, ER_FAILED) and the first-postpone capture. On LSA_ISNULL (posp_nxlsa) the postpone arm copies the previous tail_lsa into posp_nxlsa before advancing tail_lsa/undo_nxlsa (“set address early”); the dummy-head arm advances first and captures after (“set address late”), landing on the dummy head itself. posp_nxlsa is where log_recovery_find_first_postpone (Ch 8) starts scanning.

log_rv_analysis_run_postpone handles LOG_RUN_POSTPONE (a postpone already executed and redo-logged). Branches:

  1. tdes == NULL → fatal, ER_FAILED.
  2. State not in {WILL_COMMIT, COMMITTED_WITH_POSTPONE, TOPOPE_COMMITTED_WITH_POSTPONE} (TRAN_UNACTIVE_ elided): impossible for a checkpointed tdes (SYSTEM ERROR debug log), normal otherwise; recovery guesses topops.last == -1 → committed-with-postpone, else topope-committed.
  3. State now TRAN_UNACTIVE_COMMITTED_WITH_POSTPONELSA_SET_NULL (undo_nxlsa): nothing left to undo.
  4. Body read (Ch 2 macros); run_posp->ref_lsa — the LOG_POSTPONE this record executed — resets the cursor: topops.stack[last].posp_lsa in the topope state, else tdes->posp_nxlsa (other two states asserted).

Invariant — posp_nxlsa always points at the next postpone not yet known to have run. LOG_POSTPONE sets it once; every LOG_RUN_POSTPONE advances it to ref_lsa. Lagging → Chapter 8 runs a postpone twice; overshooting → deferred work silently lost.

log_rv_analysis_commit_with_postpone handles LOG_COMMIT_WITH_POSTPONE: outcome decided, deferred work possibly unfinished. After the fatal-tdes branch it reads LOG_REC_START_POSTPONE (posp_lsa + at_time) and forks on is_media_crash:

// log_rv_analysis_commit_with_postpone -- src/transaction/log_recovery.c
if (is_media_crash)
{
// ... condensed: stop_at test -> resetlog + *did_incom_recovery = true ...
}
else
{
tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE;
LSA_SET_NULL (&tdes->undo_nxlsa); /* Nothing to undo */
LSA_COPY (&tdes->tail_lsa, log_lsa);
tdes->rcv.tran_start_postpone_lsa = tdes->tail_lsa; /* <- annotation write */
LSA_COPY (&tdes->posp_nxlsa, &start_posp->posp_lsa);
}

The media-crash arm is point-in-time recovery: when stop_at != NULL && *stop_at != (time_t) (-1) && difftime (*stop_at, last_at_time) < 0 — commit past the restore target — it releases the page, truncates the log (log_recovery_resetlog, Ch 11), sets *did_incom_recovery, and the transaction is treated as never committed. If the stop_at test fails (or stop_at is NULL/-1), the media-crash arm is a no-op — the annotation and state updates happen only in the non-media-crash arm.

log_rv_analysis_commit_with_postpone_obsolete (LOG_COMMIT_WITH_POSTPONE_OBSOLETE, old layout LOG_REC_START_POSTPONE_OBSOLETE without at_time) performs exactly the non-media-crash arm — no timestamp, no point-in-time stop. Kept only to read old-release logs; slated for removal “maybe 12.0”.

LOG_SYSOP_START_POSTPONE marks a sysop that finished its main work and began its own postpone list. Its body LOG_REC_SYSOP_START_POSTPONE is an embedded LOG_REC_SYSOP_END sysop_end (what the end record will say) plus posp_lsa (first postpone of the sysop). Branches:

  1. Fatal-tdes → ER_FAILED.
  2. tail_lsa/undo_nxlsa advance; annotation write: tdes->rcv.sysop_start_postpone_lsa = tdes->tail_lsa.
  3. Three-way fork on the embedded end type: state already topope-committed → assert_release (false) (two simultaneous sysop postpones cannot exist); sysop_end.type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE → nested is_sysop_postpone == true asserted impossible, and the transaction-postpone flavor nulls undo_nxlsa (the transaction is committing regardless of its guessed state); otherwise assert (type != LOG_SYSOP_END_ABORT) — an aborting sysop never starts a postpone phase.
  4. State := TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE.
  5. Topops stack grown via logtb_realloc_topops_stack if needed (ER_OUT_OF_VIRTUAL_MEMORY on failure); topops.last must be -1, bumped to 0 (assert (false) otherwise); lastparent_lsa and posp_lsa copy into topops.stack[0].
  6. LSA_LT (sysop_end.lastparent_lsa, rcv.atomic_sysop_start_lsa) means the atomic marker was logged inside this sysop; reaching start-postpone proves the atomic part completed, so the marker is nulled.

Invariant — at most one live sysop postpone, so topops.last <= 0 throughout recovery. Enforced by the asserts in steps 3 and 5, re-checked in log_rv_analysis_sysop_end (assert (tdes->topops.last == 0)). If violated, the run-postpone arms would advance the wrong stack entry’s posp_lsa.

The simplest arm, for LOG_SYSOP_ATOMIC_START — two branches: fatal-tdes, and success, which advances tail_lsa/undo_nxlsa then writes tdes->rcv.atomic_sysop_start_lsa = *log_lsa (the record has no body — the LSA is the payload). If nothing clears it (5.5, 5.7), log_recovery_abort_all_atomic_sysopslog_recovery_abort_atomic_sysop (Ch 8) rolls the transaction back to this LSA before postpones resume.

5.7 log_rv_analysis_sysop_end — the intricate one

Section titled “5.7 log_rv_analysis_sysop_end — the intricate one”

Closes a sysop of unknown kind for a transaction in an only-guessed state. Prologue: fatal-tdes branch; advance tail_lsa, undo_nxlsa, tail_topresult_lsa; read LOG_REC_SYSOP_END; LOG_SYSOP_END_TYPE_CHECK. Then the switch, where local commit_start_postpone decides whether this end also closes an open sysop-postpone phase:

// log_rv_analysis_sysop_end -- src/transaction/log_recovery.c
case LOG_SYSOP_END_ABORT:
// ... condensed: comment -- abort neither changes state nor finishes a topope postpone ...
if (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE)
{
LSA_SET_NULL (&tdes->undo_nxlsa); /* no undo */
}
tdes->rcv.analysis_last_aborted_sysop_lsa = *log_lsa; /* <- skip-range upper bound */
tdes->rcv.analysis_last_aborted_sysop_start_lsa = sysop_end->lastparent_lsa; /* <- lower bound */
break;
case LOG_SYSOP_END_COMMIT:
assert (tdes->state != TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE); /* <- falls through to next cases */
case LOG_SYSOP_END_LOGICAL_UNDO:
case LOG_SYSOP_END_LOGICAL_MVCC_UNDO:
// ... condensed: todo comment ...
commit_start_postpone = true;
break;
case LOG_SYSOP_END_LOGICAL_COMPENSATE:
tdes->undo_nxlsa = sysop_end->compensate_lsa; /* <- jump undo over compensated range */
commit_start_postpone = true;
break;

The ABORT arm is the aborted-sysop tracker: a LOG_DBEXTERN_REDO_DATA logical redo inside the aborted range would re-create state the pre-crash rollback destroyed, so log_recovery_needs_skip_logical_redo (Ch 6) skips records with analysis_last_aborted_sysop_start_lsa < lsa < analysis_last_aborted_sysop_lsa. Each ABORT end overwrites the fields — only the last aborted sysop is tracked.

The LOG_SYSOP_END_LOGICAL_RUN_POSTPONE arm: in topope-committed state the run-postpone sysop could belong to either postpone scope; run_postpone.is_sysop_postpone decides:

  • true (sysop’s postpone): if topops.last < 0 or state is not topope-committed, the stack is conjured — realloc if max == 0 (fatal ER_OUT_OF_VIRTUAL_MEMORY), topops.last = 0, state forced to topope-committed; then topops.stack[last].posp_lsa = run_postpone.postpone_lsa. commit_start_postpone stays false — the phase continues.
  • false (transaction’s postpone): posp_nxlsa = run_postpone.postpone_lsa; topops.last != -1 → asserts confirm the topope state, else state := TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; undo_nxlsa nulled; commit_start_postpone = true.

The epilogue runs for every arm. In topope-committed state (assert (topops.last == 0)) with commit_start_postpone set, the sysop postpone phase is over and tran_start_postpone_lsa picks the restored state: non-null restores TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE (asserted LSA_LE to lastparent_lsa — the sysop ran inside the transaction’s postpone phase); null restores the default recovery state TRAN_UNACTIVE_UNILATERALLY_ABORTED. Either way topops.last = -1. Without commit_start_postpone the phase continues (topops.last stays 0); in any non-topope state it is (re)set to -1.

Two symmetric reset guards follow — a postpone phase and an atomic sysop can nest either way, and the end belongs to whichever started later. The atomic guard nulls rcv.atomic_sysop_start_lsa only if three conditions hold: (1) it is non-null; (2) LSA_GT over sysop_start_postpone_lsa — the atomic op is the more recent open phase; (3) LSA_GT (atomic_sysop_start_lsa, sysop_end->lastparent_lsa). Condition 3 is the resurrection guard: if lastparent_lsa >= atomic_sysop_start_lsa, this end closes a sysop that began after the atomic marker — one nested inside the still-open atomic operation — and clearing the annotation on its end would let recovery skip the still-unfinished atomic operation. Only an end whose lastparent_lsa precedes the marker (the sysop containing the marker) proves the atomic op completed and may clear it. The mirror-image guard nulls sysop_start_postpone_lsa identically.

  1. LOG_RCV_TDES is a five-LSA annotation block in every LOG_TDES, written by analysis arms (plus checkpoint restore, Ch 4), consumed by Chapters 6 and 8, nulled once its phase is proven concluded.
  2. log_rv_analysis_commit_with_postpone writes tran_start_postpone_lsa and doubles as the point-in-time stop on media crash; the obsolete variant is the same minus the timestamp.
  3. log_rv_analysis_sysop_start_postpone writes sysop_start_postpone_lsa, forces topops.last from -1 to 0, and clears an atomic_sysop_start_lsa proven nested inside the now-postponing sysop.
  4. log_rv_analysis_sysop_end is a six-arm switch: ABORT records the skip range without changing state; COMMIT and both LOGICAL_UNDO flavors close an open sysop postpone phase; LOGICAL_COMPENSATE also redirects undo_nxlsa; LOGICAL_RUN_POSTPONE disambiguates via is_sysop_postpone.
  5. When a sysop postpone phase closes, the prior state is rebuilt from tran_start_postpone_lsa: non-null → TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE, null → TRAN_UNACTIVE_UNILATERALLY_ABORTED; the reset guards then compare both annotations against lastparent_lsa so an end clears only its own phase’s annotation.
  6. analysis_last_aborted_sysop_start_lsa < lsa < analysis_last_aborted_sysop_lsa is how log_recovery_needs_skip_logical_redo suppresses LOG_DBEXTERN_REDO_DATA replay inside a pre-crash-aborted sysop.

Chapter 6: Redo Pass Driver and Synchronous Apply

Section titled “Chapter 6: Redo Pass Driver and Synchronous Apply”

log_recovery_redo replays the log forward over the range analysis fixed (Chapter 3) against the rebuilt transaction table (Chapter 4) — the driver loop, then the synchronous apply path down to LZ4/XOR payload assembly. Theory: companion, “Redo pass”; parallel leg: Chapter 7; loose ends: Chapter 8; RV_fun: Chapter 10.

6.1 The redo context and per-record structs

Section titled “6.1 The redo context and per-record structs”

log_rv_redo_context is the whole apply state; its constructor pre-allocates both LOG_ZIP buffers at LOGAREA_SIZE:

FieldRole
m_readerprivate log cursor (Chapter 2) — parallel workers each need one
m_redo_zipredo payload scratch+output; rcv.data points into it, no per-record malloc
m_undo_zipundo scratch for diff undoredo records — diffed redo XORs against the undo image
m_end_redo_lsaconst hard stop; past-end records are torn tail; bounds the §6.4 page-LSA assert
m_reader_fetch_page_modeNORMAL for crash recovery (trusts its snapshot), FORCE for replication re-fetch

The copy constructor delegates to the main constructor — copies share nothing, making Chapter 7’s per-worker copies safe. Each record travels as a value snapshot, log_rv_redo_rec_info<T>, with exactly three fields: m_start_lsa — the record header’s LSA, stamped onto the page after apply, the idempotence comparand; m_type — the concrete LOG_RECTYPE (one T serves plain and DIFF rectypes; the diff decision needs it); m_logrec — a by-value copy of the typed body taken via reinterpret_copy_and_add_align, so a queued job holds no log-page pointer.

Debug-only vpid_lsa_consistency_check (check / cleanup) has exactly two fields: mtx — parallel redo workers call check concurrently — and consistency_check_map, the first-seen LSA per (volid, pageid) (emplace never overwrites an existing key); cleanup clears it at pass end.

Invariant — per-page LSA ordering. Out-of-order apply loses updates. Enforced (debug, rcv_phase != LOG_RESTARTED) by assert ((*map_it).second < a_log_lsa) — each new LSA compared against the page’s first recorded LSA (weaker than pairwise monotonicity; emplace keeps the original entry).

Invariant — m_redo_zip buffer stability. rcv.data aliases m_redo_zip.log_data until the redofun returns; enforced structurally — one context per thread, sequential assembly; recycle early and the redofun reads garbage.

6.2 log_recovery_redo — setup and the outer loop

Section titled “6.2 log_recovery_redo — setup and the outer loop”

The driver drops the log critical section (LOG_CS_EXIT; re-entered at the tail). log_recovery_get_redo_parallel_countMAX (16, system_core_count) — sizes reusable_jobs and cublog::redo_parallel under SERVER_MODE (Chapter 7); in SA mode parallel_recovery_redo stays nullptr, all applies synchronous. Pre-loop defenses: a start_redolsa offset too close to the page end trips assert (false) and resumes at the next page; PRM_ID_RECOVERY_PROGRESS_LOGGING_INTERVAL (5-second floor) periodically emits ER_LOG_RECOVERY_PROGRESS with pages done/total and ETA.

The outer loop fetches the page holding lsa; on fetch failure, lsa > m_end_redo_lsa is the normal past-the-end goto exit, failure inside the promised range is logpb_fatal_error. The inner loop walks records while lsa.pageid == m_reader.get_pageid (); each iteration re-positions the reader at the (possibly repaired) record lsa via set_lsa_and_fetch_page before reading the header:

flowchart TD
    A["record at lsa"] --> B{"past end_redo_lsa?"}
    B -- yes --> Z["null lsa, break"]
    B -- no --> C["offset repair if NULL"]
    C --> H["re-fetch at lsa, read header, lsa = forw_lsa"]
    H --> K{"lsa strictly advances?"}
    K -- no --> L["fatal: loop in log"]
    K -- yes --> M["switch on log_rtype"]
    M --> P["pageid fixup"] --> A

Figure 6-1: skeleton of one inner-loop iteration of log_recovery_redo; the callouts below account for each branch.

Archive page-boundary repair — an incompletely archived record leaves the page-header offset or forw_lsa NULL. A NULL lsa.offset takes the page-header offset; if that too is NULL, archive page -> pageid + 1, active page -> genuine end of log (pageid = NULL_PAGEID); continue. A NULL forw_lsa on an archived page likewise advances to pageid + 1. Loop-in-log defense — a next lsa that does not strictly advance is logpb_fatal_error instead of spinning. Post-switch fixup — after a multi-page body, lsa.pageid jumps to the reader’s page so consumed pages are not re-fetched.

Invariant — the scan strictly advances. Every path moves lsa forward or nulls it and terminates; otherwise recovery replays the same range forever.

6.3 The dispatch switch — every record-type arm

Section titled “6.3 The dispatch switch — every record-type arm”

Past the header, two local macros carry each typed arm: BUILD_RECORD_INFO (TEMPLATE_TYPE) wraps rcv_lsa, log_rtype and the reinterpret_copy_and_add_align<TEMPLATE_TYPE> () body copy into a log_rv_redo_rec_info; INVOKE_REDO_RECORD forwards it to log_rv_redo_record_sync_or_dispatch_async, where log_rv_need_sync_redo forces the sync leg for null-VPID records and the volume/sector RVDK_* rcvindexes (enumerated in Chapter 7). Every arm, branch-complete:

ArmAction
LOG_UNDOREDO_DATA, LOG_DIFF_UNDOREDO_DATA, LOG_RUN_POSTPONE, LOG_COMPENSATEplain build+invoke (§6.4)
LOG_UNDO_DATA, LOG_POSTPONE, LOG_SAVEPOINT, postpone markers (LOG_DUMMY_HEAD_POSTPONE, LOG_COMMIT_WITH_POSTPONE/_OBSOLETE, LOG_SYSOP_START_POSTPONE), checkpoint, 2PC decision/inform, HA/replication/dummy types, LOG_SUPPLEMENTAL_INFO, LOG_SYSOP_ATOMIC_START, LOG_END_OF_LOGexplicit no-op break
LOG_MVCC_UNDOREDO_DATA, LOG_MVCC_DIFF_UNDOREDO_DATAbump mvcc_next_id past mvccid, set mvcc_op_log_lsa = rcv_lsa (vacuum); invoke
LOG_MVCC_REDO_DATAbump mvcc_next_id only — vacuum reads undo data; invoke
LOG_REDO_DATARVVAC_COMPLETE -> logpb_vacuum_reset_log_header_cache; invoke
LOG_DBEXTERN_REDO_DATApage-less (pgptr = NULL, offset = -1); gated by the skip check below; applies via log_rv_redo_record
LOG_2PC_PREPAREmissing tran/tdes -> break; else log_2pc_read_prepare re-reads the gtrid, with LOG_2PC_OBTAIN_LOCKS only in state TRAN_UNACTIVE_2PC_PREPARE
LOG_2PC_STARTrebuild coordinator info if tran alive and LOG_ISTRAN_2PC; alloc failure -> fatal + break
LOG_COMMIT, LOG_ABORTassert-only: completed non-system tran must be gone
LOG_MVCC_UNDO_DATAbookkeeping only — mvcc_next_id, mvcc_op_log_lsa; not applied
LOG_SYSOP_ENDLOG_SYSOP_END_LOGICAL_MVCC_UNDO -> mvcc_op_log_lsa = rcv_lsa
default (+LOG_SMALLER/LARGER_LOGREC_TYPE)er_set (ER_LOG_PAGE_CORRUPTED); null lsa if forw_lsa pointed back at this record

log_recovery_needs_skip_logical_redo, the repeated-crash defense, has three early false returns — wrong rectype, NULL_TRAN_INDEX, NULL tdes — and one true path:

// log_recovery_needs_skip_logical_redo -- src/transaction/log_recovery.c
if (LSA_LT (&tdes->rcv.analysis_last_aborted_sysop_start_lsa, lsa)
&& LSA_LT (lsa, &tdes->rcv.analysis_last_aborted_sysop_lsa))
{
/* ... condensed: er_log_debug ... */
return true; /* <- strictly inside a sysop a previous recovery already aborted */
}

An LSA outside the window falls through to the trailing return false. Analysis stamped the endpoints (Chapter 5); the record and its compensation already sit in the log from a previous recovery cycle.

Tail sequence. (SERVER_MODE) parallel_recovery_redo->wait_for_termination_and_stop_execution () drains every async job; LOG_CS_ENTER; log_Gl.mvcc_table.reset_start_mvccid () recomputes the MVCC baseline; the Chapter 8 hand-off (log_recovery_abort_all_atomic_sysops, log_recovery_finish_all_postpone); then logpb_flush_pages_direct, logpb_flush_header, pgbuf_flush_all. The exit: label — also the past-the-end target — nulls curr_rcv_rec_lsa, runs the consistency-check cleanup (), reports perf stats.

6.4 log_rv_redo_record_sync — fix, extract, apply

Section titled “6.4 log_rv_redo_record_sync — fix, extract, apply”
// log_rv_redo_record_sync -- src/transaction/log_recovery_redo.hpp
// ... condensed: debug-only vpid_lsa_consistency_check.check (rcv_vpid, m_start_lsa) ...
const LOG_DATA &log_data = log_rv_get_log_rec_data<T> (record_info.m_logrec);
LOG_RCV rcv;
if (!log_rv_fix_page_and_check_redo_is_needed (thread_p, rcv_vpid, rcv, log_data.rcvindex,
record_info.m_start_lsa, redo_context.m_end_redo_lsa))
{
// ... condensed: assert (rcv.pgptr == nullptr) ...
return; /* <- page gone, or change already on disk */
}
scope_exit unfix_rcv_pgptr { [&thread_p, &rcv] ()
{ pgbuf_unfix_and_init_after_check (thread_p, rcv.pgptr); } }; /* <- unfix on every exit */
// ... condensed: rcv field extractors; payload assembly ...
rvfun::fun_t redofunc = log_rv_get_fun<T> (record_info.m_logrec, log_data.rcvindex);

The condensed tail: payload-assembly error -> logpb_fatal_error + return (the scope_exit still unfixes); non-null redofunc runs under perfmon_counter_timer_raii_tracker (PSTAT_LOG_REDO_FUNC_EXEC), failure -> logpb_fatal_error; null redofunc -> er_log_debug warning only; a non-null rcv.pgptr is then stamped with m_start_lsa via pgbuf_set_lsa.

The gatekeeper log_rv_fix_page_and_check_redo_is_needed has three outcomes: (1) non-null VPID but log_rv_redo_fix_page returns null — assert (log_is_in_crash_recovery ()), return false, the deallocated-page skip; (2) page fixed but rcv_lsa <= *pgbuf_get_lsa (rcv.pgptr)pgbuf_unfix_and_init, return false, change already on disk (an assert also rejects page LSAs beyond end_redo_lsa); (3) otherwise return true, including the null-VPID fall-through that leaves rcv.pgptr == nullptr for page-less records. log_rv_redo_fix_page fixes in RECOVERY_PAGE mode with no sector-reservation check — sector tables replay in parallel, so a page may transiently look deallocated; the check costs more than skipping saves; NULL is assert_release material (“this is terrible, because it makes recovery impossible”).

Invariant — redo idempotence via page LSA. Skip when rcv_lsa <= page LSA, stamp m_start_lsa after applying; break the stamp and every later crash double-applies non-idempotent redo.

Six extractor template families — primaries uninstantiable via static_assert (sizeof (T) == 0) — flatten six record shapes into one generic routine. The outlier: log_rv_get_fun<LOG_REC_COMPENSATE> returns RV_fun[rcvindex].undofun (“yes, undo” in source) — a CLR’s redo payload is the undo image, so replay runs the undo function: ARIES repeating history (companion, “Compensation log records”).

T_data / _vpid / _offset from_mvccid_redo_lengthlog_rv_get_fun
LOG_REC_MVCC_UNDOREDOundoredo.datamvccidundoredo.rlengthredofun
LOG_REC_UNDOREDOdataMVCCID_NULLrlengthredofun
LOG_REC_MVCC_REDOredo.datamvccidredo.lengthredofun
LOG_REC_REDOdataMVCCID_NULLlengthredofun
LOG_REC_RUN_POSTPONEdataMVCCID_NULLlengthredofun
LOG_REC_COMPENSATEdataMVCCID_NULLlengthundofun

6.5 Payload assembly — unzip, diff, hand off

Section titled “6.5 Payload assembly — unzip, diff, hand off”

log_rv_get_log_rec_redo_data<T> decodes the payload. The four single-image specializations (LOG_REC_MVCC_REDO, LOG_REC_REDO, LOG_REC_RUN_POSTPONE, LOG_REC_COMPENSATE) call log_rv_get_unzip_and_diff_redo_log_data with no undo data; LOG_REC_MVCC_UNDOREDO re-wraps its embedded undoredo member as a log_rv_redo_rec_info<LOG_REC_UNDOREDO> and delegates. Only LOG_REC_UNDOREDO branches — on m_type, not T: for the two DIFF rectypes (need_diff_with_undo) it first unzips the undo image into m_undo_zip via log_rv_get_unzip_log_data (fatal + return on error), aligns, and passes m_undo_zip.data_length / .log_data on; otherwise it skips the unneeded undo image (m_reader.skip (GET_ZIP_LEN (ulength)), fatal + ER_FAILED on error), aligns, and passes (0, nullptr).

log_rv_get_unzip_log_data decodes one image, branch-complete. The length field’s sign bit is the compression flag — MAKE_ZIP_LEN sets 0x80000000 at logging time, ZIP_CHECK tests it, GET_ZIP_LEN strips it; even the skip path above goes through GET_ZIP_LEN. is_zip = ZIP_CHECK (length); an image that does_fit_in_current_page is aliased straight off the page buffer, a spanning one is heap-copied via copy_from_log. Compressed -> log_unzip (failure: fatal + ER_FAILED); uncompressed -> log_zip_realloc_if_needed (failure fatal) + memcpy. Finally add_align in the fits case, bare align () in the copy case since copy_from_log already advanced the reader.

log_rv_get_unzip_and_diff_redo_log_data layers the diff on top: after log_rv_get_unzip_log_data into the caller’s redo_unzip (failure: fatal + ER_FAILED), it un-diffs only if (is_zip) and only when undo_length > 0 && undo_data != nullptrlog_diff (undo_length, undo_data, redo_unzip.data_length, redo_unzip.log_data) — then hands off rcv->length / rcv->data, borrowing m_redo_zip storage. The is_zip gate works because diffed redo exists only compressed: at append time the XOR runs on a scratch copy and the DIFF rectype is set only when is_redo_zip; a failed compression writes the original un-diffed crumbs with the bit clear. log_unzip reads the original-length prefix log_zip stored, rejects buf_size <= 0, fails if log_zip_realloc_if_needed cannot grow the destination, LZ4-decompresses, and succeeds only when unzip_len == buf_size — short or negative means corruption, not truncation. log_diff is *(p++) ^= *(q++) over MIN (undo_length, redo_length) bytes — XOR is its own inverse, so one routine serves both directions.

The page-less twin log_rv_redo_record (the LOG_DBEXTERN_REDO_DATA arm) runs the same assemble-then-call sequence without the fix/skip gate: payload failure -> fatal + return; redofun failure -> fatal too; redofun == NULL -> debug warning; rcv->pgptr != NULL -> pgbuf_set_lsa (vacuous here — pgptr is NULL).

  1. log_rv_redo_context is the whole redo state — one log cursor plus two pre-allocated LOG_ZIP buffers whose storage rcv.data borrows; share-nothing copies enable Chapter 7’s workers.
  2. The switch also bookkeeps mvcc_next_id, mvcc_op_log_lsa (undo-bearing records only), the RVVAC_COMPLETE reset, and the logical-redo skip window.
  3. Idempotence: skip when rcv_lsa <= page LSA, stamp m_start_lsa after apply; log_rv_redo_fix_page deliberately accepts deallocated pages.
  4. Six extractor families flatten six record shapes into one apply routine; log_rv_get_fun<LOG_REC_COMPENSATE> returns the undofun — a CLR replays as an undo.
  5. One sign bit encodes compression (MAKE_ZIP_LEN/ZIP_CHECK/GET_ZIP_LEN); diffed redo exists only compressed, so log_diff (XOR) runs only when is_zip.
  6. Nothing after the loop runs until wait_for_termination_and_stop_execution drains parallel redo; only then reset_start_mvccid, Chapter 8 finishing, and the flushes.

Chapter 6’s driver hands each redoable record to log_rv_redo_record_sync_or_dispatch_async; only concrete-page, non-volume records go async, and per-page LSA order is inherited from push order because every VPID hashes to a fixed task.

7.1 Dispatch — log_rv_redo_record_sync_or_dispatch_async

Section titled “7.1 Dispatch — log_rv_redo_record_sync_or_dispatch_async”

Instantiated per record type by INVOKE_REDO_RECORD:

// log_rv_redo_record_sync_or_dispatch_async -- src/transaction/log_recovery_redo_parallel.hpp
const VPID rcv_vpid = log_rv_get_log_rec_vpid<T> (record_info.m_logrec);
#if defined (SERVER_MODE)
// ... condensed: log_data ref ...
const bool need_sync_redo = log_rv_need_sync_redo (rcv_vpid, log_data.rcvindex);
// ... condensed: PREP perf tick ...
if (parallel_recovery_redo == nullptr || need_sync_redo)
{
log_rv_redo_record_sync<T> (thread_p, redo_context, record_info, rcv_vpid);
// ... condensed: DO_SYNC perf tick ...
}
else
{
cublog::redo_job_impl *const job = a_reusable_jobs.blocking_pop (a_rcv_redo_perf_stat);
assert (job != nullptr);
job->set_record_info (rcv_vpid, record_info.m_start_lsa, record_info.m_type);
parallel_recovery_redo->add (job);
// ... condensed: DO_ASYNC perf tick ...
}
#else // !SERVER_MODE = SA_MODE
log_rv_redo_record_sync<T> (thread_p, redo_context, record_info, rcv_vpid);
#endif

SA_MODE compiles the cublog classes to empty dummies; Figure 7-1 covers every exit. The predicate:

// log_rv_need_sync_redo -- src/transaction/log_recovery.c
if (VPID_ISNULL (&a_rcv_vpid))
{
return true; /* <- no target page to hash */
}
switch (a_rcvindex)
{
case RVDK_NEWVOL: // ... condensed: RVDK_FORMAT, RVDK_INITMAP, RVDK_EXPAND_VOLUME, RVDK_VOLHEAD_EXPAND ...
return true; /* <- see Inv 7-A */
case RVDK_RESERVE_SECTORS: // ... condensed: RVDK_UNRESERVE_SECTORS ...
return true; /* <- "may be changed to async" */
default:
return false;
}

Invariant 7-A (sync record as happens-before barrier). The main thread applies a sync record before pushing any later job; new-volume pages appear only in later records, so no worker can fix a page of a volume whose creation is still unexecuted.

flowchart TD
  A["record"] --> B{"SERVER_MODE?"}
  B -- "no (SA)" --> S1["sync apply"]
  B -- yes --> C{"infra null?"}
  C -- yes --> S1
  C -- no --> D{"log_rv_need_sync_redo"}
  D -- "null VPID or RVDK volume, sector" --> S1
  D -- false --> E["blocking_pop + set_record_info"]
  E --> G["add: hash vpid to fixed task"]

Figure 7-1: dispatch exits.

7.2 Sizing and construction — redo_parallel

Section titled “7.2 Sizing and construction — redo_parallel”

log_recovery_redo registers pool demand via REGISTER_WORKERPOOL and builds once before the forward scan: reusable_jobs.initialize (count) plus new cublog::redo_parallel (count, false, MAX_LSA, redo_context); false/MAX_LSA disables monitoring (7.8). The count:

// log_recovery_get_redo_parallel_count -- src/transaction/log_recovery.c
const int num_cpus = cubthread::system_core_count ();
const int minimum_threads_to_redo = 16; /* <- "determined experimentally" */
return MAX (minimum_threads_to_redo, num_cpus);

The floor of 16 oversubscribes small machines — workers are I/O-bound. The constructor runs do_init_worker_pool (workers = slots = a_task_count), then do_init_tasks and the monitor.

FieldRoleWhy it exists
m_task_countVPID-binning modulusFixed at construction (Inv 7-B)
m_pool_entry_managerTT_RECOVERY entry factoryWorkers need real entries
m_task_state_bookkeepingBitset of active tasksUnbounded wait (7.7)
m_worker_poolWorker pool pointerOwns OS threads
m_redo_tasksvector<unique_ptr<redo_task>>Owner-managed; perf stats survive
m_vpid_hashstd::hash<VPID>Binning function of add
m_min_unapplied_log_lsa_calculationProgress monitor (7.8)Replication only
// redo_parallel::add -- src/transaction/log_recovery_redo_parallel.cpp
const std::size_t task_index = m_vpid_hash (a_job->get_vpid ()) % m_task_count;
redo_task *const task = m_redo_tasks[task_index].get ();
task->push_job (a_job);

Invariant 7-B (per-page order from push order). The main thread pushes in increasing LSA order, a VPID always hashes to the same task (m_task_count is immutable), and each task drains FIFO — per-page apply order is log order, lock-free. Break any leg and two workers race on one page, masked by the rcv_lsa <= page_lsa skip. Cross-page order is not preserved.

redo_job_base — the queueable unit:

FieldRoleWhy it exists
m_vpidTarget page; null when defaultedBinning key; get_vpid asserts non-null
m_log_lsaLSA of the recordWhere to re-read from (7.5); progress marker (7.8)

redo_task::push_job sets the unapplied marker only when monitoring is armed and the queue was empty (crash recovery passes false, so never), and notifies only past PRM_ID_RECOVERY_REDO_MINIMUM_JOB_COUNT (hidden, default 100).

redo_task (.cpp-private cubthread::task):

FieldRoleWhy it exists
m_task_idxIdentity 0..N-1Index into bitset and push vectors
m_do_monitor_unapplied_log_lsaMaintain marker or notRecovery passes false
m_task_state_bookkeepingRef to owner’s bitsetSet in ctor, cleared after drain
m_perf_stats_definition / m_perf_statsPer-task countersTimings in 7.9
m_redo_contextPrivate context copyOwn reader + zip buffers (7.5)
m_produce_vec (+_mtx, _cv)Job queue; reserves ONE_MSwap: one lock per batch
m_adding_finishedEnd-of-stream flag, set under mutexChecked only when queue empty
m_unapplied_log_lsaatomic<log_lsa>, MAX_LSA idleFeeds global minimum (7.8)
// redo_task::execute -- src/transaction/log_recovery_redo_parallel.cpp
for ( ; ; )
{
bool adding_finished { false };
pop_jobs (jobs_vec, adding_finished);
if (jobs_vec.empty () && adding_finished)
{
break; /* <- only exit */
}
else
{
assert (!jobs_vec.empty ());
THREAD_ENTRY *const thread_entry = &context;
for (auto &job : jobs_vec)
{
// ... condensed: marker update ...
job->execute (thread_entry, m_redo_context);
job->retire (m_task_idx);
}
jobs_vec.clear (); /* <- jobs already recycled */
}
}
m_task_state_bookkeeping.set_inactive (m_task_idx);

pop_jobs asserts its post-condition as an exact xor of empty and finished. Its 1 s wait_for period (PRM_ID_RECOVERY_REDO_JOB_PERIOD_IN_SECS, hidden) drains the un-notified trickle; notify_adding_finished flips the flag under the same mutex — no lost wakeup.

flowchart TD
  W["wait_for, 1 s period"] --> P{"queue empty?"}
  P -- no --> SW["swap into local jobs_vec"]
  P -- yes --> MK["park marker, monitored only"]
  MK --> F{"m_adding_finished?"}
  F -- no --> W
  F -- yes --> Z["return empty + finished"]
  SW --> EX["per job: execute, retire"]
  EX --> W
  Z --> IN["set_inactive, cv notify"]

Figure 7-2: pop_jobs and drain loop exits.

7.5 redo_job_impl::execute — the re-fetch

Section titled “7.5 redo_job_impl::execute — the re-fetch”
redo_job_impl fieldRoleWhy it exists
m_reusable_job_stackPool back-pointer, “guaranteed to outlive this instance”retire = push (a_task_idx, this)
m_log_rtypeLOG_RECTYPE stamped by set_record_infoSelects the log_rec_* layout to re-read
// redo_job_impl::execute -- src/transaction/log_recovery_redo_parallel.cpp
const int err_fetch =
redo_context.m_reader.set_lsa_and_fetch_page (get_log_lsa (), redo_context.m_reader_fetch_page_mode);
if (err_fetch != NO_ERROR)
{
return err_fetch; /* <- sole error exit */
}
redo_context.m_reader.add_align (sizeof (LOG_RECORD_HEADER));
switch (m_log_rtype)
{
case LOG_REDO_DATA:
read_record_and_redo<log_rec_redo> (thread_p, redo_context);
break;
// ... condensed: 7 more labels (MVCC/diff undoredo, RUN_POSTPONE, COMPENSATE) ...
default:
assert (false); /* <- unreachable */
}

The eight labels are Chapter 6’s page-bound redoable types; read_record_and_redo<T> re-parses the typed header and funnels into log_rv_redo_record_sync<T>, the sync path’s sink. That error exit is swallowed: redo_task::execute ignores the return, so a failed fetch silently skips the record’s redo. log_rv_redo_context is copy-constructible, not assignable: each task owns a private reader and zip buffers.

FieldRoleWhy it exists
m_flush_push_at_countPARALLEL_REDO_REUSABLE_JOBS_FLUSH_BACK_COUNT (ONE_K)One mutex touch per ~1024 retires
m_job_poolvector<redo_job_impl> of PARALLEL_REDO_REUSABLE_JOBS_COUNT (ONE_M)The only allocation
m_pop_jobsConsumer stack, popped unsynchronizedSingle consumer (Inv 7-C)
m_push_jobs (+m_push_mtx, m_push_jobs_available_cv)Shared return binSole synchronized hand-off
m_per_task_push_jobs_vecOne private vector per taskLock-free retire fast path
// reusable_jobs_stack::blocking_pop -- src/transaction/log_recovery_redo_parallel.cpp
if (!m_pop_jobs.empty ())
{
redo_job_impl *const pop_job = m_pop_jobs.back ();
m_pop_jobs.pop_back (); /* <- no lock */
return pop_job;
}
else
{
{
std::unique_lock<std::mutex> locku { m_push_mtx };
// ... condensed: cv wait until !m_push_jobs.empty () ...
m_pop_jobs.swap (m_push_jobs); /* <- O(1) refill */
}
// ... condensed: pop_back ...
}

push (a_task_idx, a_job) mirrors it: append to the caller’s private vector; only past m_flush_push_at_count lock, bulk-insert, clear, notify_one. The slow path is the backpressure valve: when all ONE_M jobs are in flight the main thread blocks until a batch returns.

Invariant 7-C (single consumer, conservation of jobs). m_pop_jobs is popped unsynchronized because only the recovery main thread calls blocking_pop. The destructor asserts pop + push + sum(per_task) == m_job_pool.size (): a job executed but never retired trips it.

7.7 task_active_state_bookkeeping and termination

Section titled “7.7 task_active_state_bookkeeping and termination”
FieldRoleWhy it exists
m_sizeTask count, asserted < BITSET_MAX_SIZE (256)Bounds-checks indices
m_valuesstd::bitset<256>, bit per taskset_active/set_inactive assert prior state
m_values_mtx / m_values_cvGuard + wakeupwait_for_termination sleeps until m_values.none ()

The pool’s own wait asserts after “a hardcoded maximum wait time (60 seconds)”; this private bookkeeping waits unbounded. Tasks set their bit in the constructor, so an early wait cannot miss not-yet-started tasks. Shutdown:

// redo_parallel::wait_for_termination_and_stop_execution -- src/transaction/log_recovery_redo_parallel.cpp
for (auto &redo_task: m_redo_tasks)
{
redo_task->notify_adding_finished ();
}
m_task_state_bookkeeping.wait_for_termination ();
// ... condensed: assert every task is_idle ...
m_worker_pool->stop_execution ();
// ... condensed: get_manager ()->destroy_worker_pool ...

redo_task::retire is a no-op (“avoid self destruct”) so per-task perf stats stay readable; WAIT_FOR_PARALLEL times the straggler wait. Both ends assert the ordering: push_job asserts !m_adding_finished, and ~redo_parallel asserts no active task and a null pool — this blocking call is mandatory before destruction.

Dormant in crash recovery (false, MAX_LSA); armed when the same infrastructure replicates on a page server. The constructor asserts the pairing: monitoring needs a valid starting LSA; no monitoring, MAX_LSA.

FieldRoleWhy it exists
m_do_monitorMaster switchAsserted by every method
m_main_thread_unapplied_log_lsaatomic<log_lsa> advanced by dispatcherSync records bypass task queues
m_redo_tasksConst ref to task vectorcalculate reads each task’s marker
m_calculated_log_lsaLast global minimum, under m_calculate_mtxWhat waiters compare against
m_calculate_mtx / m_calculate_cv / m_terminate_calculation / m_calculate_threadCalculation-thread plumbingGuard, cv, stop flag, thread

calculate minimizes the main-thread LSA against task markers, skipping idle MAX_LSA ones. wait_past_target_log_lsa has two exits: an unlocked fast path when a_target_lsa < m_calculated_log_lsa; else notify_all (kick its 10 ms nap) and block until the minimum passes. redo_parallel::wait_past_target_lsa and set_main_thread_unapplied_log_lsa are forwarders.

perf_stats (log_recovery_redo_perf.hpp) is a nullable wrapper over cubperf:

FieldRoleWhy it exists
m_definitionConst ref to cubperf::statset_definitionSlot names/types (all COUNTER_AND_TIMER)
m_stats_setcubperf::statset *, nullptr when disabledOne-point fork: every method checks it

Activation is per-side (perf_stats_is_active_for_main / ..._for_async); do_not_record_t builds a disabled instance. time_and_increment (id) adds the time since the previous call:

  • Main: FETCH_PAGE, READ_LOG, REDO_OR_PUSH_{PREP, DO_SYNC, POP_REUSABLE_DIRECT/_WAIT, DO_ASYNC}, COMMIT_ABORT, WAIT_FOR_PARALLEL, FINALIZE.
  • Workers: PARALLEL_POP, PARALLEL_SLEEP (never incremented), PARALLEL_EXECUTE, PARALLEL_RETIRE.

redo_parallel::log_perf_stats logs each worker’s set plus an element-wise average — EXECUTE vs POP shows saturation; DIRECT vs WAIT shows pool throttling.

  1. Three dispatcher exits: SA_MODE always-sync; forced sync via null infra or log_rv_need_sync_redo (null VPID, volume ops, sector reserve/unreserve); async dispatch of a recycled job.
  2. Invariant 7-A makes forced-sync records happens-before barriers; Invariant 7-B (fixed hash(VPID) % m_task_count, in-order push, FIFO drain) gives per-page LSA order without page locks.
  3. Worker count is MAX (16, cores), an experimental floor for I/O-bound workers; teardown’s private bitset dodges the pool’s 60-second assert.
  4. Jobs carry (vpid, lsa, rectype); redo_job_impl::execute re-fetches the log page via the task’s private log_rv_redo_context and converges on the sync apply path; the worker loop discards its error return.
  5. reusable_jobs_stack recycles ONE_M jobs — lock-free pop, ONE_K flush-back, conservation asserted (Inv 7-C); slow path = backpressure; min_unapplied_log_lsa_monitoring and perf_stats serve replication and diagnostics.

Chapter 8: Atomic Sysop Abort and Postpone Completion

Section titled “Chapter 8: Atomic Sysop Abort and Postpone Completion”

Redo (Ch 6-7) reproduced the crash state exactly, leaving two loose ends that must not reach undo: open atomic system operations, and transactions/sysops committed with postpone whose postpones never finished. The tail of log_recovery_redo closes both; see the high-level companion (cubrid-recovery-manager.md) for the postpone/sysop concept.

Both phases run on the recovery main thread after the parallel redo pool drains (Ch 7): they append new log records through the runtime logging path (log_sysop_start / log_sysop_abort / log_run_postpone_op), only safe once every queued redo job is applied.

// log_recovery_redo (tail) -- src/transaction/log_recovery.c
LOG_CS_ENTER (thread_p);
log_Gl.mvcc_table.reset_start_mvccid ();
/* ... er_set: "REDO" finishing-up notification ... */
log_recovery_abort_all_atomic_sysops (thread_p); /* <- must run FIRST */
log_recovery_finish_all_postpone (thread_p);
/* ... flush data pages, log pages, log header ... */

Invariant 8-A — atomic-before-postpone. Stated in the log_rcv_tdes comment: interrupted file_perm_alloc/file_perm_dealloc “must be executed atomically … before executing finish all postpones”. Postpone actions (typically permanent-file destruction) would otherwise hit half-modified file headers and sector tables — crash or file-tracker corruption.

8.2 LOG_RCV_TDES — the recovery scratchpad

Section titled “8.2 LOG_RCV_TDES — the recovery scratchpad”

Analysis (Ch 4-5) recorded everything this chapter consumes into tdes->rcv (struct log_rcv_tdes, log_impl.h) — five LOG_LSA fields, NULL_LSA meaning no such loose end:

FieldRoleWhy it exists
sysop_start_postpone_lsaLOG_SYSOP_START_POSTPONE of a sysop committed-with-postpone whose LOG_SYSOP_END never landed (8.6).That record embeds the LOG_REC_SYSOP_END to replay (8.3).
tran_start_postpone_lsaThe transaction’s LOG_COMMIT_WITH_POSTPONE.Separates branches (c)/(d) in 8.6; abort boundary in 8.7.
atomic_sysop_start_lsaLast unmatched LOG_SYSOP_ATOMIC_START; non-NULL means crashed mid-atomic-op.Rollback target for 8.4: the log suffix to undo as one unit.
analysis_last_aborted_sysop_lsaEnd LSA of the last sysop analysis saw aborted.Upper bound of the Ch 6 skip window for a rolled-back sysop’s logical redo.
analysis_last_aborted_sysop_start_lsaThat sysop’s lastparent_lsa.Lower bound of the skip window; unused here.

The last input is the LOG_RUN_POSTPONE trail in the log itself, consumed by 8.8.

8.3 LOG_REC_SYSOP_START_POSTPONE — a deferred sysop end

Section titled “8.3 LOG_REC_SYSOP_START_POSTPONE — a deferred sysop end”

A sysop committing with postpone logs its future end record up front, so recovery can finish the commit even if the real LOG_SYSOP_END never reached disk:

// log_rec_sysop_start_postpone -- src/transaction/log_record.hpp
struct log_rec_sysop_start_postpone
{
LOG_REC_SYSOP_END sysop_end; /* log record used for end of system operation */
LOG_LSA posp_lsa; /* address where the first postpone operation start */
};
FieldRoleWhy it exists
sysop_endPre-built end record; re-read via log_read_sysop_start_postpone, appended via log_sysop_end_recovery_postpone.Persists the commit decision before postpones run; its type decides the post-finish TDES state (8.6).
posp_lsaFirst LOG_POSTPONE of this sysop.Seed for the forward scan; analysis copies it to tdes->topops.stack[last].posp_lsa (Ch 5).

8.6 reads four fields of the embedded LOG_REC_SYSOP_END (full table in Ch 5): type (discriminator), lastparent_lsa (transaction LSA just before the sysop — the rollback boundary), run_postpone.postpone_lsa (the LOG_POSTPONE this sysop ran — the parent’s resume point), run_postpone.is_sysop_postpone (sysop parent — asserted impossible — vs transaction).

Both drivers share one skeleton: walk regular TDES slots 1..num_total_indices, skipping tdes == NULL || trid == NULL_TRANID, then the system TDESes rebuilt by analysis via log_system_tdes::map_all_tdes (locks systb_Mutex). Each call is bracketed by log_rv_simulate_runtime_worker / log_rv_end_simulation, so runtime logging primitives — which resolve the current transaction from the thread — act on the impersonated TDES (log_system_tdes::rv_simulate_system_tdes for system ones). log_recovery_abort_atomic_sysop handles one TDES:

flowchart TD
    G1{"tdes NULL or<br/>trid NULL?"} -- yes --> R1["return"]
    G1 -- no --> G2{"atomic_sysop_start_lsa<br/>NULL?"}
    G2 -- yes --> R1
    G2 -- no --> G3{"start &gt;= undo_nxlsa?"}
    G3 -- yes --> R3["reset LSA, return"]
    G3 -- no --> G4{"TOPOPE and start postpone<br/>&gt; atomic start?"}
    G4 -- yes --> N1["nested postpone in atomic op:<br/>finish it first"]
    G4 -- no --> G5{"TOPOPE?"}
    G5 -- yes --> N2["atomic op in sysop postpone:<br/>abort now"]
    G5 -- no --> N3["standalone"]
    N1 --> RB["fetch start page,<br/>prev = prev_tranlsa"]
    N2 --> RB
    N3 --> RB
    RB --> ERR{"fetch failed?"}
    ERR -- yes --> F["logpb_fatal_error"]
    ERR -- no --> SIM["log_sysop_start,<br/>lastparent_lsa = prev,<br/>log_sysop_abort"]
    SIM --> DONE["clear atomic_sysop_start_lsa"]

Figure 8-1: every branch of log_recovery_abort_atomic_sysop.

The nested cases order against 8.5: sysop_start_postpone_lsa > atomic_sysop_start_lsa means a sysop committed-with-postpone inside the atomic op — finish its postpone first, then abort. The opposite TOPOPE case is an atomic op started during a sysop’s postpone — abort now, finish the postpone in 8.5. The source comments spell out both numbered crash scenarios verbatim.

The rollback simulates a runtime sysop instead of calling undo — the in-source comment calls the lastparent_lsa overwrite “hack last parent”: the new sysop’s rollback boundary becomes the prev_tranlsa of the LOG_SYSOP_ATOMIC_START, so log_sysop_abort compensates everything after it and logs an abort LOG_SYSOP_END.

Invariant 8-B — no atomic residue. On return, atomic_sysop_start_lsa is NULL_LSA on every TDES — each exit path finds it NULL, resets it, or dies in logpb_fatal_error. Later phases can assume no half-open atomic file operation exists; plain record-by-record undo would recreate the partial state the marker prevents.

Per TDES, log_recovery_finish_postpone: (1) return on the guard tdes == NULL || trid == NULL_TRANID; (2) always call log_recovery_finish_sysop_postpone (8.6), which resolves a TOPOPE_COMMITTED_WITH_POSTPONE state — possibly promoting it to COMMITTED_WITH_POSTPONE; (3) branch on state:

// log_recovery_finish_postpone -- src/transaction/log_recovery.c
if (tdes->state == TRAN_UNACTIVE_WILL_COMMIT || tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE)
{
if (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE)
{ /* make sure to abort interrupted logical postpone. */
log_recovery_abort_interrupted_sysop (thread_p, tdes, &tdes->rcv.tran_start_postpone_lsa);
LSA_SET_NULL (&tdes->undo_nxlsa); } /* <- committed: nothing left to undo */
/* ... find_first_postpone -> log_do_postpone -> log_complete ... */
}
else if (tdes->state == TRAN_UNACTIVE_COMMITTED)
{ /* log_complete + free index only; postpones already done */ }

TRAN_UNACTIVE_WILL_COMMIT = commit logged, postpone start not; COMMITTED_WITH_POSTPONE first aborts a possibly interrupted logical run postpone (8.7). The elided body: log_recovery_find_first_postpone (8.8), log_do_postpone (8.9) on a non-NULL result, then — local transactions only, tdes->coord == NULLlog_complete appends the LOG_COMMIT EOT, sets TRAN_UNACTIVE_COMMITTED, and logtb_free_tran_index frees the slot (2PC: Ch 11). System TDESes pass through step (2) only; an unfinishable interrupted sysop leaves them TRAN_UNACTIVE_UNILATERALLY_ABORTED — branch (d) — for undo (Ch 9).

Runs only for TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE; analysis pushed exactly one topops entry (assert (tdes->topops.last == 0)). Sequence: abort an interrupted logical run postpone (8.7) relative to rcv.sysop_start_postpone_lsa; find the first unexecuted postpone (8.8) seeded from topops.stack[last].posp_lsa; log_do_postpone (8.9); re-read the start-postpone record via log_read_sysop_start_postpone (failure: assert_release, give up); append the pre-built end via log_sysop_end_recovery_postpone. Four outcomes:

// log_recovery_finish_sysop_postpone (outcomes) -- src/transaction/log_recovery.c
if (sysop_start_postpone.sysop_end.type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE)
{
if (sysop_start_postpone.sysop_end.run_postpone.is_sysop_postpone)
{ /* (a) sysop postpone during sysop postpone? should not happen! */
assert (false);
tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; tdes->undo_nxlsa = tdes->tail_lsa; }
else
{ /* (b) logical run postpone during transaction postpone */
tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE;
LSA_SET_NULL (&tdes->undo_nxlsa);
tdes->posp_nxlsa = sysop_start_postpone.sysop_end.run_postpone.postpone_lsa; }
}
else if (!LSA_ISNULL (&tdes->rcv.tran_start_postpone_lsa))
{ /* (c) sysop nested in transaction postpone phase */
tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; }
else
{ /* (d) standalone: hand the rest to undo (Ch 9) */
tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; tdes->undo_nxlsa = tdes->tail_lsa; }

(b) resumes the parent’s postpone after the one this sysop ran (8.3); (b)/(c) fall through into 8.5’s branch in the same invocation; (d) parks the TDES for undo. A defensive clamp resets topops.last to -1 under assert_release.

8.7 log_recovery_abort_interrupted_sysop — the backward scan

Section titled “8.7 log_recovery_abort_interrupted_sysop — the backward scan”

Postpone execution can itself use logical run postpone sysops (file destroy/deallocate); a crash mid-sysop leaves a fragment to abort before resuming. Walk the undo chain backwards from tdes->undo_nxlsa down to postpone_start_lsa:

  • Early return if undo_nxlsa is NULL or <= postpone_start_lsa — nothing to abort.
  • Per record (page fetch failure: logpb_fatal_error, return):
    • LOG_RUN_POSTPONE — physical run postpone completed: stop, last_parent_lsa = iter_lsa.
    • LOG_SYSOP_END — stop likewise if type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE, else hop to sysop_end->lastparent_lsa, skipping the nested sysop whole.
    • anything else — prev_lsa = logrec_head.prev_tranlsa; asserts forbid postpone-start types.
  • Loop drained — assert (LSA_EQ (&iter_lsa, postpone_start_lsa)); the interrupted sysop was the first postpone action: last_parent_lsa = *postpone_start_lsa.

Then the 8.4 simulated-sysop trick with stack[last].lastparent_lsa = last_parent_lsa: everything after the last completed run postpone is compensated; completed ones stay.

8.8 log_recovery_find_first_postpone — the run-postpone trail

Section titled “8.8 log_recovery_find_first_postpone — the run-postpone trail”

tdes->posp_nxlsa after analysis is ambiguous: analysis advances it to run_posp->ref_lsa of every LOG_RUN_POSTPONE scanned — the last confirmed postpone — or, if none ran, to the first postpone of LOG_COMMIT_WITH_POSTPONE (Ch 4). One forward scan disambiguates. Guards: outside crash recovery or the three postpone states — assert (0), ER_FAILED; NULL start_postpone_lsaNO_ERROR, NULL result. The scan reuses log_do_postpone’s nested-top range walk and page-fetch error path (8.9), inspecting only this trid:

  • LOG_RUN_POSTPONE with ref_lsa == start_postpone_lsa — candidate ran: set start_postpone_lsa_wasapplied, done.
  • LOG_SYSOP_END of type LOG_SYSOP_END_LOGICAL_RUN_POSTPONE — same test on run_postpone.postpone_lsa (logical run postpones log no LOG_RUN_POSTPONE).
  • LOG_POSTPONE — the first non-candidate goes to next_postpone_lsa.
  • LOG_END_OF_LOG / NULL-offset — archive-boundary page advance as in 8.9.

Tail: candidate never ran — ret_lsa = start_postpone_lsa; else ret_lsa = next_postpone_lsa, the next LOG_POSTPONE, NULL if none remain.

8.9 log_do_postpone — the shared forward executor

Section titled “8.9 log_do_postpone — the shared forward executor”

The routine that runs postpones at runtime commit re-runs them here. log_get_next_nested_top builds a stack of nested-sysop ranges; the outer loop seeks each range three ways — up to a range’s start, restarting after its end, or (when start_seek_lsa == nxtop_range->end_lsa) running to tdes->tail_lsa and stopping. This skips the interior of every completed nested sysop — committed or aborted — since their LOG_POSTPONE records belong to the sysop, not the enclosing postpone phase; only a LOG_SYSOP_END_LOGICAL_RUN_POSTPONE end record stays inside the scanned range (log_get_next_nested_top ends that range one record earlier so the end record itself is processed). Both forward scanners share the page-fetch error path: logpb_fetch_page failure raises logpb_fatal_error and jumps to the end label, which frees a heap-grown nxtop_stack.

Dispatch inside a range: ordinary data/dummy/replication types are ignored; LOG_POSTPONE executes now via log_run_postpone_opgoto end on failure; LOG_COMMIT_WITH_POSTPONE (plus _OBSOLETE, LOG_SYSOP_START_POSTPONE, the 2PC starts) nulls forward_lsa — the postpone region is over; LOG_SYSOP_END is tolerated only at start_seek_lsa, else debug-logged as a bad range. log_run_postpone_op reads the LOG_REC_REDO payload (copying across page boundaries; logpb_fatal_error on OOM) and calls log_execute_run_postpone: apply the redo function, log a new LOG_RUN_POSTPONE — a second crash just extends the trail 8.8 consumes.

Invariant 8-C — postpones execute exactly once. The posp_nxlsa trail, 8.8’s applied-check, and the fresh LOG_RUN_POSTPONE each execution logs guarantee each LOG_POSTPONE runs exactly once across any number of crashes. Seeding log_do_postpone with an already-run LSA would double-apply non-idempotent redo such as page deallocation.

  1. The redo tail runs two cleanups after the parallel pool drains: abort open atomic sysops, then finish pending postpones (Invariant 8-A), fed by tdes->rcv (LOG_RCV_TDES) plus the LOG_RUN_POSTPONE trail.
  2. Both drivers walk regular TDES slots then log_system_tdes::map_all_tdes, impersonating each transaction via log_rv_simulate_runtime_worker.
  3. Rollback simulates a sysop — log_sysop_start, overwrite lastparent_lsa, log_sysop_abort — ordered by 8.4’s nested-case branches.
  4. log_recovery_finish_sysop_postpone replays the embedded LOG_REC_SYSOP_END, landing in the transaction-postpone path or TRAN_UNACTIVE_UNILATERALLY_ABORTED for undo.
  5. Each finished worker TDES exits via log_complete (LOG_COMMIT) to TRAN_UNACTIVE_COMMITTED — except 2PC participants (Ch 11).

Redo (Chapter 6) left even the losers’ effects in place; undo rolls them back. CLR theory is in the companion (cubrid-recovery-manager.md); here: log_recovery_undo and log_rv_undo_record branch by branch, plus the sysop bracket that makes rollback crash-restartable.

All four live in log_record.hpp (log_rec_undo, log_rec_mvcc_undo, log_rec_compensate, log_rec_sysop_end), read in place from the log page.

LOG_REC_UNDO — body of LOG_UNDO_DATA:

FieldRole
data (LOG_DATA)rcvindex + volid/pageid/offset: one locator for RV_fun dispatch and page fix; NULL vpid triggers RCV_IS_LOGICAL_LOG
lengthundo-image byte count; carries the ZIP_CHECK flag

LOG_REC_MVCC_UNDO — body of LOG_MVCC_UNDO_DATA:

FieldRole
undo (LOG_REC_UNDO)embedded plain undo — strict superset; arms extract &mvcc_undo->undo
mvccidwriter’s MVCCID, re-activated during undo so the version stays invisible
vacuum_info (LOG_VACUUM_INFO)prev_mvcc_op_log_lsa chain + vfid — vacuum’s list through MVCC op records; undo skips it

LOG_REC_COMPENSATE — body of LOG_COMPENSATE, the CLR:

FieldRole
data (LOG_DATA)locator + rcvindex of the compensation’s redo — CLRs are redo-only, replayed via redofun
undo_nxlsanext record to undo, captured before the compensated one — ARIES UndoNxtLSA; restarted undo skips done work (9.3 arm 4)
lengthafter-image length

LOG_REC_SYSOP_END — body of LOG_SYSOP_END; union keyed by type:

FieldRole
lastparent_lsalast LSA before the sysop — undo jumps here; committed sysops are never re-undone
prv_topresult_lsaprevious completed top action — nested-sysop chaining (Chapter 5)
type (LOG_SYSOP_END_TYPE)union discriminator — six end flavors, one record
vfidfile of affected pages — TDE decision for trailing undo data
undo (union)LOG_REC_UNDO for LOGICAL_UNDO — the sysop’s own undo recipe if its owner aborts
mvcc_undo (union)LOG_REC_MVCC_UNDO for LOGICAL_MVCC_UNDO — same, plus MVCCID
compensate_lsa (union)resume point for LOGICAL_COMPENSATE — the bracket was itself a compensation
run_postpone (union)postpone_lsa + is_sysop_postpone for LOGICAL_RUN_POSTPONE — analysis-side twin (Chapter 5); undo asserts it never sees one

9.2 log_recovery_undo — pre-pass and loser selection

Section titled “9.2 log_recovery_undo — pre-pass and loser selection”

Called from log_recovery under LOG_RECOVERY_UNDO_PHASE. The pre-pass retires losers with nothing left to undo: a TDES in state TRAN_UNACTIVE_UNILATERALLY_ABORTED / TRAN_UNACTIVE_ABORTED with a NULL undo_nxlsa finished its rollback pre-crash but its LOG_ABORT never hit disk — log_complete (… LOG_ABORT, LOG_DONT_NEED_NEWTRID, LOG_NEED_TO_WRITE_EOT_LOG) writes it now, logtb_free_tran_index frees the slot. System TDESes need no EOT: log_system_tdes::rv_delete_all_tdes_if erases every system entry with NULL undo_nxlsa.

Selection uses logtb_rv_read_only_map_undo_tdes (log_tran_table.c): under a read-mode TR_TABLE_CS it maps a functor over every non-system slot in those two states, then over system workers via log_system_tdes::map_all_tdes — a max-scan lambda yields max_undo_lsa, two more feed the start notice (log_find_unilaterally_largest_undo_lsa duplicates the max-scan; nothing calls it today). The driver allocates undo_unzip_ptr = log_zip_alloc (LOGAREA_SIZE), arms an optional progress timer, exits LOG_CS (fetches use LOG_CS_FORCE_USE; alloc and fetch failures are fatal), then loops per Figure 9-1.

Invariant (globally descending undo order). Each iteration undoes max_undo_lsa — the largest undo_nxlsa over all losers, recomputed after every record — and every arm moves a cursor strictly backward (prev_tranlsa, a CLR’s undo_nxlsa, or a sysop’s lastparent_lsa). The inner while (max_undo_lsa.pageid == log_lsa.pageid) drains a page before fetching an earlier one; a forward-moving arm would live-lock.

flowchart TD
    A["prune finished losers"] --> B["max_undo_lsa = max undo_nxlsa"]
    B --> C{NULL?}
    C -- yes --> Z["free unzip buffer, LOG_CS_ENTER,<br/>flush log + header + data pages"]
    C -- no --> D["fetch page; while same pageid:<br/>resolve tdes, switch on log_rtype"]
    D --> G{prev_tranlsa NULL?}
    G -- yes --> H["chain done: log_complete +<br/>logtb_free_tran_index or rv_delete_tdes"]
    G -- no --> I["undo_nxlsa = prev_tranlsa"]
    H --> B
    I --> B

Figure 9-1: driver loop.

TDES resolution forks on logtb_is_system_worker_tranid: workers via log_system_tdes::rv_get_tdes (NULL asserts); regular transactions via logtb_find_tran_index + LOG_FIND_TDES — on lookup failure (a trid analysis never registered) logtb_free_tran_index_with_undo_lsa scrubs any slot holding that undo_nxlsa and the record is skipped. if (tran_index != NULL_TRAN_INDEX && tdes != NULL) gates the switch; on the worker path tran_index is stale — only tdes matters.

Every arm is preceded unconditionally by LSA_COPY (&tdes->undo_nxlsa, &prev_tranlsa) — the order is the point:

Invariant (cursor advances before the undo executes). log_append_compensate copies tdes->undo_nxlsa into the CLR it writes; the driver advanced it to prev_tranlsa first, so the CLR points at the next record to undo. Reverse the order and a crash mid-rollback replays the same undo twice.

  1. UNDOREDO family (LOG_UNDOREDO_DATA, LOG_DIFF_UNDOREDO_DATA, LOG_MVCC_* twins) — MVCC flavors read LOG_REC_MVCC_UNDOREDO and set rcv.mvcc_id, plain ones LOG_REC_UNDOREDO with MVCCID_NULL; fill rcv from the embedded LOG_DATA + ulength, call log_rv_undo_record. DIFF matters only to redo.
  2. LOG_MVCC_UNDO_DATA / LOG_UNDO_DATA — same shape with LOG_REC_MVCC_UNDO / LOG_REC_UNDO and undo->length.
  3. Redo-only / bookkeeping typesLOG_REDO_DATA, LOG_MVCC_REDO_DATA, LOG_DBEXTERN_REDO_DATA, LOG_DUMMY_HEAD_POSTPONE, LOG_POSTPONE, LOG_SAVEPOINT, LOG_REPLICATION_DATA, LOG_REPLICATION_STATEMENT, LOG_DUMMY_HA_SERVER_STATE, LOG_DUMMY_OVF_RECORD, LOG_DUMMY_GENERIC, LOG_SUPPLEMENTAL_INFO, LOG_SYSOP_ATOMIC_START: /* Not for UNDO ... */, fall through to the previous record.
  4. LOG_COMPENSATELSA_COPY (&prev_tranlsa, &compensate->undo_nxlsa). No work — the cursor leapfrogs everything already undone pre-crash.
  5. LOG_SYSOP_END — on sysop_end->type:
    • LOGICAL_UNDO / LOGICAL_MVCC_UNDO: the committed bracket carries its own undo recipe. rcv is filled from sysop_end->undo (or mvcc_undo.undo plus rcv.mvcc_id); both prev_tranlsa and tdes->undo_nxlsa move to lastparent_lsa before log_rv_undo_record runs, so its compensation skips the whole sysop. (rcv_lsa is not refreshed; diagnostics may print a stale LSA.)
    • LOGICAL_COMPENSATE: prev_tranlsa = sysop_end->compensate_lsa — resume before the record the bracket compensated.
    • default (COMMIT, ABORT): prev_tranlsa = sysop_end->lastparent_lsa; an assert documents that LOGICAL_RUN_POSTPONE never reaches undo (Chapter 8).
  6. Terminal/illegal types (LOG_RUN_POSTPONE, the LOG_COMMIT* trio, LOG_SYSOP_START_POSTPONE, LOG_ABORT, checkpoint and 2PC records, LOG_DUMMY_CRASH_RECOVERY, LOG_END_OF_LOG) and the default arm (corrupted type → ER_LOG_PAGE_CORRUPTED) — analysis went wrong: after assert (false), release builds amputate — clear tdes->mvccinfo.id, log_system_tdes::rv_delete_tdes (workers) or log_complete (… LOG_ABORT …) + logtb_free_tran_index, tdes = NULL so the epilogue skips it.

Epilogue (if (tdes != NULL)): a NULL prev_tranlsa ends the chain — clear tdes->mvccinfo.id, then rv_delete_tdes (workers) or log_complete + logtb_free_tran_index as in the pre-pass (#ifdef CCI_XA builds skip completion for TRAN_UNACTIVE_2PC_PREPARE). Otherwise prev_tranlsa goes back into tdes->undo_nxlsa, re-asserting the copy arms 4-5 may have redirected. After the loop: free the unzip buffer, re-enter LOG_CS, force-flush log, header and data pages.

Inside log_complete, updaters get log_append_abort_log + log_change_tran_as_completed and unlock_global_oldest_visible_mvccid; no-update losers (LSA_ISNULL (&tdes->tail_lsa)) just flip state.

9.4 log_rv_undo_record — one undo step, every branch

Section titled “9.4 log_rv_undo_record — one undo step, every branch”

The recovery twin of run-time log_rollback_rec; identity simulated via log_rv_simulate_runtime_worker / log_rv_end_simulation, no page locks. Pre-dispatch: (1) a valid rcv->mvcc_id is re-activated via logtb_rv_assign_mvccid_for_undo_recovery; (2) RCV_IS_LOGICAL_LOG (rcv_vpid, rcvindex) — NULL vpid or a logical rcvindex — leaves rcv->pgptr = NULL, else pgbuf_fix takes an unconditional write latch (failure asserted, tolerated); (3) ZIP_CHECK (rcv->length) strips the compression flag; the image is aliased from the log page if it fits, else malloced + logpb_copy_from_log, zipped images inflated by log_unzip into undo_unzip_ptr (alloc/unzip failures fatal — as in the reader-based redo-side twins log_rv_get_unzip_log_data / log_rv_get_unzip_and_diff_redo_log_data, Chapter 6). Then, under if (rcv->pgptr != NULL || RCV_IS_LOGICAL_LOG (…)):

// log_rv_undo_record -- src/transaction/log_recovery.c
if (rcvindex == RVBT_MVCC_INCREMENTS_UPD)
{ /* nothing to do during recovery */ }
else if (rcvindex == RVBT_MVCC_NOTIFY_VACUUM || rcvindex == RVES_NOTIFY_VACUUM)
{ /* nothing to do */ }
else if (rcvindex == RVBT_LOG_GLOBAL_UNIQUE_STATS_COMMIT)
{ /* <- in-memory only: undo on every restart, cannot compensate */
error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv);
assert (error_code == NO_ERROR);
}
else if (RCV_IS_LOGICAL_COMPENSATE_MANUAL (rcvindex))
{ /* <- undofun logs its own compensation */
LSA_COPY (&rcv->reference_lsa, &tdes->undo_nxlsa);
error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv);
// ... condensed ... logpb_fatal_error on failure; optional b-tree trace
}
else if (!RCV_IS_LOGICAL_LOG (rcv_vpid, rcvindex))
{ /* <- PHYSICAL undo: CLR first, then apply before-image */
log_append_compensate (thread_p, rcvindex, rcv_vpid, rcv->offset, rcv->pgptr,
rcv->length, rcv->data, tdes);
error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv);
// ... condensed ... logpb_fatal_error on failure
}
else
{ /* <- LOGICAL undo: bracket in a system operation */
save_state = tdes->state;
LSA_COPY (&rcv->reference_lsa, &tdes->undo_nxlsa);
log_sysop_start (thread_p);
(void) (*RV_fun[rcvindex].undofun) (thread_p, rcv);
log_sysop_end_logical_compensate (thread_p, &rcv->reference_lsa);
tdes->state = save_state;
}

A physical record whose page could not be fixed (the guard’s else) still gets log_append_compensate with pgptr = NULL — the chain stays restartable — plus ER_LOG_MAYNEED_MEDIA_RECOVERY naming the volume; the undofun is skipped and recovery continues (log-and-skip). end: frees the area, unfixes the page, log_rv_end_simulation.

Invariant (every undo step is logged before or while it happens). Physical undo writes the CLR before undofun; logical undo opens log_sysop_start first so all page changes land inside the bracket, sealed by log_sysop_end_logical_compensate with compensate_lsa = rcv->reference_lsa. Crash inside the bracket: analysis aborts the sysop (Chapter 8), undo resumes at the original record. Crash after the seal: the LOGICAL_COMPENSATE arm jumps to compensate_lsa. Either way the logical undo runs exactly once.

recovery.h defines the manual sets: RCV_IS_BTREE_LOGICAL_LOG (ten RVBT_* object-level ops) inside the wider RCV_IS_LOGICAL_COMPENSATE_MANUAL (plus RVFL_ALLOC, RVFL_USER_PAGE_MARK_DELETE, RVPGBUF_DEALLOC, RVFL_TRACKER_HEAP_REUSE, RVHF_LOB_REMOVE_DIR, RVFL_TRACKER_UNREGISTER). Their undofuns append page-level compensations themselves via log_append_compensate_with_undo_nxlsa with the saved rcv->reference_lsa — a b-tree undo may split or merge pages before compensating — so an extra bracket would be redundant.

9.5 log_append_compensate — the CLR writer

Section titled “9.5 log_append_compensate — the CLR writer”

log_append_compensate and log_append_compensate_with_undo_nxlsa wrap log_append_compensate_internal (log_manager.c); the latter passes an explicit undo_nxlsa for the b-tree case, the former NULL:

// log_append_compensate_internal -- src/transaction/log_manager.c
// ... condensed ... node = prior_lsa_alloc_and_copy_data (.., LOG_COMPENSATE, ..); NULL -> silent return
LSA_COPY (&prev_lsa, &tdes->undo_nxlsa); /* <- next record to undo, saved */
compensate = (LOG_REC_COMPENSATE *) node->data_header;
// ... condensed ... fill compensate->data; store the undo_nxlsa parameter
// into compensate->undo_nxlsa if non-NULL (b-tree override), else prev_lsa
start_lsa = prior_lsa_next_record (thread_p, node, tdes);
// ... condensed ... pgbuf_set_lsa (pgptr, start_lsa) when pgptr != NULL
/* Go back to our undo link */
LSA_COPY (&tdes->undo_nxlsa, &prev_lsa); /* <- CLR must not become next undo target */

Branches: prior_lsa_alloc_and_copy_data failure returns silently — the undo proceeds unlogged; since undo_nxlsa never advanced past the record, a re-crash simply undoes it again (re-applying a before-image is harmless). NULL pgptr (media path, 9.4) skips TDE marking and pgbuf_set_lsa; a failed pgbuf_set_lsa asserts and returns. The last line is load-bearing: prior_lsa_next_record drags undo_nxlsa forward with tail_lsa; restoring prev_lsa keeps the rollback cursor behind the CLR — per the header comment, CLRs “are never undone.”

  1. The pre-pass retires losers whose undo_nxlsa is NULL — log_complete writes the missing LOG_ABORT, logtb_free_tran_index frees the slot — and rv_delete_all_tdes_if prunes finished system TDESes.
  2. The driver always undoes the globally largest undo_nxlsa (recomputed each record via logtb_rv_read_only_map_undo_tdes): a strictly backward, page-at-a-time sweep.
  3. tdes->undo_nxlsa advances before the undo executes, so every CLR carries the correct resume point; log_append_compensate_internal restores it after appending so the CLR is never undone — undo never undoes an undo, making the pass idempotent across repeated crashes.
  4. LOG_COMPENSATE and LOG_SYSOP_END are pure cursor redirections during undo (compensate->undo_nxlsa, lastparent_lsa / compensate_lsa) — a crashed rollback resumes without repetition.
  5. log_rv_undo_record forks on RCV_IS_LOGICAL_LOG: physical undo = CLR then undofun; logical undo = sysop bracket (log_sysop_start to log_sysop_end_logical_compensate) that analysis aborts if half-done and undo skips if sealed; RCV_IS_LOGICAL_COMPENSATE_MANUAL undofuns compensate manually.
  6. The only tolerated failure is an unfixable data page — CLR still written (NULL pgptr) plus ER_LOG_MAYNEED_MEDIA_RECOVERY; everything else is logpb_fatal_error, because a half-applied undo with its CLR on disk would lie to the next restart.

Every redo, undo, compensation replay, and logdump print indexes one global array: RV_fun[] in recovery.c. The drivers of Ch 6, 7, and 9 know nothing about heap or b-tree semantics — only how to find the right function pointer. This chapter covers the entry layout, the index-equals-position invariant, NULL arms, and the shared packed-change machinery; theory lives in the high-level companion (“ARIES in CUBRID”, “Recovery Function Dispatch”).

10.1 The rvfun entry and the table it forms

Section titled “10.1 The rvfun entry and the table it forms”

Each slot is a struct rvfun (recovery.h):

// rvfun -- src/transaction/recovery.h
struct rvfun
{
using fun_t = int (*)(THREAD_ENTRY * thread_p, LOG_RCV * logrcv);
using dump_fun_t = void (*)(FILE * fp, int length, void *data);
LOG_RCVINDEX recv_index; /* For verification */
const char *recv_string;
fun_t undofun;
fun_t redofun;
dump_fun_t dump_undofun;
dump_fun_t dump_redofun;
};
FieldRoleWhy it exists
recv_indexLOG_RCVINDEX this slot claimsCompared to slot position by rv_check_rvfuns
recv_stringPrintable name ("RVHF_INSERT")logdump and fatal errors, via rv_rcvindex_string
undofunRollback, undo pass (Ch 9), redo of LOG_COMPENSATECLR payloads are undo-direction; NULL = never logs undo data
redofunRedo pass (Ch 6/7), run-postpone (Ch 8)NULL = never logs redo data (undo-only logical records)
dump_undofunDebug printer, undo payloadlogdump only, via log_dump_data
dump_redofunDebug printer, redo payloadNULL = payload not formatted

RV_fun[] is an aggregate initializer, one literal per LOG_RCVINDEX, from RVDK_NEWVOL (NULL undo arm — volume creation is redo-only) to RVHF_LOB_REMOVE_DIR. Arms are often mirror pairs (RVDK_UNRESERVE_SECTORS: undo disk_rv_reserve_sectors, redo disk_rv_unreserve_sectors).

flowchart LR
    REDO["redo + run-postpone Ch 6-8"] --> R["redofun"]
    CLR["LOG_COMPENSATE replay"] --> U["undofun"]
    UNDO["undo + rollback Ch 9"] --> U
    DUMP["logdump"] --> D["dump arms"]

Figure 10-1: consumers of each rvfun arm; compensate replay crosses to undofun.

The crossed wire is explicit in log_rv_get_fun<LOG_REC_COMPENSATE> (log_recovery_redo.hpp): its body is return RV_fun[rcvindex].undofun; — comment // yes, undo. Hence RVBT_RECORD_MODIFY_COMPENSATE registers btree_rv_redo_record_modify as undofun with NULL redo: the CLR payload is redo-format, replayed only through undofun.

LOG_RCVINDEX (recovery.h) is an explicitly numbered enum, RVDK_NEWVOL = 0 through RVHF_LOB_REMOVE_DIR = 129, closed by two specials: RV_LAST_LOGID = RVHF_LOB_REMOVE_DIR (an alias, not a slot) and RV_NOT_DEFINED = 999 (sentinel; must never index RV_fun). Its head comment mandates new entries at the bottom, “to AVOID OLD DATABASES TO BE RECOVERED UNDER OLD FILE”.

Invariant (table ordering): for every i in [0, DIM(RV_fun)), RV_fun[i].recv_index == i. The rcvindex in each on-disk log record is the array subscript — dispatch is an unchecked array load. Enforcement runs once at startup, debug builds only:

// rv_check_rvfuns -- src/transaction/recovery.c
for (i = 0; i < num_indices; i++) /* num_indices = DIM (RV_fun) */
if (RV_fun[i].recv_index != i)
{
// ... condensed: er_log_debug "out of sequence" ...
er_set (ER_FATAL_ERROR_SEVERITY, ARG_FILE_LINE, ER_GENERIC_ERROR, 0);
assert (false);
break; /* <- first mismatch only; one insertion shifts all later slots */
}

Branch accounting: one loop, one conditional — a match falls through; a mismatch logs, raises a fatal-severity error, asserts, breaks. Function and call site vanish under NDEBUG (the call opens log_initialize_internal, log_manager.c); a misordered release-build table is caught by nothing — recovery applies some other index’s function to each payload.

rv_rcvindex_string trusts the invariant: its whole body is return RV_fun[rcvindex].recv_string; — no bounds check, so RV_NOT_DEFINED must never reach it. (A stale recovery.c header comment still directs authors to rv_rcvindex_string() for new names.)

A NULL arm is a contract about logging, not a recovery-time fallback: no record with this rcvindex ever carries data for that direction. Enforcement lives at append time, in CUBRID_DEBUG blocks (log_manager.c): log_append_undoredo_crumbs asserts both arms non-NULL, log_append_undo_crumbs only undofun, log_append_redo_crumbs only redofun; rollback adds assert (RV_fun[rcvindex].undofun != NULL). At recovery only log_rv_redo_record is defensive — a NULL redofun merely logs a warning — while log_rv_undo_record calls the arm with no NULL test: the append-time contract is its only safety net.

Dump arms take (FILE *, int length, void *data) — payload only, no page — since logdump runs offline; printers are generic (log_rv_dump_char, log_rv_dump_hexa) or subsystem decoders (disk_rv_dump_hdr).

10.4 Index families and the RCV_IS_* macro overlay

Section titled “10.4 Index families and the RCV_IS_* macro overlay”

The prefix encodes the owning subsystem; the append-only rule scatters late additions to 124–129 regardless of family: RVDK_* 0–9 (disk) · RVFL_* 10–32, 128 (file mgr; 128 = TDE) · RVHF_* 33–53, 126, 129 (heap) · RVOVF_* 54–57 (overflow) · RVEH_* 58–65 (ext hash) · RVBT_* 66–91, 124–125 (b-tree; 124–125 = online index) · RVCT_* 92–96 (catalog) · RVLOG_* 97 (log no-op) · RVREPL_* 98–103 (replication; HA shipping, not page recovery) · RVVAC_* 104–117 (vacuum) · RVES_* 118 (external storage) · RVLOC_* 119 (locator dummy) · RVPGBUF_* 120–123, 127 (page buffer; 127 = TDE).

The RCV_IS_* macros (Ch 1) are a second axis: the index value selects the function; its macro membership selects the protocol around the call in log_rv_undo_record’s six-way ladder (Ch 9). Indices in RCV_IS_LOGICAL_COMPENSATE_MANUAL (fed by RCV_IS_BTREE_LOGICAL_LOG) get rcv->reference_lsa preloaded from tdes->undo_nxlsa and their undofun logs its own CLR; indices failing RCV_IS_LOGICAL_LOG get a driver-side log_append_compensate first. RCV_IS_LOGICAL_RUN_POSTPONE_MANUAL is the postpone analogue (Ch 8).

Invariant (macro/table coherence): every index named in a RCV_IS_* macro must keep an arm whose internal logging matches the protocol the macro routes it to. Nothing checks this mechanically; a RCV_IS_LOGICAL_COMPENSATE_MANUAL index whose undofun logs no CLR leaves undo_nxlsa pointing at the same record: infinite rollback loop or missing-CLR crash at next restart.

10.5 The packed partial-change mini-format

Section titled “10.5 The packed partial-change mini-format”

Many slotted-page entries log a sequence of splices instead of whole records. One splice unit is short offset_to_data | byte A | byte B | payload, padded to INT_ALIGNMENT, as produced by the packers in log_recovery.c:

// log_rv_pack_redo_record_changes -- src/transaction/log_recovery.c
assert (offset_to_data >= 0 && offset_to_data <= 0x8FFF);
/* <- intends flag bits clear; mask 0xC000 needs <= 0x3FFF, so 0x8FFF looks like a source typo */
// ... condensed: asserts both sizes <= 255 (single wire bytes); PTR_ALIGN to INT_ALIGNMENT ...
OR_PUT_SHORT (ptr, (short) offset_to_data); ptr += OR_SHORT_SIZE;
OR_PUT_BYTE (ptr, (INT16) old_data_size); ptr += OR_BYTE_SIZE;
OR_PUT_BYTE (ptr, (INT16) new_data_size); ptr += OR_BYTE_SIZE;
if (new_data_size > 0)
{ memcpy (ptr, new_data, new_data_size); ptr += new_data_size; }
// ... condensed: trailing PTR_ALIGN ...

log_rv_pack_undo_record_changes differs in exactly two ways: the two OR_PUT_BYTE lines are swapped — new_data_size first — and the memcpy payload is old_data, guarded by old_data_size > 0. That asymmetry is the whole trick:

Wire fieldIn redo dataIn undo data
offset_to_datasplice positionsplice position (same)
byte A (“remove size”)runtime old sizeruntime new size — bytes undo strips
byte B (“insert size”)runtime new sizeruntime old size — bytes undo restores
payloadnew dataold data

The packers pre-swap, so one interpreter serves both directions with no direction flag.

10.6 The interpreter: ordered replay, reversed unreplay

Section titled “10.6 The interpreter: ordered replay, reversed unreplay”

log_rv_undoredo_record_partial_changes is a three-assert wrapper that wraps the payload in an OR_BUF and calls the recursive core, because undo must apply splices in reverse log order — each offset_to_data was computed against the record as that splice saw it:

// log_rv_undoredo_partial_changes_recursive -- src/transaction/log_recovery.c
if (rcv_buf->ptr == rcv_buf->endptr)
return NO_ERROR; /* (1) clean termination */
if (rcv_buf->ptr + OR_SHORT_SIZE + 2 * OR_BYTE_SIZE > rcv_buf->endptr)
{ assert_release (false); return ER_TF_BUFFER_OVERFLOW; } /* (2) truncated unit */
offset_to_data = (int) or_get_short (rcv_buf, &error_code); /* (3,4,5) per-field errors */
// ... condensed: old_data_size, new_data_size; each returns error_code on failure ...
if (new_data_size > 0)
{ new_data = rcv_buf->ptr;
error_code = or_advance (rcv_buf, new_data_size); /* (6) payload overruns buffer */ }
else
new_data = NULL; /* (7) pure deletion splice */
or_align (rcv_buf, INT_ALIGNMENT); /* <- mirrors packer's PTR_ALIGN */
if (!is_undo)
RECORD_REPLACE_DATA (record, offset_to_data, old_data_size, new_data_size, new_data);
error_code = log_rv_undoredo_partial_changes_recursive (thread_p, rcv_buf, record, is_undo);
if (error_code != NO_ERROR)
{ assert_release (false); return error_code; } /* (8) deeper error skips this splice */
if (is_undo)
RECORD_REPLACE_DATA (record, offset_to_data, old_data_size, new_data_size, new_data);
return NO_ERROR;

(7) is legal because RECORD_REPLACE_DATA (storage_common.h) skips its memcpy when insert size is 0.

flowchart TD
    A["parse unit i"] --> B{"is_undo?"}
    B -- "no" --> C["apply splice i, then recurse into i+1"]
    B -- "yes" --> D["recurse into i+1, apply splice i on unwind"]

Figure 10-2: redo applies before recursing; undo applies on unwind, reversing order for free.

10.7 log_rv_record_modify_internal and the thin wrappers

Section titled “10.7 log_rv_record_modify_internal and the thin wrappers”

The generic record modifier reads two flag bits smuggled into rcv->offset (LOG_RV_RECORD_SET_MODIFY_MODE, mask LOG_RV_RECORD_MODIFY_MASK = 0xC000, log_append.hpp; §10.5’s 0x8FFF assert intends to protect these bits, though a flag-safe bound would be 0x3FFF):

flagsMeaningRedo actionUndo action
LOG_RV_RECORD_INSERT (0x8000)record insertedspage_insert_atspage_delete
LOG_RV_RECORD_DELETE (0x4000)record deletedspage_deletespage_insert_at
LOG_RV_RECORD_UPDATE_ALL (0xC000)full replacementspage_updatespage_update (per-arm payload)
LOG_RV_RECORD_UPDATE_PARTIAL (0x0000)splice chainsplice forward, spage_updatesplice reversed, spage_update
// log_rv_record_modify_internal -- src/transaction/log_recovery.c
INT16 flags = rcv->offset & LOG_RV_RECORD_MODIFY_MASK;
PGSLOTID slotid = rcv->offset & (~LOG_RV_RECORD_MODIFY_MASK);
if ((!is_undo && LOG_RV_RECORD_IS_INSERT (flags)) || (is_undo && LOG_RV_RECORD_IS_DELETE (flags)))
{ /* ... condensed: unpack type byte + body; spage_insert_at ... */ }
else if ((!is_undo && LOG_RV_RECORD_IS_DELETE (flags)) || (is_undo && LOG_RV_RECORD_IS_INSERT (flags)))
{ /* ... condensed: spage_delete ... */ }
else if (LOG_RV_RECORD_IS_UPDATE_ALL (flags))
{ /* ... condensed: unpack type + body; spage_update ... */ }
else
{
assert (LOG_RV_RECORD_IS_UPDATE_PARTIAL (flags));
// ... condensed: spage_get_record (..., COPY); /* <- splice on a private copy */
// log_rv_undoredo_record_partial_changes (..., is_undo); spage_update ...
}
pgbuf_set_dirty (thread_p, rcv->pgptr, DONT_FREE); /* <- every success path lands here */
return NO_ERROR;

The four arms are mutually exclusive and exhaustive; every failure path is assert_release (false) plus ER_FAILED before the dirty mark, so a failed arm never advertises a half-applied page. (One blemish: the UPDATE_PARTIAL arm’s failed spage_update returns error_code — still NO_ERROR there — asserted but not propagated.) log_rv_redo_record_modify / log_rv_undo_record_modify are one-line wrappers binding is_undo to false/true, giving the table two distinct pointers.

The b-tree indices RVBT_RECORD_MODIFY_UNDOREDO / _NO_UNDO / _COMPENSATE register btree_rv_redo_record_modify / btree_rv_undo_record_modify (btree.c) instead; their core btree_rv_record_modify_internal clones this ladder (wider BTREE_RV_FLAGS_MASK, same call into log_rv_undoredo_record_partial_changes) plus node-header upkeep.

10.8 The registration ritual — adding a new LOG_RCVINDEX

Section titled “10.8 The registration ritual — adding a new LOG_RCVINDEX”
  1. Append the enumerator at the bottom of LOG_RCVINDEX; retarget RV_LAST_LOGID. Never renumber — values persist in on-disk logs.
  2. Append the matching rvfun literal at the last slot of RV_fun[], string = enumerator spelling.
  3. Pick arms by logging discipline: both for undoredo, redo-only for redo/postpone, undo-only for logical-undo or compensate-replay (§10.1’s crossed wire).
  4. If the index needs a manual compensation/postpone protocol, add it to the right RCV_IS_* macro and implement that protocol (§10.4 invariant).
  5. Build with asserts and boot: rv_check_rvfuns is the only mechanical check. A skipped or transposed slot dies fatally there; a release build applies the wrong function to every record from the bad slot on.
  1. RV_fun[] maps the on-disk rcvindex to function pointers via an unchecked array load; recv_index == position, checked only by rv_check_rvfuns at debug startup, is load-bearing for every pass; the enum is append-only because the numbers persist in logs.
  2. NULL arms encode logging contracts policed at append time; recovery mostly trusts them.
  3. The redo pass replays LOG_COMPENSATE through the undofun slot, so compensate-only indices register their redo-direction function as undofun.
  4. The RCV_IS_* macros pick the compensation/postpone protocol wrapped around the undofun call in log_rv_undo_record (Ch 9).
  5. The packed splice format is direction-agnostic — the undo packer pre-swaps size bytes and stores old data — so one interpreter replays redo forward, undo reversed on unwind; log_rv_record_modify_internal layers a four-way ladder over it, insert/delete arms swapping under undo, and the b-tree clones the machinery rather than registering the generic wrappers.

Off the main crash-restart lifecycle: point-in-time restore, log truncation, post-restore archive/volume discard, append-point repair, execution-context shims, and the 2PC handoff.

Two log_recovery parameters gate everything: ismedia_crash and stopat (high-level doc, “Restart orchestrator”):

// log_recovery -- src/transaction/log_recovery.c
if (ismedia_crash != false)
{
/* Media crash, we may have to start from an older checkpoint... check disk headers */
(void) fileio_map_mounted (thread_p, (bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint, &rcv_lsa);
}
else
{
// ... condensed ...
if (stopat != NULL)
{
*stopat = -1; /* <- normal restart: point-in-time target forcibly disabled */
}
}

Invariant — incomplete recovery happens only on the media-crash path. A normal restart neutralizes stopat; a broken page outside media recovery is fatal, not truncated — otherwise analysis could silently destroy committed transactions. log_rv_find_checkpoint picks the oldest checkpoint among restored volumes.

Three analysis-time triggers cut the log, all converging on log_recovery_resetlog with *did_incom_recovery = true:

  1. Commit/abort newer than the target. Without a media crash, log_rv_analysis_complete jumps goto end, freeing the tran index (Chapter 4). Otherwise it reads LOG_REC_DONETIME; when *stop_at is set and difftime (*stop_at, last_at_time) < 0, it releases the held page (log_lsa->pageid = NULL_PAGEID), calls log_recovery_resetlog with the record’s header LSA — the new log ends before the too-new record — and returns NO_ERROR.

  2. Commit-with-postpone newer than the target. log_rv_analysis_commit_with_postpone runs the same difftime test on LOG_REC_START_POSTPONE.at_time inside its if (is_media_crash) arm; the else arm is normal Chapter 5 bookkeeping. A passing time check on the media branch does nothing.

  3. Physically broken page. When logpb_fetch_page fails in the log_recovery_analysis loop, the media branch stores last_at_time into *stop_at, rewinds the last transaction’s tail_lsa/undo_nxlsa to log_rec->prev_tranlsa (the half-written record is never undone), re-fetches the previous page (failure fatal: “reset log is impossible”), calls log_recovery_resetlog with prev_lsa/prev_prev_lsa, resets log_Gl.mvcc_table.reset_start_mvccid (), and returns. The non-media branch is fatal — TDE-specific message when er_errid () is ER_TDE_CIPHER_IS_NOT_LOADED.

A fourth, milder repair: when a record’s forward LSA is NULL while log_rtype != LOG_END_OF_LOG and *did_incom_recovery == false, the analysis loop calls log_startof_nxrec (§11.5) on log_Gl.hdr.append_lsa, patches log_rec->forw_lsa, rewrites the page (logpb_write_page_to_disk), and sets log_Gl.hdr.next_trid = tran_id in the same block; on failure the append point falls back to end_redo_lsa (“we may destroy a record”).

11.2 log_recovery_resetlog — truncate and re-arm

Section titled “11.2 log_recovery_resetlog — truncate and re-arm”

log_recovery_resetlog is the one function that rewrites history. It asserts LOG_CS_OWN_WRITE_MODE and non-NULL new_prev_lsa, then runs six steps:

  1. Flush what exists. If log_Gl.append.vdes != NULL_VOLDES with an append page held: logpb_flush_pages_direct + logpb_invalid_all_append_pages.
  2. Pick the new append LSA. NULL new_append_lsa → header restarts at 0|0. Otherwise, with no active log or a to-the-past reset at a mid-page offset, the append page is saved so its surviving prefix carries into the recreated log:
// log_recovery_resetlog -- src/transaction/log_recovery.c
if (log_Gl.append.vdes == NULL_VOLDES
|| (log_Gl.hdr.fpageid > new_append_lsa->pageid && new_append_lsa->offset > 0))
{
// ... condensed ... (rationale comment)
newappend_pgptr = (LOG_PAGE *) aligned_newappend_pgbuf;
if ((logpb_fetch_page (thread_p, new_append_lsa, LOG_CS_FORCE_USE, newappend_pgptr)) != NO_ERROR)
{
newappend_pgptr = NULL; /* <- tolerated: the page copy is best-effort */
}
}
LOG_RESET_APPEND_LSA (new_append_lsa);
  1. Reset header state. chkpt_lsa = append_lsa (the truncated tail is the new checkpoint), is_shutdown = false, logpb_invalidate_pool.
  2. Two regimes. If log_Gl.append.vdes == NULL_VOLDES || log_Gl.hdr.fpageid > log_Gl.hdr.append_lsa.pageid — no active log, or the append point moved before the active range — the log is rebuilt: arv_num = logpb_get_archive_number (append page - 1) + 1 names the first unneeded archive (-1 fatal); log_recovery_notpartof_archives (§11.3) removes from there up (reason strdup-ed, raw fallback); the header is rewritten as if the log began here: fpageid = nxarv_pageid = append_lsa.pageid, nxarv_num = arv_num, last_arv_num_for_syscrashes = last_deleted_arv_num = -1. A missing active log file is recreated — disk_get_db_creation, fileio_format, logpb_create_header_page, logpb_flush_page, failures fatal — and either way a fresh first append page is created and flushed. Else only nxarv_pageid is clamped down if past the new append page.
  3. Re-seed the append page. logpb_fetch_start_append_page; on success a step-2 saved image is memcpy-ed over the fetched buffer, marked dirty, flushed direct. If logpb_fetch_start_append_page fails, the restore-and-flush is skipped silently — no error is raised — and finalization proceeds regardless.
  4. Finalize. LOG_RESET_PREV_LSA (new_prev_lsa); mvcc_op_log_lsa.set_null () and vacuum_last_blockid = 0 disconnect vacuum from truncated ranges; was_active_log_reset = true; logpb_flush_header; logpb_decache_archive_info.

Invariant — after resetlog, every position-bearing header field points at or before the new append LSA. chkpt_lsa, fpageid, nxarv_pageid, prev_lsa, and the vacuum anchors are rewritten in one LOG_CS critical section; a missed field would send vacuum or the archiver chasing truncated pages.

Archives start_arv_num and up describe truncated pages. Two scan modes, keyed on whether the active log (a trustworthy header) is mounted:

// log_recovery_notpartof_archives -- src/transaction/log_recovery.c
if (log_Gl.append.vdes != NULL_VOLDES)
{
/* Trust the current log header */
// ... condensed ... (unformat archives start_arv_num .. nxarv_num - 1)
}
else
{
/* We don't know where to stop. Stop when an archive is not in the OS */
for (i = start_arv_num; i <= INT_MAX; i++)
{
fileio_make_log_archive_name (logarv_name, log_Archive_path, log_Prefix, i);
if (fileio_is_volume_exist (logarv_name) == false)
{
// ... condensed ... /* <- rebuild name of archive i-1, the LAST removed */
break;
}
fileio_unformat (thread_p, logarv_name);
}
}

With info_reason non-NULL and at least one archive removed (start_arv_num != i), a REMOVE ... REASON line goes to the log-info file via log_dump_log_info (single-vs-range format branch); errors other than ER_LOG_MOUNT_FAIL return early, before the header update (the files are already gone). Finally log_Gl.hdr.last_deleted_arv_num = (start_arv_num == i) ? i : i - 1 (set even on a no-op call — a quirk); the header is flushed only when the active log is mounted; logpb_decache_archive_info is left to callers.

When did_incom_recovery is set, the driver calls log_recovery_notpartof_volumes after the undo pass. The boundary: start_volid = boot_find_next_permanent_volid (thread_p), the first volid the restored catalog does not know about. Two sweeps:

Sweep 1 — already-mounted volumes. fileio_map_mounted runs log_unformat_ahead_volumes over every mounted volume: if volid != NULL_VOLID && volid >= *start_volid, buffer-pool pages are dropped first (pgbuf_invalidate_all, so no stale dirty page is later flushed into it), then the volume is fileio_unformat-ed and its label freed. If invalidation fails the callback returns false, stopping the map early; stragglers fall to sweep 2.

Sweep 2 — volumes laying around on disk. Extension-named candidates are probed from start_volid to LOG_MAX_DBVOLID, breaking at the first missing name. Each candidate is mounted, its creation time read via disk_get_db_creation, and dismounted; only if difftime (vol_dbcreation, log_Gl.hdr.db_creation) == 0 is it unformatted. The db_creation timestamp is the identity test — an unrelated database’s same-named volume is a deliberate NO-OP. A candidate that exists but fails to mount (vdes == NULL_VOLDES) is skipped silently — never unformatted. The extension directory derives from log_Db_fullname (empty-string fallback on malloc or fileio_get_directory_path failure). logpb_recreate_volume_info then rebuilds the volume-info file.

log_startof_nxrec answers: where does the next record start? Analysis uses it (§11.1, repair 4) when the last record’s forw_lsa is NULL but the record is complete. Branches:

  • NULL input LSA → return NULL; logpb_fetch_page failure → goto error. lsa->offset == NULL_OFFSET (page from an archive cut mid-record) → adopt log_pgptr->hdr.offset, the first record the page knows; still NULL → error.
  • canuse_forwaddr == true → take log_rec->forw_lsa; if NULL but the page lives in an archive, the next record can only be at pageid + 1 (incomplete record archived, completed later). Only if still NULL does it fall to manual scan.
  • Manual scan: advance past LOG_RECORD_HEADER, then a switch (type) steps over the type-specific header plus every variable payload — undo/redo images (GET_ZIP_LEN-decoded), postpone/compensate lengths, checkpoint arrays, savepoint names, 2PC and replication payloads, the sysop family’s conditional embedded undo image (Chapter 5); fixed-size markers just break. The epilogue LOG_READ_ADVANCE_WHEN_DOESNT_FIT rounds up to the next page when another record header cannot fit.
  • Two quirks: LOG_SUPPLEMENTAL_INFO lacks a break and falls into the marker group — currently harmless since those cases do nothing; LOG_END_OF_LOG is assert (false) — no caller asks for the record after end-of-log.

Undo, postpone, and sysop-abort code expects the current transaction in thread_p->tran_index or an attached system tdes. Recovery’s thread owns LOG_SYSTEM_TRAN_INDEX and walks other transactions’ chains, so each per-tdes operation is bracketed by a shim pair:

// log_rv_simulate_runtime_worker -- src/transaction/log_recovery.c
if (tdes->is_active_worker_transaction ())
{
thread_p->tran_index = tdes->tran_index; /* <- runtime code now sees this tdes as "mine" */
// ... condensed ... (SA_MODE: mirror via LOG_SET_CURRENT_TRAN_INDEX)
}
else if (tdes->is_system_worker_transaction ())
{
log_system_tdes::rv_simulate_system_tdes (tdes->trid); /* <- attach system tdes to thread */
}
else
{
assert (false);
}
// log_rv_end_simulation -- src/transaction/log_recovery.c
thread_p->reset_system_tdes ();
thread_p->tran_index = LOG_SYSTEM_TRAN_INDEX; /* <- unconditional restore */
// ... condensed ... (SA_MODE: mirror restore)

Both shims keep the SA-mode global mirror (LOG_SET_CURRENT_TRAN_INDEX under #if defined (SA_MODE)). For a system worker transaction (Chapter 4’s rebuilt log_system_tdes population) rv_simulate_system_tdes looks the trid up in systb_System_tdes (asserting on a miss) and installs it via set_system_tdes.

Invariant — every simulate is paired with an end; the thread is back on LOG_SYSTEM_TRAN_INDEX between transactions. log_rv_undo_record (Chapter 9) closes the pair after its end: label, so error paths restore the thread too; log_recovery_finish_all_postpone and log_recovery_abort_all_atomic_sysops (Chapter 8) wrap it in a per-tdes lambda and assert tran_index == LOG_SYSTEM_TRAN_INDEX on entry. A missing end would leave a stale system tdes attached, logging for the wrong transaction.

After undo and (if needed) log_recovery_notpartof_volumes, the driver counts distributed loose ends:

// log_recovery -- src/transaction/log_recovery.c
(void) logtb_set_num_loose_end_trans (thread_p);
/* Try to finish any 2PC blocked transactions */
if (log_Gl.trantable.num_coord_loose_end_indices > 0 || log_Gl.trantable.num_prepared_loose_end_indices > 0)
{
log_Gl.rcv_phase = LOG_RECOVERY_FINISH_2PC_PHASE;
// ... condensed ...
log_2pc_recovery (thread_p);
/* Check number of loose end transactions again.. */
// ... condensed ... (reset rcv_tdes, re-bind tran index)
(void) logtb_set_num_loose_end_trans (thread_p);
}

logtb_set_num_loose_end_trans zeroes both counters under TR_TABLE_CS_ENTER and walks every non-system tdes with a valid trid through logtb_set_loose_end_tdes: LOG_ISTRAN_2PC_PREPARE sets isloose_end and bumps num_prepared_loose_end_indices (in-doubt participant; keeps locks); LOG_ISTRAN_2PC_IN_SECOND_PHASE or TRAN_UNACTIVE_2PC_COLLECTING_PARTICIPANT_VOTES bumps num_coord_loose_end_indices (coordinator re-drives its decision). The driver keys off the two globals, not the returned sum.

log_2pc_recovery sweeps the table — skipping tdes == NULL, NULL_TRANID, and !LOG_ISTRAN_2PC (tdes) — and switches on tdes->state: collecting-votes aborts the undecided coordinator, abort/commit-decision re-executes the decision, and TRAN_UNACTIVE_WILL_COMMIT / TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE [[fallthrough]] into informing participants — local postpones are already done (Chapter 8). Vote mechanics belong to the 2PC document and the high-level companion’s “Transaction table with loose-end annotations”; here only the handoff condition matters: the fourth phase runs iff a coordinator or prepared loose end survived analysis. Prepared participants without a verdict legitimately remain in-doubt — hence a recount, not an assert-zero.

  1. Incomplete recovery is media-crash-only: a normal restart forces *stopat = -1; a broken page outside media recovery is fatal, never truncated.
  2. Three triggers cut the log — completion or commit-with-postpone newer than stopat, or an unreadable page — all via log_recovery_resetlog; a missing end-of-log is patched via log_startof_nxrec.
  3. log_recovery_resetlog rewrites every position-bearing header field in one LOG_CS section and delegates archive removal to log_recovery_notpartof_archives.
  4. Volume discard is two-phase and identity-checked by db_creation; buffer pages are invalidated before unformat; mount failures are skipped.
  5. log_startof_nxrec walks record lengths type by type — a new payload layout means a new switch arm; LOG_SUPPLEMENTAL_INFO’s missing break is only accidentally harmless.
  6. The simulate/end shims bind the thread to a worker or system tdes so runtime code runs unmodified; pairing is structural; both keep the SA-mode mirror.
  7. LOG_RECOVERY_FINISH_2PC_PHASE runs only when coordinator or prepared loose ends survive; prepared participants may stay in-doubt.

The following are line numbers as observed on 2026-06-11; symbols are the canonical anchor and line numbers are hints that decay.

SymbolFileLine
vacuum_notify_server_crashedsrc/query/vacuum.c7570
btree_rv_record_modify_internalsrc/storage/btree.c29757
NULL_OFFSETsrc/storage/storage_common.h49
RECORD_REPLACE_DATAsrc/storage/storage_common.h231
log_2pc_recovery_analysis_infosrc/transaction/log_2pc.c2029
log_2pc_recoverysrc/transaction/log_2pc.c2303
LOG_RV_RECORD_MODIFY_MASKsrc/transaction/log_append.hpp139
LOG_PAGE_INIT_VALUEsrc/transaction/log_common_impl.h46
log_zipsrc/transaction/log_compress.c45
log_unzipsrc/transaction/log_compress.c112
log_diffsrc/transaction/log_compress.c176
log_zip_realloc_if_neededsrc/transaction/log_compress.c203
log_zip_allocsrc/transaction/log_compress.c238
log_zip_freesrc/transaction/log_compress.c279
GET_ZIP_LENsrc/transaction/log_compress.h36
ZIP_CHECKsrc/transaction/log_compress.h39
log_zipsrc/transaction/log_compress.h53
LOG_ISTRAN_2PCsrc/transaction/log_impl.h173
LOG_HAS_LOGGING_BEEN_IGNOREDsrc/transaction/log_impl.h190
log_rcv_tdessrc/transaction/log_impl.h458
log_recvphasesrc/transaction/log_impl.h625
log_cs_access_modesrc/transaction/log_impl.h923
log_initialize_internalsrc/transaction/log_manager.c1100
log_append_compensatesrc/transaction/log_manager.c2985
log_append_compensate_with_undo_nxlsasrc/transaction/log_manager.c3011
log_append_compensate_internalsrc/transaction/log_manager.c3047
log_sysop_end_recovery_postponesrc/transaction/log_manager.c4024
log_completesrc/transaction/log_manager.c5653
log_rollback_recordsrc/transaction/log_manager.c7349
log_get_next_nested_topsrc/transaction/log_manager.c8023
log_do_postponesrc/transaction/log_manager.c8237
log_run_postpone_opsrc/transaction/log_manager.c8481
log_execute_run_postponesrc/transaction/log_manager.c8543
log_read_sysop_start_postponesrc/transaction/log_manager.c9962
LOGPB_IS_ARCHIVE_PAGEsrc/transaction/log_page_buffer.c155
logpb_page_has_valid_checksumsrc/transaction/log_page_buffer.c523
logpb_fetch_pagesrc/transaction/log_page_buffer.c1739
logpb_copy_pagesrc/transaction/log_page_buffer.c1871
logpb_read_page_from_filesrc/transaction/log_page_buffer.c2003
logpb_fetch_start_append_pagesrc/transaction/log_page_buffer.c2504
logpb_page_get_first_null_block_lsasrc/transaction/log_page_buffer.c3190
logpb_is_page_in_archivesrc/transaction/log_page_buffer.c4994
logpb_copy_from_logsrc/transaction/log_page_buffer.c6532
logpb_checkpointsrc/transaction/log_page_buffer.c6877
logpb_page_check_corruptionsrc/transaction/log_page_buffer.c11508
log_readersrc/transaction/log_reader.hpp36
log_reader::set_lsa_and_fetch_pagesrc/transaction/log_reader.hpp162
LOG_READ_ALIGNsrc/transaction/log_reader.hpp315
log_rec_undosrc/transaction/log_record.hpp176
log_vacuum_infosrc/transaction/log_record.hpp192
log_rec_mvcc_undosrc/transaction/log_record.hpp211
log_rec_compensatesrc/transaction/log_record.hpp262
log_sysop_end_typesrc/transaction/log_record.hpp285
log_rec_sysop_endsrc/transaction/log_record.hpp305
log_rec_sysop_start_postponesrc/transaction/log_record.hpp328
log_rec_chkptsrc/transaction/log_record.hpp345
log_info_chkpt_transsrc/transaction/log_record.hpp354
log_info_chkpt_sysopsrc/transaction/log_record.hpp372
log_rv_undo_recordsrc/transaction/log_recovery.c163
log_rv_redo_recordsrc/transaction/log_recovery.c430
log_rv_fix_page_and_check_redo_is_neededsrc/transaction/log_recovery.c494
log_rv_need_sync_redosrc/transaction/log_recovery.c541
log_rv_find_checkpointsrc/transaction/log_recovery.c579
log_rv_get_unzip_log_datasrc/transaction/log_recovery.c609
log_rv_get_unzip_and_diff_redo_log_datasrc/transaction/log_recovery.c699
log_recoverysrc/transaction/log_recovery.c736
log_rv_analysis_undo_redosrc/transaction/log_recovery.c965
log_rv_analysis_dummy_head_postponesrc/transaction/log_recovery.c1000
log_rv_analysis_postponesrc/transaction/log_recovery.c1042
log_rv_analysis_run_postponesrc/transaction/log_recovery.c1086
log_rv_analysis_compensatesrc/transaction/log_recovery.c1181
log_rv_analysis_commit_with_postponesrc/transaction/log_recovery.c1230
log_rv_analysis_commit_with_postpone_obsoletesrc/transaction/log_recovery.c1315
log_rv_analysis_sysop_start_postponesrc/transaction/log_recovery.c1365
log_rv_analysis_atomic_sysop_startsrc/transaction/log_recovery.c1472
log_rv_analysis_completesrc/transaction/log_recovery.c1509
log_rv_analysis_sysop_endsrc/transaction/log_recovery.c1612
log_rv_analysis_start_checkpointsrc/transaction/log_recovery.c1797
log_rv_analysis_end_checkpointsrc/transaction/log_recovery.c1830
log_rv_analysis_save_pointsrc/transaction/log_recovery.c2077
log_rv_analysis_2pc_preparesrc/transaction/log_recovery.c2114
log_rv_analysis_2pc_startsrc/transaction/log_recovery.c2153
log_rv_analysis_2pc_commit_decisionsrc/transaction/log_recovery.c2190
log_rv_analysis_2pc_abort_decisionsrc/transaction/log_recovery.c2224
log_rv_analysis_2pc_commit_inform_particpssrc/transaction/log_recovery.c2258
log_rv_analysis_2pc_abort_inform_particpssrc/transaction/log_recovery.c2293
log_rv_analysis_2pc_recv_acksrc/transaction/log_recovery.c2328
log_rv_analysis_log_endsrc/transaction/log_recovery.c2355
log_rv_analysis_recordsrc/transaction/log_recovery.c2378
log_is_page_of_record_brokensrc/transaction/log_recovery.c2518
log_recovery_analysissrc/transaction/log_recovery.c2587
log_recovery_needs_skip_logical_redosrc/transaction/log_recovery.c3153
log_recovery_get_redo_parallel_countsrc/transaction/log_recovery.c3197
log_recovery_redosrc/transaction/log_recovery.c3251
BUILD_RECORD_INFOsrc/transaction/log_recovery.c3468
INVOKE_REDO_RECORDsrc/transaction/log_recovery.c3471
log_recovery_abort_interrupted_sysopsrc/transaction/log_recovery.c3960
log_recovery_finish_sysop_postponesrc/transaction/log_recovery.c4064
log_recovery_finish_postponesrc/transaction/log_recovery.c4174
log_recovery_finish_all_postponesrc/transaction/log_recovery.c4243
log_recovery_abort_all_atomic_sysopssrc/transaction/log_recovery.c4280
log_recovery_abort_atomic_sysopsrc/transaction/log_recovery.c4317
log_recovery_undosrc/transaction/log_recovery.c4418
log_recovery_notpartof_archivessrc/transaction/log_recovery.c4997
log_unformat_ahead_volumessrc/transaction/log_recovery.c5100
log_recovery_notpartof_volumessrc/transaction/log_recovery.c5132
log_recovery_resetlogsrc/transaction/log_recovery.c5221
log_startof_nxrecsrc/transaction/log_recovery.c5414
log_recovery_find_first_postponesrc/transaction/log_recovery.c5793
log_rv_undoredo_partial_changes_recursivesrc/transaction/log_recovery.c6048
log_rv_undoredo_record_partial_changessrc/transaction/log_recovery.c6144
log_rv_redo_record_modifysrc/transaction/log_recovery.c6173
log_rv_undo_record_modifysrc/transaction/log_recovery.c6191
log_rv_record_modify_internalsrc/transaction/log_recovery.c6210
log_rv_pack_redo_record_changessrc/transaction/log_recovery.c6310
log_rv_pack_undo_record_changessrc/transaction/log_recovery.c6352
log_rv_redo_fix_pagesrc/transaction/log_recovery.c6390
log_rv_simulate_runtime_workersrc/transaction/log_recovery.c6417
log_rv_end_simulationsrc/transaction/log_recovery.c6438
log_cnt_pages_containing_lsasrc/transaction/log_recovery.c6449
log_find_unilaterally_largest_undo_lsasrc/transaction/log_recovery.c6470
vpid_lsa_consistency_check::checksrc/transaction/log_recovery_redo.cpp28
log_rv_redo_context::log_rv_redo_contextsrc/transaction/log_recovery_redo.cpp52
log_rv_redo_contextsrc/transaction/log_recovery_redo.hpp33
log_rv_redo_rec_infosrc/transaction/log_recovery_redo.hpp53
log_rv_get_log_rec_datasrc/transaction/log_recovery_redo.hpp112
log_rv_get_log_rec_mvccidsrc/transaction/log_recovery_redo.hpp163
log_rv_get_log_rec_vpidsrc/transaction/log_recovery_redo.hpp206
log_rv_get_log_rec_redo_lengthsrc/transaction/log_recovery_redo.hpp273
log_rv_get_log_rec_offsetsrc/transaction/log_recovery_redo.hpp316
log_rv_get_funsrc/transaction/log_recovery_redo.hpp359
log_rv_get_fun<LOG_REC_COMPENSATE>src/transaction/log_recovery_redo.hpp396
log_rv_get_funsrc/transaction/log_recovery_redo.hpp396
log_rv_get_log_rec_redo_datasrc/transaction/log_recovery_redo.hpp457
vpid_lsa_consistency_checksrc/transaction/log_recovery_redo.hpp558
log_rv_redo_record_syncsrc/transaction/log_recovery_redo.hpp587
redo_tasksrc/transaction/log_recovery_redo_parallel.cpp99
redo_task::executesrc/transaction/log_recovery_redo_parallel.cpp221
redo_parallel::addsrc/transaction/log_recovery_redo_parallel.cpp626
redo_parallel::wait_for_termination_and_stop_executionsrc/transaction/log_recovery_redo_parallel.cpp635
redo_parallel::wait_past_target_lsasrc/transaction/log_recovery_redo_parallel.cpp728
redo_job_impl::executesrc/transaction/log_recovery_redo_parallel.cpp752
reusable_jobs_stack::blocking_popsrc/transaction/log_recovery_redo_parallel.cpp868
redo_parallelsrc/transaction/log_recovery_redo_parallel.hpp55
task_active_state_bookkeepingsrc/transaction/log_recovery_redo_parallel.hpp100
min_unapplied_log_lsa_monitoringsrc/transaction/log_recovery_redo_parallel.hpp131
redo_job_basesrc/transaction/log_recovery_redo_parallel.hpp215
redo_job_implsrc/transaction/log_recovery_redo_parallel.hpp269
reusable_jobs_stacksrc/transaction/log_recovery_redo_parallel.hpp306
log_rv_redo_record_sync_or_dispatch_asyncsrc/transaction/log_recovery_redo_parallel.hpp382
perf_statssrc/transaction/log_recovery_redo_perf.hpp105
log_system_tdes::rv_simulate_system_tdessrc/transaction/log_system_tran.cpp174
log_system_tdes::map_all_tdessrc/transaction/log_system_tran.cpp253
log_system_tdes::rv_delete_all_tdes_ifsrc/transaction/log_system_tran.cpp265
log_system_tdes::rv_delete_tdessrc/transaction/log_system_tran.cpp281
logtb_rv_find_allocate_tran_indexsrc/transaction/log_tran_table.c1056
logtb_rv_assign_mvccid_for_undo_recoverysrc/transaction/log_tran_table.c1115
logtb_free_tran_indexsrc/transaction/log_tran_table.c1202
logtb_free_tran_index_with_undo_lsasrc/transaction/log_tran_table.c1281
logtb_set_loose_end_tdessrc/transaction/log_tran_table.c4124
logtb_set_num_loose_end_transsrc/transaction/log_tran_table.c4170
logtb_rv_read_only_map_undo_tdessrc/transaction/log_tran_table.c4204
mvcctable::reset_start_mvccidsrc/transaction/mvcc_table.cpp600
RV_funsrc/transaction/recovery.c54
rv_rcvindex_stringsrc/transaction/recovery.c857
rv_check_rvfunssrc/transaction/recovery.c872
LOG_RCVINDEXsrc/transaction/recovery.h36
log_rcvsrc/transaction/recovery.h197
rvfunsrc/transaction/recovery.h221
RCV_IS_BTREE_LOGICAL_LOGsrc/transaction/recovery.h241
RCV_IS_LOGICAL_COMPENSATE_MANUALsrc/transaction/recovery.h253
RCV_IS_LOGICAL_RUN_POSTPONE_MANUALsrc/transaction/recovery.h261
RCV_IS_LOGICAL_LOGsrc/transaction/recovery.h267
  • cubrid-recovery-manager.md — the high-level companion. See also cubrid-log-manager-detail.md (how the replayed records were appended) and cubrid-checkpoint.md (the restart anchor).
  • Raw analyses under raw/code-analysis/cubrid/storage/recovery_manager/.
  • Code: src/transaction/log_recovery.{c,h}, log_recovery_redo.{cpp,hpp}, log_recovery_redo_parallel.{cpp,hpp}, recovery.{c,h}.
  • Methodology: knowledge/methodology/code-analysis-detail-doc.md.