CUBRID Recovery Manager — Code-Level Deep Dive
Where this document fits: The high-level analysis
cubrid-recovery-manager.mdcovers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full restart of a crashed database inside the kernel.
Contents:
Chapter 1: Data-Structure Map
Section titled “Chapter 1: Data-Structure Map”Theory lives in the companion cubrid-recovery-manager.md (“The recovery
dispatch table”, “Redo pass — modern dispatch via templates”); this chapter
pins down every field of every recovery-side structure and the pointers
between them.
1.1 Overview — who points at whom
Section titled “1.1 Overview — who points at whom”flowchart TB LD["LOG_DATA\nrcvindex / vpid / offset"] RVF["RV_fun[rcvindex]\n(struct rvfun)"] CTX["log_rv_redo_context"] RCV["LOG_RCV"] RECINFO["log_rv_redo_rec_info<T>"] FUNC["undofun / redofun"] LD -->|"selects"| RVF RVF --> FUNC RECINFO -->|"typed header copy"| RCV CTX -->|"unzip buffer feeds rcv.data"| RCV FUNC -->|"called with &rcv"| RCV
Figure 1-1 — the rcvindex selects an RV_fun entry; the redo context
unzips the payload into the LOG_RCV the chosen function receives.
1.2 LOG_RCV — the universal recovery argument
Section titled “1.2 LOG_RCV — the universal recovery argument”Every undo, redo, compensate, and run-postpone function has signature
int (*)(THREAD_ENTRY *, LOG_RCV *) — LOG_RCV is the narrow waist.
// log_rcv -- src/transaction/recovery.hstruct log_rcv{ /* Recovery information */ MVCCID mvcc_id = MVCCID_NULL; /* mvcc id */ PAGE_PTR pgptr = nullptr; /* Page to recover. Page should not be free by recovery functions, * however it should be set dirty whenever is needed */ // ... condensed: PGLENGTH offset; int length ... const char *data = nullptr; /* Replacement data. Pointer becomes invalid once the recovery * of the data is finished */ /* <- borrowed, see invariant below */ LOG_LSA reference_lsa = NULL_LSA; /* Next LSA used by compensate/postpone. */
// ... condensed: default ctor; copy/move ctors and both assignments deleted ...};| Field | Role | Why it exists |
|---|---|---|
mvcc_id | MVCCID for MVCC-class records, else MVCCID_NULL | record-header field; set only by the MVCC log_rv_get_log_rec_mvccid specializations |
pgptr | page to recover, fixed by the driver; nullptr for logical records | fix/unfix is centralized in the driver |
offset | offset or slot id within pgptr, from LOG_DATA.offset | physical recovery is page+offset addressed |
length | byte length of data | raw buffer, no terminator |
data | redo replacement or undo before-image | points into a LOG_ZIP buffer or the log page — lifetime rule below |
reference_lsa | compensate: the transaction’s undo_nxlsa at undo time — the next LSA the undo chain resumes from, handed to log_sysop_end_logical_compensate; both log_rollback_record (runtime rollback) and log_rv_undo_record (restart undo) fill it. Run-postpone: LSA of the original postpone record, filled by log_execute_run_postpone | anchor for the manual logical functions (1.7) that append their own compensation / run-postpone records |
Invariant (borrowed-data lifetime). rcv.data and rcv.pgptr are
loans: data aliases the unzip buffer (m_redo_zip.log_data) or the log
page; pgptr is unfixed by the caller on return. Enforcement: all four
copy/move operations deleted; log_rv_redo_record_sync builds a fresh
stack-local LOG_RCV per record, a scope_exit unfixing pgptr. Stashing
rcv->data means the next record’s unzip silently corrupts the replay.
1.3 rvfun and the RV_fun[] dispatch table
Section titled “1.3 rvfun and the RV_fun[] dispatch table”rvfun (recovery.h) bundles fun_t = int (*)(THREAD_ENTRY *, LOG_RCV *),
dump_fun_t = void (*)(FILE *, int, void *), and six fields;
extern struct rvfun RV_fun[] is initialized in recovery.c:
| Field | Role | Why it exists |
|---|---|---|
recv_index | copy of the entry’s own index (/* For verification */) | rv_check_rvfuns asserts RV_fun[i].recv_index == i at debug startup |
recv_string | name, e.g. "RVDK_FORMAT" | trace/dump output via rv_rcvindex_string |
undofun | applied by undo/rollback — and by redo of compensate records: log_rv_get_fun<LOG_REC_COMPENSATE> returns undofun (// yes, undo) | a CLR’s redo is the original undo |
redofun | applied by redo pass, run-postpone, HA replication apply | the forward image applier |
dump_undofun / dump_redofun | payload pretty-printers, NULL if none | log-dump tooling only |
rv_rcvindex_string is branch-free
(return RV_fun[rcvindex].recv_string;). rv_check_rvfuns only turns
initializer misordering into a debug-build startup failure (er_set plus
assert (false)); nothing guards an out-of-range argument such as
RV_NOT_DEFINED (999) — callers must pass a defined index.
1.4 LOG_RCVINDEX — the index space, by family
Section titled “1.4 LOG_RCVINDEX — the index space, by family”Invariant (append-only numbering). Indices persist inside log records,
so renumbering replays the wrong function on old databases. The enum
header warns: “NEW ENTRIES SHOULD BE ADDED AT THE BOTTON OF THE FILE … to
AVOID OLD DATABASES TO BE RECOVERED UNDER OLD FILE” — hence
RVPGBUF_SET_TDE_ALGORITHM (127) far from its siblings (120–123).
RV_LAST_LOGID = RVHF_LOB_REMOVE_DIR (129) marks the top;
RV_NOT_DEFINED = 999 is the “no rcvindex” sentinel.
| Family | Range | Subsystem |
|---|---|---|
RVDK_* | 0–9 | disk manager |
RVFL_* | 10–32, 128 | file manager |
RVHF_* | 33–53, 126, 129 | heap |
RVOVF_* | 54–57 | overflow records |
RVEH_* | 58–65 | extendible hash |
RVBT_* | 66–91, 124–125 | b-tree, incl. logical-key set (1.7) |
RVCT_* | 92–96 | catalog pages |
RVLOG_* | 97 | logical-redo noop marker |
RVREPL_* | 98–103 | replication, HA appliers only |
RVVAC_* | 104–117 | vacuum |
RVES_* | 118 | external storage (LOB) |
RVLOC_* | 119 | locator classname dummy |
RVPGBUF_* | 120–123, 127 | page buffer |
1.5 Modern redo-side types
Section titled “1.5 Modern redo-side types”The redo pass (Ch 6) and each parallel-redo applier (Ch 7) own one
log_rv_redo_context (log_recovery_redo.hpp):
| Field | Role | Why it exists |
|---|---|---|
m_reader | log_reader cursor, built with LOG_CS_SAFE_READER | independent log position per applier |
m_redo_zip | unzip target for redo payloads; its log_data becomes rcv.data | output must outlive the recovery-function call |
m_undo_zip | unzip target for the undo half of diff undoredo records | LOG_DIFF_UNDOREDO_DATA stores redo as an XOR diff against undo |
m_end_redo_lsa | const upper bound; records at or past it are not redone | freezes the redo horizon before the pass |
m_reader_fetch_page_mode | const fetch mode for set_lsa_and_fetch_page; NORMAL refetches only when the pageid changes (do_fetch_page = FORCE || m_lsa.pageid != lsa.pageid) | the only constructor call (redo pass, log_recovery.c) passes NORMAL; FORCE is retained unused for future reuse (log_reader.hpp comment) |
Default constructor deleted; the two-argument constructor pre-grows both
buffers to LOGAREA_SIZE; move and both assignments deleted. The copy
constructor — the only allowed copy — delegates back with
(o.m_end_redo_lsa, o.m_reader_fetch_page_mode): only the two const
knobs survive, so each parallel-redo worker gets fresh buffers and reader.
Each applied record is a log_rv_redo_rec_info<T>: every special member is
deleted except the (log_lsa, LOG_RECTYPE, const T &) constructor — built
once, fully initialized, never reseated.
| Field | Role | Why it exists |
|---|---|---|
m_start_lsa | LSA of the record header | stamped onto the page after apply (pgbuf_set_lsa); key of the check below |
m_type | the LOG_RECTYPE | drives the LOG_DIFF_UNDOREDO_DATA XOR-diff branch in log_rv_get_log_rec_redo_data |
m_logrec | by-value copy of the typed body T — one of LOG_REC_{UNDOREDO, MVCC_UNDOREDO, REDO, MVCC_REDO, RUN_POSTPONE, COMPENSATE} | frees the reader to advance; log_rv_get_log_rec_* specializations extract vpid/mvccid/length/offset |
Invariant (per-page LSA ordering, debug builds). Redo for one page must
apply in log order even across threads; vpid_lsa_consistency_check
(compiled out under NDEBUG) checks a necessary condition of it:
// vpid_lsa_consistency_check::check -- src/transaction/log_recovery_redo.cpp std::lock_guard<std::mutex> lck (mtx); const vpid_key_t key {a_vpid.volid, a_vpid.pageid}; const auto map_it = consistency_check_map.find (key); if (map_it != consistency_check_map.cend ()) { assert ((*map_it).second < a_log_lsa); /* <- later applies must beat the stored LSA */ } consistency_check_map.emplace (key, a_log_lsa); /* <- emplace never overwrites an existing key */| Field | Role | Why it exists |
|---|---|---|
mtx | serializes check and cleanup | the map (global log_Gl_recovery_redo_consistency_check) is hit by every applier |
consistency_check_map | per-page baseline — vpid_log_lsa_map_t maps vpid_key_t = (volid, pageid) to the first LSA applied to the page; emplace never overwrites, so the baseline never advances | the assert demands every later apply carry an LSA above the baseline — weaker than pairwise monotonicity (a swap between two later applies passes), but an image older than the first apply still trips it |
cleanup() clears the map after the pass; log_rv_redo_record_sync
consults it only while log_Gl.rcv_phase != LOG_RESTARTED.
1.6 Analysis-side state
Section titled “1.6 Analysis-side state”LOG_RCV_TDES (field rcv of log_tdes, log_impl.h) carries
analysis-pass discoveries into later passes (Ch 4, 5, 8). Five LSAs:
| Field | Role | Why it exists |
|---|---|---|
sysop_start_postpone_lsa | LSA of the LOG_SYSOP_START_POSTPONE in progress at crash | resume anchor for the sysop postpone phase (Ch 8) |
tran_start_postpone_lsa | where transaction-level postpone began | splits “committed, postpones pending” from plain active |
atomic_sysop_start_lsa | start of an interrupted atomic file op (file_perm_alloc / file_perm_dealloc) | must complete or roll back fully before postpones run (Ch 8) |
analysis_last_aborted_sysop_lsa | end LSA of the last sysop aborted during analysis (“to recover logical redo operation”) | logical redo must not re-enter the aborted range |
analysis_last_aborted_sysop_start_lsa | matching start LSA of that sysop | the other end of the bracket |
LOG_RECVPHASE (log_impl.h), the global mode switch
log_Gl.rcv_phase, is consulted far outside recovery (page-buffer fix
rules, the check above): LOG_RESTARTED (recovery done),
LOG_RECOVERY_ANALYSIS_PHASE (Ch 3–5), LOG_RECOVERY_REDO_PHASE (Ch 6–7),
LOG_RECOVERY_UNDO_PHASE (Ch 9), LOG_RECOVERY_FINISH_2PC_PHASE (Ch 11).
Checkpoint snapshot records (log_record.hpp): a fixed LOG_REC_CHKPT
header, then ntrans trans entries, then ntops sysop entries.
log_rec_chkpt field | Role | Why it exists |
|---|---|---|
redo_lsa | ”Oldest LSA of dirty data page in page buffers” (source comment) | redo-pass lower bound — the fuzzy-checkpoint contract |
ntrans | count of trans entries following | variable-sized record |
ntops | count of sysop entries after the trans array | same |
LOG_INFO_CHKPT_TRANS snapshots the same-named live log_tdes fields;
analysis re-creates a TDES per entry, then corrects it from the log tail
(Ch 4):
| Field | Role | Why it exists |
|---|---|---|
isloose_end | loose-end flag at checkpoint | marks 2PC/client loose ends |
trid | transaction identifier | key for re-creating the TDES |
state | TRAN_STATE at checkpoint | seeds loose-end classification |
head_lsa | first log record of the transaction | bounds the backward chain |
tail_lsa | last record at checkpoint | analysis scan resumes here |
undo_nxlsa | next record to undo, given CLRs logged during undo | rollback skips already-compensated work |
posp_nxlsa | first postpone record | where postpone execution starts |
savept_lsa | last savepoint | savepoint chain head, partial rollback |
tail_topresult_lsa | last partial abort/commit | nested-sysop resolution |
start_postpone_lsa | start-postpone address, if mid-postpone | such a transaction must finish postpones, not be undone |
user_name | client name (char[LOG_USERNAME_MAX]) | restored into the TDES |
LOG_INFO_CHKPT_SYSOP snapshots the two persistent sysop anchors of
LOG_RCV_TDES. The other three LOG_RCV_TDES LSAs never travel in it: the
analysis_last_aborted_* pair are products of the current analysis run,
never persisted, and tran_start_postpone_lsa rides in the per-transaction
entry instead, as LOG_INFO_CHKPT_TRANS.start_postpone_lsa:
| Field | Role | Why it exists |
|---|---|---|
trid | which transaction’s TDES the two LSAs are restored into | keyed by transaction, not parallel to the trans array |
sysop_start_postpone_lsa | saved rcv.sysop_start_postpone_lsa | the sysop state can predate the checkpoint |
atomic_sysop_start_lsa | saved rcv.atomic_sysop_start_lsa | same, for interrupted atomic file ops |
1.7 LOG_ZIP and the logical-classifier macros
Section titled “1.7 LOG_ZIP and the logical-classifier macros”LOG_ZIP (log_compress.h), the compression workspace of the write
path and (1.5) the redo context, owns log_data (freed by
log_zip_free_data); all four copy/move operations are deleted — a
member-wise copy would double-free:
| Field | Role | Why it exists |
|---|---|---|
data_length | valid bytes currently in log_data | after log_unzip, the length handed to rcv.length |
buf_size | allocated capacity | log_zip_realloc_if_needed grows it; sticky across records |
log_data | the owned buffer (“used as data buffer”) | the storage rcv.data borrows — the 1.2 lifetime rule |
A stored length marks compression in its top bit: MAKE_ZIP_LEN(l) sets
0x80000000, ZIP_CHECK(l) tests, GET_ZIP_LEN(l) strips.
The classifier macros. Four pure disjunctions over LOG_RCVINDEX; the
branches are the listed indices.
// RCV_IS_BTREE_LOGICAL_LOG -- src/transaction/recovery.h#define RCV_IS_BTREE_LOGICAL_LOG(idx) \ ((idx) == RVBT_DELETE_OBJECT_PHYSICAL \ || (idx) == RVBT_MVCC_DELETE_OBJECT \ || (idx) == RVBT_MVCC_INSERT_OBJECT \ || (idx) == RVBT_NON_MVCC_INSERT_OBJECT \ || (idx) == RVBT_MARK_DELETED \ || (idx) == RVBT_DELETE_OBJECT_POSTPONE \ || (idx) == RVBT_MVCC_INSERT_OBJECT_UNQ \ || (idx) == RVBT_MVCC_NOTIFY_VACUUM \ || (idx) == RVBT_ONLINE_INDEX_UNDO_TRAN_DELETE \ || (idx) == RVBT_ONLINE_INDEX_UNDO_TRAN_INSERT)These ten ops are logged by key value, not page image — undo re-descends the tree, never running against one fixed page.
RCV_IS_LOGICAL_COMPENSATE_MANUAL is the btree set plus exactly six:
RVFL_ALLOC, RVFL_USER_PAGE_MARK_DELETE, RVPGBUF_DEALLOC,
RVFL_TRACKER_HEAP_REUSE, RVHF_LOB_REMOVE_DIR, RVFL_TRACKER_UNREGISTER;
their undofun appends its own compensation via rcv.reference_lsa, so the
rollback driver must not auto-append a LOG_COMPENSATE — a re-crash would
double-undo. RCV_IS_LOGICAL_RUN_POSTPONE_MANUAL matches exactly four:
RVFL_DEALLOC, RVHF_MARK_DELETED, RVHF_LOB_REMOVE_DIR,
RVBT_DELETE_OBJECT_POSTPONE; as postpone actions their redofun closes
with LOG_SYSOP_END_LOGICAL_RUN_POSTPONE, not a standard LOG_RUN_POSTPONE
(Ch 8). RVHF_LOB_REMOVE_DIR and RVBT_DELETE_OBJECT_POSTPONE sit in both
sets.
RCV_IS_LOGICAL_LOG (vpid, idx) is the master test and the only one that
inspects the address:
((vpid)->volid == NULL_VOLID) || ((vpid)->pageid == NULL_PAGEID)
short-circuits to logical regardless of index; then
RCV_IS_BTREE_LOGICAL_LOG (idx); then eleven indices:
RVBT_MVCC_INCREMENTS_UPD, RVPGBUF_FLUSH_PAGE, RVFL_DESTROY,
RVFL_ALLOC, RVFL_DEALLOC, RVVAC_NOTIFY_DROPPED_FILE,
RVPGBUF_DEALLOC, RVES_NOTIFY_VACUUM, RVHF_MARK_DELETED,
RVFL_TRACKER_HEAP_REUSE, RVFL_TRACKER_UNREGISTER. A new logical index
missing here makes recovery try to fix a nonexistent page — a fix error
during rollback, far from the bug.
flowchart TD
A["record vpid + rcvindex"] --> B{"volid or pageid NULL?"}
B -- yes --> L["logical: undofun gets pgptr = nullptr"]
B -- no --> C{"RCV_IS_BTREE_LOGICAL_LOG?"}
C -- yes --> L
C -- no --> D{"one of the 11 listed indices?"}
D -- yes --> L
D -- no --> P["physical: driver fixes page, passes pgptr"]
Figure 1-2 — RCV_IS_LOGICAL_LOG as evaluated by undo/rollback drivers.
1.8 Chapter summary — key takeaways
Section titled “1.8 Chapter summary — key takeaways”LOG_RCVis the one calling convention;data/pgptrare borrowed, so all four copy/move operations are deleted.RV_fun[]is indexed by the append-onlyLOG_RCVINDEX; debug-startuprv_check_rvfunscatches only misordering — nothing bounds-checks lookups.- Compensate records redo through
undofun(log_rv_get_fun<LOG_REC_COMPENSATE>): a CLR’s redo re-does the undo. log_rv_redo_context(reader + two zip buffers + frozenm_end_redo_lsa; copies rebuild fresh buffers; onlyNORMALfetch mode used) feeds immutablelog_rv_redo_rec_info<T>snapshots; debug-onlyvpid_lsa_consistency_checkasserts every later apply per page stays above the first-applied LSA — a necessary condition of log order, not full pairwise monotonicity.- Analysis state =
LOG_RCV_TDES(five LSAs), seeded fromLOG_REC_CHKPTLOG_INFO_CHKPT_TRANS+LOG_INFO_CHKPT_SYSOP(only the two sysop anchors persist; tran-level postpone travels in the trans entry), gated byLOG_RECVPHASE.
- The
RCV_IS_*macros split physical vs logical, automatic vs manual; a new logical index missing fromRCV_IS_LOGICAL_LOGbreaks rollback long after the feature ships.
Chapter 2: Restart Entry and Log Page Access
Section titled “Chapter 2: Restart Entry and Log Page Access”Who drives recovery at server start, how the checkpoint anchor is found and
downgraded for a media crash, and how the passes (Ch 3, 6, 9) physically
read log pages. Theory: the companion cubrid-recovery-manager.md.
2.1 log_recovery — the restart orchestrator, branch by branch
Section titled “2.1 log_recovery — the restart orchestrator, branch by branch”log_recovery (in log_recovery.c) has one caller,
log_initialize_internal, gated on
init_emergency == false && (log_Gl.hdr.is_shutdown == false || ismedia_crash == true)
(restoredb passes ismedia_crash; emergency startup skips recovery); the
caller holds the log CS in write mode (assert (LOG_CS_OWN_WRITE_MODE)).
// log_recovery -- src/transaction/log_recovery.c /* ... condensed: branch 1 -- NULL LOG_FIND_TDES is er_set + logpb_fatal_error, return ... */ rcv_tdes->state = TRAN_RECOVERY; /* <- the recovery "transaction" */ if (LOG_HAS_LOGGING_BEEN_IGNORED ()) { /* <- branch 2: fatal, then clear the flag */ /* ... condensed ... */ } /* ... condensed ... */ LSA_COPY (&rcv_lsa, &log_Gl.hdr.chkpt_lsa); if (ismedia_crash != false) { /* <- branch 3a: downgrade anchor */ (void) fileio_map_mounted (thread_p, (bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint, &rcv_lsa); } /* ... condensed: else, branch 3b -- if (stopat != NULL) *stopat = -1 ... */ vacuum_notify_server_crashed (&rcv_lsa);Branches 1 and 2 fatal via logpb_fatal_error; branch 2 fires when
LOG_HAS_LOGGING_BEEN_IGNORED() (log_impl.h) sees
log_Gl.hdr.has_logging_been_skipped — a crash while logging was skipped
is unrepairable (ER_LOG_CORRUPTED_DB_DUE_CRASH_NOLOGGING). Branch 3a:
restored volumes may be older than the header checkpoint, so
log_rv_find_checkpoint is mapped over every volume, copying its
disk_get_checkpoint LSA into rcv_lsa when
LSA_ISNULL (rcv_lsa) || LSA_LT (&chkpt_lsa, rcv_lsa) and returning true
so all volumes are visited — the oldest checkpoint wins.
vacuum_notify_server_crashed copies rcv_lsa into
vacuum_Data.recovery_lsa for vacuum’s backward scan when analysis finds
no MVCC op record.
Invariant — the analysis start LSA is no newer than the checkpoint recorded in any permanent volume. A volume header stores the checkpoint LSA at its last flush; replay must start at or before it, else redo skips updates the restored volume never received. Figure 2-1 maps the rest.
flowchart TD
A["ANALYSIS Ch 3"] --> B["logpb_fetch_start_append_page<br/>error: fatal"]
B --> C{did_incom_recovery}
C -->|false| D["LOG_RESET_PREV_LSA from EOF back_lsa"] --> G
C -->|true| G["append LOG_DUMMY_CRASH_RECOVERY<br/>rcv_phase_lsa = tail_lsa"]
G --> H["REDO Ch 6, then UNDO Ch 9<br/>log_system_tdes::rv_final"] --> K{did_incom_recovery}
K -->|true| L["log_recovery_notpartof_volumes"] --> N
K -->|false| N["TRAN_ACTIVE, logtb_set_num_loose_end_trans"]
N --> O{2pc loose ends}
O -->|yes| P["FINISH_2PC: log_2pc_recovery Ch 11"] --> R
O -->|no| R["logpb_decache_archive_info<br/>CS exit, logpb_checkpoint, CS enter"]
R --> S["flush all + header, then locator_initialize,<br/>heap_classrepr_restart_cache -- each error: fatal"]
Figure 2-1: log_recovery after the anchor is fixed.
Section 2.5 covers the append-point re-arm; log_append_empty_record
writes LOG_DUMMY_CRASH_RECOVERY, whose LSA becomes log_Gl.rcv_phase_lsa,
the crash boundary undo keys on (Ch 9). A stopat cut sets
did_incom_recovery (Ch 3); log_recovery_notpartof_volumes then drops
volumes created after the restore point. The close exits the log CS around
logpb_checkpoint, flushes dirty pages and the header, and re-caches the
catalog tracker and class representations — two further fatal branches; the
caller then sets LOG_RESTARTED.
2.2 The rcv_phase transitions
Section titled “2.2 The rcv_phase transitions”log_Gl.rcv_phase (enum log_recvphase, log_impl.h) is the global mode;
LOG_ISRESTARTED() tests for LOG_RESTARTED, which the caller sets after
the phases above. logpb_copy_page fills its recovery cache only
if (!LOG_ISRESTARTED ()), and the physical readers run debug checksum
checks only when LOG_RESTARTED — torn tails during recovery are repaired
logically (Section 2.7).
2.3 logpb_fetch_page — the single physical-read entry
Section titled “2.3 logpb_fetch_page — the single physical-read entry”logpb_fetch_page (in log_page_buffer.c) takes an
enum log_cs_access_mode (log_impl.h). The classic analysis and undo
scans call it with LOG_CS_FORCE_USE (they run under the log CS held by
log_recovery); the redo machinery’s log_reader forwards
LOG_CS_SAFE_READER so its positioned fetches skip the CS (Section 2.6).
// logpb_fetch_page -- src/transaction/log_page_buffer.c if (LSA_LE (&append_lsa, req_lsa) /* <- case 1: page beyond flushed area */ || LSA_LE (&append_prev_lsa, req_lsa)) /* <- case 2: page may hold a temp EOL */ { LOG_CS_ENTER (thread_p); /* ... condensed ... */ if (LSA_LE (&log_Gl.hdr.append_lsa, req_lsa)) /* retry with mutex */ { logpb_prior_lsa_append_all_list (thread_p); /* <- drain prior list to buffers */ } LOG_CS_EXIT (thread_p); } rv = logpb_copy_page (thread_p, req_lsa->pageid, access_mode, log_pgptr); /* ... condensed: rv != NO_ERROR is the only error exit ... */The front gate folds the in-memory prior-LSA list into the page buffer so
a reader near the append point never sees a stale tail. logpb_copy_page
then has four arms: a LOGPB_HEADER_PAGE_ID request is served from the
cached header_buffer (file read when not cached); an out-of-range buffer
index raises ER_LOG_PAGE_CORRUPTED; a buffer hit memcpys and re-checks
log_bufptr->pageid — the safe-reader mode skips the read CS, so this
re-check is its lock-free validation; everything else falls to
logpb_read_page_from_file, caching the page forward-only while
!LOG_ISRESTARTED ().
2.4 Active versus archive: logpb_read_page_from_file
Section titled “2.4 Active versus archive: logpb_read_page_from_file”A pageid is archived iff LOGPB_IS_ARCHIVE_PAGE (pageid) — not the header
page and below LOGPB_NEXT_ARCHIVE_PAGE_ID (log_Gl.hdr.nxarv_pageid);
logpb_is_page_in_archive wraps it. LOG_CS_SAFE_READER takes the read CS
itself (sets log_csect_entered); other modes assert (LOG_CS_OWN). The
CS protects the archive set — an archive created mid-read once left
logpb_to_physical_pageid stale (the in-code comment records the bug).
// logpb_read_page_from_file -- src/transaction/log_page_buffer.c bool fetch_from_archive = logpb_is_page_in_archive (pageid); if (fetch_from_archive) { bool is_archive_page_in_active_log = (pageid + LOGPB_ACTIVE_NPAGES) > log_Gl.hdr.append_lsa.pageid; bool dont_fetch_archive_from_active = !LOG_ISRESTARTED () || log_Gl.hdr.was_active_log_reset; if (is_archive_page_in_active_log && !dont_fetch_archive_from_active) { fetch_from_archive = false; /* <- slot not yet lapped in circular active file */ } }The shortcut: the active file is circular with LOGPB_ACTIVE_NPAGES
(= log_Gl.hdr.npages) slots, so an archived page stays readable from
active until its slot is re-appended — disabled during recovery and after
an active-log reset, when the active tail is exactly what the crash made
suspect.
The remaining arms: an archive fetch (logpb_fetch_from_archive) returning
NULL is goto error. An active fetch maps the slot via
logpb_to_physical_pageid, then fileio_read (ER_LOG_READ, goto error); the self-id check: hdr.logical_pageid == pageid is good, then
tde_decrypt_log_page if encrypted (archives decrypt inside
logpb_fetch_from_archive); == pageid + LOGPB_ACTIVE_NPAGES means lapped
since the check — retry from archive; anything else is
ER_LOG_PAGE_CORRUPTED. Both exits release the CS iff log_csect_entered;
debug checksum only when LOG_RESTARTED.
Invariant — every log page self-identifies. An active-file read is
valid only if hdr.logical_pageid matches; the one benign mismatch is one
lap, pageid + LOGPB_ACTIVE_NPAGES — without the check a lapped slot would
replay as the old page.
2.5 logpb_fetch_start_append_page — re-arming the append point
Section titled “2.5 logpb_fetch_start_append_page — re-arming the append point”Between analysis and redo the log must become writable again. Four
branches: an empty log (append_lsa offset 0, pageid 0 — debug builds:
PRM_ID_FIRST_LOG_PAGEID) makes logpb_locate_page get NEW_PAGE
instead of OLD_PAGE; a leftover
log_Gl.append.log_pgptr is discarded (logpb_invalid_all_append_pages);
NULL from logpb_locate_page is the only error exit (ER_FAILED, fatal in
log_recovery); on success set_nxio_lsa (log_Gl.hdr.append_lsa) is
recorded and the page joins flush_info->toflush, flushed
(logpb_flush_pages_direct) when the array is full.
2.6 log_reader — the C++ fetch wrapper for the redo machinery
Section titled “2.6 log_reader — the C++ fetch wrapper for the redo machinery”The modern redo path (Ch 6, Ch 7) uses log_reader (log_reader.hpp,
final class, header-only; the sibling log_reader.cpp is stale — no
CMakeLists builds it).
| Field | Role | Why it exists |
|---|---|---|
m_thread_entry | Lazily cached THREAD_ENTRY * | Single-thread contract, asserted each use |
m_lsa | Read position; starts NULL_LSA | Drives fetch pageid, intra-page offset, memoization |
m_cs_access | Mode passed to logpb_fetch_page; default LOG_CS_FORCE_USE | CS-owning passes vs CS-free readers (LETS-port leftover) |
m_page | log_page * aligned into m_area_buffer by the constructor | Private fetch destination — no shared-buffer locking |
m_area_buffer | char [IO_MAX_PAGE_SIZE + DOUBLE_ALIGNMENT] | Inline no-heap storage; copied workers get their own |
set_lsa_and_fetch_page computes
do_fetch_page { fetch_page_mode == fetch_mode::FORCE || m_lsa.pageid != lsa.pageid },
assigns m_lsa = lsa, and fetches (logpb_fetch_page (.., m_cs_access, m_page), fatal on failure) only when true: NORMAL memoizes the current
page, FORCE always refetches. align, add_align,
advance_when_does_not_fit and copy_from_log delegate to the classic
LOG_READ_ALIGN family and logpb_copy_from_log (bottom of the same
header), refetching on page crossings — but only fetch_page (under
set_lsa_and_fetch_page and skip) forwards m_cs_access; the delegating
members use the family’s default LOG_CS_FORCE_USE, so even a safe
reader briefly takes the read CS at mid-record page crossings.
The owning aggregate log_rv_redo_context (log_recovery_redo.hpp):
| Field | Role | Why it exists |
|---|---|---|
m_reader | log_reader { LOG_CS_SAFE_READER } | Private reader per context; CS-free positioned fetches |
m_redo_zip, m_undo_zip | LOG_ZIP scratch buffers | Decompression targets reused across records (Section 2.8) |
m_end_redo_lsa | const LOG_LSA redo stop bound | Workers compare record LSAs without touching globals |
m_reader_fetch_page_mode | const log_reader::fetch_mode | NORMAL memoizes pages; FORCE kept for reuse |
The synchronous redo driver constructs it with fetch_mode::NORMAL; the
copy constructor re-runs the main one so each parallel worker (Ch 7) gets
fresh buffers; the constructor pre-sizes both zips to LOGAREA_SIZE, the
destructor frees them (log_zip_free_data).
2.7 The NULL_OFFSET convention for incompletely archived records
Section titled “2.7 The NULL_OFFSET convention for incompletely archived records”NULL_OFFSET is (-1) (storage_common.h). When the archiver copies an
active page whose last record continues onto the next page, an LSA into the
continuation may carry offset == NULL_OFFSET: the record’s completion
postdates archiving. Every forward scan — analysis, redo, walkers like
log_startof_nxrec — must repair it before dereferencing:
// log_recovery_analysis (record loop) -- src/transaction/log_recovery.c if (lsa.offset == NULL_OFFSET) { lsa.offset = log_page_p->hdr.offset; /* <- page's first record offset */ if (lsa.offset == NULL_OFFSET) { /* Continue with next pageid */ if (logpb_is_page_in_archive (log_lsa.pageid)) { lsa.pageid = log_lsa.pageid + 1; /* <- archive: keep walking */ } else { lsa.pageid = NULL_PAGEID; /* <- active: stop scan */ } continue; } }A page whose own hdr.offset is NULL_OFFSET holds no record start (pure
continuation) — in an archive try the next page; in the active log the
scan ran off the end. Analysis scratch pages are initialized to
hdr.offset = NULL_OFFSET.
2.8 LOG_ZIP allocation helpers each pass instantiates
Section titled “2.8 LOG_ZIP allocation helpers each pass instantiates”LOG_ZIP (struct log_zip, log_compress.h) is a grow-only buffer:
| Field | Role | Why it exists |
|---|---|---|
data_length | Bytes currently stored | Consumers read exactly this much; capacity may be larger |
buf_size | Capacity of log_data | Grow-only check; sized to the LZ4 worst case |
log_data | The buffer (char *) | Reused across records; nullptr until first sizing |
log_zip_realloc_if_needed (log_zip, new_size) (in log_compress.c) grows
only when new_size > 0 && new_size > log_zip.buf_size, to
LOG_ZIP_BUF_SIZE (LZ4, new_size) (ER_OUT_OF_VIRTUAL_MEMORY on failure);
a second check, new_size > 0 && log_zip.log_data == nullptr, zeroes the
fields and returns false (caller fatals) — true covers success and
no-grow. log_zip_alloc mallocs + zeroes the struct and sizes it the same
way (nullptr on failure, husk freed); log_zip_free runs
log_zip_free_data then frees the struct. The redo context pre-sizes its
two zips (Section 2.6); the undo pass log_zip_alloc (LOGAREA_SIZE)s
undo_unzip_ptr, freed on every exit of log_recovery_undo; the shared
consumer log_rv_get_unzip_log_data splits compressed from plain via
ZIP_CHECK (length) — log_unzip versus memcpy after
log_zip_realloc_if_needed.
2.9 Chapter summary — key takeaways
Section titled “2.9 Chapter summary — key takeaways”log_recoveryruns as theTRAN_RECOVERYsystem transaction under an already-held write-mode log CS; onlylog_initialize_internalcalls it, and emergency startup skips it.- The analysis anchor is
log_Gl.hdr.chkpt_lsa, downgraded on a media crash to the oldest per-volume checkpoint found vialog_rv_find_checkpoint. - Between analysis and redo the append point is re-armed and a
LOG_DUMMY_CRASH_RECOVERYappended; its LSA (log_Gl.rcv_phase_lsa) is the crash boundary undo keys on. - Classic analysis and undo scans fetch with
LOG_CS_FORCE_USEunder the held log CS;log_readerforwardsLOG_CS_SAFE_READERfor positioned fetches, though its page-crossing helpers still default toLOG_CS_FORCE_USEand briefly take the read CS. logpb_read_page_from_filesplits active versus archive onLOGPB_IS_ARCHIVE_PAGE; the only benign self-id mismatch is the one-lap aliaspageid + LOGPB_ACTIVE_NPAGES, and the archived-but-still-in-active shortcut is disabled during recovery.NULL_OFFSET(-1) marks LSAs into incompletely archived records; every forward scan repairs it fromhdr.offset, advancing a page in archives and terminating in the active log.
Chapter 3: Analysis Pass Driver
Section titled “Chapter 3: Analysis Pass Driver”log_recovery_analysis walks forward from the checkpoint anchor through possibly corrupted or truncated log and computes the redo range: a page-fetch outer loop around a record-step inner loop. Record semantics go to log_rv_analysis_record (Ch 4–5); the driver owns cursor advancement, corruption defenses, the truncate-or-fatal decision, and redo-range bookkeeping. ARIES rationale: recovery-phases section of cubrid-recovery-manager.md; page-fetch mechanics: Ch 2.
3.1 Entry point, outputs, and driver state
Section titled “3.1 Entry point, outputs, and driver state”log_recovery resolves the anchor — log_Gl.hdr.chkpt_lsa, or under media crash the oldest checkpoint among data-volume headers (log_rv_find_checkpoint) — and passes it as start_lsa, with is_media_crash (truncate vs fatal, 3.2) and stop_at (the restoredb -d boundary, 3.7). Outputs: start_redo_lsa (the anchor unless Ch 4 pulls it back), end_redo_lsa (Invariant 3-B), did_incom_recovery (truncated; log_recovery skips the EOF back-link fix-up), num_redo_log_records (3.8).
Key driver locals of log_recovery_analysis:
| Local | Role | Why |
|---|---|---|
lsa | next record; NULL ends both loops | single termination condition |
log_lsa | current record, page in log_page_p | lsa advances before dispatch (3.6) |
prev_lsa | last good record | resetlog target |
prev_prev_lsa | resetlog’s new_prev_lsa | tracks prev_lsa; NULL only if the first fetched page is broken |
first_corrupted_rec_lsa | first all-0xff 4 KB block | per-record cut-off (3.5) |
last_checked_page_id | page already checksummed | probe once per page (3.3) |
is_log_page_broken | fetch failed / record tail missing | truncate-or-fatal fork (3.2) |
is_log_page_corrupted | readable but checksum failed | partial flush (3.5); terminal (Invariant 3-C) |
null_block | 4 KB of LOG_PAGE_INIT_VALUE (0xff, log_common_impl.h) | tear-detection memcmp operand |
checkpoint_lsa | set by LOG_END_CHKPT dispatch (Ch 4) | 2PC tail re-read (3.8) |
may_use_checkpoint / may_need_synch_checkpoint_2pc | dispatch flags (Ch 4) | the second arms the 2PC tail |
last_at_time | stays -1 in the driver | echo to *stop_at is inert (3.7) |
Initialization copies start_lsa into lsa, start_redo_lsa, end_redo_lsa, and prev_lsa — a degenerate redo range until proven otherwise — and nulls or zeroes everything else.
3.2 Outer loop: the is_log_page_broken branch
Section titled “3.2 Outer loop: the is_log_page_broken branch”Each outer iteration logpb_fetch_pages the page under the cursor; failure — past the flushed log, missing archive, TDE decryption error — sets is_log_page_broken, as can the inner loop’s broken-tail break (3.4): one branch decides what broken means.
Media-crash arm: truncate and accept — log past the restore point may legitimately not exist. It echoes last_at_time via *stop_at, steps the last record’s owner tdes (tail_lsa/undo_nxlsa) back to log_rec->prev_tranlsa so undo never chases the truncated record, re-fetches prev_lsa’s page (clobbered by the failed fetch; fatal on failure), then log_recovery_resetlog (thread_p, &prev_lsa, &prev_prev_lsa) makes prev_lsa the new append point (Ch 11), sets *did_incom_recovery, resets the MVCC table, and returns — skipping the 2PC tail (3.8). Normal-crash arm: fatal — after a plain crash every page up to eof_lsa must be readable; when er_errid () is ER_TDE_CIPHER_IS_NOT_LOADED the message names TDE: the page is intact but undecryptable.
Invariant 3-B (redo-range honesty). On return, every record in [start_redo_lsa, end_redo_lsa) is readable and structurally complete; the boundary itself is the last fully-probed record or a position re-initialized before redo reads it. Normal end: the last dispatched record. Truncation (3.6 step 8): reverted to prev_lsa. Broken-record probe (3.4): deliberately advanced onto the broken record — equal to prev_lsa — so resetlog makes that position the new append point, overwritten by LOG_DUMMY_CRASH_RECOVERY before redo runs. Violation: redo (Ch 6) applies half-written bodies.
flowchart TD
A["fetch page at lsa"] --> B{"broken?"}
B -- no --> C["inner loop 3.3-3.6"] --> D{"lsa null?"}
D -- no --> A
D -- yes --> E["2PC tail; reset_start_mvccid"]
B -- "yes, media crash" --> F["resetlog at prev_lsa; did_incom_recovery; return"]
B -- "yes, normal crash" --> G["fatal (TDE or generic)"]
Figure 3-1 — outer loop of log_recovery_analysis.
3.3 Inner loop entry: NULL_OFFSET repair and the corruption probe
Section titled “3.3 Inner loop entry: NULL_OFFSET repair and the corruption probe”The inner loop runs while the cursor stays on the fetched page: while (!LSA_ISNULL (&lsa) && lsa.pageid == log_lsa.pageid). Two housekeeping steps precede record access.
NULL_OFFSET repair. A record archived while incomplete leaves the continuation’s offset unknown: the cursor arrives as (pageid, NULL_OFFSET) and is re-anchored on log_page_p->hdr.offset, the first header starting in this page. If that too is NULL_OFFSET (only continuation bytes here): archive page — lsa.pageid = log_lsa.pageid + 1, keep walking the record’s middle; active page — lsa.pageid = NULL_PAGEID, scan over. Either way continue.
Per-page corruption probe. Guarded by last_checked_page_id, so once per page. logpb_page_check_corruption wraps logpb_page_has_valid_checksum (CRC32 vs hdr.checksum); a helper error is fatal. A corrupt archive page is fatal (/* Should not happen. */ — archives are written once); a corrupt active page means a partial page flush. logpb_page_get_first_null_block_lsa locates the tear: the first 4 KB block that memcmps equal to null_block yields (hdr.logical_pageid, i * block_size), minus sizeof (LOG_HDRPAGE) when nonzero — LSA offsets index area[], the raw page starts earlier.
If no block matches (corrupt, but every block holds data), first_corrupted_rec_lsa stays NULL: the 3.5 cut-off and its safety nets (gated on !is_log_page_corrupted) are skipped; only the page-advance ban and EOF stop of Invariant 3-C still apply.
3.4 Multi-page records: log_is_page_of_record_broken
Section titled “3.4 Multi-page records: log_is_page_of_record_broken”After log_rec = LOG_GET_LOG_RECORD_HEADER (log_page_p, &log_lsa), the media-crash path runs one more probe — a header may sit on the last restored page while its body spills onto pages never restored:
// log_is_page_of_record_broken -- src/transaction/log_recovery.c LSA_COPY (&fwd_log_lsa, &log_rec_header->forw_lsa); /* TODO - Do we need to handle NULL fwd_log_lsa? */ if (!LSA_ISNULL (&fwd_log_lsa)) { if (LSA_GE (log_lsa, &fwd_log_lsa) || (!LSA_ISNULL (&log_Gl.hdr.eof_lsa) && LSA_GT (&fwd_log_lsa, &log_Gl.hdr.eof_lsa))) { is_log_page_broken = true; /* <- forw_lsa is nonsense */ } else { if (fwd_log_lsa.pageid != log_lsa->pageid && (fwd_log_lsa.offset != 0 || fwd_log_lsa.pageid > log_lsa->pageid + 1)) { // ... condensed: record spans pages -- probe-fetch fwd_log_lsa page; // failure -> broken ... } } }Branch by branch: (1) forw_lsa NULL — declines; the 3.5 safety nets judge instead (the TODO admits the gap). (2) forw_lsa not after the current record, or beyond eof_lsa — the header itself is garbage: broken (eof_lsa is NULL-guarded: restoring without an active volume recovers it only during analysis). (3) forw_lsa on a later page at nonzero offset, or more than one page ahead — the body provably reaches that page: probe-fetch it; failure means the tail is gone, success means sane. The excluded case — next record at offset 0 of the next page — proves nothing; no fetch is spent.
On a broken verdict the inner loop copies end_redo_lsa = lsa, sets prev_lsa and prev_prev_lsa to it, debug-traces, and breaks — the reset happens in 3.2, where prev_lsa is now the broken record itself: resetlog cuts there, sacrificing it so everything earlier survives.
3.5 The first_corrupted_rec_lsa cut-off
Section titled “3.5 The first_corrupted_rec_lsa cut-off”For pages that failed the checksum, the driver decides per record whether it precedes the torn region. Two safety nets first widen the verdict (only while is_log_page_corrupted is false): (1) missing end-of-log — forw_lsa NULL on a non-LOG_END_OF_LOG record in the active log is impossible (every chain ends at an EOF record): page declared corrupted, cut-off from the null-block scan. (2) Body crossing a null block — when forw_lsa stays in-page, map the record start and forw_lsa - 1 to block indexes ((offset + sizeof (LOG_HDRPAGE)) / block_size); if they differ and the ending block equals null_block, the body was never fully flushed: the cut-off becomes the record itself.
With a non-NULL cut-off, three outcomes. A record strictly past the tear ends the scan at the previous good record:
// log_recovery_analysis -- src/transaction/log_recovery.c if (LSA_GT (&log_lsa, &first_corrupted_rec_lsa)) { LOG_RESET_APPEND_LSA (end_redo_lsa); /* <- starts past the tear */ LSA_SET_NULL (&lsa); break; }The else arm flags the record itself corrupted when log_lsa == first_corrupted_rec_lsa, when forw_lsa points past the tear, or when the DB_ALIGN-ed end of its header overruns LOGAREA_SIZE or lands past the tear; then LOG_RESET_APPEND_LSA (&log_lsa) — the first casualty’s own position becomes the new append point — nulls lsa, breaks. A record provably before the tear is processed normally.
Invariant 3-C (corruption is terminal per page). Once is_log_page_corrupted is true, the cursor never advances to another page. Enforced by the post-advance null-out (corrupted, not LOG_END_OF_LOG, lsa.pageid != log_lsa.pageid → LSA_SET_NULL) plus the stop after dispatching LOG_END_OF_LOG. Recycled pages from earlier log wraps can hold valid-looking stale records; following them replays a previous epoch.
3.6 Advancing the cursor: every remaining branch
Section titled “3.6 Advancing the cursor: every remaining branch”The rest of the inner-loop body, in order:
end_redo_lsa = lsa; lsa = log_rec->forw_lsa— the range tip moves before dispatch.- Corrupted-page page-advance ban (Invariant 3-C).
- Archive null-forward fix: NULL
lsaon an archive page →log_lsa.pageid + 1— incomplete archiving, not end of log. - Loop guard.
lsabackward or sideways (lsa.pageid < log_lsa.pageid, or same page andlsa.offset <= log_lsa.offset): “loop in the log” debug-trace,logpb_fatal_error, thenLSA_SET_NULL (&lsa); break;. - Missing-EOF repair. NULL
lsa,log_rtype != LOG_END_OF_LOG, no truncation yet: the append LSA parks atend_redo_lsa; iflog_startof_nxrecfinds the next record start, advance there, patch the in-bufferlog_rec->forw_lsa, flush the page (logpb_write_page_to_disk) — a physical repair. Either waylog_Gl.hdr.next_trid = tran_id. - Redo counting.
*num_redo_log_recordscounts twelve redo-bearing types —LOG_REDO_DATA,LOG_UNDOREDO_DATA,LOG_DIFF_UNDOREDO_DATA, their threeLOG_MVCC_*counterparts,LOG_DBEXTERN_REDO_DATA,LOG_RUN_POSTPONE,LOG_COMPENSATE,LOG_2PC_PREPARE,LOG_2PC_START,LOG_2PC_RECV_ACK; everything else hits the silentdefault. - Dispatch.
log_rv_analysis_recordrebuilds transaction state (Ch 4); itsLOG_END_OF_LOGcase islog_rv_analysis_log_end(3.8). - Post-dispatch truncation.
*did_incom_recoveryraised (3.7):end_redo_lsa = prev_lsa— the trigger is excluded from redo;lsanulled, break. - Self-loop assert.
LSA_EQ (end_redo_lsa, &lsa)— the cursor did not move:assert_release, scan aborts via NULL cursor. - Corrupted page +
LOG_END_OF_LOG→ stop (second half of Invariant 3-C). prev_lsa = end_redo_lsa; prev_prev_lsa = prev_lsa;— the resetlog anchors trail the tip by one record.- Page-id back-fill. Forward
(pageid, NULL_OFFSET)with a stale smaller pageid → current page (pairs with 3.3’s repair).
Invariant 3-A (monotone cursor). Every iteration strictly increases the cursor (pageid, offset). Enforced by steps 4 and 9, both terminating the scan. Violation: analysis spins forever.
3.7 Point-in-time stop: stop_at and LOG_REC_DONETIME
Section titled “3.7 Point-in-time stop: stop_at and LOG_REC_DONETIME”stop_at comes from log_recovery: -1 (no limit) on normal restart, the restoredb -d timestamp on media crash. The driver never reads commit times itself — log_rv_analysis_complete does, for LOG_COMMIT / LOG_ABORT (log_rv_analysis_commit_with_postpone applies the same test to its LOG_REC_START_POSTPONE at_time). It reads the LOG_REC_DONETIME payload behind the header; when *stop_at != (time_t) (-1) and difftime (*stop_at, last_at_time) < 0 — the first done record stamped after the stop point — it nulls the page cursor, calls log_recovery_resetlog (thread_p, &record_header_lsa, prev_lsa) to cut the log before this commit, and raises *did_incom_recovery; 3.6 step 8 then excludes the record and ends the scan. That last_at_time is its local; the driver’s copy, echoed into *stop_at in 3.2, stays -1 — inert today.
3.8 log_rv_analysis_log_end, the 2PC re-read tail, and the outputs
Section titled “3.8 log_rv_analysis_log_end, the 2PC re-read tail, and the outputs”The one dispatch case belonging to the driver’s story is the clean end of log, log_rv_analysis_log_end — one branch on logpb_is_page_in_archive. In the active log the EOF’s own position becomes log_Gl.hdr.append_lsa (LOG_RESET_APPEND_LSA (log_lsa) — new appends overwrite the EOF record), next_trid is restored from its owner, and the cursor takes the EOF’s NULL forw_lsa — both loops end (the missing-EOF repair exempts LOG_END_OF_LOG). An EOF in an archive page is a stale leftover from before the archiving cut: the header is untouched, the NULL forward goes through 3.6 step 3, and the scan continues.
The 2PC re-read tail. If any dispatched record set may_need_synch_checkpoint_2pc (a LOG_REC_CHKPT listing transactions in 2PC at checkpoint time — Ch 4), the driver re-reads the checkpoint record after the outer loop: (1) logpb_fetch_page on checkpoint_lsa, failure fatal; (2) the LOG_INFO_CHKPT_TRANS array of chkpt.ntrans entries, read in-page when log_lsa.offset + size < LOGAREA_SIZE, else malloc + logpb_copy_from_log (failed malloc: fatal); (3) each chkpt_trans[i].trid resolves via logtb_find_tran_index; log_2pc_recovery_analysis_info runs only for tdes still LOG_ISTRAN_2PC. The media-crash arm of 3.2 returns before this tail — truncated restores skip 2PC reconstruction.
log_recovery then emits ER_LOG_RECOVERY_REDO_STARTED from the range and the count; log_cnt_pages_containing_lsa returns 0 when *to_lsa == *from_lsa, else the inclusive to_lsa->pageid - from_lsa->pageid + 1. When nothing past the anchor survived, end_redo_lsa still equals start_redo_lsa from initialization — the count is honestly zero.
3.9 Chapter summary — key takeaways
Section titled “3.9 Chapter summary — key takeaways”log_recovery_analysisis a page-fetch outer loop around a record-step inner loop; corruption decisions belong to the driver, record semantics tolog_rv_analysis_record(Ch 4–5).- Broken pages fork on
is_media_crash: backups truncate vialog_recovery_resetlogatprev_lsaand raisedid_incom_recovery; normal restarts are fatal (ER_TDE_CIPHER_IS_NOT_LOADEDmeans “load the TDE key”). - Partial page flush is caught by a once-per-page CRC check; the tear is the first all-
0xff4 KB block, andfirst_corrupted_rec_lsacuts the scan with three per-record outcomes. A corrupted page is terminal (Invariant 3-C). log_is_page_of_record_broken(media crash only) validatesforw_lsaplausibility and probe-fetches a multi-page record’s last page; a missing tail parksend_redo_lsaandprev_lsaon the broken record so resetlog cuts there.- The redo range is honest (Invariant 3-B): everything strictly before
end_redo_lsais readable and complete; the boundary is fully probed or re-initialized before redo reads it. - Point-in-time restore lives in
log_rv_analysis_complete(LOG_REC_DONETIME), not the driver; analysis is also not read-only — a missingLOG_END_OF_LOGis physically repaired vialog_startof_nxrec, a patchedforw_lsa, and a page flush.
Chapter 4: Analysis Record Dispatch and Transaction Table Rebuild
Section titled “Chapter 4: Analysis Record Dispatch and Transaction Table Rebuild”Chapter 3’s driver feeds every LOG_RECORD_HEADER it reads to log_rv_analysis_record. This chapter traces how each arm rebuilds transaction-table state, plus the global counters that ride along — append point, next TRANID, MVCCID horizon. The postpone/sysop arms belong to Chapter 5; ARIES theory lives in the companion cubrid-recovery-manager.md.
4.1 The dispatch switch in log_rv_analysis_record
Section titled “4.1 The dispatch switch in log_rv_analysis_record”log_rv_analysis_record is a pure demultiplexer: one switch (log_type), no logic of its own; its pointer parameters all belong to the driver’s loop state (Chapter 3). Every LOG_RECTYPE lands in exactly one arm:
| Record type(s) | Handler | Effect on the table |
|---|---|---|
LOG_UNDOREDO_DATA, LOG_DIFF_UNDOREDO_DATA, LOG_UNDO_DATA, LOG_REDO_DATA, their four LOG_MVCC_* twins, LOG_DBEXTERN_REDO_DATA | log_rv_analysis_undo_redo | advance tail_lsa + undo_nxlsa (4.3) |
LOG_SAVEPOINT | log_rv_analysis_save_point | same, plus savept_lsa (4.3) |
LOG_COMPENSATE | log_rv_analysis_compensate | redirect undo_nxlsa past undone work (4.3) |
LOG_COMMIT, LOG_ABORT | log_rv_analysis_complete | free the tran index, or stop analysis early (4.4) |
the seven LOG_2PC_* types | the seven log_rv_analysis_2pc_* arms | stamp a 2PC tdes->state (4.5) |
LOG_START_CHKPT / LOG_END_CHKPT | log_rv_analysis_start_checkpoint / _end_checkpoint | arm may_use_checkpoint (4.7) / merge snapshot (4.8) |
LOG_DUMMY_HEAD_POSTPONE, LOG_POSTPONE, LOG_RUN_POSTPONE, LOG_COMMIT_WITH_POSTPONE (+_OBSOLETE), LOG_SYSOP_START_POSTPONE, LOG_SYSOP_END, LOG_SYSOP_ATOMIC_START | the matching log_rv_analysis_* postpone/sysop arms | Chapter 5; commit-with-postpone’s early-stop branch mirrors 4.4 |
LOG_END_OF_LOG | log_rv_analysis_log_end | reset append point + next_trid (4.10) |
LOG_DUMMY_CRASH_RECOVERY, LOG_REPLICATION_DATA, LOG_REPLICATION_STATEMENT, LOG_DUMMY_HA_SERVER_STATE, LOG_DUMMY_OVF_RECORD, LOG_DUMMY_GENERIC, LOG_SUPPLEMENTAL_INFO | none — bare break | no table effect |
LOG_SMALLER_LOGREC_TYPE, LOG_LARGER_LOGREC_TYPE, default | none | er_set (ER_LOG_PAGE_CORRUPTED) + assert (false) — “probably the log is corrupted” |
Return codes are discarded — most via (void) casts; the sysop-end and checkpoint arms simply ignore them. Almost every failure calls logpb_fatal_error, which terminates recovery; the lone exception is end-checkpoint’s sysop re-read (4.8 step 7) — debug builds assert, release builds swallow the error.
4.2 logtb_rv_find_allocate_tran_index — the lazy TDES allocator
Section titled “4.2 logtb_rv_find_allocate_tran_index — the lazy TDES allocator”Nearly every arm starts here (log_tran_table.c): map tran_id to a TDES, allocating on first sight. Three branches: B1 — logtb_is_system_worker_tranid (trid) short-circuits to log_system_tdes::rv_get_or_alloc_tdes, keeping system workers out of the table. B2 — logtb_find_tran_index misses: logtb_allocate_tran_index (..., TRAN_UNACTIVE_UNILATERALLY_ABORTED, ...), then LSA_COPY (&tdes->head_lsa, log_lsa); allocation failure is logpb_fatal_error + return NULL. B3 — hit: LOG_FIND_TDES.
Invariant — presumed abort. Every TDES created during analysis is born TRAN_UNACTIVE_UNILATERALLY_ABORTED, head_lsa = first sighting. Only a later completion record (removal, 4.4) or 2PC/postpone record (state upgrade) changes the verdict; any other initial state would make the undo pass (Chapter 9) skip a loser and leave its updates on disk.
4.3 The simple arms — undo_redo, save_point, compensate
Section titled “4.3 The simple arms — undo_redo, save_point, compensate”log_rv_analysis_undo_redo covers all nine data-change types. Only non-happy branch: NULL TDES means logpb_fatal_error, return ER_FAILED. Otherwise LSA_COPY (&tdes->tail_lsa, log_lsa) then LSA_COPY (&tdes->undo_nxlsa, &tdes->tail_lsa): tail_lsa is the latest record, undo_nxlsa where undo starts walking backward; for a plain data record they coincide. log_rv_analysis_save_point adds LSA_COPY (&tdes->savept_lsa, &tdes->tail_lsa) for post-restart partial rollback.
log_rv_analysis_compensate handles LOG_COMPENSATE — a CLR, proof some update was already undone — and is the one arm where undo_nxlsa diverges from tail_lsa. After the allocator + NULL-fatal branch, it advances to the LOG_REC_COMPENSATE body (LOG_READ_ADD_ALIGN, LOG_READ_ADVANCE_WHEN_DOESNT_FIT) and executes one copy — LSA_COPY (&tdes->undo_nxlsa, &compensate->undo_nxlsa) — and does not advance tail_lsa. The copied pointer lands before the compensated update, so undo never restarts from the CLR itself: ARIES’ never-undo-an-undo rule, enforced purely by pointer redirection.
4.4 log_rv_analysis_complete — commit/abort finalization
Section titled “4.4 log_rv_analysis_complete — commit/abort finalization”LOG_COMMIT and LOG_ABORT share log_rv_analysis_complete — the only arm that removes table state, and one of two early-stop arms (the other, log_rv_analysis_commit_with_postpone in Chapter 5, carries the same stop_at/resetlog branch). Four branches:
// log_rv_analysis_complete -- src/transaction/log_recovery.c tran_index = logtb_find_tran_index (thread_p, tran_id); /* <- find, never allocate */ // ... condensed: B1 -- if not media crash, goto end; else read LOG_REC_DONETIME -> last_at_time ... if (stop_at != NULL && *stop_at != (time_t) (-1) && difftime (*stop_at, last_at_time) < 0) { /* B2: completion is newer than --until-time */ log_lsa->pageid = NULL_PAGEID; log_recovery_resetlog (thread_p, &record_header_lsa, prev_lsa); *did_incom_recovery = true; return NO_ERROR; /* <- index NOT freed: tran stays a loser */ }end: // ... condensed: B3 -- if tran_index != NULL_TRAN_INDEX, logtb_free_tran_index ... return NO_ERROR; /* B4: never seen before -> nothing to drop */Two asymmetries: it finds, never allocates — a completion whose transaction predates the window is a no-op (B4); and B2 keeps the index — truncating the log at the commit record makes the transaction retroactively in-flight, so undo rolls it back: restore-to-timestamp.
4.5 The seven 2PC arms — a state-transition table
Section titled “4.5 The seven 2PC arms — a state-transition table”Structurally identical: allocate the TDES (NULL: logpb_fatal_error, ER_FAILED), overwrite tdes->state, advance tail_lsa; none touches undo_nxlsa. Only the stamped state differs:
| Record type | Handler | tdes->state stamped |
|---|---|---|
LOG_2PC_PREPARE | log_rv_analysis_2pc_prepare | TRAN_UNACTIVE_2PC_PREPARE |
LOG_2PC_START | log_rv_analysis_2pc_start | TRAN_UNACTIVE_2PC_COLLECTING_PARTICIPANT_VOTES |
LOG_2PC_COMMIT_DECISION | log_rv_analysis_2pc_commit_decision | TRAN_UNACTIVE_2PC_COMMIT_DECISION |
LOG_2PC_ABORT_DECISION | log_rv_analysis_2pc_abort_decision | TRAN_UNACTIVE_2PC_ABORT_DECISION |
LOG_2PC_COMMIT_INFORM_PARTICPS | log_rv_analysis_2pc_commit_inform_particps | TRAN_UNACTIVE_COMMITTED_INFORMING_PARTICIPANTS |
LOG_2PC_ABORT_INFORM_PARTICPS | log_rv_analysis_2pc_abort_inform_particps | TRAN_UNACTIVE_ABORTED_INFORMING_PARTICIPANTS |
LOG_2PC_RECV_ACK | log_rv_analysis_2pc_recv_ack | unchanged — only tail_lsa advances |
LOG_2PC_PREPARE is the participant side; the rest are coordinator records. Prepare and start also plant tdes->gtrid = LOG_2PC_NULL_GTRID: a sentinel that the body (gtrid, participants, locks) was not read — it “needs to be read during either redo phase, or during finish_commit_protocol phase” (source comment); 4.9 consumes it.
4.6 The checkpoint payload structs
Section titled “4.6 The checkpoint payload structs”A completed checkpoint is two records: an empty LOG_START_CHKPT anchor and a LOG_END_CHKPT whose body (log_record.hpp) is a LOG_REC_CHKPT header, ntrans LOG_INFO_CHKPT_TRANS entries, then ntops LOG_INFO_CHKPT_SYSOP entries.
LOG_REC_CHKPT (log_rec_chkpt) has three fields: redo_lsa — oldest recovery LSA of any dirty data page, because redo must start at the oldest unflushed change (4.8 step 8); ntrans and ntops — counts of the two arrays that follow, which are not self-delimiting (ntops is commonly zero).
LOG_INFO_CHKPT_TRANS (log_info_chkpt_trans) — one serialized TDES per live transaction:
| Field | Role | Why it exists |
|---|---|---|
isloose_end | to tdes->isloose_end | Client loose ends |
trid | Transaction id | Merge key for the allocator |
state | Snapshot state; TRAN_ACTIVE and TRAN_UNACTIVE_ABORTED remap to TRAN_UNACTIVE_UNILATERALLY_ABORTED, others verbatim | Presumed abort; 2PC/postpone states survive |
head_lsa | to tdes->head_lsa | May predate the analysis window |
tail_lsa | to tdes->tail_lsa | Chain resume point; 2PC walk cursor (4.9) |
undo_nxlsa | to tdes->undo_nxlsa | Pre-checkpoint CLR redirects (4.3) |
posp_nxlsa | to tdes->posp_nxlsa | Postpone chain start (Chapter 5) |
savept_lsa | to tdes->savept_lsa | Pre-checkpoint savepoints |
tail_topresult_lsa | to tdes->tail_topresult_lsa | Skip completed sysops on rollback |
start_postpone_lsa | to tdes->rcv.tran_start_postpone_lsa | Postpone completion (Chapter 8) |
user_name | to tdes->client via set_system_internal_with_user | Loose-end owner |
LOG_INFO_CHKPT_SYSOP (log_info_chkpt_sysop) — only sysops committing with postpone are checkpointed; an ordinary in-flight sysop simply dies with its transaction:
| Field | Role | Why it exists |
|---|---|---|
trid | Owning transaction | The sysop array is flat; entries join by id |
sysop_start_postpone_lsa | to tdes->rcv.sysop_start_postpone_lsa | Non-null triggers re-reading that record (4.8 step 7) |
atomic_sysop_start_lsa | to tdes->rcv.atomic_sysop_start_lsa | Drives atomic-sysop abort (Chapter 8) |
4.7 log_rv_analysis_start_checkpoint and the may_use_checkpoint guard
Section titled “4.7 log_rv_analysis_start_checkpoint and the may_use_checkpoint guard”The LOG_START_CHKPT arm is one condition — if (LSA_EQ (log_lsa, start_lsa)) { *may_use_checkpoint = true; } — and that condition is the design. start_lsa is where analysis began: log_Gl.hdr.chkpt_lsa, updated only when a checkpoint completes (Chapter 3). The flag arms only for the anchor start record, never for a LOG_START_CHKPT met mid-scan — such a snapshot “can contain stuff which does not exist any longer” (source comment).
stateDiagram-v2
[*] --> Unset : analysis starts, flag false
Unset --> Armed : LOG_START_CHKPT at start_lsa
Unset --> Unset : LOG_START_CHKPT elsewhere, LSA_EQ fails
Armed --> Consumed : LOG_END_CHKPT, merge snapshot then clear flag
Unset --> Unset : LOG_END_CHKPT, guard returns early
Consumed --> Consumed : any later checkpoint records ignored
Figure 4-1: Lifecycle of may_use_checkpoint. Only the END pairing with the anchor START can merge a snapshot.
This answers the crash-window question. Crash between START and END: the header still names the previous completed checkpoint; the unfinished window’s START fails LSA_EQ, its END was never written. A second complete window inside the range (media recovery): its START fails LSA_EQ, its END dies on the 4.8 guard.
4.8 log_rv_analysis_end_checkpoint — merging the snapshot, branch by branch
Section titled “4.8 log_rv_analysis_end_checkpoint — merging the snapshot, branch by branch”The longest arm; every branch accounted for:
- Guard.
if (*may_use_checkpoint == false) return NO_ERROR;— unpaired ENDs die here; otherwise the flag clears at once: single-shot. - Anchor capture.
LSA_COPY (check_point, log_lsa)saves the END’s LSA into the driver’scheckpoint_lsa— used by the run-postpone arm (Chapter 5) and 4.9. - Header read.
LOG_REC_CHKPTis copied by value (chkpt = *tmp_chkpt) — later page advances may evict its page. - Trans array — two branches. In-page (
log_lsa->offset + size < LOGAREA_SIZE): used in place; elsemalloc+logpb_copy_from_log; malloc failure is fatal. - Merge loop over
chkpt.ntransentries — allocator first (NULL: free area,logpb_fatal_error,ER_FAILED), then:
// log_rv_analysis_end_checkpoint -- src/transaction/log_recovery.c logtb_clear_tdes (thread_p, tdes); /* <- wipe what the loop built so far */ if (chkpt_one->state == TRAN_ACTIVE || chkpt_one->state == TRAN_UNACTIVE_ABORTED) { tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; /* <- presumed-abort remap */ } else { tdes->state = chkpt_one->state; /* <- 2PC / postpone states survive */ } // ... condensed: isloose_end, six LSA_COPYs, rcv.tran_start_postpone_lsa, user name ... if (LOG_ISTRAN_2PC (tdes)) { *may_need_synch_checkpoint_2pc = true; /* <- defer 2PC body reads (4.9) */ }Invariant — snapshot atomicity with the END record. logtb_clear_tdes clobbers state already built from records between START and END. Safe only because logpb_checkpoint snapshots the table and appends LOG_END_CHKPT (prior_lsa_next_record_with_lock) under one log_Gl.prior_info.prior_lsa_mutex hold: nothing appends in between, so the snapshot supersedes everything since START. Release the mutex earlier and this merge would silently regress tail_lsa/undo_nxlsa — undo would skip live changes.
6. Trans area release. free_and_init (area) — nulls area for reuse by the sysop array.
7. Sysop merge, gated by chkpt.ntops > 0. Same in-page-vs-malloc branches as step 4. Per entry: allocate the TDES by trid; grow the topops stack (logtb_realloc_topops_stack) when tdes->topops.max == 0 || (tdes->topops.last + 1) >= tdes->topops.max (failure: free, fatal); copy both LSAs into tdes->rcv. If sysop_start_postpone_lsa is non-null: bump topops.last from -1 to 0 — else assert (tdes->topops.last == 0), at most one level during recovery — and log_read_sysop_start_postpone re-reads that record on a private page buffer to fill topops.stack[last].lastparent_lsa and .posp_lsa, which the checkpoint entry omits. The only place analysis re-reads an older record; its error path is assert (false); return error_code; — no logpb_fatal_error (4.1).
8. Redo start pull-back. if (LSA_LT (&chkpt.redo_lsa, start_redo_lsa)) LSA_COPY (start_redo_lsa, &chkpt.redo_lsa); — redo (Chapter 6) begins at the oldest dirty page’s recovery LSA.
9. Final free_and_init (area) for the sysop copy (no-op if in-page), then NO_ERROR.
4.9 may_need_synch_checkpoint_2pc — the deferred 2PC reconstruction
Section titled “4.9 may_need_synch_checkpoint_2pc — the deferred 2PC reconstruction”After the main loop, log_recovery_analysis re-fetches LOG_END_CHKPT at the saved checkpoint_lsa and, for every trans entry whose TDES still satisfies LOG_ISTRAN_2PC, calls log_2pc_recovery_analysis_info (thread_p, tdes, &chkpt_trans[i].tail_lsa) (log_2pc.c): a prev_tranlsa back-chain walk from the snapshot-time tail_lsa, reading the LOG_2PC_PREPARE body while tdes->gtrid == LOG_2PC_NULL_GTRID and the LOG_2PC_START body while tdes->coord == NULL, collecting acks. The snapshot omits 2PC bodies “due to the big space overhead (e.g., locks)” (source comment), and they may predate the window — only a backward walk recovers them; the re-check skips transactions that completed after the snapshot.
4.10 LOG_END_OF_LOG, next_trid, and MVCCID restoration
Section titled “4.10 LOG_END_OF_LOG, next_trid, and MVCCID restoration”Two pieces of global state ride along with the per-transaction rebuild. First, the EOF arm — log_rv_analysis_log_end is one branch, if (!logpb_is_page_in_archive (log_lsa->pageid)): only an EOF in the active log counts. Inside it, LOG_RESET_APPEND_LSA (log_lsa) re-anchors the append point so post-recovery writes overwrite the EOF, and log_Gl.hdr.next_trid = tran_id restarts the TRANID counter from the EOF record’s own trid — restart never re-issues an id seen in the log. An EOF inside an archive is an artifact of archiving an incomplete log and is skipped; the no-EOF-found repair path is the driver’s (Chapter 3).
Second, MVCCIDs. Deliberately, no analysis arm restores tdes->mvccinfo — rebuilt losers carry no MVCCID out of analysis. Instead the last statement of log_recovery_analysis (and of its incomplete-recovery early return) is log_Gl.mvcc_table.reset_start_mvccid () (mvcc_table.cpp), re-seeding the active-MVCCID bitmap start and m_current_status_lowest_active_mvccid from log_Gl.hdr.mvcc_next_id: every lower MVCCID is treated as no longer active. Redo refines the header value — each replayed MVCC record pushes log_Gl.hdr.mvcc_next_id past its own id — and reset_start_mvccid runs once more after redo (Chapter 6). A loser’s original MVCCID reappears only during undo: logtb_rv_assign_mvccid_for_undo_recovery sets tdes->mvccinfo.id from the undone record’s rcv->mvcc_id (Chapter 9).
4.11 Chapter summary — key takeaways
Section titled “4.11 Chapter summary — key takeaways”log_rv_analysis_recordis a logic-free demultiplexer; an unknownLOG_RECTYPEis page corruption; seven dummy/replication types are no-ops. Handler failures end inlogpb_fatal_error— except end-checkpoint’s sysop re-read, dropped in release builds.logtb_rv_find_allocate_tran_indexenforces presumed abort: transactions are bornTRAN_UNACTIVE_UNILATERALLY_ABORTEDat first sighting; system workers live in a separatelog_system_tdesmap.- Only
log_rv_analysis_compensatemakesundo_nxlsadiverge fromtail_lsa, jumping over already-undone work via the CLR’s stored pointer. log_rv_analysis_completefinds but never allocates, and is the only arm that removes table state; itsstop_atbranch truncates the log and keeps the index — point-in-time restore.- The seven 2PC arms differ only in the stamped
TRAN_STATE; prepare/start plant thegtrid = LOG_2PC_NULL_GTRIDsentinel consumed by the post-looplog_2pc_recovery_analysis_infowalk. - A
LOG_END_CHKPTmerges only when armed by aLOG_START_CHKPTat exactlystart_lsa— half-built or extra checkpoint windows are ignored by construction; thelogtb_clear_tdes-then-overwrite merge is safe becauselogpb_checkpointsnapshots the table and appends the END under oneprior_lsa_mutexhold. - Global counters ride along:
LOG_END_OF_LOGre-anchors the append point andnext_trid; MVCCIDs are not rebuilt per transaction —reset_start_mvccidre-seeds the MVCC table fromlog_Gl.hdr.mvcc_next_id, and undo re-attaches loser MVCCIDs lazily.
Chapter 5: Sysop and Postpone Bookkeeping During Analysis
Section titled “Chapter 5: Sysop and Postpone Bookkeeping During Analysis”The messy middles — transactions caught inside system operations, atomic sysops, or commit-time postpones — become five LSA annotations in LOG_RCV_TDES, written by the log_rv_analysis_* arms below (driver: Ch 3, dispatch: Ch 4). Theory: high-level companion (cubrid-recovery-manager.md).
5.1 LOG_RCV_TDES — the recovery annotation block
Section titled “5.1 LOG_RCV_TDES — the recovery annotation block”LOG_RCV_TDES (struct log_rcv_tdes in log_impl.h) is five LOG_LSA fields embedded in every LOG_TDES as field rcv; outside recovery all five stay null.
| Field | Role | Why it exists |
|---|---|---|
sysop_start_postpone_lsa | Last open LOG_SYSOP_START_POSTPONE; written by log_rv_analysis_sysop_start_postpone, checkpoint-restored (Ch 4), reset by log_rv_analysis_sysop_end | log_recovery_finish_sysop_postpone (Ch 8) re-reads it to resume the sysop’s postpone list — no end record points to it |
tran_start_postpone_lsa | The transaction’s LOG_COMMIT_WITH_POSTPONE; written by log_rv_analysis_commit_with_postpone + obsolete variant, checkpoint-restored (Ch 4) | Non-null-ness picks the state restored when a sysop postpone ends (5.7); bound for log_recovery_finish_postpone |
atomic_sysop_start_lsa | Last unmatched LOG_SYSOP_ATOMIC_START; written by log_rv_analysis_atomic_sysop_start, checkpoint-restored (Ch 4), reset by both sysop arms when the atomic op is proven complete | Still set after redo → log_recovery_abort_all_atomic_sysops (Ch 8) rolls back to it before postpones run |
analysis_last_aborted_sysop_lsa | Most recent ABORT-type LOG_SYSOP_END; written only in that arm of log_rv_analysis_sysop_end | Upper bound of the logical-redo skip range (log_recovery_needs_skip_logical_redo, Ch 6) |
analysis_last_aborted_sysop_start_lsa | lastparent_lsa of that same aborted sysop end | Lower bound of the same skip range |
flowchart LR
cwp["commit_with_postpone"] --> f1["tran_start_postpone_lsa"]
ssp["sysop_start_postpone"] --> f2["sysop_start_postpone_lsa"]
ats["atomic_sysop_start"] --> f3["atomic_sysop_start_lsa"]
se["sysop_end"] --> f4["analysis_last_aborted_sysop_lsa<br/>+ _start_lsa"]
se -. resets .-> f2
se -. resets .-> f3
f1 --> fp["finish_postpone (Ch 8)"]
f1 --> fsp["finish_sysop_postpone (Ch 8)"]
f2 --> fsp
f3 --> aas["abort_all_atomic_sysops (Ch 8)"]
f4 --> skip["needs_skip_logical_redo (Ch 6)"]
Figure 5-1: annotation writers (left) and post-redo consumers (right), prefixes elided.
Invariant — annotations survive only while their phase is open. Each field is nulled once analysis proves its phase concluded pre-crash (reset guards, 5.7). Stale
atomic_sysop_start_lsa→ Ch 8 rolls back a committed operation; stalesysop_start_postpone_lsa→ an already-run postpone list replays.
5.2 LOG_REC_SYSOP_END and LOG_SYSOP_END_TYPE
Section titled “5.2 LOG_REC_SYSOP_END and LOG_SYSOP_END_TYPE”Every system operation ends with LOG_SYSOP_END, body LOG_REC_SYSOP_END (log_record.hpp) — three fixed fields, a vfid pointer, and a union switched by type:
| Field | Role | Why it exists |
|---|---|---|
lastparent_lsa | Transaction’s last LSA before the sysop started | Undo jump target over the sysop; compared against the annotations to detect nesting order |
prv_topresult_lsa | Previous concluded top action’s LSA | Chains sysop results so partial abort can skip them (tail_topresult_lsa) |
type | One of six LOG_SYSOP_END_TYPE values | Selects union interpretation and recovery behavior |
vfid | Owning file; equals mvcc_undo’s vacuum-info file for MVCC undo | TDE (encryption) context lookup |
union undo | Logical undo payload (LOGICAL_UNDO) | Multi-page op undoes via one logical recovery function |
union mvcc_undo | Undo + MVCCID/vacuum info (LOGICAL_MVCC_UNDO) | Vacuum must see the operation’s MVCCID |
union compensate_lsa | Next-undo LSA (LOGICAL_COMPENSATE) | The sysop replaces a compensation record; undo resumes here |
union run_postpone | postpone_lsa + is_sysop_postpone flag (LOGICAL_RUN_POSTPONE) | Replaces a LOG_RUN_POSTPONE; the flag says whose postpone list advances (5.7) |
LOG_SYSOP_END_TYPE (enum log_sysop_end_type, log_record.hpp) has six values: LOG_SYSOP_END_COMMIT (“permanent changes”), LOG_SYSOP_END_ABORT (“aborted system op”), and the four LOG_SYSOP_END_LOGICAL_* flavors UNDO, MVCC_UNDO, COMPENSATE, RUN_POSTPONE. The union is a role matrix switched solely by type (asserted by LOG_SYSOP_END_TYPE_CHECK); 5.7 traces each value’s analysis-time effect.
5.3 Postpone-side arms: LOG_POSTPONE, LOG_DUMMY_HEAD_POSTPONE, LOG_RUN_POSTPONE
Section titled “5.3 Postpone-side arms: LOG_POSTPONE, LOG_DUMMY_HEAD_POSTPONE, LOG_RUN_POSTPONE”log_rv_analysis_postpone (LOG_POSTPONE) and log_rv_analysis_dummy_head_postpone (the no-op LOG_DUMMY_HEAD_POSTPONE marker) each have two branches: a fatal logtb_rv_find_allocate_tran_index == NULL early return (logpb_fatal_error, ER_FAILED) and the first-postpone capture. On LSA_ISNULL (posp_nxlsa) the postpone arm copies the previous tail_lsa into posp_nxlsa before advancing tail_lsa/undo_nxlsa (“set address early”); the dummy-head arm advances first and captures after (“set address late”), landing on the dummy head itself. posp_nxlsa is where log_recovery_find_first_postpone (Ch 8) starts scanning.
log_rv_analysis_run_postpone handles LOG_RUN_POSTPONE (a postpone already executed and redo-logged). Branches:
tdes == NULL→ fatal,ER_FAILED.- State not in {
WILL_COMMIT,COMMITTED_WITH_POSTPONE,TOPOPE_COMMITTED_WITH_POSTPONE} (TRAN_UNACTIVE_elided): impossible for a checkpointed tdes (SYSTEM ERROR debug log), normal otherwise; recovery guessestopops.last == -1→ committed-with-postpone, else topope-committed. - State now
TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE→LSA_SET_NULL (undo_nxlsa): nothing left to undo. - Body read (Ch 2 macros);
run_posp->ref_lsa— theLOG_POSTPONEthis record executed — resets the cursor:topops.stack[last].posp_lsain the topope state, elsetdes->posp_nxlsa(other two states asserted).
Invariant —
posp_nxlsaalways points at the next postpone not yet known to have run.LOG_POSTPONEsets it once; everyLOG_RUN_POSTPONEadvances it toref_lsa. Lagging → Chapter 8 runs a postpone twice; overshooting → deferred work silently lost.
5.4 Transaction commit with postpone
Section titled “5.4 Transaction commit with postpone”log_rv_analysis_commit_with_postpone handles LOG_COMMIT_WITH_POSTPONE: outcome decided, deferred work possibly unfinished. After the fatal-tdes branch it reads LOG_REC_START_POSTPONE (posp_lsa + at_time) and forks on is_media_crash:
// log_rv_analysis_commit_with_postpone -- src/transaction/log_recovery.c if (is_media_crash) { // ... condensed: stop_at test -> resetlog + *did_incom_recovery = true ... } else { tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; LSA_SET_NULL (&tdes->undo_nxlsa); /* Nothing to undo */ LSA_COPY (&tdes->tail_lsa, log_lsa); tdes->rcv.tran_start_postpone_lsa = tdes->tail_lsa; /* <- annotation write */ LSA_COPY (&tdes->posp_nxlsa, &start_posp->posp_lsa); }The media-crash arm is point-in-time recovery: when stop_at != NULL && *stop_at != (time_t) (-1) && difftime (*stop_at, last_at_time) < 0 — commit past the restore target — it releases the page, truncates the log (log_recovery_resetlog, Ch 11), sets *did_incom_recovery, and the transaction is treated as never committed. If the stop_at test fails (or stop_at is NULL/-1), the media-crash arm is a no-op — the annotation and state updates happen only in the non-media-crash arm.
log_rv_analysis_commit_with_postpone_obsolete (LOG_COMMIT_WITH_POSTPONE_OBSOLETE, old layout LOG_REC_START_POSTPONE_OBSOLETE without at_time) performs exactly the non-media-crash arm — no timestamp, no point-in-time stop. Kept only to read old-release logs; slated for removal “maybe 12.0”.
5.5 log_rv_analysis_sysop_start_postpone
Section titled “5.5 log_rv_analysis_sysop_start_postpone”LOG_SYSOP_START_POSTPONE marks a sysop that finished its main work and began its own postpone list. Its body LOG_REC_SYSOP_START_POSTPONE is an embedded LOG_REC_SYSOP_END sysop_end (what the end record will say) plus posp_lsa (first postpone of the sysop). Branches:
- Fatal-tdes →
ER_FAILED. tail_lsa/undo_nxlsaadvance; annotation write:tdes->rcv.sysop_start_postpone_lsa = tdes->tail_lsa.- Three-way fork on the embedded end type: state already topope-committed →
assert_release (false)(two simultaneous sysop postpones cannot exist);sysop_end.type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE→ nestedis_sysop_postpone == trueasserted impossible, and the transaction-postpone flavor nullsundo_nxlsa(the transaction is committing regardless of its guessed state); otherwiseassert (type != LOG_SYSOP_END_ABORT)— an aborting sysop never starts a postpone phase. - State :=
TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE. - Topops stack grown via
logtb_realloc_topops_stackif needed (ER_OUT_OF_VIRTUAL_MEMORYon failure);topops.lastmust be-1, bumped to0(assert (false)otherwise);lastparent_lsaandposp_lsacopy intotopops.stack[0]. LSA_LT (sysop_end.lastparent_lsa, rcv.atomic_sysop_start_lsa)means the atomic marker was logged inside this sysop; reaching start-postpone proves the atomic part completed, so the marker is nulled.
Invariant — at most one live sysop postpone, so
topops.last <= 0throughout recovery. Enforced by the asserts in steps 3 and 5, re-checked inlog_rv_analysis_sysop_end(assert (tdes->topops.last == 0)). If violated, the run-postpone arms would advance the wrong stack entry’sposp_lsa.
5.6 log_rv_analysis_atomic_sysop_start
Section titled “5.6 log_rv_analysis_atomic_sysop_start”The simplest arm, for LOG_SYSOP_ATOMIC_START — two branches: fatal-tdes, and success, which advances tail_lsa/undo_nxlsa then writes tdes->rcv.atomic_sysop_start_lsa = *log_lsa (the record has no body — the LSA is the payload). If nothing clears it (5.5, 5.7), log_recovery_abort_all_atomic_sysops → log_recovery_abort_atomic_sysop (Ch 8) rolls the transaction back to this LSA before postpones resume.
5.7 log_rv_analysis_sysop_end — the intricate one
Section titled “5.7 log_rv_analysis_sysop_end — the intricate one”Closes a sysop of unknown kind for a transaction in an only-guessed state. Prologue: fatal-tdes branch; advance tail_lsa, undo_nxlsa, tail_topresult_lsa; read LOG_REC_SYSOP_END; LOG_SYSOP_END_TYPE_CHECK. Then the switch, where local commit_start_postpone decides whether this end also closes an open sysop-postpone phase:
// log_rv_analysis_sysop_end -- src/transaction/log_recovery.c case LOG_SYSOP_END_ABORT: // ... condensed: comment -- abort neither changes state nor finishes a topope postpone ... if (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE) { LSA_SET_NULL (&tdes->undo_nxlsa); /* no undo */ } tdes->rcv.analysis_last_aborted_sysop_lsa = *log_lsa; /* <- skip-range upper bound */ tdes->rcv.analysis_last_aborted_sysop_start_lsa = sysop_end->lastparent_lsa; /* <- lower bound */ break; case LOG_SYSOP_END_COMMIT: assert (tdes->state != TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE); /* <- falls through to next cases */ case LOG_SYSOP_END_LOGICAL_UNDO: case LOG_SYSOP_END_LOGICAL_MVCC_UNDO: // ... condensed: todo comment ... commit_start_postpone = true; break; case LOG_SYSOP_END_LOGICAL_COMPENSATE: tdes->undo_nxlsa = sysop_end->compensate_lsa; /* <- jump undo over compensated range */ commit_start_postpone = true; break;The ABORT arm is the aborted-sysop tracker: a LOG_DBEXTERN_REDO_DATA logical redo inside the aborted range would re-create state the pre-crash rollback destroyed, so log_recovery_needs_skip_logical_redo (Ch 6) skips records with analysis_last_aborted_sysop_start_lsa < lsa < analysis_last_aborted_sysop_lsa. Each ABORT end overwrites the fields — only the last aborted sysop is tracked.
The LOG_SYSOP_END_LOGICAL_RUN_POSTPONE arm: in topope-committed state the run-postpone sysop could belong to either postpone scope; run_postpone.is_sysop_postpone decides:
- true (sysop’s postpone): if
topops.last < 0or state is not topope-committed, the stack is conjured — realloc ifmax == 0(fatalER_OUT_OF_VIRTUAL_MEMORY),topops.last = 0, state forced to topope-committed; thentopops.stack[last].posp_lsa = run_postpone.postpone_lsa.commit_start_postponestays false — the phase continues. - false (transaction’s postpone):
posp_nxlsa = run_postpone.postpone_lsa;topops.last != -1→ asserts confirm the topope state, else state :=TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE;undo_nxlsanulled;commit_start_postpone = true.
The epilogue runs for every arm. In topope-committed state (assert (topops.last == 0)) with commit_start_postpone set, the sysop postpone phase is over and tran_start_postpone_lsa picks the restored state: non-null restores TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE (asserted LSA_LE to lastparent_lsa — the sysop ran inside the transaction’s postpone phase); null restores the default recovery state TRAN_UNACTIVE_UNILATERALLY_ABORTED. Either way topops.last = -1. Without commit_start_postpone the phase continues (topops.last stays 0); in any non-topope state it is (re)set to -1.
Two symmetric reset guards follow — a postpone phase and an atomic sysop can nest either way, and the end belongs to whichever started later. The atomic guard nulls rcv.atomic_sysop_start_lsa only if three conditions hold: (1) it is non-null; (2) LSA_GT over sysop_start_postpone_lsa — the atomic op is the more recent open phase; (3) LSA_GT (atomic_sysop_start_lsa, sysop_end->lastparent_lsa). Condition 3 is the resurrection guard: if lastparent_lsa >= atomic_sysop_start_lsa, this end closes a sysop that began after the atomic marker — one nested inside the still-open atomic operation — and clearing the annotation on its end would let recovery skip the still-unfinished atomic operation. Only an end whose lastparent_lsa precedes the marker (the sysop containing the marker) proves the atomic op completed and may clear it. The mirror-image guard nulls sysop_start_postpone_lsa identically.
5.8 Chapter summary — key takeaways
Section titled “5.8 Chapter summary — key takeaways”LOG_RCV_TDESis a five-LSA annotation block in everyLOG_TDES, written by analysis arms (plus checkpoint restore, Ch 4), consumed by Chapters 6 and 8, nulled once its phase is proven concluded.log_rv_analysis_commit_with_postponewritestran_start_postpone_lsaand doubles as the point-in-time stop on media crash; the obsolete variant is the same minus the timestamp.log_rv_analysis_sysop_start_postponewritessysop_start_postpone_lsa, forcestopops.lastfrom-1to0, and clears anatomic_sysop_start_lsaproven nested inside the now-postponing sysop.log_rv_analysis_sysop_endis a six-arm switch: ABORT records the skip range without changing state; COMMIT and both LOGICAL_UNDO flavors close an open sysop postpone phase; LOGICAL_COMPENSATE also redirectsundo_nxlsa; LOGICAL_RUN_POSTPONE disambiguates viais_sysop_postpone.- When a sysop postpone phase closes, the prior state is rebuilt from
tran_start_postpone_lsa: non-null →TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE, null →TRAN_UNACTIVE_UNILATERALLY_ABORTED; the reset guards then compare both annotations againstlastparent_lsaso an end clears only its own phase’s annotation. analysis_last_aborted_sysop_start_lsa < lsa < analysis_last_aborted_sysop_lsais howlog_recovery_needs_skip_logical_redosuppressesLOG_DBEXTERN_REDO_DATAreplay inside a pre-crash-aborted sysop.
Chapter 6: Redo Pass Driver and Synchronous Apply
Section titled “Chapter 6: Redo Pass Driver and Synchronous Apply”log_recovery_redo replays the log forward over the range analysis fixed (Chapter 3) against the rebuilt transaction table (Chapter 4) — the driver loop, then the synchronous apply path down to LZ4/XOR payload assembly. Theory: companion, “Redo pass”; parallel leg: Chapter 7; loose ends: Chapter 8; RV_fun: Chapter 10.
6.1 The redo context and per-record structs
Section titled “6.1 The redo context and per-record structs”log_rv_redo_context is the whole apply state; its constructor pre-allocates both LOG_ZIP buffers at LOGAREA_SIZE:
| Field | Role |
|---|---|
m_reader | private log cursor (Chapter 2) — parallel workers each need one |
m_redo_zip | redo payload scratch+output; rcv.data points into it, no per-record malloc |
m_undo_zip | undo scratch for diff undoredo records — diffed redo XORs against the undo image |
m_end_redo_lsa | const hard stop; past-end records are torn tail; bounds the §6.4 page-LSA assert |
m_reader_fetch_page_mode | NORMAL for crash recovery (trusts its snapshot), FORCE for replication re-fetch |
The copy constructor delegates to the main constructor — copies share nothing, making Chapter 7’s per-worker copies safe. Each record travels as a value snapshot, log_rv_redo_rec_info<T>, with exactly three fields: m_start_lsa — the record header’s LSA, stamped onto the page after apply, the idempotence comparand; m_type — the concrete LOG_RECTYPE (one T serves plain and DIFF rectypes; the diff decision needs it); m_logrec — a by-value copy of the typed body taken via reinterpret_copy_and_add_align, so a queued job holds no log-page pointer.
Debug-only vpid_lsa_consistency_check (check / cleanup) has exactly two fields: mtx — parallel redo workers call check concurrently — and consistency_check_map, the first-seen LSA per (volid, pageid) (emplace never overwrites an existing key); cleanup clears it at pass end.
Invariant — per-page LSA ordering. Out-of-order apply loses updates. Enforced (debug, rcv_phase != LOG_RESTARTED) by assert ((*map_it).second < a_log_lsa) — each new LSA compared against the page’s first recorded LSA (weaker than pairwise monotonicity; emplace keeps the original entry).
Invariant — m_redo_zip buffer stability. rcv.data aliases m_redo_zip.log_data until the redofun returns; enforced structurally — one context per thread, sequential assembly; recycle early and the redofun reads garbage.
6.2 log_recovery_redo — setup and the outer loop
Section titled “6.2 log_recovery_redo — setup and the outer loop”The driver drops the log critical section (LOG_CS_EXIT; re-entered at the tail). log_recovery_get_redo_parallel_count — MAX (16, system_core_count) — sizes reusable_jobs and cublog::redo_parallel under SERVER_MODE (Chapter 7); in SA mode parallel_recovery_redo stays nullptr, all applies synchronous. Pre-loop defenses: a start_redolsa offset too close to the page end trips assert (false) and resumes at the next page; PRM_ID_RECOVERY_PROGRESS_LOGGING_INTERVAL (5-second floor) periodically emits ER_LOG_RECOVERY_PROGRESS with pages done/total and ETA.
The outer loop fetches the page holding lsa; on fetch failure, lsa > m_end_redo_lsa is the normal past-the-end goto exit, failure inside the promised range is logpb_fatal_error. The inner loop walks records while lsa.pageid == m_reader.get_pageid (); each iteration re-positions the reader at the (possibly repaired) record lsa via set_lsa_and_fetch_page before reading the header:
flowchart TD
A["record at lsa"] --> B{"past end_redo_lsa?"}
B -- yes --> Z["null lsa, break"]
B -- no --> C["offset repair if NULL"]
C --> H["re-fetch at lsa, read header, lsa = forw_lsa"]
H --> K{"lsa strictly advances?"}
K -- no --> L["fatal: loop in log"]
K -- yes --> M["switch on log_rtype"]
M --> P["pageid fixup"] --> A
Figure 6-1: skeleton of one inner-loop iteration of log_recovery_redo; the callouts below account for each branch.
Archive page-boundary repair — an incompletely archived record leaves the page-header offset or forw_lsa NULL. A NULL lsa.offset takes the page-header offset; if that too is NULL, archive page -> pageid + 1, active page -> genuine end of log (pageid = NULL_PAGEID); continue. A NULL forw_lsa on an archived page likewise advances to pageid + 1. Loop-in-log defense — a next lsa that does not strictly advance is logpb_fatal_error instead of spinning. Post-switch fixup — after a multi-page body, lsa.pageid jumps to the reader’s page so consumed pages are not re-fetched.
Invariant — the scan strictly advances. Every path moves lsa forward or nulls it and terminates; otherwise recovery replays the same range forever.
6.3 The dispatch switch — every record-type arm
Section titled “6.3 The dispatch switch — every record-type arm”Past the header, two local macros carry each typed arm: BUILD_RECORD_INFO (TEMPLATE_TYPE) wraps rcv_lsa, log_rtype and the reinterpret_copy_and_add_align<TEMPLATE_TYPE> () body copy into a log_rv_redo_rec_info; INVOKE_REDO_RECORD forwards it to log_rv_redo_record_sync_or_dispatch_async, where log_rv_need_sync_redo forces the sync leg for null-VPID records and the volume/sector RVDK_* rcvindexes (enumerated in Chapter 7). Every arm, branch-complete:
| Arm | Action |
|---|---|
LOG_UNDOREDO_DATA, LOG_DIFF_UNDOREDO_DATA, LOG_RUN_POSTPONE, LOG_COMPENSATE | plain build+invoke (§6.4) |
LOG_UNDO_DATA, LOG_POSTPONE, LOG_SAVEPOINT, postpone markers (LOG_DUMMY_HEAD_POSTPONE, LOG_COMMIT_WITH_POSTPONE/_OBSOLETE, LOG_SYSOP_START_POSTPONE), checkpoint, 2PC decision/inform, HA/replication/dummy types, LOG_SUPPLEMENTAL_INFO, LOG_SYSOP_ATOMIC_START, LOG_END_OF_LOG | explicit no-op break |
LOG_MVCC_UNDOREDO_DATA, LOG_MVCC_DIFF_UNDOREDO_DATA | bump mvcc_next_id past mvccid, set mvcc_op_log_lsa = rcv_lsa (vacuum); invoke |
LOG_MVCC_REDO_DATA | bump mvcc_next_id only — vacuum reads undo data; invoke |
LOG_REDO_DATA | RVVAC_COMPLETE -> logpb_vacuum_reset_log_header_cache; invoke |
LOG_DBEXTERN_REDO_DATA | page-less (pgptr = NULL, offset = -1); gated by the skip check below; applies via log_rv_redo_record |
LOG_2PC_PREPARE | missing tran/tdes -> break; else log_2pc_read_prepare re-reads the gtrid, with LOG_2PC_OBTAIN_LOCKS only in state TRAN_UNACTIVE_2PC_PREPARE |
LOG_2PC_START | rebuild coordinator info if tran alive and LOG_ISTRAN_2PC; alloc failure -> fatal + break |
LOG_COMMIT, LOG_ABORT | assert-only: completed non-system tran must be gone |
LOG_MVCC_UNDO_DATA | bookkeeping only — mvcc_next_id, mvcc_op_log_lsa; not applied |
LOG_SYSOP_END | LOG_SYSOP_END_LOGICAL_MVCC_UNDO -> mvcc_op_log_lsa = rcv_lsa |
default (+LOG_SMALLER/LARGER_LOGREC_TYPE) | er_set (ER_LOG_PAGE_CORRUPTED); null lsa if forw_lsa pointed back at this record |
log_recovery_needs_skip_logical_redo, the repeated-crash defense, has three early false returns — wrong rectype, NULL_TRAN_INDEX, NULL tdes — and one true path:
// log_recovery_needs_skip_logical_redo -- src/transaction/log_recovery.c if (LSA_LT (&tdes->rcv.analysis_last_aborted_sysop_start_lsa, lsa) && LSA_LT (lsa, &tdes->rcv.analysis_last_aborted_sysop_lsa)) { /* ... condensed: er_log_debug ... */ return true; /* <- strictly inside a sysop a previous recovery already aborted */ }An LSA outside the window falls through to the trailing return false. Analysis stamped the endpoints (Chapter 5); the record and its compensation already sit in the log from a previous recovery cycle.
Tail sequence. (SERVER_MODE) parallel_recovery_redo->wait_for_termination_and_stop_execution () drains every async job; LOG_CS_ENTER; log_Gl.mvcc_table.reset_start_mvccid () recomputes the MVCC baseline; the Chapter 8 hand-off (log_recovery_abort_all_atomic_sysops, log_recovery_finish_all_postpone); then logpb_flush_pages_direct, logpb_flush_header, pgbuf_flush_all. The exit: label — also the past-the-end target — nulls curr_rcv_rec_lsa, runs the consistency-check cleanup (), reports perf stats.
6.4 log_rv_redo_record_sync — fix, extract, apply
Section titled “6.4 log_rv_redo_record_sync — fix, extract, apply”// log_rv_redo_record_sync -- src/transaction/log_recovery_redo.hpp // ... condensed: debug-only vpid_lsa_consistency_check.check (rcv_vpid, m_start_lsa) ... const LOG_DATA &log_data = log_rv_get_log_rec_data<T> (record_info.m_logrec); LOG_RCV rcv; if (!log_rv_fix_page_and_check_redo_is_needed (thread_p, rcv_vpid, rcv, log_data.rcvindex, record_info.m_start_lsa, redo_context.m_end_redo_lsa)) { // ... condensed: assert (rcv.pgptr == nullptr) ... return; /* <- page gone, or change already on disk */ } scope_exit unfix_rcv_pgptr { [&thread_p, &rcv] () { pgbuf_unfix_and_init_after_check (thread_p, rcv.pgptr); } }; /* <- unfix on every exit */ // ... condensed: rcv field extractors; payload assembly ... rvfun::fun_t redofunc = log_rv_get_fun<T> (record_info.m_logrec, log_data.rcvindex);The condensed tail: payload-assembly error -> logpb_fatal_error + return (the scope_exit still unfixes); non-null redofunc runs under perfmon_counter_timer_raii_tracker (PSTAT_LOG_REDO_FUNC_EXEC), failure -> logpb_fatal_error; null redofunc -> er_log_debug warning only; a non-null rcv.pgptr is then stamped with m_start_lsa via pgbuf_set_lsa.
The gatekeeper log_rv_fix_page_and_check_redo_is_needed has three outcomes: (1) non-null VPID but log_rv_redo_fix_page returns null — assert (log_is_in_crash_recovery ()), return false, the deallocated-page skip; (2) page fixed but rcv_lsa <= *pgbuf_get_lsa (rcv.pgptr) — pgbuf_unfix_and_init, return false, change already on disk (an assert also rejects page LSAs beyond end_redo_lsa); (3) otherwise return true, including the null-VPID fall-through that leaves rcv.pgptr == nullptr for page-less records. log_rv_redo_fix_page fixes in RECOVERY_PAGE mode with no sector-reservation check — sector tables replay in parallel, so a page may transiently look deallocated; the check costs more than skipping saves; NULL is assert_release material (“this is terrible, because it makes recovery impossible”).
Invariant — redo idempotence via page LSA. Skip when rcv_lsa <= page LSA, stamp m_start_lsa after applying; break the stamp and every later crash double-applies non-idempotent redo.
Six extractor template families — primaries uninstantiable via static_assert (sizeof (T) == 0) — flatten six record shapes into one generic routine. The outlier: log_rv_get_fun<LOG_REC_COMPENSATE> returns RV_fun[rcvindex].undofun (“yes, undo” in source) — a CLR’s redo payload is the undo image, so replay runs the undo function: ARIES repeating history (companion, “Compensation log records”).
T | _data / _vpid / _offset from | _mvccid | _redo_length | log_rv_get_fun |
|---|---|---|---|---|
LOG_REC_MVCC_UNDOREDO | undoredo.data | mvccid | undoredo.rlength | redofun |
LOG_REC_UNDOREDO | data | MVCCID_NULL | rlength | redofun |
LOG_REC_MVCC_REDO | redo.data | mvccid | redo.length | redofun |
LOG_REC_REDO | data | MVCCID_NULL | length | redofun |
LOG_REC_RUN_POSTPONE | data | MVCCID_NULL | length | redofun |
LOG_REC_COMPENSATE | data | MVCCID_NULL | length | undofun |
6.5 Payload assembly — unzip, diff, hand off
Section titled “6.5 Payload assembly — unzip, diff, hand off”log_rv_get_log_rec_redo_data<T> decodes the payload. The four single-image specializations (LOG_REC_MVCC_REDO, LOG_REC_REDO, LOG_REC_RUN_POSTPONE, LOG_REC_COMPENSATE) call log_rv_get_unzip_and_diff_redo_log_data with no undo data; LOG_REC_MVCC_UNDOREDO re-wraps its embedded undoredo member as a log_rv_redo_rec_info<LOG_REC_UNDOREDO> and delegates. Only LOG_REC_UNDOREDO branches — on m_type, not T: for the two DIFF rectypes (need_diff_with_undo) it first unzips the undo image into m_undo_zip via log_rv_get_unzip_log_data (fatal + return on error), aligns, and passes m_undo_zip.data_length / .log_data on; otherwise it skips the unneeded undo image (m_reader.skip (GET_ZIP_LEN (ulength)), fatal + ER_FAILED on error), aligns, and passes (0, nullptr).
log_rv_get_unzip_log_data decodes one image, branch-complete. The length field’s sign bit is the compression flag — MAKE_ZIP_LEN sets 0x80000000 at logging time, ZIP_CHECK tests it, GET_ZIP_LEN strips it; even the skip path above goes through GET_ZIP_LEN. is_zip = ZIP_CHECK (length); an image that does_fit_in_current_page is aliased straight off the page buffer, a spanning one is heap-copied via copy_from_log. Compressed -> log_unzip (failure: fatal + ER_FAILED); uncompressed -> log_zip_realloc_if_needed (failure fatal) + memcpy. Finally add_align in the fits case, bare align () in the copy case since copy_from_log already advanced the reader.
log_rv_get_unzip_and_diff_redo_log_data layers the diff on top: after log_rv_get_unzip_log_data into the caller’s redo_unzip (failure: fatal + ER_FAILED), it un-diffs only if (is_zip) and only when undo_length > 0 && undo_data != nullptr — log_diff (undo_length, undo_data, redo_unzip.data_length, redo_unzip.log_data) — then hands off rcv->length / rcv->data, borrowing m_redo_zip storage. The is_zip gate works because diffed redo exists only compressed: at append time the XOR runs on a scratch copy and the DIFF rectype is set only when is_redo_zip; a failed compression writes the original un-diffed crumbs with the bit clear. log_unzip reads the original-length prefix log_zip stored, rejects buf_size <= 0, fails if log_zip_realloc_if_needed cannot grow the destination, LZ4-decompresses, and succeeds only when unzip_len == buf_size — short or negative means corruption, not truncation. log_diff is *(p++) ^= *(q++) over MIN (undo_length, redo_length) bytes — XOR is its own inverse, so one routine serves both directions.
The page-less twin log_rv_redo_record (the LOG_DBEXTERN_REDO_DATA arm) runs the same assemble-then-call sequence without the fix/skip gate: payload failure -> fatal + return; redofun failure -> fatal too; redofun == NULL -> debug warning; rcv->pgptr != NULL -> pgbuf_set_lsa (vacuous here — pgptr is NULL).
6.6 Chapter summary — key takeaways
Section titled “6.6 Chapter summary — key takeaways”log_rv_redo_contextis the whole redo state — one log cursor plus two pre-allocatedLOG_ZIPbuffers whose storagercv.databorrows; share-nothing copies enable Chapter 7’s workers.- The switch also bookkeeps
mvcc_next_id,mvcc_op_log_lsa(undo-bearing records only), theRVVAC_COMPLETEreset, and the logical-redo skip window. - Idempotence: skip when
rcv_lsa <= page LSA, stampm_start_lsaafter apply;log_rv_redo_fix_pagedeliberately accepts deallocated pages. - Six extractor families flatten six record shapes into one apply routine;
log_rv_get_fun<LOG_REC_COMPENSATE>returns the undofun — a CLR replays as an undo. - One sign bit encodes compression (
MAKE_ZIP_LEN/ZIP_CHECK/GET_ZIP_LEN); diffed redo exists only compressed, solog_diff(XOR) runs only whenis_zip. - Nothing after the loop runs until
wait_for_termination_and_stop_executiondrains parallel redo; only thenreset_start_mvccid, Chapter 8 finishing, and the flushes.
Chapter 7: Parallel Redo Infrastructure
Section titled “Chapter 7: Parallel Redo Infrastructure”Chapter 6’s driver hands each redoable record to log_rv_redo_record_sync_or_dispatch_async; only concrete-page, non-volume records go async, and per-page LSA order is inherited from push order because every VPID hashes to a fixed task.
7.1 Dispatch — log_rv_redo_record_sync_or_dispatch_async
Section titled “7.1 Dispatch — log_rv_redo_record_sync_or_dispatch_async”Instantiated per record type by INVOKE_REDO_RECORD:
// log_rv_redo_record_sync_or_dispatch_async -- src/transaction/log_recovery_redo_parallel.hpp const VPID rcv_vpid = log_rv_get_log_rec_vpid<T> (record_info.m_logrec);#if defined (SERVER_MODE) // ... condensed: log_data ref ... const bool need_sync_redo = log_rv_need_sync_redo (rcv_vpid, log_data.rcvindex); // ... condensed: PREP perf tick ... if (parallel_recovery_redo == nullptr || need_sync_redo) { log_rv_redo_record_sync<T> (thread_p, redo_context, record_info, rcv_vpid); // ... condensed: DO_SYNC perf tick ... } else { cublog::redo_job_impl *const job = a_reusable_jobs.blocking_pop (a_rcv_redo_perf_stat); assert (job != nullptr); job->set_record_info (rcv_vpid, record_info.m_start_lsa, record_info.m_type); parallel_recovery_redo->add (job); // ... condensed: DO_ASYNC perf tick ... }#else // !SERVER_MODE = SA_MODE log_rv_redo_record_sync<T> (thread_p, redo_context, record_info, rcv_vpid);#endifSA_MODE compiles the cublog classes to empty dummies; Figure 7-1 covers every exit. The predicate:
// log_rv_need_sync_redo -- src/transaction/log_recovery.c if (VPID_ISNULL (&a_rcv_vpid)) { return true; /* <- no target page to hash */ } switch (a_rcvindex) { case RVDK_NEWVOL: // ... condensed: RVDK_FORMAT, RVDK_INITMAP, RVDK_EXPAND_VOLUME, RVDK_VOLHEAD_EXPAND ... return true; /* <- see Inv 7-A */ case RVDK_RESERVE_SECTORS: // ... condensed: RVDK_UNRESERVE_SECTORS ... return true; /* <- "may be changed to async" */ default: return false; }Invariant 7-A (sync record as happens-before barrier). The main thread applies a sync record before pushing any later job; new-volume pages appear only in later records, so no worker can fix a page of a volume whose creation is still unexecuted.
flowchart TD
A["record"] --> B{"SERVER_MODE?"}
B -- "no (SA)" --> S1["sync apply"]
B -- yes --> C{"infra null?"}
C -- yes --> S1
C -- no --> D{"log_rv_need_sync_redo"}
D -- "null VPID or RVDK volume, sector" --> S1
D -- false --> E["blocking_pop + set_record_info"]
E --> G["add: hash vpid to fixed task"]
Figure 7-1: dispatch exits.
7.2 Sizing and construction — redo_parallel
Section titled “7.2 Sizing and construction — redo_parallel”log_recovery_redo registers pool demand via REGISTER_WORKERPOOL and builds once before the forward scan: reusable_jobs.initialize (count) plus new cublog::redo_parallel (count, false, MAX_LSA, redo_context); false/MAX_LSA disables monitoring (7.8). The count:
// log_recovery_get_redo_parallel_count -- src/transaction/log_recovery.c const int num_cpus = cubthread::system_core_count (); const int minimum_threads_to_redo = 16; /* <- "determined experimentally" */ return MAX (minimum_threads_to_redo, num_cpus);The floor of 16 oversubscribes small machines — workers are I/O-bound. The constructor runs do_init_worker_pool (workers = slots = a_task_count), then do_init_tasks and the monitor.
| Field | Role | Why it exists |
|---|---|---|
m_task_count | VPID-binning modulus | Fixed at construction (Inv 7-B) |
m_pool_entry_manager | TT_RECOVERY entry factory | Workers need real entries |
m_task_state_bookkeeping | Bitset of active tasks | Unbounded wait (7.7) |
m_worker_pool | Worker pool pointer | Owns OS threads |
m_redo_tasks | vector<unique_ptr<redo_task>> | Owner-managed; perf stats survive |
m_vpid_hash | std::hash<VPID> | Binning function of add |
m_min_unapplied_log_lsa_calculation | Progress monitor (7.8) | Replication only |
7.3 The VPID hash — order without locks
Section titled “7.3 The VPID hash — order without locks”// redo_parallel::add -- src/transaction/log_recovery_redo_parallel.cpp const std::size_t task_index = m_vpid_hash (a_job->get_vpid ()) % m_task_count; redo_task *const task = m_redo_tasks[task_index].get (); task->push_job (a_job);Invariant 7-B (per-page order from push order). The main thread pushes in increasing LSA order, a VPID always hashes to the same task (m_task_count is immutable), and each task drains FIFO — per-page apply order is log order, lock-free. Break any leg and two workers race on one page, masked by the rcv_lsa <= page_lsa skip. Cross-page order is not preserved.
redo_job_base — the queueable unit:
| Field | Role | Why it exists |
|---|---|---|
m_vpid | Target page; null when defaulted | Binning key; get_vpid asserts non-null |
m_log_lsa | LSA of the record | Where to re-read from (7.5); progress marker (7.8) |
redo_task::push_job sets the unapplied marker only when monitoring is armed and the queue was empty (crash recovery passes false, so never), and notifies only past PRM_ID_RECOVERY_REDO_MINIMUM_JOB_COUNT (hidden, default 100).
7.4 redo_task::execute
Section titled “7.4 redo_task::execute”redo_task (.cpp-private cubthread::task):
| Field | Role | Why it exists |
|---|---|---|
m_task_idx | Identity 0..N-1 | Index into bitset and push vectors |
m_do_monitor_unapplied_log_lsa | Maintain marker or not | Recovery passes false |
m_task_state_bookkeeping | Ref to owner’s bitset | Set in ctor, cleared after drain |
m_perf_stats_definition / m_perf_stats | Per-task counters | Timings in 7.9 |
m_redo_context | Private context copy | Own reader + zip buffers (7.5) |
m_produce_vec (+_mtx, _cv) | Job queue; reserves ONE_M | Swap: one lock per batch |
m_adding_finished | End-of-stream flag, set under mutex | Checked only when queue empty |
m_unapplied_log_lsa | atomic<log_lsa>, MAX_LSA idle | Feeds global minimum (7.8) |
// redo_task::execute -- src/transaction/log_recovery_redo_parallel.cpp for ( ; ; ) { bool adding_finished { false }; pop_jobs (jobs_vec, adding_finished); if (jobs_vec.empty () && adding_finished) { break; /* <- only exit */ } else { assert (!jobs_vec.empty ()); THREAD_ENTRY *const thread_entry = &context; for (auto &job : jobs_vec) { // ... condensed: marker update ... job->execute (thread_entry, m_redo_context); job->retire (m_task_idx); } jobs_vec.clear (); /* <- jobs already recycled */ } } m_task_state_bookkeeping.set_inactive (m_task_idx);pop_jobs asserts its post-condition as an exact xor of empty and finished. Its 1 s wait_for period (PRM_ID_RECOVERY_REDO_JOB_PERIOD_IN_SECS, hidden) drains the un-notified trickle; notify_adding_finished flips the flag under the same mutex — no lost wakeup.
flowchart TD
W["wait_for, 1 s period"] --> P{"queue empty?"}
P -- no --> SW["swap into local jobs_vec"]
P -- yes --> MK["park marker, monitored only"]
MK --> F{"m_adding_finished?"}
F -- no --> W
F -- yes --> Z["return empty + finished"]
SW --> EX["per job: execute, retire"]
EX --> W
Z --> IN["set_inactive, cv notify"]
Figure 7-2: pop_jobs and drain loop exits.
7.5 redo_job_impl::execute — the re-fetch
Section titled “7.5 redo_job_impl::execute — the re-fetch”redo_job_impl field | Role | Why it exists |
|---|---|---|
m_reusable_job_stack | Pool back-pointer, “guaranteed to outlive this instance” | retire = push (a_task_idx, this) |
m_log_rtype | LOG_RECTYPE stamped by set_record_info | Selects the log_rec_* layout to re-read |
// redo_job_impl::execute -- src/transaction/log_recovery_redo_parallel.cpp const int err_fetch = redo_context.m_reader.set_lsa_and_fetch_page (get_log_lsa (), redo_context.m_reader_fetch_page_mode); if (err_fetch != NO_ERROR) { return err_fetch; /* <- sole error exit */ } redo_context.m_reader.add_align (sizeof (LOG_RECORD_HEADER)); switch (m_log_rtype) { case LOG_REDO_DATA: read_record_and_redo<log_rec_redo> (thread_p, redo_context); break; // ... condensed: 7 more labels (MVCC/diff undoredo, RUN_POSTPONE, COMPENSATE) ... default: assert (false); /* <- unreachable */ }The eight labels are Chapter 6’s page-bound redoable types; read_record_and_redo<T> re-parses the typed header and funnels into log_rv_redo_record_sync<T>, the sync path’s sink. That error exit is swallowed: redo_task::execute ignores the return, so a failed fetch silently skips the record’s redo. log_rv_redo_context is copy-constructible, not assignable: each task owns a private reader and zip buffers.
7.6 reusable_jobs_stack — recycling
Section titled “7.6 reusable_jobs_stack — recycling”| Field | Role | Why it exists |
|---|---|---|
m_flush_push_at_count | PARALLEL_REDO_REUSABLE_JOBS_FLUSH_BACK_COUNT (ONE_K) | One mutex touch per ~1024 retires |
m_job_pool | vector<redo_job_impl> of PARALLEL_REDO_REUSABLE_JOBS_COUNT (ONE_M) | The only allocation |
m_pop_jobs | Consumer stack, popped unsynchronized | Single consumer (Inv 7-C) |
m_push_jobs (+m_push_mtx, m_push_jobs_available_cv) | Shared return bin | Sole synchronized hand-off |
m_per_task_push_jobs_vec | One private vector per task | Lock-free retire fast path |
// reusable_jobs_stack::blocking_pop -- src/transaction/log_recovery_redo_parallel.cpp if (!m_pop_jobs.empty ()) { redo_job_impl *const pop_job = m_pop_jobs.back (); m_pop_jobs.pop_back (); /* <- no lock */ return pop_job; } else { { std::unique_lock<std::mutex> locku { m_push_mtx }; // ... condensed: cv wait until !m_push_jobs.empty () ... m_pop_jobs.swap (m_push_jobs); /* <- O(1) refill */ } // ... condensed: pop_back ... }push (a_task_idx, a_job) mirrors it: append to the caller’s private vector; only past m_flush_push_at_count lock, bulk-insert, clear, notify_one. The slow path is the backpressure valve: when all ONE_M jobs are in flight the main thread blocks until a batch returns.
Invariant 7-C (single consumer, conservation of jobs). m_pop_jobs is popped unsynchronized because only the recovery main thread calls blocking_pop. The destructor asserts pop + push + sum(per_task) == m_job_pool.size (): a job executed but never retired trips it.
7.7 task_active_state_bookkeeping and termination
Section titled “7.7 task_active_state_bookkeeping and termination”| Field | Role | Why it exists |
|---|---|---|
m_size | Task count, asserted < BITSET_MAX_SIZE (256) | Bounds-checks indices |
m_values | std::bitset<256>, bit per task | set_active/set_inactive assert prior state |
m_values_mtx / m_values_cv | Guard + wakeup | wait_for_termination sleeps until m_values.none () |
The pool’s own wait asserts after “a hardcoded maximum wait time (60 seconds)”; this private bookkeeping waits unbounded. Tasks set their bit in the constructor, so an early wait cannot miss not-yet-started tasks. Shutdown:
// redo_parallel::wait_for_termination_and_stop_execution -- src/transaction/log_recovery_redo_parallel.cpp for (auto &redo_task: m_redo_tasks) { redo_task->notify_adding_finished (); } m_task_state_bookkeeping.wait_for_termination (); // ... condensed: assert every task is_idle ... m_worker_pool->stop_execution (); // ... condensed: get_manager ()->destroy_worker_pool ...redo_task::retire is a no-op (“avoid self destruct”) so per-task perf stats stay readable; WAIT_FOR_PARALLEL times the straggler wait. Both ends assert the ordering: push_job asserts !m_adding_finished, and ~redo_parallel asserts no active task and a null pool — this blocking call is mandatory before destruction.
7.8 min_unapplied_log_lsa_monitoring
Section titled “7.8 min_unapplied_log_lsa_monitoring”Dormant in crash recovery (false, MAX_LSA); armed when the same infrastructure replicates on a page server. The constructor asserts the pairing: monitoring needs a valid starting LSA; no monitoring, MAX_LSA.
| Field | Role | Why it exists |
|---|---|---|
m_do_monitor | Master switch | Asserted by every method |
m_main_thread_unapplied_log_lsa | atomic<log_lsa> advanced by dispatcher | Sync records bypass task queues |
m_redo_tasks | Const ref to task vector | calculate reads each task’s marker |
m_calculated_log_lsa | Last global minimum, under m_calculate_mtx | What waiters compare against |
m_calculate_mtx / m_calculate_cv / m_terminate_calculation / m_calculate_thread | Calculation-thread plumbing | Guard, cv, stop flag, thread |
calculate minimizes the main-thread LSA against task markers, skipping idle MAX_LSA ones. wait_past_target_log_lsa has two exits: an unlocked fast path when a_target_lsa < m_calculated_log_lsa; else notify_all (kick its 10 ms nap) and block until the minimum passes. redo_parallel::wait_past_target_lsa and set_main_thread_unapplied_log_lsa are forwarders.
7.9 perf_stats
Section titled “7.9 perf_stats”perf_stats (log_recovery_redo_perf.hpp) is a nullable wrapper over cubperf:
| Field | Role | Why it exists |
|---|---|---|
m_definition | Const ref to cubperf::statset_definition | Slot names/types (all COUNTER_AND_TIMER) |
m_stats_set | cubperf::statset *, nullptr when disabled | One-point fork: every method checks it |
Activation is per-side (perf_stats_is_active_for_main / ..._for_async); do_not_record_t builds a disabled instance. time_and_increment (id) adds the time since the previous call:
- Main:
FETCH_PAGE,READ_LOG,REDO_OR_PUSH_{PREP, DO_SYNC, POP_REUSABLE_DIRECT/_WAIT, DO_ASYNC},COMMIT_ABORT,WAIT_FOR_PARALLEL,FINALIZE. - Workers:
PARALLEL_POP,PARALLEL_SLEEP(never incremented),PARALLEL_EXECUTE,PARALLEL_RETIRE.
redo_parallel::log_perf_stats logs each worker’s set plus an element-wise average — EXECUTE vs POP shows saturation; DIRECT vs WAIT shows pool throttling.
7.10 Chapter summary — key takeaways
Section titled “7.10 Chapter summary — key takeaways”- Three dispatcher exits: SA_MODE always-sync; forced sync via null infra or
log_rv_need_sync_redo(null VPID, volume ops, sector reserve/unreserve); async dispatch of a recycled job. - Invariant 7-A makes forced-sync records happens-before barriers; Invariant 7-B (fixed
hash(VPID) % m_task_count, in-order push, FIFO drain) gives per-page LSA order without page locks. - Worker count is
MAX (16, cores), an experimental floor for I/O-bound workers; teardown’s private bitset dodges the pool’s 60-second assert. - Jobs carry
(vpid, lsa, rectype);redo_job_impl::executere-fetches the log page via the task’s privatelog_rv_redo_contextand converges on the sync apply path; the worker loop discards its error return. reusable_jobs_stackrecycles ONE_M jobs — lock-free pop, ONE_K flush-back, conservation asserted (Inv 7-C); slow path = backpressure;min_unapplied_log_lsa_monitoringandperf_statsserve replication and diagnostics.
Chapter 8: Atomic Sysop Abort and Postpone Completion
Section titled “Chapter 8: Atomic Sysop Abort and Postpone Completion”Redo (Ch 6-7) reproduced the crash state exactly, leaving two loose ends that must not reach undo: open atomic system operations, and transactions/sysops committed with postpone whose postpones never finished. The tail of log_recovery_redo closes both; see the high-level companion (cubrid-recovery-manager.md) for the postpone/sysop concept.
8.1 Placement in the redo tail
Section titled “8.1 Placement in the redo tail”Both phases run on the recovery main thread after the parallel redo pool drains (Ch 7): they append new log records through the runtime logging path (log_sysop_start / log_sysop_abort / log_run_postpone_op), only safe once every queued redo job is applied.
// log_recovery_redo (tail) -- src/transaction/log_recovery.c LOG_CS_ENTER (thread_p); log_Gl.mvcc_table.reset_start_mvccid (); /* ... er_set: "REDO" finishing-up notification ... */ log_recovery_abort_all_atomic_sysops (thread_p); /* <- must run FIRST */ log_recovery_finish_all_postpone (thread_p); /* ... flush data pages, log pages, log header ... */Invariant 8-A — atomic-before-postpone. Stated in the log_rcv_tdes comment: interrupted file_perm_alloc/file_perm_dealloc “must be executed atomically … before executing finish all postpones”. Postpone actions (typically permanent-file destruction) would otherwise hit half-modified file headers and sector tables — crash or file-tracker corruption.
8.2 LOG_RCV_TDES — the recovery scratchpad
Section titled “8.2 LOG_RCV_TDES — the recovery scratchpad”Analysis (Ch 4-5) recorded everything this chapter consumes into tdes->rcv (struct log_rcv_tdes, log_impl.h) — five LOG_LSA fields, NULL_LSA meaning no such loose end:
| Field | Role | Why it exists |
|---|---|---|
sysop_start_postpone_lsa | LOG_SYSOP_START_POSTPONE of a sysop committed-with-postpone whose LOG_SYSOP_END never landed (8.6). | That record embeds the LOG_REC_SYSOP_END to replay (8.3). |
tran_start_postpone_lsa | The transaction’s LOG_COMMIT_WITH_POSTPONE. | Separates branches (c)/(d) in 8.6; abort boundary in 8.7. |
atomic_sysop_start_lsa | Last unmatched LOG_SYSOP_ATOMIC_START; non-NULL means crashed mid-atomic-op. | Rollback target for 8.4: the log suffix to undo as one unit. |
analysis_last_aborted_sysop_lsa | End LSA of the last sysop analysis saw aborted. | Upper bound of the Ch 6 skip window for a rolled-back sysop’s logical redo. |
analysis_last_aborted_sysop_start_lsa | That sysop’s lastparent_lsa. | Lower bound of the skip window; unused here. |
The last input is the LOG_RUN_POSTPONE trail in the log itself, consumed by 8.8.
8.3 LOG_REC_SYSOP_START_POSTPONE — a deferred sysop end
Section titled “8.3 LOG_REC_SYSOP_START_POSTPONE — a deferred sysop end”A sysop committing with postpone logs its future end record up front, so recovery can finish the commit even if the real LOG_SYSOP_END never reached disk:
// log_rec_sysop_start_postpone -- src/transaction/log_record.hppstruct log_rec_sysop_start_postpone{ LOG_REC_SYSOP_END sysop_end; /* log record used for end of system operation */ LOG_LSA posp_lsa; /* address where the first postpone operation start */};| Field | Role | Why it exists |
|---|---|---|
sysop_end | Pre-built end record; re-read via log_read_sysop_start_postpone, appended via log_sysop_end_recovery_postpone. | Persists the commit decision before postpones run; its type decides the post-finish TDES state (8.6). |
posp_lsa | First LOG_POSTPONE of this sysop. | Seed for the forward scan; analysis copies it to tdes->topops.stack[last].posp_lsa (Ch 5). |
8.6 reads four fields of the embedded LOG_REC_SYSOP_END (full table in Ch 5): type (discriminator), lastparent_lsa (transaction LSA just before the sysop — the rollback boundary), run_postpone.postpone_lsa (the LOG_POSTPONE this sysop ran — the parent’s resume point), run_postpone.is_sysop_postpone (sysop parent — asserted impossible — vs transaction).
8.4 Aborting atomic sysops
Section titled “8.4 Aborting atomic sysops”Both drivers share one skeleton: walk regular TDES slots 1..num_total_indices, skipping tdes == NULL || trid == NULL_TRANID, then the system TDESes rebuilt by analysis via log_system_tdes::map_all_tdes (locks systb_Mutex). Each call is bracketed by log_rv_simulate_runtime_worker / log_rv_end_simulation, so runtime logging primitives — which resolve the current transaction from the thread — act on the impersonated TDES (log_system_tdes::rv_simulate_system_tdes for system ones). log_recovery_abort_atomic_sysop handles one TDES:
flowchart TD
G1{"tdes NULL or<br/>trid NULL?"} -- yes --> R1["return"]
G1 -- no --> G2{"atomic_sysop_start_lsa<br/>NULL?"}
G2 -- yes --> R1
G2 -- no --> G3{"start >= undo_nxlsa?"}
G3 -- yes --> R3["reset LSA, return"]
G3 -- no --> G4{"TOPOPE and start postpone<br/>> atomic start?"}
G4 -- yes --> N1["nested postpone in atomic op:<br/>finish it first"]
G4 -- no --> G5{"TOPOPE?"}
G5 -- yes --> N2["atomic op in sysop postpone:<br/>abort now"]
G5 -- no --> N3["standalone"]
N1 --> RB["fetch start page,<br/>prev = prev_tranlsa"]
N2 --> RB
N3 --> RB
RB --> ERR{"fetch failed?"}
ERR -- yes --> F["logpb_fatal_error"]
ERR -- no --> SIM["log_sysop_start,<br/>lastparent_lsa = prev,<br/>log_sysop_abort"]
SIM --> DONE["clear atomic_sysop_start_lsa"]
Figure 8-1: every branch of log_recovery_abort_atomic_sysop.
The nested cases order against 8.5: sysop_start_postpone_lsa > atomic_sysop_start_lsa means a sysop committed-with-postpone inside the atomic op — finish its postpone first, then abort. The opposite TOPOPE case is an atomic op started during a sysop’s postpone — abort now, finish the postpone in 8.5. The source comments spell out both numbered crash scenarios verbatim.
The rollback simulates a runtime sysop instead of calling undo — the in-source comment calls the lastparent_lsa overwrite “hack last parent”: the new sysop’s rollback boundary becomes the prev_tranlsa of the LOG_SYSOP_ATOMIC_START, so log_sysop_abort compensates everything after it and logs an abort LOG_SYSOP_END.
Invariant 8-B — no atomic residue. On return, atomic_sysop_start_lsa is NULL_LSA on every TDES — each exit path finds it NULL, resets it, or dies in logpb_fatal_error. Later phases can assume no half-open atomic file operation exists; plain record-by-record undo would recreate the partial state the marker prevents.
8.5 Finishing transaction postpones
Section titled “8.5 Finishing transaction postpones”Per TDES, log_recovery_finish_postpone: (1) return on the guard tdes == NULL || trid == NULL_TRANID; (2) always call log_recovery_finish_sysop_postpone (8.6), which resolves a TOPOPE_COMMITTED_WITH_POSTPONE state — possibly promoting it to COMMITTED_WITH_POSTPONE; (3) branch on state:
// log_recovery_finish_postpone -- src/transaction/log_recovery.c if (tdes->state == TRAN_UNACTIVE_WILL_COMMIT || tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE) { if (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE) { /* make sure to abort interrupted logical postpone. */ log_recovery_abort_interrupted_sysop (thread_p, tdes, &tdes->rcv.tran_start_postpone_lsa); LSA_SET_NULL (&tdes->undo_nxlsa); } /* <- committed: nothing left to undo */ /* ... find_first_postpone -> log_do_postpone -> log_complete ... */ } else if (tdes->state == TRAN_UNACTIVE_COMMITTED) { /* log_complete + free index only; postpones already done */ }TRAN_UNACTIVE_WILL_COMMIT = commit logged, postpone start not; COMMITTED_WITH_POSTPONE first aborts a possibly interrupted logical run postpone (8.7). The elided body: log_recovery_find_first_postpone (8.8), log_do_postpone (8.9) on a non-NULL result, then — local transactions only, tdes->coord == NULL — log_complete appends the LOG_COMMIT EOT, sets TRAN_UNACTIVE_COMMITTED, and logtb_free_tran_index frees the slot (2PC: Ch 11). System TDESes pass through step (2) only; an unfinishable interrupted sysop leaves them TRAN_UNACTIVE_UNILATERALLY_ABORTED — branch (d) — for undo (Ch 9).
8.6 log_recovery_finish_sysop_postpone
Section titled “8.6 log_recovery_finish_sysop_postpone”Runs only for TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE; analysis pushed exactly one topops entry (assert (tdes->topops.last == 0)). Sequence: abort an interrupted logical run postpone (8.7) relative to rcv.sysop_start_postpone_lsa; find the first unexecuted postpone (8.8) seeded from topops.stack[last].posp_lsa; log_do_postpone (8.9); re-read the start-postpone record via log_read_sysop_start_postpone (failure: assert_release, give up); append the pre-built end via log_sysop_end_recovery_postpone. Four outcomes:
// log_recovery_finish_sysop_postpone (outcomes) -- src/transaction/log_recovery.c if (sysop_start_postpone.sysop_end.type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE) { if (sysop_start_postpone.sysop_end.run_postpone.is_sysop_postpone) { /* (a) sysop postpone during sysop postpone? should not happen! */ assert (false); tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; tdes->undo_nxlsa = tdes->tail_lsa; } else { /* (b) logical run postpone during transaction postpone */ tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; LSA_SET_NULL (&tdes->undo_nxlsa); tdes->posp_nxlsa = sysop_start_postpone.sysop_end.run_postpone.postpone_lsa; } } else if (!LSA_ISNULL (&tdes->rcv.tran_start_postpone_lsa)) { /* (c) sysop nested in transaction postpone phase */ tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; } else { /* (d) standalone: hand the rest to undo (Ch 9) */ tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; tdes->undo_nxlsa = tdes->tail_lsa; }(b) resumes the parent’s postpone after the one this sysop ran (8.3); (b)/(c) fall through into 8.5’s branch in the same invocation; (d) parks the TDES for undo. A defensive clamp resets topops.last to -1 under assert_release.
8.7 log_recovery_abort_interrupted_sysop — the backward scan
Section titled “8.7 log_recovery_abort_interrupted_sysop — the backward scan”Postpone execution can itself use logical run postpone sysops (file destroy/deallocate); a crash mid-sysop leaves a fragment to abort before resuming. Walk the undo chain backwards from tdes->undo_nxlsa down to postpone_start_lsa:
- Early return if
undo_nxlsais NULL or<= postpone_start_lsa— nothing to abort. - Per record (page fetch failure:
logpb_fatal_error, return):LOG_RUN_POSTPONE— physical run postpone completed: stop,last_parent_lsa = iter_lsa.LOG_SYSOP_END— stop likewise iftype == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE, else hop tosysop_end->lastparent_lsa, skipping the nested sysop whole.- anything else —
prev_lsa = logrec_head.prev_tranlsa; asserts forbid postpone-start types.
- Loop drained —
assert (LSA_EQ (&iter_lsa, postpone_start_lsa)); the interrupted sysop was the first postpone action:last_parent_lsa = *postpone_start_lsa.
Then the 8.4 simulated-sysop trick with stack[last].lastparent_lsa = last_parent_lsa: everything after the last completed run postpone is compensated; completed ones stay.
8.8 log_recovery_find_first_postpone — the run-postpone trail
Section titled “8.8 log_recovery_find_first_postpone — the run-postpone trail”tdes->posp_nxlsa after analysis is ambiguous: analysis advances it to run_posp->ref_lsa of every LOG_RUN_POSTPONE scanned — the last confirmed postpone — or, if none ran, to the first postpone of LOG_COMMIT_WITH_POSTPONE (Ch 4). One forward scan disambiguates. Guards: outside crash recovery or the three postpone states — assert (0), ER_FAILED; NULL start_postpone_lsa — NO_ERROR, NULL result. The scan reuses log_do_postpone’s nested-top range walk and page-fetch error path (8.9), inspecting only this trid:
LOG_RUN_POSTPONEwithref_lsa == start_postpone_lsa— candidate ran: setstart_postpone_lsa_wasapplied, done.LOG_SYSOP_ENDof typeLOG_SYSOP_END_LOGICAL_RUN_POSTPONE— same test onrun_postpone.postpone_lsa(logical run postpones log noLOG_RUN_POSTPONE).LOG_POSTPONE— the first non-candidate goes tonext_postpone_lsa.LOG_END_OF_LOG/ NULL-offset — archive-boundary page advance as in 8.9.
Tail: candidate never ran — ret_lsa = start_postpone_lsa; else ret_lsa = next_postpone_lsa, the next LOG_POSTPONE, NULL if none remain.
8.9 log_do_postpone — the shared forward executor
Section titled “8.9 log_do_postpone — the shared forward executor”The routine that runs postpones at runtime commit re-runs them here. log_get_next_nested_top builds a stack of nested-sysop ranges; the outer loop seeks each range three ways — up to a range’s start, restarting after its end, or (when start_seek_lsa == nxtop_range->end_lsa) running to tdes->tail_lsa and stopping. This skips the interior of every completed nested sysop — committed or aborted — since their LOG_POSTPONE records belong to the sysop, not the enclosing postpone phase; only a LOG_SYSOP_END_LOGICAL_RUN_POSTPONE end record stays inside the scanned range (log_get_next_nested_top ends that range one record earlier so the end record itself is processed). Both forward scanners share the page-fetch error path: logpb_fetch_page failure raises logpb_fatal_error and jumps to the end label, which frees a heap-grown nxtop_stack.
Dispatch inside a range: ordinary data/dummy/replication types are ignored; LOG_POSTPONE executes now via log_run_postpone_op — goto end on failure; LOG_COMMIT_WITH_POSTPONE (plus _OBSOLETE, LOG_SYSOP_START_POSTPONE, the 2PC starts) nulls forward_lsa — the postpone region is over; LOG_SYSOP_END is tolerated only at start_seek_lsa, else debug-logged as a bad range. log_run_postpone_op reads the LOG_REC_REDO payload (copying across page boundaries; logpb_fatal_error on OOM) and calls log_execute_run_postpone: apply the redo function, log a new LOG_RUN_POSTPONE — a second crash just extends the trail 8.8 consumes.
Invariant 8-C — postpones execute exactly once. The posp_nxlsa trail, 8.8’s applied-check, and the fresh LOG_RUN_POSTPONE each execution logs guarantee each LOG_POSTPONE runs exactly once across any number of crashes. Seeding log_do_postpone with an already-run LSA would double-apply non-idempotent redo such as page deallocation.
8.10 Chapter summary — key takeaways
Section titled “8.10 Chapter summary — key takeaways”- The redo tail runs two cleanups after the parallel pool drains: abort open atomic sysops, then finish pending postpones (Invariant 8-A), fed by
tdes->rcv(LOG_RCV_TDES) plus theLOG_RUN_POSTPONEtrail. - Both drivers walk regular TDES slots then
log_system_tdes::map_all_tdes, impersonating each transaction vialog_rv_simulate_runtime_worker. - Rollback simulates a sysop —
log_sysop_start, overwritelastparent_lsa,log_sysop_abort— ordered by 8.4’s nested-case branches. log_recovery_finish_sysop_postponereplays the embeddedLOG_REC_SYSOP_END, landing in the transaction-postpone path orTRAN_UNACTIVE_UNILATERALLY_ABORTEDfor undo.- Each finished worker TDES exits via
log_complete (LOG_COMMIT)toTRAN_UNACTIVE_COMMITTED— except 2PC participants (Ch 11).
Chapter 9: Undo Pass and Compensation
Section titled “Chapter 9: Undo Pass and Compensation”Redo (Chapter 6) left even the losers’ effects in place; undo rolls them back. CLR theory is in the companion (cubrid-recovery-manager.md); here: log_recovery_undo and log_rv_undo_record branch by branch, plus the sysop bracket that makes rollback crash-restartable.
9.1 Record structs of the undo pass
Section titled “9.1 Record structs of the undo pass”All four live in log_record.hpp (log_rec_undo, log_rec_mvcc_undo, log_rec_compensate, log_rec_sysop_end), read in place from the log page.
LOG_REC_UNDO — body of LOG_UNDO_DATA:
| Field | Role |
|---|---|
data (LOG_DATA) | rcvindex + volid/pageid/offset: one locator for RV_fun dispatch and page fix; NULL vpid triggers RCV_IS_LOGICAL_LOG |
length | undo-image byte count; carries the ZIP_CHECK flag |
LOG_REC_MVCC_UNDO — body of LOG_MVCC_UNDO_DATA:
| Field | Role |
|---|---|
undo (LOG_REC_UNDO) | embedded plain undo — strict superset; arms extract &mvcc_undo->undo |
mvccid | writer’s MVCCID, re-activated during undo so the version stays invisible |
vacuum_info (LOG_VACUUM_INFO) | prev_mvcc_op_log_lsa chain + vfid — vacuum’s list through MVCC op records; undo skips it |
LOG_REC_COMPENSATE — body of LOG_COMPENSATE, the CLR:
| Field | Role |
|---|---|
data (LOG_DATA) | locator + rcvindex of the compensation’s redo — CLRs are redo-only, replayed via redofun |
undo_nxlsa | next record to undo, captured before the compensated one — ARIES UndoNxtLSA; restarted undo skips done work (9.3 arm 4) |
length | after-image length |
LOG_REC_SYSOP_END — body of LOG_SYSOP_END; union keyed by type:
| Field | Role |
|---|---|
lastparent_lsa | last LSA before the sysop — undo jumps here; committed sysops are never re-undone |
prv_topresult_lsa | previous completed top action — nested-sysop chaining (Chapter 5) |
type (LOG_SYSOP_END_TYPE) | union discriminator — six end flavors, one record |
vfid | file of affected pages — TDE decision for trailing undo data |
undo (union) | LOG_REC_UNDO for LOGICAL_UNDO — the sysop’s own undo recipe if its owner aborts |
mvcc_undo (union) | LOG_REC_MVCC_UNDO for LOGICAL_MVCC_UNDO — same, plus MVCCID |
compensate_lsa (union) | resume point for LOGICAL_COMPENSATE — the bracket was itself a compensation |
run_postpone (union) | postpone_lsa + is_sysop_postpone for LOGICAL_RUN_POSTPONE — analysis-side twin (Chapter 5); undo asserts it never sees one |
9.2 log_recovery_undo — pre-pass and loser selection
Section titled “9.2 log_recovery_undo — pre-pass and loser selection”Called from log_recovery under LOG_RECOVERY_UNDO_PHASE. The pre-pass retires losers with nothing left to undo: a TDES in state TRAN_UNACTIVE_UNILATERALLY_ABORTED / TRAN_UNACTIVE_ABORTED with a NULL undo_nxlsa finished its rollback pre-crash but its LOG_ABORT never hit disk — log_complete (… LOG_ABORT, LOG_DONT_NEED_NEWTRID, LOG_NEED_TO_WRITE_EOT_LOG) writes it now, logtb_free_tran_index frees the slot. System TDESes need no EOT: log_system_tdes::rv_delete_all_tdes_if erases every system entry with NULL undo_nxlsa.
Selection uses logtb_rv_read_only_map_undo_tdes (log_tran_table.c): under a read-mode TR_TABLE_CS it maps a functor over every non-system slot in those two states, then over system workers via log_system_tdes::map_all_tdes — a max-scan lambda yields max_undo_lsa, two more feed the start notice (log_find_unilaterally_largest_undo_lsa duplicates the max-scan; nothing calls it today). The driver allocates undo_unzip_ptr = log_zip_alloc (LOGAREA_SIZE), arms an optional progress timer, exits LOG_CS (fetches use LOG_CS_FORCE_USE; alloc and fetch failures are fatal), then loops per Figure 9-1.
Invariant (globally descending undo order). Each iteration undoes
max_undo_lsa— the largestundo_nxlsaover all losers, recomputed after every record — and every arm moves a cursor strictly backward (prev_tranlsa, a CLR’sundo_nxlsa, or a sysop’slastparent_lsa). The innerwhile (max_undo_lsa.pageid == log_lsa.pageid)drains a page before fetching an earlier one; a forward-moving arm would live-lock.
flowchart TD
A["prune finished losers"] --> B["max_undo_lsa = max undo_nxlsa"]
B --> C{NULL?}
C -- yes --> Z["free unzip buffer, LOG_CS_ENTER,<br/>flush log + header + data pages"]
C -- no --> D["fetch page; while same pageid:<br/>resolve tdes, switch on log_rtype"]
D --> G{prev_tranlsa NULL?}
G -- yes --> H["chain done: log_complete +<br/>logtb_free_tran_index or rv_delete_tdes"]
G -- no --> I["undo_nxlsa = prev_tranlsa"]
H --> B
I --> B
Figure 9-1: driver loop.
TDES resolution forks on logtb_is_system_worker_tranid: workers via log_system_tdes::rv_get_tdes (NULL asserts); regular transactions via logtb_find_tran_index + LOG_FIND_TDES — on lookup failure (a trid analysis never registered) logtb_free_tran_index_with_undo_lsa scrubs any slot holding that undo_nxlsa and the record is skipped. if (tran_index != NULL_TRAN_INDEX && tdes != NULL) gates the switch; on the worker path tran_index is stale — only tdes matters.
9.3 The record-type switch — every arm
Section titled “9.3 The record-type switch — every arm”Every arm is preceded unconditionally by LSA_COPY (&tdes->undo_nxlsa, &prev_tranlsa) — the order is the point:
Invariant (cursor advances before the undo executes).
log_append_compensatecopiestdes->undo_nxlsainto the CLR it writes; the driver advanced it toprev_tranlsafirst, so the CLR points at the next record to undo. Reverse the order and a crash mid-rollback replays the same undo twice.
- UNDOREDO family (
LOG_UNDOREDO_DATA,LOG_DIFF_UNDOREDO_DATA,LOG_MVCC_*twins) — MVCC flavors readLOG_REC_MVCC_UNDOREDOand setrcv.mvcc_id, plain onesLOG_REC_UNDOREDOwithMVCCID_NULL; fillrcvfrom the embeddedLOG_DATA+ulength, calllog_rv_undo_record. DIFF matters only to redo. LOG_MVCC_UNDO_DATA/LOG_UNDO_DATA— same shape withLOG_REC_MVCC_UNDO/LOG_REC_UNDOandundo->length.- Redo-only / bookkeeping types —
LOG_REDO_DATA,LOG_MVCC_REDO_DATA,LOG_DBEXTERN_REDO_DATA,LOG_DUMMY_HEAD_POSTPONE,LOG_POSTPONE,LOG_SAVEPOINT,LOG_REPLICATION_DATA,LOG_REPLICATION_STATEMENT,LOG_DUMMY_HA_SERVER_STATE,LOG_DUMMY_OVF_RECORD,LOG_DUMMY_GENERIC,LOG_SUPPLEMENTAL_INFO,LOG_SYSOP_ATOMIC_START:/* Not for UNDO ... */, fall through to the previous record. LOG_COMPENSATE—LSA_COPY (&prev_tranlsa, &compensate->undo_nxlsa). No work — the cursor leapfrogs everything already undone pre-crash.LOG_SYSOP_END— onsysop_end->type:LOGICAL_UNDO/LOGICAL_MVCC_UNDO: the committed bracket carries its own undo recipe.rcvis filled fromsysop_end->undo(ormvcc_undo.undoplusrcv.mvcc_id); bothprev_tranlsaandtdes->undo_nxlsamove tolastparent_lsabeforelog_rv_undo_recordruns, so its compensation skips the whole sysop. (rcv_lsais not refreshed; diagnostics may print a stale LSA.)LOGICAL_COMPENSATE:prev_tranlsa = sysop_end->compensate_lsa— resume before the record the bracket compensated.- default (
COMMIT,ABORT):prev_tranlsa = sysop_end->lastparent_lsa; anassertdocuments thatLOGICAL_RUN_POSTPONEnever reaches undo (Chapter 8).
- Terminal/illegal types (
LOG_RUN_POSTPONE, theLOG_COMMIT*trio,LOG_SYSOP_START_POSTPONE,LOG_ABORT, checkpoint and 2PC records,LOG_DUMMY_CRASH_RECOVERY,LOG_END_OF_LOG) and the default arm (corrupted type →ER_LOG_PAGE_CORRUPTED) — analysis went wrong: afterassert (false), release builds amputate — cleartdes->mvccinfo.id,log_system_tdes::rv_delete_tdes(workers) orlog_complete (… LOG_ABORT …)+logtb_free_tran_index,tdes = NULLso the epilogue skips it.
Epilogue (if (tdes != NULL)): a NULL prev_tranlsa ends the chain — clear tdes->mvccinfo.id, then rv_delete_tdes (workers) or log_complete + logtb_free_tran_index as in the pre-pass (#ifdef CCI_XA builds skip completion for TRAN_UNACTIVE_2PC_PREPARE). Otherwise prev_tranlsa goes back into tdes->undo_nxlsa, re-asserting the copy arms 4-5 may have redirected. After the loop: free the unzip buffer, re-enter LOG_CS, force-flush log, header and data pages.
Inside log_complete, updaters get log_append_abort_log + log_change_tran_as_completed and unlock_global_oldest_visible_mvccid; no-update losers (LSA_ISNULL (&tdes->tail_lsa)) just flip state.
9.4 log_rv_undo_record — one undo step, every branch
Section titled “9.4 log_rv_undo_record — one undo step, every branch”The recovery twin of run-time log_rollback_rec; identity simulated via log_rv_simulate_runtime_worker / log_rv_end_simulation, no page locks. Pre-dispatch: (1) a valid rcv->mvcc_id is re-activated via logtb_rv_assign_mvccid_for_undo_recovery; (2) RCV_IS_LOGICAL_LOG (rcv_vpid, rcvindex) — NULL vpid or a logical rcvindex — leaves rcv->pgptr = NULL, else pgbuf_fix takes an unconditional write latch (failure asserted, tolerated); (3) ZIP_CHECK (rcv->length) strips the compression flag; the image is aliased from the log page if it fits, else malloced + logpb_copy_from_log, zipped images inflated by log_unzip into undo_unzip_ptr (alloc/unzip failures fatal — as in the reader-based redo-side twins log_rv_get_unzip_log_data / log_rv_get_unzip_and_diff_redo_log_data, Chapter 6). Then, under if (rcv->pgptr != NULL || RCV_IS_LOGICAL_LOG (…)):
// log_rv_undo_record -- src/transaction/log_recovery.c if (rcvindex == RVBT_MVCC_INCREMENTS_UPD) { /* nothing to do during recovery */ } else if (rcvindex == RVBT_MVCC_NOTIFY_VACUUM || rcvindex == RVES_NOTIFY_VACUUM) { /* nothing to do */ } else if (rcvindex == RVBT_LOG_GLOBAL_UNIQUE_STATS_COMMIT) { /* <- in-memory only: undo on every restart, cannot compensate */ error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv); assert (error_code == NO_ERROR); } else if (RCV_IS_LOGICAL_COMPENSATE_MANUAL (rcvindex)) { /* <- undofun logs its own compensation */ LSA_COPY (&rcv->reference_lsa, &tdes->undo_nxlsa); error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv); // ... condensed ... logpb_fatal_error on failure; optional b-tree trace } else if (!RCV_IS_LOGICAL_LOG (rcv_vpid, rcvindex)) { /* <- PHYSICAL undo: CLR first, then apply before-image */ log_append_compensate (thread_p, rcvindex, rcv_vpid, rcv->offset, rcv->pgptr, rcv->length, rcv->data, tdes); error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv); // ... condensed ... logpb_fatal_error on failure } else { /* <- LOGICAL undo: bracket in a system operation */ save_state = tdes->state; LSA_COPY (&rcv->reference_lsa, &tdes->undo_nxlsa); log_sysop_start (thread_p); (void) (*RV_fun[rcvindex].undofun) (thread_p, rcv); log_sysop_end_logical_compensate (thread_p, &rcv->reference_lsa); tdes->state = save_state; }A physical record whose page could not be fixed (the guard’s else) still gets log_append_compensate with pgptr = NULL — the chain stays restartable — plus ER_LOG_MAYNEED_MEDIA_RECOVERY naming the volume; the undofun is skipped and recovery continues (log-and-skip). end: frees the area, unfixes the page, log_rv_end_simulation.
Invariant (every undo step is logged before or while it happens). Physical undo writes the CLR before
undofun; logical undo openslog_sysop_startfirst so all page changes land inside the bracket, sealed bylog_sysop_end_logical_compensatewithcompensate_lsa = rcv->reference_lsa. Crash inside the bracket: analysis aborts the sysop (Chapter 8), undo resumes at the original record. Crash after the seal: theLOGICAL_COMPENSATEarm jumps tocompensate_lsa. Either way the logical undo runs exactly once.
recovery.h defines the manual sets: RCV_IS_BTREE_LOGICAL_LOG (ten RVBT_* object-level ops) inside the wider RCV_IS_LOGICAL_COMPENSATE_MANUAL (plus RVFL_ALLOC, RVFL_USER_PAGE_MARK_DELETE, RVPGBUF_DEALLOC, RVFL_TRACKER_HEAP_REUSE, RVHF_LOB_REMOVE_DIR, RVFL_TRACKER_UNREGISTER). Their undofuns append page-level compensations themselves via log_append_compensate_with_undo_nxlsa with the saved rcv->reference_lsa — a b-tree undo may split or merge pages before compensating — so an extra bracket would be redundant.
9.5 log_append_compensate — the CLR writer
Section titled “9.5 log_append_compensate — the CLR writer”log_append_compensate and log_append_compensate_with_undo_nxlsa wrap log_append_compensate_internal (log_manager.c); the latter passes an explicit undo_nxlsa for the b-tree case, the former NULL:
// log_append_compensate_internal -- src/transaction/log_manager.c // ... condensed ... node = prior_lsa_alloc_and_copy_data (.., LOG_COMPENSATE, ..); NULL -> silent return LSA_COPY (&prev_lsa, &tdes->undo_nxlsa); /* <- next record to undo, saved */ compensate = (LOG_REC_COMPENSATE *) node->data_header; // ... condensed ... fill compensate->data; store the undo_nxlsa parameter // into compensate->undo_nxlsa if non-NULL (b-tree override), else prev_lsa start_lsa = prior_lsa_next_record (thread_p, node, tdes); // ... condensed ... pgbuf_set_lsa (pgptr, start_lsa) when pgptr != NULL /* Go back to our undo link */ LSA_COPY (&tdes->undo_nxlsa, &prev_lsa); /* <- CLR must not become next undo target */Branches: prior_lsa_alloc_and_copy_data failure returns silently — the undo proceeds unlogged; since undo_nxlsa never advanced past the record, a re-crash simply undoes it again (re-applying a before-image is harmless). NULL pgptr (media path, 9.4) skips TDE marking and pgbuf_set_lsa; a failed pgbuf_set_lsa asserts and returns. The last line is load-bearing: prior_lsa_next_record drags undo_nxlsa forward with tail_lsa; restoring prev_lsa keeps the rollback cursor behind the CLR — per the header comment, CLRs “are never undone.”
9.6 Chapter summary — key takeaways
Section titled “9.6 Chapter summary — key takeaways”- The pre-pass retires losers whose
undo_nxlsais NULL —log_completewrites the missingLOG_ABORT,logtb_free_tran_indexfrees the slot — andrv_delete_all_tdes_ifprunes finished system TDESes. - The driver always undoes the globally largest
undo_nxlsa(recomputed each record vialogtb_rv_read_only_map_undo_tdes): a strictly backward, page-at-a-time sweep. tdes->undo_nxlsaadvances before the undo executes, so every CLR carries the correct resume point;log_append_compensate_internalrestores it after appending so the CLR is never undone — undo never undoes an undo, making the pass idempotent across repeated crashes.LOG_COMPENSATEandLOG_SYSOP_ENDare pure cursor redirections during undo (compensate->undo_nxlsa,lastparent_lsa/compensate_lsa) — a crashed rollback resumes without repetition.log_rv_undo_recordforks onRCV_IS_LOGICAL_LOG: physical undo = CLR thenundofun; logical undo = sysop bracket (log_sysop_starttolog_sysop_end_logical_compensate) that analysis aborts if half-done and undo skips if sealed;RCV_IS_LOGICAL_COMPENSATE_MANUALundofuns compensate manually.- The only tolerated failure is an unfixable data page — CLR still written (NULL
pgptr) plusER_LOG_MAYNEED_MEDIA_RECOVERY; everything else islogpb_fatal_error, because a half-applied undo with its CLR on disk would lie to the next restart.
Chapter 10: The RV_fun Dispatch Table
Section titled “Chapter 10: The RV_fun Dispatch Table”Every redo, undo, compensation replay, and logdump print indexes one global array: RV_fun[] in recovery.c. The drivers of Ch 6, 7, and 9 know nothing about heap or b-tree semantics — only how to find the right function pointer. This chapter covers the entry layout, the index-equals-position invariant, NULL arms, and the shared packed-change machinery; theory lives in the high-level companion (“ARIES in CUBRID”, “Recovery Function Dispatch”).
10.1 The rvfun entry and the table it forms
Section titled “10.1 The rvfun entry and the table it forms”Each slot is a struct rvfun (recovery.h):
// rvfun -- src/transaction/recovery.hstruct rvfun{ using fun_t = int (*)(THREAD_ENTRY * thread_p, LOG_RCV * logrcv); using dump_fun_t = void (*)(FILE * fp, int length, void *data); LOG_RCVINDEX recv_index; /* For verification */ const char *recv_string; fun_t undofun; fun_t redofun; dump_fun_t dump_undofun; dump_fun_t dump_redofun;};| Field | Role | Why it exists |
|---|---|---|
recv_index | LOG_RCVINDEX this slot claims | Compared to slot position by rv_check_rvfuns |
recv_string | Printable name ("RVHF_INSERT") | logdump and fatal errors, via rv_rcvindex_string |
undofun | Rollback, undo pass (Ch 9), redo of LOG_COMPENSATE | CLR payloads are undo-direction; NULL = never logs undo data |
redofun | Redo pass (Ch 6/7), run-postpone (Ch 8) | NULL = never logs redo data (undo-only logical records) |
dump_undofun | Debug printer, undo payload | logdump only, via log_dump_data |
dump_redofun | Debug printer, redo payload | NULL = payload not formatted |
RV_fun[] is an aggregate initializer, one literal per LOG_RCVINDEX, from RVDK_NEWVOL (NULL undo arm — volume creation is redo-only) to RVHF_LOB_REMOVE_DIR. Arms are often mirror pairs (RVDK_UNRESERVE_SECTORS: undo disk_rv_reserve_sectors, redo disk_rv_unreserve_sectors).
flowchart LR
REDO["redo + run-postpone Ch 6-8"] --> R["redofun"]
CLR["LOG_COMPENSATE replay"] --> U["undofun"]
UNDO["undo + rollback Ch 9"] --> U
DUMP["logdump"] --> D["dump arms"]
Figure 10-1: consumers of each rvfun arm; compensate replay crosses to undofun.
The crossed wire is explicit in log_rv_get_fun<LOG_REC_COMPENSATE> (log_recovery_redo.hpp): its body is return RV_fun[rcvindex].undofun; — comment // yes, undo. Hence RVBT_RECORD_MODIFY_COMPENSATE registers btree_rv_redo_record_modify as undofun with NULL redo: the CLR payload is redo-format, replayed only through undofun.
10.2 The index-equals-position invariant
Section titled “10.2 The index-equals-position invariant”LOG_RCVINDEX (recovery.h) is an explicitly numbered enum, RVDK_NEWVOL = 0 through RVHF_LOB_REMOVE_DIR = 129, closed by two specials: RV_LAST_LOGID = RVHF_LOB_REMOVE_DIR (an alias, not a slot) and RV_NOT_DEFINED = 999 (sentinel; must never index RV_fun). Its head comment mandates new entries at the bottom, “to AVOID OLD DATABASES TO BE RECOVERED UNDER OLD FILE”.
Invariant (table ordering): for every i in [0, DIM(RV_fun)), RV_fun[i].recv_index == i. The rcvindex in each on-disk log record is the array subscript — dispatch is an unchecked array load. Enforcement runs once at startup, debug builds only:
// rv_check_rvfuns -- src/transaction/recovery.c for (i = 0; i < num_indices; i++) /* num_indices = DIM (RV_fun) */ if (RV_fun[i].recv_index != i) { // ... condensed: er_log_debug "out of sequence" ... er_set (ER_FATAL_ERROR_SEVERITY, ARG_FILE_LINE, ER_GENERIC_ERROR, 0); assert (false); break; /* <- first mismatch only; one insertion shifts all later slots */ }Branch accounting: one loop, one conditional — a match falls through; a mismatch logs, raises a fatal-severity error, asserts, breaks. Function and call site vanish under NDEBUG (the call opens log_initialize_internal, log_manager.c); a misordered release-build table is caught by nothing — recovery applies some other index’s function to each payload.
rv_rcvindex_string trusts the invariant: its whole body is return RV_fun[rcvindex].recv_string; — no bounds check, so RV_NOT_DEFINED must never reach it. (A stale recovery.c header comment still directs authors to rv_rcvindex_string() for new names.)
10.3 NULL arms and where they are policed
Section titled “10.3 NULL arms and where they are policed”A NULL arm is a contract about logging, not a recovery-time fallback: no record with this rcvindex ever carries data for that direction. Enforcement lives at append time, in CUBRID_DEBUG blocks (log_manager.c): log_append_undoredo_crumbs asserts both arms non-NULL, log_append_undo_crumbs only undofun, log_append_redo_crumbs only redofun; rollback adds assert (RV_fun[rcvindex].undofun != NULL). At recovery only log_rv_redo_record is defensive — a NULL redofun merely logs a warning — while log_rv_undo_record calls the arm with no NULL test: the append-time contract is its only safety net.
Dump arms take (FILE *, int length, void *data) — payload only, no page — since logdump runs offline; printers are generic (log_rv_dump_char, log_rv_dump_hexa) or subsystem decoders (disk_rv_dump_hdr).
10.4 Index families and the RCV_IS_* macro overlay
Section titled “10.4 Index families and the RCV_IS_* macro overlay”The prefix encodes the owning subsystem; the append-only rule scatters late additions to 124–129 regardless of family: RVDK_* 0–9 (disk) · RVFL_* 10–32, 128 (file mgr; 128 = TDE) · RVHF_* 33–53, 126, 129 (heap) · RVOVF_* 54–57 (overflow) · RVEH_* 58–65 (ext hash) · RVBT_* 66–91, 124–125 (b-tree; 124–125 = online index) · RVCT_* 92–96 (catalog) · RVLOG_* 97 (log no-op) · RVREPL_* 98–103 (replication; HA shipping, not page recovery) · RVVAC_* 104–117 (vacuum) · RVES_* 118 (external storage) · RVLOC_* 119 (locator dummy) · RVPGBUF_* 120–123, 127 (page buffer; 127 = TDE).
The RCV_IS_* macros (Ch 1) are a second axis: the index value selects the function; its macro membership selects the protocol around the call in log_rv_undo_record’s six-way ladder (Ch 9). Indices in RCV_IS_LOGICAL_COMPENSATE_MANUAL (fed by RCV_IS_BTREE_LOGICAL_LOG) get rcv->reference_lsa preloaded from tdes->undo_nxlsa and their undofun logs its own CLR; indices failing RCV_IS_LOGICAL_LOG get a driver-side log_append_compensate first. RCV_IS_LOGICAL_RUN_POSTPONE_MANUAL is the postpone analogue (Ch 8).
Invariant (macro/table coherence): every index named in a RCV_IS_* macro must keep an arm whose internal logging matches the protocol the macro routes it to. Nothing checks this mechanically; a RCV_IS_LOGICAL_COMPENSATE_MANUAL index whose undofun logs no CLR leaves undo_nxlsa pointing at the same record: infinite rollback loop or missing-CLR crash at next restart.
10.5 The packed partial-change mini-format
Section titled “10.5 The packed partial-change mini-format”Many slotted-page entries log a sequence of splices instead of whole records. One splice unit is short offset_to_data | byte A | byte B | payload, padded to INT_ALIGNMENT, as produced by the packers in log_recovery.c:
// log_rv_pack_redo_record_changes -- src/transaction/log_recovery.c assert (offset_to_data >= 0 && offset_to_data <= 0x8FFF); /* <- intends flag bits clear; mask 0xC000 needs <= 0x3FFF, so 0x8FFF looks like a source typo */ // ... condensed: asserts both sizes <= 255 (single wire bytes); PTR_ALIGN to INT_ALIGNMENT ... OR_PUT_SHORT (ptr, (short) offset_to_data); ptr += OR_SHORT_SIZE; OR_PUT_BYTE (ptr, (INT16) old_data_size); ptr += OR_BYTE_SIZE; OR_PUT_BYTE (ptr, (INT16) new_data_size); ptr += OR_BYTE_SIZE; if (new_data_size > 0) { memcpy (ptr, new_data, new_data_size); ptr += new_data_size; } // ... condensed: trailing PTR_ALIGN ...log_rv_pack_undo_record_changes differs in exactly two ways: the two OR_PUT_BYTE lines are swapped — new_data_size first — and the memcpy payload is old_data, guarded by old_data_size > 0. That asymmetry is the whole trick:
| Wire field | In redo data | In undo data |
|---|---|---|
offset_to_data | splice position | splice position (same) |
| byte A (“remove size”) | runtime old size | runtime new size — bytes undo strips |
| byte B (“insert size”) | runtime new size | runtime old size — bytes undo restores |
| payload | new data | old data |
The packers pre-swap, so one interpreter serves both directions with no direction flag.
10.6 The interpreter: ordered replay, reversed unreplay
Section titled “10.6 The interpreter: ordered replay, reversed unreplay”log_rv_undoredo_record_partial_changes is a three-assert wrapper that wraps the payload in an OR_BUF and calls the recursive core, because undo must apply splices in reverse log order — each offset_to_data was computed against the record as that splice saw it:
// log_rv_undoredo_partial_changes_recursive -- src/transaction/log_recovery.c if (rcv_buf->ptr == rcv_buf->endptr) return NO_ERROR; /* (1) clean termination */ if (rcv_buf->ptr + OR_SHORT_SIZE + 2 * OR_BYTE_SIZE > rcv_buf->endptr) { assert_release (false); return ER_TF_BUFFER_OVERFLOW; } /* (2) truncated unit */ offset_to_data = (int) or_get_short (rcv_buf, &error_code); /* (3,4,5) per-field errors */ // ... condensed: old_data_size, new_data_size; each returns error_code on failure ... if (new_data_size > 0) { new_data = rcv_buf->ptr; error_code = or_advance (rcv_buf, new_data_size); /* (6) payload overruns buffer */ } else new_data = NULL; /* (7) pure deletion splice */ or_align (rcv_buf, INT_ALIGNMENT); /* <- mirrors packer's PTR_ALIGN */ if (!is_undo) RECORD_REPLACE_DATA (record, offset_to_data, old_data_size, new_data_size, new_data); error_code = log_rv_undoredo_partial_changes_recursive (thread_p, rcv_buf, record, is_undo); if (error_code != NO_ERROR) { assert_release (false); return error_code; } /* (8) deeper error skips this splice */ if (is_undo) RECORD_REPLACE_DATA (record, offset_to_data, old_data_size, new_data_size, new_data); return NO_ERROR;(7) is legal because RECORD_REPLACE_DATA (storage_common.h) skips its memcpy when insert size is 0.
flowchart TD
A["parse unit i"] --> B{"is_undo?"}
B -- "no" --> C["apply splice i, then recurse into i+1"]
B -- "yes" --> D["recurse into i+1, apply splice i on unwind"]
Figure 10-2: redo applies before recursing; undo applies on unwind, reversing order for free.
10.7 log_rv_record_modify_internal and the thin wrappers
Section titled “10.7 log_rv_record_modify_internal and the thin wrappers”The generic record modifier reads two flag bits smuggled into rcv->offset (LOG_RV_RECORD_SET_MODIFY_MODE, mask LOG_RV_RECORD_MODIFY_MASK = 0xC000, log_append.hpp; §10.5’s 0x8FFF assert intends to protect these bits, though a flag-safe bound would be 0x3FFF):
flags | Meaning | Redo action | Undo action |
|---|---|---|---|
LOG_RV_RECORD_INSERT (0x8000) | record inserted | spage_insert_at | spage_delete |
LOG_RV_RECORD_DELETE (0x4000) | record deleted | spage_delete | spage_insert_at |
LOG_RV_RECORD_UPDATE_ALL (0xC000) | full replacement | spage_update | spage_update (per-arm payload) |
LOG_RV_RECORD_UPDATE_PARTIAL (0x0000) | splice chain | splice forward, spage_update | splice reversed, spage_update |
// log_rv_record_modify_internal -- src/transaction/log_recovery.c INT16 flags = rcv->offset & LOG_RV_RECORD_MODIFY_MASK; PGSLOTID slotid = rcv->offset & (~LOG_RV_RECORD_MODIFY_MASK); if ((!is_undo && LOG_RV_RECORD_IS_INSERT (flags)) || (is_undo && LOG_RV_RECORD_IS_DELETE (flags))) { /* ... condensed: unpack type byte + body; spage_insert_at ... */ } else if ((!is_undo && LOG_RV_RECORD_IS_DELETE (flags)) || (is_undo && LOG_RV_RECORD_IS_INSERT (flags))) { /* ... condensed: spage_delete ... */ } else if (LOG_RV_RECORD_IS_UPDATE_ALL (flags)) { /* ... condensed: unpack type + body; spage_update ... */ } else { assert (LOG_RV_RECORD_IS_UPDATE_PARTIAL (flags)); // ... condensed: spage_get_record (..., COPY); /* <- splice on a private copy */ // log_rv_undoredo_record_partial_changes (..., is_undo); spage_update ... } pgbuf_set_dirty (thread_p, rcv->pgptr, DONT_FREE); /* <- every success path lands here */ return NO_ERROR;The four arms are mutually exclusive and exhaustive; every failure path is assert_release (false) plus ER_FAILED before the dirty mark, so a failed arm never advertises a half-applied page. (One blemish: the UPDATE_PARTIAL arm’s failed spage_update returns error_code — still NO_ERROR there — asserted but not propagated.) log_rv_redo_record_modify / log_rv_undo_record_modify are one-line wrappers binding is_undo to false/true, giving the table two distinct pointers.
The b-tree indices RVBT_RECORD_MODIFY_UNDOREDO / _NO_UNDO / _COMPENSATE register btree_rv_redo_record_modify / btree_rv_undo_record_modify (btree.c) instead; their core btree_rv_record_modify_internal clones this ladder (wider BTREE_RV_FLAGS_MASK, same call into log_rv_undoredo_record_partial_changes) plus node-header upkeep.
10.8 The registration ritual — adding a new LOG_RCVINDEX
Section titled “10.8 The registration ritual — adding a new LOG_RCVINDEX”- Append the enumerator at the bottom of
LOG_RCVINDEX; retargetRV_LAST_LOGID. Never renumber — values persist in on-disk logs. - Append the matching
rvfunliteral at the last slot ofRV_fun[], string = enumerator spelling. - Pick arms by logging discipline: both for undoredo, redo-only for redo/postpone, undo-only for logical-undo or compensate-replay (§10.1’s crossed wire).
- If the index needs a manual compensation/postpone protocol, add it to the right
RCV_IS_*macro and implement that protocol (§10.4 invariant). - Build with asserts and boot:
rv_check_rvfunsis the only mechanical check. A skipped or transposed slot dies fatally there; a release build applies the wrong function to every record from the bad slot on.
10.9 Chapter summary — key takeaways
Section titled “10.9 Chapter summary — key takeaways”RV_fun[]maps the on-disk rcvindex to function pointers via an unchecked array load;recv_index == position, checked only byrv_check_rvfunsat debug startup, is load-bearing for every pass; the enum is append-only because the numbers persist in logs.- NULL arms encode logging contracts policed at append time; recovery mostly trusts them.
- The redo pass replays
LOG_COMPENSATEthrough theundofunslot, so compensate-only indices register their redo-direction function asundofun. - The
RCV_IS_*macros pick the compensation/postpone protocol wrapped around theundofuncall inlog_rv_undo_record(Ch 9). - The packed splice format is direction-agnostic — the undo packer pre-swaps size bytes and stores old data — so one interpreter replays redo forward, undo reversed on unwind;
log_rv_record_modify_internallayers a four-way ladder over it, insert/delete arms swapping under undo, and the b-tree clones the machinery rather than registering the generic wrappers.
Chapter 11: Special Paths
Section titled “Chapter 11: Special Paths”Off the main crash-restart lifecycle: point-in-time restore, log truncation, post-restore archive/volume discard, append-point repair, execution-context shims, and the 2PC handoff.
11.1 The stopat contract
Section titled “11.1 The stopat contract”Two log_recovery parameters gate everything: ismedia_crash and stopat (high-level doc, “Restart orchestrator”):
// log_recovery -- src/transaction/log_recovery.c if (ismedia_crash != false) { /* Media crash, we may have to start from an older checkpoint... check disk headers */ (void) fileio_map_mounted (thread_p, (bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint, &rcv_lsa); } else { // ... condensed ... if (stopat != NULL) { *stopat = -1; /* <- normal restart: point-in-time target forcibly disabled */ } }Invariant — incomplete recovery happens only on the media-crash path. A normal restart neutralizes stopat; a broken page outside media recovery is fatal, not truncated — otherwise analysis could silently destroy committed transactions. log_rv_find_checkpoint picks the oldest checkpoint among restored volumes.
Three analysis-time triggers cut the log, all converging on log_recovery_resetlog with *did_incom_recovery = true:
-
Commit/abort newer than the target. Without a media crash,
log_rv_analysis_completejumpsgoto end, freeing the tran index (Chapter 4). Otherwise it readsLOG_REC_DONETIME; when*stop_atis set anddifftime (*stop_at, last_at_time) < 0, it releases the held page (log_lsa->pageid = NULL_PAGEID), callslog_recovery_resetlogwith the record’s header LSA — the new log ends before the too-new record — and returnsNO_ERROR. -
Commit-with-postpone newer than the target.
log_rv_analysis_commit_with_postponeruns the samedifftimetest onLOG_REC_START_POSTPONE.at_timeinside itsif (is_media_crash)arm; theelsearm is normal Chapter 5 bookkeeping. A passing time check on the media branch does nothing. -
Physically broken page. When
logpb_fetch_pagefails in thelog_recovery_analysisloop, the media branch storeslast_at_timeinto*stop_at, rewinds the last transaction’stail_lsa/undo_nxlsatolog_rec->prev_tranlsa(the half-written record is never undone), re-fetches the previous page (failure fatal: “reset log is impossible”), callslog_recovery_resetlogwithprev_lsa/prev_prev_lsa, resetslog_Gl.mvcc_table.reset_start_mvccid (), and returns. The non-media branch is fatal — TDE-specific message whener_errid ()isER_TDE_CIPHER_IS_NOT_LOADED.
A fourth, milder repair: when a record’s forward LSA is NULL while log_rtype != LOG_END_OF_LOG and *did_incom_recovery == false, the analysis loop calls log_startof_nxrec (§11.5) on log_Gl.hdr.append_lsa, patches log_rec->forw_lsa, rewrites the page (logpb_write_page_to_disk), and sets log_Gl.hdr.next_trid = tran_id in the same block; on failure the append point falls back to end_redo_lsa (“we may destroy a record”).
11.2 log_recovery_resetlog — truncate and re-arm
Section titled “11.2 log_recovery_resetlog — truncate and re-arm”log_recovery_resetlog is the one function that rewrites history. It asserts LOG_CS_OWN_WRITE_MODE and non-NULL new_prev_lsa, then runs six steps:
- Flush what exists. If
log_Gl.append.vdes != NULL_VOLDESwith an append page held:logpb_flush_pages_direct+logpb_invalid_all_append_pages. - Pick the new append LSA. NULL
new_append_lsa→ header restarts at0|0. Otherwise, with no active log or a to-the-past reset at a mid-page offset, the append page is saved so its surviving prefix carries into the recreated log:
// log_recovery_resetlog -- src/transaction/log_recovery.c if (log_Gl.append.vdes == NULL_VOLDES || (log_Gl.hdr.fpageid > new_append_lsa->pageid && new_append_lsa->offset > 0)) { // ... condensed ... (rationale comment) newappend_pgptr = (LOG_PAGE *) aligned_newappend_pgbuf; if ((logpb_fetch_page (thread_p, new_append_lsa, LOG_CS_FORCE_USE, newappend_pgptr)) != NO_ERROR) { newappend_pgptr = NULL; /* <- tolerated: the page copy is best-effort */ } } LOG_RESET_APPEND_LSA (new_append_lsa);- Reset header state.
chkpt_lsa = append_lsa(the truncated tail is the new checkpoint),is_shutdown = false,logpb_invalidate_pool. - Two regimes. If
log_Gl.append.vdes == NULL_VOLDES || log_Gl.hdr.fpageid > log_Gl.hdr.append_lsa.pageid— no active log, or the append point moved before the active range — the log is rebuilt:arv_num = logpb_get_archive_number (append page - 1) + 1names the first unneeded archive (-1fatal);log_recovery_notpartof_archives(§11.3) removes from there up (reasonstrdup-ed, raw fallback); the header is rewritten as if the log began here:fpageid = nxarv_pageid = append_lsa.pageid,nxarv_num = arv_num,last_arv_num_for_syscrashes = last_deleted_arv_num = -1. A missing active log file is recreated —disk_get_db_creation,fileio_format,logpb_create_header_page,logpb_flush_page, failures fatal — and either way a fresh first append page is created and flushed. Else onlynxarv_pageidis clamped down if past the new append page. - Re-seed the append page.
logpb_fetch_start_append_page; on success a step-2 saved image ismemcpy-ed over the fetched buffer, marked dirty, flushed direct. Iflogpb_fetch_start_append_pagefails, the restore-and-flush is skipped silently — no error is raised — and finalization proceeds regardless. - Finalize.
LOG_RESET_PREV_LSA (new_prev_lsa);mvcc_op_log_lsa.set_null ()andvacuum_last_blockid = 0disconnect vacuum from truncated ranges;was_active_log_reset = true;logpb_flush_header;logpb_decache_archive_info.
Invariant — after resetlog, every position-bearing header field points at or before the new append LSA. chkpt_lsa, fpageid, nxarv_pageid, prev_lsa, and the vacuum anchors are rewritten in one LOG_CS critical section; a missed field would send vacuum or the archiver chasing truncated pages.
11.3 log_recovery_notpartof_archives
Section titled “11.3 log_recovery_notpartof_archives”Archives start_arv_num and up describe truncated pages. Two scan modes, keyed on whether the active log (a trustworthy header) is mounted:
// log_recovery_notpartof_archives -- src/transaction/log_recovery.c if (log_Gl.append.vdes != NULL_VOLDES) { /* Trust the current log header */ // ... condensed ... (unformat archives start_arv_num .. nxarv_num - 1) } else { /* We don't know where to stop. Stop when an archive is not in the OS */ for (i = start_arv_num; i <= INT_MAX; i++) { fileio_make_log_archive_name (logarv_name, log_Archive_path, log_Prefix, i); if (fileio_is_volume_exist (logarv_name) == false) { // ... condensed ... /* <- rebuild name of archive i-1, the LAST removed */ break; } fileio_unformat (thread_p, logarv_name); } }With info_reason non-NULL and at least one archive removed (start_arv_num != i), a REMOVE ... REASON line goes to the log-info file via log_dump_log_info (single-vs-range format branch); errors other than ER_LOG_MOUNT_FAIL return early, before the header update (the files are already gone). Finally log_Gl.hdr.last_deleted_arv_num = (start_arv_num == i) ? i : i - 1 (set even on a no-op call — a quirk); the header is flushed only when the active log is mounted; logpb_decache_archive_info is left to callers.
11.4 log_recovery_notpartof_volumes
Section titled “11.4 log_recovery_notpartof_volumes”When did_incom_recovery is set, the driver calls log_recovery_notpartof_volumes after the undo pass. The boundary: start_volid = boot_find_next_permanent_volid (thread_p), the first volid the restored catalog does not know about. Two sweeps:
Sweep 1 — already-mounted volumes. fileio_map_mounted runs log_unformat_ahead_volumes over every mounted volume: if volid != NULL_VOLID && volid >= *start_volid, buffer-pool pages are dropped first (pgbuf_invalidate_all, so no stale dirty page is later flushed into it), then the volume is fileio_unformat-ed and its label freed. If invalidation fails the callback returns false, stopping the map early; stragglers fall to sweep 2.
Sweep 2 — volumes laying around on disk. Extension-named candidates are probed from start_volid to LOG_MAX_DBVOLID, breaking at the first missing name. Each candidate is mounted, its creation time read via disk_get_db_creation, and dismounted; only if difftime (vol_dbcreation, log_Gl.hdr.db_creation) == 0 is it unformatted. The db_creation timestamp is the identity test — an unrelated database’s same-named volume is a deliberate NO-OP. A candidate that exists but fails to mount (vdes == NULL_VOLDES) is skipped silently — never unformatted. The extension directory derives from log_Db_fullname (empty-string fallback on malloc or fileio_get_directory_path failure). logpb_recreate_volume_info then rebuilds the volume-info file.
11.5 log_startof_nxrec
Section titled “11.5 log_startof_nxrec”log_startof_nxrec answers: where does the next record start? Analysis uses it (§11.1, repair 4) when the last record’s forw_lsa is NULL but the record is complete. Branches:
- NULL input LSA → return NULL;
logpb_fetch_pagefailure →goto error.lsa->offset == NULL_OFFSET(page from an archive cut mid-record) → adoptlog_pgptr->hdr.offset, the first record the page knows; still NULL → error. canuse_forwaddr == true→ takelog_rec->forw_lsa; if NULL but the page lives in an archive, the next record can only be atpageid + 1(incomplete record archived, completed later). Only if still NULL does it fall to manual scan.- Manual scan: advance past
LOG_RECORD_HEADER, then aswitch (type)steps over the type-specific header plus every variable payload — undo/redo images (GET_ZIP_LEN-decoded), postpone/compensate lengths, checkpoint arrays, savepoint names, 2PC and replication payloads, the sysop family’s conditional embedded undo image (Chapter 5); fixed-size markers justbreak. The epilogueLOG_READ_ADVANCE_WHEN_DOESNT_FITrounds up to the next page when another record header cannot fit. - Two quirks:
LOG_SUPPLEMENTAL_INFOlacks abreakand falls into the marker group — currently harmless since those cases do nothing;LOG_END_OF_LOGisassert (false)— no caller asks for the record after end-of-log.
11.6 The simulate/end shims
Section titled “11.6 The simulate/end shims”Undo, postpone, and sysop-abort code expects the current transaction in thread_p->tran_index or an attached system tdes. Recovery’s thread owns LOG_SYSTEM_TRAN_INDEX and walks other transactions’ chains, so each per-tdes operation is bracketed by a shim pair:
// log_rv_simulate_runtime_worker -- src/transaction/log_recovery.c if (tdes->is_active_worker_transaction ()) { thread_p->tran_index = tdes->tran_index; /* <- runtime code now sees this tdes as "mine" */ // ... condensed ... (SA_MODE: mirror via LOG_SET_CURRENT_TRAN_INDEX) } else if (tdes->is_system_worker_transaction ()) { log_system_tdes::rv_simulate_system_tdes (tdes->trid); /* <- attach system tdes to thread */ } else { assert (false); }
// log_rv_end_simulation -- src/transaction/log_recovery.c thread_p->reset_system_tdes (); thread_p->tran_index = LOG_SYSTEM_TRAN_INDEX; /* <- unconditional restore */ // ... condensed ... (SA_MODE: mirror restore)Both shims keep the SA-mode global mirror (LOG_SET_CURRENT_TRAN_INDEX under #if defined (SA_MODE)). For a system worker transaction (Chapter 4’s rebuilt log_system_tdes population) rv_simulate_system_tdes looks the trid up in systb_System_tdes (asserting on a miss) and installs it via set_system_tdes.
Invariant — every simulate is paired with an end; the thread is back on LOG_SYSTEM_TRAN_INDEX between transactions. log_rv_undo_record (Chapter 9) closes the pair after its end: label, so error paths restore the thread too; log_recovery_finish_all_postpone and log_recovery_abort_all_atomic_sysops (Chapter 8) wrap it in a per-tdes lambda and assert tran_index == LOG_SYSTEM_TRAN_INDEX on entry. A missing end would leave a stale system tdes attached, logging for the wrong transaction.
11.7 The 2PC handoff
Section titled “11.7 The 2PC handoff”After undo and (if needed) log_recovery_notpartof_volumes, the driver counts distributed loose ends:
// log_recovery -- src/transaction/log_recovery.c (void) logtb_set_num_loose_end_trans (thread_p); /* Try to finish any 2PC blocked transactions */ if (log_Gl.trantable.num_coord_loose_end_indices > 0 || log_Gl.trantable.num_prepared_loose_end_indices > 0) { log_Gl.rcv_phase = LOG_RECOVERY_FINISH_2PC_PHASE; // ... condensed ... log_2pc_recovery (thread_p); /* Check number of loose end transactions again.. */ // ... condensed ... (reset rcv_tdes, re-bind tran index) (void) logtb_set_num_loose_end_trans (thread_p); }logtb_set_num_loose_end_trans zeroes both counters under TR_TABLE_CS_ENTER and walks every non-system tdes with a valid trid through logtb_set_loose_end_tdes: LOG_ISTRAN_2PC_PREPARE sets isloose_end and bumps num_prepared_loose_end_indices (in-doubt participant; keeps locks); LOG_ISTRAN_2PC_IN_SECOND_PHASE or TRAN_UNACTIVE_2PC_COLLECTING_PARTICIPANT_VOTES bumps num_coord_loose_end_indices (coordinator re-drives its decision). The driver keys off the two globals, not the returned sum.
log_2pc_recovery sweeps the table — skipping tdes == NULL, NULL_TRANID, and !LOG_ISTRAN_2PC (tdes) — and switches on tdes->state: collecting-votes aborts the undecided coordinator, abort/commit-decision re-executes the decision, and TRAN_UNACTIVE_WILL_COMMIT / TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE [[fallthrough]] into informing participants — local postpones are already done (Chapter 8). Vote mechanics belong to the 2PC document and the high-level companion’s “Transaction table with loose-end annotations”; here only the handoff condition matters: the fourth phase runs iff a coordinator or prepared loose end survived analysis. Prepared participants without a verdict legitimately remain in-doubt — hence a recount, not an assert-zero.
11.8 Chapter summary — key takeaways
Section titled “11.8 Chapter summary — key takeaways”- Incomplete recovery is media-crash-only: a normal restart forces
*stopat = -1; a broken page outside media recovery is fatal, never truncated. - Three triggers cut the log — completion or commit-with-postpone newer than
stopat, or an unreadable page — all vialog_recovery_resetlog; a missing end-of-log is patched vialog_startof_nxrec. log_recovery_resetlogrewrites every position-bearing header field in oneLOG_CSsection and delegates archive removal tolog_recovery_notpartof_archives.- Volume discard is two-phase and identity-checked by
db_creation; buffer pages are invalidated before unformat; mount failures are skipped. log_startof_nxrecwalks record lengths type by type — a new payload layout means a newswitcharm;LOG_SUPPLEMENTAL_INFO’s missingbreakis only accidentally harmless.- The simulate/end shims bind the thread to a worker or system tdes so runtime code runs unmodified; pairing is structural; both keep the SA-mode mirror.
LOG_RECOVERY_FINISH_2PC_PHASEruns only when coordinator or prepared loose ends survive; prepared participants may stay in-doubt.
Position hints as of this revision
Section titled “Position hints as of this revision”The following are line numbers as observed on 2026-06-11; symbols are the canonical anchor and line numbers are hints that decay.
| Symbol | File | Line |
|---|---|---|
vacuum_notify_server_crashed | src/query/vacuum.c | 7570 |
btree_rv_record_modify_internal | src/storage/btree.c | 29757 |
NULL_OFFSET | src/storage/storage_common.h | 49 |
RECORD_REPLACE_DATA | src/storage/storage_common.h | 231 |
log_2pc_recovery_analysis_info | src/transaction/log_2pc.c | 2029 |
log_2pc_recovery | src/transaction/log_2pc.c | 2303 |
LOG_RV_RECORD_MODIFY_MASK | src/transaction/log_append.hpp | 139 |
LOG_PAGE_INIT_VALUE | src/transaction/log_common_impl.h | 46 |
log_zip | src/transaction/log_compress.c | 45 |
log_unzip | src/transaction/log_compress.c | 112 |
log_diff | src/transaction/log_compress.c | 176 |
log_zip_realloc_if_needed | src/transaction/log_compress.c | 203 |
log_zip_alloc | src/transaction/log_compress.c | 238 |
log_zip_free | src/transaction/log_compress.c | 279 |
GET_ZIP_LEN | src/transaction/log_compress.h | 36 |
ZIP_CHECK | src/transaction/log_compress.h | 39 |
log_zip | src/transaction/log_compress.h | 53 |
LOG_ISTRAN_2PC | src/transaction/log_impl.h | 173 |
LOG_HAS_LOGGING_BEEN_IGNORED | src/transaction/log_impl.h | 190 |
log_rcv_tdes | src/transaction/log_impl.h | 458 |
log_recvphase | src/transaction/log_impl.h | 625 |
log_cs_access_mode | src/transaction/log_impl.h | 923 |
log_initialize_internal | src/transaction/log_manager.c | 1100 |
log_append_compensate | src/transaction/log_manager.c | 2985 |
log_append_compensate_with_undo_nxlsa | src/transaction/log_manager.c | 3011 |
log_append_compensate_internal | src/transaction/log_manager.c | 3047 |
log_sysop_end_recovery_postpone | src/transaction/log_manager.c | 4024 |
log_complete | src/transaction/log_manager.c | 5653 |
log_rollback_record | src/transaction/log_manager.c | 7349 |
log_get_next_nested_top | src/transaction/log_manager.c | 8023 |
log_do_postpone | src/transaction/log_manager.c | 8237 |
log_run_postpone_op | src/transaction/log_manager.c | 8481 |
log_execute_run_postpone | src/transaction/log_manager.c | 8543 |
log_read_sysop_start_postpone | src/transaction/log_manager.c | 9962 |
LOGPB_IS_ARCHIVE_PAGE | src/transaction/log_page_buffer.c | 155 |
logpb_page_has_valid_checksum | src/transaction/log_page_buffer.c | 523 |
logpb_fetch_page | src/transaction/log_page_buffer.c | 1739 |
logpb_copy_page | src/transaction/log_page_buffer.c | 1871 |
logpb_read_page_from_file | src/transaction/log_page_buffer.c | 2003 |
logpb_fetch_start_append_page | src/transaction/log_page_buffer.c | 2504 |
logpb_page_get_first_null_block_lsa | src/transaction/log_page_buffer.c | 3190 |
logpb_is_page_in_archive | src/transaction/log_page_buffer.c | 4994 |
logpb_copy_from_log | src/transaction/log_page_buffer.c | 6532 |
logpb_checkpoint | src/transaction/log_page_buffer.c | 6877 |
logpb_page_check_corruption | src/transaction/log_page_buffer.c | 11508 |
log_reader | src/transaction/log_reader.hpp | 36 |
log_reader::set_lsa_and_fetch_page | src/transaction/log_reader.hpp | 162 |
LOG_READ_ALIGN | src/transaction/log_reader.hpp | 315 |
log_rec_undo | src/transaction/log_record.hpp | 176 |
log_vacuum_info | src/transaction/log_record.hpp | 192 |
log_rec_mvcc_undo | src/transaction/log_record.hpp | 211 |
log_rec_compensate | src/transaction/log_record.hpp | 262 |
log_sysop_end_type | src/transaction/log_record.hpp | 285 |
log_rec_sysop_end | src/transaction/log_record.hpp | 305 |
log_rec_sysop_start_postpone | src/transaction/log_record.hpp | 328 |
log_rec_chkpt | src/transaction/log_record.hpp | 345 |
log_info_chkpt_trans | src/transaction/log_record.hpp | 354 |
log_info_chkpt_sysop | src/transaction/log_record.hpp | 372 |
log_rv_undo_record | src/transaction/log_recovery.c | 163 |
log_rv_redo_record | src/transaction/log_recovery.c | 430 |
log_rv_fix_page_and_check_redo_is_needed | src/transaction/log_recovery.c | 494 |
log_rv_need_sync_redo | src/transaction/log_recovery.c | 541 |
log_rv_find_checkpoint | src/transaction/log_recovery.c | 579 |
log_rv_get_unzip_log_data | src/transaction/log_recovery.c | 609 |
log_rv_get_unzip_and_diff_redo_log_data | src/transaction/log_recovery.c | 699 |
log_recovery | src/transaction/log_recovery.c | 736 |
log_rv_analysis_undo_redo | src/transaction/log_recovery.c | 965 |
log_rv_analysis_dummy_head_postpone | src/transaction/log_recovery.c | 1000 |
log_rv_analysis_postpone | src/transaction/log_recovery.c | 1042 |
log_rv_analysis_run_postpone | src/transaction/log_recovery.c | 1086 |
log_rv_analysis_compensate | src/transaction/log_recovery.c | 1181 |
log_rv_analysis_commit_with_postpone | src/transaction/log_recovery.c | 1230 |
log_rv_analysis_commit_with_postpone_obsolete | src/transaction/log_recovery.c | 1315 |
log_rv_analysis_sysop_start_postpone | src/transaction/log_recovery.c | 1365 |
log_rv_analysis_atomic_sysop_start | src/transaction/log_recovery.c | 1472 |
log_rv_analysis_complete | src/transaction/log_recovery.c | 1509 |
log_rv_analysis_sysop_end | src/transaction/log_recovery.c | 1612 |
log_rv_analysis_start_checkpoint | src/transaction/log_recovery.c | 1797 |
log_rv_analysis_end_checkpoint | src/transaction/log_recovery.c | 1830 |
log_rv_analysis_save_point | src/transaction/log_recovery.c | 2077 |
log_rv_analysis_2pc_prepare | src/transaction/log_recovery.c | 2114 |
log_rv_analysis_2pc_start | src/transaction/log_recovery.c | 2153 |
log_rv_analysis_2pc_commit_decision | src/transaction/log_recovery.c | 2190 |
log_rv_analysis_2pc_abort_decision | src/transaction/log_recovery.c | 2224 |
log_rv_analysis_2pc_commit_inform_particps | src/transaction/log_recovery.c | 2258 |
log_rv_analysis_2pc_abort_inform_particps | src/transaction/log_recovery.c | 2293 |
log_rv_analysis_2pc_recv_ack | src/transaction/log_recovery.c | 2328 |
log_rv_analysis_log_end | src/transaction/log_recovery.c | 2355 |
log_rv_analysis_record | src/transaction/log_recovery.c | 2378 |
log_is_page_of_record_broken | src/transaction/log_recovery.c | 2518 |
log_recovery_analysis | src/transaction/log_recovery.c | 2587 |
log_recovery_needs_skip_logical_redo | src/transaction/log_recovery.c | 3153 |
log_recovery_get_redo_parallel_count | src/transaction/log_recovery.c | 3197 |
log_recovery_redo | src/transaction/log_recovery.c | 3251 |
BUILD_RECORD_INFO | src/transaction/log_recovery.c | 3468 |
INVOKE_REDO_RECORD | src/transaction/log_recovery.c | 3471 |
log_recovery_abort_interrupted_sysop | src/transaction/log_recovery.c | 3960 |
log_recovery_finish_sysop_postpone | src/transaction/log_recovery.c | 4064 |
log_recovery_finish_postpone | src/transaction/log_recovery.c | 4174 |
log_recovery_finish_all_postpone | src/transaction/log_recovery.c | 4243 |
log_recovery_abort_all_atomic_sysops | src/transaction/log_recovery.c | 4280 |
log_recovery_abort_atomic_sysop | src/transaction/log_recovery.c | 4317 |
log_recovery_undo | src/transaction/log_recovery.c | 4418 |
log_recovery_notpartof_archives | src/transaction/log_recovery.c | 4997 |
log_unformat_ahead_volumes | src/transaction/log_recovery.c | 5100 |
log_recovery_notpartof_volumes | src/transaction/log_recovery.c | 5132 |
log_recovery_resetlog | src/transaction/log_recovery.c | 5221 |
log_startof_nxrec | src/transaction/log_recovery.c | 5414 |
log_recovery_find_first_postpone | src/transaction/log_recovery.c | 5793 |
log_rv_undoredo_partial_changes_recursive | src/transaction/log_recovery.c | 6048 |
log_rv_undoredo_record_partial_changes | src/transaction/log_recovery.c | 6144 |
log_rv_redo_record_modify | src/transaction/log_recovery.c | 6173 |
log_rv_undo_record_modify | src/transaction/log_recovery.c | 6191 |
log_rv_record_modify_internal | src/transaction/log_recovery.c | 6210 |
log_rv_pack_redo_record_changes | src/transaction/log_recovery.c | 6310 |
log_rv_pack_undo_record_changes | src/transaction/log_recovery.c | 6352 |
log_rv_redo_fix_page | src/transaction/log_recovery.c | 6390 |
log_rv_simulate_runtime_worker | src/transaction/log_recovery.c | 6417 |
log_rv_end_simulation | src/transaction/log_recovery.c | 6438 |
log_cnt_pages_containing_lsa | src/transaction/log_recovery.c | 6449 |
log_find_unilaterally_largest_undo_lsa | src/transaction/log_recovery.c | 6470 |
vpid_lsa_consistency_check::check | src/transaction/log_recovery_redo.cpp | 28 |
log_rv_redo_context::log_rv_redo_context | src/transaction/log_recovery_redo.cpp | 52 |
log_rv_redo_context | src/transaction/log_recovery_redo.hpp | 33 |
log_rv_redo_rec_info | src/transaction/log_recovery_redo.hpp | 53 |
log_rv_get_log_rec_data | src/transaction/log_recovery_redo.hpp | 112 |
log_rv_get_log_rec_mvccid | src/transaction/log_recovery_redo.hpp | 163 |
log_rv_get_log_rec_vpid | src/transaction/log_recovery_redo.hpp | 206 |
log_rv_get_log_rec_redo_length | src/transaction/log_recovery_redo.hpp | 273 |
log_rv_get_log_rec_offset | src/transaction/log_recovery_redo.hpp | 316 |
log_rv_get_fun | src/transaction/log_recovery_redo.hpp | 359 |
log_rv_get_fun<LOG_REC_COMPENSATE> | src/transaction/log_recovery_redo.hpp | 396 |
log_rv_get_fun | src/transaction/log_recovery_redo.hpp | 396 |
log_rv_get_log_rec_redo_data | src/transaction/log_recovery_redo.hpp | 457 |
vpid_lsa_consistency_check | src/transaction/log_recovery_redo.hpp | 558 |
log_rv_redo_record_sync | src/transaction/log_recovery_redo.hpp | 587 |
redo_task | src/transaction/log_recovery_redo_parallel.cpp | 99 |
redo_task::execute | src/transaction/log_recovery_redo_parallel.cpp | 221 |
redo_parallel::add | src/transaction/log_recovery_redo_parallel.cpp | 626 |
redo_parallel::wait_for_termination_and_stop_execution | src/transaction/log_recovery_redo_parallel.cpp | 635 |
redo_parallel::wait_past_target_lsa | src/transaction/log_recovery_redo_parallel.cpp | 728 |
redo_job_impl::execute | src/transaction/log_recovery_redo_parallel.cpp | 752 |
reusable_jobs_stack::blocking_pop | src/transaction/log_recovery_redo_parallel.cpp | 868 |
redo_parallel | src/transaction/log_recovery_redo_parallel.hpp | 55 |
task_active_state_bookkeeping | src/transaction/log_recovery_redo_parallel.hpp | 100 |
min_unapplied_log_lsa_monitoring | src/transaction/log_recovery_redo_parallel.hpp | 131 |
redo_job_base | src/transaction/log_recovery_redo_parallel.hpp | 215 |
redo_job_impl | src/transaction/log_recovery_redo_parallel.hpp | 269 |
reusable_jobs_stack | src/transaction/log_recovery_redo_parallel.hpp | 306 |
log_rv_redo_record_sync_or_dispatch_async | src/transaction/log_recovery_redo_parallel.hpp | 382 |
perf_stats | src/transaction/log_recovery_redo_perf.hpp | 105 |
log_system_tdes::rv_simulate_system_tdes | src/transaction/log_system_tran.cpp | 174 |
log_system_tdes::map_all_tdes | src/transaction/log_system_tran.cpp | 253 |
log_system_tdes::rv_delete_all_tdes_if | src/transaction/log_system_tran.cpp | 265 |
log_system_tdes::rv_delete_tdes | src/transaction/log_system_tran.cpp | 281 |
logtb_rv_find_allocate_tran_index | src/transaction/log_tran_table.c | 1056 |
logtb_rv_assign_mvccid_for_undo_recovery | src/transaction/log_tran_table.c | 1115 |
logtb_free_tran_index | src/transaction/log_tran_table.c | 1202 |
logtb_free_tran_index_with_undo_lsa | src/transaction/log_tran_table.c | 1281 |
logtb_set_loose_end_tdes | src/transaction/log_tran_table.c | 4124 |
logtb_set_num_loose_end_trans | src/transaction/log_tran_table.c | 4170 |
logtb_rv_read_only_map_undo_tdes | src/transaction/log_tran_table.c | 4204 |
mvcctable::reset_start_mvccid | src/transaction/mvcc_table.cpp | 600 |
RV_fun | src/transaction/recovery.c | 54 |
rv_rcvindex_string | src/transaction/recovery.c | 857 |
rv_check_rvfuns | src/transaction/recovery.c | 872 |
LOG_RCVINDEX | src/transaction/recovery.h | 36 |
log_rcv | src/transaction/recovery.h | 197 |
rvfun | src/transaction/recovery.h | 221 |
RCV_IS_BTREE_LOGICAL_LOG | src/transaction/recovery.h | 241 |
RCV_IS_LOGICAL_COMPENSATE_MANUAL | src/transaction/recovery.h | 253 |
RCV_IS_LOGICAL_RUN_POSTPONE_MANUAL | src/transaction/recovery.h | 261 |
RCV_IS_LOGICAL_LOG | src/transaction/recovery.h | 267 |
Sources
Section titled “Sources”cubrid-recovery-manager.md— the high-level companion. See alsocubrid-log-manager-detail.md(how the replayed records were appended) andcubrid-checkpoint.md(the restart anchor).- Raw analyses under
raw/code-analysis/cubrid/storage/recovery_manager/. - Code:
src/transaction/log_recovery.{c,h},log_recovery_redo.{cpp,hpp},log_recovery_redo_parallel.{cpp,hpp},recovery.{c,h}. - Methodology:
knowledge/methodology/code-analysis-detail-doc.md.