CUBRID Recovery Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-recovery-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full restart of a crashed database inside the kernel.

Contents:

Ch	Title	Status
1	Data-Structure Map	✅
2	Restart Entry and Log Page Access	✅
3	Analysis Pass Driver	✅
4	Analysis Record Dispatch and Transaction Table Rebuild	✅
5	Sysop and Postpone Bookkeeping During Analysis	✅
6	Redo Pass Driver and Synchronous Apply	✅
7	Parallel Redo Infrastructure	✅
8	Atomic Sysop Abort and Postpone Completion	✅
9	Undo Pass and Compensation	✅
10	The RV_fun Dispatch Table	✅
11	Special Paths	✅

Chapter 1: Data-Structure Map

Theory lives in the companion cubrid-recovery-manager.md (“The recovery dispatch table”, “Redo pass — modern dispatch via templates”); this chapter pins down every field of every recovery-side structure and the pointers between them.

1.1 Overview — who points at whom

flowchart TB
  LD["LOG_DATA\nrcvindex / vpid / offset"]
  RVF["RV_fun[rcvindex]\n(struct rvfun)"]
  CTX["log_rv_redo_context"]
  RCV["LOG_RCV"]
  RECINFO["log_rv_redo_rec_info&lt;T&gt;"]
  FUNC["undofun / redofun"]

  LD -->|"selects"| RVF
  RVF --> FUNC
  RECINFO -->|"typed header copy"| RCV
  CTX -->|"unzip buffer feeds rcv.data"| RCV
  FUNC -->|"called with &rcv"| RCV

Figure 1-1 — the rcvindex selects an RV_fun entry; the redo context unzips the payload into the LOG_RCV the chosen function receives.

1.2 `LOG_RCV` — the universal recovery argument

Every undo, redo, compensate, and run-postpone function has signature int (*)(THREAD_ENTRY *, LOG_RCV *) — LOG_RCV is the narrow waist.

// log_rcv -- src/transaction/recovery.h
struct log_rcv
{                               /* Recovery information */
  MVCCID mvcc_id = MVCCID_NULL; /* mvcc id */
  PAGE_PTR pgptr = nullptr;     /* Page to recover. Page should not be free by recovery functions,
                                 * however it should be set dirty whenever is needed */
  // ... condensed: PGLENGTH offset; int length ...
  const char *data = nullptr;   /* Replacement data. Pointer becomes invalid once the recovery
                                 * of the data is finished */    /* <- borrowed, see invariant below */
  LOG_LSA reference_lsa = NULL_LSA;  /* Next LSA used by compensate/postpone. */

  // ... condensed: default ctor; copy/move ctors and both assignments deleted ...
};

Field	Role	Why it exists
`mvcc_id`	MVCCID for MVCC-class records, else `MVCCID_NULL`	record-header field; set only by the MVCC `log_rv_get_log_rec_mvccid` specializations
`pgptr`	page to recover, fixed by the driver; `nullptr` for logical records	fix/unfix is centralized in the driver
`offset`	offset or slot id within `pgptr`, from `LOG_DATA.offset`	physical recovery is page+offset addressed
`length`	byte length of `data`	raw buffer, no terminator
`data`	redo replacement or undo before-image	points into a `LOG_ZIP` buffer or the log page — lifetime rule below
`reference_lsa`	compensate: the transaction’s `undo_nxlsa` at undo time — the next LSA the undo chain resumes from, handed to `log_sysop_end_logical_compensate`; both `log_rollback_record` (runtime rollback) and `log_rv_undo_record` (restart undo) fill it. Run-postpone: LSA of the original postpone record, filled by `log_execute_run_postpone`	anchor for the manual logical functions (1.7) that append their own compensation / run-postpone records

Invariant (borrowed-data lifetime). rcv.data and rcv.pgptr are loans: data aliases the unzip buffer (m_redo_zip.log_data) or the log page; pgptr is unfixed by the caller on return. Enforcement: all four copy/move operations deleted; log_rv_redo_record_sync builds a fresh stack-local LOG_RCV per record, a scope_exit unfixing pgptr. Stashing rcv->data means the next record’s unzip silently corrupts the replay.

1.3 `rvfun` and the `RV_fun[]` dispatch table

rvfun (recovery.h) bundles fun_t = int (*)(THREAD_ENTRY *, LOG_RCV *), dump_fun_t = void (*)(FILE *, int, void *), and six fields; extern struct rvfun RV_fun[] is initialized in recovery.c:

Field	Role	Why it exists
`recv_index`	copy of the entry’s own index (`/* For verification */`)	`rv_check_rvfuns` asserts `RV_fun[i].recv_index == i` at debug startup
`recv_string`	name, e.g. `"RVDK_FORMAT"`	trace/dump output via `rv_rcvindex_string`
`undofun`	applied by undo/rollback — and by redo of compensate records: `log_rv_get_fun<LOG_REC_COMPENSATE>` returns `undofun` (`// yes, undo`)	a CLR’s redo is the original undo
`redofun`	applied by redo pass, run-postpone, HA replication apply	the forward image applier
`dump_undofun` / `dump_redofun`	payload pretty-printers, `NULL` if none	log-dump tooling only

rv_rcvindex_string is branch-free (return RV_fun[rcvindex].recv_string;). rv_check_rvfuns only turns initializer misordering into a debug-build startup failure (er_set plus assert (false)); nothing guards an out-of-range argument such as RV_NOT_DEFINED (999) — callers must pass a defined index.

1.4 `LOG_RCVINDEX` — the index space, by family

Invariant (append-only numbering). Indices persist inside log records, so renumbering replays the wrong function on old databases. The enum header warns: “NEW ENTRIES SHOULD BE ADDED AT THE BOTTON OF THE FILE … to AVOID OLD DATABASES TO BE RECOVERED UNDER OLD FILE” — hence RVPGBUF_SET_TDE_ALGORITHM (127) far from its siblings (120–123). RV_LAST_LOGID = RVHF_LOB_REMOVE_DIR (129) marks the top; RV_NOT_DEFINED = 999 is the “no rcvindex” sentinel.

Family	Range	Subsystem
`RVDK_*`	0–9	disk manager
`RVFL_*`	10–32, 128	file manager
`RVHF_*`	33–53, 126, 129	heap
`RVOVF_*`	54–57	overflow records
`RVEH_*`	58–65	extendible hash
`RVBT_*`	66–91, 124–125	b-tree, incl. logical-key set (1.7)
`RVCT_*`	92–96	catalog pages
`RVLOG_*`	97	logical-redo noop marker
`RVREPL_*`	98–103	replication, HA appliers only
`RVVAC_*`	104–117	vacuum
`RVES_*`	118	external storage (LOB)
`RVLOC_*`	119	locator classname dummy
`RVPGBUF_*`	120–123, 127	page buffer

1.5 Modern redo-side types

The redo pass (Ch 6) and each parallel-redo applier (Ch 7) own one log_rv_redo_context (log_recovery_redo.hpp):

Field	Role	Why it exists
`m_reader`	`log_reader` cursor, built with `LOG_CS_SAFE_READER`	independent log position per applier
`m_redo_zip`	unzip target for redo payloads; its `log_data` becomes `rcv.data`	output must outlive the recovery-function call
`m_undo_zip`	unzip target for the undo half of diff undoredo records	`LOG_DIFF_UNDOREDO_DATA` stores redo as an XOR diff against undo
`m_end_redo_lsa`	`const` upper bound; records at or past it are not redone	freezes the redo horizon before the pass
`m_reader_fetch_page_mode`	`const` fetch mode for `set_lsa_and_fetch_page`; `NORMAL` refetches only when the pageid changes (`do_fetch_page = FORCE \|\| m_lsa.pageid != lsa.pageid`)	the only constructor call (redo pass, `log_recovery.c`) passes `NORMAL`; `FORCE` is retained unused for future reuse (`log_reader.hpp` comment)

Default constructor deleted; the two-argument constructor pre-grows both buffers to LOGAREA_SIZE; move and both assignments deleted. The copy constructor — the only allowed copy — delegates back with (o.m_end_redo_lsa, o.m_reader_fetch_page_mode): only the two const knobs survive, so each parallel-redo worker gets fresh buffers and reader.

Each applied record is a log_rv_redo_rec_info<T>: every special member is deleted except the (log_lsa, LOG_RECTYPE, const T &) constructor — built once, fully initialized, never reseated.

Field	Role	Why it exists
`m_start_lsa`	LSA of the record header	stamped onto the page after apply (`pgbuf_set_lsa`); key of the check below
`m_type`	the `LOG_RECTYPE`	drives the `LOG_DIFF_UNDOREDO_DATA` XOR-diff branch in `log_rv_get_log_rec_redo_data`
`m_logrec`	by-value copy of the typed body `T` — one of `LOG_REC_{UNDOREDO, MVCC_UNDOREDO, REDO, MVCC_REDO, RUN_POSTPONE, COMPENSATE}`	frees the reader to advance; `log_rv_get_log_rec_*` specializations extract vpid/mvccid/length/offset

Invariant (per-page LSA ordering, debug builds). Redo for one page must apply in log order even across threads; vpid_lsa_consistency_check (compiled out under NDEBUG) checks a necessary condition of it:

// vpid_lsa_consistency_check::check -- src/transaction/log_recovery_redo.cpp
  std::lock_guard<std::mutex> lck (mtx);
  const vpid_key_t key {a_vpid.volid, a_vpid.pageid};
  const auto map_it =  consistency_check_map.find (key);
  if (map_it != consistency_check_map.cend ())
    {
      assert ((*map_it).second < a_log_lsa);  /* <- later applies must beat the stored LSA */
    }
  consistency_check_map.emplace (key, a_log_lsa);  /* <- emplace never overwrites an existing key */

Field	Role	Why it exists
`mtx`	serializes `check` and `cleanup`	the map (global `log_Gl_recovery_redo_consistency_check`) is hit by every applier
`consistency_check_map`	per-page baseline — `vpid_log_lsa_map_t` maps `vpid_key_t` = `(volid, pageid)` to the first LSA applied to the page; `emplace` never overwrites, so the baseline never advances	the `assert` demands every later apply carry an LSA above the baseline — weaker than pairwise monotonicity (a swap between two later applies passes), but an image older than the first apply still trips it

cleanup() clears the map after the pass; log_rv_redo_record_sync consults it only while log_Gl.rcv_phase != LOG_RESTARTED.

1.6 Analysis-side state

LOG_RCV_TDES (field rcv of log_tdes, log_impl.h) carries analysis-pass discoveries into later passes (Ch 4, 5, 8). Five LSAs:

Field	Role	Why it exists
`sysop_start_postpone_lsa`	LSA of the `LOG_SYSOP_START_POSTPONE` in progress at crash	resume anchor for the sysop postpone phase (Ch 8)
`tran_start_postpone_lsa`	where transaction-level postpone began	splits “committed, postpones pending” from plain active
`atomic_sysop_start_lsa`	start of an interrupted atomic file op (`file_perm_alloc` / `file_perm_dealloc`)	must complete or roll back fully before postpones run (Ch 8)
`analysis_last_aborted_sysop_lsa`	end LSA of the last sysop aborted during analysis (“to recover logical redo operation”)	logical redo must not re-enter the aborted range
`analysis_last_aborted_sysop_start_lsa`	matching start LSA of that sysop	the other end of the bracket

LOG_RECVPHASE (log_impl.h), the global mode switch log_Gl.rcv_phase, is consulted far outside recovery (page-buffer fix rules, the check above): LOG_RESTARTED (recovery done), LOG_RECOVERY_ANALYSIS_PHASE (Ch 3–5), LOG_RECOVERY_REDO_PHASE (Ch 6–7), LOG_RECOVERY_UNDO_PHASE (Ch 9), LOG_RECOVERY_FINISH_2PC_PHASE (Ch 11).

Checkpoint snapshot records (log_record.hpp): a fixed LOG_REC_CHKPT header, then ntrans trans entries, then ntops sysop entries.

`log_rec_chkpt` field	Role	Why it exists
`redo_lsa`	”Oldest LSA of dirty data page in page buffers” (source comment)	redo-pass lower bound — the fuzzy-checkpoint contract
`ntrans`	count of trans entries following	variable-sized record
`ntops`	count of sysop entries after the trans array	same

LOG_INFO_CHKPT_TRANS snapshots the same-named live log_tdes fields; analysis re-creates a TDES per entry, then corrects it from the log tail (Ch 4):

Field	Role	Why it exists
`isloose_end`	loose-end flag at checkpoint	marks 2PC/client loose ends
`trid`	transaction identifier	key for re-creating the TDES
`state`	`TRAN_STATE` at checkpoint	seeds loose-end classification
`head_lsa`	first log record of the transaction	bounds the backward chain
`tail_lsa`	last record at checkpoint	analysis scan resumes here
`undo_nxlsa`	next record to undo, given CLRs logged during undo	rollback skips already-compensated work
`posp_nxlsa`	first postpone record	where postpone execution starts
`savept_lsa`	last savepoint	savepoint chain head, partial rollback
`tail_topresult_lsa`	last partial abort/commit	nested-sysop resolution
`start_postpone_lsa`	start-postpone address, if mid-postpone	such a transaction must finish postpones, not be undone
`user_name`	client name (`char[LOG_USERNAME_MAX]`)	restored into the TDES

LOG_INFO_CHKPT_SYSOP snapshots the two persistent sysop anchors of LOG_RCV_TDES. The other three LOG_RCV_TDES LSAs never travel in it: the analysis_last_aborted_* pair are products of the current analysis run, never persisted, and tran_start_postpone_lsa rides in the per-transaction entry instead, as LOG_INFO_CHKPT_TRANS.start_postpone_lsa:

Field	Role	Why it exists
`trid`	which transaction’s TDES the two LSAs are restored into	keyed by transaction, not parallel to the trans array
`sysop_start_postpone_lsa`	saved `rcv.sysop_start_postpone_lsa`	the sysop state can predate the checkpoint
`atomic_sysop_start_lsa`	saved `rcv.atomic_sysop_start_lsa`	same, for interrupted atomic file ops

1.7 `LOG_ZIP` and the logical-classifier macros

LOG_ZIP (log_compress.h), the compression workspace of the write path and (1.5) the redo context, owns log_data (freed by log_zip_free_data); all four copy/move operations are deleted — a member-wise copy would double-free:

Field	Role	Why it exists
`data_length`	valid bytes currently in `log_data`	after `log_unzip`, the length handed to `rcv.length`
`buf_size`	allocated capacity	`log_zip_realloc_if_needed` grows it; sticky across records
`log_data`	the owned buffer (“used as data buffer”)	the storage `rcv.data` borrows — the 1.2 lifetime rule

A stored length marks compression in its top bit: MAKE_ZIP_LEN(l) sets 0x80000000, ZIP_CHECK(l) tests, GET_ZIP_LEN(l) strips.

The classifier macros. Four pure disjunctions over LOG_RCVINDEX; the branches are the listed indices.

// RCV_IS_BTREE_LOGICAL_LOG -- src/transaction/recovery.h
#define RCV_IS_BTREE_LOGICAL_LOG(idx) \
  ((idx) == RVBT_DELETE_OBJECT_PHYSICAL \
   || (idx) == RVBT_MVCC_DELETE_OBJECT \
   || (idx) == RVBT_MVCC_INSERT_OBJECT \
   || (idx) == RVBT_NON_MVCC_INSERT_OBJECT \
   || (idx) == RVBT_MARK_DELETED \
   || (idx) == RVBT_DELETE_OBJECT_POSTPONE \
   || (idx) == RVBT_MVCC_INSERT_OBJECT_UNQ \
   || (idx) == RVBT_MVCC_NOTIFY_VACUUM \
   || (idx) == RVBT_ONLINE_INDEX_UNDO_TRAN_DELETE \
   || (idx) == RVBT_ONLINE_INDEX_UNDO_TRAN_INSERT)

These ten ops are logged by key value, not page image — undo re-descends the tree, never running against one fixed page.

RCV_IS_LOGICAL_COMPENSATE_MANUAL is the btree set plus exactly six: RVFL_ALLOC, RVFL_USER_PAGE_MARK_DELETE, RVPGBUF_DEALLOC, RVFL_TRACKER_HEAP_REUSE, RVHF_LOB_REMOVE_DIR, RVFL_TRACKER_UNREGISTER; their undofun appends its own compensation via rcv.reference_lsa, so the rollback driver must not auto-append a LOG_COMPENSATE — a re-crash would double-undo. RCV_IS_LOGICAL_RUN_POSTPONE_MANUAL matches exactly four: RVFL_DEALLOC, RVHF_MARK_DELETED, RVHF_LOB_REMOVE_DIR, RVBT_DELETE_OBJECT_POSTPONE; as postpone actions their redofun closes with LOG_SYSOP_END_LOGICAL_RUN_POSTPONE, not a standard LOG_RUN_POSTPONE (Ch 8). RVHF_LOB_REMOVE_DIR and RVBT_DELETE_OBJECT_POSTPONE sit in both sets.

RCV_IS_LOGICAL_LOG (vpid, idx) is the master test and the only one that inspects the address: ((vpid)->volid == NULL_VOLID) || ((vpid)->pageid == NULL_PAGEID) short-circuits to logical regardless of index; then RCV_IS_BTREE_LOGICAL_LOG (idx); then eleven indices: RVBT_MVCC_INCREMENTS_UPD, RVPGBUF_FLUSH_PAGE, RVFL_DESTROY, RVFL_ALLOC, RVFL_DEALLOC, RVVAC_NOTIFY_DROPPED_FILE, RVPGBUF_DEALLOC, RVES_NOTIFY_VACUUM, RVHF_MARK_DELETED, RVFL_TRACKER_HEAP_REUSE, RVFL_TRACKER_UNREGISTER. A new logical index missing here makes recovery try to fix a nonexistent page — a fix error during rollback, far from the bug.

flowchart TD
  A["record vpid + rcvindex"] --> B{"volid or pageid NULL?"}
  B -- yes --> L["logical: undofun gets pgptr = nullptr"]
  B -- no --> C{"RCV_IS_BTREE_LOGICAL_LOG?"}
  C -- yes --> L
  C -- no --> D{"one of the 11 listed indices?"}
  D -- yes --> L
  D -- no --> P["physical: driver fixes page, passes pgptr"]

Figure 1-2 — RCV_IS_LOGICAL_LOG as evaluated by undo/rollback drivers.

1.8 Chapter summary — key takeaways

LOG_RCV is the one calling convention; data/pgptr are borrowed, so all four copy/move operations are deleted.
RV_fun[] is indexed by the append-only LOG_RCVINDEX; debug-startup rv_check_rvfuns catches only misordering — nothing bounds-checks lookups.
Compensate records redo through undofun (log_rv_get_fun<LOG_REC_COMPENSATE>): a CLR’s redo re-does the undo.
log_rv_redo_context (reader + two zip buffers + frozen m_end_redo_lsa; copies rebuild fresh buffers; only NORMAL fetch mode used) feeds immutable log_rv_redo_rec_info<T> snapshots; debug-only vpid_lsa_consistency_check asserts every later apply per page stays above the first-applied LSA — a necessary condition of log order, not full pairwise monotonicity.
Analysis state = LOG_RCV_TDES (five LSAs), seeded from LOG_REC_CHKPT
- LOG_INFO_CHKPT_TRANS + LOG_INFO_CHKPT_SYSOP (only the two sysop anchors persist; tran-level postpone travels in the trans entry), gated by LOG_RECVPHASE.
The RCV_IS_* macros split physical vs logical, automatic vs manual; a new logical index missing from RCV_IS_LOGICAL_LOG breaks rollback long after the feature ships.

Chapter 2: Restart Entry and Log Page Access

Who drives recovery at server start, how the checkpoint anchor is found and downgraded for a media crash, and how the passes (Ch 3, 6, 9) physically read log pages. Theory: the companion cubrid-recovery-manager.md.

2.1 `log_recovery` — the restart orchestrator, branch by branch

log_recovery (in log_recovery.c) has one caller, log_initialize_internal, gated on init_emergency == false && (log_Gl.hdr.is_shutdown == false || ismedia_crash == true) (restoredb passes ismedia_crash; emergency startup skips recovery); the caller holds the log CS in write mode (assert (LOG_CS_OWN_WRITE_MODE)).

// log_recovery -- src/transaction/log_recovery.c
  /* ... condensed: branch 1 -- NULL LOG_FIND_TDES is er_set + logpb_fatal_error, return ... */
  rcv_tdes->state = TRAN_RECOVERY;   /* <- the recovery "transaction" */
  if (LOG_HAS_LOGGING_BEEN_IGNORED ())
    {                                /* <- branch 2: fatal, then clear the flag */
      /* ... condensed ... */
    }
  /* ... condensed ... */
  LSA_COPY (&rcv_lsa, &log_Gl.hdr.chkpt_lsa);
  if (ismedia_crash != false)
    {                                /* <- branch 3a: downgrade anchor */
      (void) fileio_map_mounted (thread_p,
          (bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint, &rcv_lsa);
    }
  /* ... condensed: else, branch 3b -- if (stopat != NULL) *stopat = -1 ... */
  vacuum_notify_server_crashed (&rcv_lsa);

Branches 1 and 2 fatal via logpb_fatal_error; branch 2 fires when LOG_HAS_LOGGING_BEEN_IGNORED() (log_impl.h) sees log_Gl.hdr.has_logging_been_skipped — a crash while logging was skipped is unrepairable (ER_LOG_CORRUPTED_DB_DUE_CRASH_NOLOGGING). Branch 3a: restored volumes may be older than the header checkpoint, so log_rv_find_checkpoint is mapped over every volume, copying its disk_get_checkpoint LSA into rcv_lsa when LSA_ISNULL (rcv_lsa) || LSA_LT (&chkpt_lsa, rcv_lsa) and returning true so all volumes are visited — the oldest checkpoint wins. vacuum_notify_server_crashed copies rcv_lsa into vacuum_Data.recovery_lsa for vacuum’s backward scan when analysis finds no MVCC op record.

Invariant — the analysis start LSA is no newer than the checkpoint recorded in any permanent volume. A volume header stores the checkpoint LSA at its last flush; replay must start at or before it, else redo skips updates the restored volume never received. Figure 2-1 maps the rest.

flowchart TD
  A["ANALYSIS  Ch 3"] --> B["logpb_fetch_start_append_page<br/>error: fatal"]
  B --> C{did_incom_recovery}
  C -->|false| D["LOG_RESET_PREV_LSA from EOF back_lsa"] --> G
  C -->|true| G["append LOG_DUMMY_CRASH_RECOVERY<br/>rcv_phase_lsa = tail_lsa"]
  G --> H["REDO Ch 6, then UNDO Ch 9<br/>log_system_tdes::rv_final"] --> K{did_incom_recovery}
  K -->|true| L["log_recovery_notpartof_volumes"] --> N
  K -->|false| N["TRAN_ACTIVE, logtb_set_num_loose_end_trans"]
  N --> O{2pc loose ends}
  O -->|yes| P["FINISH_2PC: log_2pc_recovery  Ch 11"] --> R
  O -->|no| R["logpb_decache_archive_info<br/>CS exit, logpb_checkpoint, CS enter"]
  R --> S["flush all + header, then locator_initialize,<br/>heap_classrepr_restart_cache -- each error: fatal"]

Figure 2-1: log_recovery after the anchor is fixed.

Section 2.5 covers the append-point re-arm; log_append_empty_record writes LOG_DUMMY_CRASH_RECOVERY, whose LSA becomes log_Gl.rcv_phase_lsa, the crash boundary undo keys on (Ch 9). A stopat cut sets did_incom_recovery (Ch 3); log_recovery_notpartof_volumes then drops volumes created after the restore point. The close exits the log CS around logpb_checkpoint, flushes dirty pages and the header, and re-caches the catalog tracker and class representations — two further fatal branches; the caller then sets LOG_RESTARTED.

2.2 The `rcv_phase` transitions

log_Gl.rcv_phase (enum log_recvphase, log_impl.h) is the global mode; LOG_ISRESTARTED() tests for LOG_RESTARTED, which the caller sets after the phases above. logpb_copy_page fills its recovery cache only if (!LOG_ISRESTARTED ()), and the physical readers run debug checksum checks only when LOG_RESTARTED — torn tails during recovery are repaired logically (Section 2.7).

2.3 `logpb_fetch_page` — the single physical-read entry

logpb_fetch_page (in log_page_buffer.c) takes an enum log_cs_access_mode (log_impl.h). The classic analysis and undo scans call it with LOG_CS_FORCE_USE (they run under the log CS held by log_recovery); the redo machinery’s log_reader forwards LOG_CS_SAFE_READER so its positioned fetches skip the CS (Section 2.6).

// logpb_fetch_page -- src/transaction/log_page_buffer.c
  if (LSA_LE (&append_lsa, req_lsa)        /* <- case 1: page beyond flushed area */
      || LSA_LE (&append_prev_lsa, req_lsa)) /* <- case 2: page may hold a temp EOL */
    {
      LOG_CS_ENTER (thread_p);
      /* ... condensed ... */
      if (LSA_LE (&log_Gl.hdr.append_lsa, req_lsa))  /* retry with mutex */
        {
          logpb_prior_lsa_append_all_list (thread_p); /* <- drain prior list to buffers */
        }
      LOG_CS_EXIT (thread_p);
    }
  rv = logpb_copy_page (thread_p, req_lsa->pageid, access_mode, log_pgptr);
  /* ... condensed: rv != NO_ERROR is the only error exit ... */

The front gate folds the in-memory prior-LSA list into the page buffer so a reader near the append point never sees a stale tail. logpb_copy_page then has four arms: a LOGPB_HEADER_PAGE_ID request is served from the cached header_buffer (file read when not cached); an out-of-range buffer index raises ER_LOG_PAGE_CORRUPTED; a buffer hit memcpys and re-checks log_bufptr->pageid — the safe-reader mode skips the read CS, so this re-check is its lock-free validation; everything else falls to logpb_read_page_from_file, caching the page forward-only while !LOG_ISRESTARTED ().

2.4 Active versus archive: `logpb_read_page_from_file`

A pageid is archived iff LOGPB_IS_ARCHIVE_PAGE (pageid) — not the header page and below LOGPB_NEXT_ARCHIVE_PAGE_ID (log_Gl.hdr.nxarv_pageid); logpb_is_page_in_archive wraps it. LOG_CS_SAFE_READER takes the read CS itself (sets log_csect_entered); other modes assert (LOG_CS_OWN). The CS protects the archive set — an archive created mid-read once left logpb_to_physical_pageid stale (the in-code comment records the bug).

// logpb_read_page_from_file -- src/transaction/log_page_buffer.c
  bool fetch_from_archive = logpb_is_page_in_archive (pageid);
  if (fetch_from_archive)
    {
      bool is_archive_page_in_active_log = (pageid + LOGPB_ACTIVE_NPAGES) > log_Gl.hdr.append_lsa.pageid;
      bool dont_fetch_archive_from_active = !LOG_ISRESTARTED () || log_Gl.hdr.was_active_log_reset;
      if (is_archive_page_in_active_log && !dont_fetch_archive_from_active)
        {
          fetch_from_archive = false;   /* <- slot not yet lapped in circular active file */
        }
    }

The shortcut: the active file is circular with LOGPB_ACTIVE_NPAGES (= log_Gl.hdr.npages) slots, so an archived page stays readable from active until its slot is re-appended — disabled during recovery and after an active-log reset, when the active tail is exactly what the crash made suspect.

The remaining arms: an archive fetch (logpb_fetch_from_archive) returning NULL is goto error. An active fetch maps the slot via logpb_to_physical_pageid, then fileio_read (ER_LOG_READ, goto error); the self-id check: hdr.logical_pageid == pageid is good, then tde_decrypt_log_page if encrypted (archives decrypt inside logpb_fetch_from_archive); == pageid + LOGPB_ACTIVE_NPAGES means lapped since the check — retry from archive; anything else is ER_LOG_PAGE_CORRUPTED. Both exits release the CS iff log_csect_entered; debug checksum only when LOG_RESTARTED.

Invariant — every log page self-identifies. An active-file read is valid only if hdr.logical_pageid matches; the one benign mismatch is one lap, pageid + LOGPB_ACTIVE_NPAGES — without the check a lapped slot would replay as the old page.

2.5 `logpb_fetch_start_append_page` — re-arming the append point

Between analysis and redo the log must become writable again. Four branches: an empty log (append_lsa offset 0, pageid 0 — debug builds: PRM_ID_FIRST_LOG_PAGEID) makes logpb_locate_page get NEW_PAGE instead of OLD_PAGE; a leftover log_Gl.append.log_pgptr is discarded (logpb_invalid_all_append_pages); NULL from logpb_locate_page is the only error exit (ER_FAILED, fatal in log_recovery); on success set_nxio_lsa (log_Gl.hdr.append_lsa) is recorded and the page joins flush_info->toflush, flushed (logpb_flush_pages_direct) when the array is full.

2.6 `log_reader` — the C++ fetch wrapper for the redo machinery

The modern redo path (Ch 6, Ch 7) uses log_reader (log_reader.hpp, final class, header-only; the sibling log_reader.cpp is stale — no CMakeLists builds it).

Field	Role	Why it exists
`m_thread_entry`	Lazily cached `THREAD_ENTRY *`	Single-thread contract, asserted each use
`m_lsa`	Read position; starts `NULL_LSA`	Drives fetch pageid, intra-page offset, memoization
`m_cs_access`	Mode passed to `logpb_fetch_page`; default `LOG_CS_FORCE_USE`	CS-owning passes vs CS-free readers (LETS-port leftover)
`m_page`	`log_page *` aligned into `m_area_buffer` by the constructor	Private fetch destination — no shared-buffer locking
`m_area_buffer`	`char [IO_MAX_PAGE_SIZE + DOUBLE_ALIGNMENT]`	Inline no-heap storage; copied workers get their own

set_lsa_and_fetch_page computes do_fetch_page { fetch_page_mode == fetch_mode::FORCE || m_lsa.pageid != lsa.pageid }, assigns m_lsa = lsa, and fetches (logpb_fetch_page (.., m_cs_access, m_page), fatal on failure) only when true: NORMAL memoizes the current page, FORCE always refetches. align, add_align, advance_when_does_not_fit and copy_from_log delegate to the classic LOG_READ_ALIGN family and logpb_copy_from_log (bottom of the same header), refetching on page crossings — but only fetch_page (under set_lsa_and_fetch_page and skip) forwards m_cs_access; the delegating members use the family’s default LOG_CS_FORCE_USE, so even a safe reader briefly takes the read CS at mid-record page crossings.

The owning aggregate log_rv_redo_context (log_recovery_redo.hpp):

Field	Role	Why it exists
`m_reader`	`log_reader { LOG_CS_SAFE_READER }`	Private reader per context; CS-free positioned fetches
`m_redo_zip`, `m_undo_zip`	`LOG_ZIP` scratch buffers	Decompression targets reused across records (Section 2.8)
`m_end_redo_lsa`	`const LOG_LSA` redo stop bound	Workers compare record LSAs without touching globals
`m_reader_fetch_page_mode`	`const log_reader::fetch_mode`	`NORMAL` memoizes pages; `FORCE` kept for reuse

The synchronous redo driver constructs it with fetch_mode::NORMAL; the copy constructor re-runs the main one so each parallel worker (Ch 7) gets fresh buffers; the constructor pre-sizes both zips to LOGAREA_SIZE, the destructor frees them (log_zip_free_data).

2.7 The `NULL_OFFSET` convention for incompletely archived records

NULL_OFFSET is (-1) (storage_common.h). When the archiver copies an active page whose last record continues onto the next page, an LSA into the continuation may carry offset == NULL_OFFSET: the record’s completion postdates archiving. Every forward scan — analysis, redo, walkers like log_startof_nxrec — must repair it before dereferencing:

// log_recovery_analysis (record loop) -- src/transaction/log_recovery.c
  if (lsa.offset == NULL_OFFSET)
    {
      lsa.offset = log_page_p->hdr.offset;   /* <- page's first record offset */
      if (lsa.offset == NULL_OFFSET)
        {
          /* Continue with next pageid */
          if (logpb_is_page_in_archive (log_lsa.pageid))
            {
              lsa.pageid = log_lsa.pageid + 1; /* <- archive: keep walking */
            }
          else
            {
              lsa.pageid = NULL_PAGEID;        /* <- active: stop scan */
            }
          continue;
        }
    }

A page whose own hdr.offset is NULL_OFFSET holds no record start (pure continuation) — in an archive try the next page; in the active log the scan ran off the end. Analysis scratch pages are initialized to hdr.offset = NULL_OFFSET.

2.8 LOG_ZIP allocation helpers each pass instantiates

LOG_ZIP (struct log_zip, log_compress.h) is a grow-only buffer:

Field	Role	Why it exists
`data_length`	Bytes currently stored	Consumers read exactly this much; capacity may be larger
`buf_size`	Capacity of `log_data`	Grow-only check; sized to the LZ4 worst case
`log_data`	The buffer (`char *`)	Reused across records; `nullptr` until first sizing

log_zip_realloc_if_needed (log_zip, new_size) (in log_compress.c) grows only when new_size > 0 && new_size > log_zip.buf_size, to LOG_ZIP_BUF_SIZE (LZ4, new_size) (ER_OUT_OF_VIRTUAL_MEMORY on failure); a second check, new_size > 0 && log_zip.log_data == nullptr, zeroes the fields and returns false (caller fatals) — true covers success and no-grow. log_zip_alloc mallocs + zeroes the struct and sizes it the same way (nullptr on failure, husk freed); log_zip_free runs log_zip_free_data then frees the struct. The redo context pre-sizes its two zips (Section 2.6); the undo pass log_zip_alloc (LOGAREA_SIZE)s undo_unzip_ptr, freed on every exit of log_recovery_undo; the shared consumer log_rv_get_unzip_log_data splits compressed from plain via ZIP_CHECK (length) — log_unzip versus memcpy after log_zip_realloc_if_needed.

2.9 Chapter summary — key takeaways

log_recovery runs as the TRAN_RECOVERY system transaction under an already-held write-mode log CS; only log_initialize_internal calls it, and emergency startup skips it.
The analysis anchor is log_Gl.hdr.chkpt_lsa, downgraded on a media crash to the oldest per-volume checkpoint found via log_rv_find_checkpoint.
Between analysis and redo the append point is re-armed and a LOG_DUMMY_CRASH_RECOVERY appended; its LSA (log_Gl.rcv_phase_lsa) is the crash boundary undo keys on.
Classic analysis and undo scans fetch with LOG_CS_FORCE_USE under the held log CS; log_reader forwards LOG_CS_SAFE_READER for positioned fetches, though its page-crossing helpers still default to LOG_CS_FORCE_USE and briefly take the read CS.
logpb_read_page_from_file splits active versus archive on LOGPB_IS_ARCHIVE_PAGE; the only benign self-id mismatch is the one-lap alias pageid + LOGPB_ACTIVE_NPAGES, and the archived-but-still-in-active shortcut is disabled during recovery.
NULL_OFFSET (-1) marks LSAs into incompletely archived records; every forward scan repairs it from hdr.offset, advancing a page in archives and terminating in the active log.

Chapter 3: Analysis Pass Driver

log_recovery_analysis walks forward from the checkpoint anchor through possibly corrupted or truncated log and computes the redo range: a page-fetch outer loop around a record-step inner loop. Record semantics go to log_rv_analysis_record (Ch 4–5); the driver owns cursor advancement, corruption defenses, the truncate-or-fatal decision, and redo-range bookkeeping. ARIES rationale: recovery-phases section of cubrid-recovery-manager.md; page-fetch mechanics: Ch 2.

3.1 Entry point, outputs, and driver state

log_recovery resolves the anchor — log_Gl.hdr.chkpt_lsa, or under media crash the oldest checkpoint among data-volume headers (log_rv_find_checkpoint) — and passes it as start_lsa, with is_media_crash (truncate vs fatal, 3.2) and stop_at (the restoredb -d boundary, 3.7). Outputs: start_redo_lsa (the anchor unless Ch 4 pulls it back), end_redo_lsa (Invariant 3-B), did_incom_recovery (truncated; log_recovery skips the EOF back-link fix-up), num_redo_log_records (3.8).

Key driver locals of log_recovery_analysis:

Local	Role	Why
`lsa`	next record; NULL ends both loops	single termination condition
`log_lsa`	current record, page in `log_page_p`	`lsa` advances before dispatch (3.6)
`prev_lsa`	last good record	resetlog target
`prev_prev_lsa`	resetlog’s `new_prev_lsa`	tracks `prev_lsa`; NULL only if the first fetched page is broken
`first_corrupted_rec_lsa`	first all-`0xff` 4 KB block	per-record cut-off (3.5)
`last_checked_page_id`	page already checksummed	probe once per page (3.3)
`is_log_page_broken`	fetch failed / record tail missing	truncate-or-fatal fork (3.2)
`is_log_page_corrupted`	readable but checksum failed	partial flush (3.5); terminal (Invariant 3-C)
`null_block`	4 KB of `LOG_PAGE_INIT_VALUE` (`0xff`, `log_common_impl.h`)	tear-detection `memcmp` operand
`checkpoint_lsa`	set by `LOG_END_CHKPT` dispatch (Ch 4)	2PC tail re-read (3.8)
`may_use_checkpoint` / `may_need_synch_checkpoint_2pc`	dispatch flags (Ch 4)	the second arms the 2PC tail
`last_at_time`	stays `-1` in the driver	echo to `*stop_at` is inert (3.7)

Initialization copies start_lsa into lsa, start_redo_lsa, end_redo_lsa, and prev_lsa — a degenerate redo range until proven otherwise — and nulls or zeroes everything else.

3.2 Outer loop: the is_log_page_broken branch

Each outer iteration logpb_fetch_pages the page under the cursor; failure — past the flushed log, missing archive, TDE decryption error — sets is_log_page_broken, as can the inner loop’s broken-tail break (3.4): one branch decides what broken means.

Media-crash arm: truncate and accept — log past the restore point may legitimately not exist. It echoes last_at_time via *stop_at, steps the last record’s owner tdes (tail_lsa/undo_nxlsa) back to log_rec->prev_tranlsa so undo never chases the truncated record, re-fetches prev_lsa’s page (clobbered by the failed fetch; fatal on failure), then log_recovery_resetlog (thread_p, &prev_lsa, &prev_prev_lsa) makes prev_lsa the new append point (Ch 11), sets *did_incom_recovery, resets the MVCC table, and returns — skipping the 2PC tail (3.8). Normal-crash arm: fatal — after a plain crash every page up to eof_lsa must be readable; when er_errid () is ER_TDE_CIPHER_IS_NOT_LOADED the message names TDE: the page is intact but undecryptable.

Invariant 3-B (redo-range honesty). On return, every record in [start_redo_lsa, end_redo_lsa) is readable and structurally complete; the boundary itself is the last fully-probed record or a position re-initialized before redo reads it. Normal end: the last dispatched record. Truncation (3.6 step 8): reverted to prev_lsa. Broken-record probe (3.4): deliberately advanced onto the broken record — equal to prev_lsa — so resetlog makes that position the new append point, overwritten by LOG_DUMMY_CRASH_RECOVERY before redo runs. Violation: redo (Ch 6) applies half-written bodies.

flowchart TD
  A["fetch page at lsa"] --> B{"broken?"}
  B -- no --> C["inner loop 3.3-3.6"] --> D{"lsa null?"}
  D -- no --> A
  D -- yes --> E["2PC tail; reset_start_mvccid"]
  B -- "yes, media crash" --> F["resetlog at prev_lsa; did_incom_recovery; return"]
  B -- "yes, normal crash" --> G["fatal (TDE or generic)"]

Figure 3-1 — outer loop of log_recovery_analysis.

3.3 Inner loop entry: NULL_OFFSET repair and the corruption probe

The inner loop runs while the cursor stays on the fetched page: while (!LSA_ISNULL (&lsa) && lsa.pageid == log_lsa.pageid). Two housekeeping steps precede record access.

NULL_OFFSET repair. A record archived while incomplete leaves the continuation’s offset unknown: the cursor arrives as (pageid, NULL_OFFSET) and is re-anchored on log_page_p->hdr.offset, the first header starting in this page. If that too is NULL_OFFSET (only continuation bytes here): archive page — lsa.pageid = log_lsa.pageid + 1, keep walking the record’s middle; active page — lsa.pageid = NULL_PAGEID, scan over. Either way continue.

Per-page corruption probe. Guarded by last_checked_page_id, so once per page. logpb_page_check_corruption wraps logpb_page_has_valid_checksum (CRC32 vs hdr.checksum); a helper error is fatal. A corrupt archive page is fatal (/* Should not happen. */ — archives are written once); a corrupt active page means a partial page flush. logpb_page_get_first_null_block_lsa locates the tear: the first 4 KB block that memcmps equal to null_block yields (hdr.logical_pageid, i * block_size), minus sizeof (LOG_HDRPAGE) when nonzero — LSA offsets index area[], the raw page starts earlier.

If no block matches (corrupt, but every block holds data), first_corrupted_rec_lsa stays NULL: the 3.5 cut-off and its safety nets (gated on !is_log_page_corrupted) are skipped; only the page-advance ban and EOF stop of Invariant 3-C still apply.

3.4 Multi-page records: log_is_page_of_record_broken

After log_rec = LOG_GET_LOG_RECORD_HEADER (log_page_p, &log_lsa), the media-crash path runs one more probe — a header may sit on the last restored page while its body spills onto pages never restored:

// log_is_page_of_record_broken -- src/transaction/log_recovery.c
  LSA_COPY (&fwd_log_lsa, &log_rec_header->forw_lsa);
  /* TODO - Do we need to handle NULL fwd_log_lsa? */
  if (!LSA_ISNULL (&fwd_log_lsa))
    {
      if (LSA_GE (log_lsa, &fwd_log_lsa)
    || (!LSA_ISNULL (&log_Gl.hdr.eof_lsa) && LSA_GT (&fwd_log_lsa, &log_Gl.hdr.eof_lsa)))
  {
    is_log_page_broken = true;  /* <- forw_lsa is nonsense */
  }
      else
  {
    if (fwd_log_lsa.pageid != log_lsa->pageid
        && (fwd_log_lsa.offset != 0 || fwd_log_lsa.pageid > log_lsa->pageid + 1))
      {
        // ... condensed: record spans pages -- probe-fetch fwd_log_lsa page;
        //     failure -> broken ...
      }
  }
    }

Branch by branch: (1) forw_lsa NULL — declines; the 3.5 safety nets judge instead (the TODO admits the gap). (2) forw_lsa not after the current record, or beyond eof_lsa — the header itself is garbage: broken (eof_lsa is NULL-guarded: restoring without an active volume recovers it only during analysis). (3) forw_lsa on a later page at nonzero offset, or more than one page ahead — the body provably reaches that page: probe-fetch it; failure means the tail is gone, success means sane. The excluded case — next record at offset 0 of the next page — proves nothing; no fetch is spent.

On a broken verdict the inner loop copies end_redo_lsa = lsa, sets prev_lsa and prev_prev_lsa to it, debug-traces, and breaks — the reset happens in 3.2, where prev_lsa is now the broken record itself: resetlog cuts there, sacrificing it so everything earlier survives.

3.5 The first_corrupted_rec_lsa cut-off

For pages that failed the checksum, the driver decides per record whether it precedes the torn region. Two safety nets first widen the verdict (only while is_log_page_corrupted is false): (1) missing end-of-log — forw_lsa NULL on a non-LOG_END_OF_LOG record in the active log is impossible (every chain ends at an EOF record): page declared corrupted, cut-off from the null-block scan. (2) Body crossing a null block — when forw_lsa stays in-page, map the record start and forw_lsa - 1 to block indexes ((offset + sizeof (LOG_HDRPAGE)) / block_size); if they differ and the ending block equals null_block, the body was never fully flushed: the cut-off becomes the record itself.

With a non-NULL cut-off, three outcomes. A record strictly past the tear ends the scan at the previous good record:

// log_recovery_analysis -- src/transaction/log_recovery.c
        if (LSA_GT (&log_lsa, &first_corrupted_rec_lsa))
    {
      LOG_RESET_APPEND_LSA (end_redo_lsa);  /* <- starts past the tear */
      LSA_SET_NULL (&lsa);
      break;
    }

The else arm flags the record itself corrupted when log_lsa == first_corrupted_rec_lsa, when forw_lsa points past the tear, or when the DB_ALIGN-ed end of its header overruns LOGAREA_SIZE or lands past the tear; then LOG_RESET_APPEND_LSA (&log_lsa) — the first casualty’s own position becomes the new append point — nulls lsa, breaks. A record provably before the tear is processed normally.

Invariant 3-C (corruption is terminal per page). Once is_log_page_corrupted is true, the cursor never advances to another page. Enforced by the post-advance null-out (corrupted, not LOG_END_OF_LOG, lsa.pageid != log_lsa.pageid → LSA_SET_NULL) plus the stop after dispatching LOG_END_OF_LOG. Recycled pages from earlier log wraps can hold valid-looking stale records; following them replays a previous epoch.

3.6 Advancing the cursor: every remaining branch

The rest of the inner-loop body, in order:

end_redo_lsa = lsa; lsa = log_rec->forw_lsa — the range tip moves before dispatch.
Corrupted-page page-advance ban (Invariant 3-C).
Archive null-forward fix: NULL lsa on an archive page → log_lsa.pageid + 1 — incomplete archiving, not end of log.
Loop guard. lsa backward or sideways (lsa.pageid < log_lsa.pageid, or same page and lsa.offset <= log_lsa.offset): “loop in the log” debug-trace, logpb_fatal_error, then LSA_SET_NULL (&lsa); break;.
Missing-EOF repair. NULL lsa, log_rtype != LOG_END_OF_LOG, no truncation yet: the append LSA parks at end_redo_lsa; if log_startof_nxrec finds the next record start, advance there, patch the in-buffer log_rec->forw_lsa, flush the page (logpb_write_page_to_disk) — a physical repair. Either way log_Gl.hdr.next_trid = tran_id.
Redo counting. *num_redo_log_records counts twelve redo-bearing types — LOG_REDO_DATA, LOG_UNDOREDO_DATA, LOG_DIFF_UNDOREDO_DATA, their three LOG_MVCC_* counterparts, LOG_DBEXTERN_REDO_DATA, LOG_RUN_POSTPONE, LOG_COMPENSATE, LOG_2PC_PREPARE, LOG_2PC_START, LOG_2PC_RECV_ACK; everything else hits the silent default.
Dispatch. log_rv_analysis_record rebuilds transaction state (Ch 4); its LOG_END_OF_LOG case is log_rv_analysis_log_end (3.8).
Post-dispatch truncation. *did_incom_recovery raised (3.7): end_redo_lsa = prev_lsa — the trigger is excluded from redo; lsa nulled, break.
Self-loop assert. LSA_EQ (end_redo_lsa, &lsa) — the cursor did not move: assert_release, scan aborts via NULL cursor.
Corrupted page + LOG_END_OF_LOG → stop (second half of Invariant 3-C).
prev_lsa = end_redo_lsa; prev_prev_lsa = prev_lsa; — the resetlog anchors trail the tip by one record.
Page-id back-fill. Forward (pageid, NULL_OFFSET) with a stale smaller pageid → current page (pairs with 3.3’s repair).

Invariant 3-A (monotone cursor). Every iteration strictly increases the cursor (pageid, offset). Enforced by steps 4 and 9, both terminating the scan. Violation: analysis spins forever.

3.7 Point-in-time stop: stop_at and LOG_REC_DONETIME

stop_at comes from log_recovery: -1 (no limit) on normal restart, the restoredb -d timestamp on media crash. The driver never reads commit times itself — log_rv_analysis_complete does, for LOG_COMMIT / LOG_ABORT (log_rv_analysis_commit_with_postpone applies the same test to its LOG_REC_START_POSTPONE at_time). It reads the LOG_REC_DONETIME payload behind the header; when *stop_at != (time_t) (-1) and difftime (*stop_at, last_at_time) < 0 — the first done record stamped after the stop point — it nulls the page cursor, calls log_recovery_resetlog (thread_p, &record_header_lsa, prev_lsa) to cut the log before this commit, and raises *did_incom_recovery; 3.6 step 8 then excludes the record and ends the scan. That last_at_time is its local; the driver’s copy, echoed into *stop_at in 3.2, stays -1 — inert today.

3.8 log_rv_analysis_log_end, the 2PC re-read tail, and the outputs

The one dispatch case belonging to the driver’s story is the clean end of log, log_rv_analysis_log_end — one branch on logpb_is_page_in_archive. In the active log the EOF’s own position becomes log_Gl.hdr.append_lsa (LOG_RESET_APPEND_LSA (log_lsa) — new appends overwrite the EOF record), next_trid is restored from its owner, and the cursor takes the EOF’s NULL forw_lsa — both loops end (the missing-EOF repair exempts LOG_END_OF_LOG). An EOF in an archive page is a stale leftover from before the archiving cut: the header is untouched, the NULL forward goes through 3.6 step 3, and the scan continues.

The 2PC re-read tail. If any dispatched record set may_need_synch_checkpoint_2pc (a LOG_REC_CHKPT listing transactions in 2PC at checkpoint time — Ch 4), the driver re-reads the checkpoint record after the outer loop: (1) logpb_fetch_page on checkpoint_lsa, failure fatal; (2) the LOG_INFO_CHKPT_TRANS array of chkpt.ntrans entries, read in-page when log_lsa.offset + size < LOGAREA_SIZE, else malloc + logpb_copy_from_log (failed malloc: fatal); (3) each chkpt_trans[i].trid resolves via logtb_find_tran_index; log_2pc_recovery_analysis_info runs only for tdes still LOG_ISTRAN_2PC. The media-crash arm of 3.2 returns before this tail — truncated restores skip 2PC reconstruction.

log_recovery then emits ER_LOG_RECOVERY_REDO_STARTED from the range and the count; log_cnt_pages_containing_lsa returns 0 when *to_lsa == *from_lsa, else the inclusive to_lsa->pageid - from_lsa->pageid + 1. When nothing past the anchor survived, end_redo_lsa still equals start_redo_lsa from initialization — the count is honestly zero.

3.9 Chapter summary — key takeaways

log_recovery_analysis is a page-fetch outer loop around a record-step inner loop; corruption decisions belong to the driver, record semantics to log_rv_analysis_record (Ch 4–5).
Broken pages fork on is_media_crash: backups truncate via log_recovery_resetlog at prev_lsa and raise did_incom_recovery; normal restarts are fatal (ER_TDE_CIPHER_IS_NOT_LOADED means “load the TDE key”).
Partial page flush is caught by a once-per-page CRC check; the tear is the first all-0xff 4 KB block, and first_corrupted_rec_lsa cuts the scan with three per-record outcomes. A corrupted page is terminal (Invariant 3-C).
log_is_page_of_record_broken (media crash only) validates forw_lsa plausibility and probe-fetches a multi-page record’s last page; a missing tail parks end_redo_lsa and prev_lsa on the broken record so resetlog cuts there.
The redo range is honest (Invariant 3-B): everything strictly before end_redo_lsa is readable and complete; the boundary is fully probed or re-initialized before redo reads it.
Point-in-time restore lives in log_rv_analysis_complete (LOG_REC_DONETIME), not the driver; analysis is also not read-only — a missing LOG_END_OF_LOG is physically repaired via log_startof_nxrec, a patched forw_lsa, and a page flush.

Chapter 4: Analysis Record Dispatch and Transaction Table Rebuild

Chapter 3’s driver feeds every LOG_RECORD_HEADER it reads to log_rv_analysis_record. This chapter traces how each arm rebuilds transaction-table state, plus the global counters that ride along — append point, next TRANID, MVCCID horizon. The postpone/sysop arms belong to Chapter 5; ARIES theory lives in the companion cubrid-recovery-manager.md.

4.1 The dispatch switch in `log_rv_analysis_record`

log_rv_analysis_record is a pure demultiplexer: one switch (log_type), no logic of its own; its pointer parameters all belong to the driver’s loop state (Chapter 3). Every LOG_RECTYPE lands in exactly one arm:

Record type(s)	Handler	Effect on the table
`LOG_UNDOREDO_DATA`, `LOG_DIFF_UNDOREDO_DATA`, `LOG_UNDO_DATA`, `LOG_REDO_DATA`, their four `LOG_MVCC_*` twins, `LOG_DBEXTERN_REDO_DATA`	`log_rv_analysis_undo_redo`	advance `tail_lsa` + `undo_nxlsa` (4.3)
`LOG_SAVEPOINT`	`log_rv_analysis_save_point`	same, plus `savept_lsa` (4.3)
`LOG_COMPENSATE`	`log_rv_analysis_compensate`	redirect `undo_nxlsa` past undone work (4.3)
`LOG_COMMIT`, `LOG_ABORT`	`log_rv_analysis_complete`	free the tran index, or stop analysis early (4.4)
the seven `LOG_2PC_*` types	the seven `log_rv_analysis_2pc_*` arms	stamp a 2PC `tdes->state` (4.5)
`LOG_START_CHKPT` / `LOG_END_CHKPT`	`log_rv_analysis_start_checkpoint` / `_end_checkpoint`	arm `may_use_checkpoint` (4.7) / merge snapshot (4.8)
`LOG_DUMMY_HEAD_POSTPONE`, `LOG_POSTPONE`, `LOG_RUN_POSTPONE`, `LOG_COMMIT_WITH_POSTPONE` (+`_OBSOLETE`), `LOG_SYSOP_START_POSTPONE`, `LOG_SYSOP_END`, `LOG_SYSOP_ATOMIC_START`	the matching `log_rv_analysis_*` postpone/sysop arms	Chapter 5; commit-with-postpone’s early-stop branch mirrors 4.4
`LOG_END_OF_LOG`	`log_rv_analysis_log_end`	reset append point + `next_trid` (4.10)
`LOG_DUMMY_CRASH_RECOVERY`, `LOG_REPLICATION_DATA`, `LOG_REPLICATION_STATEMENT`, `LOG_DUMMY_HA_SERVER_STATE`, `LOG_DUMMY_OVF_RECORD`, `LOG_DUMMY_GENERIC`, `LOG_SUPPLEMENTAL_INFO`	none — bare `break`	no table effect
`LOG_SMALLER_LOGREC_TYPE`, `LOG_LARGER_LOGREC_TYPE`, `default`	none	`er_set (ER_LOG_PAGE_CORRUPTED)` + `assert (false)` — “probably the log is corrupted”

Return codes are discarded — most via (void) casts; the sysop-end and checkpoint arms simply ignore them. Almost every failure calls logpb_fatal_error, which terminates recovery; the lone exception is end-checkpoint’s sysop re-read (4.8 step 7) — debug builds assert, release builds swallow the error.

4.2 `logtb_rv_find_allocate_tran_index` — the lazy TDES allocator

Nearly every arm starts here (log_tran_table.c): map tran_id to a TDES, allocating on first sight. Three branches: B1 — logtb_is_system_worker_tranid (trid) short-circuits to log_system_tdes::rv_get_or_alloc_tdes, keeping system workers out of the table. B2 — logtb_find_tran_index misses: logtb_allocate_tran_index (..., TRAN_UNACTIVE_UNILATERALLY_ABORTED, ...), then LSA_COPY (&tdes->head_lsa, log_lsa); allocation failure is logpb_fatal_error + return NULL. B3 — hit: LOG_FIND_TDES.

Invariant — presumed abort. Every TDES created during analysis is born TRAN_UNACTIVE_UNILATERALLY_ABORTED, head_lsa = first sighting. Only a later completion record (removal, 4.4) or 2PC/postpone record (state upgrade) changes the verdict; any other initial state would make the undo pass (Chapter 9) skip a loser and leave its updates on disk.

4.3 The simple arms — undo_redo, save_point, compensate

log_rv_analysis_undo_redo covers all nine data-change types. Only non-happy branch: NULL TDES means logpb_fatal_error, return ER_FAILED. Otherwise LSA_COPY (&tdes->tail_lsa, log_lsa) then LSA_COPY (&tdes->undo_nxlsa, &tdes->tail_lsa): tail_lsa is the latest record, undo_nxlsa where undo starts walking backward; for a plain data record they coincide. log_rv_analysis_save_point adds LSA_COPY (&tdes->savept_lsa, &tdes->tail_lsa) for post-restart partial rollback.

log_rv_analysis_compensate handles LOG_COMPENSATE — a CLR, proof some update was already undone — and is the one arm where undo_nxlsa diverges from tail_lsa. After the allocator + NULL-fatal branch, it advances to the LOG_REC_COMPENSATE body (LOG_READ_ADD_ALIGN, LOG_READ_ADVANCE_WHEN_DOESNT_FIT) and executes one copy — LSA_COPY (&tdes->undo_nxlsa, &compensate->undo_nxlsa) — and does not advance tail_lsa. The copied pointer lands before the compensated update, so undo never restarts from the CLR itself: ARIES’ never-undo-an-undo rule, enforced purely by pointer redirection.

4.4 `log_rv_analysis_complete` — commit/abort finalization

LOG_COMMIT and LOG_ABORT share log_rv_analysis_complete — the only arm that removes table state, and one of two early-stop arms (the other, log_rv_analysis_commit_with_postpone in Chapter 5, carries the same stop_at/resetlog branch). Four branches:

// log_rv_analysis_complete -- src/transaction/log_recovery.c
  tran_index = logtb_find_tran_index (thread_p, tran_id);   /* <- find, never allocate */
  // ... condensed: B1 -- if not media crash, goto end; else read LOG_REC_DONETIME -> last_at_time ...
  if (stop_at != NULL && *stop_at != (time_t) (-1) && difftime (*stop_at, last_at_time) < 0)
    {                                    /* B2: completion is newer than --until-time */
      log_lsa->pageid = NULL_PAGEID;
      log_recovery_resetlog (thread_p, &record_header_lsa, prev_lsa);
      *did_incom_recovery = true;
      return NO_ERROR;                   /* <- index NOT freed: tran stays a loser */
    }
end:
  // ... condensed: B3 -- if tran_index != NULL_TRAN_INDEX, logtb_free_tran_index ...
  return NO_ERROR;                       /* B4: never seen before -> nothing to drop */

Two asymmetries: it finds, never allocates — a completion whose transaction predates the window is a no-op (B4); and B2 keeps the index — truncating the log at the commit record makes the transaction retroactively in-flight, so undo rolls it back: restore-to-timestamp.

4.5 The seven 2PC arms — a state-transition table

Structurally identical: allocate the TDES (NULL: logpb_fatal_error, ER_FAILED), overwrite tdes->state, advance tail_lsa; none touches undo_nxlsa. Only the stamped state differs:

Record type	Handler	`tdes->state` stamped
`LOG_2PC_PREPARE`	`log_rv_analysis_2pc_prepare`	`TRAN_UNACTIVE_2PC_PREPARE`
`LOG_2PC_START`	`log_rv_analysis_2pc_start`	`TRAN_UNACTIVE_2PC_COLLECTING_PARTICIPANT_VOTES`
`LOG_2PC_COMMIT_DECISION`	`log_rv_analysis_2pc_commit_decision`	`TRAN_UNACTIVE_2PC_COMMIT_DECISION`
`LOG_2PC_ABORT_DECISION`	`log_rv_analysis_2pc_abort_decision`	`TRAN_UNACTIVE_2PC_ABORT_DECISION`
`LOG_2PC_COMMIT_INFORM_PARTICPS`	`log_rv_analysis_2pc_commit_inform_particps`	`TRAN_UNACTIVE_COMMITTED_INFORMING_PARTICIPANTS`
`LOG_2PC_ABORT_INFORM_PARTICPS`	`log_rv_analysis_2pc_abort_inform_particps`	`TRAN_UNACTIVE_ABORTED_INFORMING_PARTICIPANTS`
`LOG_2PC_RECV_ACK`	`log_rv_analysis_2pc_recv_ack`	unchanged — only `tail_lsa` advances

LOG_2PC_PREPARE is the participant side; the rest are coordinator records. Prepare and start also plant tdes->gtrid = LOG_2PC_NULL_GTRID: a sentinel that the body (gtrid, participants, locks) was not read — it “needs to be read during either redo phase, or during finish_commit_protocol phase” (source comment); 4.9 consumes it.

4.6 The checkpoint payload structs

A completed checkpoint is two records: an empty LOG_START_CHKPT anchor and a LOG_END_CHKPT whose body (log_record.hpp) is a LOG_REC_CHKPT header, ntrans LOG_INFO_CHKPT_TRANS entries, then ntops LOG_INFO_CHKPT_SYSOP entries.

LOG_REC_CHKPT (log_rec_chkpt) has three fields: redo_lsa — oldest recovery LSA of any dirty data page, because redo must start at the oldest unflushed change (4.8 step 8); ntrans and ntops — counts of the two arrays that follow, which are not self-delimiting (ntops is commonly zero).

LOG_INFO_CHKPT_TRANS (log_info_chkpt_trans) — one serialized TDES per live transaction:

Field	Role	Why it exists
`isloose_end`	to `tdes->isloose_end`	Client loose ends
`trid`	Transaction id	Merge key for the allocator
`state`	Snapshot state; `TRAN_ACTIVE` and `TRAN_UNACTIVE_ABORTED` remap to `TRAN_UNACTIVE_UNILATERALLY_ABORTED`, others verbatim	Presumed abort; 2PC/postpone states survive
`head_lsa`	to `tdes->head_lsa`	May predate the analysis window
`tail_lsa`	to `tdes->tail_lsa`	Chain resume point; 2PC walk cursor (4.9)
`undo_nxlsa`	to `tdes->undo_nxlsa`	Pre-checkpoint CLR redirects (4.3)
`posp_nxlsa`	to `tdes->posp_nxlsa`	Postpone chain start (Chapter 5)
`savept_lsa`	to `tdes->savept_lsa`	Pre-checkpoint savepoints
`tail_topresult_lsa`	to `tdes->tail_topresult_lsa`	Skip completed sysops on rollback
`start_postpone_lsa`	to `tdes->rcv.tran_start_postpone_lsa`	Postpone completion (Chapter 8)
`user_name`	to `tdes->client` via `set_system_internal_with_user`	Loose-end owner

LOG_INFO_CHKPT_SYSOP (log_info_chkpt_sysop) — only sysops committing with postpone are checkpointed; an ordinary in-flight sysop simply dies with its transaction:

Field	Role	Why it exists
`trid`	Owning transaction	The sysop array is flat; entries join by id
`sysop_start_postpone_lsa`	to `tdes->rcv.sysop_start_postpone_lsa`	Non-null triggers re-reading that record (4.8 step 7)
`atomic_sysop_start_lsa`	to `tdes->rcv.atomic_sysop_start_lsa`	Drives atomic-sysop abort (Chapter 8)

4.7 `log_rv_analysis_start_checkpoint` and the `may_use_checkpoint` guard

The LOG_START_CHKPT arm is one condition — if (LSA_EQ (log_lsa, start_lsa)) { *may_use_checkpoint = true; } — and that condition is the design. start_lsa is where analysis began: log_Gl.hdr.chkpt_lsa, updated only when a checkpoint completes (Chapter 3). The flag arms only for the anchor start record, never for a LOG_START_CHKPT met mid-scan — such a snapshot “can contain stuff which does not exist any longer” (source comment).

stateDiagram-v2
    [*] --> Unset : analysis starts, flag false
    Unset --> Armed : LOG_START_CHKPT at start_lsa
    Unset --> Unset : LOG_START_CHKPT elsewhere, LSA_EQ fails
    Armed --> Consumed : LOG_END_CHKPT, merge snapshot then clear flag
    Unset --> Unset : LOG_END_CHKPT, guard returns early
    Consumed --> Consumed : any later checkpoint records ignored

Figure 4-1: Lifecycle of may_use_checkpoint. Only the END pairing with the anchor START can merge a snapshot.

This answers the crash-window question. Crash between START and END: the header still names the previous completed checkpoint; the unfinished window’s START fails LSA_EQ, its END was never written. A second complete window inside the range (media recovery): its START fails LSA_EQ, its END dies on the 4.8 guard.

4.8 `log_rv_analysis_end_checkpoint` — merging the snapshot, branch by branch

The longest arm; every branch accounted for:

Guard. if (*may_use_checkpoint == false) return NO_ERROR; — unpaired ENDs die here; otherwise the flag clears at once: single-shot.
Anchor capture. LSA_COPY (check_point, log_lsa) saves the END’s LSA into the driver’s checkpoint_lsa — used by the run-postpone arm (Chapter 5) and 4.9.
Header read. LOG_REC_CHKPT is copied by value (chkpt = *tmp_chkpt) — later page advances may evict its page.
Trans array — two branches. In-page (log_lsa->offset + size < LOGAREA_SIZE): used in place; else malloc + logpb_copy_from_log; malloc failure is fatal.
Merge loop over chkpt.ntrans entries — allocator first (NULL: free area, logpb_fatal_error, ER_FAILED), then:

// log_rv_analysis_end_checkpoint -- src/transaction/log_recovery.c
      logtb_clear_tdes (thread_p, tdes);    /* <- wipe what the loop built so far */
      if (chkpt_one->state == TRAN_ACTIVE || chkpt_one->state == TRAN_UNACTIVE_ABORTED)
  {
    tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED;   /* <- presumed-abort remap */
  }
      else
  {
    tdes->state = chkpt_one->state;   /* <- 2PC / postpone states survive */
  }
      // ... condensed: isloose_end, six LSA_COPYs, rcv.tran_start_postpone_lsa, user name ...
      if (LOG_ISTRAN_2PC (tdes))
  {
    *may_need_synch_checkpoint_2pc = true;   /* <- defer 2PC body reads (4.9) */
  }

Invariant — snapshot atomicity with the END record. logtb_clear_tdes clobbers state already built from records between START and END. Safe only because logpb_checkpoint snapshots the table and appends LOG_END_CHKPT (prior_lsa_next_record_with_lock) under one log_Gl.prior_info.prior_lsa_mutex hold: nothing appends in between, so the snapshot supersedes everything since START. Release the mutex earlier and this merge would silently regress tail_lsa/undo_nxlsa — undo would skip live changes. 6. Trans area release. free_and_init (area) — nulls area for reuse by the sysop array. 7. Sysop merge, gated by chkpt.ntops > 0. Same in-page-vs-malloc branches as step 4. Per entry: allocate the TDES by trid; grow the topops stack (logtb_realloc_topops_stack) when tdes->topops.max == 0 || (tdes->topops.last + 1) >= tdes->topops.max (failure: free, fatal); copy both LSAs into tdes->rcv. If sysop_start_postpone_lsa is non-null: bump topops.last from -1 to 0 — else assert (tdes->topops.last == 0), at most one level during recovery — and log_read_sysop_start_postpone re-reads that record on a private page buffer to fill topops.stack[last].lastparent_lsa and .posp_lsa, which the checkpoint entry omits. The only place analysis re-reads an older record; its error path is assert (false); return error_code; — no logpb_fatal_error (4.1). 8. Redo start pull-back. if (LSA_LT (&chkpt.redo_lsa, start_redo_lsa)) LSA_COPY (start_redo_lsa, &chkpt.redo_lsa); — redo (Chapter 6) begins at the oldest dirty page’s recovery LSA. 9. Final free_and_init (area) for the sysop copy (no-op if in-page), then NO_ERROR.

4.9 `may_need_synch_checkpoint_2pc` — the deferred 2PC reconstruction

After the main loop, log_recovery_analysis re-fetches LOG_END_CHKPT at the saved checkpoint_lsa and, for every trans entry whose TDES still satisfies LOG_ISTRAN_2PC, calls log_2pc_recovery_analysis_info (thread_p, tdes, &chkpt_trans[i].tail_lsa) (log_2pc.c): a prev_tranlsa back-chain walk from the snapshot-time tail_lsa, reading the LOG_2PC_PREPARE body while tdes->gtrid == LOG_2PC_NULL_GTRID and the LOG_2PC_START body while tdes->coord == NULL, collecting acks. The snapshot omits 2PC bodies “due to the big space overhead (e.g., locks)” (source comment), and they may predate the window — only a backward walk recovers them; the re-check skips transactions that completed after the snapshot.

4.10 `LOG_END_OF_LOG`, `next_trid`, and MVCCID restoration

Two pieces of global state ride along with the per-transaction rebuild. First, the EOF arm — log_rv_analysis_log_end is one branch, if (!logpb_is_page_in_archive (log_lsa->pageid)): only an EOF in the active log counts. Inside it, LOG_RESET_APPEND_LSA (log_lsa) re-anchors the append point so post-recovery writes overwrite the EOF, and log_Gl.hdr.next_trid = tran_id restarts the TRANID counter from the EOF record’s own trid — restart never re-issues an id seen in the log. An EOF inside an archive is an artifact of archiving an incomplete log and is skipped; the no-EOF-found repair path is the driver’s (Chapter 3).

Second, MVCCIDs. Deliberately, no analysis arm restores tdes->mvccinfo — rebuilt losers carry no MVCCID out of analysis. Instead the last statement of log_recovery_analysis (and of its incomplete-recovery early return) is log_Gl.mvcc_table.reset_start_mvccid () (mvcc_table.cpp), re-seeding the active-MVCCID bitmap start and m_current_status_lowest_active_mvccid from log_Gl.hdr.mvcc_next_id: every lower MVCCID is treated as no longer active. Redo refines the header value — each replayed MVCC record pushes log_Gl.hdr.mvcc_next_id past its own id — and reset_start_mvccid runs once more after redo (Chapter 6). A loser’s original MVCCID reappears only during undo: logtb_rv_assign_mvccid_for_undo_recovery sets tdes->mvccinfo.id from the undone record’s rcv->mvcc_id (Chapter 9).

4.11 Chapter summary — key takeaways

log_rv_analysis_record is a logic-free demultiplexer; an unknown LOG_RECTYPE is page corruption; seven dummy/replication types are no-ops. Handler failures end in logpb_fatal_error — except end-checkpoint’s sysop re-read, dropped in release builds.
logtb_rv_find_allocate_tran_index enforces presumed abort: transactions are born TRAN_UNACTIVE_UNILATERALLY_ABORTED at first sighting; system workers live in a separate log_system_tdes map.
Only log_rv_analysis_compensate makes undo_nxlsa diverge from tail_lsa, jumping over already-undone work via the CLR’s stored pointer.
log_rv_analysis_complete finds but never allocates, and is the only arm that removes table state; its stop_at branch truncates the log and keeps the index — point-in-time restore.
The seven 2PC arms differ only in the stamped TRAN_STATE; prepare/start plant the gtrid = LOG_2PC_NULL_GTRID sentinel consumed by the post-loop log_2pc_recovery_analysis_info walk.
A LOG_END_CHKPT merges only when armed by a LOG_START_CHKPT at exactly start_lsa — half-built or extra checkpoint windows are ignored by construction; the logtb_clear_tdes-then-overwrite merge is safe because logpb_checkpoint snapshots the table and appends the END under one prior_lsa_mutex hold.
Global counters ride along: LOG_END_OF_LOG re-anchors the append point and next_trid; MVCCIDs are not rebuilt per transaction — reset_start_mvccid re-seeds the MVCC table from log_Gl.hdr.mvcc_next_id, and undo re-attaches loser MVCCIDs lazily.

Chapter 5: Sysop and Postpone Bookkeeping During Analysis

The messy middles — transactions caught inside system operations, atomic sysops, or commit-time postpones — become five LSA annotations in LOG_RCV_TDES, written by the log_rv_analysis_* arms below (driver: Ch 3, dispatch: Ch 4). Theory: high-level companion (cubrid-recovery-manager.md).

5.1 LOG_RCV_TDES — the recovery annotation block

LOG_RCV_TDES (struct log_rcv_tdes in log_impl.h) is five LOG_LSA fields embedded in every LOG_TDES as field rcv; outside recovery all five stay null.

Field	Role	Why it exists
`sysop_start_postpone_lsa`	Last open `LOG_SYSOP_START_POSTPONE`; written by `log_rv_analysis_sysop_start_postpone`, checkpoint-restored (Ch 4), reset by `log_rv_analysis_sysop_end`	`log_recovery_finish_sysop_postpone` (Ch 8) re-reads it to resume the sysop’s postpone list — no end record points to it
`tran_start_postpone_lsa`	The transaction’s `LOG_COMMIT_WITH_POSTPONE`; written by `log_rv_analysis_commit_with_postpone` + obsolete variant, checkpoint-restored (Ch 4)	Non-null-ness picks the state restored when a sysop postpone ends (5.7); bound for `log_recovery_finish_postpone`
`atomic_sysop_start_lsa`	Last unmatched `LOG_SYSOP_ATOMIC_START`; written by `log_rv_analysis_atomic_sysop_start`, checkpoint-restored (Ch 4), reset by both sysop arms when the atomic op is proven complete	Still set after redo → `log_recovery_abort_all_atomic_sysops` (Ch 8) rolls back to it before postpones run
`analysis_last_aborted_sysop_lsa`	Most recent ABORT-type `LOG_SYSOP_END`; written only in that arm of `log_rv_analysis_sysop_end`	Upper bound of the logical-redo skip range (`log_recovery_needs_skip_logical_redo`, Ch 6)
`analysis_last_aborted_sysop_start_lsa`	`lastparent_lsa` of that same aborted sysop end	Lower bound of the same skip range

flowchart LR
    cwp["commit_with_postpone"] --> f1["tran_start_postpone_lsa"]
    ssp["sysop_start_postpone"] --> f2["sysop_start_postpone_lsa"]
    ats["atomic_sysop_start"] --> f3["atomic_sysop_start_lsa"]
    se["sysop_end"] --> f4["analysis_last_aborted_sysop_lsa<br/>+ _start_lsa"]
    se -. resets .-> f2
    se -. resets .-> f3
    f1 --> fp["finish_postpone (Ch 8)"]
    f1 --> fsp["finish_sysop_postpone (Ch 8)"]
    f2 --> fsp
    f3 --> aas["abort_all_atomic_sysops (Ch 8)"]
    f4 --> skip["needs_skip_logical_redo (Ch 6)"]

Figure 5-1: annotation writers (left) and post-redo consumers (right), prefixes elided.

Invariant — annotations survive only while their phase is open. Each field is nulled once analysis proves its phase concluded pre-crash (reset guards, 5.7). Stale atomic_sysop_start_lsa → Ch 8 rolls back a committed operation; stale sysop_start_postpone_lsa → an already-run postpone list replays.

5.2 LOG_REC_SYSOP_END and LOG_SYSOP_END_TYPE

Every system operation ends with LOG_SYSOP_END, body LOG_REC_SYSOP_END (log_record.hpp) — three fixed fields, a vfid pointer, and a union switched by type:

Field	Role	Why it exists
`lastparent_lsa`	Transaction’s last LSA before the sysop started	Undo jump target over the sysop; compared against the annotations to detect nesting order
`prv_topresult_lsa`	Previous concluded top action’s LSA	Chains sysop results so partial abort can skip them (`tail_topresult_lsa`)
`type`	One of six `LOG_SYSOP_END_TYPE` values	Selects union interpretation and recovery behavior
`vfid`	Owning file; equals `mvcc_undo`’s vacuum-info file for MVCC undo	TDE (encryption) context lookup
union `undo`	Logical undo payload (`LOGICAL_UNDO`)	Multi-page op undoes via one logical recovery function
union `mvcc_undo`	Undo + MVCCID/vacuum info (`LOGICAL_MVCC_UNDO`)	Vacuum must see the operation’s MVCCID
union `compensate_lsa`	Next-undo LSA (`LOGICAL_COMPENSATE`)	The sysop replaces a compensation record; undo resumes here
union `run_postpone`	`postpone_lsa` + `is_sysop_postpone` flag (`LOGICAL_RUN_POSTPONE`)	Replaces a `LOG_RUN_POSTPONE`; the flag says whose postpone list advances (5.7)

LOG_SYSOP_END_TYPE (enum log_sysop_end_type, log_record.hpp) has six values: LOG_SYSOP_END_COMMIT (“permanent changes”), LOG_SYSOP_END_ABORT (“aborted system op”), and the four LOG_SYSOP_END_LOGICAL_* flavors UNDO, MVCC_UNDO, COMPENSATE, RUN_POSTPONE. The union is a role matrix switched solely by type (asserted by LOG_SYSOP_END_TYPE_CHECK); 5.7 traces each value’s analysis-time effect.

5.3 Postpone-side arms: LOG_POSTPONE, LOG_DUMMY_HEAD_POSTPONE, LOG_RUN_POSTPONE

log_rv_analysis_postpone (LOG_POSTPONE) and log_rv_analysis_dummy_head_postpone (the no-op LOG_DUMMY_HEAD_POSTPONE marker) each have two branches: a fatal logtb_rv_find_allocate_tran_index == NULL early return (logpb_fatal_error, ER_FAILED) and the first-postpone capture. On LSA_ISNULL (posp_nxlsa) the postpone arm copies the previous tail_lsa into posp_nxlsa before advancing tail_lsa/undo_nxlsa (“set address early”); the dummy-head arm advances first and captures after (“set address late”), landing on the dummy head itself. posp_nxlsa is where log_recovery_find_first_postpone (Ch 8) starts scanning.

log_rv_analysis_run_postpone handles LOG_RUN_POSTPONE (a postpone already executed and redo-logged). Branches:

tdes == NULL → fatal, ER_FAILED.
State not in {WILL_COMMIT, COMMITTED_WITH_POSTPONE, TOPOPE_COMMITTED_WITH_POSTPONE} (TRAN_UNACTIVE_ elided): impossible for a checkpointed tdes (SYSTEM ERROR debug log), normal otherwise; recovery guesses topops.last == -1 → committed-with-postpone, else topope-committed.
State now TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE → LSA_SET_NULL (undo_nxlsa): nothing left to undo.
Body read (Ch 2 macros); run_posp->ref_lsa — the LOG_POSTPONE this record executed — resets the cursor: topops.stack[last].posp_lsa in the topope state, else tdes->posp_nxlsa (other two states asserted).

Invariant — posp_nxlsa always points at the next postpone not yet known to have run. LOG_POSTPONE sets it once; every LOG_RUN_POSTPONE advances it to ref_lsa. Lagging → Chapter 8 runs a postpone twice; overshooting → deferred work silently lost.

5.4 Transaction commit with postpone

log_rv_analysis_commit_with_postpone handles LOG_COMMIT_WITH_POSTPONE: outcome decided, deferred work possibly unfinished. After the fatal-tdes branch it reads LOG_REC_START_POSTPONE (posp_lsa + at_time) and forks on is_media_crash:

// log_rv_analysis_commit_with_postpone -- src/transaction/log_recovery.c
  if (is_media_crash)
    {
      // ... condensed: stop_at test -> resetlog + *did_incom_recovery = true ...
    }
  else
    {
      tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE;
      LSA_SET_NULL (&tdes->undo_nxlsa);                  /* Nothing to undo */
      LSA_COPY (&tdes->tail_lsa, log_lsa);
      tdes->rcv.tran_start_postpone_lsa = tdes->tail_lsa; /* <- annotation write */
      LSA_COPY (&tdes->posp_nxlsa, &start_posp->posp_lsa);
    }

The media-crash arm is point-in-time recovery: when stop_at != NULL && *stop_at != (time_t) (-1) && difftime (*stop_at, last_at_time) < 0 — commit past the restore target — it releases the page, truncates the log (log_recovery_resetlog, Ch 11), sets *did_incom_recovery, and the transaction is treated as never committed. If the stop_at test fails (or stop_at is NULL/-1), the media-crash arm is a no-op — the annotation and state updates happen only in the non-media-crash arm.

log_rv_analysis_commit_with_postpone_obsolete (LOG_COMMIT_WITH_POSTPONE_OBSOLETE, old layout LOG_REC_START_POSTPONE_OBSOLETE without at_time) performs exactly the non-media-crash arm — no timestamp, no point-in-time stop. Kept only to read old-release logs; slated for removal “maybe 12.0”.

5.5 log_rv_analysis_sysop_start_postpone

LOG_SYSOP_START_POSTPONE marks a sysop that finished its main work and began its own postpone list. Its body LOG_REC_SYSOP_START_POSTPONE is an embedded LOG_REC_SYSOP_END sysop_end (what the end record will say) plus posp_lsa (first postpone of the sysop). Branches:

Fatal-tdes → ER_FAILED.
tail_lsa/undo_nxlsa advance; annotation write: tdes->rcv.sysop_start_postpone_lsa = tdes->tail_lsa.
Three-way fork on the embedded end type: state already topope-committed → assert_release (false) (two simultaneous sysop postpones cannot exist); sysop_end.type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE → nested is_sysop_postpone == true asserted impossible, and the transaction-postpone flavor nulls undo_nxlsa (the transaction is committing regardless of its guessed state); otherwise assert (type != LOG_SYSOP_END_ABORT) — an aborting sysop never starts a postpone phase.
State := TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE.
Topops stack grown via logtb_realloc_topops_stack if needed (ER_OUT_OF_VIRTUAL_MEMORY on failure); topops.last must be -1, bumped to 0 (assert (false) otherwise); lastparent_lsa and posp_lsa copy into topops.stack[0].
LSA_LT (sysop_end.lastparent_lsa, rcv.atomic_sysop_start_lsa) means the atomic marker was logged inside this sysop; reaching start-postpone proves the atomic part completed, so the marker is nulled.

Invariant — at most one live sysop postpone, so topops.last <= 0 throughout recovery. Enforced by the asserts in steps 3 and 5, re-checked in log_rv_analysis_sysop_end (assert (tdes->topops.last == 0)). If violated, the run-postpone arms would advance the wrong stack entry’s posp_lsa.

5.6 log_rv_analysis_atomic_sysop_start

The simplest arm, for LOG_SYSOP_ATOMIC_START — two branches: fatal-tdes, and success, which advances tail_lsa/undo_nxlsa then writes tdes->rcv.atomic_sysop_start_lsa = *log_lsa (the record has no body — the LSA is the payload). If nothing clears it (5.5, 5.7), log_recovery_abort_all_atomic_sysops → log_recovery_abort_atomic_sysop (Ch 8) rolls the transaction back to this LSA before postpones resume.

5.7 log_rv_analysis_sysop_end — the intricate one

Closes a sysop of unknown kind for a transaction in an only-guessed state. Prologue: fatal-tdes branch; advance tail_lsa, undo_nxlsa, tail_topresult_lsa; read LOG_REC_SYSOP_END; LOG_SYSOP_END_TYPE_CHECK. Then the switch, where local commit_start_postpone decides whether this end also closes an open sysop-postpone phase:

// log_rv_analysis_sysop_end -- src/transaction/log_recovery.c
    case LOG_SYSOP_END_ABORT:
      // ... condensed: comment -- abort neither changes state nor finishes a topope postpone ...
      if (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE)
  {
    LSA_SET_NULL (&tdes->undo_nxlsa);     /* no undo */
  }
      tdes->rcv.analysis_last_aborted_sysop_lsa = *log_lsa;             /* <- skip-range upper bound */
      tdes->rcv.analysis_last_aborted_sysop_start_lsa = sysop_end->lastparent_lsa;  /* <- lower bound */
      break;
    case LOG_SYSOP_END_COMMIT:
      assert (tdes->state != TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE);  /* <- falls through to next cases */
    case LOG_SYSOP_END_LOGICAL_UNDO:
    case LOG_SYSOP_END_LOGICAL_MVCC_UNDO:
      // ... condensed: todo comment ...
      commit_start_postpone = true;
      break;
    case LOG_SYSOP_END_LOGICAL_COMPENSATE:
      tdes->undo_nxlsa = sysop_end->compensate_lsa;  /* <- jump undo over compensated range */
      commit_start_postpone = true;
      break;

The ABORT arm is the aborted-sysop tracker: a LOG_DBEXTERN_REDO_DATA logical redo inside the aborted range would re-create state the pre-crash rollback destroyed, so log_recovery_needs_skip_logical_redo (Ch 6) skips records with analysis_last_aborted_sysop_start_lsa < lsa < analysis_last_aborted_sysop_lsa. Each ABORT end overwrites the fields — only the last aborted sysop is tracked.

The LOG_SYSOP_END_LOGICAL_RUN_POSTPONE arm: in topope-committed state the run-postpone sysop could belong to either postpone scope; run_postpone.is_sysop_postpone decides:

true (sysop’s postpone): if topops.last < 0 or state is not topope-committed, the stack is conjured — realloc if max == 0 (fatal ER_OUT_OF_VIRTUAL_MEMORY), topops.last = 0, state forced to topope-committed; then topops.stack[last].posp_lsa = run_postpone.postpone_lsa. commit_start_postpone stays false — the phase continues.
false (transaction’s postpone): posp_nxlsa = run_postpone.postpone_lsa; topops.last != -1 → asserts confirm the topope state, else state := TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; undo_nxlsa nulled; commit_start_postpone = true.

The epilogue runs for every arm. In topope-committed state (assert (topops.last == 0)) with commit_start_postpone set, the sysop postpone phase is over and tran_start_postpone_lsa picks the restored state: non-null restores TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE (asserted LSA_LE to lastparent_lsa — the sysop ran inside the transaction’s postpone phase); null restores the default recovery state TRAN_UNACTIVE_UNILATERALLY_ABORTED. Either way topops.last = -1. Without commit_start_postpone the phase continues (topops.last stays 0); in any non-topope state it is (re)set to -1.

Two symmetric reset guards follow — a postpone phase and an atomic sysop can nest either way, and the end belongs to whichever started later. The atomic guard nulls rcv.atomic_sysop_start_lsa only if three conditions hold: (1) it is non-null; (2) LSA_GT over sysop_start_postpone_lsa — the atomic op is the more recent open phase; (3) LSA_GT (atomic_sysop_start_lsa, sysop_end->lastparent_lsa). Condition 3 is the resurrection guard: if lastparent_lsa >= atomic_sysop_start_lsa, this end closes a sysop that began after the atomic marker — one nested inside the still-open atomic operation — and clearing the annotation on its end would let recovery skip the still-unfinished atomic operation. Only an end whose lastparent_lsa precedes the marker (the sysop containing the marker) proves the atomic op completed and may clear it. The mirror-image guard nulls sysop_start_postpone_lsa identically.

5.8 Chapter summary — key takeaways

LOG_RCV_TDES is a five-LSA annotation block in every LOG_TDES, written by analysis arms (plus checkpoint restore, Ch 4), consumed by Chapters 6 and 8, nulled once its phase is proven concluded.
log_rv_analysis_commit_with_postpone writes tran_start_postpone_lsa and doubles as the point-in-time stop on media crash; the obsolete variant is the same minus the timestamp.
log_rv_analysis_sysop_start_postpone writes sysop_start_postpone_lsa, forces topops.last from -1 to 0, and clears an atomic_sysop_start_lsa proven nested inside the now-postponing sysop.
log_rv_analysis_sysop_end is a six-arm switch: ABORT records the skip range without changing state; COMMIT and both LOGICAL_UNDO flavors close an open sysop postpone phase; LOGICAL_COMPENSATE also redirects undo_nxlsa; LOGICAL_RUN_POSTPONE disambiguates via is_sysop_postpone.
When a sysop postpone phase closes, the prior state is rebuilt from tran_start_postpone_lsa: non-null → TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE, null → TRAN_UNACTIVE_UNILATERALLY_ABORTED; the reset guards then compare both annotations against lastparent_lsa so an end clears only its own phase’s annotation.
analysis_last_aborted_sysop_start_lsa < lsa < analysis_last_aborted_sysop_lsa is how log_recovery_needs_skip_logical_redo suppresses LOG_DBEXTERN_REDO_DATA replay inside a pre-crash-aborted sysop.

Chapter 6: Redo Pass Driver and Synchronous Apply

log_recovery_redo replays the log forward over the range analysis fixed (Chapter 3) against the rebuilt transaction table (Chapter 4) — the driver loop, then the synchronous apply path down to LZ4/XOR payload assembly. Theory: companion, “Redo pass”; parallel leg: Chapter 7; loose ends: Chapter 8; RV_fun: Chapter 10.

6.1 The redo context and per-record structs

log_rv_redo_context is the whole apply state; its constructor pre-allocates both LOG_ZIP buffers at LOGAREA_SIZE:

Field	Role
`m_reader`	private log cursor (Chapter 2) — parallel workers each need one
`m_redo_zip`	redo payload scratch+output; `rcv.data` points into it, no per-record malloc
`m_undo_zip`	undo scratch for diff undoredo records — diffed redo XORs against the undo image
`m_end_redo_lsa`	`const` hard stop; past-end records are torn tail; bounds the §6.4 page-LSA assert
`m_reader_fetch_page_mode`	`NORMAL` for crash recovery (trusts its snapshot), `FORCE` for replication re-fetch

The copy constructor delegates to the main constructor — copies share nothing, making Chapter 7’s per-worker copies safe. Each record travels as a value snapshot, log_rv_redo_rec_info<T>, with exactly three fields: m_start_lsa — the record header’s LSA, stamped onto the page after apply, the idempotence comparand; m_type — the concrete LOG_RECTYPE (one T serves plain and DIFF rectypes; the diff decision needs it); m_logrec — a by-value copy of the typed body taken via reinterpret_copy_and_add_align, so a queued job holds no log-page pointer.

Debug-only vpid_lsa_consistency_check (check / cleanup) has exactly two fields: mtx — parallel redo workers call check concurrently — and consistency_check_map, the first-seen LSA per (volid, pageid) (emplace never overwrites an existing key); cleanup clears it at pass end.

Invariant — per-page LSA ordering. Out-of-order apply loses updates. Enforced (debug, rcv_phase != LOG_RESTARTED) by assert ((*map_it).second < a_log_lsa) — each new LSA compared against the page’s first recorded LSA (weaker than pairwise monotonicity; emplace keeps the original entry).

Invariant — m_redo_zip buffer stability. rcv.data aliases m_redo_zip.log_data until the redofun returns; enforced structurally — one context per thread, sequential assembly; recycle early and the redofun reads garbage.

6.2 log_recovery_redo — setup and the outer loop

The driver drops the log critical section (LOG_CS_EXIT; re-entered at the tail). log_recovery_get_redo_parallel_count — MAX (16, system_core_count) — sizes reusable_jobs and cublog::redo_parallel under SERVER_MODE (Chapter 7); in SA mode parallel_recovery_redo stays nullptr, all applies synchronous. Pre-loop defenses: a start_redolsa offset too close to the page end trips assert (false) and resumes at the next page; PRM_ID_RECOVERY_PROGRESS_LOGGING_INTERVAL (5-second floor) periodically emits ER_LOG_RECOVERY_PROGRESS with pages done/total and ETA.

The outer loop fetches the page holding lsa; on fetch failure, lsa > m_end_redo_lsa is the normal past-the-end goto exit, failure inside the promised range is logpb_fatal_error. The inner loop walks records while lsa.pageid == m_reader.get_pageid (); each iteration re-positions the reader at the (possibly repaired) record lsa via set_lsa_and_fetch_page before reading the header:

flowchart TD
    A["record at lsa"] --> B{"past end_redo_lsa?"}
    B -- yes --> Z["null lsa, break"]
    B -- no --> C["offset repair if NULL"]
    C --> H["re-fetch at lsa, read header, lsa = forw_lsa"]
    H --> K{"lsa strictly advances?"}
    K -- no --> L["fatal: loop in log"]
    K -- yes --> M["switch on log_rtype"]
    M --> P["pageid fixup"] --> A

Figure 6-1: skeleton of one inner-loop iteration of log_recovery_redo; the callouts below account for each branch.

Archive page-boundary repair — an incompletely archived record leaves the page-header offset or forw_lsa NULL. A NULL lsa.offset takes the page-header offset; if that too is NULL, archive page -> pageid + 1, active page -> genuine end of log (pageid = NULL_PAGEID); continue. A NULL forw_lsa on an archived page likewise advances to pageid + 1. Loop-in-log defense — a next lsa that does not strictly advance is logpb_fatal_error instead of spinning. Post-switch fixup — after a multi-page body, lsa.pageid jumps to the reader’s page so consumed pages are not re-fetched.

Invariant — the scan strictly advances. Every path moves lsa forward or nulls it and terminates; otherwise recovery replays the same range forever.

6.3 The dispatch switch — every record-type arm

Past the header, two local macros carry each typed arm: BUILD_RECORD_INFO (TEMPLATE_TYPE) wraps rcv_lsa, log_rtype and the reinterpret_copy_and_add_align<TEMPLATE_TYPE> () body copy into a log_rv_redo_rec_info; INVOKE_REDO_RECORD forwards it to log_rv_redo_record_sync_or_dispatch_async, where log_rv_need_sync_redo forces the sync leg for null-VPID records and the volume/sector RVDK_* rcvindexes (enumerated in Chapter 7). Every arm, branch-complete:

Arm	Action
`LOG_UNDOREDO_DATA`, `LOG_DIFF_UNDOREDO_DATA`, `LOG_RUN_POSTPONE`, `LOG_COMPENSATE`	plain build+invoke (§6.4)
`LOG_UNDO_DATA`, `LOG_POSTPONE`, `LOG_SAVEPOINT`, postpone markers (`LOG_DUMMY_HEAD_POSTPONE`, `LOG_COMMIT_WITH_POSTPONE`/`_OBSOLETE`, `LOG_SYSOP_START_POSTPONE`), checkpoint, 2PC decision/inform, HA/replication/dummy types, `LOG_SUPPLEMENTAL_INFO`, `LOG_SYSOP_ATOMIC_START`, `LOG_END_OF_LOG`	explicit no-op `break`
`LOG_MVCC_UNDOREDO_DATA`, `LOG_MVCC_DIFF_UNDOREDO_DATA`	bump `mvcc_next_id` past mvccid, set `mvcc_op_log_lsa = rcv_lsa` (vacuum); invoke
`LOG_MVCC_REDO_DATA`	bump `mvcc_next_id` only — vacuum reads undo data; invoke
`LOG_REDO_DATA`	`RVVAC_COMPLETE` -> `logpb_vacuum_reset_log_header_cache`; invoke
`LOG_DBEXTERN_REDO_DATA`	page-less (`pgptr = NULL`, `offset = -1`); gated by the skip check below; applies via `log_rv_redo_record`
`LOG_2PC_PREPARE`	missing tran/tdes -> `break`; else `log_2pc_read_prepare` re-reads the gtrid, with `LOG_2PC_OBTAIN_LOCKS` only in state `TRAN_UNACTIVE_2PC_PREPARE`
`LOG_2PC_START`	rebuild coordinator info if tran alive and `LOG_ISTRAN_2PC`; alloc failure -> fatal + `break`
`LOG_COMMIT`, `LOG_ABORT`	assert-only: completed non-system tran must be gone
`LOG_MVCC_UNDO_DATA`	bookkeeping only — `mvcc_next_id`, `mvcc_op_log_lsa`; not applied
`LOG_SYSOP_END`	`LOG_SYSOP_END_LOGICAL_MVCC_UNDO` -> `mvcc_op_log_lsa = rcv_lsa`
`default` (+`LOG_SMALLER/LARGER_LOGREC_TYPE`)	`er_set (ER_LOG_PAGE_CORRUPTED)`; null `lsa` if `forw_lsa` pointed back at this record

log_recovery_needs_skip_logical_redo, the repeated-crash defense, has three early false returns — wrong rectype, NULL_TRAN_INDEX, NULL tdes — and one true path:

// log_recovery_needs_skip_logical_redo -- src/transaction/log_recovery.c
  if (LSA_LT (&tdes->rcv.analysis_last_aborted_sysop_start_lsa, lsa)
      && LSA_LT (lsa, &tdes->rcv.analysis_last_aborted_sysop_lsa))
    {
      /* ... condensed: er_log_debug ... */
      return true;   /* <- strictly inside a sysop a previous recovery already aborted */
    }

An LSA outside the window falls through to the trailing return false. Analysis stamped the endpoints (Chapter 5); the record and its compensation already sit in the log from a previous recovery cycle.

Tail sequence. (SERVER_MODE) parallel_recovery_redo->wait_for_termination_and_stop_execution () drains every async job; LOG_CS_ENTER; log_Gl.mvcc_table.reset_start_mvccid () recomputes the MVCC baseline; the Chapter 8 hand-off (log_recovery_abort_all_atomic_sysops, log_recovery_finish_all_postpone); then logpb_flush_pages_direct, logpb_flush_header, pgbuf_flush_all. The exit: label — also the past-the-end target — nulls curr_rcv_rec_lsa, runs the consistency-check cleanup (), reports perf stats.

6.4 log_rv_redo_record_sync — fix, extract, apply

// log_rv_redo_record_sync -- src/transaction/log_recovery_redo.hpp
  // ... condensed: debug-only vpid_lsa_consistency_check.check (rcv_vpid, m_start_lsa) ...
  const LOG_DATA &log_data = log_rv_get_log_rec_data<T> (record_info.m_logrec);
  LOG_RCV rcv;
  if (!log_rv_fix_page_and_check_redo_is_needed (thread_p, rcv_vpid, rcv, log_data.rcvindex,
      record_info.m_start_lsa, redo_context.m_end_redo_lsa))
    {
      // ... condensed: assert (rcv.pgptr == nullptr) ...
      return;   /* <- page gone, or change already on disk */
    }
  scope_exit unfix_rcv_pgptr { [&thread_p, &rcv] ()
    { pgbuf_unfix_and_init_after_check (thread_p, rcv.pgptr); } };  /* <- unfix on every exit */
  // ... condensed: rcv field extractors; payload assembly ...
  rvfun::fun_t redofunc = log_rv_get_fun<T> (record_info.m_logrec, log_data.rcvindex);

The condensed tail: payload-assembly error -> logpb_fatal_error + return (the scope_exit still unfixes); non-null redofunc runs under perfmon_counter_timer_raii_tracker (PSTAT_LOG_REDO_FUNC_EXEC), failure -> logpb_fatal_error; null redofunc -> er_log_debug warning only; a non-null rcv.pgptr is then stamped with m_start_lsa via pgbuf_set_lsa.

The gatekeeper log_rv_fix_page_and_check_redo_is_needed has three outcomes: (1) non-null VPID but log_rv_redo_fix_page returns null — assert (log_is_in_crash_recovery ()), return false, the deallocated-page skip; (2) page fixed but rcv_lsa <= *pgbuf_get_lsa (rcv.pgptr) — pgbuf_unfix_and_init, return false, change already on disk (an assert also rejects page LSAs beyond end_redo_lsa); (3) otherwise return true, including the null-VPID fall-through that leaves rcv.pgptr == nullptr for page-less records. log_rv_redo_fix_page fixes in RECOVERY_PAGE mode with no sector-reservation check — sector tables replay in parallel, so a page may transiently look deallocated; the check costs more than skipping saves; NULL is assert_release material (“this is terrible, because it makes recovery impossible”).

Invariant — redo idempotence via page LSA. Skip when rcv_lsa <= page LSA, stamp m_start_lsa after applying; break the stamp and every later crash double-applies non-idempotent redo.

Six extractor template families — primaries uninstantiable via static_assert (sizeof (T) == 0) — flatten six record shapes into one generic routine. The outlier: log_rv_get_fun<LOG_REC_COMPENSATE> returns RV_fun[rcvindex].undofun (“yes, undo” in source) — a CLR’s redo payload is the undo image, so replay runs the undo function: ARIES repeating history (companion, “Compensation log records”).

`T`	`_data` / `_vpid` / `_offset` from	`_mvccid`	`_redo_length`	`log_rv_get_fun`
`LOG_REC_MVCC_UNDOREDO`	`undoredo.data`	`mvccid`	`undoredo.rlength`	`redofun`
`LOG_REC_UNDOREDO`	`data`	`MVCCID_NULL`	`rlength`	`redofun`
`LOG_REC_MVCC_REDO`	`redo.data`	`mvccid`	`redo.length`	`redofun`
`LOG_REC_REDO`	`data`	`MVCCID_NULL`	`length`	`redofun`
`LOG_REC_RUN_POSTPONE`	`data`	`MVCCID_NULL`	`length`	`redofun`
`LOG_REC_COMPENSATE`	`data`	`MVCCID_NULL`	`length`	`undofun`

6.5 Payload assembly — unzip, diff, hand off

log_rv_get_log_rec_redo_data<T> decodes the payload. The four single-image specializations (LOG_REC_MVCC_REDO, LOG_REC_REDO, LOG_REC_RUN_POSTPONE, LOG_REC_COMPENSATE) call log_rv_get_unzip_and_diff_redo_log_data with no undo data; LOG_REC_MVCC_UNDOREDO re-wraps its embedded undoredo member as a log_rv_redo_rec_info<LOG_REC_UNDOREDO> and delegates. Only LOG_REC_UNDOREDO branches — on m_type, not T: for the two DIFF rectypes (need_diff_with_undo) it first unzips the undo image into m_undo_zip via log_rv_get_unzip_log_data (fatal + return on error), aligns, and passes m_undo_zip.data_length / .log_data on; otherwise it skips the unneeded undo image (m_reader.skip (GET_ZIP_LEN (ulength)), fatal + ER_FAILED on error), aligns, and passes (0, nullptr).

log_rv_get_unzip_log_data decodes one image, branch-complete. The length field’s sign bit is the compression flag — MAKE_ZIP_LEN sets 0x80000000 at logging time, ZIP_CHECK tests it, GET_ZIP_LEN strips it; even the skip path above goes through GET_ZIP_LEN. is_zip = ZIP_CHECK (length); an image that does_fit_in_current_page is aliased straight off the page buffer, a spanning one is heap-copied via copy_from_log. Compressed -> log_unzip (failure: fatal + ER_FAILED); uncompressed -> log_zip_realloc_if_needed (failure fatal) + memcpy. Finally add_align in the fits case, bare align () in the copy case since copy_from_log already advanced the reader.

log_rv_get_unzip_and_diff_redo_log_data layers the diff on top: after log_rv_get_unzip_log_data into the caller’s redo_unzip (failure: fatal + ER_FAILED), it un-diffs only if (is_zip) and only when undo_length > 0 && undo_data != nullptr — log_diff (undo_length, undo_data, redo_unzip.data_length, redo_unzip.log_data) — then hands off rcv->length / rcv->data, borrowing m_redo_zip storage. The is_zip gate works because diffed redo exists only compressed: at append time the XOR runs on a scratch copy and the DIFF rectype is set only when is_redo_zip; a failed compression writes the original un-diffed crumbs with the bit clear. log_unzip reads the original-length prefix log_zip stored, rejects buf_size <= 0, fails if log_zip_realloc_if_needed cannot grow the destination, LZ4-decompresses, and succeeds only when unzip_len == buf_size — short or negative means corruption, not truncation. log_diff is *(p++) ^= *(q++) over MIN (undo_length, redo_length) bytes — XOR is its own inverse, so one routine serves both directions.

The page-less twin log_rv_redo_record (the LOG_DBEXTERN_REDO_DATA arm) runs the same assemble-then-call sequence without the fix/skip gate: payload failure -> fatal + return; redofun failure -> fatal too; redofun == NULL -> debug warning; rcv->pgptr != NULL -> pgbuf_set_lsa (vacuous here — pgptr is NULL).

6.6 Chapter summary — key takeaways

log_rv_redo_context is the whole redo state — one log cursor plus two pre-allocated LOG_ZIP buffers whose storage rcv.data borrows; share-nothing copies enable Chapter 7’s workers.
The switch also bookkeeps mvcc_next_id, mvcc_op_log_lsa (undo-bearing records only), the RVVAC_COMPLETE reset, and the logical-redo skip window.
Idempotence: skip when rcv_lsa <= page LSA, stamp m_start_lsa after apply; log_rv_redo_fix_page deliberately accepts deallocated pages.
Six extractor families flatten six record shapes into one apply routine; log_rv_get_fun<LOG_REC_COMPENSATE> returns the undofun — a CLR replays as an undo.
One sign bit encodes compression (MAKE_ZIP_LEN/ZIP_CHECK/GET_ZIP_LEN); diffed redo exists only compressed, so log_diff (XOR) runs only when is_zip.
Nothing after the loop runs until wait_for_termination_and_stop_execution drains parallel redo; only then reset_start_mvccid, Chapter 8 finishing, and the flushes.

Chapter 7: Parallel Redo Infrastructure

Chapter 6’s driver hands each redoable record to log_rv_redo_record_sync_or_dispatch_async; only concrete-page, non-volume records go async, and per-page LSA order is inherited from push order because every VPID hashes to a fixed task.

7.1 Dispatch — log_rv_redo_record_sync_or_dispatch_async

Instantiated per record type by INVOKE_REDO_RECORD:

// log_rv_redo_record_sync_or_dispatch_async -- src/transaction/log_recovery_redo_parallel.hpp
  const VPID rcv_vpid = log_rv_get_log_rec_vpid<T> (record_info.m_logrec);
#if defined (SERVER_MODE)
  // ... condensed: log_data ref ...
  const bool need_sync_redo = log_rv_need_sync_redo (rcv_vpid, log_data.rcvindex);
  // ... condensed: PREP perf tick ...
  if (parallel_recovery_redo == nullptr || need_sync_redo)
    {
      log_rv_redo_record_sync<T> (thread_p, redo_context, record_info, rcv_vpid);
      // ... condensed: DO_SYNC perf tick ...
    }
  else
    {
      cublog::redo_job_impl *const job = a_reusable_jobs.blocking_pop (a_rcv_redo_perf_stat);
      assert (job != nullptr);
      job->set_record_info (rcv_vpid, record_info.m_start_lsa, record_info.m_type);
      parallel_recovery_redo->add (job);
      // ... condensed: DO_ASYNC perf tick ...
    }
#else // !SERVER_MODE = SA_MODE
  log_rv_redo_record_sync<T> (thread_p, redo_context, record_info, rcv_vpid);
#endif

SA_MODE compiles the cublog classes to empty dummies; Figure 7-1 covers every exit. The predicate:

// log_rv_need_sync_redo -- src/transaction/log_recovery.c
  if (VPID_ISNULL (&a_rcv_vpid))
    {
      return true;   /* <- no target page to hash */
    }
  switch (a_rcvindex)
    {
    case RVDK_NEWVOL:  // ... condensed: RVDK_FORMAT, RVDK_INITMAP, RVDK_EXPAND_VOLUME, RVDK_VOLHEAD_EXPAND ...
      return true;     /* <- see Inv 7-A */
    case RVDK_RESERVE_SECTORS:  // ... condensed: RVDK_UNRESERVE_SECTORS ...
      return true;     /* <- "may be changed to async" */
    default:
      return false;
    }

Invariant 7-A (sync record as happens-before barrier). The main thread applies a sync record before pushing any later job; new-volume pages appear only in later records, so no worker can fix a page of a volume whose creation is still unexecuted.

flowchart TD
  A["record"] --> B{"SERVER_MODE?"}
  B -- "no (SA)" --> S1["sync apply"]
  B -- yes --> C{"infra null?"}
  C -- yes --> S1
  C -- no --> D{"log_rv_need_sync_redo"}
  D -- "null VPID or RVDK volume, sector" --> S1
  D -- false --> E["blocking_pop + set_record_info"]
  E --> G["add: hash vpid to fixed task"]

Figure 7-1: dispatch exits.

7.2 Sizing and construction — redo_parallel

log_recovery_redo registers pool demand via REGISTER_WORKERPOOL and builds once before the forward scan: reusable_jobs.initialize (count) plus new cublog::redo_parallel (count, false, MAX_LSA, redo_context); false/MAX_LSA disables monitoring (7.8). The count:

// log_recovery_get_redo_parallel_count -- src/transaction/log_recovery.c
  const int num_cpus = cubthread::system_core_count ();
  const int minimum_threads_to_redo = 16;  /* <- "determined experimentally" */
  return MAX (minimum_threads_to_redo, num_cpus);

The floor of 16 oversubscribes small machines — workers are I/O-bound. The constructor runs do_init_worker_pool (workers = slots = a_task_count), then do_init_tasks and the monitor.

Field	Role	Why it exists
`m_task_count`	VPID-binning modulus	Fixed at construction (Inv 7-B)
`m_pool_entry_manager`	`TT_RECOVERY` entry factory	Workers need real entries
`m_task_state_bookkeeping`	Bitset of active tasks	Unbounded wait (7.7)
`m_worker_pool`	Worker pool pointer	Owns OS threads
`m_redo_tasks`	`vector<unique_ptr<redo_task>>`	Owner-managed; perf stats survive
`m_vpid_hash`	`std::hash<VPID>`	Binning function of `add`
`m_min_unapplied_log_lsa_calculation`	Progress monitor (7.8)	Replication only

7.3 The VPID hash — order without locks

// redo_parallel::add -- src/transaction/log_recovery_redo_parallel.cpp
  const std::size_t task_index = m_vpid_hash (a_job->get_vpid ()) % m_task_count;
  redo_task *const task = m_redo_tasks[task_index].get ();
  task->push_job (a_job);

Invariant 7-B (per-page order from push order). The main thread pushes in increasing LSA order, a VPID always hashes to the same task (m_task_count is immutable), and each task drains FIFO — per-page apply order is log order, lock-free. Break any leg and two workers race on one page, masked by the rcv_lsa <= page_lsa skip. Cross-page order is not preserved.

redo_job_base — the queueable unit:

Field	Role	Why it exists
`m_vpid`	Target page; null when defaulted	Binning key; `get_vpid` asserts non-null
`m_log_lsa`	LSA of the record	Where to re-read from (7.5); progress marker (7.8)

redo_task::push_job sets the unapplied marker only when monitoring is armed and the queue was empty (crash recovery passes false, so never), and notifies only past PRM_ID_RECOVERY_REDO_MINIMUM_JOB_COUNT (hidden, default 100).

7.4 redo_task::execute

redo_task (.cpp-private cubthread::task):

Field	Role	Why it exists
`m_task_idx`	Identity 0..N-1	Index into bitset and push vectors
`m_do_monitor_unapplied_log_lsa`	Maintain marker or not	Recovery passes `false`
`m_task_state_bookkeeping`	Ref to owner’s bitset	Set in ctor, cleared after drain
`m_perf_stats_definition` / `m_perf_stats`	Per-task counters	Timings in 7.9
`m_redo_context`	Private context copy	Own reader + zip buffers (7.5)
`m_produce_vec` (+`_mtx`, `_cv`)	Job queue; reserves ONE_M	Swap: one lock per batch
`m_adding_finished`	End-of-stream flag, set under mutex	Checked only when queue empty
`m_unapplied_log_lsa`	`atomic<log_lsa>`, `MAX_LSA` idle	Feeds global minimum (7.8)

// redo_task::execute -- src/transaction/log_recovery_redo_parallel.cpp
  for ( ; ; )
    {
      bool adding_finished { false };
      pop_jobs (jobs_vec, adding_finished);
      if (jobs_vec.empty () && adding_finished)
        {
          break;                  /* <- only exit */
        }
      else
        {
          assert (!jobs_vec.empty ());
          THREAD_ENTRY *const thread_entry = &context;
          for (auto &job : jobs_vec)
            {
              // ... condensed: marker update ...
              job->execute (thread_entry, m_redo_context);
              job->retire (m_task_idx);
            }
          jobs_vec.clear ();      /* <- jobs already recycled */
        }
    }
  m_task_state_bookkeeping.set_inactive (m_task_idx);

pop_jobs asserts its post-condition as an exact xor of empty and finished. Its 1 s wait_for period (PRM_ID_RECOVERY_REDO_JOB_PERIOD_IN_SECS, hidden) drains the un-notified trickle; notify_adding_finished flips the flag under the same mutex — no lost wakeup.

flowchart TD
  W["wait_for, 1 s period"] --> P{"queue empty?"}
  P -- no --> SW["swap into local jobs_vec"]
  P -- yes --> MK["park marker, monitored only"]
  MK --> F{"m_adding_finished?"}
  F -- no --> W
  F -- yes --> Z["return empty + finished"]
  SW --> EX["per job: execute, retire"]
  EX --> W
  Z --> IN["set_inactive, cv notify"]

Figure 7-2: pop_jobs and drain loop exits.

7.5 redo_job_impl::execute — the re-fetch

`redo_job_impl` field	Role	Why it exists
`m_reusable_job_stack`	Pool back-pointer, “guaranteed to outlive this instance”	`retire` = `push (a_task_idx, this)`
`m_log_rtype`	`LOG_RECTYPE` stamped by `set_record_info`	Selects the `log_rec_*` layout to re-read

// redo_job_impl::execute -- src/transaction/log_recovery_redo_parallel.cpp
  const int err_fetch =
    redo_context.m_reader.set_lsa_and_fetch_page (get_log_lsa (), redo_context.m_reader_fetch_page_mode);
  if (err_fetch != NO_ERROR)
    {
      return err_fetch;          /* <- sole error exit */
    }
  redo_context.m_reader.add_align (sizeof (LOG_RECORD_HEADER));
  switch (m_log_rtype)
    {
    case LOG_REDO_DATA:
      read_record_and_redo<log_rec_redo> (thread_p, redo_context);
      break;
    // ... condensed: 7 more labels (MVCC/diff undoredo, RUN_POSTPONE, COMPENSATE) ...
    default:
      assert (false);            /* <- unreachable */
    }

The eight labels are Chapter 6’s page-bound redoable types; read_record_and_redo<T> re-parses the typed header and funnels into log_rv_redo_record_sync<T>, the sync path’s sink. That error exit is swallowed: redo_task::execute ignores the return, so a failed fetch silently skips the record’s redo. log_rv_redo_context is copy-constructible, not assignable: each task owns a private reader and zip buffers.

7.6 reusable_jobs_stack — recycling

Field	Role	Why it exists
`m_flush_push_at_count`	`PARALLEL_REDO_REUSABLE_JOBS_FLUSH_BACK_COUNT` (ONE_K)	One mutex touch per ~1024 retires
`m_job_pool`	`vector<redo_job_impl>` of `PARALLEL_REDO_REUSABLE_JOBS_COUNT` (ONE_M)	The only allocation
`m_pop_jobs`	Consumer stack, popped unsynchronized	Single consumer (Inv 7-C)
`m_push_jobs` (+`m_push_mtx`, `m_push_jobs_available_cv`)	Shared return bin	Sole synchronized hand-off
`m_per_task_push_jobs_vec`	One private vector per task	Lock-free retire fast path

// reusable_jobs_stack::blocking_pop -- src/transaction/log_recovery_redo_parallel.cpp
  if (!m_pop_jobs.empty ())
    {
      redo_job_impl *const pop_job = m_pop_jobs.back ();
      m_pop_jobs.pop_back ();        /* <- no lock */
      return pop_job;
    }
  else
    {
      {
  std::unique_lock<std::mutex> locku { m_push_mtx };
  // ... condensed: cv wait until !m_push_jobs.empty () ...
  m_pop_jobs.swap (m_push_jobs);   /* <- O(1) refill */
      }
      // ... condensed: pop_back ...
    }

push (a_task_idx, a_job) mirrors it: append to the caller’s private vector; only past m_flush_push_at_count lock, bulk-insert, clear, notify_one. The slow path is the backpressure valve: when all ONE_M jobs are in flight the main thread blocks until a batch returns.

Invariant 7-C (single consumer, conservation of jobs). m_pop_jobs is popped unsynchronized because only the recovery main thread calls blocking_pop. The destructor asserts pop + push + sum(per_task) == m_job_pool.size (): a job executed but never retired trips it.

7.7 task_active_state_bookkeeping and termination

Field	Role	Why it exists
`m_size`	Task count, asserted `< BITSET_MAX_SIZE` (256)	Bounds-checks indices
`m_values`	`std::bitset<256>`, bit per task	`set_active`/`set_inactive` assert prior state
`m_values_mtx` / `m_values_cv`	Guard + wakeup	`wait_for_termination` sleeps until `m_values.none ()`

The pool’s own wait asserts after “a hardcoded maximum wait time (60 seconds)”; this private bookkeeping waits unbounded. Tasks set their bit in the constructor, so an early wait cannot miss not-yet-started tasks. Shutdown:

// redo_parallel::wait_for_termination_and_stop_execution -- src/transaction/log_recovery_redo_parallel.cpp
    for (auto &redo_task: m_redo_tasks)
      {
  redo_task->notify_adding_finished ();
      }
    m_task_state_bookkeeping.wait_for_termination ();
    // ... condensed: assert every task is_idle ...
    m_worker_pool->stop_execution ();
    // ... condensed: get_manager ()->destroy_worker_pool ...

redo_task::retire is a no-op (“avoid self destruct”) so per-task perf stats stay readable; WAIT_FOR_PARALLEL times the straggler wait. Both ends assert the ordering: push_job asserts !m_adding_finished, and ~redo_parallel asserts no active task and a null pool — this blocking call is mandatory before destruction.

7.8 min_unapplied_log_lsa_monitoring

Dormant in crash recovery (false, MAX_LSA); armed when the same infrastructure replicates on a page server. The constructor asserts the pairing: monitoring needs a valid starting LSA; no monitoring, MAX_LSA.

Field	Role	Why it exists
`m_do_monitor`	Master switch	Asserted by every method
`m_main_thread_unapplied_log_lsa`	`atomic<log_lsa>` advanced by dispatcher	Sync records bypass task queues
`m_redo_tasks`	Const ref to task vector	`calculate` reads each task’s marker
`m_calculated_log_lsa`	Last global minimum, under `m_calculate_mtx`	What waiters compare against
`m_calculate_mtx` / `m_calculate_cv` / `m_terminate_calculation` / `m_calculate_thread`	Calculation-thread plumbing	Guard, cv, stop flag, thread

calculate minimizes the main-thread LSA against task markers, skipping idle MAX_LSA ones. wait_past_target_log_lsa has two exits: an unlocked fast path when a_target_lsa < m_calculated_log_lsa; else notify_all (kick its 10 ms nap) and block until the minimum passes. redo_parallel::wait_past_target_lsa and set_main_thread_unapplied_log_lsa are forwarders.

7.9 perf_stats

perf_stats (log_recovery_redo_perf.hpp) is a nullable wrapper over cubperf:

Field	Role	Why it exists
`m_definition`	Const ref to `cubperf::statset_definition`	Slot names/types (all `COUNTER_AND_TIMER`)
`m_stats_set`	`cubperf::statset *`, `nullptr` when disabled	One-point fork: every method checks it

Activation is per-side (perf_stats_is_active_for_main / ..._for_async); do_not_record_t builds a disabled instance. time_and_increment (id) adds the time since the previous call:

Main: FETCH_PAGE, READ_LOG, REDO_OR_PUSH_{PREP, DO_SYNC, POP_REUSABLE_DIRECT/_WAIT, DO_ASYNC}, COMMIT_ABORT, WAIT_FOR_PARALLEL, FINALIZE.
Workers: PARALLEL_POP, PARALLEL_SLEEP (never incremented), PARALLEL_EXECUTE, PARALLEL_RETIRE.

redo_parallel::log_perf_stats logs each worker’s set plus an element-wise average — EXECUTE vs POP shows saturation; DIRECT vs WAIT shows pool throttling.

7.10 Chapter summary — key takeaways

Three dispatcher exits: SA_MODE always-sync; forced sync via null infra or log_rv_need_sync_redo (null VPID, volume ops, sector reserve/unreserve); async dispatch of a recycled job.
Invariant 7-A makes forced-sync records happens-before barriers; Invariant 7-B (fixed hash(VPID) % m_task_count, in-order push, FIFO drain) gives per-page LSA order without page locks.
Worker count is MAX (16, cores), an experimental floor for I/O-bound workers; teardown’s private bitset dodges the pool’s 60-second assert.
Jobs carry (vpid, lsa, rectype); redo_job_impl::execute re-fetches the log page via the task’s private log_rv_redo_context and converges on the sync apply path; the worker loop discards its error return.
reusable_jobs_stack recycles ONE_M jobs — lock-free pop, ONE_K flush-back, conservation asserted (Inv 7-C); slow path = backpressure; min_unapplied_log_lsa_monitoring and perf_stats serve replication and diagnostics.

Chapter 8: Atomic Sysop Abort and Postpone Completion

Redo (Ch 6-7) reproduced the crash state exactly, leaving two loose ends that must not reach undo: open atomic system operations, and transactions/sysops committed with postpone whose postpones never finished. The tail of log_recovery_redo closes both; see the high-level companion (cubrid-recovery-manager.md) for the postpone/sysop concept.

8.1 Placement in the redo tail

Both phases run on the recovery main thread after the parallel redo pool drains (Ch 7): they append new log records through the runtime logging path (log_sysop_start / log_sysop_abort / log_run_postpone_op), only safe once every queued redo job is applied.

// log_recovery_redo (tail) -- src/transaction/log_recovery.c
  LOG_CS_ENTER (thread_p);
  log_Gl.mvcc_table.reset_start_mvccid ();
  /* ... er_set: "REDO" finishing-up notification ... */
  log_recovery_abort_all_atomic_sysops (thread_p);  /* <- must run FIRST */
  log_recovery_finish_all_postpone (thread_p);
  /* ... flush data pages, log pages, log header ... */

Invariant 8-A — atomic-before-postpone. Stated in the log_rcv_tdes comment: interrupted file_perm_alloc/file_perm_dealloc “must be executed atomically … before executing finish all postpones”. Postpone actions (typically permanent-file destruction) would otherwise hit half-modified file headers and sector tables — crash or file-tracker corruption.

8.2 `LOG_RCV_TDES` — the recovery scratchpad

Analysis (Ch 4-5) recorded everything this chapter consumes into tdes->rcv (struct log_rcv_tdes, log_impl.h) — five LOG_LSA fields, NULL_LSA meaning no such loose end:

Field	Role	Why it exists
`sysop_start_postpone_lsa`	`LOG_SYSOP_START_POSTPONE` of a sysop committed-with-postpone whose `LOG_SYSOP_END` never landed (8.6).	That record embeds the `LOG_REC_SYSOP_END` to replay (8.3).
`tran_start_postpone_lsa`	The transaction’s `LOG_COMMIT_WITH_POSTPONE`.	Separates branches (c)/(d) in 8.6; abort boundary in 8.7.
`atomic_sysop_start_lsa`	Last unmatched `LOG_SYSOP_ATOMIC_START`; non-NULL means crashed mid-atomic-op.	Rollback target for 8.4: the log suffix to undo as one unit.
`analysis_last_aborted_sysop_lsa`	End LSA of the last sysop analysis saw aborted.	Upper bound of the Ch 6 skip window for a rolled-back sysop’s logical redo.
`analysis_last_aborted_sysop_start_lsa`	That sysop’s `lastparent_lsa`.	Lower bound of the skip window; unused here.

The last input is the LOG_RUN_POSTPONE trail in the log itself, consumed by 8.8.

8.3 `LOG_REC_SYSOP_START_POSTPONE` — a deferred sysop end

A sysop committing with postpone logs its future end record up front, so recovery can finish the commit even if the real LOG_SYSOP_END never reached disk:

// log_rec_sysop_start_postpone -- src/transaction/log_record.hpp
struct log_rec_sysop_start_postpone
{
  LOG_REC_SYSOP_END sysop_end;  /* log record used for end of system operation */
  LOG_LSA posp_lsa;             /* address where the first postpone operation start */
};

Field	Role	Why it exists
`sysop_end`	Pre-built end record; re-read via `log_read_sysop_start_postpone`, appended via `log_sysop_end_recovery_postpone`.	Persists the commit decision before postpones run; its `type` decides the post-finish TDES state (8.6).
`posp_lsa`	First `LOG_POSTPONE` of this sysop.	Seed for the forward scan; analysis copies it to `tdes->topops.stack[last].posp_lsa` (Ch 5).

8.6 reads four fields of the embedded LOG_REC_SYSOP_END (full table in Ch 5): type (discriminator), lastparent_lsa (transaction LSA just before the sysop — the rollback boundary), run_postpone.postpone_lsa (the LOG_POSTPONE this sysop ran — the parent’s resume point), run_postpone.is_sysop_postpone (sysop parent — asserted impossible — vs transaction).

8.4 Aborting atomic sysops

Both drivers share one skeleton: walk regular TDES slots 1..num_total_indices, skipping tdes == NULL || trid == NULL_TRANID, then the system TDESes rebuilt by analysis via log_system_tdes::map_all_tdes (locks systb_Mutex). Each call is bracketed by log_rv_simulate_runtime_worker / log_rv_end_simulation, so runtime logging primitives — which resolve the current transaction from the thread — act on the impersonated TDES (log_system_tdes::rv_simulate_system_tdes for system ones). log_recovery_abort_atomic_sysop handles one TDES:

flowchart TD
    G1{"tdes NULL or<br/>trid NULL?"} -- yes --> R1["return"]
    G1 -- no --> G2{"atomic_sysop_start_lsa<br/>NULL?"}
    G2 -- yes --> R1
    G2 -- no --> G3{"start &gt;= undo_nxlsa?"}
    G3 -- yes --> R3["reset LSA, return"]
    G3 -- no --> G4{"TOPOPE and start postpone<br/>&gt; atomic start?"}
    G4 -- yes --> N1["nested postpone in atomic op:<br/>finish it first"]
    G4 -- no --> G5{"TOPOPE?"}
    G5 -- yes --> N2["atomic op in sysop postpone:<br/>abort now"]
    G5 -- no --> N3["standalone"]
    N1 --> RB["fetch start page,<br/>prev = prev_tranlsa"]
    N2 --> RB
    N3 --> RB
    RB --> ERR{"fetch failed?"}
    ERR -- yes --> F["logpb_fatal_error"]
    ERR -- no --> SIM["log_sysop_start,<br/>lastparent_lsa = prev,<br/>log_sysop_abort"]
    SIM --> DONE["clear atomic_sysop_start_lsa"]

Figure 8-1: every branch of log_recovery_abort_atomic_sysop.

The nested cases order against 8.5: sysop_start_postpone_lsa > atomic_sysop_start_lsa means a sysop committed-with-postpone inside the atomic op — finish its postpone first, then abort. The opposite TOPOPE case is an atomic op started during a sysop’s postpone — abort now, finish the postpone in 8.5. The source comments spell out both numbered crash scenarios verbatim.

The rollback simulates a runtime sysop instead of calling undo — the in-source comment calls the lastparent_lsa overwrite “hack last parent”: the new sysop’s rollback boundary becomes the prev_tranlsa of the LOG_SYSOP_ATOMIC_START, so log_sysop_abort compensates everything after it and logs an abort LOG_SYSOP_END.

Invariant 8-B — no atomic residue. On return, atomic_sysop_start_lsa is NULL_LSA on every TDES — each exit path finds it NULL, resets it, or dies in logpb_fatal_error. Later phases can assume no half-open atomic file operation exists; plain record-by-record undo would recreate the partial state the marker prevents.

8.5 Finishing transaction postpones

Per TDES, log_recovery_finish_postpone: (1) return on the guard tdes == NULL || trid == NULL_TRANID; (2) always call log_recovery_finish_sysop_postpone (8.6), which resolves a TOPOPE_COMMITTED_WITH_POSTPONE state — possibly promoting it to COMMITTED_WITH_POSTPONE; (3) branch on state:

// log_recovery_finish_postpone -- src/transaction/log_recovery.c
  if (tdes->state == TRAN_UNACTIVE_WILL_COMMIT || tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE)
    {
      if (tdes->state == TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE)
        { /* make sure to abort interrupted logical postpone. */
          log_recovery_abort_interrupted_sysop (thread_p, tdes, &tdes->rcv.tran_start_postpone_lsa);
          LSA_SET_NULL (&tdes->undo_nxlsa); }  /* <- committed: nothing left to undo */
      /* ... find_first_postpone -> log_do_postpone -> log_complete ... */
    }
  else if (tdes->state == TRAN_UNACTIVE_COMMITTED)
    { /* log_complete + free index only; postpones already done */ }

TRAN_UNACTIVE_WILL_COMMIT = commit logged, postpone start not; COMMITTED_WITH_POSTPONE first aborts a possibly interrupted logical run postpone (8.7). The elided body: log_recovery_find_first_postpone (8.8), log_do_postpone (8.9) on a non-NULL result, then — local transactions only, tdes->coord == NULL — log_complete appends the LOG_COMMIT EOT, sets TRAN_UNACTIVE_COMMITTED, and logtb_free_tran_index frees the slot (2PC: Ch 11). System TDESes pass through step (2) only; an unfinishable interrupted sysop leaves them TRAN_UNACTIVE_UNILATERALLY_ABORTED — branch (d) — for undo (Ch 9).

8.6 `log_recovery_finish_sysop_postpone`

Runs only for TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE; analysis pushed exactly one topops entry (assert (tdes->topops.last == 0)). Sequence: abort an interrupted logical run postpone (8.7) relative to rcv.sysop_start_postpone_lsa; find the first unexecuted postpone (8.8) seeded from topops.stack[last].posp_lsa; log_do_postpone (8.9); re-read the start-postpone record via log_read_sysop_start_postpone (failure: assert_release, give up); append the pre-built end via log_sysop_end_recovery_postpone. Four outcomes:

// log_recovery_finish_sysop_postpone (outcomes) -- src/transaction/log_recovery.c
  if (sysop_start_postpone.sysop_end.type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE)
    {
      if (sysop_start_postpone.sysop_end.run_postpone.is_sysop_postpone)
        { /* (a) sysop postpone during sysop postpone? should not happen! */
          assert (false);
          tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; tdes->undo_nxlsa = tdes->tail_lsa; }
      else
        { /* (b) logical run postpone during transaction postpone */
          tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE;
          LSA_SET_NULL (&tdes->undo_nxlsa);
          tdes->posp_nxlsa = sysop_start_postpone.sysop_end.run_postpone.postpone_lsa; }
    }
  else if (!LSA_ISNULL (&tdes->rcv.tran_start_postpone_lsa))
    { /* (c) sysop nested in transaction postpone phase */
      tdes->state = TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; }
  else
    { /* (d) standalone: hand the rest to undo (Ch 9) */
      tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; tdes->undo_nxlsa = tdes->tail_lsa; }

(b) resumes the parent’s postpone after the one this sysop ran (8.3); (b)/(c) fall through into 8.5’s branch in the same invocation; (d) parks the TDES for undo. A defensive clamp resets topops.last to -1 under assert_release.

8.7 `log_recovery_abort_interrupted_sysop` — the backward scan

Postpone execution can itself use logical run postpone sysops (file destroy/deallocate); a crash mid-sysop leaves a fragment to abort before resuming. Walk the undo chain backwards from tdes->undo_nxlsa down to postpone_start_lsa:

Early return if undo_nxlsa is NULL or <= postpone_start_lsa — nothing to abort.
Per record (page fetch failure: logpb_fatal_error, return):
- LOG_RUN_POSTPONE — physical run postpone completed: stop, last_parent_lsa = iter_lsa.
- LOG_SYSOP_END — stop likewise if type == LOG_SYSOP_END_LOGICAL_RUN_POSTPONE, else hop to sysop_end->lastparent_lsa, skipping the nested sysop whole.
- anything else — prev_lsa = logrec_head.prev_tranlsa; asserts forbid postpone-start types.
Loop drained — assert (LSA_EQ (&iter_lsa, postpone_start_lsa)); the interrupted sysop was the first postpone action: last_parent_lsa = *postpone_start_lsa.

Then the 8.4 simulated-sysop trick with stack[last].lastparent_lsa = last_parent_lsa: everything after the last completed run postpone is compensated; completed ones stay.

8.8 `log_recovery_find_first_postpone` — the run-postpone trail

tdes->posp_nxlsa after analysis is ambiguous: analysis advances it to run_posp->ref_lsa of every LOG_RUN_POSTPONE scanned — the last confirmed postpone — or, if none ran, to the first postpone of LOG_COMMIT_WITH_POSTPONE (Ch 4). One forward scan disambiguates. Guards: outside crash recovery or the three postpone states — assert (0), ER_FAILED; NULL start_postpone_lsa — NO_ERROR, NULL result. The scan reuses log_do_postpone’s nested-top range walk and page-fetch error path (8.9), inspecting only this trid:

LOG_RUN_POSTPONE with ref_lsa == start_postpone_lsa — candidate ran: set start_postpone_lsa_wasapplied, done.
LOG_SYSOP_END of type LOG_SYSOP_END_LOGICAL_RUN_POSTPONE — same test on run_postpone.postpone_lsa (logical run postpones log no LOG_RUN_POSTPONE).
LOG_POSTPONE — the first non-candidate goes to next_postpone_lsa.
LOG_END_OF_LOG / NULL-offset — archive-boundary page advance as in 8.9.

Tail: candidate never ran — ret_lsa = start_postpone_lsa; else ret_lsa = next_postpone_lsa, the next LOG_POSTPONE, NULL if none remain.

8.9 `log_do_postpone` — the shared forward executor

The routine that runs postpones at runtime commit re-runs them here. log_get_next_nested_top builds a stack of nested-sysop ranges; the outer loop seeks each range three ways — up to a range’s start, restarting after its end, or (when start_seek_lsa == nxtop_range->end_lsa) running to tdes->tail_lsa and stopping. This skips the interior of every completed nested sysop — committed or aborted — since their LOG_POSTPONE records belong to the sysop, not the enclosing postpone phase; only a LOG_SYSOP_END_LOGICAL_RUN_POSTPONE end record stays inside the scanned range (log_get_next_nested_top ends that range one record earlier so the end record itself is processed). Both forward scanners share the page-fetch error path: logpb_fetch_page failure raises logpb_fatal_error and jumps to the end label, which frees a heap-grown nxtop_stack.

Dispatch inside a range: ordinary data/dummy/replication types are ignored; LOG_POSTPONE executes now via log_run_postpone_op — goto end on failure; LOG_COMMIT_WITH_POSTPONE (plus _OBSOLETE, LOG_SYSOP_START_POSTPONE, the 2PC starts) nulls forward_lsa — the postpone region is over; LOG_SYSOP_END is tolerated only at start_seek_lsa, else debug-logged as a bad range. log_run_postpone_op reads the LOG_REC_REDO payload (copying across page boundaries; logpb_fatal_error on OOM) and calls log_execute_run_postpone: apply the redo function, log a new LOG_RUN_POSTPONE — a second crash just extends the trail 8.8 consumes.

Invariant 8-C — postpones execute exactly once. The posp_nxlsa trail, 8.8’s applied-check, and the fresh LOG_RUN_POSTPONE each execution logs guarantee each LOG_POSTPONE runs exactly once across any number of crashes. Seeding log_do_postpone with an already-run LSA would double-apply non-idempotent redo such as page deallocation.

8.10 Chapter summary — key takeaways

The redo tail runs two cleanups after the parallel pool drains: abort open atomic sysops, then finish pending postpones (Invariant 8-A), fed by tdes->rcv (LOG_RCV_TDES) plus the LOG_RUN_POSTPONE trail.
Both drivers walk regular TDES slots then log_system_tdes::map_all_tdes, impersonating each transaction via log_rv_simulate_runtime_worker.
Rollback simulates a sysop — log_sysop_start, overwrite lastparent_lsa, log_sysop_abort — ordered by 8.4’s nested-case branches.
log_recovery_finish_sysop_postpone replays the embedded LOG_REC_SYSOP_END, landing in the transaction-postpone path or TRAN_UNACTIVE_UNILATERALLY_ABORTED for undo.
Each finished worker TDES exits via log_complete (LOG_COMMIT) to TRAN_UNACTIVE_COMMITTED — except 2PC participants (Ch 11).

Chapter 9: Undo Pass and Compensation

Redo (Chapter 6) left even the losers’ effects in place; undo rolls them back. CLR theory is in the companion (cubrid-recovery-manager.md); here: log_recovery_undo and log_rv_undo_record branch by branch, plus the sysop bracket that makes rollback crash-restartable.

9.1 Record structs of the undo pass

All four live in log_record.hpp (log_rec_undo, log_rec_mvcc_undo, log_rec_compensate, log_rec_sysop_end), read in place from the log page.

LOG_REC_UNDO — body of LOG_UNDO_DATA:

Field	Role
`data` (`LOG_DATA`)	`rcvindex` + volid/pageid/offset: one locator for `RV_fun` dispatch and page fix; NULL vpid triggers `RCV_IS_LOGICAL_LOG`
`length`	undo-image byte count; carries the `ZIP_CHECK` flag

LOG_REC_MVCC_UNDO — body of LOG_MVCC_UNDO_DATA:

Field	Role
`undo` (`LOG_REC_UNDO`)	embedded plain undo — strict superset; arms extract `&mvcc_undo->undo`
`mvccid`	writer’s MVCCID, re-activated during undo so the version stays invisible
`vacuum_info` (`LOG_VACUUM_INFO`)	`prev_mvcc_op_log_lsa` chain + `vfid` — vacuum’s list through MVCC op records; undo skips it

LOG_REC_COMPENSATE — body of LOG_COMPENSATE, the CLR:

Field	Role
`data` (`LOG_DATA`)	locator + `rcvindex` of the compensation’s redo — CLRs are redo-only, replayed via `redofun`
`undo_nxlsa`	next record to undo, captured before the compensated one — ARIES UndoNxtLSA; restarted undo skips done work (9.3 arm 4)
`length`	after-image length

LOG_REC_SYSOP_END — body of LOG_SYSOP_END; union keyed by type:

Field	Role
`lastparent_lsa`	last LSA before the sysop — undo jumps here; committed sysops are never re-undone
`prv_topresult_lsa`	previous completed top action — nested-sysop chaining (Chapter 5)
`type` (`LOG_SYSOP_END_TYPE`)	union discriminator — six end flavors, one record
`vfid`	file of affected pages — TDE decision for trailing undo data
`undo` (union)	`LOG_REC_UNDO` for `LOGICAL_UNDO` — the sysop’s own undo recipe if its owner aborts
`mvcc_undo` (union)	`LOG_REC_MVCC_UNDO` for `LOGICAL_MVCC_UNDO` — same, plus MVCCID
`compensate_lsa` (union)	resume point for `LOGICAL_COMPENSATE` — the bracket was itself a compensation
`run_postpone` (union)	`postpone_lsa` + `is_sysop_postpone` for `LOGICAL_RUN_POSTPONE` — analysis-side twin (Chapter 5); undo asserts it never sees one

9.2 `log_recovery_undo` — pre-pass and loser selection

Called from log_recovery under LOG_RECOVERY_UNDO_PHASE. The pre-pass retires losers with nothing left to undo: a TDES in state TRAN_UNACTIVE_UNILATERALLY_ABORTED / TRAN_UNACTIVE_ABORTED with a NULL undo_nxlsa finished its rollback pre-crash but its LOG_ABORT never hit disk — log_complete (… LOG_ABORT, LOG_DONT_NEED_NEWTRID, LOG_NEED_TO_WRITE_EOT_LOG) writes it now, logtb_free_tran_index frees the slot. System TDESes need no EOT: log_system_tdes::rv_delete_all_tdes_if erases every system entry with NULL undo_nxlsa.

Selection uses logtb_rv_read_only_map_undo_tdes (log_tran_table.c): under a read-mode TR_TABLE_CS it maps a functor over every non-system slot in those two states, then over system workers via log_system_tdes::map_all_tdes — a max-scan lambda yields max_undo_lsa, two more feed the start notice (log_find_unilaterally_largest_undo_lsa duplicates the max-scan; nothing calls it today). The driver allocates undo_unzip_ptr = log_zip_alloc (LOGAREA_SIZE), arms an optional progress timer, exits LOG_CS (fetches use LOG_CS_FORCE_USE; alloc and fetch failures are fatal), then loops per Figure 9-1.

Invariant (globally descending undo order). Each iteration undoes max_undo_lsa — the largest undo_nxlsa over all losers, recomputed after every record — and every arm moves a cursor strictly backward (prev_tranlsa, a CLR’s undo_nxlsa, or a sysop’s lastparent_lsa). The inner while (max_undo_lsa.pageid == log_lsa.pageid) drains a page before fetching an earlier one; a forward-moving arm would live-lock.

flowchart TD
    A["prune finished losers"] --> B["max_undo_lsa = max undo_nxlsa"]
    B --> C{NULL?}
    C -- yes --> Z["free unzip buffer, LOG_CS_ENTER,<br/>flush log + header + data pages"]
    C -- no --> D["fetch page; while same pageid:<br/>resolve tdes, switch on log_rtype"]
    D --> G{prev_tranlsa NULL?}
    G -- yes --> H["chain done: log_complete +<br/>logtb_free_tran_index or rv_delete_tdes"]
    G -- no --> I["undo_nxlsa = prev_tranlsa"]
    H --> B
    I --> B

Figure 9-1: driver loop.

TDES resolution forks on logtb_is_system_worker_tranid: workers via log_system_tdes::rv_get_tdes (NULL asserts); regular transactions via logtb_find_tran_index + LOG_FIND_TDES — on lookup failure (a trid analysis never registered) logtb_free_tran_index_with_undo_lsa scrubs any slot holding that undo_nxlsa and the record is skipped. if (tran_index != NULL_TRAN_INDEX && tdes != NULL) gates the switch; on the worker path tran_index is stale — only tdes matters.

9.3 The record-type switch — every arm

Every arm is preceded unconditionally by LSA_COPY (&tdes->undo_nxlsa, &prev_tranlsa) — the order is the point:

Invariant (cursor advances before the undo executes). log_append_compensate copies tdes->undo_nxlsa into the CLR it writes; the driver advanced it to prev_tranlsa first, so the CLR points at the next record to undo. Reverse the order and a crash mid-rollback replays the same undo twice.

UNDOREDO family (LOG_UNDOREDO_DATA, LOG_DIFF_UNDOREDO_DATA, LOG_MVCC_* twins) — MVCC flavors read LOG_REC_MVCC_UNDOREDO and set rcv.mvcc_id, plain ones LOG_REC_UNDOREDO with MVCCID_NULL; fill rcv from the embedded LOG_DATA + ulength, call log_rv_undo_record. DIFF matters only to redo.
LOG_MVCC_UNDO_DATA / LOG_UNDO_DATA — same shape with LOG_REC_MVCC_UNDO / LOG_REC_UNDO and undo->length.
Redo-only / bookkeeping types — LOG_REDO_DATA, LOG_MVCC_REDO_DATA, LOG_DBEXTERN_REDO_DATA, LOG_DUMMY_HEAD_POSTPONE, LOG_POSTPONE, LOG_SAVEPOINT, LOG_REPLICATION_DATA, LOG_REPLICATION_STATEMENT, LOG_DUMMY_HA_SERVER_STATE, LOG_DUMMY_OVF_RECORD, LOG_DUMMY_GENERIC, LOG_SUPPLEMENTAL_INFO, LOG_SYSOP_ATOMIC_START: /* Not for UNDO ... */, fall through to the previous record.
LOG_COMPENSATE — LSA_COPY (&prev_tranlsa, &compensate->undo_nxlsa). No work — the cursor leapfrogs everything already undone pre-crash.
LOG_SYSOP_END — on sysop_end->type:
- LOGICAL_UNDO / LOGICAL_MVCC_UNDO: the committed bracket carries its own undo recipe. rcv is filled from sysop_end->undo (or mvcc_undo.undo plus rcv.mvcc_id); both prev_tranlsa and tdes->undo_nxlsa move to lastparent_lsa before log_rv_undo_record runs, so its compensation skips the whole sysop. (rcv_lsa is not refreshed; diagnostics may print a stale LSA.)
- LOGICAL_COMPENSATE: prev_tranlsa = sysop_end->compensate_lsa — resume before the record the bracket compensated.
- default (COMMIT, ABORT): prev_tranlsa = sysop_end->lastparent_lsa; an assert documents that LOGICAL_RUN_POSTPONE never reaches undo (Chapter 8).
Terminal/illegal types (LOG_RUN_POSTPONE, the LOG_COMMIT* trio, LOG_SYSOP_START_POSTPONE, LOG_ABORT, checkpoint and 2PC records, LOG_DUMMY_CRASH_RECOVERY, LOG_END_OF_LOG) and the default arm (corrupted type → ER_LOG_PAGE_CORRUPTED) — analysis went wrong: after assert (false), release builds amputate — clear tdes->mvccinfo.id, log_system_tdes::rv_delete_tdes (workers) or log_complete (… LOG_ABORT …) + logtb_free_tran_index, tdes = NULL so the epilogue skips it.

Epilogue (if (tdes != NULL)): a NULL prev_tranlsa ends the chain — clear tdes->mvccinfo.id, then rv_delete_tdes (workers) or log_complete + logtb_free_tran_index as in the pre-pass (#ifdef CCI_XA builds skip completion for TRAN_UNACTIVE_2PC_PREPARE). Otherwise prev_tranlsa goes back into tdes->undo_nxlsa, re-asserting the copy arms 4-5 may have redirected. After the loop: free the unzip buffer, re-enter LOG_CS, force-flush log, header and data pages.

Inside log_complete, updaters get log_append_abort_log + log_change_tran_as_completed and unlock_global_oldest_visible_mvccid; no-update losers (LSA_ISNULL (&tdes->tail_lsa)) just flip state.

9.4 `log_rv_undo_record` — one undo step, every branch

The recovery twin of run-time log_rollback_rec; identity simulated via log_rv_simulate_runtime_worker / log_rv_end_simulation, no page locks. Pre-dispatch: (1) a valid rcv->mvcc_id is re-activated via logtb_rv_assign_mvccid_for_undo_recovery; (2) RCV_IS_LOGICAL_LOG (rcv_vpid, rcvindex) — NULL vpid or a logical rcvindex — leaves rcv->pgptr = NULL, else pgbuf_fix takes an unconditional write latch (failure asserted, tolerated); (3) ZIP_CHECK (rcv->length) strips the compression flag; the image is aliased from the log page if it fits, else malloced + logpb_copy_from_log, zipped images inflated by log_unzip into undo_unzip_ptr (alloc/unzip failures fatal — as in the reader-based redo-side twins log_rv_get_unzip_log_data / log_rv_get_unzip_and_diff_redo_log_data, Chapter 6). Then, under if (rcv->pgptr != NULL || RCV_IS_LOGICAL_LOG (…)):

// log_rv_undo_record -- src/transaction/log_recovery.c
      if (rcvindex == RVBT_MVCC_INCREMENTS_UPD)
  { /* nothing to do during recovery */ }
      else if (rcvindex == RVBT_MVCC_NOTIFY_VACUUM || rcvindex == RVES_NOTIFY_VACUUM)
  { /* nothing to do */ }
      else if (rcvindex == RVBT_LOG_GLOBAL_UNIQUE_STATS_COMMIT)
  { /* <- in-memory only: undo on every restart, cannot compensate */
    error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv);
    assert (error_code == NO_ERROR);
  }
      else if (RCV_IS_LOGICAL_COMPENSATE_MANUAL (rcvindex))
  { /* <- undofun logs its own compensation */
    LSA_COPY (&rcv->reference_lsa, &tdes->undo_nxlsa);
    error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv);
    // ... condensed ... logpb_fatal_error on failure; optional b-tree trace
  }
      else if (!RCV_IS_LOGICAL_LOG (rcv_vpid, rcvindex))
  { /* <- PHYSICAL undo: CLR first, then apply before-image */
    log_append_compensate (thread_p, rcvindex, rcv_vpid, rcv->offset, rcv->pgptr,
         rcv->length, rcv->data, tdes);
    error_code = (*RV_fun[rcvindex].undofun) (thread_p, rcv);
    // ... condensed ... logpb_fatal_error on failure
  }
      else
  { /* <- LOGICAL undo: bracket in a system operation */
    save_state = tdes->state;
    LSA_COPY (&rcv->reference_lsa, &tdes->undo_nxlsa);
    log_sysop_start (thread_p);
    (void) (*RV_fun[rcvindex].undofun) (thread_p, rcv);
    log_sysop_end_logical_compensate (thread_p, &rcv->reference_lsa);
    tdes->state = save_state;
  }

A physical record whose page could not be fixed (the guard’s else) still gets log_append_compensate with pgptr = NULL — the chain stays restartable — plus ER_LOG_MAYNEED_MEDIA_RECOVERY naming the volume; the undofun is skipped and recovery continues (log-and-skip). end: frees the area, unfixes the page, log_rv_end_simulation.

Invariant (every undo step is logged before or while it happens). Physical undo writes the CLR before undofun; logical undo opens log_sysop_start first so all page changes land inside the bracket, sealed by log_sysop_end_logical_compensate with compensate_lsa = rcv->reference_lsa. Crash inside the bracket: analysis aborts the sysop (Chapter 8), undo resumes at the original record. Crash after the seal: the LOGICAL_COMPENSATE arm jumps to compensate_lsa. Either way the logical undo runs exactly once.

recovery.h defines the manual sets: RCV_IS_BTREE_LOGICAL_LOG (ten RVBT_* object-level ops) inside the wider RCV_IS_LOGICAL_COMPENSATE_MANUAL (plus RVFL_ALLOC, RVFL_USER_PAGE_MARK_DELETE, RVPGBUF_DEALLOC, RVFL_TRACKER_HEAP_REUSE, RVHF_LOB_REMOVE_DIR, RVFL_TRACKER_UNREGISTER). Their undofuns append page-level compensations themselves via log_append_compensate_with_undo_nxlsa with the saved rcv->reference_lsa — a b-tree undo may split or merge pages before compensating — so an extra bracket would be redundant.

9.5 `log_append_compensate` — the CLR writer

log_append_compensate and log_append_compensate_with_undo_nxlsa wrap log_append_compensate_internal (log_manager.c); the latter passes an explicit undo_nxlsa for the b-tree case, the former NULL:

// log_append_compensate_internal -- src/transaction/log_manager.c
  // ... condensed ... node = prior_lsa_alloc_and_copy_data (.., LOG_COMPENSATE, ..); NULL -> silent return
  LSA_COPY (&prev_lsa, &tdes->undo_nxlsa);   /* <- next record to undo, saved */
  compensate = (LOG_REC_COMPENSATE *) node->data_header;
  // ... condensed ... fill compensate->data; store the undo_nxlsa parameter
  //     into compensate->undo_nxlsa if non-NULL (b-tree override), else prev_lsa
  start_lsa = prior_lsa_next_record (thread_p, node, tdes);
  // ... condensed ... pgbuf_set_lsa (pgptr, start_lsa) when pgptr != NULL
  /* Go back to our undo link */
  LSA_COPY (&tdes->undo_nxlsa, &prev_lsa);   /* <- CLR must not become next undo target */

Branches: prior_lsa_alloc_and_copy_data failure returns silently — the undo proceeds unlogged; since undo_nxlsa never advanced past the record, a re-crash simply undoes it again (re-applying a before-image is harmless). NULL pgptr (media path, 9.4) skips TDE marking and pgbuf_set_lsa; a failed pgbuf_set_lsa asserts and returns. The last line is load-bearing: prior_lsa_next_record drags undo_nxlsa forward with tail_lsa; restoring prev_lsa keeps the rollback cursor behind the CLR — per the header comment, CLRs “are never undone.”

9.6 Chapter summary — key takeaways

The pre-pass retires losers whose undo_nxlsa is NULL — log_complete writes the missing LOG_ABORT, logtb_free_tran_index frees the slot — and rv_delete_all_tdes_if prunes finished system TDESes.
The driver always undoes the globally largest undo_nxlsa (recomputed each record via logtb_rv_read_only_map_undo_tdes): a strictly backward, page-at-a-time sweep.
tdes->undo_nxlsa advances before the undo executes, so every CLR carries the correct resume point; log_append_compensate_internal restores it after appending so the CLR is never undone — undo never undoes an undo, making the pass idempotent across repeated crashes.
LOG_COMPENSATE and LOG_SYSOP_END are pure cursor redirections during undo (compensate->undo_nxlsa, lastparent_lsa / compensate_lsa) — a crashed rollback resumes without repetition.
log_rv_undo_record forks on RCV_IS_LOGICAL_LOG: physical undo = CLR then undofun; logical undo = sysop bracket (log_sysop_start to log_sysop_end_logical_compensate) that analysis aborts if half-done and undo skips if sealed; RCV_IS_LOGICAL_COMPENSATE_MANUAL undofuns compensate manually.
The only tolerated failure is an unfixable data page — CLR still written (NULL pgptr) plus ER_LOG_MAYNEED_MEDIA_RECOVERY; everything else is logpb_fatal_error, because a half-applied undo with its CLR on disk would lie to the next restart.

Chapter 10: The RV_fun Dispatch Table

Every redo, undo, compensation replay, and logdump print indexes one global array: RV_fun[] in recovery.c. The drivers of Ch 6, 7, and 9 know nothing about heap or b-tree semantics — only how to find the right function pointer. This chapter covers the entry layout, the index-equals-position invariant, NULL arms, and the shared packed-change machinery; theory lives in the high-level companion (“ARIES in CUBRID”, “Recovery Function Dispatch”).

10.1 The rvfun entry and the table it forms

Each slot is a struct rvfun (recovery.h):

// rvfun -- src/transaction/recovery.h
struct rvfun
{
  using fun_t = int (*)(THREAD_ENTRY * thread_p, LOG_RCV * logrcv);
  using dump_fun_t = void (*)(FILE * fp, int length, void *data);
  LOG_RCVINDEX recv_index;  /* For verification */
  const char *recv_string;
  fun_t undofun;
  fun_t redofun;
  dump_fun_t dump_undofun;
  dump_fun_t dump_redofun;
};

Field	Role	Why it exists
`recv_index`	`LOG_RCVINDEX` this slot claims	Compared to slot position by `rv_check_rvfuns`
`recv_string`	Printable name (`"RVHF_INSERT"`)	`logdump` and fatal errors, via `rv_rcvindex_string`
`undofun`	Rollback, undo pass (Ch 9), redo of `LOG_COMPENSATE`	CLR payloads are undo-direction; `NULL` = never logs undo data
`redofun`	Redo pass (Ch 6/7), run-postpone (Ch 8)	`NULL` = never logs redo data (undo-only logical records)
`dump_undofun`	Debug printer, undo payload	`logdump` only, via `log_dump_data`
`dump_redofun`	Debug printer, redo payload	`NULL` = payload not formatted

RV_fun[] is an aggregate initializer, one literal per LOG_RCVINDEX, from RVDK_NEWVOL (NULL undo arm — volume creation is redo-only) to RVHF_LOB_REMOVE_DIR. Arms are often mirror pairs (RVDK_UNRESERVE_SECTORS: undo disk_rv_reserve_sectors, redo disk_rv_unreserve_sectors).

flowchart LR
    REDO["redo + run-postpone Ch 6-8"] --> R["redofun"]
    CLR["LOG_COMPENSATE replay"] --> U["undofun"]
    UNDO["undo + rollback Ch 9"] --> U
    DUMP["logdump"] --> D["dump arms"]

Figure 10-1: consumers of each rvfun arm; compensate replay crosses to undofun.

The crossed wire is explicit in log_rv_get_fun<LOG_REC_COMPENSATE> (log_recovery_redo.hpp): its body is return RV_fun[rcvindex].undofun; — comment // yes, undo. Hence RVBT_RECORD_MODIFY_COMPENSATE registers btree_rv_redo_record_modify as undofun with NULL redo: the CLR payload is redo-format, replayed only through undofun.

10.2 The index-equals-position invariant

LOG_RCVINDEX (recovery.h) is an explicitly numbered enum, RVDK_NEWVOL = 0 through RVHF_LOB_REMOVE_DIR = 129, closed by two specials: RV_LAST_LOGID = RVHF_LOB_REMOVE_DIR (an alias, not a slot) and RV_NOT_DEFINED = 999 (sentinel; must never index RV_fun). Its head comment mandates new entries at the bottom, “to AVOID OLD DATABASES TO BE RECOVERED UNDER OLD FILE”.

Invariant (table ordering): for every i in [0, DIM(RV_fun)), RV_fun[i].recv_index == i. The rcvindex in each on-disk log record is the array subscript — dispatch is an unchecked array load. Enforcement runs once at startup, debug builds only:

// rv_check_rvfuns -- src/transaction/recovery.c
  for (i = 0; i < num_indices; i++)        /* num_indices = DIM (RV_fun) */
    if (RV_fun[i].recv_index != i)
      {
  // ... condensed: er_log_debug "out of sequence" ...
  er_set (ER_FATAL_ERROR_SEVERITY, ARG_FILE_LINE, ER_GENERIC_ERROR, 0);
  assert (false);
  break;   /* <- first mismatch only; one insertion shifts all later slots */
      }

Branch accounting: one loop, one conditional — a match falls through; a mismatch logs, raises a fatal-severity error, asserts, breaks. Function and call site vanish under NDEBUG (the call opens log_initialize_internal, log_manager.c); a misordered release-build table is caught by nothing — recovery applies some other index’s function to each payload.

rv_rcvindex_string trusts the invariant: its whole body is return RV_fun[rcvindex].recv_string; — no bounds check, so RV_NOT_DEFINED must never reach it. (A stale recovery.c header comment still directs authors to rv_rcvindex_string() for new names.)

10.3 NULL arms and where they are policed

A NULL arm is a contract about logging, not a recovery-time fallback: no record with this rcvindex ever carries data for that direction. Enforcement lives at append time, in CUBRID_DEBUG blocks (log_manager.c): log_append_undoredo_crumbs asserts both arms non-NULL, log_append_undo_crumbs only undofun, log_append_redo_crumbs only redofun; rollback adds assert (RV_fun[rcvindex].undofun != NULL). At recovery only log_rv_redo_record is defensive — a NULL redofun merely logs a warning — while log_rv_undo_record calls the arm with no NULL test: the append-time contract is its only safety net.

Dump arms take (FILE *, int length, void *data) — payload only, no page — since logdump runs offline; printers are generic (log_rv_dump_char, log_rv_dump_hexa) or subsystem decoders (disk_rv_dump_hdr).

10.4 Index families and the RCV_IS_* macro overlay

The prefix encodes the owning subsystem; the append-only rule scatters late additions to 124–129 regardless of family: RVDK_* 0–9 (disk) · RVFL_* 10–32, 128 (file mgr; 128 = TDE) · RVHF_* 33–53, 126, 129 (heap) · RVOVF_* 54–57 (overflow) · RVEH_* 58–65 (ext hash) · RVBT_* 66–91, 124–125 (b-tree; 124–125 = online index) · RVCT_* 92–96 (catalog) · RVLOG_* 97 (log no-op) · RVREPL_* 98–103 (replication; HA shipping, not page recovery) · RVVAC_* 104–117 (vacuum) · RVES_* 118 (external storage) · RVLOC_* 119 (locator dummy) · RVPGBUF_* 120–123, 127 (page buffer; 127 = TDE).

The RCV_IS_* macros (Ch 1) are a second axis: the index value selects the function; its macro membership selects the protocol around the call in log_rv_undo_record’s six-way ladder (Ch 9). Indices in RCV_IS_LOGICAL_COMPENSATE_MANUAL (fed by RCV_IS_BTREE_LOGICAL_LOG) get rcv->reference_lsa preloaded from tdes->undo_nxlsa and their undofun logs its own CLR; indices failing RCV_IS_LOGICAL_LOG get a driver-side log_append_compensate first. RCV_IS_LOGICAL_RUN_POSTPONE_MANUAL is the postpone analogue (Ch 8).

Invariant (macro/table coherence): every index named in a RCV_IS_* macro must keep an arm whose internal logging matches the protocol the macro routes it to. Nothing checks this mechanically; a RCV_IS_LOGICAL_COMPENSATE_MANUAL index whose undofun logs no CLR leaves undo_nxlsa pointing at the same record: infinite rollback loop or missing-CLR crash at next restart.

10.5 The packed partial-change mini-format

Many slotted-page entries log a sequence of splices instead of whole records. One splice unit is short offset_to_data | byte A | byte B | payload, padded to INT_ALIGNMENT, as produced by the packers in log_recovery.c:

// log_rv_pack_redo_record_changes -- src/transaction/log_recovery.c
  assert (offset_to_data >= 0 && offset_to_data <= 0x8FFF);
  /* <- intends flag bits clear; mask 0xC000 needs <= 0x3FFF, so 0x8FFF looks like a source typo */
  // ... condensed: asserts both sizes <= 255 (single wire bytes); PTR_ALIGN to INT_ALIGNMENT ...
  OR_PUT_SHORT (ptr, (short) offset_to_data);  ptr += OR_SHORT_SIZE;
  OR_PUT_BYTE (ptr, (INT16) old_data_size);    ptr += OR_BYTE_SIZE;
  OR_PUT_BYTE (ptr, (INT16) new_data_size);    ptr += OR_BYTE_SIZE;
  if (new_data_size > 0)
    { memcpy (ptr, new_data, new_data_size);   ptr += new_data_size; }
  // ... condensed: trailing PTR_ALIGN ...

log_rv_pack_undo_record_changes differs in exactly two ways: the two OR_PUT_BYTE lines are swapped — new_data_size first — and the memcpy payload is old_data, guarded by old_data_size > 0. That asymmetry is the whole trick:

Wire field	In redo data	In undo data
`offset_to_data`	splice position	splice position (same)
byte A (“remove size”)	runtime old size	runtime new size — bytes undo strips
byte B (“insert size”)	runtime new size	runtime old size — bytes undo restores
payload	new data	old data

The packers pre-swap, so one interpreter serves both directions with no direction flag.

10.6 The interpreter: ordered replay, reversed unreplay

log_rv_undoredo_record_partial_changes is a three-assert wrapper that wraps the payload in an OR_BUF and calls the recursive core, because undo must apply splices in reverse log order — each offset_to_data was computed against the record as that splice saw it:

// log_rv_undoredo_partial_changes_recursive -- src/transaction/log_recovery.c
  if (rcv_buf->ptr == rcv_buf->endptr)
    return NO_ERROR;                                   /* (1) clean termination */
  if (rcv_buf->ptr + OR_SHORT_SIZE + 2 * OR_BYTE_SIZE > rcv_buf->endptr)
    { assert_release (false); return ER_TF_BUFFER_OVERFLOW; }   /* (2) truncated unit */
  offset_to_data = (int) or_get_short (rcv_buf, &error_code);   /* (3,4,5) per-field errors */
  // ... condensed: old_data_size, new_data_size; each returns error_code on failure ...
  if (new_data_size > 0)
    { new_data = rcv_buf->ptr;
      error_code = or_advance (rcv_buf, new_data_size); /* (6) payload overruns buffer */ }
  else
    new_data = NULL;                                    /* (7) pure deletion splice */
  or_align (rcv_buf, INT_ALIGNMENT);                    /* <- mirrors packer's PTR_ALIGN */
  if (!is_undo)
    RECORD_REPLACE_DATA (record, offset_to_data, old_data_size, new_data_size, new_data);
  error_code = log_rv_undoredo_partial_changes_recursive (thread_p, rcv_buf, record, is_undo);
  if (error_code != NO_ERROR)
    { assert_release (false); return error_code; }      /* (8) deeper error skips this splice */
  if (is_undo)
    RECORD_REPLACE_DATA (record, offset_to_data, old_data_size, new_data_size, new_data);
  return NO_ERROR;

(7) is legal because RECORD_REPLACE_DATA (storage_common.h) skips its memcpy when insert size is 0.

flowchart TD
    A["parse unit i"] --> B{"is_undo?"}
    B -- "no" --> C["apply splice i, then recurse into i+1"]
    B -- "yes" --> D["recurse into i+1, apply splice i on unwind"]

Figure 10-2: redo applies before recursing; undo applies on unwind, reversing order for free.

10.7 log_rv_record_modify_internal and the thin wrappers

The generic record modifier reads two flag bits smuggled into rcv->offset (LOG_RV_RECORD_SET_MODIFY_MODE, mask LOG_RV_RECORD_MODIFY_MASK = 0xC000, log_append.hpp; §10.5’s 0x8FFF assert intends to protect these bits, though a flag-safe bound would be 0x3FFF):

`flags`	Meaning	Redo action	Undo action
`LOG_RV_RECORD_INSERT` (0x8000)	record inserted	`spage_insert_at`	`spage_delete`
`LOG_RV_RECORD_DELETE` (0x4000)	record deleted	`spage_delete`	`spage_insert_at`
`LOG_RV_RECORD_UPDATE_ALL` (0xC000)	full replacement	`spage_update`	`spage_update` (per-arm payload)
`LOG_RV_RECORD_UPDATE_PARTIAL` (0x0000)	splice chain	splice forward, `spage_update`	splice reversed, `spage_update`

// log_rv_record_modify_internal -- src/transaction/log_recovery.c
  INT16 flags = rcv->offset & LOG_RV_RECORD_MODIFY_MASK;
  PGSLOTID slotid = rcv->offset & (~LOG_RV_RECORD_MODIFY_MASK);
  if ((!is_undo && LOG_RV_RECORD_IS_INSERT (flags)) || (is_undo && LOG_RV_RECORD_IS_DELETE (flags)))
    { /* ... condensed: unpack type byte + body; spage_insert_at ... */ }
  else if ((!is_undo && LOG_RV_RECORD_IS_DELETE (flags)) || (is_undo && LOG_RV_RECORD_IS_INSERT (flags)))
    { /* ... condensed: spage_delete ... */ }
  else if (LOG_RV_RECORD_IS_UPDATE_ALL (flags))
    { /* ... condensed: unpack type + body; spage_update ... */ }
  else
    {
      assert (LOG_RV_RECORD_IS_UPDATE_PARTIAL (flags));
      // ... condensed: spage_get_record (..., COPY);          /* <- splice on a private copy */
      //     log_rv_undoredo_record_partial_changes (..., is_undo); spage_update ...
    }
  pgbuf_set_dirty (thread_p, rcv->pgptr, DONT_FREE);           /* <- every success path lands here */
  return NO_ERROR;

The four arms are mutually exclusive and exhaustive; every failure path is assert_release (false) plus ER_FAILED before the dirty mark, so a failed arm never advertises a half-applied page. (One blemish: the UPDATE_PARTIAL arm’s failed spage_update returns error_code — still NO_ERROR there — asserted but not propagated.) log_rv_redo_record_modify / log_rv_undo_record_modify are one-line wrappers binding is_undo to false/true, giving the table two distinct pointers.

The b-tree indices RVBT_RECORD_MODIFY_UNDOREDO / _NO_UNDO / _COMPENSATE register btree_rv_redo_record_modify / btree_rv_undo_record_modify (btree.c) instead; their core btree_rv_record_modify_internal clones this ladder (wider BTREE_RV_FLAGS_MASK, same call into log_rv_undoredo_record_partial_changes) plus node-header upkeep.

10.8 The registration ritual — adding a new LOG_RCVINDEX

Append the enumerator at the bottom of LOG_RCVINDEX; retarget RV_LAST_LOGID. Never renumber — values persist in on-disk logs.
Append the matching rvfun literal at the last slot of RV_fun[], string = enumerator spelling.
Pick arms by logging discipline: both for undoredo, redo-only for redo/postpone, undo-only for logical-undo or compensate-replay (§10.1’s crossed wire).
If the index needs a manual compensation/postpone protocol, add it to the right RCV_IS_* macro and implement that protocol (§10.4 invariant).
Build with asserts and boot: rv_check_rvfuns is the only mechanical check. A skipped or transposed slot dies fatally there; a release build applies the wrong function to every record from the bad slot on.

10.9 Chapter summary — key takeaways

RV_fun[] maps the on-disk rcvindex to function pointers via an unchecked array load; recv_index == position, checked only by rv_check_rvfuns at debug startup, is load-bearing for every pass; the enum is append-only because the numbers persist in logs.
NULL arms encode logging contracts policed at append time; recovery mostly trusts them.
The redo pass replays LOG_COMPENSATE through the undofun slot, so compensate-only indices register their redo-direction function as undofun.
The RCV_IS_* macros pick the compensation/postpone protocol wrapped around the undofun call in log_rv_undo_record (Ch 9).
The packed splice format is direction-agnostic — the undo packer pre-swaps size bytes and stores old data — so one interpreter replays redo forward, undo reversed on unwind; log_rv_record_modify_internal layers a four-way ladder over it, insert/delete arms swapping under undo, and the b-tree clones the machinery rather than registering the generic wrappers.

Chapter 11: Special Paths

Off the main crash-restart lifecycle: point-in-time restore, log truncation, post-restore archive/volume discard, append-point repair, execution-context shims, and the 2PC handoff.

11.1 The stopat contract

Two log_recovery parameters gate everything: ismedia_crash and stopat (high-level doc, “Restart orchestrator”):

// log_recovery -- src/transaction/log_recovery.c
  if (ismedia_crash != false)
    {
      /* Media crash, we may have to start from an older checkpoint... check disk headers */
      (void) fileio_map_mounted (thread_p, (bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint, &rcv_lsa);
    }
  else
    {
      // ... condensed ...
      if (stopat != NULL)
  {
    *stopat = -1;   /* <- normal restart: point-in-time target forcibly disabled */
  }
    }

Invariant — incomplete recovery happens only on the media-crash path. A normal restart neutralizes stopat; a broken page outside media recovery is fatal, not truncated — otherwise analysis could silently destroy committed transactions. log_rv_find_checkpoint picks the oldest checkpoint among restored volumes.

Three analysis-time triggers cut the log, all converging on log_recovery_resetlog with *did_incom_recovery = true:

Commit/abort newer than the target. Without a media crash, log_rv_analysis_complete jumps goto end, freeing the tran index (Chapter 4). Otherwise it reads LOG_REC_DONETIME; when *stop_at is set and difftime (*stop_at, last_at_time) < 0, it releases the held page (log_lsa->pageid = NULL_PAGEID), calls log_recovery_resetlog with the record’s header LSA — the new log ends before the too-new record — and returns NO_ERROR.
Commit-with-postpone newer than the target. log_rv_analysis_commit_with_postpone runs the same difftime test on LOG_REC_START_POSTPONE.at_time inside its if (is_media_crash) arm; the else arm is normal Chapter 5 bookkeeping. A passing time check on the media branch does nothing.
Physically broken page. When logpb_fetch_page fails in the log_recovery_analysis loop, the media branch stores last_at_time into *stop_at, rewinds the last transaction’s tail_lsa/undo_nxlsa to log_rec->prev_tranlsa (the half-written record is never undone), re-fetches the previous page (failure fatal: “reset log is impossible”), calls log_recovery_resetlog with prev_lsa/prev_prev_lsa, resets log_Gl.mvcc_table.reset_start_mvccid (), and returns. The non-media branch is fatal — TDE-specific message when er_errid () is ER_TDE_CIPHER_IS_NOT_LOADED.

A fourth, milder repair: when a record’s forward LSA is NULL while log_rtype != LOG_END_OF_LOG and *did_incom_recovery == false, the analysis loop calls log_startof_nxrec (§11.5) on log_Gl.hdr.append_lsa, patches log_rec->forw_lsa, rewrites the page (logpb_write_page_to_disk), and sets log_Gl.hdr.next_trid = tran_id in the same block; on failure the append point falls back to end_redo_lsa (“we may destroy a record”).

11.2 log_recovery_resetlog — truncate and re-arm

log_recovery_resetlog is the one function that rewrites history. It asserts LOG_CS_OWN_WRITE_MODE and non-NULL new_prev_lsa, then runs six steps:

Flush what exists. If log_Gl.append.vdes != NULL_VOLDES with an append page held: logpb_flush_pages_direct + logpb_invalid_all_append_pages.
Pick the new append LSA. NULL new_append_lsa → header restarts at 0|0. Otherwise, with no active log or a to-the-past reset at a mid-page offset, the append page is saved so its surviving prefix carries into the recreated log:

// log_recovery_resetlog -- src/transaction/log_recovery.c
      if (log_Gl.append.vdes == NULL_VOLDES
    || (log_Gl.hdr.fpageid > new_append_lsa->pageid && new_append_lsa->offset > 0))
  {
    // ... condensed ... (rationale comment)
    newappend_pgptr = (LOG_PAGE *) aligned_newappend_pgbuf;
    if ((logpb_fetch_page (thread_p, new_append_lsa, LOG_CS_FORCE_USE, newappend_pgptr)) != NO_ERROR)
      {
        newappend_pgptr = NULL;   /* <- tolerated: the page copy is best-effort */
      }
  }
      LOG_RESET_APPEND_LSA (new_append_lsa);

Reset header state. chkpt_lsa = append_lsa (the truncated tail is the new checkpoint), is_shutdown = false, logpb_invalidate_pool.
Two regimes. If log_Gl.append.vdes == NULL_VOLDES || log_Gl.hdr.fpageid > log_Gl.hdr.append_lsa.pageid — no active log, or the append point moved before the active range — the log is rebuilt: arv_num = logpb_get_archive_number (append page - 1) + 1 names the first unneeded archive (-1 fatal); log_recovery_notpartof_archives (§11.3) removes from there up (reason strdup-ed, raw fallback); the header is rewritten as if the log began here: fpageid = nxarv_pageid = append_lsa.pageid, nxarv_num = arv_num, last_arv_num_for_syscrashes = last_deleted_arv_num = -1. A missing active log file is recreated — disk_get_db_creation, fileio_format, logpb_create_header_page, logpb_flush_page, failures fatal — and either way a fresh first append page is created and flushed. Else only nxarv_pageid is clamped down if past the new append page.
Re-seed the append page. logpb_fetch_start_append_page; on success a step-2 saved image is memcpy-ed over the fetched buffer, marked dirty, flushed direct. If logpb_fetch_start_append_page fails, the restore-and-flush is skipped silently — no error is raised — and finalization proceeds regardless.
Finalize. LOG_RESET_PREV_LSA (new_prev_lsa); mvcc_op_log_lsa.set_null () and vacuum_last_blockid = 0 disconnect vacuum from truncated ranges; was_active_log_reset = true; logpb_flush_header; logpb_decache_archive_info.

Invariant — after resetlog, every position-bearing header field points at or before the new append LSA. chkpt_lsa, fpageid, nxarv_pageid, prev_lsa, and the vacuum anchors are rewritten in one LOG_CS critical section; a missed field would send vacuum or the archiver chasing truncated pages.

11.3 log_recovery_notpartof_archives

Archives start_arv_num and up describe truncated pages. Two scan modes, keyed on whether the active log (a trustworthy header) is mounted:

// log_recovery_notpartof_archives -- src/transaction/log_recovery.c
  if (log_Gl.append.vdes != NULL_VOLDES)
    {
      /* Trust the current log header */
      // ... condensed ... (unformat archives start_arv_num .. nxarv_num - 1)
    }
  else
    {
      /* We don't know where to stop. Stop when an archive is not in the OS */
      for (i = start_arv_num; i <= INT_MAX; i++)
  {
    fileio_make_log_archive_name (logarv_name, log_Archive_path, log_Prefix, i);
    if (fileio_is_volume_exist (logarv_name) == false)
      {
        // ... condensed ... /* <- rebuild name of archive i-1, the LAST removed */
        break;
      }
    fileio_unformat (thread_p, logarv_name);
  }
    }

With info_reason non-NULL and at least one archive removed (start_arv_num != i), a REMOVE ... REASON line goes to the log-info file via log_dump_log_info (single-vs-range format branch); errors other than ER_LOG_MOUNT_FAIL return early, before the header update (the files are already gone). Finally log_Gl.hdr.last_deleted_arv_num = (start_arv_num == i) ? i : i - 1 (set even on a no-op call — a quirk); the header is flushed only when the active log is mounted; logpb_decache_archive_info is left to callers.

11.4 log_recovery_notpartof_volumes

When did_incom_recovery is set, the driver calls log_recovery_notpartof_volumes after the undo pass. The boundary: start_volid = boot_find_next_permanent_volid (thread_p), the first volid the restored catalog does not know about. Two sweeps:

Sweep 1 — already-mounted volumes. fileio_map_mounted runs log_unformat_ahead_volumes over every mounted volume: if volid != NULL_VOLID && volid >= *start_volid, buffer-pool pages are dropped first (pgbuf_invalidate_all, so no stale dirty page is later flushed into it), then the volume is fileio_unformat-ed and its label freed. If invalidation fails the callback returns false, stopping the map early; stragglers fall to sweep 2.

Sweep 2 — volumes laying around on disk. Extension-named candidates are probed from start_volid to LOG_MAX_DBVOLID, breaking at the first missing name. Each candidate is mounted, its creation time read via disk_get_db_creation, and dismounted; only if difftime (vol_dbcreation, log_Gl.hdr.db_creation) == 0 is it unformatted. The db_creation timestamp is the identity test — an unrelated database’s same-named volume is a deliberate NO-OP. A candidate that exists but fails to mount (vdes == NULL_VOLDES) is skipped silently — never unformatted. The extension directory derives from log_Db_fullname (empty-string fallback on malloc or fileio_get_directory_path failure). logpb_recreate_volume_info then rebuilds the volume-info file.

11.5 log_startof_nxrec

log_startof_nxrec answers: where does the next record start? Analysis uses it (§11.1, repair 4) when the last record’s forw_lsa is NULL but the record is complete. Branches:

NULL input LSA → return NULL; logpb_fetch_page failure → goto error. lsa->offset == NULL_OFFSET (page from an archive cut mid-record) → adopt log_pgptr->hdr.offset, the first record the page knows; still NULL → error.
canuse_forwaddr == true → take log_rec->forw_lsa; if NULL but the page lives in an archive, the next record can only be at pageid + 1 (incomplete record archived, completed later). Only if still NULL does it fall to manual scan.
Manual scan: advance past LOG_RECORD_HEADER, then a switch (type) steps over the type-specific header plus every variable payload — undo/redo images (GET_ZIP_LEN-decoded), postpone/compensate lengths, checkpoint arrays, savepoint names, 2PC and replication payloads, the sysop family’s conditional embedded undo image (Chapter 5); fixed-size markers just break. The epilogue LOG_READ_ADVANCE_WHEN_DOESNT_FIT rounds up to the next page when another record header cannot fit.
Two quirks: LOG_SUPPLEMENTAL_INFO lacks a break and falls into the marker group — currently harmless since those cases do nothing; LOG_END_OF_LOG is assert (false) — no caller asks for the record after end-of-log.

11.6 The simulate/end shims

Undo, postpone, and sysop-abort code expects the current transaction in thread_p->tran_index or an attached system tdes. Recovery’s thread owns LOG_SYSTEM_TRAN_INDEX and walks other transactions’ chains, so each per-tdes operation is bracketed by a shim pair:

// log_rv_simulate_runtime_worker -- src/transaction/log_recovery.c
  if (tdes->is_active_worker_transaction ())
    {
      thread_p->tran_index = tdes->tran_index;   /* <- runtime code now sees this tdes as "mine" */
      // ... condensed ... (SA_MODE: mirror via LOG_SET_CURRENT_TRAN_INDEX)
    }
  else if (tdes->is_system_worker_transaction ())
    {
      log_system_tdes::rv_simulate_system_tdes (tdes->trid);   /* <- attach system tdes to thread */
    }
  else
    {
      assert (false);
    }

// log_rv_end_simulation -- src/transaction/log_recovery.c
  thread_p->reset_system_tdes ();
  thread_p->tran_index = LOG_SYSTEM_TRAN_INDEX;   /* <- unconditional restore */
  // ... condensed ... (SA_MODE: mirror restore)

Both shims keep the SA-mode global mirror (LOG_SET_CURRENT_TRAN_INDEX under #if defined (SA_MODE)). For a system worker transaction (Chapter 4’s rebuilt log_system_tdes population) rv_simulate_system_tdes looks the trid up in systb_System_tdes (asserting on a miss) and installs it via set_system_tdes.

Invariant — every simulate is paired with an end; the thread is back on LOG_SYSTEM_TRAN_INDEX between transactions. log_rv_undo_record (Chapter 9) closes the pair after its end: label, so error paths restore the thread too; log_recovery_finish_all_postpone and log_recovery_abort_all_atomic_sysops (Chapter 8) wrap it in a per-tdes lambda and assert tran_index == LOG_SYSTEM_TRAN_INDEX on entry. A missing end would leave a stale system tdes attached, logging for the wrong transaction.

11.7 The 2PC handoff

After undo and (if needed) log_recovery_notpartof_volumes, the driver counts distributed loose ends:

// log_recovery -- src/transaction/log_recovery.c
  (void) logtb_set_num_loose_end_trans (thread_p);
  /* Try to finish any 2PC blocked transactions */
  if (log_Gl.trantable.num_coord_loose_end_indices > 0 || log_Gl.trantable.num_prepared_loose_end_indices > 0)
    {
      log_Gl.rcv_phase = LOG_RECOVERY_FINISH_2PC_PHASE;
      // ... condensed ...
      log_2pc_recovery (thread_p);
      /* Check number of loose end transactions again.. */
      // ... condensed ... (reset rcv_tdes, re-bind tran index)
      (void) logtb_set_num_loose_end_trans (thread_p);
    }

logtb_set_num_loose_end_trans zeroes both counters under TR_TABLE_CS_ENTER and walks every non-system tdes with a valid trid through logtb_set_loose_end_tdes: LOG_ISTRAN_2PC_PREPARE sets isloose_end and bumps num_prepared_loose_end_indices (in-doubt participant; keeps locks); LOG_ISTRAN_2PC_IN_SECOND_PHASE or TRAN_UNACTIVE_2PC_COLLECTING_PARTICIPANT_VOTES bumps num_coord_loose_end_indices (coordinator re-drives its decision). The driver keys off the two globals, not the returned sum.

log_2pc_recovery sweeps the table — skipping tdes == NULL, NULL_TRANID, and !LOG_ISTRAN_2PC (tdes) — and switches on tdes->state: collecting-votes aborts the undecided coordinator, abort/commit-decision re-executes the decision, and TRAN_UNACTIVE_WILL_COMMIT / TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE [[fallthrough]] into informing participants — local postpones are already done (Chapter 8). Vote mechanics belong to the 2PC document and the high-level companion’s “Transaction table with loose-end annotations”; here only the handoff condition matters: the fourth phase runs iff a coordinator or prepared loose end survived analysis. Prepared participants without a verdict legitimately remain in-doubt — hence a recount, not an assert-zero.

11.8 Chapter summary — key takeaways

Incomplete recovery is media-crash-only: a normal restart forces *stopat = -1; a broken page outside media recovery is fatal, never truncated.
Three triggers cut the log — completion or commit-with-postpone newer than stopat, or an unreadable page — all via log_recovery_resetlog; a missing end-of-log is patched via log_startof_nxrec.
log_recovery_resetlog rewrites every position-bearing header field in one LOG_CS section and delegates archive removal to log_recovery_notpartof_archives.
Volume discard is two-phase and identity-checked by db_creation; buffer pages are invalidated before unformat; mount failures are skipped.
log_startof_nxrec walks record lengths type by type — a new payload layout means a new switch arm; LOG_SUPPLEMENTAL_INFO’s missing break is only accidentally harmless.
The simulate/end shims bind the thread to a worker or system tdes so runtime code runs unmodified; pairing is structural; both keep the SA-mode mirror.
LOG_RECOVERY_FINISH_2PC_PHASE runs only when coordinator or prepared loose ends survive; prepared participants may stay in-doubt.

Position hints as of this revision

The following are line numbers as observed on 2026-06-11; symbols are the canonical anchor and line numbers are hints that decay.

Symbol	File	Line
`vacuum_notify_server_crashed`	`src/query/vacuum.c`	7570
`btree_rv_record_modify_internal`	`src/storage/btree.c`	29757
`NULL_OFFSET`	`src/storage/storage_common.h`	49
`RECORD_REPLACE_DATA`	`src/storage/storage_common.h`	231
`log_2pc_recovery_analysis_info`	`src/transaction/log_2pc.c`	2029
`log_2pc_recovery`	`src/transaction/log_2pc.c`	2303
`LOG_RV_RECORD_MODIFY_MASK`	`src/transaction/log_append.hpp`	139
`LOG_PAGE_INIT_VALUE`	`src/transaction/log_common_impl.h`	46
`log_zip`	`src/transaction/log_compress.c`	45
`log_unzip`	`src/transaction/log_compress.c`	112
`log_diff`	`src/transaction/log_compress.c`	176
`log_zip_realloc_if_needed`	`src/transaction/log_compress.c`	203
`log_zip_alloc`	`src/transaction/log_compress.c`	238
`log_zip_free`	`src/transaction/log_compress.c`	279
`GET_ZIP_LEN`	`src/transaction/log_compress.h`	36
`ZIP_CHECK`	`src/transaction/log_compress.h`	39
`log_zip`	`src/transaction/log_compress.h`	53
`LOG_ISTRAN_2PC`	`src/transaction/log_impl.h`	173
`LOG_HAS_LOGGING_BEEN_IGNORED`	`src/transaction/log_impl.h`	190
`log_rcv_tdes`	`src/transaction/log_impl.h`	458
`log_recvphase`	`src/transaction/log_impl.h`	625
`log_cs_access_mode`	`src/transaction/log_impl.h`	923
`log_initialize_internal`	`src/transaction/log_manager.c`	1100
`log_append_compensate`	`src/transaction/log_manager.c`	2985
`log_append_compensate_with_undo_nxlsa`	`src/transaction/log_manager.c`	3011
`log_append_compensate_internal`	`src/transaction/log_manager.c`	3047
`log_sysop_end_recovery_postpone`	`src/transaction/log_manager.c`	4024
`log_complete`	`src/transaction/log_manager.c`	5653
`log_rollback_record`	`src/transaction/log_manager.c`	7349
`log_get_next_nested_top`	`src/transaction/log_manager.c`	8023
`log_do_postpone`	`src/transaction/log_manager.c`	8237
`log_run_postpone_op`	`src/transaction/log_manager.c`	8481
`log_execute_run_postpone`	`src/transaction/log_manager.c`	8543
`log_read_sysop_start_postpone`	`src/transaction/log_manager.c`	9962
`LOGPB_IS_ARCHIVE_PAGE`	`src/transaction/log_page_buffer.c`	155
`logpb_page_has_valid_checksum`	`src/transaction/log_page_buffer.c`	523
`logpb_fetch_page`	`src/transaction/log_page_buffer.c`	1739
`logpb_copy_page`	`src/transaction/log_page_buffer.c`	1871
`logpb_read_page_from_file`	`src/transaction/log_page_buffer.c`	2003
`logpb_fetch_start_append_page`	`src/transaction/log_page_buffer.c`	2504
`logpb_page_get_first_null_block_lsa`	`src/transaction/log_page_buffer.c`	3190
`logpb_is_page_in_archive`	`src/transaction/log_page_buffer.c`	4994
`logpb_copy_from_log`	`src/transaction/log_page_buffer.c`	6532
`logpb_checkpoint`	`src/transaction/log_page_buffer.c`	6877
`logpb_page_check_corruption`	`src/transaction/log_page_buffer.c`	11508
`log_reader`	`src/transaction/log_reader.hpp`	36
`log_reader::set_lsa_and_fetch_page`	`src/transaction/log_reader.hpp`	162
`LOG_READ_ALIGN`	`src/transaction/log_reader.hpp`	315
`log_rec_undo`	`src/transaction/log_record.hpp`	176
`log_vacuum_info`	`src/transaction/log_record.hpp`	192
`log_rec_mvcc_undo`	`src/transaction/log_record.hpp`	211
`log_rec_compensate`	`src/transaction/log_record.hpp`	262
`log_sysop_end_type`	`src/transaction/log_record.hpp`	285
`log_rec_sysop_end`	`src/transaction/log_record.hpp`	305
`log_rec_sysop_start_postpone`	`src/transaction/log_record.hpp`	328
`log_rec_chkpt`	`src/transaction/log_record.hpp`	345
`log_info_chkpt_trans`	`src/transaction/log_record.hpp`	354
`log_info_chkpt_sysop`	`src/transaction/log_record.hpp`	372
`log_rv_undo_record`	`src/transaction/log_recovery.c`	163
`log_rv_redo_record`	`src/transaction/log_recovery.c`	430
`log_rv_fix_page_and_check_redo_is_needed`	`src/transaction/log_recovery.c`	494
`log_rv_need_sync_redo`	`src/transaction/log_recovery.c`	541
`log_rv_find_checkpoint`	`src/transaction/log_recovery.c`	579
`log_rv_get_unzip_log_data`	`src/transaction/log_recovery.c`	609
`log_rv_get_unzip_and_diff_redo_log_data`	`src/transaction/log_recovery.c`	699
`log_recovery`	`src/transaction/log_recovery.c`	736
`log_rv_analysis_undo_redo`	`src/transaction/log_recovery.c`	965
`log_rv_analysis_dummy_head_postpone`	`src/transaction/log_recovery.c`	1000
`log_rv_analysis_postpone`	`src/transaction/log_recovery.c`	1042
`log_rv_analysis_run_postpone`	`src/transaction/log_recovery.c`	1086
`log_rv_analysis_compensate`	`src/transaction/log_recovery.c`	1181
`log_rv_analysis_commit_with_postpone`	`src/transaction/log_recovery.c`	1230
`log_rv_analysis_commit_with_postpone_obsolete`	`src/transaction/log_recovery.c`	1315
`log_rv_analysis_sysop_start_postpone`	`src/transaction/log_recovery.c`	1365
`log_rv_analysis_atomic_sysop_start`	`src/transaction/log_recovery.c`	1472
`log_rv_analysis_complete`	`src/transaction/log_recovery.c`	1509
`log_rv_analysis_sysop_end`	`src/transaction/log_recovery.c`	1612
`log_rv_analysis_start_checkpoint`	`src/transaction/log_recovery.c`	1797
`log_rv_analysis_end_checkpoint`	`src/transaction/log_recovery.c`	1830
`log_rv_analysis_save_point`	`src/transaction/log_recovery.c`	2077
`log_rv_analysis_2pc_prepare`	`src/transaction/log_recovery.c`	2114
`log_rv_analysis_2pc_start`	`src/transaction/log_recovery.c`	2153
`log_rv_analysis_2pc_commit_decision`	`src/transaction/log_recovery.c`	2190
`log_rv_analysis_2pc_abort_decision`	`src/transaction/log_recovery.c`	2224
`log_rv_analysis_2pc_commit_inform_particps`	`src/transaction/log_recovery.c`	2258
`log_rv_analysis_2pc_abort_inform_particps`	`src/transaction/log_recovery.c`	2293
`log_rv_analysis_2pc_recv_ack`	`src/transaction/log_recovery.c`	2328
`log_rv_analysis_log_end`	`src/transaction/log_recovery.c`	2355
`log_rv_analysis_record`	`src/transaction/log_recovery.c`	2378
`log_is_page_of_record_broken`	`src/transaction/log_recovery.c`	2518
`log_recovery_analysis`	`src/transaction/log_recovery.c`	2587
`log_recovery_needs_skip_logical_redo`	`src/transaction/log_recovery.c`	3153
`log_recovery_get_redo_parallel_count`	`src/transaction/log_recovery.c`	3197
`log_recovery_redo`	`src/transaction/log_recovery.c`	3251
`BUILD_RECORD_INFO`	`src/transaction/log_recovery.c`	3468
`INVOKE_REDO_RECORD`	`src/transaction/log_recovery.c`	3471
`log_recovery_abort_interrupted_sysop`	`src/transaction/log_recovery.c`	3960
`log_recovery_finish_sysop_postpone`	`src/transaction/log_recovery.c`	4064
`log_recovery_finish_postpone`	`src/transaction/log_recovery.c`	4174
`log_recovery_finish_all_postpone`	`src/transaction/log_recovery.c`	4243
`log_recovery_abort_all_atomic_sysops`	`src/transaction/log_recovery.c`	4280
`log_recovery_abort_atomic_sysop`	`src/transaction/log_recovery.c`	4317
`log_recovery_undo`	`src/transaction/log_recovery.c`	4418
`log_recovery_notpartof_archives`	`src/transaction/log_recovery.c`	4997
`log_unformat_ahead_volumes`	`src/transaction/log_recovery.c`	5100
`log_recovery_notpartof_volumes`	`src/transaction/log_recovery.c`	5132
`log_recovery_resetlog`	`src/transaction/log_recovery.c`	5221
`log_startof_nxrec`	`src/transaction/log_recovery.c`	5414
`log_recovery_find_first_postpone`	`src/transaction/log_recovery.c`	5793
`log_rv_undoredo_partial_changes_recursive`	`src/transaction/log_recovery.c`	6048
`log_rv_undoredo_record_partial_changes`	`src/transaction/log_recovery.c`	6144
`log_rv_redo_record_modify`	`src/transaction/log_recovery.c`	6173
`log_rv_undo_record_modify`	`src/transaction/log_recovery.c`	6191
`log_rv_record_modify_internal`	`src/transaction/log_recovery.c`	6210
`log_rv_pack_redo_record_changes`	`src/transaction/log_recovery.c`	6310
`log_rv_pack_undo_record_changes`	`src/transaction/log_recovery.c`	6352
`log_rv_redo_fix_page`	`src/transaction/log_recovery.c`	6390
`log_rv_simulate_runtime_worker`	`src/transaction/log_recovery.c`	6417
`log_rv_end_simulation`	`src/transaction/log_recovery.c`	6438
`log_cnt_pages_containing_lsa`	`src/transaction/log_recovery.c`	6449
`log_find_unilaterally_largest_undo_lsa`	`src/transaction/log_recovery.c`	6470
`vpid_lsa_consistency_check::check`	`src/transaction/log_recovery_redo.cpp`	28
`log_rv_redo_context::log_rv_redo_context`	`src/transaction/log_recovery_redo.cpp`	52
`log_rv_redo_context`	`src/transaction/log_recovery_redo.hpp`	33
`log_rv_redo_rec_info`	`src/transaction/log_recovery_redo.hpp`	53
`log_rv_get_log_rec_data`	`src/transaction/log_recovery_redo.hpp`	112
`log_rv_get_log_rec_mvccid`	`src/transaction/log_recovery_redo.hpp`	163
`log_rv_get_log_rec_vpid`	`src/transaction/log_recovery_redo.hpp`	206
`log_rv_get_log_rec_redo_length`	`src/transaction/log_recovery_redo.hpp`	273
`log_rv_get_log_rec_offset`	`src/transaction/log_recovery_redo.hpp`	316
`log_rv_get_fun`	`src/transaction/log_recovery_redo.hpp`	359
`log_rv_get_fun<LOG_REC_COMPENSATE>`	`src/transaction/log_recovery_redo.hpp`	396
`log_rv_get_fun`	`src/transaction/log_recovery_redo.hpp`	396
`log_rv_get_log_rec_redo_data`	`src/transaction/log_recovery_redo.hpp`	457
`vpid_lsa_consistency_check`	`src/transaction/log_recovery_redo.hpp`	558
`log_rv_redo_record_sync`	`src/transaction/log_recovery_redo.hpp`	587
`redo_task`	`src/transaction/log_recovery_redo_parallel.cpp`	99
`redo_task::execute`	`src/transaction/log_recovery_redo_parallel.cpp`	221
`redo_parallel::add`	`src/transaction/log_recovery_redo_parallel.cpp`	626
`redo_parallel::wait_for_termination_and_stop_execution`	`src/transaction/log_recovery_redo_parallel.cpp`	635
`redo_parallel::wait_past_target_lsa`	`src/transaction/log_recovery_redo_parallel.cpp`	728
`redo_job_impl::execute`	`src/transaction/log_recovery_redo_parallel.cpp`	752
`reusable_jobs_stack::blocking_pop`	`src/transaction/log_recovery_redo_parallel.cpp`	868
`redo_parallel`	`src/transaction/log_recovery_redo_parallel.hpp`	55
`task_active_state_bookkeeping`	`src/transaction/log_recovery_redo_parallel.hpp`	100
`min_unapplied_log_lsa_monitoring`	`src/transaction/log_recovery_redo_parallel.hpp`	131
`redo_job_base`	`src/transaction/log_recovery_redo_parallel.hpp`	215
`redo_job_impl`	`src/transaction/log_recovery_redo_parallel.hpp`	269
`reusable_jobs_stack`	`src/transaction/log_recovery_redo_parallel.hpp`	306
`log_rv_redo_record_sync_or_dispatch_async`	`src/transaction/log_recovery_redo_parallel.hpp`	382
`perf_stats`	`src/transaction/log_recovery_redo_perf.hpp`	105
`log_system_tdes::rv_simulate_system_tdes`	`src/transaction/log_system_tran.cpp`	174
`log_system_tdes::map_all_tdes`	`src/transaction/log_system_tran.cpp`	253
`log_system_tdes::rv_delete_all_tdes_if`	`src/transaction/log_system_tran.cpp`	265
`log_system_tdes::rv_delete_tdes`	`src/transaction/log_system_tran.cpp`	281
`logtb_rv_find_allocate_tran_index`	`src/transaction/log_tran_table.c`	1056
`logtb_rv_assign_mvccid_for_undo_recovery`	`src/transaction/log_tran_table.c`	1115
`logtb_free_tran_index`	`src/transaction/log_tran_table.c`	1202
`logtb_free_tran_index_with_undo_lsa`	`src/transaction/log_tran_table.c`	1281
`logtb_set_loose_end_tdes`	`src/transaction/log_tran_table.c`	4124
`logtb_set_num_loose_end_trans`	`src/transaction/log_tran_table.c`	4170
`logtb_rv_read_only_map_undo_tdes`	`src/transaction/log_tran_table.c`	4204
`mvcctable::reset_start_mvccid`	`src/transaction/mvcc_table.cpp`	600
`RV_fun`	`src/transaction/recovery.c`	54
`rv_rcvindex_string`	`src/transaction/recovery.c`	857
`rv_check_rvfuns`	`src/transaction/recovery.c`	872
`LOG_RCVINDEX`	`src/transaction/recovery.h`	36
`log_rcv`	`src/transaction/recovery.h`	197
`rvfun`	`src/transaction/recovery.h`	221
`RCV_IS_BTREE_LOGICAL_LOG`	`src/transaction/recovery.h`	241
`RCV_IS_LOGICAL_COMPENSATE_MANUAL`	`src/transaction/recovery.h`	253
`RCV_IS_LOGICAL_RUN_POSTPONE_MANUAL`	`src/transaction/recovery.h`	261
`RCV_IS_LOGICAL_LOG`	`src/transaction/recovery.h`	267

Sources

cubrid-recovery-manager.md — the high-level companion. See also cubrid-log-manager-detail.md (how the replayed records were appended) and cubrid-checkpoint.md (the restart anchor).
Raw analyses under raw/code-analysis/cubrid/storage/recovery_manager/.
Code: src/transaction/log_recovery.{c,h}, log_recovery_redo.{cpp,hpp}, log_recovery_redo_parallel.{cpp,hpp}, recovery.{c,h}.
Methodology: knowledge/methodology/code-analysis-detail-doc.md.