Skip to content

CUBRID Reading Path — How a Server Restart Recovers

Contents:

This is a synthesis doc — a reading path, not a deep dive. It traces the journey a single crashed cub_server takes from SIGKILL mid-flush back to the moment a CAS worker issues SELECT 1. Each step delegates technical detail to a sibling doc; the value of this page is the ordering and the handoffs. If you want the panorama, the diagram in § Diagram compresses everything to a single picture.

Imagine the worst plausible failure short of media loss. The server is mid-work: the checkpoint daemon is flushing dirty pages through the DWB, the prior list holds a dozen unflushed log records, two user transactions are deep in B+Tree splits, a third has just emitted its commit record but its postpones haven’t run. Then the machine loses power, or the OOM killer fires, or kill -9 lands. The process ceases.

The disk a millisecond later carries three categories of damage:

  1. Half-flushed pages. The DWB was mid-batch writing slots back to their home volumes. Some pages reached home; some didn’t. The DWB volume is durable (it was fsync(2)-ed before any home write started), so for every torn home page, a clean copy exists in the DWB.
  2. Prior-list entries that never made it to the log. A handful of LOG_REC_* nodes sat in the prior list awaiting the log flusher. Those records have no on-disk presence, so their changes are lost. The WAL invariant guarantees any data page on disk was preceded by its log record on disk, so if the prior-list entry was lost, the data-page write was also not yet flushed. Consistent.
  3. Partially completed transactions. Committed T_c emitted LOG_COMMIT and LOG_COMMIT_WITH_POSTPONE; both are durable. Its postpones did not run. T_a and T_b are mid-statement: undo chains stretching back twenty records but no commit records. They are losers in ARIES parlance.

The restart’s job is to bring the disk to a state both internally consistent (no torn pages, every committed change present, every uncommitted change gone) and externally correct (T_c’s commit durable; T_a and T_b never happened). Then, and only then, does the network listener open the socket.

The pipeline is six mandatory phases plus two optional follow-ons:

  • Phase 1 (process boot): mount volumes, attach to the log.
  • Phase 2 (DWB recovery): heal torn pages.
  • Phase 3 (locate checkpoint): read the log header.
  • Phase 4 (analysis): rebuild the TX/dirty-page tables.
  • Phase 5 (redo): replay forward from the redo-LSA.
  • Phase 6 (undo): roll back losers, emitting CLRs.
  • Phase 7 (vacuum catch-up): background MVCCID reclamation.
  • Phase 8 (HA catch-up): replication resumes.
  • Phase 9 (open listener): accept the first client.

Phases 1-6 run sequentially. Phases 7-8 fan out as background activity in parallel with phase 9; they don’t gate acceptance.

cub_server’s main() (in src/executables/server.c) is a thin shim: parse flags, register signal handlers, call net_server_start (src/communication/network_sr.c). That orchestrator dispatches into boot_restart_server (src/transaction/boot_sr.c), which walks the subsystem-init list in topological order.

The order matters because the dependency graph has cycles: recovery needs the page buffer; the page buffer needs the disk manager; the disk manager wants catalog metadata; the catalog needs recovery to be over. CUBRID breaks the cycle by staging — each subsystem comes up in an early phase that participates in recovery and a late phase that consumes catalog data. See cubrid-boot.md for the topological walk.

Three boot-time actions are load-bearing for the recovery story:

  • Volume opening. The disk manager reads databases.txt, enumerates volumes via the log-info file, opens each by descriptor. Volumes are attached but not yet trusted: their pages may be torn from the mid-flush crash. No client can read or write pages until phase 6 finishes.
  • Log attach. log_initialize (log_manager.c) opens the active log, reads the header (page id -9), inspects hdr.is_shutdown. If true, recovery is skipped. If false — our scenario — it calls log_recovery.
  • Recovery dispatch. log_recovery is the three-pass driver in cubrid-recovery-manager.md. But before it runs, the DWB must have a chance to heal torn pages. That’s phase 2.

At the moment of boot, no client connections are possible. boot_Server_status = BOOT_SERVER_DOWN and the listener is not running. The OS rejects every incoming packet. Recover first, listen second is what prevents partial recovery from leaking to a client.

Before the recovery manager touches the log, the double-write buffer is inspected. The DWB exists for one reason: torn-page protection. A 16 KB CUBRID page sits across many disk sectors (512 B or 4 KiB atomic). A mid-write crash can leave a page on disk with the first half new and the second half old. ARIES redo cannot recover from this — the redo function applies a delta on a coherent page, not a torn one. Postgres uses full-page-image WAL; InnoDB and CUBRID use a doublewrite buffer; SQL Server uses torn-page detection bits.

The DWB runtime invariant: before a dirty page is written to its home volume, a copy is staged in the DWB volume, and the DWB volume is fsync-ed. Then the home write proceeds; if it tears, a clean copy exists in the DWB.

Restart-time DWB recovery is dwb_load_and_recover_pages (src/storage/double_write_buffer.cpp):

  1. Open the DWB volume(s).
  2. For each occupied slot, compute the checksum.
  3. Read the home-volume copy.
  4. If the home copy is good (valid checksum, LSA at-or-above the slot’s LSA), skip.
  5. Otherwise write the DWB slot over the home page, fsync.
  6. Mark the DWB clean.

This sweep is mandatory before ARIES. Analysis and redo read pages from home volumes; a torn page would crash the parser or silently corrupt the database under a redo-on-stale-data bug.

DWB recovery also resolves a subtle checkpoint interaction described in cubrid-checkpoint.md: the checkpoint protocol drives dirty pages through the DWB during step 7 of logpb_checkpoint, so a crash during checkpoint flush leaves both partial home writes and DWB slots holding the clean versions. The DWB sweep heals these without help from the checkpoint protocol; the protections are independent.

For slot lifecycle, the parallel-flush worker pool, and the second-block design, see cubrid-double-write-buffer.md.

After phase 2, every page on every data volume is either correct as of its on-disk LSA or has been healed. The recovery manager can read any page without fear of torn-write artifacts. Whether a page is current — that’s what phase 5 fixes.

ARIES recovery is bounded by checkpoints. Without one, analysis would walk every WAL record ever written; with one, analysis starts from the most-recent checkpoint LSA.

The pointer to the most-recent checkpoint lives in the active log header on page id -9, field log_Gl.hdr.chkpt_lsa (log_storage.hpp). The checkpoint daemon keeps it current by emitting LOG_START_CHKPT (whose LSA becomes the next chkpt_lsa) and LOG_END_CHKPT (carrying the active-TX snapshot, active-sysop snapshot, and redo-LSA hint), then updating the header and fsync-ing.

The checkpoint is fuzzy: the trantable walk inside logpb_checkpoint runs under a read-mode CS, so transactions make progress between the bracket records. The snapshot is coherent but not quiescent — a TX that committed mid-checkpoint appears active in the snapshot, but its commit record sits later in the log and analysis will see it. The ARIES paper proves this correct as long as analysis treats records in the bracket window the same as records after end-CHKPT. cubrid-checkpoint.md walks the proof.

The first action of log_recovery is to read log_Gl.hdr.chkpt_lsa into a local rcv_lsa:

// log_recovery — src/transaction/log_recovery.c (excerpt)
LSA_COPY (&rcv_lsa, &log_Gl.hdr.chkpt_lsa);
if (ismedia_crash != false)
{
/* media recovery: per-volume rcv_lsa may predate chkpt_lsa */
(void) fileio_map_mounted (thread_p,
(bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint,
&rcv_lsa);
}

In our crash scenario ismedia_crash is false, so rcv_lsa is exactly the header pointer. The log_rv_find_checkpoint branch handles restore-from-backup, taking the minimum per-volume rcv-LSA across mounted volumes; see cubrid-backup-restore.md.

If chkpt_lsa is NULL_LSA: a brand-new install (analysis walks from log start, slow but correct) or header corruption on an established database (fatal — restore from backup).

One internal pipe worth naming: the redo-LSA hint in the end record (LOG_REC_CHKPT.redo_lsa) is the smallest oldest_unflush_lsa across the page buffer at checkpoint time; it can be earlier than chkpt_lsa on a long-lived dirty page. Analysis always starts from chkpt_lsa; redo starts from chkpt.redo_lsa. See cubrid-log-manager.md for the on-log shape and cubrid-recovery-manager.md for analysis consumption.

Analysis is a forward walk from chkpt_lsa. It changes no data page; its sole product is in-memory state — a reconstructed transaction table (TT), a reconstructed dirty-page hint (start_redo_lsa), and a classification of every TX (committed, aborted, in-doubt, loser).

Entry point is log_recovery_analysis. The per-record dispatcher log_rv_analysis_record switches on LOG_RECTYPE. The relevant arms for the restart story:

  • LOG_START_CHKPT / LOG_END_CHKPT. Only the first checkpoint record consumes its snapshot (may_use_checkpoint gate). Each LOG_INFO_CHKPT_TRANS row seeds a TDES via logtb_rv_find_allocate_tran_index, with state coerced (TRAN_ACTIVE/TRAN_UNACTIVE_ABORTEDTRAN_UNACTIVE_UNILATERALLY_ABORTED, i.e. loser; TRAN_2PC_PREPARED kept verbatim, i.e. in-doubt). The end record’s redo_lsa becomes start_redo_lsa. Later checkpoint records in the analysis window are skipped.
  • LOG_UNDOREDO_DATA / LOG_MVCC_UNDOREDO_DATA. Extend the TX’s tail_lsa; allocate a TDES with TRAN_ACTIVE if absent.
  • LOG_COMMIT. TT → TRAN_UNACTIVE_COMMITTED.
  • LOG_COMMIT_WITH_POSTPONE. TT → TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; phase-5 postpone replay will finalise.
  • LOG_ABORT. TT → TRAN_UNACTIVE_ABORTED.
  • LOG_2PC_PREPARE. TT → TRAN_UNACTIVE_2PC_PREPARED. The TX is in-doubt, kept alive past restart awaiting the coordinator. See cubrid-2pc.md.
  • LOG_SYSOP_END. Updates per-TDES sysop bookkeeping (LOG_RCV_TDES annotations).
  • LOG_END_OF_LOG. Stop. Current LSA is end_redo_lsa.

At the end of analysis, every TX known to the engine at crash time has a TDES entry in the rebuilt trantable, classified for the later passes.

Analysis does not touch any data page — it is a pure log walk and is deterministic given an intact log. The recovery manager’s correctness boundary is the analysis-redo handoff. For the full per-record dispatch including unlisted LOG_* arms, see cubrid-recovery-manager.md.

Redo walks forward from start_redo_lsa to end_redo_lsa, applying every record whose target page on disk has a stale LSA. The semantics is textbook ARIES “repeating history”: at the end of redo, every page is in the exact state it was in the moment before the crash, including changes from never-committed TXs. Loser cleanup is phase 6.

The per-record dispatcher is log_rv_redo_record_sync<T> (log_recovery_redo.hpp), a template specialised by log-record payload type. The sync path:

  1. Read the next log record.
  2. Determine target VPID. Multi-page records dispatch one VPID at a time.
  3. Fix the target page (DWB-healed by now).
  4. If page.lsa >= record.lsa, skip — already on disk.
  5. Otherwise call RV_fun[record.rcvindex].redofun (rcv).
  6. Set page.lsa = record.lsa, mark dirty, unfix.

One specialisation is widely misunderstood: for LOG_REC_COMPENSATE (CLR), dispatch returns the undo function from RV_fun[], not the redo function. A CLR’s payload is the undo image of a previously-rolled-back action; replaying it forward during redo means re-applying that undo. The source comment in log_rv_get_fun<LOG_REC_COMPENSATE> is “yes, undo”. Mistakes here are how engines lose data during double-fault recovery. See cubrid-recovery-manager.md.

The modern path runs redo parallel by VPID. log_recovery_redo_parallel.{cpp,hpp} runs a reader thread that walks the log sequentially; sync-only recovery functions apply inline, others dispatch as cublog::redo_job_impl to a worker pool hashed by VPID. Hashing by VPID preserves per-page LSA monotonicity without locks while disjoint pages overlap freely.

Two design points to internalise:

  • Redo replays losers’ changes too. Phase 6 undoes them. Without redo applying the loser changes first, undo couldn’t work — the undo functions assume the page is in the post-action state.
  • page.lsa >= record.lsa is the per-page termination condition. A page already flushed before crash carries an LSA at-or-beyond its last log record; redo skips it. A dirty-at-crash page lags; redo applies until it catches up.

After redo, the database is crash-consistent. Then log_recovery_finish_all_postpone walks TT entries in TRAN_UNACTIVE_*_COMMITTED_WITH_POSTPONE and replays each postpone via log_do_postpone. CUBRID’s pass order (analysis → redo → postpone → undo) departs from textbook ARIES because postpones must finish before undo of losers — otherwise an undo could roll back state a postpone depends on.

Undo erases loser changes. After phase 5 the database mirrors the moment of crash, so loser changes are present and visible. Undo walks each loser TX’s per-TX log chain backward from tail_lsa, applying compensating actions and emitting compensation log records (CLRs) until the chain hits head_lsa.

The driver is log_recovery_undo. For each loser TX (left as TRAN_UNACTIVE_UNILATERALLY_ABORTED after analysis), it walks prev_tranlsa backward calling log_rv_undo_record:

  • Physical records (LOG_UNDOREDO_DATA, LOG_MVCC_UNDOREDO_DATA): fix the target page, call RV_fun[rcvindex].undofun, emit LOG_COMPENSATE whose undo_nxlsa points at the predecessor of the undone record.
  • Logical records (LOG_SYSOP_END_LOGICAL_UNDO and friends): defer to system-op undo machinery in cubrid-transaction.md. Logical undo is how CUBRID handles B+Tree operations — undoing a split physically would require logging the entire pre-split page; the logical scheme records “delete key K from index I” and the undo function reproduces the inverse against whatever state the page is now in.

The CLR’s undo_nxlsa is what makes “undo itself redoable” — the property ARIES is named for. A re-crash during undo lets the next restart’s redo pass replay the partial CLR chain forward (CLRs are redo-only), and undo resumes by reading undo_nxlsa to skip already-undone records.

In our crash scenario T_a and T_b are the B+Tree-split losers; their chains walk back through every split/merge their statements caused. T_c (committed-with-postpone) was already finalised in phase 5’s postpone sub-step. In-doubt 2PC TXs are untouched and sit in TRAN_UNACTIVE_2PC_PREPARED awaiting the coordinator (cubrid-2pc.md).

After undo, the database is transactionally consistent. The recovery manager sets log_Gl.rcv_phase = LOG_RESTARTED and returns. As the last action log_recovery calls (void) logpb_checkpoint (thread_p) so the next restart starts from a clean boundary already including the recovery work. For CLR shape, savepoints, partial undo: see cubrid-recovery-manager.md and cubrid-transaction.md.

CUBRID is MVCC. A “deleted” row is not removed in place — a new version marked deleted-by-MVCCID-X is written, and the old version stays visible to snapshots that still see X. When X drops below the global oldest visible MVCCID, the old version is dead and can be reclaimed. Vacuum is the background subsystem that does this.

The durable handle for vacuum’s progress is LOG_HEADER.mvcc_op_log_lsa, alongside chkpt_lsa in the log header. The vacuum master reads it on restart, computes the oldest visible MVCCID from the recovered TT, and schedules block-reclamation jobs for the log range between mvcc_op_log_lsa and the current tail.

Vacuum is not on the restart critical path. It runs as a background daemon and starts after LOG_RESTARTED is set. Clients can connect before vacuum catches up; the cost is dead versions occupying heap space and a slight scan-latency hit. Blocking acceptance on vacuum could add minutes on a busy database, which is operationally unacceptable.

Two interactions:

  • MVCCID issuance is lazy. Analysis does not pre-allocate MVCCIDs; each LOG_MVCC_* record carries the issuing TX’s MVCCID, and per-TDES MVCCID state is rebuilt during analysis from those fields. This is why LOG_INFO_CHKPT_TRANS has no MVCCID field. See cubrid-mvcc.md.
  • Vacuum workers share the engine’s cubthread::manager pool. They are alive but idle until the vacuum master submits its first batch.

For the master/worker split, block-job scheduling, OID heap traversal, and index purging, see cubrid-vacuum.md.

For a stand-alone server (no HA), this phase is a no-op.

With HA configured, the boot’s HA init dispatches by role:

  • Slave. applylogdb persists a progress LSA on every successful apply (in db_ha_apply_info or the equivalent state file). On restart it reads the persisted LSA, opens the master’s WAL stream (or archive), and resumes. Records between persisted LSA and master tail are replayed in order; later records arrive as the master produces them.
  • Master. Slaves’ copylogdb peers reconnect; each tells the master the LSA it last received, and the master streams from there. A slave that has fallen too far behind (master archives purged below its resume LSA) must be re-bootstrapped from a backup — cubrid-backup-restore.md.

The role is decided by cub_master and the heartbeat daemon via UDP heartbeats (cubrid-heartbeat.md); the role is persisted in the database header so the server is self-consistent on restart.

Replication recovery is a separate concern from crash recovery, though they share LSA handles. Crash recovery (phases 1-6) re-establishes this server’s local state up to its log tail; replication catch-up then re-establishes the slave’s position relative to the master. For apply/copy daemon architecture, conflict resolution, and on-the-wire format see cubrid-ha-replication.md.

The last action of the restart pipeline is to start the network listener. Until this point, the OS has rejected every incoming TCP packet because no socket was listening.

The listener is brought up by css_init (in cubrid-network-protocol.md), called from net_server_start after boot_restart_server returns. css_init binds a socket on the configured port, listen(2)s with a backlog, and spawns the listener thread whose loop is accept(2) → dispatch to a worker.

At listen(2), boot_Server_status is already BOOT_SERVER_UP. The first accepted connection runs the registration handshake xboot_register_client (boot_sr.c):

  • Validate the client’s credentials against the catalog.
  • Issue a TRANID, allocate a TDES.
  • Return a packed BOOT_SERVER_CREDENTIAL (boot.h) carrying page size, log page size, root-class OID, disk-compatibility number, HA state, charset, language, session key.

The client side is boot_restart_client (boot_cl.c), which the CAS process calls during its own init. Once the credential is unpacked, the CAS can issue real SQL.

The connections come via cub_broker, a separate process per logical service that owns a pool of CAS workers. On cub_server restart, each broker detects the server is back and its CAS workers reconnect via the register flow. Pool management, sticky-session policy, and broker-to-server failover are in cubrid-broker.md.

A subtle invariant: before the listener opens, the database is fully recovered and transactionally consistent. The first SELECT 1 sees a coherent universe — the engine’s contract with applications. For RPC packet format, keep-alive, and connection lifecycle see cubrid-network-protocol.md.

flowchart TB
  subgraph CRASH["Crash state on disk"]
    direction TB
    HV["Home-volume pages\n(some torn)"]
    DV["DWB volume\n(clean copies of in-flight pages)"]
    LV["Active log\nrecords up to last flush"]
    LH["Log header\nchkpt_lsa, mvcc_op_log_lsa"]
    HA["HA progress files\nlast-applied LSA"]
  end

  subgraph BOOT["Phase 1 — Process boot (boot_sr.c)"]
    direction TB
    B1["main → net_server_start"]
    B2["boot_restart_server: subsystems in topological order"]
    B3["disk manager: open volumes"]
    B4["page buffer: alloc cache"]
    B5["log_initialize: read hdr, see is_shutdown=false"]
  end

  subgraph DWBPHASE["Phase 2 — DWB recovery (dwb_load_and_recover_pages)"]
    direction TB
    D1["Scan DWB slots"]
    D2["For each slot, read home page"]
    D3{"home OK?"}
    D4["Skip"]
    D5["Write DWB slot over home, fsync"]
    D6["Mark DWB clean"]
  end

  subgraph CHKPT["Phase 3 — Locate checkpoint (log_recovery)"]
    direction TB
    C1["rcv_lsa = log_Gl.hdr.chkpt_lsa"]
    C2["If media crash: take min over per-volume rcv_lsa"]
  end

  subgraph ANALYSIS["Phase 4 — Analysis pass (log_recovery_analysis)"]
    direction TB
    A1["Walk log forward from rcv_lsa"]
    A2["log_rv_analysis_record: switch on LOG_RECTYPE"]
    A3["Seed TT from LOG_END_CHKPT snapshot"]
    A4["Update TT/DPT per record"]
    A5["Classify each TX:\ncommitted | aborted | loser | in-doubt"]
    A6["start_redo_lsa, end_redo_lsa"]
  end

  subgraph REDO["Phase 5 — Redo pass (log_recovery_redo)"]
    direction TB
    R1["Walk log forward from start_redo_lsa"]
    R2["log_rv_redo_record_sync<T>: dispatch by payload type"]
    R3["fix page; if page.lsa < record.lsa: apply RV_fun[idx].redofun"]
    R4["Parallel by VPID hash via redo_parallel"]
    R5["log_recovery_finish_all_postpone:\nreplay COMMIT_WITH_POSTPONE"]
  end

  subgraph UNDO["Phase 6 — Undo pass (log_recovery_undo)"]
    direction TB
    U1["For each loser TX:"]
    U2["Walk prev_tranlsa backward from tail_lsa"]
    U3["log_rv_undo_record:\nRV_fun[idx].undofun + emit LOG_COMPENSATE"]
    U4["CLR.undo_nxlsa = predecessor"]
    U5["TX → TRAN_UNACTIVE_UNILATERALLY_ABORTED"]
    U6["Final logpb_checkpoint() — clean boundary"]
  end

  subgraph VACUUM["Phase 7 — Vacuum catch-up (background)"]
    direction TB
    V1["Vacuum master reads mvcc_op_log_lsa"]
    V2["Compute oldest_visible_MVCCID"]
    V3["Schedule block-reclamation jobs"]
  end

  subgraph HAPHASE["Phase 8 — HA catch-up (background)"]
    direction TB
    H1{"role?"}
    H2["slave: applylogdb resumes from last-applied LSA"]
    H3["master: copylogdb peers reconnect, stream from peer LSAs"]
  end

  subgraph LISTEN["Phase 9 — Open for connections (css_init)"]
    direction TB
    L1["bind / listen / spawn listener thread"]
    L2["boot_Server_status = BOOT_SERVER_UP"]
    L3["broker → CAS reconnect → xboot_register_client"]
    L4["BOOT_SERVER_CREDENTIAL returned to client"]
    L5["First SELECT 1"]
  end

  CRASH --> BOOT
  BOOT --> DWBPHASE
  HV -.->|read| D2
  DV -.->|read| D1
  D1 --> D2 --> D3
  D3 -- yes --> D4
  D3 -- no  --> D5
  D4 --> D6
  D5 --> D6
  DWBPHASE --> CHKPT
  LH -.->|read| C1
  C1 --> C2
  CHKPT --> ANALYSIS
  LV -.->|walk| A1
  A1 --> A2 --> A3 --> A4 --> A5 --> A6
  ANALYSIS --> REDO
  A6 --> R1
  R1 --> R2 --> R3 --> R4 --> R5
  REDO --> UNDO
  U1 --> U2 --> U3 --> U4 --> U5 --> U6
  UNDO --> VACUUM
  UNDO --> HAPHASE
  UNDO --> LISTEN
  V1 --> V2 --> V3
  HA -.->|read| H2
  H1 -- slave  --> H2
  H1 -- master --> H3
  L1 --> L2 --> L3 --> L4 --> L5

The diagram compresses every handoff in the restart pipeline. Three properties are visible at this resolution:

  • Phase 2 must precede phase 4. ARIES analysis reads pages via the page buffer; if those pages are torn, the analysis walk crashes on a malformed page header. DWB recovery is the only protection against this.
  • Phases 7 and 8 fan out from phase 6. They are parallel background activity that doesn’t gate phase 9. A client can see a transactionally-consistent database before vacuum has cleaned up dead versions and before HA has fully caught up.
  • The handoff from phase 6 to phase 9 is direct. There is no further preparation required; once LOG_RESTARTED is set and the post-recovery checkpoint is durable, the listener opens.

This document is the panorama, not the encyclopaedia. Several branches of the broader recovery story are deliberately left to their own docs:

  • Recovery from a backup file (PITR). Our scenario assumed the on-disk state was salvageable — torn pages but no destroyed volumes. If a volume is lost (disk failure, FS corruption, accidental rm), the recovery path is different: restore from the most-recent backup, then replay archived WAL forward to the desired point in time. The entry point is boot_restart_from_backup, the WAL replay path goes through the same redo dispatcher we covered, but anchored on a backup LSA instead of chkpt_lsa. See cubrid-backup-restore.md.
  • Media recovery (single-volume restore). A subset of the above. One volume goes bad while the rest of the database is healthy; the operator restores just that volume and the engine rolls it forward through archived WAL. The log_rv_find_checkpoint branch in log_recovery handles the per-volume LSA walk this scenario needs. See cubrid-backup-restore.md.
  • Parallel-redo internals. We mentioned the per-VPID hashing and the worker pool, but the job-queue shape, the back-pressure policy, the page-fix coordination with the buffer manager, and the perf-counter scaffolding shared with the page-server replication path are all in cubrid-recovery-manager.md under “Parallel redo”.
  • Server-mode-versus-stand-alone differences. The boot path for cub_server (server mode) and for csql -S / loaddb (stand-alone mode) differ in which subsystems come up. Stand-alone tools share the recovery manager but skip the network listener and the broker handshake. See cubrid-boot.md and cubrid-sa-cs-runtime.md.
  • In-doubt 2PC resolution after restart. Phase 4 leaves prepared 2PC TXs in TRAN_UNACTIVE_2PC_PREPARED. Phases 5 and 6 do not touch them. After phase 9, the coordinator’s decision arrives over the network, and the in-doubt TX is committed or aborted via xtran_2pc_*. The orphan-TX timer and the LOG_RECOVERY_FINISH_2PC_PHASE enum are the entry points; details in cubrid-2pc.md.
  • TDE-encrypted log pages during recovery. If the database uses transparent data encryption, log pages are encrypted at rest and must be decrypted before the recovery manager can parse them. Decryption happens inside the log reader, before the analysis/redo dispatchers see records. See cubrid-tde.md.
  • Authentication, sessions, and authorization. Phase 9 opens the socket; the first thing each connection does is authenticate. Authentication state, role resolution, and per-session credentials are handled by cubrid-authentication.md and cubrid-server-session.md.
  • Catalog rehydration and class-cache rebuild. The boot module’s late phase reopens the catalog after recovery finishes; the locator, class object cache, and statistics cache are repopulated lazily on first reference. See cubrid-catalog-manager.md, cubrid-class-object.md, and cubrid-locator.md.

Recovery-pipeline detail docs in this knowledge base

Section titled “Recovery-pipeline detail docs in this knowledge base”
  • cubrid-boot.md — the topological subsystem-init order, the create-vs-restart dispatch, the boot-status flag, and xboot_register_client for client connect.
  • cubrid-double-write-buffer.md — torn-page protection: slot lifecycle, dwb_load_and_recover_pages, the parallel flush worker pool, the second-block design.
  • cubrid-checkpoint.md — fuzzy-checkpoint protocol, the LOG_START_CHKPT/LOG_END_CHKPT bracket, the redo-LSA hint, the chkpt_lsa header field.
  • cubrid-log-manager.md — the WAL framework: log-record shape, prior-list discipline, log-page layout, the log header.
  • cubrid-recovery-manager.md — the three-pass ARIES driver: log_recovery, the per-record dispatchers, RV_fun[], the templated redo, parallel redo by VPID, the postpone pass, the CLR contract.
  • cubrid-transaction.md — TDES shape, per-TX log chain, system ops, savepoints, logical undo.
  • cubrid-2pc.mdLOG_2PC_PREPARE, in-doubt recovery, the LOG_RECOVERY_FINISH_2PC_PHASE enum, the coordinator-initiated resolution path.
  • cubrid-vacuum.md — vacuum master/worker split, MVCCID watermarks, block-job scheduling, mvcc_op_log_lsa recovery.
  • cubrid-mvcc.md — MVCC issuance, snapshot rebuild, visibility rules during recovery.
  • cubrid-ha-replication.mdapplylogdb, copylogdb, slave/master roles, last-applied LSA persistence.
  • cubrid-heartbeat.md — UDP heartbeats, role negotiation, cub_master arbitration.
  • cubrid-network-protocol.mdcss_init, the listener loop, the on-the-wire RPC format.
  • cubrid-broker.mdcub_broker, CAS pool, reconnect policy, sticky sessions.
  • cubrid-backup-restore.md — branched recovery: PITR, media recovery, single-volume restore.
  • cubrid-tde.md — encrypted-log decryption during recovery.
  • cubrid-page-buffer-manager.md — page-fix and dirty-tracking the redo pass relies on.
  • cubrid-disk-manager.md — volume open and the per-volume disk header that media recovery uses.
  • src/executables/server.ccub_server main().
  • src/communication/network_sr.cnet_server_start, css_init.
  • src/transaction/boot_sr.cboot_restart_server, xboot_register_client, xboot_initialize_server.
  • src/transaction/log_manager.clog_initialize, the is_shutdown gate dispatching to recovery.
  • src/transaction/log_recovery.clog_recovery, log_recovery_analysis, log_recovery_redo, log_recovery_finish_all_postpone, log_recovery_undo, log_rv_find_checkpoint.
  • src/transaction/log_recovery_redo.{cpp,hpp} — templated per-record redo dispatcher and the LOG_REC_COMPENSATE specialisation.
  • src/transaction/log_recovery_redo_parallel.{cpp,hpp} — per-VPID worker pool.
  • src/transaction/recovery.hRV_fun[], LOG_RCVINDEX, LOG_RCV.
  • src/storage/double_write_buffer.cppdwb_load_and_recover_pages.
  • src/storage/page_buffer.cpgbuf_flush_checkpoint, fix/unfix discipline.
  • src/transaction/log_page_buffer.clogpb_checkpoint, logpb_flush_header.
  • src/connection/server_support.c — listener thread, accept loop.
  • Mohan, Haderle, Lindsay, Pirahesh, Schwarz, ARIES, ACM TODS 17.1, 1992 — the canonical algorithm CUBRID implements.
  • Petrov, Database Internals, 2019, ch. 5 §“Recovery” and §“ARIES”.
  • Bernstein, Hadzilacos, Goodman, Concurrency Control and Recovery in Database Systems, 1987 — checkpoints, consistent-vs-fuzzy distinction.
  • Silberschatz, Korth, Sudarshan, Database System Concepts, 7th ed., ch. 19.