CUBRID Reading Path — How a Server Restart Recovers
Contents:
- What this traces
- Step 1 — Process boot
- Step 2 — DWB recovery
- Step 3 — Find the last checkpoint
- Step 4 — Analysis pass
- Step 5 — Redo pass
- Step 6 — Undo pass
- Step 7 — Vacuum catches up
- Step 8 — Pick up replication
- Step 9 — Open for connections
- Diagram — full restart pipeline
- What we did NOT cover
- Sources
This is a synthesis doc — a reading path, not a deep dive. It
traces the journey a single crashed cub_server takes from
SIGKILL mid-flush back to the moment a CAS worker issues
SELECT 1. Each step delegates technical detail to a sibling doc;
the value of this page is the ordering and the handoffs. If
you want the panorama, the diagram in
§ Diagram compresses everything
to a single picture.
What this traces
Section titled “What this traces”Imagine the worst plausible failure short of media loss. The server
is mid-work: the checkpoint daemon is flushing dirty pages through
the DWB, the prior list holds a dozen unflushed log records, two
user transactions are deep in B+Tree splits, a third has just
emitted its commit record but its postpones haven’t run. Then the
machine loses power, or the OOM killer fires, or kill -9 lands.
The process ceases.
The disk a millisecond later carries three categories of damage:
- Half-flushed pages. The DWB was mid-batch writing slots back
to their home volumes. Some pages reached home; some didn’t.
The DWB volume is durable (it was
fsync(2)-ed before any home write started), so for every torn home page, a clean copy exists in the DWB. - Prior-list entries that never made it to the log. A handful
of
LOG_REC_*nodes sat in the prior list awaiting the log flusher. Those records have no on-disk presence, so their changes are lost. The WAL invariant guarantees any data page on disk was preceded by its log record on disk, so if the prior-list entry was lost, the data-page write was also not yet flushed. Consistent. - Partially completed transactions. Committed T_c emitted
LOG_COMMITandLOG_COMMIT_WITH_POSTPONE; both are durable. Its postpones did not run. T_a and T_b are mid-statement: undo chains stretching back twenty records but no commit records. They are losers in ARIES parlance.
The restart’s job is to bring the disk to a state both internally consistent (no torn pages, every committed change present, every uncommitted change gone) and externally correct (T_c’s commit durable; T_a and T_b never happened). Then, and only then, does the network listener open the socket.
The pipeline is six mandatory phases plus two optional follow-ons:
- Phase 1 (process boot): mount volumes, attach to the log.
- Phase 2 (DWB recovery): heal torn pages.
- Phase 3 (locate checkpoint): read the log header.
- Phase 4 (analysis): rebuild the TX/dirty-page tables.
- Phase 5 (redo): replay forward from the redo-LSA.
- Phase 6 (undo): roll back losers, emitting CLRs.
- Phase 7 (vacuum catch-up): background MVCCID reclamation.
- Phase 8 (HA catch-up): replication resumes.
- Phase 9 (open listener): accept the first client.
Phases 1-6 run sequentially. Phases 7-8 fan out as background activity in parallel with phase 9; they don’t gate acceptance.
Step 1 — Process boot
Section titled “Step 1 — Process boot”cub_server’s main() (in src/executables/server.c) is a thin
shim: parse flags, register signal handlers, call
net_server_start (src/communication/network_sr.c). That
orchestrator dispatches into boot_restart_server
(src/transaction/boot_sr.c), which walks the subsystem-init list
in topological order.
The order matters because the dependency graph has cycles: recovery
needs the page buffer; the page buffer needs the disk manager; the
disk manager wants catalog metadata; the catalog needs recovery to
be over. CUBRID breaks the cycle by staging — each subsystem
comes up in an early phase that participates in recovery and a
late phase that consumes catalog data. See
cubrid-boot.md for the topological walk.
Three boot-time actions are load-bearing for the recovery story:
- Volume opening. The disk manager reads
databases.txt, enumerates volumes via the log-info file, opens each by descriptor. Volumes are attached but not yet trusted: their pages may be torn from the mid-flush crash. No client can read or write pages until phase 6 finishes. - Log attach.
log_initialize(log_manager.c) opens the active log, reads the header (page id-9), inspectshdr.is_shutdown. Iftrue, recovery is skipped. Iffalse— our scenario — it callslog_recovery. - Recovery dispatch.
log_recoveryis the three-pass driver incubrid-recovery-manager.md. But before it runs, the DWB must have a chance to heal torn pages. That’s phase 2.
At the moment of boot, no client connections are possible.
boot_Server_status = BOOT_SERVER_DOWN and the listener is not
running. The OS rejects every incoming packet. Recover first,
listen second is what prevents partial recovery from leaking to a
client.
Step 2 — DWB recovery
Section titled “Step 2 — DWB recovery”Before the recovery manager touches the log, the double-write buffer is inspected. The DWB exists for one reason: torn-page protection. A 16 KB CUBRID page sits across many disk sectors (512 B or 4 KiB atomic). A mid-write crash can leave a page on disk with the first half new and the second half old. ARIES redo cannot recover from this — the redo function applies a delta on a coherent page, not a torn one. Postgres uses full-page-image WAL; InnoDB and CUBRID use a doublewrite buffer; SQL Server uses torn-page detection bits.
The DWB runtime invariant: before a dirty page is written to its
home volume, a copy is staged in the DWB volume, and the DWB
volume is fsync-ed. Then the home write proceeds; if it tears,
a clean copy exists in the DWB.
Restart-time DWB recovery is dwb_load_and_recover_pages
(src/storage/double_write_buffer.cpp):
- Open the DWB volume(s).
- For each occupied slot, compute the checksum.
- Read the home-volume copy.
- If the home copy is good (valid checksum, LSA at-or-above the slot’s LSA), skip.
- Otherwise write the DWB slot over the home page,
fsync. - Mark the DWB clean.
This sweep is mandatory before ARIES. Analysis and redo read pages from home volumes; a torn page would crash the parser or silently corrupt the database under a redo-on-stale-data bug.
DWB recovery also resolves a subtle checkpoint interaction
described in cubrid-checkpoint.md: the
checkpoint protocol drives dirty pages through the DWB during
step 7 of logpb_checkpoint, so a crash during checkpoint flush
leaves both partial home writes and DWB slots holding the clean
versions. The DWB sweep heals these without help from the
checkpoint protocol; the protections are independent.
For slot lifecycle, the parallel-flush worker pool, and the
second-block design, see
cubrid-double-write-buffer.md.
After phase 2, every page on every data volume is either correct as of its on-disk LSA or has been healed. The recovery manager can read any page without fear of torn-write artifacts. Whether a page is current — that’s what phase 5 fixes.
Step 3 — Find the last checkpoint
Section titled “Step 3 — Find the last checkpoint”ARIES recovery is bounded by checkpoints. Without one, analysis would walk every WAL record ever written; with one, analysis starts from the most-recent checkpoint LSA.
The pointer to the most-recent checkpoint lives in the active log
header on page id -9, field log_Gl.hdr.chkpt_lsa
(log_storage.hpp). The checkpoint daemon keeps it current by
emitting LOG_START_CHKPT (whose LSA becomes the next
chkpt_lsa) and LOG_END_CHKPT (carrying the active-TX
snapshot, active-sysop snapshot, and redo-LSA hint), then
updating the header and fsync-ing.
The checkpoint is fuzzy: the trantable walk inside
logpb_checkpoint runs under a read-mode CS, so transactions
make progress between the bracket records. The snapshot is
coherent but not quiescent — a TX that committed mid-checkpoint
appears active in the snapshot, but its commit record sits later
in the log and analysis will see it. The ARIES paper proves this
correct as long as analysis treats records in the bracket window
the same as records after end-CHKPT.
cubrid-checkpoint.md walks the proof.
The first action of log_recovery is to read
log_Gl.hdr.chkpt_lsa into a local rcv_lsa:
// log_recovery — src/transaction/log_recovery.c (excerpt)LSA_COPY (&rcv_lsa, &log_Gl.hdr.chkpt_lsa);if (ismedia_crash != false) { /* media recovery: per-volume rcv_lsa may predate chkpt_lsa */ (void) fileio_map_mounted (thread_p, (bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint, &rcv_lsa); }In our crash scenario ismedia_crash is false, so rcv_lsa is
exactly the header pointer. The log_rv_find_checkpoint branch
handles restore-from-backup, taking the minimum per-volume
rcv-LSA across mounted volumes; see
cubrid-backup-restore.md.
If chkpt_lsa is NULL_LSA: a brand-new install (analysis walks
from log start, slow but correct) or header corruption on an
established database (fatal — restore from backup).
One internal pipe worth naming: the redo-LSA hint in the end
record (LOG_REC_CHKPT.redo_lsa) is the smallest
oldest_unflush_lsa across the page buffer at checkpoint time;
it can be earlier than chkpt_lsa on a long-lived dirty page.
Analysis always starts from chkpt_lsa; redo starts from
chkpt.redo_lsa. See
cubrid-log-manager.md for the on-log
shape and
cubrid-recovery-manager.md for
analysis consumption.
Step 4 — Analysis pass
Section titled “Step 4 — Analysis pass”Analysis is a forward walk from chkpt_lsa. It changes no data
page; its sole product is in-memory state — a reconstructed
transaction table (TT), a reconstructed dirty-page hint
(start_redo_lsa), and a classification of every TX (committed,
aborted, in-doubt, loser).
Entry point is log_recovery_analysis. The per-record dispatcher
log_rv_analysis_record switches on LOG_RECTYPE. The relevant
arms for the restart story:
LOG_START_CHKPT/LOG_END_CHKPT. Only the first checkpoint record consumes its snapshot (may_use_checkpointgate). EachLOG_INFO_CHKPT_TRANSrow seeds a TDES vialogtb_rv_find_allocate_tran_index, with state coerced (TRAN_ACTIVE/TRAN_UNACTIVE_ABORTED→TRAN_UNACTIVE_UNILATERALLY_ABORTED, i.e. loser;TRAN_2PC_PREPAREDkept verbatim, i.e. in-doubt). The end record’sredo_lsabecomesstart_redo_lsa. Later checkpoint records in the analysis window are skipped.LOG_UNDOREDO_DATA/LOG_MVCC_UNDOREDO_DATA. Extend the TX’stail_lsa; allocate a TDES withTRAN_ACTIVEif absent.LOG_COMMIT. TT →TRAN_UNACTIVE_COMMITTED.LOG_COMMIT_WITH_POSTPONE. TT →TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE; phase-5 postpone replay will finalise.LOG_ABORT. TT →TRAN_UNACTIVE_ABORTED.LOG_2PC_PREPARE. TT →TRAN_UNACTIVE_2PC_PREPARED. The TX is in-doubt, kept alive past restart awaiting the coordinator. Seecubrid-2pc.md.LOG_SYSOP_END. Updates per-TDES sysop bookkeeping (LOG_RCV_TDESannotations).LOG_END_OF_LOG. Stop. Current LSA isend_redo_lsa.
At the end of analysis, every TX known to the engine at crash time has a TDES entry in the rebuilt trantable, classified for the later passes.
Analysis does not touch any data page — it is a pure log walk
and is deterministic given an intact log. The recovery manager’s
correctness boundary is the analysis-redo handoff. For the full
per-record dispatch including unlisted LOG_* arms, see
cubrid-recovery-manager.md.
Step 5 — Redo pass
Section titled “Step 5 — Redo pass”Redo walks forward from start_redo_lsa to end_redo_lsa,
applying every record whose target page on disk has a stale LSA.
The semantics is textbook ARIES “repeating history”: at the end
of redo, every page is in the exact state it was in the moment
before the crash, including changes from never-committed TXs.
Loser cleanup is phase 6.
The per-record dispatcher is log_rv_redo_record_sync<T>
(log_recovery_redo.hpp), a template specialised by log-record
payload type. The sync path:
- Read the next log record.
- Determine target VPID. Multi-page records dispatch one VPID at a time.
- Fix the target page (DWB-healed by now).
- If
page.lsa >= record.lsa, skip — already on disk. - Otherwise call
RV_fun[record.rcvindex].redofun (rcv). - Set
page.lsa = record.lsa, mark dirty, unfix.
One specialisation is widely misunderstood: for
LOG_REC_COMPENSATE (CLR), dispatch returns the undo function
from RV_fun[], not the redo function. A CLR’s payload is the
undo image of a previously-rolled-back action; replaying it
forward during redo means re-applying that undo. The source
comment in log_rv_get_fun<LOG_REC_COMPENSATE> is “yes, undo”.
Mistakes here are how engines lose data during double-fault
recovery. See
cubrid-recovery-manager.md.
The modern path runs redo parallel by VPID.
log_recovery_redo_parallel.{cpp,hpp} runs a reader thread that
walks the log sequentially; sync-only recovery functions apply
inline, others dispatch as cublog::redo_job_impl to a worker
pool hashed by VPID. Hashing by VPID preserves per-page LSA
monotonicity without locks while disjoint pages overlap freely.
Two design points to internalise:
- Redo replays losers’ changes too. Phase 6 undoes them. Without redo applying the loser changes first, undo couldn’t work — the undo functions assume the page is in the post-action state.
page.lsa >= record.lsais the per-page termination condition. A page already flushed before crash carries an LSA at-or-beyond its last log record; redo skips it. A dirty-at-crash page lags; redo applies until it catches up.
After redo, the database is crash-consistent. Then
log_recovery_finish_all_postpone walks TT entries in
TRAN_UNACTIVE_*_COMMITTED_WITH_POSTPONE and replays each
postpone via log_do_postpone. CUBRID’s pass order (analysis →
redo → postpone → undo) departs from textbook ARIES because
postpones must finish before undo of losers — otherwise an undo
could roll back state a postpone depends on.
Step 6 — Undo pass
Section titled “Step 6 — Undo pass”Undo erases loser changes. After phase 5 the database mirrors the
moment of crash, so loser changes are present and visible. Undo
walks each loser TX’s per-TX log chain backward from tail_lsa,
applying compensating actions and emitting compensation log
records (CLRs) until the chain hits head_lsa.
The driver is log_recovery_undo. For each loser TX (left as
TRAN_UNACTIVE_UNILATERALLY_ABORTED after analysis), it walks
prev_tranlsa backward calling log_rv_undo_record:
- Physical records (
LOG_UNDOREDO_DATA,LOG_MVCC_UNDOREDO_DATA): fix the target page, callRV_fun[rcvindex].undofun, emitLOG_COMPENSATEwhoseundo_nxlsapoints at the predecessor of the undone record. - Logical records (
LOG_SYSOP_END_LOGICAL_UNDOand friends): defer to system-op undo machinery incubrid-transaction.md. Logical undo is how CUBRID handles B+Tree operations — undoing a split physically would require logging the entire pre-split page; the logical scheme records “delete key K from index I” and the undo function reproduces the inverse against whatever state the page is now in.
The CLR’s undo_nxlsa is what makes “undo itself redoable” — the
property ARIES is named for. A re-crash during undo lets the next
restart’s redo pass replay the partial CLR chain forward (CLRs
are redo-only), and undo resumes by reading undo_nxlsa to skip
already-undone records.
In our crash scenario T_a and T_b are the B+Tree-split losers;
their chains walk back through every split/merge their statements
caused. T_c (committed-with-postpone) was already finalised in
phase 5’s postpone sub-step. In-doubt 2PC TXs are untouched and
sit in TRAN_UNACTIVE_2PC_PREPARED awaiting the coordinator
(cubrid-2pc.md).
After undo, the database is transactionally consistent. The
recovery manager sets log_Gl.rcv_phase = LOG_RESTARTED and
returns. As the last action log_recovery calls
(void) logpb_checkpoint (thread_p) so the next restart starts
from a clean boundary already including the recovery work. For
CLR shape, savepoints, partial undo: see
cubrid-recovery-manager.md and
cubrid-transaction.md.
Step 7 — Vacuum catches up
Section titled “Step 7 — Vacuum catches up”CUBRID is MVCC. A “deleted” row is not removed in place — a new version marked deleted-by-MVCCID-X is written, and the old version stays visible to snapshots that still see X. When X drops below the global oldest visible MVCCID, the old version is dead and can be reclaimed. Vacuum is the background subsystem that does this.
The durable handle for vacuum’s progress is
LOG_HEADER.mvcc_op_log_lsa, alongside chkpt_lsa in the log
header. The vacuum master reads it on restart, computes the
oldest visible MVCCID from the recovered TT, and schedules
block-reclamation jobs for the log range between
mvcc_op_log_lsa and the current tail.
Vacuum is not on the restart critical path. It runs as a
background daemon and starts after LOG_RESTARTED is set.
Clients can connect before vacuum catches up; the cost is dead
versions occupying heap space and a slight scan-latency hit.
Blocking acceptance on vacuum could add minutes on a busy
database, which is operationally unacceptable.
Two interactions:
- MVCCID issuance is lazy. Analysis does not pre-allocate
MVCCIDs; each
LOG_MVCC_*record carries the issuing TX’s MVCCID, and per-TDES MVCCID state is rebuilt during analysis from those fields. This is whyLOG_INFO_CHKPT_TRANShas no MVCCID field. Seecubrid-mvcc.md. - Vacuum workers share the engine’s
cubthread::managerpool. They are alive but idle until the vacuum master submits its first batch.
For the master/worker split, block-job scheduling, OID heap
traversal, and index purging, see
cubrid-vacuum.md.
Step 8 — Pick up replication
Section titled “Step 8 — Pick up replication”For a stand-alone server (no HA), this phase is a no-op.
With HA configured, the boot’s HA init dispatches by role:
- Slave.
applylogdbpersists a progress LSA on every successful apply (indb_ha_apply_infoor the equivalent state file). On restart it reads the persisted LSA, opens the master’s WAL stream (or archive), and resumes. Records between persisted LSA and master tail are replayed in order; later records arrive as the master produces them. - Master. Slaves’
copylogdbpeers reconnect; each tells the master the LSA it last received, and the master streams from there. A slave that has fallen too far behind (master archives purged below its resume LSA) must be re-bootstrapped from a backup —cubrid-backup-restore.md.
The role is decided by cub_master and the heartbeat daemon via
UDP heartbeats (cubrid-heartbeat.md);
the role is persisted in the database header so the server is
self-consistent on restart.
Replication recovery is a separate concern from crash recovery,
though they share LSA handles. Crash recovery (phases 1-6)
re-establishes this server’s local state up to its log tail;
replication catch-up then re-establishes the slave’s position
relative to the master. For apply/copy daemon architecture,
conflict resolution, and on-the-wire format see
cubrid-ha-replication.md.
Step 9 — Open for connections
Section titled “Step 9 — Open for connections”The last action of the restart pipeline is to start the network listener. Until this point, the OS has rejected every incoming TCP packet because no socket was listening.
The listener is brought up by css_init (in
cubrid-network-protocol.md),
called from net_server_start after boot_restart_server
returns. css_init binds a socket on the configured port,
listen(2)s with a backlog, and spawns the listener thread
whose loop is accept(2) → dispatch to a worker.
At listen(2), boot_Server_status is already BOOT_SERVER_UP.
The first accepted connection runs the registration handshake
xboot_register_client (boot_sr.c):
- Validate the client’s credentials against the catalog.
- Issue a TRANID, allocate a TDES.
- Return a packed
BOOT_SERVER_CREDENTIAL(boot.h) carrying page size, log page size, root-class OID, disk-compatibility number, HA state, charset, language, session key.
The client side is boot_restart_client (boot_cl.c), which
the CAS process calls during its own init. Once the credential
is unpacked, the CAS can issue real SQL.
The connections come via cub_broker, a separate process per
logical service that owns a pool of CAS workers. On cub_server
restart, each broker detects the server is back and its CAS
workers reconnect via the register flow. Pool management,
sticky-session policy, and broker-to-server failover are in
cubrid-broker.md.
A subtle invariant: before the listener opens, the database
is fully recovered and transactionally consistent. The first
SELECT 1 sees a coherent universe — the engine’s contract
with applications. For RPC packet format, keep-alive, and
connection lifecycle see
cubrid-network-protocol.md.
Diagram — full restart pipeline
Section titled “Diagram — full restart pipeline”flowchart TB
subgraph CRASH["Crash state on disk"]
direction TB
HV["Home-volume pages\n(some torn)"]
DV["DWB volume\n(clean copies of in-flight pages)"]
LV["Active log\nrecords up to last flush"]
LH["Log header\nchkpt_lsa, mvcc_op_log_lsa"]
HA["HA progress files\nlast-applied LSA"]
end
subgraph BOOT["Phase 1 — Process boot (boot_sr.c)"]
direction TB
B1["main → net_server_start"]
B2["boot_restart_server: subsystems in topological order"]
B3["disk manager: open volumes"]
B4["page buffer: alloc cache"]
B5["log_initialize: read hdr, see is_shutdown=false"]
end
subgraph DWBPHASE["Phase 2 — DWB recovery (dwb_load_and_recover_pages)"]
direction TB
D1["Scan DWB slots"]
D2["For each slot, read home page"]
D3{"home OK?"}
D4["Skip"]
D5["Write DWB slot over home, fsync"]
D6["Mark DWB clean"]
end
subgraph CHKPT["Phase 3 — Locate checkpoint (log_recovery)"]
direction TB
C1["rcv_lsa = log_Gl.hdr.chkpt_lsa"]
C2["If media crash: take min over per-volume rcv_lsa"]
end
subgraph ANALYSIS["Phase 4 — Analysis pass (log_recovery_analysis)"]
direction TB
A1["Walk log forward from rcv_lsa"]
A2["log_rv_analysis_record: switch on LOG_RECTYPE"]
A3["Seed TT from LOG_END_CHKPT snapshot"]
A4["Update TT/DPT per record"]
A5["Classify each TX:\ncommitted | aborted | loser | in-doubt"]
A6["start_redo_lsa, end_redo_lsa"]
end
subgraph REDO["Phase 5 — Redo pass (log_recovery_redo)"]
direction TB
R1["Walk log forward from start_redo_lsa"]
R2["log_rv_redo_record_sync<T>: dispatch by payload type"]
R3["fix page; if page.lsa < record.lsa: apply RV_fun[idx].redofun"]
R4["Parallel by VPID hash via redo_parallel"]
R5["log_recovery_finish_all_postpone:\nreplay COMMIT_WITH_POSTPONE"]
end
subgraph UNDO["Phase 6 — Undo pass (log_recovery_undo)"]
direction TB
U1["For each loser TX:"]
U2["Walk prev_tranlsa backward from tail_lsa"]
U3["log_rv_undo_record:\nRV_fun[idx].undofun + emit LOG_COMPENSATE"]
U4["CLR.undo_nxlsa = predecessor"]
U5["TX → TRAN_UNACTIVE_UNILATERALLY_ABORTED"]
U6["Final logpb_checkpoint() — clean boundary"]
end
subgraph VACUUM["Phase 7 — Vacuum catch-up (background)"]
direction TB
V1["Vacuum master reads mvcc_op_log_lsa"]
V2["Compute oldest_visible_MVCCID"]
V3["Schedule block-reclamation jobs"]
end
subgraph HAPHASE["Phase 8 — HA catch-up (background)"]
direction TB
H1{"role?"}
H2["slave: applylogdb resumes from last-applied LSA"]
H3["master: copylogdb peers reconnect, stream from peer LSAs"]
end
subgraph LISTEN["Phase 9 — Open for connections (css_init)"]
direction TB
L1["bind / listen / spawn listener thread"]
L2["boot_Server_status = BOOT_SERVER_UP"]
L3["broker → CAS reconnect → xboot_register_client"]
L4["BOOT_SERVER_CREDENTIAL returned to client"]
L5["First SELECT 1"]
end
CRASH --> BOOT
BOOT --> DWBPHASE
HV -.->|read| D2
DV -.->|read| D1
D1 --> D2 --> D3
D3 -- yes --> D4
D3 -- no --> D5
D4 --> D6
D5 --> D6
DWBPHASE --> CHKPT
LH -.->|read| C1
C1 --> C2
CHKPT --> ANALYSIS
LV -.->|walk| A1
A1 --> A2 --> A3 --> A4 --> A5 --> A6
ANALYSIS --> REDO
A6 --> R1
R1 --> R2 --> R3 --> R4 --> R5
REDO --> UNDO
U1 --> U2 --> U3 --> U4 --> U5 --> U6
UNDO --> VACUUM
UNDO --> HAPHASE
UNDO --> LISTEN
V1 --> V2 --> V3
HA -.->|read| H2
H1 -- slave --> H2
H1 -- master --> H3
L1 --> L2 --> L3 --> L4 --> L5
The diagram compresses every handoff in the restart pipeline. Three properties are visible at this resolution:
- Phase 2 must precede phase 4. ARIES analysis reads pages via the page buffer; if those pages are torn, the analysis walk crashes on a malformed page header. DWB recovery is the only protection against this.
- Phases 7 and 8 fan out from phase 6. They are parallel background activity that doesn’t gate phase 9. A client can see a transactionally-consistent database before vacuum has cleaned up dead versions and before HA has fully caught up.
- The handoff from phase 6 to phase 9 is direct. There is
no further preparation required; once
LOG_RESTARTEDis set and the post-recovery checkpoint is durable, the listener opens.
What we did NOT cover
Section titled “What we did NOT cover”This document is the panorama, not the encyclopaedia. Several branches of the broader recovery story are deliberately left to their own docs:
- Recovery from a backup file (PITR). Our scenario assumed
the on-disk state was salvageable — torn pages but no
destroyed volumes. If a volume is lost (disk failure, FS
corruption, accidental
rm), the recovery path is different: restore from the most-recent backup, then replay archived WAL forward to the desired point in time. The entry point isboot_restart_from_backup, the WAL replay path goes through the same redo dispatcher we covered, but anchored on a backup LSA instead ofchkpt_lsa. Seecubrid-backup-restore.md. - Media recovery (single-volume restore). A subset of the
above. One volume goes bad while the rest of the database is
healthy; the operator restores just that volume and the engine
rolls it forward through archived WAL. The
log_rv_find_checkpointbranch inlog_recoveryhandles the per-volume LSA walk this scenario needs. Seecubrid-backup-restore.md. - Parallel-redo internals. We mentioned the per-VPID hashing
and the worker pool, but the job-queue shape, the
back-pressure policy, the page-fix coordination with the
buffer manager, and the perf-counter scaffolding shared with
the page-server replication path are all in
cubrid-recovery-manager.mdunder “Parallel redo”. - Server-mode-versus-stand-alone differences. The boot path
for
cub_server(server mode) and forcsql -S/loaddb(stand-alone mode) differ in which subsystems come up. Stand-alone tools share the recovery manager but skip the network listener and the broker handshake. Seecubrid-boot.mdandcubrid-sa-cs-runtime.md. - In-doubt 2PC resolution after restart. Phase 4 leaves
prepared 2PC TXs in
TRAN_UNACTIVE_2PC_PREPARED. Phases 5 and 6 do not touch them. After phase 9, the coordinator’s decision arrives over the network, and the in-doubt TX is committed or aborted viaxtran_2pc_*. The orphan-TX timer and theLOG_RECOVERY_FINISH_2PC_PHASEenum are the entry points; details incubrid-2pc.md. - TDE-encrypted log pages during recovery. If the database
uses transparent data encryption, log pages are encrypted at
rest and must be decrypted before the recovery manager can
parse them. Decryption happens inside the log reader, before
the analysis/redo dispatchers see records. See
cubrid-tde.md. - Authentication, sessions, and authorization. Phase 9
opens the socket; the first thing each connection does is
authenticate. Authentication state, role resolution, and
per-session credentials are handled by
cubrid-authentication.mdandcubrid-server-session.md. - Catalog rehydration and class-cache rebuild. The boot
module’s late phase reopens the catalog after recovery
finishes; the locator, class object cache, and statistics
cache are repopulated lazily on first reference. See
cubrid-catalog-manager.md,cubrid-class-object.md, andcubrid-locator.md.
Sources
Section titled “Sources”Recovery-pipeline detail docs in this knowledge base
Section titled “Recovery-pipeline detail docs in this knowledge base”cubrid-boot.md— the topological subsystem-init order, the create-vs-restart dispatch, the boot-status flag, andxboot_register_clientfor client connect.cubrid-double-write-buffer.md— torn-page protection: slot lifecycle,dwb_load_and_recover_pages, the parallel flush worker pool, the second-block design.cubrid-checkpoint.md— fuzzy-checkpoint protocol, theLOG_START_CHKPT/LOG_END_CHKPTbracket, the redo-LSA hint, thechkpt_lsaheader field.cubrid-log-manager.md— the WAL framework: log-record shape, prior-list discipline, log-page layout, the log header.cubrid-recovery-manager.md— the three-pass ARIES driver:log_recovery, the per-record dispatchers,RV_fun[], the templated redo, parallel redo by VPID, the postpone pass, the CLR contract.cubrid-transaction.md— TDES shape, per-TX log chain, system ops, savepoints, logical undo.cubrid-2pc.md—LOG_2PC_PREPARE, in-doubt recovery, theLOG_RECOVERY_FINISH_2PC_PHASEenum, the coordinator-initiated resolution path.cubrid-vacuum.md— vacuum master/worker split, MVCCID watermarks, block-job scheduling,mvcc_op_log_lsarecovery.cubrid-mvcc.md— MVCC issuance, snapshot rebuild, visibility rules during recovery.cubrid-ha-replication.md—applylogdb,copylogdb, slave/master roles, last-applied LSA persistence.cubrid-heartbeat.md— UDP heartbeats, role negotiation,cub_masterarbitration.cubrid-network-protocol.md—css_init, the listener loop, the on-the-wire RPC format.cubrid-broker.md—cub_broker, CAS pool, reconnect policy, sticky sessions.cubrid-backup-restore.md— branched recovery: PITR, media recovery, single-volume restore.cubrid-tde.md— encrypted-log decryption during recovery.cubrid-page-buffer-manager.md— page-fix and dirty-tracking the redo pass relies on.cubrid-disk-manager.md— volume open and the per-volume disk header that media recovery uses.
Code paths consulted
Section titled “Code paths consulted”src/executables/server.c—cub_servermain().src/communication/network_sr.c—net_server_start,css_init.src/transaction/boot_sr.c—boot_restart_server,xboot_register_client,xboot_initialize_server.src/transaction/log_manager.c—log_initialize, theis_shutdowngate dispatching to recovery.src/transaction/log_recovery.c—log_recovery,log_recovery_analysis,log_recovery_redo,log_recovery_finish_all_postpone,log_recovery_undo,log_rv_find_checkpoint.src/transaction/log_recovery_redo.{cpp,hpp}— templated per-record redo dispatcher and theLOG_REC_COMPENSATEspecialisation.src/transaction/log_recovery_redo_parallel.{cpp,hpp}— per-VPID worker pool.src/transaction/recovery.h—RV_fun[],LOG_RCVINDEX,LOG_RCV.src/storage/double_write_buffer.cpp—dwb_load_and_recover_pages.src/storage/page_buffer.c—pgbuf_flush_checkpoint, fix/unfix discipline.src/transaction/log_page_buffer.c—logpb_checkpoint,logpb_flush_header.src/connection/server_support.c— listener thread, accept loop.
Theoretical references
Section titled “Theoretical references”- Mohan, Haderle, Lindsay, Pirahesh, Schwarz, ARIES, ACM TODS 17.1, 1992 — the canonical algorithm CUBRID implements.
- Petrov, Database Internals, 2019, ch. 5 §“Recovery” and §“ARIES”.
- Bernstein, Hadzilacos, Goodman, Concurrency Control and Recovery in Database Systems, 1987 — checkpoints, consistent-vs-fuzzy distinction.
- Silberschatz, Korth, Sudarshan, Database System Concepts, 7th ed., ch. 19.