CUBRID Checkpoint — Fuzzy ARIES Checkpoint Protocol, Active-TX Snapshot, and Recovery Anchor LSA
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”A checkpoint is the database engine’s negotiated truce with the recovery manager: a periodic record in the write-ahead log whose position bounds the work the next restart will have to do. Without checkpoints, the analysis pass at restart would have to walk the entire log from the beginning of time — every record ever written — because it cannot otherwise know which transactions were active when the engine crashed and which dirty pages may not have reached their home volumes. With checkpoints, the engine periodically commits a piece of self-knowledge to the log: “as of this LSA, here is the set of in-flight transactions and here is the smallest LSA whose data page is still dirty in memory; recovery may safely begin its analysis pass from that LSA.” Database Internals (Petrov, ch. 5 §“Recovery”) frames it as the fundamental tool that makes restart time bounded rather than proportional to engine uptime.
The contract every checkpoint protocol must satisfy is laid out in the ARIES paper (Mohan et al., TODS 17.1, 1992). The paper distinguishes two checkpoint families on a single axis — how much of the engine has to stop while the checkpoint runs.
- Sharp / consistent / quiescent checkpoints. Drain all in-flight transactions, flush every dirty page, then write one checkpoint record. Captured state is consistent. Cost: every checkpoint freezes user transactions for the duration of the buffer-pool flush. Bernstein/Hadzilacos/Goodman (ch. 6 §“Checkpointing”) describe this as the textbook variant; few production engines use it because the throughput hit is unacceptable.
- Fuzzy / non-blocking checkpoints. Take a snapshot of the
active-transaction set and the dirty-page set without freezing
them, write it to the log between two bracketing records
(
begin-CHKPTandend-CHKPTin ARIES parlance), and continue serving traffic. Captured state is internally inconsistent — between the bracket records, transactions commit, abort, and dirty new pages — but ARIES proves the analysis pass tolerates this by treating any record in the window the same as a record after end-CHKPT.
CUBRID picks the fuzzy variant. ARIES frames correctness around three properties:
- Coverage. Every transaction active at end-CHKPT must either appear in the snapshot or have its activity visible in records after end-CHKPT. ARIES guarantees this by walking the trantable under a read-mode lock.
- Anchoring. The redo-LSA hint must be a lower bound on the LSA of any dirty page. Anything below is provably on disk; anything above may or may not be.
- Restartability. A crash during checkpoint must not corrupt
the recovery boundary. ARIES makes the checkpoint entirely
log-resident — the in-memory pointer (
chkpt_lsa) advances only after both bracket records are durable.
The two-step (begin-CHKPT / end-CHKPT) pattern is the canonical ARIES expression of these properties. The rest of this document is the slow zoom into how CUBRID realises it.
Common DBMS Design
Section titled “Common DBMS Design”Almost every WAL-based engine implements a fuzzy checkpoint
protocol; the names differ but the mechanics are remarkably stable.
This section names the shared engineering vocabulary so that the
CUBRID-specific symbols in ## CUBRID's Approach slot into a
familiar shape.
Periodic timer or threshold trigger
Section titled “Periodic timer or threshold trigger”The checkpoint is launched either on a wall-clock timer or when
the log has consumed a configured amount of space since the
previous checkpoint. The two policies are usually combined.
Time-based triggering bounds recovery time after long quiet
periods; space-based triggering bounds it after busy periods.
PostgreSQL exposes checkpoint_timeout + max_wal_size; CUBRID
exposes log_checkpoint_interval (time) and log_checkpoint_size
(log-page count).
Active-transaction snapshot
Section titled “Active-transaction snapshot”The checkpoint emits one entry per in-flight transaction with
enough state to re-bootstrap recovery’s transaction table from the
checkpoint record alone. The minimum is (trid, state, head_lsa, tail_lsa); state distinguishes losers (active, will be undone)
from in-doubts (2PC-prepared, kept alive) from
committed-with-postpone (need postpones replayed). Most engines
also capture undo_nxlsa and savepoint LSAs.
Dirty-page snapshot — or the redo-LSA hint that subsumes it
Section titled “Dirty-page snapshot — or the redo-LSA hint that subsumes it”Two design points. (a) Some engines emit the entire dirty-page
table — (volid, pageid, recovery_lsa) triples — so recovery
reconstructs the DPT directly. ARIES paper. Downside: a busy engine
emits a multi-megabyte record. (b) Most modern engines compress
the DPT into a single redo-LSA scalar (smallest recovery_lsa).
Recovery rebuilds the DPT inline by walking forward from this LSA.
CUBRID takes the compressed approach: LOG_REC_CHKPT.redo_lsa is
a single LSA.
Bracketing records — begin-CHKPT and end-CHKPT
Section titled “Bracketing records — begin-CHKPT and end-CHKPT”The checkpoint emits begin-CHKPT (marks the LSA next analysis
starts from) and end-CHKPT (carries the snapshot). Between them
ordinary transactions continue producing log records; ARIES proves
that bracket-window records are correctly handled by analysis as
if they were after end-CHKPT.
Page-buffer flush coordination & header durability
Section titled “Page-buffer flush coordination & header durability”The checkpoint drives enough dirty pages to disk that the next
redo-LSA can advance — otherwise the redo-LSA stays pinned and
recovery work keeps growing. PostgreSQL’s CheckPointBuffers,
InnoDB’s flush-list worker, and CUBRID’s pgbuf_flush_checkpoint
all share this role. Recovery starts by reading the active-log
header to find the checkpoint LSA; the header therefore carries a
chkpt_lsa field that advances only after the checkpoint record
is durable.
Comparative landscape
Section titled “Comparative landscape”| Engine | Header / pointer | Trigger | Dirty-page hint |
|---|---|---|---|
| PostgreSQL | pg_control.checkPoint | checkpoint_timeout + max_wal_size | redoLSN scalar; runningXacts side record |
| InnoDB | log-file header | log-bytes-since-last + dirty-pages-pct | oldest_modification per page tracker |
| Oracle | controlfile SCN | redo-log size + manual checkpoint | Mean Time To Recover targets |
| SQL Server | bootpage dbi_checkptLSN | recovery-interval target | dirty-page LSN watermark |
| CUBRID | log_Gl.hdr.chkpt_lsa | log_checkpoint_interval (timer) | LOG_REC_CHKPT.redo_lsa scalar |
All entries are fuzzy. CUBRID sits in the mainstream; its distinctive choices are the explicit dual bracket records (faithful to the ARIES paper) and a timer-only trigger.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theoretical concept | CUBRID name |
|---|---|
| Fuzzy checkpoint daemon | log_Checkpoint_daemon (created by log_checkpoint_daemon_init) |
| Daemon period (timer) | log_get_checkpoint_interval reading PRM_ID_LOG_CHECKPOINT_INTERVAL_SECS |
| Daemon body | log_checkpoint_execute → logpb_checkpoint |
| Begin-CHKPT log record | LOG_START_CHKPT (= 25) |
| End-CHKPT log record | LOG_END_CHKPT (= 26) |
| End-CHKPT payload | LOG_REC_CHKPT { redo_lsa, ntrans, ntops } |
| Per-tran snapshot row | LOG_INFO_CHKPT_TRANS { trid, state, head_lsa, tail_lsa, undo_nxlsa, posp_nxlsa, savept_lsa, ... } |
| Per-sysop snapshot row | LOG_INFO_CHKPT_SYSOP { trid, sysop_start_postpone_lsa, atomic_sysop_start_lsa } |
| Recovery anchor LSA in log header | log_Gl.hdr.chkpt_lsa |
| In-memory copy of last redo-LSA | log_Gl.chkpt_redo_lsa |
| Watermark for archive removal | log_Gl.hdr.smallest_lsa_at_last_chkpt |
Mutex protecting chkpt_lsa & chkpt_redo_lsa | log_Gl.chkpt_lsa_lock |
| Active-TX walk | for i in trantable.num_total_indices: logpb_checkpoint_trans(...) |
| Active-sysop walk | logpb_checkpoint_topops (called twice — user trans and system trans) |
| Page-buffer flush helper | pgbuf_flush_checkpoint (flush_upto, prev_redo, *out_redo, *out_npages) |
| Force log header to disk | logpb_flush_header |
| Force WAL pages to disk | logpb_flush_pages_direct |
| File-system-level fsync of all volumes | fileio_synchronize_all (DWB cooperates inside) |
| Restart entry that consumes the checkpoint | log_recovery → log_recovery_analysis (start_lsa = log_Gl.hdr.chkpt_lsa) |
| Per-record analysis arms for chkpt records | log_rv_analysis_start_checkpoint, log_rv_analysis_end_checkpoint |
LOG_ISCHECKPOINT_TIME macro | Page-count-based predicate (legacy path) in log_manager.c |
CUBRID’s Approach
Section titled “CUBRID’s Approach”CUBRID’s checkpoint subsystem has six moving parts: the daemon
registration that wakes the checkpoint thread on a timer; the
preflight phase that flushes the existing dirty pages and emits
LOG_START_CHKPT; the active-transaction snapshot captured by
walking the trantable under a read-mode lock; the active-sysop
snapshot for nested top-operations in commit-postpone; the
end-CHKPT emission that packs the snapshot into LOG_REC_CHKPT
plus its trailing arrays; and finally the header-update + fsync
that publishes the new chkpt_lsa so the next restart anchors on
it. We walk them in that order, then close with the recovery-side
view (how log_recovery_analysis consumes what was written) and
the cooperation with the double-write buffer.
Daemon registration
Section titled “Daemon registration”log_Checkpoint_daemon is a cubthread::daemon declared at file
scope in log_manager.c:
// log_Checkpoint_daemon — src/transaction/log_manager.cstatic cubthread::daemon *log_Checkpoint_daemon = NULL;It is created by log_checkpoint_daemon_init, which the global
log-daemon bootstrap (log_daemons_init) calls during server start.
The daemon’s tick period comes from the
PRM_ID_LOG_CHECKPOINT_INTERVAL_SECS system parameter; the looper
binds a callback that re-reads the parameter on every tick so the
period can be retuned at runtime via db_change_active_log_arg-style
calls.
// log_checkpoint_daemon_init — src/transaction/log_manager.cREGISTER_DAEMON (log_checkpoint);
voidlog_checkpoint_daemon_init (){ assert (log_Checkpoint_daemon == NULL);
cubthread::looper looper = cubthread::looper (log_get_checkpoint_interval); cubthread::entry_callable_task *daemon_task = new cubthread::entry_callable_task (log_checkpoint_execute);
log_Checkpoint_daemon = cubthread::get_manager ()->create_daemon (looper, daemon_task, "log-checkpoint");}
// log_get_checkpoint_interval — src/transaction/log_manager.cvoidlog_get_checkpoint_interval (bool & is_timed_wait, cubthread::delta_time & period){ int log_checkpoint_interval_sec = prm_get_integer_value (PRM_ID_LOG_CHECKPOINT_INTERVAL_SECS); assert (log_checkpoint_interval_sec >= 0);
if (log_checkpoint_interval_sec > 0) { is_timed_wait = true; period = std::chrono::seconds (log_checkpoint_interval_sec); } else { // infinite wait — checkpoint disabled until someone wakes it explicitly is_timed_wait = false; }}The default is 360 seconds (6 minutes), observable in
system_parameter.c where PRM_ID_LOG_CHECKPOINT_INTERVAL_SECS is
declared with default 360. Setting the parameter to zero disables
the timer entirely; the daemon will then wait indefinitely until
something calls log_wakeup_checkpoint_daemon. The user-facing wakeup
is the path used by ad-hoc cubrid_check requests and by the
(void) logpb_checkpoint (thread_p) call inside log_recovery after
restart finishes — the engine takes a fresh checkpoint as the very
last act of restart so the next crash starts from a clean boundary.
The daemon body is a thin wrapper:
// log_checkpoint_execute — src/transaction/log_manager.cstatic voidlog_checkpoint_execute (cubthread::entry & thread_ref){ if (!BO_IS_SERVER_RESTARTED ()) { // wait for boot to finish — if we ran during analysis the trantable // would not be populated yet return; }
logpb_checkpoint (&thread_ref);}BO_IS_SERVER_RESTARTED is the boolean that flips when boot has
fully completed, including the three-pass restart-recovery (analysis
→ redo → undo) plus all postpone replays. Until that flag is true
the daemon refuses to run — taking a checkpoint mid-restart would
record a transaction table that is itself being mutated by recovery.
A legacy page-count trigger is preserved as a macro:
// LOG_ISCHECKPOINT_TIME — src/transaction/log_manager.c#define LOG_ISCHECKPOINT_TIME() \ (log_Gl.rcv_phase == LOG_RESTARTED \ && log_Gl.run_nxchkpt_atpageid != NULL_PAGEID \ && log_Gl.hdr.append_lsa.pageid >= log_Gl.run_nxchkpt_atpageid)run_nxchkpt_atpageid is a per-process counter that
logpb_checkpoint advances by chkpt_every_npages (default
100000 log pages, clamped to ≥ PRM_ID_LOG_NBUFFERS) every time it
finishes. The intent was for log appenders to poll on each append
and trigger an inline checkpoint; in the modern code path the
daemon-driven timer dominates and the macro is defensive
belt-and-braces.
Top-level flow — logpb_checkpoint
Section titled “Top-level flow — logpb_checkpoint”sequenceDiagram participant T as log-checkpoint daemon participant CHK as logpb_checkpoint participant LOG as log append participant TT as trantable participant PB as page buffer participant DWB as DWB participant FS as filesystem participant HDR as log header page Note over T: timer tick (PRM_ID_LOG_CHECKPOINT_INTERVAL_SECS) T->>CHK: log_checkpoint_execute CHK->>LOG: logpb_flush_pages_direct (drain in-flight WAL) CHK->>LOG: prior_lsa_alloc_and_copy_data(LOG_START_CHKPT) Note right of CHK: newchkpt_lsa = LSA of begin record CHK->>PB: pgbuf_flush_checkpoint(newchkpt_lsa, prev_redo, &out_redo, &nflushed) PB->>DWB: dwb_flush_force (page-by-page) DWB->>FS: fsync DWB volume DWB->>FS: write home pages PB-->>CHK: tmp_chkpt.redo_lsa = smallest dirty LSA remaining CHK->>FS: fileio_synchronize_all (force all data volumes) CHK->>TT: TR_TABLE_CS_ENTER (read mode) CHK->>TT: walk trantable, build LOG_INFO_CHKPT_TRANS[] CHK->>TT: walk trantable, build LOG_INFO_CHKPT_SYSOP[] CHK->>LOG: prior_lsa_alloc_and_copy_data(LOG_END_CHKPT, packed_arrays) CHK->>TT: TR_TABLE_CS_EXIT CHK->>LOG: logpb_flush_pages_direct (force end record) CHK->>HDR: log_Gl.hdr.chkpt_lsa = newchkpt_lsa CHK->>HDR: log_Gl.hdr.smallest_lsa_at_last_chkpt = smallest_head_lsa CHK->>LOG: logpb_flush_header CHK->>FS: write checkpoint LSA into every volume's disk header
The numbered structure of logpb_checkpoint (spans roughly
log_page_buffer.c:6877 to :7300):
LOG_CS_ENTER. Exclusive on the log structure (the prior-list mutex sits inside).- Refuse if recovery hasn’t finished.
if (BO_IS_SERVER_RESTARTED () && log_Gl.run_nxchkpt_atpageid == NULL_PAGEID) return;— theNULL_PAGEIDsentinel also serves as the “only one checkpoint at a time” guard. - Snapshot previous chkpt_lsa under
chkpt_lsa_lockso readers likelog_get_db_start_parametersdon’t take the log CS. - Drain in-flight WAL via
logpb_flush_pages_direct— every record before begin-CHKPT must be durable. - Emit
LOG_START_CHKPT. A bare record; its LSA is captured asnewchkpt_lsa, the next restart’s analysis start. logtb_reflect_global_unique_stats_to_btree. CUBRID’s btrees keep cached unique counters that must be flushed to the catalog btree before redo-LSA advances past them.pgbuf_flush_checkpoint(newchkpt_lsa, …). Picks every BCB whoseoldest_unflush_lsa <= newchkpt_lsaand flushes them through the DWB. The smallest remaining LSA is returned intmp_chkpt.redo_lsa— the redo-LSA hint.fileio_synchronize_all.fsync(2)every data volume.- Walk trantable under
TR_TABLE_CS_ENTER(read). Iterate0..trantable.num_total_indices; skip the system transaction (checkpoint runs as it). Each liveLOG_TDESbecomes onelogpb_checkpoint_transrow. - Walk again for active sysops in
TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONEvialogpb_checkpoint_topops. - Emit
LOG_END_CHKPT.prior_lsa_alloc_and_copy_datawithLOG_REC_CHKPTasdata_headerand the two trailing arrays. - Force the end record via another
logpb_flush_pages_direct. - Update in-memory pointers under
chkpt_lsa_lock:log_Gl.hdr.chkpt_lsa = newchkpt_lsa,log_Gl.chkpt_redo_lsa = tmp_chkpt.redo_lsa,log_Gl.hdr.smallest_lsa_at_last_chkpt← smallest tran-head_lsa(archive-removal watermark). logpb_flush_header. Writes the active-log header page — after this fsync the boundary is durable.- Stamp every volume’s disk header for media-recovery (read
by
log_rv_find_checkpointduring restart from a backup).
The two-phase commit pattern of the checkpoint itself is worth
naming: the checkpoint becomes effective when the log header is
durable, not when the end-CHKPT record is durable. A crash
between step 12 and step 14 leaves the log with both checkpoint
records present but the in-memory header pointer not yet on disk;
the next restart reads the on-disk header, finds the previous
chkpt_lsa, and runs analysis from there. The new bracket records
appear during that analysis and get treated as ordinary records;
because they don’t carry information the analysis needs (the
previous checkpoint already covered everything they would say
plus more), they are effectively ignored — the
log_rv_analysis_end_checkpoint arm short-circuits when
*may_use_checkpoint == false, which is the case if the start
record’s LSA didn’t match the analysis start LSA. Correctness is
preserved by paying the cost of redoing one checkpoint window.
The active-transaction snapshot — logpb_checkpoint_trans
Section titled “The active-transaction snapshot — logpb_checkpoint_trans”The per-TDES extractor is short enough to read in full:
// logpb_checkpoint_trans — src/transaction/log_page_buffer.cvoidlogpb_checkpoint_trans (LOG_INFO_CHKPT_TRANS * chkpt_entries, log_tdes * tdes, int &ntrans, int &ntops, LOG_LSA & smallest_lsa){ LOG_INFO_CHKPT_TRANS *chkpt_entry = &chkpt_entries[ntrans];
if (tdes != NULL && tdes->trid != NULL_TRANID && !tdes->tail_lsa.is_null () && tdes->commit_abort_lsa.is_null ()) { chkpt_entry->isloose_end = tdes->isloose_end; chkpt_entry->trid = tdes->trid; chkpt_entry->state = tdes->state; LSA_COPY (&chkpt_entry->head_lsa, &tdes->head_lsa); LSA_COPY (&chkpt_entry->tail_lsa, &tdes->tail_lsa);
if (chkpt_entry->state == TRAN_UNACTIVE_ABORTED) { /* Transaction is in the middle of an abort, since rollback is * not run in a critical section. Set the undo point to be the * same as its tail. The recovery process will read the last * record which is likely a compensating one, and find where to * continue a rollback operation. */ LSA_COPY (&chkpt_entry->undo_nxlsa, &tdes->tail_lsa); } else { LSA_COPY (&chkpt_entry->undo_nxlsa, &tdes->undo_nxlsa); }
LSA_COPY (&chkpt_entry->posp_nxlsa, &tdes->posp_nxlsa); LSA_COPY (&chkpt_entry->savept_lsa, &tdes->savept_lsa); LSA_COPY (&chkpt_entry->tail_topresult_lsa, &tdes->tail_topresult_lsa); LSA_COPY (&chkpt_entry->start_postpone_lsa, &tdes->rcv.tran_start_postpone_lsa); strncpy (chkpt_entry->user_name, tdes->client.get_db_user (), LOG_USERNAME_MAX); ntrans++;
if (tdes->topops.last >= 0 && (tdes->state == TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE)) { ntops += tdes->topops.last + 1; }
if (LSA_ISNULL (&smallest_lsa) || LSA_GT (&smallest_lsa, &tdes->head_lsa)) { LSA_COPY (&smallest_lsa, &tdes->head_lsa); } }}Three things deserve marking up. (a) The eligibility test
filters out three categories: null trantable slots, transactions
that haven’t logged anything yet (tail_lsa.is_null ()), and
transactions that have already appended their commit/abort record
(commit_abort_lsa.is_null () is false). The third test is
notable: a transaction whose commit record is in the prior list but
not yet drained still counts as in-flight from the trantable’s
perspective, but its commit record being already appended means
recovery will see it and resolve it without needing the snapshot
entry. The condition is “the actual transaction state is ignored
by the checkpoint mechanism as long as either the commit or the
abort log records have been appended” — the source comment is
explicit. (b) The TRAN_UNACTIVE_ABORTED path forces
undo_nxlsa = tail_lsa. Rollback in CUBRID does not hold a
critical section, so a checkpoint can land mid-rollback; the
snapshot must be a position recovery can resume from, and the tail
of the chain (most likely a CLR for the last completed undo step)
is the safe rendezvous. (c) The smallest_lsa accumulator is
the watermark used outside the loop to update
log_Gl.hdr.smallest_lsa_at_last_chkpt. This is not the redo-LSA
used by the analysis pass; it is the archive-retention watermark
— no log archive whose pages are all below this LSA is needed for
crash recovery, so archive removal is gated on it.
The active-sysop snapshot — logpb_checkpoint_topops
Section titled “The active-sysop snapshot — logpb_checkpoint_topops”Sysops (CUBRID’s nested top-operations, equivalent to ARIES’s mini-transactions) only need to be snapshotted when they have side-effects that recovery’s postpone pass would replay. The extractor’s eligibility test reads:
// logpb_checkpoint_topops — src/transaction/log_page_buffer.c (excerpt)if (tdes != NULL && tdes->trid != NULL_TRANID && (!LSA_ISNULL (&tdes->rcv.sysop_start_postpone_lsa) || !LSA_ISNULL (&tdes->rcv.atomic_sysop_start_lsa))) { /* this transaction is running system operation postpone or an * atomic system operation * note: we cannot compare tdes->state with * TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE. we are * not synchronizing setting transaction state. * however, setting tdes->rcv.sysop_start_postpone_lsa is * protected by log_Gl.prior_info.prior_lsa_mutex. so we * check this instead of state. */ ... LOG_INFO_CHKPT_SYSOP *chkpt_topop = &chkpt_topops[ntops]; chkpt_topop->trid = tdes->trid; chkpt_topop->sysop_start_postpone_lsa = tdes->rcv.sysop_start_postpone_lsa; chkpt_topop->atomic_sysop_start_lsa = tdes->rcv.atomic_sysop_start_lsa; ntops++; }The comment is the load-bearing piece. The natural eligibility
predicate would be tdes->state == TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE, but tdes->state
mutates without holding a mutex (the writers update it
optimistically and rely on per-record log entries for correctness).
What is protected by a mutex is the LSA that records “this
transaction has entered sysop-postpone”: setting
sysop_start_postpone_lsa requires prior_lsa_mutex. The
checkpoint walks under prior_lsa_mutex-held already (it took the
mutex to guarantee no new prior-list nodes appear during the walk),
so the LSA test is the safe stand-in.
The captured LOG_INFO_CHKPT_SYSOP has only three fields — trid,
sysop_start_postpone_lsa, atomic_sysop_start_lsa. Recovery’s
analysis pass uses these to seed tdes->rcv.sysop_start_postpone_lsa
on the rebuilt TDES, which then drives the postpone replay during
log_recovery_finish_all_postpone.
There is a subtle interaction with the ntops counter. The first
trantable walk computes ntops by counting transactions in
TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE. The second walk
re-derives the actual ntops from the eligibility test above. The
two counts can diverge — a transaction can transition between
states between the two walks — and the second walk wins. The
length_all_tops buffer is reallocated inside logpb_checkpoint_topops
if the second walk’s running count exceeds the first walk’s
estimate.
The end record — LOG_REC_CHKPT and its trailing arrays
Section titled “The end record — LOG_REC_CHKPT and its trailing arrays”The on-log shape:
// LOG_REC_CHKPT — src/transaction/log_record.hpptypedef struct log_rec_chkpt LOG_REC_CHKPT;struct log_rec_chkpt{ LOG_LSA redo_lsa; /* Oldest LSA of dirty data page in page buffers */ int ntrans; /* Number of active transactions */ int ntops; /* Total number of system operations */};
/* Transaction descriptor */typedef struct log_info_chkpt_trans LOG_INFO_CHKPT_TRANS;struct log_info_chkpt_trans{ int isloose_end; TRANID trid; /* Transaction identifier */ TRAN_STATE state; /* Transaction state (e.g., Active, aborted) */ LOG_LSA head_lsa; /* First log address of transaction */ LOG_LSA tail_lsa; /* Last log record address of transaction */ LOG_LSA undo_nxlsa; /* Next log record address for UNDO purposes */ LOG_LSA posp_nxlsa; /* First address of a postpone record */ LOG_LSA savept_lsa; /* Address of last savepoint */ LOG_LSA tail_topresult_lsa; /* Address of last partial abort/commit */ LOG_LSA start_postpone_lsa; /* Address of start postpone */ char user_name[LOG_USERNAME_MAX];};
typedef struct log_info_chkpt_sysop LOG_INFO_CHKPT_SYSOP;struct log_info_chkpt_sysop{ TRANID trid; LOG_LSA sysop_start_postpone_lsa; LOG_LSA atomic_sysop_start_lsa;};The on-disk layout of the end-CHKPT record is therefore:
| LOG_RECORD_HEADER (back/forw/trid/type=LOG_END_CHKPT) || LOG_REC_CHKPT { redo_lsa, ntrans, ntops } || LOG_INFO_CHKPT_TRANS [ntrans] || LOG_INFO_CHKPT_SYSOP [ntops] |Two implementation details worth marking up. (a) The two arrays
are heap-allocated separately (malloc(ntrans * sizeof(...)) for
trans, ditto for topops) and the prior-list-allocator is then asked
to copy them into the appended record:
// from logpb_checkpoint, end-CHKPT emissionnode = prior_lsa_alloc_and_copy_data ( thread_p, LOG_END_CHKPT, RV_NOT_DEFINED, NULL, length_all_chkpt_trans, (char *) chkpt_trans, (int) length_all_tops, (char *) chkpt_topops);
chkpt = (LOG_REC_CHKPT *) node->data_header;*chkpt = tmp_chkpt;
prior_lsa_next_record_with_lock (thread_p, node, tdes);The prior_lsa_alloc_and_copy_data overload accepts two payloads
which it concatenates after the data_header. The header itself is
filled in after allocation by writing through node->data_header.
(b) The record uses RV_NOT_DEFINED as its recovery index. A
checkpoint record has no redo function and no undo function — it is
purely informational, consumed by the analysis pass. The
RV_fun[RV_NOT_DEFINED] slot is a debug-only dump entry; trying to
redo or undo a checkpoint record would fail an assert.
Recovery interaction — analysis anchors on chkpt_lsa
Section titled “Recovery interaction — analysis anchors on chkpt_lsa”The starting LSA of log_recovery_analysis is set inside
log_recovery as the very first action:
// log_recovery — src/transaction/log_recovery.c (excerpt)LSA_COPY (&rcv_lsa, &log_Gl.hdr.chkpt_lsa);
if (ismedia_crash != false) { /* Media crash, we may have to start from an older checkpoint... */ (void) fileio_map_mounted ( thread_p, (bool (*)(THREAD_ENTRY *, VOLID, void *)) log_rv_find_checkpoint, &rcv_lsa); }The crash-vs-media distinction is significant. For crash
recovery, chkpt_lsa is exactly the LSA the analysis must start
from — every record before it has been consumed by a prior
checkpoint and is no longer needed. For media recovery (restart
from a backup), the per-volume disk headers may carry rcv-LSAs that
predate the global chkpt_lsa, because the backup was taken at a
different point in time; the loop walks all mounted volumes and
takes the minimum rcv-LSA across them.
The analysis pass arms for the two checkpoint records:
// log_rv_analysis_start_checkpoint — src/transaction/log_recovery.cstatic intlog_rv_analysis_start_checkpoint (LOG_LSA * log_lsa, LOG_LSA * start_lsa, bool * may_use_checkpoint){ /* Use the checkpoint record only if it is the first record in the * analysis. */ if (LSA_EQ (log_lsa, start_lsa)) { *may_use_checkpoint = true; } return NO_ERROR;}
// log_rv_analysis_end_checkpoint — src/transaction/log_recovery.c (sketch)if (*may_use_checkpoint == false) return NO_ERROR;*may_use_checkpoint = false;
LSA_COPY (check_point, log_lsa);/* read LOG_REC_CHKPT header, then the trans + topops trailing arrays */...for (i = 0; i < chkpt.ntrans; i++) { tdes = logtb_rv_find_allocate_tran_index (thread_p, chkpt_trans[i].trid, log_lsa); logtb_clear_tdes (thread_p, tdes);
if (chkpt_one->state == TRAN_ACTIVE || chkpt_one->state == TRAN_UNACTIVE_ABORTED) tdes->state = TRAN_UNACTIVE_UNILATERALLY_ABORTED; else tdes->state = chkpt_one->state;
LSA_COPY (&tdes->head_lsa, &chkpt_one->head_lsa); LSA_COPY (&tdes->tail_lsa, &chkpt_one->tail_lsa); LSA_COPY (&tdes->undo_nxlsa, &chkpt_one->undo_nxlsa); LSA_COPY (&tdes->posp_nxlsa, &chkpt_one->posp_nxlsa); /* ...savept_lsa, tail_topresult_lsa, tran_start_postpone_lsa... */
if (LOG_ISTRAN_2PC (tdes)) *may_need_synch_checkpoint_2pc = true; }Three observations. (a) The eligibility gate
if (LSA_EQ (log_lsa, start_lsa)) enforces that only the first
checkpoint encountered is consumed. Subsequent checkpoint records
inside the analysis window are skipped — they exist as ordinary
log traffic but their snapshot is stale. (b) TRAN_ACTIVE and
TRAN_UNACTIVE_ABORTED from the snapshot are coerced to
TRAN_UNACTIVE_UNILATERALLY_ABORTED — recovery treats every
still-active transaction as a loser. The 2PC-prepared state is
kept verbatim so the in-doubt path can find it. (c)
start_redo_lsa is set from chkpt.redo_lsa. This becomes the
lower bound log_recovery_redo walks from.
Recovery boundary diagram
Section titled “Recovery boundary diagram”flowchart LR
subgraph LOG["WAL on disk"]
direction LR
OLD["...older records..."] --> SC["LOG_START_CHKPT @ chkpt_lsa"]
SC --> M1["LOG_UNDOREDO_DATA"]
M1 --> M2["LOG_COMMIT (T17)"]
M2 --> EC["LOG_END_CHKPT (snapshot, redo_lsa=R)"]
EC --> P1["LOG_UNDOREDO_DATA"]
P1 --> P2["LOG_MVCC_REDO_DATA"]
P2 --> EOF["LOG_END_OF_LOG (crash here)"]
end
subgraph HDR["log header (pageid -9)"]
HC["chkpt_lsa = SC.lsa"]
HS["smallest_lsa_at_last_chkpt"]
end
subgraph PASS["analysis pass"]
A1["start_lsa = chkpt_lsa"]
A2["walk forward"]
A3["seed TT/DPT from end-CHKPT"]
A4["redo_lsa = chkpt.redo_lsa"]
end
HDR -.->|read at restart| A1
A1 --> SC
SC -.->|"start: may_use_chkpt=true"| A2
EC -.->|"consume snapshot"| A3
A3 --> A4
A4 -.->|"redo from R (may be < SC.lsa)"| LOG
The diagram makes the redo-vs-analysis distinction visible:
analysis starts from chkpt_lsa (the start-CHKPT record), but
redo starts from chkpt.redo_lsa (the smallest-dirty-LSA hint
from the end-CHKPT record). Those are typically the same LSA, but
not always — redo_lsa can be earlier than chkpt_lsa when a
page-buffer entry has an oldest_unflush_lsa that predates the
checkpoint (a long-lived dirty page). In that case redo walks
backward from the analysis start, applying each record only if the
target page’s on-disk LSA is below the record’s LSA.
Cooperation with the page buffer and DWB
Section titled “Cooperation with the page buffer and DWB”pgbuf_flush_checkpoint is the page-buffer entry point step 7
calls. Its essential body:
// pgbuf_flush_checkpoint — src/storage/page_buffer.c (sketch)intpgbuf_flush_checkpoint (THREAD_ENTRY *thread_p, const LOG_LSA *flush_upto_lsa, const LOG_LSA *prev_chkpt_redo_lsa, LOG_LSA *smallest_lsa, int *flushed_page_cnt){ /* Things must be truly flushed up to this lsa */ logpb_flush_log_for_wal (thread_p, flush_upto_lsa); LSA_SET_NULL (smallest_lsa);
for (bufid = 0; bufid < pgbuf_Pool.num_buffers; bufid++) { bufptr = PGBUF_FIND_BCB_PTR (bufid); PGBUF_BCB_LOCK (bufptr);
/* skip non-dirty, post-window-dirty, temp-volume BCBs */ if (!pgbuf_bcb_is_dirty (bufptr) || LSA_GT (&bufptr->oldest_unflush_lsa, flush_upto_lsa) || pgbuf_is_temporary_volume (bufptr->vpid.volid)) { PGBUF_BCB_UNLOCK (bufptr); continue; }
/* defensive invariant: oldest_unflush_lsa must not predate * the previous checkpoint's redo-LSA */ if (LSA_LT (&bufptr->oldest_unflush_lsa, prev_chkpt_redo_lsa)) { er_set (...ER_LOG_CHECKPOINT_SKIP_INVALID_PAGE...); assert (false); }
/* enqueue for flush, sorted by VPID */ f_list[collected_bcbs++].bufptr = bufptr; ... } /* drain via pgbuf_flush_chkpt_seq_list → DWB → home volume */}Three things matter. (a) The first call
logpb_flush_log_for_wal enforces WAL ordering: every record up
to and including the begin-CHKPT (flush_upto_lsa = newchkpt_lsa)
is forced before any data page is written. (b) Pages that
became dirty after the begin-CHKPT are deliberately skipped —
flushing them would not advance the next redo-LSA usefully and
would interfere with in-flight transactions. (c) The
prev_chkpt_redo_lsa invariant catches incorrect redo-LSAs from
the previous checkpoint via assertion.
The actual write path goes through the DWB. BCBs are sorted by VPID
and handed to pgbuf_flush_chkpt_seq_list, which drives each page
through dwb_add_page so the DWB flush daemon can write the
staging slot before the home page. Even if the engine crashes
mid-checkpoint, every page mid-home-write is either fully on disk
(clean) or recoverable from its DWB slot during restart’s pre-redo
DWB scan. See cubrid-double-write-buffer.md.
flowchart LR
subgraph CHK["logpb_checkpoint"]
direction TB
SC["1) emit LOG_START_CHKPT"]
PF["2) pgbuf_flush_checkpoint(newchkpt_lsa)"]
FA["3) fileio_synchronize_all"]
EC["4) emit LOG_END_CHKPT"]
HDR["5) flush log header"]
end
subgraph PB["page buffer"]
BCBS["BCBs with oldest_unflush_lsa <= newchkpt_lsa"]
end
subgraph DWBP["DWB"]
SLOT["staged page in DWB slot"]
HOME["home volume page"]
end
subgraph FS["filesystem"]
DV["data volumes"]
LV["log file"]
HV["log header page"]
end
SC --> LV
PF --> BCBS
BCBS -->|dwb_add_page| SLOT
SLOT -->|fsync DWB volume| DV
SLOT -->|then write home| HOME
HOME --> DV
PF --> FA
FA -->|fsync each volume| DV
EC --> LV
HDR --> HV
HV -.->|"chkpt_lsa = newchkpt_lsa"| LV
Failure cases
Section titled “Failure cases”The protocol is designed to crash safely at any point.
- Before begin-CHKPT. No effect; the previous checkpoint remains the boundary.
- Between begin-CHKPT and end-CHKPT, or between end-CHKPT and
header flush.
log_Gl.hdr.chkpt_lsaon disk still points at the previous checkpoint (the header is the very last thing written). Analysis runs from the previous chkpt_lsa, encounters the partial new-bracket records as ordinary log traffic. The begin-CHKPT arm’s eligibility gate (LSA_EQ(log_lsa, start_lsa)) fails — analysis didn’t start from the new begin record — so the arm no-ops and the partial bracket is harmless. Cost: one checkpoint window’s worth of re-processing. - After header flush. New
chkpt_lsais durable. Analysis starts from the new bracket; the gate fires correctly. - Missing checkpoint (brand-new install).
chkpt_lsaisNULL_LSA; analysis walks from the beginning of the log. Slow but correct. - Checkpoint LSA past durable end-of-log. Header corruption —
fatal (
logpb_fatal_error), restore from backup. - Mid-flush of dirty pages (step 7). DWB protects against torn
pages: pages mid-home-write are restored from their DWB slot
during
dwb_load_and_recover_pagesbefore redo runs.
Source Walkthrough
Section titled “Source Walkthrough”Daemon registration and timing
Section titled “Daemon registration and timing”log_Checkpoint_daemon(log_manager.c) — file-scopecubthread::daemonpointer.log_checkpoint_daemon_init(log_manager.c) — creates the daemon at server start.log_get_checkpoint_interval(log_manager.c) — readsPRM_ID_LOG_CHECKPOINT_INTERVAL_SECSfor the looper period.log_checkpoint_execute(log_manager.c) — daemon body; defers tologpb_checkpoint.log_wakeup_checkpoint_daemon(log_manager.c) — out-of-band wakeup hook.log_daemons_init/log_daemons_destroy(log_manager.c) — bootstrap and teardown.LOG_ISCHECKPOINT_TIMEmacro (log_manager.c) — page-count-based legacy trigger.
Checkpoint emission
Section titled “Checkpoint emission”logpb_checkpoint(log_page_buffer.c) — orchestrator.logpb_checkpoint_trans(log_page_buffer.c) — per-TDES extractor.logpb_checkpoint_topops(log_page_buffer.c) — per-active-sysop extractor.logpb_dump_checkpoint_trans(log_page_buffer.c) — debug dumper forcubrid logdump.log_dump_record_checkpoint(log_manager.c) — top-level dispatcher for dumping a checkpoint record.log_dump_checkpoint_topops(log_manager.c) — debug dumper for active-sysop array.prior_lsa_alloc_and_copy_data(log_append.cpp) — shared with all log appenders; both bracket records use it.prior_lsa_next_record_with_lock(log_append.cpp) — assigns the LSA and links the prior-list node.
Page-buffer cooperation
Section titled “Page-buffer cooperation”pgbuf_flush_checkpoint(page_buffer.c) — selects dirty BCBs, sorts by VPID, drives them through DWB.pgbuf_flush_chkpt_seq_list(page_buffer.c) — performs the actual flush of one batch.pgbuf_Pool.is_checkpoint(page_buffer.c) — atomic flag the page-buffer flusher reads to coordinate with concurrent victims.logpb_flush_log_for_wal(log_page_buffer.c) — WAL-ordering enforcement called bypgbuf_flush_checkpoint.fileio_synchronize_all(file_io.c) — fsync all volumes after the dirty-page flush.dwb_flush_force(double_write_buffer.cpp) — forces pending DWB blocks; called transitively fromfileio_synchronize_all.
Recovery-side consumption
Section titled “Recovery-side consumption”log_recovery(log_recovery.c) — setsrcv_lsa = log_Gl.hdr.chkpt_lsaas the analysis starting point.log_rv_find_checkpoint(log_recovery.c) — per-volume rcv-LSA scan used during media recovery.log_recovery_analysis(log_recovery.c) — forward walk fromchkpt_lsa.log_rv_analysis_record(log_recovery.c) — switch overLOG_RECTYPE; arms forLOG_START_CHKPTandLOG_END_CHKPT.log_rv_analysis_start_checkpoint(log_recovery.c) — setsmay_use_checkpointif the start record’s LSA matches the analysis start.log_rv_analysis_end_checkpoint(log_recovery.c) — readsLOG_REC_CHKPT, the trans array, and the topops array; seeds the trantable; setsstart_redo_lsa.logtb_rv_find_allocate_tran_index(log_tran_table.c) — allocates a TDES slot keyed by trid for each row in the trans array.logtb_clear_tdes(log_tran_table.c) — zeroes the TDES before re-populating from the snapshot.
Header & in-memory pointers
Section titled “Header & in-memory pointers”log_Gl.hdr.chkpt_lsa(log_storage.hppfieldLOG_HEADER::chkpt_lsa) — on-disk recovery anchor.log_Gl.hdr.smallest_lsa_at_last_chkpt(log_storage.hpp) — archive-removal watermark.log_Gl.chkpt_redo_lsa(log_impl.h::log_global) — in-memory copy of the last end-CHKPT’s redo_lsa, used bypgbuf_flush_checkpointasprev_chkpt_redo_lsa.log_Gl.chkpt_lsa_lock(log_impl.h::log_global) — pthread mutex guarding the two LSAs above.log_Gl.run_nxchkpt_atpageid/log_Gl.chkpt_every_npages(log_impl.h::log_global) — legacy page-count trigger state.logpb_flush_header(log_page_buffer.c) — writes the active-log header page to disk.
Log record type & payload
Section titled “Log record type & payload”LOG_START_CHKPT(log_record.hpp, value 25) — begin marker.LOG_END_CHKPT(log_record.hpp, value 26) — end record carrying the snapshot.LOG_REC_CHKPT(log_record.hpp) —{ redo_lsa, ntrans, ntops }header.LOG_INFO_CHKPT_TRANS(log_record.hpp) — per-tran row.LOG_INFO_CHKPT_SYSOP(log_record.hpp) — per-active-sysop row.
System parameters
Section titled “System parameters”PRM_ID_LOG_CHECKPOINT_INTERVAL_SECS(system_parameter.c, default 360 s, deprecated) — timer period.PRM_ID_LOG_CHECKPOINT_INTERVAL(system_parameter.c, default 360 s, replacement) — same role, different unit handling.PRM_ID_LOG_CHECKPOINT_NPAGES(system_parameter.c, default 100000, deprecated) — page-count trigger.PRM_ID_LOG_CHECKPOINT_SIZE(system_parameter.c, default 100000, replacement) — size-based equivalent.PRM_ID_LOG_CHECKPOINT_SLEEP_MSECS(system_parameter.c, default 1 ms, hidden) — inter-page flush throttle.PRM_ID_LOG_CHKPT_DETAILED(system_parameter.c) — turns ondetailed_er_logtraces insidelogpb_checkpoint.
Position hints as of 2026-05-01
Section titled “Position hints as of 2026-05-01”| Symbol | File | Line |
|---|---|---|
log_Checkpoint_daemon | log_manager.c | 359 |
log_get_checkpoint_interval | log_manager.c | 10075 |
log_wakeup_checkpoint_daemon | log_manager.c | 10113 |
log_checkpoint_execute | log_manager.c | 10167 |
log_checkpoint_daemon_init | log_manager.c | 10407 |
LOG_ISCHECKPOINT_TIME macro | log_manager.c | 122 |
log_dump_checkpoint_topops | log_manager.c | 6769 |
log_dump_record_checkpoint | log_manager.c | 6792 |
logpb_checkpoint_trans | log_page_buffer.c | 6783 |
logpb_checkpoint_topops | log_page_buffer.c | 6833 |
logpb_checkpoint | log_page_buffer.c | 6877 |
logpb_dump_checkpoint_trans | log_page_buffer.c | 7395 |
log_rv_find_checkpoint | log_recovery.c | 579 |
log_rv_analysis_start_checkpoint | log_recovery.c | 1797 |
log_rv_analysis_end_checkpoint | log_recovery.c | 1830 |
log_rv_analysis_record (LOG_*_CHKPT arms) | log_recovery.c | 2436 |
log_recovery (chkpt_lsa anchor) | log_recovery.c | 780 |
LOG_REC_CHKPT struct | log_record.hpp | 345 |
LOG_INFO_CHKPT_TRANS struct | log_record.hpp | 354 |
LOG_INFO_CHKPT_SYSOP struct | log_record.hpp | 372 |
LOG_START_CHKPT enum | log_record.hpp | 96 |
LOG_END_CHKPT enum | log_record.hpp | 97 |
LOG_HEADER::chkpt_lsa | log_storage.hpp | 141 |
LOG_HEADER::smallest_lsa_at_last_chkpt | log_storage.hpp | 163 |
log_global::chkpt_lsa_lock | log_impl.h | 681 |
log_global::chkpt_redo_lsa | log_impl.h | 683 |
log_global::chkpt_every_npages | log_impl.h | 684 |
log_global::run_nxchkpt_atpageid | log_impl.h | 678 |
pgbuf_flush_checkpoint | page_buffer.c | 3960 |
PRM_ID_LOG_CHECKPOINT_INTERVAL_SECS | system_parameter.c | 1368 |
PRM_ID_LOG_CHECKPOINT_INTERVAL | system_parameter.c | 1379 |
PRM_ID_LOG_CHECKPOINT_NPAGES | system_parameter.c | 1346 |
PRM_ID_LOG_CHECKPOINT_SIZE | system_parameter.c | 1357 |
Cross-check Notes
Section titled “Cross-check Notes”vs cubrid-recovery-manager.md
Section titled “vs cubrid-recovery-manager.md”- Bracket records are
LOG_START_CHKPT/LOG_END_CHKPT. Both docs say so; this is the faithful ARIES two-step. log_Gl.hdr.chkpt_lsais the analysis anchor, not the redo anchor. The redo anchor (start_redo_lsa) is derived fromLOG_REC_CHKPT.redo_lsa. The two typically match but can diverge when long-lived dirty pages exist.- 2PC: the recovery-manager doc lists
LOG_RECOVERY_FINISH_2PC_PHASEas a conditional phase; the trigger is set insidelog_rv_analysis_end_checkpointvia*may_need_synch_checkpoint_2pc = truewhen the snapshot contains aTRAN_2PC_PREPAREDrow. - The “post-restart final checkpoint” call at the end of
log_recoveryis the same path used at clean shutdown.
vs cubrid-log-manager.md
Section titled “vs cubrid-log-manager.md”Checkpoint records flow through the same prior-list discipline as every other appender:
- The end-CHKPT record carries trailing payloads via the two-payload
overload of
prior_lsa_alloc_and_copy_data. - The checkpoint bypasses the group-commit waiter and uses
logpb_flush_pages_directbecause it needs synchronous durability of both bracket records.
vs cubrid-mvcc.md
Section titled “vs cubrid-mvcc.md”LOG_HEADER.mvcc_op_log_lsais updated alongsidechkpt_lsa, giving vacuum a durable handle.LOG_INFO_CHKPT_TRANSdoes NOT carry MVCCID. Recovery rebuilds per-TDES MVCCID from the per-recordmvcc_idfield during analysis, not from the snapshot. This is correct because MVCCID issuance is lazy.
vs cubrid-double-write-buffer.md
Section titled “vs cubrid-double-write-buffer.md”pgbuf_flush_checkpointis the largest single DWB producer.fileio_synchronize_allcallsdwb_flush_forcetransitively; step 8 oflogpb_checkpointensures all DWB-staged pages reach home before the end-CHKPT record is emitted, so the redo-LSA promise is sound.- DWB and checkpoint protocols protect against in-progress crashes independently: torn-page recovery vs. previous-checkpoint fallback. Neither relies on the other.
Open Questions
Section titled “Open Questions”-
Why are both
PRM_ID_LOG_CHECKPOINT_INTERVAL_SECSandPRM_ID_LOG_CHECKPOINT_INTERVALdefined with default 360? The former is markedPRM_DEPRECATED, the latter is the modern replacement. Are there call sites that still read the deprecated one? Investigation path: grep forPRM_ID_LOG_CHECKPOINT_INTERVALwithout the_SECSsuffix; check whetherlog_get_checkpoint_intervalshould switch. -
Is the page-count trigger (
LOG_ISCHECKPOINT_TIME) actually used? The macro is defined but the daemon-driven timer appears to dominate. A grep forLOG_ISCHECKPOINT_TIMEwould show whether any append path still polls it; if not, the macro is dead code preserved for legacy compatibility. -
What is the upper bound on
LOG_REC_CHKPTrecord size? A busy engine with many active transactions could produce a multi-MB end-CHKPT record. Is there a clamp? What happens if the serialised size exceeds one log page? TheLOG_READ_ADVANCE_WHEN_DOESNT_FITmacro insidelog_rv_analysis_end_checkpointsuggests the recovery side handles multi-page checkpoint records, but the emit side’s allocation is monolithic — is that correct? -
Crash atomicity of the per-volume-header write. Step 15 of
logpb_checkpointrewrites every data volume’s disk header with the new chkpt_lsa for media-recovery purposes. This is not atomic across volumes. A crash mid-loop would leave some volumes with the new LSA and others with the old. Is media recovery robust to this?log_rv_find_checkpointtakes the minimum per-volume LSA, so the answer is “yes” — but the property should be confirmed. -
Does the standalone (
SA_MODE) path emit checkpoints? The daemon registration is#if defined(SERVER_MODE)-guarded. Standalone tools (csql -S, loaddb) presumably take a checkpoint only at exit, not periodically. Investigation path: tracelogpb_checkpointcallers underSA_MODE. -
The
tdes->client.set_system_internal_with_user (chkpt_one->user_name)call in analysis recovery looks unusual. It sets a system marker with a user name from the snapshot. Why does recovery need the user name? Possibly for HA/replication audit logs. Worth tracing. -
Interaction with the page-server replication path. CUBRID has a “page server” replication mode where the page buffer is on a remote node. Does the checkpoint daemon coordinate with the page server?
log_recovery_redo.hppmentions the redo dispatcher is shared between recovery and page-server replication; is the checkpoint emitter also shared?
Sources
Section titled “Sources”CUBRID source (/data/hgryoo/references/cubrid/)
Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”src/transaction/log_manager.c— daemon registration, looper, legacyLOG_ISCHECKPOINT_TIMEmacro, dump helpers.src/transaction/log_page_buffer.c— the body oflogpb_checkpointplus its helperslogpb_checkpoint_trans,logpb_checkpoint_topops,logpb_dump_checkpoint_trans, andlogpb_flush_header.src/transaction/log_record.hpp— the on-log shape:LOG_REC_CHKPT,LOG_INFO_CHKPT_TRANS,LOG_INFO_CHKPT_SYSOP, theLOG_START_CHKPT/LOG_END_CHKPTenum values.src/transaction/log_storage.hpp—LOG_HEADER::chkpt_lsaandsmallest_lsa_at_last_chkpt.src/transaction/log_impl.h—log_global’s checkpoint-related fields (chkpt_lsa_lock,chkpt_redo_lsa,chkpt_every_npages,run_nxchkpt_atpageid).src/transaction/log_recovery.c— the consumer:log_rv_find_checkpoint,log_rv_analysis_start_checkpoint,log_rv_analysis_end_checkpoint, the analysis dispatch arm inlog_rv_analysis_record, thechkpt_lsa = rcv_lsaassignment inlog_recovery.src/transaction/log_tran_table.c—TR_TABLE_CS_ENTER/EXIT primitives the checkpoint walks under, andlogtb_clear_tdes,logtb_rv_find_allocate_tran_indexused by recovery.src/storage/page_buffer.c—pgbuf_flush_checkpoint, the dirty-page driver invoked from insidelogpb_checkpoint.src/base/system_parameter.c—PRM_ID_LOG_CHECKPOINT_*entries and their defaults.
Theoretical references
Section titled “Theoretical references”- Mohan, Haderle, Lindsay, Pirahesh, Schwarz, ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 17.1, 1992 — the fuzzy-checkpoint protocol with explicit begin/end records is the ARIES section §6.
- Bernstein, Hadzilacos, Goodman, Concurrency Control and Recovery in Database Systems, 1987 — the textbook treatment of checkpoints in §6 (“Recovery”); distinguishes consistent vs fuzzy variants.
- Petrov, Database Internals, 2019, ch. 5 §“Recovery” and §“ARIES” — modern textbook framing; introduces redo-LSA hint and the relationship between checkpoint frequency and recovery time.
- Silberschatz, Korth, Sudarshan, Database System Concepts, 7th ed., ch. 19 (“Recovery System”) — the standard undergraduate presentation; checkpoints are framed as a way to bound the redo pass.
Sibling docs in this knowledge base
Section titled “Sibling docs in this knowledge base”knowledge/code-analysis/cubrid/cubrid-recovery-manager.md— the three-pass restart protocol that consumes what this checkpoint emits.knowledge/code-analysis/cubrid/cubrid-log-manager.md— the WAL framework whose prior-list and append discipline the checkpoint uses for both bracket records.knowledge/code-analysis/cubrid/cubrid-mvcc.md— MVCC interactions throughmvcc_op_log_lsaand the lazy-MVCCID issuance model.knowledge/code-analysis/cubrid/cubrid-double-write-buffer.md— the torn-page guard the checkpoint cooperates with during step 7 (pgbuf_flush_checkpoint) and step 8 (fileio_synchronize_all).knowledge/code-analysis/cubrid/cubrid-page-buffer-manager.md— the dirty-page tracking that drives the redo-LSA hint.knowledge/code-analysis/cubrid/cubrid-2pc.md— the in-doubt transactions that the active-TX snapshot keeps alive across restart viamay_need_synch_checkpoint_2pc.