CUBRID Vacuum — Code-Level Deep Dive
이 콘텐츠는 아직 번역되지 않았습니다.
Where this document fits: The high-level analysis
cubrid-vacuum.mdcovers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single vacuum log block — from append-time tracking to heap and index cleanup — inside the kernel.
Contents:
Chapter 1: Data Structure Map
Section titled “Chapter 1: Data Structure Map”The high-level companion (cubrid-vacuum.md, “Overall structure” and “Vacuum data — the catalogue of pending work”) names the four moving parts: log producers, vacuum data, the master, the worker pool. This chapter is the field dictionary underneath: which structs represent a vacuum block at each stage of its life, and how they point at each other. Later chapters trace operations over these structures without re-explaining fields.
A block has four representations, in chronological order:
- an accumulator inside
log_header(log_Gl.hdr), updated while transactions append MVCC log records (Ch. 3); - a
vacuum_data_entryin flight inside the lock-freevacuum_Block_data_buffer(Ch. 3 → Ch. 4); - a persisted
vacuum_data_entryinside avacuum_data_page, tracked AVAILABLE → IN_PROGRESS → VACUUMED (Ch. 4–6, 9); - a bare
VACUUM_LOG_BLOCKIDwith status flags insidevacuum_Finished_job_queue(Ch. 9).
flowchart LR
subgraph logside["log side"]
HDR["log_Gl.hdr accumulator"]
VI["log_vacuum_info per MVCC record"]
end
subgraph queues["lock-free queues"]
BB["vacuum_Block_data_buffer<br/>1024 x vacuum_data_entry"]
FQ["vacuum_Finished_job_queue<br/>2048 x VACUUM_LOG_BLOCKID"]
end
subgraph vdata["vacuum data file"]
VD["vacuum_Data global"]
VDP["vacuum_data_page chain<br/>entry window per page"]
end
subgraph exec["execution side"]
W["vacuum_Workers[]"]
DF["dropped-files page chain"]
end
HDR -- "block boundary" --> BB
VI -. "back-chain read by worker" .- W
BB -- "master consumes" --> VDP
VD --- VDP
VDP -- "job dispatch" --> W
W -- "is_file_dropped" --> DF
W -- "done / interrupted" --> FQ
FQ -- "master marks finished" --> VDP
Figure 1-1 — The four representations of a block. The two queues are the only contact points between transaction threads, the master, and workers.
1.1 The log-header accumulator
Section titled “1.1 The log-header accumulator”While transactions append log records, the current (still filling) block exists only as five fields of log_header (log_storage.hpp):
| Field | Role | Why it exists |
|---|---|---|
does_block_need_vacuum | True once an MVCC undo/undoredo record landed in the current block | The “block is dirty” bit; a block with no MVCC ops produces no queue entry |
mvcc_op_log_lsa | LSA of the last MVCC op record so far | Becomes vacuum_data_entry::start_lsa, the worker’s chain-walk entry point (Ch. 7) |
oldest_visible_mvccid | Global oldest-visible snapshot taken at the block’s first MVCC op | Becomes the entry’s eligibility key (Ch. 5) |
newest_block_mvccid | Max MVCCID among the block’s ops | Block runs only once this drops below the oldest-visible watermark, checked by vacuum_master_task::is_cursor_entry_ready_to_vacuum |
vacuum_last_blockid | Persisted high-water mark of consumed blocks | Written by SA-mode vacuum_sa_reflect_last_blockid before archive purge; recovery takes MAX with vacuum data (Ch. 11) |
The first four are maintained by one function on the prior-LSA path, under prior_lsa_mutex:
// prior_update_header_mvcc_info -- src/transaction/log_append.cppstatic voidprior_update_header_mvcc_info (const LOG_LSA &record_lsa, MVCCID mvccid){ if (!log_Gl.hdr.does_block_need_vacuum) { // first mvcc record for this block log_Gl.hdr.oldest_visible_mvccid = log_Gl.mvcc_table.get_global_oldest_visible (); log_Gl.hdr.newest_block_mvccid = mvccid; } else { // ... condensed: sanity asserts ... assert (vacuum_get_log_blockid (log_Gl.hdr.mvcc_op_log_lsa.pageid) == vacuum_get_log_blockid (record_lsa.pageid)); /* <- both records in same block */ if (log_Gl.hdr.newest_block_mvccid < mvccid) { log_Gl.hdr.newest_block_mvccid = mvccid; } } log_Gl.hdr.mvcc_op_log_lsa = record_lsa; log_Gl.hdr.does_block_need_vacuum = true;}Invariant 1-A — single-block accumulator. All four MVCC accumulator fields describe exactly one block: the one containing
mvcc_op_log_lsa. Enforcement: post-restart (LOG_ISRESTARTED), while the accumulator is dirty (does_block_need_vacuum),prior_lsa_next_record_internalchecks on every record append — any type, before the MVCC-type branch — whether the newly reserved LSA falls in a different block thanmvcc_op_log_lsa; if so it flushes the accumulator viavacuum_produce_log_block_datafirst, all underprior_lsa_mutex. What breaks: one entry would aggregate two blocks’ MVCCIDs —oldest_visible_mvccidcould be too new (vacuum removes a still-visible version) orstart_lsacould point outside the block (worker scans the wrong range).
Each MVCC undo record additionally embeds a log_vacuum_info (log_record.hpp):
| Field | Role | Why it exists |
|---|---|---|
prev_mvcc_op_log_lsa | LSA of the previous MVCC op record — a backward singly linked list, copied from mvcc_op_log_lsa under the same mutex | Worker hops the chain from start_lsa, touching only MVCC records (Ch. 7) |
vfid | File the operation touched | Worker checks the dropped-files ledger and groups heap objects per file without fetching pages |
Note the asymmetric reset: the boundary flush in vacuum_produce_log_block_data clears does_block_need_vacuum and newest_block_mvccid but not mvcc_op_log_lsa, so the back-chain runs unbroken across block boundaries.
1.2 vacuum_data_entry — the bit-packed block record
Section titled “1.2 vacuum_data_entry — the bit-packed block record”// vacuum_data_entry -- src/query/vacuum.cstruct vacuum_data_entry{ VACUUM_LOG_BLOCKID blockid; // blockid and flags LOG_LSA start_lsa; // lsa of last mvcc op log record in block MVCCID oldest_visible_mvccid; // oldest visible MVCCID while block was logged MVCCID newest_mvccid; // newest MVCCID in log block // ... condensed: constructors, mask-test accessors, setters ...};| Field | Role | Why it exists |
|---|---|---|
blockid | 61-bit id plus 3 flag bits (2-bit status + interrupted) in the top bits | Status must survive crashes with the entry; stealing high bits avoids widening the record (VACUUM_LOG_BLOCKID is std::int64_t) |
start_lsa | Copied from log_Gl.hdr.mvcc_op_log_lsa | Entry point of the worker’s backward chain walk |
oldest_visible_mvccid | Copied from the accumulator | Eligibility key; feeds vacuum_Data.oldest_unvacuumed_mvccid (Ch. 5) |
newest_mvccid | Copied from the accumulator | Block runs only when this drops below the oldest-visible watermark |
The bit layout of blockid:
// VACUUM_DATA_ENTRY_FLAG_MASK -- src/query/vacuum.c#define VACUUM_DATA_ENTRY_FLAG_MASK 0xE000000000000000 /* <- top 3 bits */#define VACUUM_DATA_ENTRY_BLOCKID_MASK 0x1FFFFFFFFFFFFFFF /* <- low 61 bits */#define VACUUM_BLOCK_STATUS_MASK 0xC000000000000000 /* <- top 2 bits = status */#define VACUUM_BLOCK_STATUS_VACUUMED 0x8000000000000000#define VACUUM_BLOCK_STATUS_IN_PROGRESS_VACUUM 0x4000000000000000#define VACUUM_BLOCK_STATUS_AVAILABLE 0x0000000000000000#define VACUUM_BLOCK_FLAG_INTERRUPTED 0x2000000000000000 /* <- bit 61 */#define VACUUM_BLOCKID_WITHOUT_FLAGS(blockid) \ ((blockid) & VACUUM_DATA_ENTRY_BLOCKID_MASK)Bits 63–62: status (10 VACUUMED, 01 IN_PROGRESS, 00 AVAILABLE, 11 unused); bit 61: the orthogonal INTERRUPTED flag; bits 60–0: the id. AVAILABLE being all-zero, a freshly computed blockid is born AVAILABLE for free. The accessors (is_available etc.) are one-line mask tests; the two compound setters are the subtle part:
// vacuum_data_entry::set_vacuumed -- src/query/vacuum.cvoid vacuum_data_entry::set_vacuumed (){ VACUUM_BLOCK_STATUS_SET_VACUUMED (blockid); VACUUM_BLOCK_CLEAR_INTERRUPTED (blockid); /* <- success wipes the interrupted history */}void vacuum_data_entry::set_interrupted (){ VACUUM_BLOCK_STATUS_SET_AVAILABLE (blockid); /* <- back to AVAILABLE for re-dispatch */ VACUUM_BLOCK_SET_INTERRUPTED (blockid); /* <- but remember the scar */}set_interrupted does not add a fourth status; it returns to AVAILABLE and raises the flag — which is why 11 never appears.
Role matrix for the status bits — the same two bits mean different things to different actors:
| Status | To the master | To a worker | To recovery |
|---|---|---|---|
| AVAILABLE | candidate for the job cursor (Ch. 6) | never sees it | redo of job start re-marks IN_PROGRESS |
| IN_PROGRESS | skip; a worker owns it | ”this is my job” | job died with the crash → interrupted (Ch. 11) |
| VACUUMED | remove entry, advance index_unvacuumed (Ch. 9) | terminal; set by set_vacuumed | entry may be dropped from the page |
| + INTERRUPTED | re-dispatch, flag the prior death | cautious mode: half-cleaned pages tolerated (Ch. 8) | preserved across restart |
stateDiagram-v2
[*] --> AVAILABLE : appended by master\nvacuum_consume_buffer_log_blocks
AVAILABLE --> IN_PROGRESS : set_job_in_progress\nmaster dispatches job
IN_PROGRESS --> VACUUMED : set_vacuumed\nworker success, clears INTERRUPTED
IN_PROGRESS --> AVAILABLE_INTERRUPTED : set_interrupted\nshutdown or error
AVAILABLE_INTERRUPTED --> IN_PROGRESS : set_job_in_progress\nredispatch keeps INTERRUPTED flag
VACUUMED --> [*] : entry removed\nvacuum_data_mark_finished
Figure 1-2 — Status lifecycle in the top bits of blockid. AVAILABLE_INTERRUPTED is AVAILABLE plus the bit-61 flag, not a distinct status.
The id is pure arithmetic over log page ids:
// vacuum_get_log_blockid -- src/query/vacuum.cVACUUM_LOG_BLOCKIDvacuum_get_log_blockid (LOG_PAGEID pageid){ if (prm_get_bool_value (PRM_ID_DISABLE_VACUUM) || pageid == NULL_PAGEID) { return VACUUM_NULL_LOG_BLOCKID; /* <- -1; the only escape hatch */ } // ... condensed ... return pageid / vacuum_Data.log_block_npages;}log_block_npages defaults to VACUUM_LOG_BLOCK_PAGES_DEFAULT (31, vacuum.h); the inverse VACUUM_FIRST_LOG_PAGEID_IN_BLOCK(blockid) is blockid * log_block_npages. The const log_header & constructor is representation 1 → 2: it delegates to the three-argument constructor, which asserts oldest <= newest and computes blockid = vacuum_get_log_blockid (start_lsa.pageid).
Invariant 1-B — blockid/start_lsa coherence.
VACUUM_BLOCKID_WITHOUT_FLAGS (blockid) == vacuum_get_log_blockid (start_lsa.pageid)for every entry. Enforcement: by construction (the three-argument constructor). What breaks: the prefetch window and the chain walk would target different log ranges.
1.3 vacuum_data_page — the persisted array with a sliding window
Section titled “1.3 vacuum_data_page — the persisted array with a sliding window”// vacuum_data_page -- src/query/vacuum.cstruct vacuum_data_page{ VPID next_page; INT16 index_unvacuumed; INT16 index_free; VACUUM_DATA_ENTRY data[1]; /* <- flexible array; capacity computed at runtime */ static const INT16 INDEX_NOT_FOUND = -1; // ... condensed: is_empty, is_index_valid, get_index_of_blockid, get_first_blockid ...};| Field | Role | Why it exists |
|---|---|---|
next_page | VPID link to the next page; NULL VPID on the last | Vacuum data is a queue of pages: consume at head, append at tail |
index_unvacuumed | First entry not yet VACUUMED | Finished entries are skipped by sliding this forward, not compacted (Ch. 9) |
index_free | First unused slot; append position | [index_unvacuumed, index_free) is the live window |
data[1] | Entry array, page_data_max_count slots | VACUUM_DATA_PAGE_HEADER_SIZE = offsetof (VACUUM_DATA_PAGE, data); capacity = (DB_PAGESIZE - header) / sizeof (entry) |
INDEX_NOT_FOUND | Sentinel (-1) from get_index_of_blockid | Distinguishes “not on this page” from a valid slot |
Invariant 1-C — dense, consecutive window.
0 <= index_unvacuumed <= index_free <= page_data_max_count, and the window holds strictly consecutive blockids. Enforcement: the master appends in blockid order atindex_free(vacuum_consume_buffer_log_blocks, Ch. 4), synthesizing a placeholder entry born VACUUMED (page_free_data->set_vacuumed ()) for every gap block that had no MVCC ops; it never deletes from the middle (a finished entry keeps its slot, only its status changes) and slidesindex_unvacuumedonly over VACUUMED prefixes.vacuum_data_mark_finishedassertspage_free_blockid == data_page->data[data_page->index_free - 1].get_blockid () + 1. What breaks: the O(1) lookup below returns the wrong slot — the master would mark the wrong block finished.
// vacuum_data_page::get_index_of_blockid -- src/query/vacuum.cINT16vacuum_data_page::get_index_of_blockid (VACUUM_LOG_BLOCKID blockid) const{ if (is_empty ()) { return INDEX_NOT_FOUND; } VACUUM_LOG_BLOCKID first_blockid = data[index_unvacuumed].get_blockid (); // ... condensed: return INDEX_NOT_FOUND if blockid before or after the window ... INT16 index_of_blockid = (INT16) (blockid - first_blockid) + index_unvacuumed; assert (data[index_of_blockid].get_blockid () == blockid); /* <- relies on Invariant 1-C */ return index_of_blockid; /* <- O(1), no loop */}is_empty () is index_unvacuumed == index_free; both indexes are reset to 0 by vacuum_data_initialize_new_page.
Invariant 1-D — an empty page still carries
last_blockid.vacuum_init_data_page_with_last_blockidwrites a blockid into slot 0 (data->blockid) of a freshly initialized page even though the window is empty; when vacuum data is empty, recovery readsvacuum_Data.last_page->data->blockidand takesMAXwithlog_Gl.hdr.vacuum_last_blockid(Ch. 11). Slot 0 plays two roles: a live entry when the window covers it, the persisted high-water mark when the page is empty. What breaks: zeroing slot 0 on emptying would let a restart re-consume already-vacuumed blocks.
1.4 The vacuum_data global and the job cursor
Section titled “1.4 The vacuum_data global and the job cursor”One static instance, vacuum_Data, glues everything together:
// vacuum_data -- src/query/vacuum.cstruct vacuum_data{ VFID vacuum_data_file; LOG_PAGEID keep_from_log_pageid; /* Smallest LOG_PAGEID that vacuum may still need for its jobs. */ MVCCID oldest_unvacuumed_mvccid; /* Global oldest MVCCID not vacuumed (yet). */ VACUUM_DATA_PAGE *first_page; /* Cached first vacuum data page. */ VACUUM_DATA_PAGE *last_page; /* Cached last vacuum data page. */ // ... condensed ... private: VACUUM_LOG_BLOCKID m_last_blockid; /* ... the id of last added block * which may not even be in vacuum data (being already vacuumed). */};| Field | Role | Why it exists |
|---|---|---|
vacuum_data_file | VFID of the disk file holding the page chain | Finds the head VPID via the file descriptor at boot |
keep_from_log_pageid | First log page of the first unvacuumed block | Archive purger’s fence, via vacuum_min_log_pageid_to_keep (Ch. 9) |
oldest_unvacuumed_mvccid | ”Everything below is fully cleaned” watermark | Sanity checks, vacuum_is_mvccid_vacuumed; upgrade_oldest_unvacuumed asserts ascent (Ch. 5) |
first_page / last_page | Permanently fixed head/tail pages | Master reads head constantly, producers append at tail; latching once avoids per-op pgbuf fixes |
page_data_max_count | Entries per page | Computed once in vacuum_initialize from DB_PAGESIZE |
log_block_npages | Block granularity in log pages | Divisor of vacuum_get_log_blockid; fixed at db creation, default 31 |
is_loaded | Pages fixed and ready | Guards vacuum_data_load_first_and_last_page re-entry |
shutdown_sequence | vacuum_shutdown_sequence object | Orderly stop; workers turn it into set_interrupted paths |
is_archive_removal_safe | False until keep_from_log_pageid first computed | vacuum_is_safe_to_remove_archives — no purge before vacuum knows its needs |
recovery_lsa | LSA where recovery started | Backward-scan anchor for vacuum_recover_lost_block_data (Ch. 11) |
is_restoredb_session | Booted by restoredb | Recovery-path behavior switch |
is_vacuum_complete (SA_MODE) | Standalone “all caught up” flag | SA-mode runs vacuum to completion inside xvacuum (Ch. 11) |
m_last_blockid (private) | Id of the last block ever consumed | Not necessarily present as an entry — see role matrix |
Role matrix for m_last_blockid — meaning depends on whether vacuum data is empty:
| State | get_last_blockid () means | get_first_blockid () returns |
|---|---|---|
| non-empty | blockid of the last appended entry | first_page->get_first_blockid () — first window entry of the head page |
| empty | last block that was consumed, possibly long since removed | m_last_blockid too — both accessors collapse to the same value |
set_last_blockid strips flags (VACUUM_BLOCKID_WITHOUT_FLAGS) and, in debug builds, asserts the value is strictly below the block of log_Gl.prior_info.prior_lsa — the last block must never overtake the log.
Invariant 1-E — head and tail pages stay fixed.
first_pageandlast_pageare latched once at load and never unfixed between operations. Enforcement: the pgbuf wrappers —vacuum_fix_data_pagereturns the cached pointer when the VPID matches either end,vacuum_unfix_data_pagesilently skips them, and onlyvacuum_unfix_first_and_last_data_page(shutdown) really releases them. What breaks: unfixingfirst_pagedirectly viapgbuf_unfixleaves the nextvacuum_fix_data_pagereturning a stale pointer to a page the buffer manager may have victimized.
Two small companions live beside vacuum_Data. The vacuum_data_load struct (global vacuum_Data_load, fields vpid_first / vpid_last) records the chain ends so vacuum_data_load_first_and_last_page can fix both without walking the chain — with the special case vpid_first == vpid_last → last_page = first_page (one page must not be fixed twice). And vacuum_job_cursor is the master’s persistent iteration state over the window:
// vacuum_job_cursor -- src/query/vacuum.cclass vacuum_job_cursor{ // ... condensed: increment_blockid, readjust_to_vacuum_data_changes, load/unload ... private: VACUUM_LOG_BLOCKID m_blockid; // current cursor blockid VACUUM_DATA_PAGE *m_page; // loaded page of blockid or null INT16 m_index; // loaded index of blockid or INDEX_NOT_FOUND};| Field | Role | Why it exists |
|---|---|---|
m_blockid | The cursor’s canonical position | Blockids are stable; after blocks are removed/appended an entry’s location moves but its id does not |
m_page / m_index | Cached physical location of m_blockid | Avoids re-searching per step; recomputed via get_index_of_blockid by readjust_to_vacuum_data_changes after vacuum data shifts |
The cursor’s traversal logic is Ch. 6’s subject; only its layout is fixed here.
1.5 The two lock-free queues
Section titled “1.5 The two lock-free queues”// vacuum_Block_data_buffer -- src/query/vacuum.clockfree::circular_queue<vacuum_data_entry> *vacuum_Block_data_buffer = NULL;#define VACUUM_BLOCK_DATA_BUFFER_CAPACITY 1024lockfree::circular_queue<VACUUM_LOG_BLOCKID> *vacuum_Finished_job_queue = NULL;#define VACUUM_FINISHED_JOB_QUEUE_CAPACITY 2048| Queue | Element | Producer → Consumer | Why a queue at all |
|---|---|---|---|
vacuum_Block_data_buffer | full vacuum_data_entry (1024 cap) | any transaction thread on the prior-LSA path → master (vacuum_consume_buffer_log_blocks, Ch. 4) | transactions must never latch vacuum data pages — the source comment: “It is advisable to avoid synchronizing running transactions with vacuum threads” |
vacuum_Finished_job_queue | bare VACUUM_LOG_BLOCKID with status flags attached (2048 cap) | workers (vacuum_finished_block_vacuum, Ch. 9) → master (vacuum_data_mark_finished) | workers must not write vacuum data pages either; the master is the only writer of representation 3 |
The producer side of the first queue:
// vacuum_produce_log_block_data -- src/query/vacuum.cvoidvacuum_produce_log_block_data (THREAD_ENTRY * thread_p){ // ... condensed: PRM_ID_DISABLE_VACUUM early return ... VACUUM_DATA_ENTRY block_data { log_Gl.hdr }; /* <- representation 1 -> 2 */ log_Gl.hdr.does_block_need_vacuum = false; /* <- reset accumulator for next block */ log_Gl.hdr.newest_block_mvccid = MVCCID_NULL; // ... condensed: NULL-buffer guard, vacuum_er_log ... if (!vacuum_Block_data_buffer->produce (block_data)) { /* Push failed, the buffer must be full */ vacuum_er_log_error (VACUUM_ER_LOG_ERROR, "%s", "Cannot produce new log block data! The buffer is already full."); assert (false); return; /* <- block metadata LOST in release builds */ }}The full-queue branch is the known soft spot (the TODO above it admits it): a full buffer silently drops a block’s metadata in release builds. Ch. 11 covers the safety net — vacuum_recover_lost_block_data rebuilds entries by scanning the mvcc_op_log_lsa back-chain.
On the finished side, the element format is the trick: the worker pushes data->blockid after calling set_vacuumed () or set_interrupted () on its private copy, so the flags ride along in the top bits. The master strips them with VACUUM_BLOCKID_WITHOUT_FLAGS to locate the entry (via get_index_of_blockid) and reads the flags to decide VACUUMED-remove versus AVAILABLE+INTERRUPTED-requeue.
1.6 The worker side
Section titled “1.6 The worker side”// vacuum_worker -- src/query/vacuum.hstruct vacuum_worker{ VACUUM_WORKER_STATE state; INT32 drop_files_version; struct log_zip *log_zip_p; VACUUM_HEAP_OBJECT *heap_objects; int heap_objects_capacity; int n_heap_objects; char *undo_data_buffer; int undo_data_buffer_capacity; int private_lru_index; // page buffer private lru list char *prefetch_log_buffer; LOG_PAGEID prefetch_first_pageid; LOG_PAGEID prefetch_last_pageid; bool allocated_resources; int idx; // -1 for vacuum_master; Otherwise, the sequence number of vacuum_worker};| Field | Role | Why it exists |
|---|---|---|
state | VACUUM_WORKER_STATE_INACTIVE / PROCESS_LOG / EXECUTE | Log-reading code behaves differently for vacuum threads (vacuum_is_process_log_for_vacuum gates LOG_CS handling) |
drop_files_version | Last vacuum_Dropped_files_version observed | The master’s min across workers decides when old ledger entries can be cleaned (Ch. 10) |
log_zip_p | Persistent unzip scratch | Log undo data may be compressed; per-record reallocation would thrash |
heap_objects / heap_objects_capacity / n_heap_objects | Growable target array from the log pass (initial VACUUM_DEFAULT_HEAP_OBJECT_BUFFER_SIZE = 4000) | Log pass only collects; execution sorts by VFID, batches per heap page (Ch. 7→8) |
undo_data_buffer / undo_data_buffer_capacity | Copy buffer for undo data spanning log pages (initial IO_PAGESIZE) | B-tree vacuum needs the data contiguous |
private_lru_index | Private LRU list id in the page buffer | Vacuum touches huge page counts once; quarantine protects the shared LRU |
prefetch_log_buffer | VACUUM_PREFETCH_LOG_BLOCK_BUFFER_PAGES = 1 + log_block_npages (32) pages | One bulk read beats 31 single fetches; the +1 is one extra page beyond the block’s start_lsa page — which is logically the block’s last page — since the last record may spill into the next page; vacuum_log_prefetch_vacuum_block’s comment warns it handles at most that one extra page |
prefetch_first_pageid / prefetch_last_pageid | Buffer range: first = VACUUM_FIRST_LOG_PAGEID_IN_BLOCK (entry->get_blockid ()), last = first + VACUUM_PREFETCH_LOG_BLOCK_BUFFER_PAGES - 1 (first + 31 at default) | vacuum_fetch_log_page serves from the buffer iff the pageid is inside the range (Ch. 7) |
allocated_resources | Lazy-allocation flag | Buffers malloc’d on first job (vacuum_worker_allocate_resources), not at boot |
idx | -1 for vacuum_Master, else slot in vacuum_Workers[] (max VACUUM_MAX_WORKER_COUNT = 50) | Identifies the thread in logs and the min-version scan |
VACUUM_HEAP_OBJECT is the unit the collection buffer holds — deliberately minimal:
| Field | Role | Why it exists |
|---|---|---|
vfid | Heap file of the object | Primary sort key: the dropped-ledger is checked once per file |
oid | Object id; its pageid is the secondary grouping key | Batches per heap page → one fix, one log record per page (Ch. 8) |
The comment above vacuum_Master’s definition explains why the master is also a VACUUM_WORKER: it needed system operations and a transaction descriptor for page allocation, so it reuses the struct. vacuum_heap_helper, the worker’s per-page scratch (home/forward pages, slot and result arrays, MVCC header), is dissected in Ch. 8.
1.7 The dropped-files ledger pages
Section titled “1.7 The dropped-files ledger pages”| Struct / Field | Role | Why it exists |
|---|---|---|
vacuum_dropped_file.vfid | The dropped file | Workers match log_vacuum_info.vfid against it |
vacuum_dropped_file.mvccid | MVCCID recorded at drop time | Only records with mvccid <= this are skipped — the file id may be reused by a newer file whose records must still be vacuumed (Ch. 10) |
vacuum_dropped_files_page.next_page | VPID chain link | Ledger can outgrow one page |
vacuum_dropped_files_page.n_dropped_files | Live entry count | Entries kept VFID-sorted per page (vacuum_add_dropped_file binary-searches via util_bsearch, then memmove-inserts in position); capacity VACUUM_DROPPED_FILES_PAGE_CAPACITY |
vacuum_dropped_files_page.dropped_files[1] | Flexible entry array | Same flexible-array idiom as vacuum_data_page |
Supporting globals: vacuum_Dropped_files_vfid / vacuum_Dropped_files_vpid (file and head page), vacuum_Dropped_files_loaded, vacuum_Dropped_files_count, vacuum_Last_dropped_vfid, and vacuum_Dropped_files_version — the INT32 generation counter paired with each worker’s drop_files_version. vacuum_Dropped_files_mutex is held only by vacuum_notify_all_workers_dropped_file: it serializes the notify-workers step of each drop (guarding vacuum_Last_dropped_vfid and the ++vacuum_Dropped_files_version bump, one file at a time) — page edits rely on page latches, and reading workers never take it. Debug builds mirror the page chain in memory via vacuum_Track_dropped_files. (VACUUM_DROPPED_FILE_FLAG_DUPLICATE is defined but unused in the current source — a re-dropped vfid is instead handled in place: the binary search finds the existing entry and only its mvccid is updated.) Behavior is Ch. 10’s subject; the layout is fixed here.
1.8 Panorama
Section titled “1.8 Panorama”Figure 1-3 — Panorama of vacuum’s data structures. Blue arrows: block birth (Ch. 3–4). Green: master’s append and dispatch (Ch. 4, 6). Red: completion report (Ch. 9). Dashed purple: worker reads log pages back from the volume (Ch. 7).
1.9 Chapter summary — key takeaways
Section titled “1.9 Chapter summary — key takeaways”- A vacuum block has four representations: the
log_Gl.hdraccumulator, an in-flightvacuum_data_entryinvacuum_Block_data_buffer, a persisted entry in avacuum_data_page, and a flags-carryingVACUUM_LOG_BLOCKIDinvacuum_Finished_job_queue. The same four values — start LSA, oldest/newest MVCCID, blockid — flow through all of them. vacuum_data_entry::blockidpacks a 2-bit status (AVAILABLE00, IN_PROGRESS01, VACUUMED10) plus the orthogonal bit-61 INTERRUPTED flag above a 61-bit id;set_interruptedreturns status to AVAILABLE while raising the flag,set_vacuumedclears it — so11never appears.vacuum_data_pagekeeps a dense window[index_unvacuumed, index_free)of strictly consecutive blockids (no-MVCC gap blocks get placeholder entries born VACUUMED);get_index_of_blockidexploits this for an O(1) lookup, and slot 0 of an empty page doubles as the persistedlast_blockid(Invariant 1-D).vacuum_Data.first_page/last_pageare permanently fixed; thevacuum_fix_data_page/vacuum_unfix_data_pagewrappers route around pgbuf for those two VPIDs. The job cursor anchors onm_blockid(stable) and recomputes its cachedm_page/m_indexafter the window shifts.- The two lock-free queues exist so transaction threads never touch vacuum data pages and workers never write them — the master is the sole writer of the persisted representation. A full
vacuum_Block_data_bufferdrops block metadata in release builds; recovery’s backward chain scan is the safety net (Ch. 11). VACUUM_WORKERis a bag of persistent per-thread buffers (prefetch buffer of1 + log_block_npagespages — the +1 for the last record spilling past thestart_lsapage — heap-object array, undo buffer, zip scratch) plusdrop_files_version, whose per-worker minimum gates dropped-files cleanup; the master reuses the struct withidx == -1.- Block geometry is pure arithmetic:
blockid = pageid / log_block_npagesandfirst pageid = blockid * log_block_npages— every fence (keep_from_log_pageid, prefetch ranges) derives from these two formulas.
Chapter 2: Initialization and Memory Management
Section titled “Chapter 2: Initialization and Memory Management”How the Chapter 1 structures — vacuum_Data, the two lock-free queues, vacuum_Master, vacuum_Workers, the on-disk file pair — come into existence, how threads acquire vacuum identity, and in what order everything is torn down. Design rationale for the master/worker split lives in the high-level companion (cubrid-vacuum.md); none of it is re-derived here.
Startup is two-phase. vacuum_initialize runs in boot_restart_server (boot_sr.c) before log_initialize, because crash recovery already needs vacuum_Data.log_block_npages, the dropped-files VPID, and worker contexts. vacuum_boot runs after recovery, because the master must not consume vacuum data while redo is still rewriting it.
2.1 vacuum_initialize — parameter capture and static arrays
Section titled “2.1 vacuum_initialize — parameter capture and static arrays”Inputs come from boot_Db_parm: block size in log pages plus the two VFIDs from createdb (section 2.2). The is_restore flag is r_args != NULL && r_args->is_restore_from_backup in boot_restart_server; the only other caller, xboot_emergency_patch, hard-codes false. It lands in vacuum_Data.is_restoredb_session, consumed exactly once — at SA shutdown (section 2.7).
// vacuum_initialize -- src/query/vacuum.c if (prm_get_bool_value (PRM_ID_DISABLE_VACUUM)) return NO_ERROR; /* <- branch 1: vacuum disabled, nothing is built */ vacuum_Data.is_restoredb_session = is_restore; vacuum_Data.log_block_npages = vacuum_log_block_npages; vacuum_Data.page_data_max_count = (DB_PAGESIZE - VACUUM_DATA_PAGE_HEADER_SIZE) / sizeof (VACUUM_DATA_ENTRY); // ... condensed: VFID copies; SA_MODE is_vacuum_complete = false; dropped-files globals, mutex init ... if (vacuum_get_first_page_dropped_files (thread_p, &vacuum_Dropped_files_vpid) != NO_ERROR) { assert (false); goto error; } /* <- branch 2: sticky first page lookup failed */ vacuum_Block_data_buffer = new lockfree::circular_queue<vacuum_data_entry> (VACUUM_BLOCK_DATA_BUFFER_CAPACITY); if (vacuum_Block_data_buffer == NULL) goto error; /* <- branches 3, 4: queue NULL checks (vacuum_Finished_job_queue is identical) */ vacuum_Master.state = VACUUM_WORKER_STATE_EXECUTE; /* <- master is *always* in execute state */ vacuum_Master.idx = -1; /* private_lru_index = -1, buffers NULL */ for (i = 0; i < VACUUM_MAX_WORKER_COUNT; i++) { vacuum_Workers[i].state = VACUUM_WORKER_STATE_INACTIVE; vacuum_Workers[i].private_lru_index = pgbuf_assign_private_lru (thread_p); /* <- eager */ vacuum_Workers[i].allocated_resources = false; vacuum_Workers[i].idx = i; /* buffer pointers NULL, capacities 0 */ } return NO_ERROR;error: vacuum_finalize (thread_p); /* <- error path reuses the full teardown */ return (error_code == NO_ERROR) ? ER_FAILED : error_code;Branch (2) is a thin wrapper over file_get_sticky_first_page, so failure means the createdb-time file is missing — an unbootable database. Branches (3)/(4) are vestigial under throwing new, but route to the shared error: label so a half-built state is torn down by the same vacuum_finalize as a fully-built one. Note the worker-loop asymmetry: all VACUUM_MAX_WORKER_COUNT (50) static slots get a private page-buffer LRU list immediately, even though PRM_ID_VACUUM_WORKER_COUNT may cap actual threads far lower — LRU indices must be claimed during page-buffer bootstrap; every other per-worker buffer waits for the first job (section 2.5). The slots are VACUUM_WORKER structs whose per-field Field | Role | Why table lives in Chapter 1; vacuum_initialize only NULLs the pointers and zeroes the capacities, leaving state = INACTIVE, allocated_resources = false.
2.2 The createdb pair — where the files come from
Section titled “2.2 The createdb pair — where the files come from”boot_create_all_volumes (createdb) calls both creation functions back to back and stores the VFIDs in boot_Db_parm. vacuum_create_file_for_vacuum_data: file_create_with_npages (FILE_VACUUM_DATA, one page), file_alloc, then the first-page VPID is written into the file descriptor via file_descriptor_update — that descriptor is how restart finds the page — and the page is formatted by vacuum_init_data_page_with_last_blockid (..., 0): empty data, last_blockid 0. vacuum_create_file_for_dropped_files is the same skeleton with two differences: it allocates via file_alloc_sticky_first_page instead of writing a descriptor (hence the boot-time file_get_sticky_first_page lookup), and it formats the page inline — NULL next_page, n_dropped_files = 0 — before vacuum_set_dirty_dropped_entries_page (..., FREE). Exits per function: create error, alloc error, NULL-page guard, success — plus a fifth for vacuum data (descriptor-update error). Page format details: Chapter 10.
2.3 vacuum_boot — load state, then start threads
Section titled “2.3 vacuum_boot — load state, then start threads”flowchart TD
A["boot_restart_server"] --> B["vacuum_initialize<br/>(before log_initialize)"]
B --> C["log_initialize -- crash recovery runs here"]
C --> D["vacuum_boot"]
D --> E["vacuum_data_load_and_recover<br/>recover entries, stash VPIDs in vacuum_Data_load, unload"]
E --> F["vacuum_load_dropped_files_from_disk"]
F --> G["new vacuum_master_entry_manager<br/>new vacuum_worker_entry_manager"]
G --> H{"SERVER_MODE?"}
H -- yes --> I["thread_create_stats_worker_pool<br/>create_daemon vacuum-master"]
H -- no --> J["no threads -- xvacuum drives jobs"]
I --> K["vacuum_Is_booted = true"]
J --> K
K --> L["first master iteration / first SA pass<br/>vacuum_data_load_first_and_last_page re-fixes the pair"]
Figure 2-1: boot sequence; vacuum_initialize and vacuum_boot bracket crash recovery.
vacuum_boot (assert (!vacuum_Is_booted) — boot once) has four runtime branches plus a compile-time split: the disable-vacuum branch still calls log_Gl.mvcc_table.update_global_oldest_visible () (“for debug only” — the Chapter 5 watermark must advance even with vacuum off); a thread_p == NULL fallback to thread_get_thread_entry_info; and two error returns from the load steps — vacuum_data_load_and_recover (Chapter 11) and vacuum_load_dropped_files_from_disk (Chapter 10). Note what the first one does not do: it fixes the page chain only while walking it; its end: label stashes the first/last VPIDs into vacuum_Data_load and unloads both pages — they “must be fixed by vacuum master” (in-code comment), not by the boot thread (section 2.6). Only then are the two entry managers allocated — in both modes, because the SA path claims workers through vacuum_Worker_entry_manager too. Under SERVER_MODE the worker pool gets PRM_ID_VACUUM_WORKER_COUNT threads and one task queue (thread_create_stats_worker_pool), and the master becomes a cubthread::daemon running vacuum_master_task every PRM_ID_VACUUM_MASTER_WAKEUP_INTERVAL ms; SA builds create no thread. Cross-check: log_vacuum_worker_pool is computed from logging flags but its one use is commented out (// m_log = log_vacuum_worker_pool) — dead configuration in this revision.
2.4 Thread identity — entry managers, system tdes, pool handoff
Section titled “2.4 Thread identity — entry managers, system tdes, pool handoff”Vacuum threads are generic cubthread workers until an entry manager hook brands them. Both managers funnel into one helper:
// vacuum_init_thread_context -- src/query/vacuum.c context.type = type; /* <- TT_VACUUM_MASTER or TT_VACUUM_WORKER */ context.vacuum_worker = worker; context.check_interrupt = false; /* <- vacuum is immune to client interrupt checks */ assert (context.get_system_tdes () == NULL); context.claim_system_worker (); /* <- new log_system_tdes; tran_index = LOG_SYSTEM_TRAN_INDEX */After restart, claim_system_worker draws its tdes from the shared allocator in log_system_tran.cpp (systdes_claim_tdes: systb_Next_tranid seeded with LOG_SYSTEM_WORKER_FIRST_TRANID = NULL_TRANID - 1, stepped by -1, free-list reuse). The macros VACUUM_WORKER_INDEX_TO_TRANID / ..._TRANID_TO_INDEX still sit at the top of vacuum.c but have no remaining callers — leftovers of the old fixed-TRANID-per-slot design; binding is now dynamic, and recovery rebuilds system transactions via log_system_tdes::rv_get_or_alloc_tdes from TRANIDs in the log.
vacuum_master_entry_manager (extends cubthread::daemon_entry_manager) — no data members, two final overrides:
| Member | Role | Why it exists |
|---|---|---|
on_daemon_create | Assert vacuum_Master.state == VACUUM_WORKER_STATE_EXECUTE; vacuum_init_thread_context (..., TT_VACUUM_MASTER, &vacuum_Master) | Master’s VACUUM_WORKER is the static singleton; no pool claim |
on_daemon_retire | vacuum_finalize (&context) (tagged // todo: is this the rightful place?); retire_system_worker; null vacuum_worker (asserting it was &vacuum_Master) | Piggybacks subsystem teardown on the daemon’s death — section 2.7 |
vacuum_worker_entry_manager (extends cubthread::entry_manager):
| Member | Role | Why it exists |
|---|---|---|
m_pool | resource_shared_pool<VACUUM_WORKER>* over the static 50-slot array; deleted in destructor | Non-owning: claim () pops a slot pointer off a mutex-guarded free stack, retire () pushes it back; no fixed thread-slot binding |
claim_worker / retire_worker | Public pass-throughs to m_pool | Lets vacuum_sa_run_job claim a worker without pool hooks |
on_create | tran_index = 0; vacuum_init_thread_context (..., TT_VACUUM_WORKER, m_pool->claim ()); vacuum_worker_allocate_resources (assert on failure); copy private_lru_index into the entry | Per pooled thread, before its first task: identity, tdes, buffers, private LRU handoff |
on_retire | retire_system_worker; worker state = INACTIVE; m_pool->retire; null vacuum_worker; private_lru_index = -1 | Mirror of on_create; entry returns to the global manager clean |
on_recycle | tran_index = LOG_SYSTEM_TRAN_INDEX | Recycling resets entries to NULL_TRAN_INDEX; vacuum must keep looking like the system transaction |
Invariant — vacuum thread identity is atomic. An entry is a vacuum thread iff type is TT_VACUUM_MASTER/TT_VACUUM_WORKER, vacuum_worker is non-NULL, a system tdes is claimed, and tran_index == LOG_SYSTEM_TRAN_INDEX. The hooks set and clear all four together (vacuum_init_thread_context asserts no tdes pre-exists; on_retire asserts vacuum_worker != NULL). Violated, vacuum_get_vacuum_worker asserts and vacuum log records could carry a client TRANID.
SA mode drapes the same identity over the main thread temporarily: vacuum_convert_thread_to_master (save thread_p->type, set master identity, claim tdes only if none exists), vacuum_convert_thread_to_worker (same, plus vacuum_worker_allocate_resources with assert_release on failure), vacuum_restore_thread (restore type, null vacuum_worker, retire_system_worker, tran_index = LOG_SYSTEM_TRAN_INDEX); each tolerates thread_p == NULL by self-lookup. xvacuum brackets the SA pass with convert-to-master/restore; vacuum_sa_run_job nests convert-to-worker/back-to-master per block, asserting the saved types match, then retires the slot via retire_worker.
2.5 Lazy buffers — vacuum_worker_allocate_resources
Section titled “2.5 Lazy buffers — vacuum_worker_allocate_resources”// vacuum_worker_allocate_resources -- src/query/vacuum.c assert (worker->state == VACUUM_WORKER_STATE::VACUUM_WORKER_STATE_INACTIVE); if (worker->allocated_resources) return NO_ERROR; /* <- idempotent: SA mode re-converts the same thread repeatedly */ worker->log_zip_p = log_zip_alloc (IO_PAGESIZE); // ... condensed: NULL -> logpb_fatal_error + return ER_FAILED ... worker->heap_objects = (VACUUM_HEAP_OBJECT *) malloc (worker->heap_objects_capacity * sizeof (VACUUM_HEAP_OBJECT)); // ... condensed: NULL -> goto error; same for undo_data_buffer (IO_PAGESIZE) ... worker->prefetch_log_buffer = (char *) malloc (VACUUM_PREFETCH_LOG_BLOCK_BUFFER_PAGES * LOG_PAGESIZE); // ... condensed: NULL -> goto error ... /* <- (1 + log_block_npages) log pages */ assert (logtb_get_system_tdes (thread_p) != NULL); /* <- tdes must already be claimed */ worker->allocated_resources = true;Six branches: the short-circuit, four allocation failures (each calls logpb_fatal_error — a worker without buffers is server-fatal, since vacuum falling behind is unbounded debt), and success. The four allocations fill exactly the four VACUUM_WORKER buffer fields left NULL by vacuum_initialize — log_zip_p, heap_objects (VACUUM_DEFAULT_HEAP_OBJECT_BUFFER_SIZE), undo_data_buffer (IO_PAGESIZE), prefetch_log_buffer — then flip allocated_resources = true. The first failure returns directly; the rest goto error, where vacuum_finalize_worker frees whatever subset exists — four independent idempotent frees (log_zip_free plus three free_and_init), which also makes it the universal teardown vacuum_finalize runs on all 50 slots plus vacuum_Master, allocated or not. The prefetch sizing ties memory to the Chapter 3 block geometry: VACUUM_PREFETCH_LOG_BLOCK_BUFFER_PAGES = 1 + log_block_npages, one whole log block plus one page.
2.6 Why first_page and last_page stay permanently fixed
Section titled “2.6 Why first_page and last_page stay permanently fixed”vacuum_Data.first_page and last_page are hot on every master iteration and block append, so vacuum holds write latches on them for the whole of the master’s runtime. The page-buffer discipline is bent in exactly three wrappers:
// vacuum_fix_data_page -- src/query/vacuum.c#define vacuum_fix_data_page(thread_p, vpidp) \ (vacuum_Data.first_page != NULL && VPID_EQ (pgbuf_get_vpid_ptr ((PAGE_PTR) vacuum_Data.first_page), vpidp) ? \ vacuum_Data.first_page : /* <- short-circuit: reuse held latch */ \ /* ... same test against vacuum_Data.last_page ... */ \ (VACUUM_DATA_PAGE *) pgbuf_fix (thread_p, vpidp, OLD_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH))vacuum_unfix_data_page and vacuum_set_dirty_data_page apply the same identity test: a page aliasing the held pair is never unfixed and is dirtied with DONT_FREE. Cross-check note: vacuum_set_dirty_data_page is now an inline function taking the pointer by value, so its trailing data_page = NULL only clears the local copy — unlike the still-macro vacuum_unfix_data_page, it cannot null the caller’s pointer.
Invariant — exactly one extra fix per cached page, held by vacuum. Every fix of a first/last VPID must route through these wrappers; a raw pgbuf_fix/pgbuf_unfix on those VPIDs skews the fix count, which debug builds verify via vacuum_verify_vacuum_data_page_fix_count. Violated, either the latch leaks or the cached pointer dangles.
Establishment always funnels through vacuum_data_load_first_and_last_page — not through boot. vacuum_data_load_and_recover deliberately unloads the pair on exit: the in-code comment on vacuum_master_task::execute explains that the load “was initially in boot_restart_server”, but boot’s commit complains about — and unfixes — any page its thread left fixed, “so we have to load the data here (vacuum master never commits)”. Hence the master’s first iteration re-fixes the pair in SERVER_MODE; xvacuum and vacuum_sa_reflect_last_blockid do it in SA mode. Load branches: is_loaded early return; first-page fix failure (assert_release, return); the single-page case where vpid_first == vpid_last makes last_page alias first_page (why vacuum_unfix_first_and_last_data_page only unfixes last_page when it differs); last-page fix failure (unfix both, assert_release). The inverse, vacuum_data_unload_first_and_last_page, early-returns when not loaded, stashes both VPIDs into the file-scope vacuum_Data_load struct, unfixes the pair, clears is_loaded. VACUUM_DATA_LOAD, in full:
| Field | Role | Why it exists |
|---|---|---|
vpid_first | First-page VPID, saved at unload | Reload without re-reading the file descriptor; NULL means never loaded (checked by vacuum_sa_reflect_last_blockid) |
vpid_last | Last-page VPID, saved at unload | Same; reload re-fixes the tail directly instead of walking the page chain |
2.7 Shutdown — workers first, master last, finalize inside
Section titled “2.7 Shutdown — workers first, master last, finalize inside”xboot_shutdown_server (boot_sr.c) encodes the ordering contract in its comments:
// xboot_shutdown_server -- src/transaction/boot_sr.c log_abort_all_active_transaction (thread_p); vacuum_stop_workers (thread_p); /* <- 1: no new jobs, drain pool */ // ... condensed: stats reflection, caches, boot_remove_all_temp_volumes ... // only after all logging is finished can this vacuum master be stopped; boot_remove_all_temp_volumes // may add a final log entry vacuum_stop_master (thread_p); /* <- 2: daemon dies, vacuum_finalize runs */vacuum_stop_workers early-returns when !vacuum_Is_booted, calls vacuum_notify_server_shutdown → vacuum_Data.shutdown_sequence.request_shutdown (), then (if the pool exists) logs pool stats, stop_execution (), destroy_worker_pool, and deletes vacuum_Worker_entry_manager — destroying the resource pool, whose destructor asserts every worker was retired. vacuum_stop_master (same guard) destroys the daemon if one exists — triggering on_daemon_retire → vacuum_finalize on the master’s own thread — deletes the master entry manager, and clears vacuum_Is_booted. The error label of boot_restart_server calls the same two functions back to back, so a failed boot reuses the ordered teardown.
vacuum_shutdown_sequence — all fields:
| Field | Role | Why it exists |
|---|---|---|
m_state | NO_SHUTDOWN to SHUTDOWN_REQUESTED (SERVER_MODE only) to SHUTDOWN_REGISTERED | Separates “shutdown asked” from “master acknowledged”; polled via is_shutdown_requested / check_shutdown_request |
m_state_mutex | SERVER_MODE only; guards transitions and the condvar | Request and acknowledgement happen on different threads |
m_condvar | SERVER_MODE only; requester waits inside request_shutdown | Makes request_shutdown synchronous — returns only after registration |
The handshake: request_shutdown returns immediately if already SHUTDOWN_REGISTERED (re-requests are no-ops), else sets SHUTDOWN_REQUESTED and blocks until m_state == SHUTDOWN_REGISTERED || vacuum_Master_daemon == NULL. On its next wakeup the master’s vacuum_master_task::check_shutdown calls check_shutdown_request: NO_SHUTDOWN → false; SHUTDOWN_REGISTERED → true; SHUTDOWN_REQUESTED → take the mutex, set SHUTDOWN_REGISTERED, notify_one, true (in SA builds this branch is assert (false) — no requester thread exists). If no daemon was ever created, the requester self-registers; in SA mode request_shutdown jumps straight to SHUTDOWN_REGISTERED.
stateDiagram-v2
[*] --> NO_SHUTDOWN
NO_SHUTDOWN --> SHUTDOWN_REQUESTED : request_shutdown\nSERVER_MODE, requester blocks on condvar
SHUTDOWN_REQUESTED --> SHUTDOWN_REGISTERED : master check_shutdown_request\nor requester self-registers when daemon is NULL
NO_SHUTDOWN --> SHUTDOWN_REGISTERED : request_shutdown in SA_MODE
SHUTDOWN_REGISTERED --> [*]
Figure 2-2: vacuum_shutdown_sequence states; the REQUESTED state exists only in SERVER_MODE.
Invariant — request before destroy. request_shutdown must run while the master daemon still exists (or before it was ever created); only the master’s acknowledgement or the daemon-NULL escape unblocks the requester. Workers-before-master ordering guarantees this; reversed, the shutdown thread could park on m_condvar with nobody left to notify it.
vacuum_finalize (reached from the init error path, the master’s retirement, or end of SA xvacuum) walks six guarded steps: disable-vacuum return; assert (!vacuum_is_work_in_progress ...); drain vacuum_Finished_job_queue via vacuum_data_mark_finished (Chapter 9), assert it emptied, delete it; loop-consume vacuum_Block_data_buffer with vacuum_consume_buffer_log_blocks — a loop because consuming appends to vacuum data, which itself logs and can complete yet another block — with a safe-guard assert/break if vacuum_Data.is_loaded is false; in SA builds, vacuum_data_empty_update_last_blockid; then vacuum_data_unload_first_and_last_page, a belt-and-braces pgbuf_unfix_all, vacuum_finalize_worker over all 50 slots plus vacuum_Master, and pthread_mutex_destroy on the dropped-files mutex.
SA mode inverts who calls finalize. (The SERVER_MODE build of xvacuum is a stub returning ER_VACUUM_CS_NOT_AVAILABLE — a client-issued VACUUM statement is rejected; the rest is the SA compile branch.) No daemon exists, so vacuum_stop_master never triggers it — vacuum_finalize runs at the end of xvacuum, inside the convert-to-master/restore bracket, after which vacuum_Data.is_vacuum_complete = true makes further xvacuum calls no-ops (the flag vacuum_initialize reset at boot). After vacuum_stop_master, the SA branch of xboot_shutdown_server adds one extra step: vacuum_sa_reflect_last_blockid, which early-returns if vacuum_Data_load.vpid_first is NULL (fresh createdb or aborted boot) or if vacuum_Data.is_restoredb_session is set (“restoredb doesn’t vacuum” — the lone consumer of the section 2.1 flag); otherwise it reloads the page pair, takes logpb_last_complete_blockid () (early-out on VACUUM_NULL_LOG_BLOCKID), persists it through vacuum_data_empty_update_last_blockid, and unloads again.
2.8 Chapter summary — key takeaways
Section titled “2.8 Chapter summary — key takeaways”- Startup is two-phase around crash recovery:
vacuum_initialize(parameter capture, queues, staticvacuum_Master/vacuum_Workers[50]) runs beforelog_initialize;vacuum_boot(data load + recovery, dropped-files load, daemon and pool creation) runs after, gated byvacuum_Is_booted. - The on-disk pair is born at createdb:
vacuum_create_file_for_vacuum_datapersists its first-page VPID in the file descriptor;vacuum_create_file_for_dropped_filesuses a sticky first page — the two boot-time lookup paths mirror this. - Thread identity is granted only through entry-manager hooks or the SA convert functions, always as a bundle: thread type,
vacuum_workerpointer, a dynamically claimed system tdes with negative TRANID (VACUUM_WORKER_INDEX_TO_TRANIDis a callerless leftover),LOG_SYSTEM_TRAN_INDEX, and the private LRU handoff. - Worker slots are pooled, not bound: a
resource_shared_poolover the static array lets any thread claim any slot; onlyprivate_lru_indexis eager, everything else arrives lazily and idempotently invacuum_worker_allocate_resources, with failure escalated tologpb_fatal_error. first_page/last_pagestay write-latched for the master’s whole runtime — established by the master’s first iteration (or the SA pass), not by boot, because boot’s commit would unfix its thread’s latches;vacuum_fix_data_page,vacuum_unfix_data_page, and theDONT_FREEbranch ofvacuum_set_dirty_data_pageare the only legal access paths, verified by debug fix-count checks.- Shutdown is workers-then-master:
vacuum_stop_workersruns the synchronousvacuum_shutdown_sequencehandshake while the daemon can still acknowledge;vacuum_stop_masterthen destroys it, andon_daemon_retirerunsvacuum_finalize. The boot error path reuses the same two calls. - SA mode has no daemon:
vacuum_finalizeruns insidexvacuum(thenis_vacuum_completeblocks re-entry), and shutdown appendsvacuum_sa_reflect_last_blockid— skipped for restoredb sessions via theis_restoredb_sessionflag captured atvacuum_initialize.
Chapter 3: Block Birth in the Log Append Path
Section titled “Chapter 3: Block Birth in the Log Append Path”The subsystem is booted (Ch 2) and idle. Where does a block come from?
Nowhere inside vacuum.c: blocks are born as a side effect of
ordinary transactions appending log records, inside
prior_lsa_next_record_internal (log_append.cpp) — the tree’s sole
caller of vacuum_produce_log_block_data — under the prior-LSA mutex.
Vacuum only receives them through vacuum_Block_data_buffer. This
chapter traces every branch of the producer side.
3.1 The producer pipeline in one picture
Section titled “3.1 The producer pipeline in one picture”Every log record is first materialized as a log_prior_node and
appended to the prior list — an in-memory queue the log flusher later
copies into log pages. If the record is MVCC-flavored, two extra things
happen under the same mutex hold: its
vacuum_info.prev_mvcc_op_log_lsa is patched to point at the previous
MVCC record, and the per-block accumulator in log_Gl.hdr is folded
forward. When a record’s start LSA lands in a different log block
than the accumulator, the pending block is closed and pushed to vacuum
as a vacuum_data_entry.
flowchart LR NODE["log_prior_node"] -->|"prior_lsa_next_record\nunder prior_lsa_mutex"| LIST["prior list"] NODE -->|"MVCC type: patch vacuum_info,\nfold mvccid into header"| HDR["log_Gl.hdr accumulator\nmvcc_op_log_lsa\noldest_visible_mvccid\nnewest_block_mvccid\ndoes_block_need_vacuum"] HDR -->|"block boundary crossed:\nvacuum_produce_log_block_data"| ENTRY["vacuum_data_entry"] ENTRY --> QUEUE["vacuum_Block_data_buffer\nlockfree circular_queue, cap 1024"] QUEUE -->|"drained by Ch 4"| VD["vacuum data"]
Figure 3-1: the producer side. Everything left of the queue runs in
transaction threads inside log_append.cpp; vacuum only consumes.
3.2 log_prior_node and the log_Gl.hdr accumulator
Section titled “3.2 log_prior_node and the log_Gl.hdr accumulator”// log_prior_node -- src/transaction/log_append.hppstruct log_prior_node{ LOG_RECORD_HEADER log_header; LOG_LSA start_lsa; /* for assertion */ // ... condensed: tde_encrypted ... int data_header_length; char *data_header; // ... condensed: ulength/udata, rlength/rdata payload pointers ... LOG_PRIOR_NODE *next;};Three fields matter. start_lsa is assigned by
prior_lsa_start_append and drives the boundary test.
log_header.type selects the dispatch branch. data_header holds the
type-specific fixed header (e.g., LOG_REC_MVCC_UNDOREDO) still in
memory — the append path casts it and patches vacuum_info in
place before the bytes reach a log page, which is what makes the
backward MVCC chain possible.
The accumulator lives in log_header — the file header log_Gl.hdr
(log_storage.hpp), not the per-record header. Its fields change
meaning with does_block_need_vacuum:
log_Gl.hdr field | does_block_need_vacuum == false | == true |
|---|---|---|
mvcc_op_log_lsa | Last MVCC op of an already produced block (kept — Invariant 3-E) | Last MVCC op of the pending block |
oldest_visible_mvccid | Stale; re-sampled at next block birth | Horizon frozen at the block’s first MVCC record |
newest_block_mvccid | MVCCID_NULL (reset on produce) | Running max of the block’s MVCCIDs |
does_block_need_vacuum | Boundary branch disarmed | Pending block exists; branch armed |
3.3 prior_lsa_next_record_internal — branch-complete walkthrough
Section titled “3.3 prior_lsa_next_record_internal — branch-complete walkthrough”Both public entry points funnel here — prior_lsa_next_record takes
the mutex itself (LOG_PRIOR_LSA_WITHOUT_LOCK);
prior_lsa_next_record_with_lock is for callers already holding it.
First, prior_lsa_start_append stamps node->start_lsa from
log_Gl.prior_info.prior_lsa and threads the per-transaction undo
chain — except for system-worker transactions outside a sysop (vacuum
workers themselves log this way), whose chain LSAs are nulled. Then
three zones run: (A) block-boundary check, (B) record-type dispatch,
(C) list insertion and overflow.
flowchart TD
A1["prior_lsa_start_append\nstart_lsa assigned"] --> B0{"LOG_ISRESTARTED and\ndoes_block_need_vacuum?"}
B0 -->|no| C0
B0 -->|"yes, and blockid of\nhdr.mvcc_op_log_lsa\n!= blockid of start_lsa"| B2["vacuum_produce_log_block_data\ncloses pending block"]
B2 --> C0{"record type?"}
C0 -->|"4 MVCC shapes"| D1["extract vacuum_info + mvccid\npatch prev_mvcc_op_log_lsa\nprior_update_header_mvcc_info"]
C0 -->|"5 recovery-bookkeeping\ntype groups"| E1["save LSAs into\ntdes rcv state"]
C0 -->|"anything else"| F0
D1 --> F0["append payloads\nprior_lsa_end_append\ninsert at prior list tail"]
E1 --> F0
F0 --> I0{"WITHOUT_LOCK and\nlist_size >= log buffer size?"}
I0 -->|no| Z["return start_lsa"]
I0 -->|"yes, server, not crash recovery"| I3["wake log flush daemon\nsleep 1ms"]
I0 -->|"yes, crash recovery or SA"| I4["flush prior list inline\nunder LOG_CS"]
I3 --> Z
I4 --> Z
Figure 3-2: prior_lsa_next_record_internal. The vacuum spine is B0
to D1; E1 covers the non-vacuum dispatch arms.
Zone A runs before the type dispatch, on every record type:
// prior_lsa_next_record_internal -- src/transaction/log_append.cpp if (LOG_ISRESTARTED () && log_Gl.hdr.does_block_need_vacuum) { assert (!LSA_ISNULL (&log_Gl.hdr.mvcc_op_log_lsa)); if (vacuum_get_log_blockid (log_Gl.hdr.mvcc_op_log_lsa.pageid) != vacuum_get_log_blockid (start_lsa.pageid)) { assert (vacuum_get_log_blockid (log_Gl.hdr.mvcc_op_log_lsa.pageid) <= (vacuum_get_log_blockid (start_lsa.pageid) - 1)); /* <- pending block strictly older */ vacuum_produce_log_block_data (thread_p); } }Two consequences. First, a pending block is closed by the next record
past the boundary, of any type — a plain LOG_COMMIT suffices; on a
quiet system the block lingers open until traffic resumes, which the
consumer side must tolerate (Ch 4, Ch 9). Second, LOG_ISRESTARTED ()
(log_Gl.rcv_phase == LOG_RESTARTED, log_impl.h) disarms production
during recovery — Invariant 3-C.
Zone B recognizes exactly four MVCC shapes, with a three-way extraction because they carry the payload in different places:
// prior_lsa_next_record_internal -- src/transaction/log_append.cpp if (node->log_header.type == LOG_MVCC_UNDO_DATA || node->log_header.type == LOG_MVCC_UNDOREDO_DATA || node->log_header.type == LOG_MVCC_DIFF_UNDOREDO_DATA || (node->log_header.type == LOG_SYSOP_END && ((LOG_REC_SYSOP_END *) node->data_header)->type == LOG_SYSOP_END_LOGICAL_MVCC_UNDO)) { // ... condensed: vacuum_info / mvccid from LOG_REC_MVCC_UNDO (undo and // sysop_end.mvcc_undo cases) or LOG_REC_MVCC_UNDOREDO (undoredo // and diff-undoredo, which share the struct) ...
/* Save previous mvcc operation log lsa to vacuum info */ LSA_COPY (&vacuum_info->prev_mvcc_op_log_lsa, &log_Gl.hdr.mvcc_op_log_lsa); prior_update_header_mvcc_info (start_lsa, mvccid); }The LSA_COPY welds the backward MVCC chain:
log_vacuum_info::prev_mvcc_op_log_lsa (log_record.hpp) of the new
record gets the previous MVCC record’s LSA, written directly into the
node’s data_header bytes. The worker (Ch 7) and crash recovery
(Ch 11) walk this chain backwards.
The five non-vacuum else if arms: LOG_SYSOP_START_POSTPONE,
non-MVCC LOG_SYSOP_END, LOG_COMMIT_WITH_POSTPONE (+_OBSOLETE),
LOG_SYSOP_ATOMIC_START, and LOG_COMMIT/LOG_ABORT — all save
recovery-bookkeeping LSAs into tdes->rcv (or flip tdes state) under
the same mutex hold; none touch the vacuum accumulator.
Zone C inserts the node at the prior-list tail, grows list_size (in
bytes), and — only in WITHOUT_LOCK mode, after releasing the mutex —
checks overflow against logpb_get_memsize (). On overflow, server
mode outside crash recovery wakes the flush daemon and naps 1 ms;
crash-recovery and standalone modes flush the list inline under
LOG_CS. With-lock callers skip this: they cannot flush while holding
the mutex.
3.4 prior_update_header_mvcc_info — folding the block forward
Section titled “3.4 prior_update_header_mvcc_info — folding the block forward”// prior_update_header_mvcc_info -- src/transaction/log_append.cppstatic voidprior_update_header_mvcc_info (const LOG_LSA &record_lsa, MVCCID mvccid){ if (!log_Gl.hdr.does_block_need_vacuum) { // first mvcc record for this block log_Gl.hdr.oldest_visible_mvccid = log_Gl.mvcc_table.get_global_oldest_visible (); /* <- sampled ONCE */ log_Gl.hdr.newest_block_mvccid = mvccid; } else { // ... condensed: sanity asserts ... assert (vacuum_get_log_blockid (log_Gl.hdr.mvcc_op_log_lsa.pageid) == vacuum_get_log_blockid (record_lsa.pageid)); if (log_Gl.hdr.newest_block_mvccid < mvccid) { log_Gl.hdr.newest_block_mvccid = mvccid; /* <- running max; ids may arrive out of order */ } } log_Gl.hdr.mvcc_op_log_lsa = record_lsa; log_Gl.hdr.does_block_need_vacuum = true;}A block is literally born on the false branch — that is the only
place oldest_visible_mvccid is (re)sampled.
Invariant 3-A — the watermark is frozen at logging time.
oldest_visible_mvccidis sampled exactly once, when the block’s first MVCC record is logged — never at consumption. Re-sampling later would let the advanced horizon exclude transactions still live when the block’s undo was created, breaking monotonicoldest_unvacuumedtracking (Ch 5); the recovery path even refuses to reset it between rebuilt blocks (“we don’t resetdata.oldest_visible_mvccidbetween blocks” invacuum_recover_lost_block_data). Recovery is not identical to the live path here — see Cross-check Notes (3.7).
Invariant 3-B — one accumulator, one block. Every record folded into the accumulator lies in the same log block. Enforced by ordering: zone A runs before
prior_update_header_mvcc_infounder one hold ofprior_lsa_mutex, and theelsebranch assertsvacuum_get_log_blockidequality. Otherwise the entry’sblockidwould not cover all its records.
Invariant 3-C — recovery never produces. The
LOG_ISRESTARTEDguard keeps redo-time appends from pushing entries; blocks pending at crash time are rebuilt byvacuum_recover_lost_block_datainstead (Ch 11). Without it, replayed appends would mint duplicates of blocks already registered in vacuum data before the crash.
3.5 vacuum_get_log_blockid — fixed block geometry
Section titled “3.5 vacuum_get_log_blockid — fixed block geometry”// vacuum_get_log_blockid -- src/query/vacuum.cVACUUM_LOG_BLOCKIDvacuum_get_log_blockid (LOG_PAGEID pageid){ if (prm_get_bool_value (PRM_ID_DISABLE_VACUUM) || pageid == NULL_PAGEID) { return VACUUM_NULL_LOG_BLOCKID; } assert (vacuum_Data.log_block_npages != 0); return pageid / vacuum_Data.log_block_npages;}
// VACUUM_FIRST_LOG_PAGEID_IN_BLOCK -- src/query/vacuum.c#define VACUUM_FIRST_LOG_PAGEID_IN_BLOCK(blockid) \ ((blockid) * vacuum_Data.log_block_npages)#define VACUUM_LAST_LOG_PAGEID_IN_BLOCK(blockid) \ (VACUUM_FIRST_LOG_PAGEID_IN_BLOCK (blockid + 1) - 1)A block is pure arithmetic; no block object exists until the boundary
branch decides one has ended. The disabled/NULL_PAGEID early return
yields VACUUM_NULL_LOG_BLOCKID (-1, log_common_impl.h), making
the boundary comparison inert when vacuum is off.
Invariant 3-D — block boundaries align to a fixed log-page count. Block
bcovers exactly pages[b * log_block_npages, (b+1) * log_block_npages - 1]; division and multiplication above are exact inverses.log_block_npagesis set once invacuum_initialize(Ch 2) and never changes — everyblockidpersisted in vacuum data encodes the geometry. The worker’s page-range termination (Ch 7) and recovery’s per-block stop page (Ch 11) rely on it.
3.6 vacuum_produce_log_block_data — minting the entry
Section titled “3.6 vacuum_produce_log_block_data — minting the entry”// vacuum_produce_log_block_data -- src/query/vacuum.cvoidvacuum_produce_log_block_data (THREAD_ENTRY * thread_p){ if (prm_get_bool_value (PRM_ID_DISABLE_VACUUM)) { return; /* branch 1: vacuum disabled */ } assert (log_Gl.hdr.does_block_need_vacuum == true); VACUUM_DATA_ENTRY block_data { log_Gl.hdr }; /* <- snapshot the accumulator */
// reset info for next block log_Gl.hdr.does_block_need_vacuum = false; log_Gl.hdr.newest_block_mvccid = MVCCID_NULL; /* <- mvcc_op_log_lsa deliberately NOT reset */
if (vacuum_Block_data_buffer == NULL) { assert (false); return; /* branch 2: not booted, debug-only trap */ } // ... condensed: vacuum_er_log of the new entry ... if (!vacuum_Block_data_buffer->produce (block_data)) { /* TODO: ... Make sure that we do not lose vacuum data ... */ vacuum_er_log_error (VACUUM_ER_LOG_ERROR, "%s", "Cannot produce new log block data! The buffer is already full."); assert (false); return; /* branch 3: buffer full -- entry DROPPED in release */ } perfmon_add_stat (thread_p, PSTAT_VAC_NUM_TO_VACUUM_LOG_PAGES, vacuum_Data.log_block_npages);}The entry constructor (a log_header overload delegates to this one)
derives blockid from the accumulator’s last-record LSA:
// vacuum_data_entry::vacuum_data_entry -- src/query/vacuum.cvacuum_data_entry::vacuum_data_entry (const log_lsa &lsa, MVCCID oldest, MVCCID newest) : blockid (VACUUM_NULL_LOG_BLOCKID) , start_lsa (lsa) /* <- lsa of LAST mvcc op in the block, despite the name */ , oldest_visible_mvccid (oldest) , newest_mvccid (newest){ // ... condensed: asserts, incl. oldest <= newest ... blockid = vacuum_get_log_blockid (start_lsa.pageid);}Note start_lsa is where the worker starts its backward walk — the
last MVCC op — not the block’s first page.
blockid is a VACUUM_LOG_BLOCKID — an int64_t alias from
log_storage.hpp — but only the low 61 bits carry the block number;
the top three are lifecycle flags, set later by the master and workers
(Ch 5-7), never by the producer:
// VACUUM_DATA_ENTRY_FLAG_MASK -- src/query/vacuum.c#define VACUUM_DATA_ENTRY_FLAG_MASK 0xE000000000000000#define VACUUM_DATA_ENTRY_BLOCKID_MASK 0x1FFFFFFFFFFFFFFF
#define VACUUM_BLOCK_STATUS_MASK 0xC000000000000000#define VACUUM_BLOCK_STATUS_VACUUMED 0x8000000000000000#define VACUUM_BLOCK_STATUS_IN_PROGRESS_VACUUM 0x4000000000000000#define VACUUM_BLOCK_STATUS_AVAILABLE 0x0000000000000000
#define VACUUM_BLOCK_FLAG_INTERRUPTED 0x2000000000000000| Bits | Meaning |
|---|---|
| 63-62 | Status: 10 vacuumed, 01 in-progress, 00 available |
| 61 | VACUUM_BLOCK_FLAG_INTERRUPTED — job was cut short (Ch 6) |
| 60-0 | Block number, extracted by VACUUM_BLOCKID_WITHOUT_FLAGS |
(The in-source comment above the masks — “first bit will be used for
this flag” — predates the three-bit reality.) A newborn entry’s flags
are all zero — VACUUM_BLOCK_STATUS_AVAILABLE — because
vacuum_get_log_blockid returns a pure quotient, far below bit 61.
Branch 3 deserves emphasis: vacuum_Block_data_buffer is a lock-free
circular queue of VACUUM_BLOCK_DATA_BUFFER_CAPACITY (1024) entries,
allocated in vacuum_initialize. If full, the entry is logged,
asserted on, and lost in a release build — the accumulator was
already reset two lines earlier, so nothing retries. The in-source
TODOs acknowledge this; in practice the consumer
(vacuum_consume_buffer_log_blocks, Ch 4) drains far faster than 1024
blocks of log can be written.
Invariant 3-E — the MVCC chain is continuous across blocks.
vacuum_produce_log_block_dataresetsdoes_block_need_vacuumandnewest_block_mvccidbut leaveslog_Gl.hdr.mvcc_op_log_lsapointing at the previous block’s last MVCC record. The next block’s first MVCC record therefore links back across the boundary — theLSA_COPYin zone B reads the stale value beforeprior_update_header_mvcc_infooverwrites it. This letsvacuum_recover_lost_block_datarebuild several lost blocks in one backward walk (Ch 11); resetting the field to NULL would cap recovery at a single block.
3.7 Cross-check notes — producer vs. recovery rebuild
Section titled “3.7 Cross-check notes — producer vs. recovery rebuild”The live producer and the crash-recovery rebuilder
(vacuum_recover_lost_block_data, Ch 11) mint entries by two different
rules, easy to conflate:
oldest_visible_mvccidsource. The live path samples the global horizon once per block viaget_global_oldest_visible(Invariant 3-A). Recovery cannot — that horizon is gone — so it carries the minimum MVCCID seen while replaying the block’s records (MVCC_ID_PRECEDES); a later, smaller MVCCID implies a transaction active during the block, so the minimum is the safe substitute. The two values need not be equal.- Direction. The live path closes the current block going forward;
recovery walks
prev_mvcc_op_log_lsabackward across several blocks andproduces them oldest-first off astd::stack. - Last block re-armed, not produced. If the rebuilt block is the one
the live header still owns
(
blockid == vacuum_get_log_blockid (prior_lsa.pageid)), recovery restores it intolog_Gl.hdrinstead of pushing it — the seam where the two paths rejoin.
3.8 Chapter summary — key takeaways
Section titled “3.8 Chapter summary — key takeaways”- Blocks are born in
log_append.cpp, notvacuum.c:prior_lsa_next_record_internalfolds MVCC info intolog_Gl.hdrand mints avacuum_data_entrywhen a record’sstart_lsacrosses a block boundary — all underprior_lsa_mutex, the accumulator’s only (and sufficient) serialization. - Four record shapes feed the accumulator —
LOG_MVCC_UNDO_DATA,LOG_MVCC_UNDOREDO_DATA,LOG_MVCC_DIFF_UNDOREDO_DATA, andLOG_SYSOP_ENDcarryingLOG_SYSOP_END_LOGICAL_MVCC_UNDO— each contributing an MVCCID and receiving aprev_mvcc_op_log_lsaback link patched into its still-in-memorydata_header. - The boundary test fires on every record type; a quiet system
leaves the last block open in
log_Gl.hdruntil traffic resumes (handled at consumption, Ch 4/9). - A block is a fixed arithmetic page range (Invariant 3-D); the
entry’s 64-bit
blockidkeeps the number in bits 60-0 and reserves bits 63-61 for status/interrupted flags, all zero at birth. oldest_visible_mvccidis frozen at the block’s first MVCC record (Invariant 3-A);newest_block_mvccidis a running max;mvcc_op_log_lsasurvives block production, keeping the backward chain unbroken (Invariant 3-E).LOG_ISRESTARTEDdisarms production during recovery (Invariant 3-C); the 1024-entry handoff queue drops entries on overflow with only an assert and an error log — a known, tolerated weak point.
Chapter 4: Block Registration into Vacuum Data
Section titled “Chapter 4: Block Registration into Vacuum Data”Chapter 3 ended with a vacuum_data_entry sitting in vacuum_Block_data_buffer, a lock-free circular queue in volatile memory; a crash at that instant loses it. This chapter walks the function that fixes that — vacuum_consume_buffer_log_blocks — branch by branch: draining the buffer, filling blockid gaps, appending at last_page->index_free, growing the file when a page fills, and the redo-only WAL protocol behind it.
4.1 Who calls the consume path, and when
Section titled “4.1 Who calls the consume path, and when”vacuum_consume_buffer_log_blocks has exactly three direct call sites:
| Caller | Mode | Trigger |
|---|---|---|
vacuum_data::update | both | Canonical wrapper: mark finished jobs (Ch 9), then consume. Reached via vacuum_job_cursor::force_data_update. |
vacuum_recover_lost_block_data | boot | Replays the log tail into the buffer, then consumes it immediately (Ch 11). |
vacuum_finalize | shutdown | Drains the buffer in a loop — consuming appends log, which can complete another block. |
In SERVER_MODE, vacuum_master_task::execute calls m_cursor.force_data_update () once per wakeup, and again whenever should_force_data_update reports is_half_full () on vacuum_Finished_job_queue or vacuum_Block_data_buffer (“don’t wait until it’s full”; log appenders must never find the queue full). The queues are sized independently — VACUUM_BLOCK_DATA_BUFFER_CAPACITY = 1024, VACUUM_FINISHED_JOB_QUEUE_CAPACITY = 2048. In SA mode, xvacuum runs the cursor loop inline, forcing an update when the buffer is non-empty, when vacuum_Finished_job_queue->is_full (), or when the cursor runs off the end. At SA shutdown, xboot_shutdown_server (boot_sr.c) calls vacuum_sa_reflect_last_blockid: last_blockid jumps to logpb_last_complete_blockid (), is mirrored into log_Gl.hdr.vacuum_last_blockid, and re-stamped into the empty page via vacuum_data_empty_update_last_blockid, so the next boot does not re-scan log that SA mode already vacuumed; vacuum_finalize repeats the re-stamp after its drain loop.
vacuum_job_cursor::force_data_update brackets vacuum_data::update with unload () / readjust_to_vacuum_data_changes () + load (), because consuming can swap or free the very page the cursor has fixed (Ch 6). update runs mark-finished, then consume, then — only if !vacuum_Data.is_empty () — upgrade_oldest_unvacuumed (get_first_entry ().oldest_visible_mvccid), valid only because oldest_visible_mvccid is non-decreasing across entries (§4.3).
4.2 Entry guards and the empty-buffer fast path
Section titled “4.2 Entry guards and the empty-buffer fast path”The function opens with two guards and a subtle fast path:
PRM_ID_DISABLE_VACUUMset → returnNO_ERROR.vacuum_Block_data_buffer == NULL→assert (false), returnNO_ERROR(never happens live).- Buffer empty. If vacuum data is also empty (
vacuum_is_empty ()— single page,index_unvacuumed == index_free), the function advancesm_last_blockidanyway, so an idle system does not pin ever-older log archives:
// vacuum_consume_buffer_log_blocks -- src/query/vacuum.c if (vacuum_Block_data_buffer->is_empty ()) { if (vacuum_is_empty ()) { if (log_Gl.hdr.does_block_need_vacuum) { return NO_ERROR; /* <- current block has MVCC ops; cannot skip it */ } std::unique_lock<std::mutex> ulock { log_Gl.prior_info.prior_lsa_mutex }; // ... condensed: recheck does_block_need_vacuum under the mutex -> return; // recheck buffer: non-empty -> fall through to consume ... LOG_LSA log_lsa = log_Gl.prior_info.prior_lsa; ulock.unlock (); // unlock after reading prior_lsa const VACUUM_LOG_BLOCKID LOG_BLOCK_TRAILING_DIFF = 2; VACUUM_LOG_BLOCKID log_blockid = vacuum_get_log_blockid (log_lsa.pageid); if (log_blockid > vacuum_Data.get_last_blockid () + LOG_BLOCK_TRAILING_DIFF) { vacuum_Data.set_last_blockid (log_blockid - LOG_BLOCK_TRAILING_DIFF); vacuum_data_empty_update_last_blockid (thread_p); vacuum_update_keep_from_log_pageid (thread_p); } return NO_ERROR; } else { return NO_ERROR; /* <- data non-empty: last entry already defines last_blockid */ } }Three details matter. The check–lock–recheck on log_Gl.hdr.does_block_need_vacuum: skipping past an in-flight block with MVCC ops would orphan them; the recheck under prior_lsa_mutex closes the race with the Chapter 3 producer, which flips the flag under the same mutex. LOG_BLOCK_TRAILING_DIFF = 2: log_blockid is the block containing prior_lsa, still being written; staying two behind means the catch-up never claims a block that could still produce MVCC ops. And vacuum_data_empty_update_last_blockid re-initializes the single empty page through vacuum_init_data_page_with_last_blockid — even the “no work” path is WAL-logged (§4.5).
Past the fast path, vacuum_Data.last_page == NULL is an assert_release + ER_FAILED (data not loaded — caller bug).
4.3 The drain loop and the dense-monotonic invariant
Section titled “4.3 The drain loop and the dense-monotonic invariant”The consume loop walks the buffer and appends at the cached last page:
// vacuum_consume_buffer_log_blocks -- src/query/vacuum.c data_page = vacuum_Data.last_page; page_free_data = data_page->data + data_page->index_free; save_page_free_data = page_free_data; /* <- start of the not-yet-logged run */ was_vacuum_data_empty = vacuum_is_empty ();
while (vacuum_Block_data_buffer->consume (consumed_data)) { assert (vacuum_Data.get_last_blockid () < consumed_data.blockid); for (next_blockid = vacuum_Data.get_last_blockid () + 1; next_blockid <= consumed_data.blockid; next_blockid++) { // ... page-full branch, see 4.4 ... if (data_page->index_unvacuumed == data_page->index_free && next_blockid < consumed_data.blockid) { next_blockid = consumed_data.blockid - 1; // empty page: skip gaps; for will increment to target continue; } page_free_data->blockid = next_blockid; if (next_blockid == consumed_data.blockid) { LSA_COPY (&page_free_data->start_lsa, &consumed_data.start_lsa); page_free_data->newest_mvccid = consumed_data.newest_mvccid; page_free_data->oldest_visible_mvccid = consumed_data.oldest_visible_mvccid; // ... condensed: NDEBUG asserts, er_log ... } else { page_free_data->set_vacuumed (); /* <- gap block: no MVCC ops */ LSA_SET_NULL (&page_free_data->start_lsa); // ... condensed: oldest_visible_mvccid = newest_mvccid = MVCCID_NULL ... } vacuum_Data.set_last_blockid (next_blockid); page_free_data++; data_page->index_free++; } }The buffer only carries blocks that had MVCC ops (Ch 3). Yet the inner for runs over every blockid between m_last_blockid + 1 and consumed_data.blockid, materializing the missing ones as pre-VACUUMED gap entries with NULL start_lsa and NULL MVCCIDs.
Key invariant: vacuum data blockids are dense and monotonic. Every blockid in [get_first_blockid (), get_last_blockid ()] appears exactly once, in consecutive order across the page chain, and oldest_visible_mvccid is non-decreasing along that order. The gap-fill loop enforces density; the debug asserts (page_free_data - 1)->get_blockid () + 1 == page_free_data->get_blockid () and the oldest_visible_mvccid comparison check both. If violated: the job cursor’s blockid arithmetic (vacuum_job_cursor::change_blockid, Ch 6) lands on the wrong entry, vacuum_data_mark_finished (Ch 9) marks the wrong block, and the “first entry has the oldest MVCCID” shortcut corrupts oldest_unvacuumed_mvccid (Ch 5).
One refinement keeps the invariant cheap: if the current page is empty (index_unvacuumed == index_free) and we are still below consumed_data.blockid, the loop teleports next_blockid to consumed_data.blockid - 1 and continues — gap entries on an empty page would be dead weight; the page’s range legally restarts at the real block. A new real entry starts life AVAILABLE (flags live in the blockid’s high bits, Ch 1); gap entries are born VACUUMED and are reclaimed by the next mark-finished pass without any worker seeing them.
After the loop: if was_vacuum_data_empty, vacuum_update_keep_from_log_pageid recomputes the log-removal watermark (Ch 5). The function works in whole blocks, not LSAs: empty data keeps log from the first page of the block after last_blockid; non-empty data keeps from the first log page of get_first_blockid ()’s block. Appending the first entry to empty data therefore pulls the watermark back to that entry’s block boundary.
4.4 The page-full branch
Section titled “4.4 The page-full branch”When data_page->index_free == vacuum_Data.page_data_max_count, the chain grows:
// vacuum_consume_buffer_log_blocks -- src/query/vacuum.c (page-full branch) if (page_free_data > save_page_free_data) { log_append_redo_data2 (thread_p, RVVAC_DATA_APPEND_BLOCKS, NULL, (PAGE_PTR) data_page, (PGLENGTH) (save_page_free_data - data_page->data), /* ... condensed ... */); /* <- log the run appended so far */ vacuum_set_dirty_data_page (thread_p, data_page, DONT_FREE); } if (is_sysop) { log_sysop_commit (thread_p); /* <- second new page in one call: commit previous sysop */ } log_sysop_start (thread_p); is_sysop = true; error_code = file_alloc (thread_p, &vacuum_Data.vacuum_data_file, file_init_page_type, &ptype, &next_vpid, (PAGE_PTR *) (&data_page)); // ... condensed: on error or NULL page -> log_sysop_abort + return ... vacuum_init_data_page_with_last_blockid (thread_p, data_page, vacuum_Data.get_last_blockid ()); VPID_COPY (&vacuum_Data.last_page->next_page, &next_vpid); log_append_undoredo_data2 (thread_p, RVVAC_DATA_SET_LINK, NULL, (PAGE_PTR) vacuum_Data.last_page, 0, 0, sizeof (VPID), NULL, &next_vpid); /* <- undo data is NULL */ save_last_page = vacuum_Data.last_page; vacuum_Data.last_page = data_page; /* <- swap the cached fixed page */ vacuum_set_dirty_data_page (thread_p, save_last_page, FREE); // we cannot commit here. we should append some data blocks first. page_free_data = data_page->data + data_page->index_free; save_page_free_data = page_free_data;Branches, in order: (a) the old page’s pending run is logged only if non-empty; (b) an already-open sysop from a previous page-full iteration is committed first; (c) file_alloc failure or a NULL page aborts the sysop and returns — the abort undoes the allocation itself; (d) the new page is initialized and stamped with the current last_blockid, the old last page’s next_page is linked, the cached last_page pointer is swapped, the old page unfixed (FREE).
The system operation is the consistency boundary: allocation, page init, and link commit as one unit. The deliberate oddity is “we cannot commit here” — the sysop stays open until at least one entry run is logged into the new page (§4.5), so recovery never surfaces an allocated-but-empty page at the end of the chain whose data[0].blockid stamp the loop has already moved past.
flowchart TD
A["page full?"] -->|yes| B["pending run on old page?<br/>log RVVAC_DATA_APPEND_BLOCKS"]
B --> C["previous sysop open?<br/>commit it"]
C --> D["log_sysop_start + file_alloc"]
D -->|fail| X["log_sysop_abort<br/>return error"]
D -->|ok| E["RVVAC_DATA_INIT_NEW_PAGE<br/>RVVAC_DATA_SET_LINK<br/>swap cached last_page"]
A -->|no| G["gap blockid?"]
E --> G
G -->|yes| H["set_vacuumed<br/>NULL lsa and mvccids"]
G -->|no| I["copy start_lsa, oldest, newest"]
H --> J["fill slot at index_free<br/>set_last_blockid; index_free++"]
I --> J
J -->|next blockid| A
J -->|buffer drained| K["log final run<br/>commit sysop if open"]
Figure 4-1. The consume loop with the page-full branch. Error exits abort the sysop; every other path converges on the closing append record.
4.5 The redo-only WAL protocol
Section titled “4.5 The redo-only WAL protocol”The closing branch mirrors the page-full prologue: if save_page_free_data < page_free_data, the final run is logged with RVVAC_DATA_APPEND_BLOCKS, then — only now that the new page has data — an open sysop is committed and the page marked dirty (DONT_FREE). The else arm covers the impossible leftover: an open sysop with no appended run is assert (false) but still committed, “don’t leak the sysop”. Three recovery indexes cover the whole path (the RV_fun table in recovery.c confirms which sides exist):
| rcvindex | undo | redo |
|---|---|---|
RVVAC_DATA_APPEND_BLOCKS | — | vacuum_rv_redo_append_data: bulk-copy run to data + rcv->offset, advance index_free |
RVVAC_DATA_INIT_NEW_PAGE | — | vacuum_rv_redo_initialize_data_page: re-run vacuum_data_initialize_new_page, restore data->blockid |
RVVAC_DATA_SET_LINK | vacuum_rv_undoredo_data_set_link | same function: set or NULL next_page |
// vacuum_rv_redo_append_data -- src/query/vacuum.c int n_blocks = rcv->length / sizeof (VACUUM_DATA_ENTRY); // ... condensed: length sanity asserts ... assert (rcv->offset == data_page->index_free); /* <- append-only: offset must equal index_free */ memcpy (data_page->data + rcv->offset, rcv->data, n_blocks * sizeof (VACUUM_DATA_ENTRY)); data_page->index_free += n_blocks;Key invariant: vacuum data appends are idempotent-by-position, so redo suffices and undo is never needed. Entries are only written at index_free, the redo record carries the absolute slot offset, and the page LSA decides whether the redo applies. Nothing here needs rollback: registration runs under the vacuum master’s system thread, not a client transaction. If undo existed, a rollback would erase pending work and the corresponding log interval would never be vacuumed — the one failure vacuum cannot afford.
The single undoredo record, RVVAC_DATA_SET_LINK, exists for the sysop: on abort (allocation failure) its undo runs with rcv->data == NULL, which vacuum_rv_undoredo_data_set_link maps to VPID_SET_NULL (&data_page->next_page) — detaching the half-born page the aborted sysop simultaneously deallocates. Both directions share the function; the NULL-data branch is the undo, the copy branch the redo.
Page initialization is its own redo record because it does more than zero the page:
// vacuum_init_data_page_with_last_blockid -- src/query/vacuum.c vacuum_data_initialize_new_page (thread_p, data_page); /* memset, NULL next_page, indexes = 0, ptype */ data_page->data->blockid = blockid; /* <- ghost slot: last_blockid survives in data[0] */ log_append_redo_data2 (thread_p, RVVAC_DATA_INIT_NEW_PAGE, NULL, (PAGE_PTR) data_page, 0, sizeof (blockid), &blockid); vacuum_set_dirty_data_page (thread_p, data_page, DONT_FREE);Key invariant: an empty vacuum data page still remembers last_blockid in data[0].blockid. With index_free == 0 the slot is logically dead, yet it carries the high-water mark; vacuum_data::set_last_blockid can therefore stay a plain in-memory setter — it only strips flag bits (VACUUM_BLOCKID_WITHOUT_FLAGS) and debug-asserts the value stays below prior_lsa’s block. At boot, vacuum_data_load_and_recover rebuilds m_last_blockid: non-empty data → the last real entry, last_page->data[index_free - 1].blockid (defensive fallback to slot 0 — the ghost slot again — should index_free be 0); empty data → MAX (log_Gl.hdr.vacuum_last_blockid, vacuum_Data.last_page->data->blockid), the ghost slot possibly overridden by the header value SA mode advanced before deleting archives. (Side branches: a fresh log with a still-negative logpb_last_complete_blockid () leaves the value untouched; a 10.1-era database with NULL recovery_lsa/mvcc_op_log_lsa takes it directly.) Break the ghost-slot rule and an idle-then-crashed server resumes with a stale last_blockid, re-fills “gaps” for blocks already consumed, and double-registers work.
The function ends with VACUUM_VERIFY_VACUUM_DATA plus a page-fix-count check in debug builds, then NO_ERROR.
4.6 Chapter summary — key takeaways
Section titled “4.6 Chapter summary — key takeaways”vacuum_consume_buffer_log_blocksis the only writer of new vacuum data entries, reached throughvacuum_data::update(master tick, SA-modexvacuum),vacuum_recover_lost_block_data(boot), or thevacuum_finalizedrain loop; the master forces it whenever buffer or finished-job queue passes half-full.- The drain loop enforces the dense-monotonic blockid invariant by materializing gap entries — born
VACUUMED, NULL LSA — for every blockid with no MVCC ops, except across an empty page where the range may legally restart. - Page growth is wrapped in a system operation that stays open until real data lands on the new page; allocation failure aborts the sysop, and
RVVAC_DATA_SET_LINK’s undo (NULL data → NULL link) detaches the half-born page. - Everything else is redo-only (
RVVAC_DATA_APPEND_BLOCKS,RVVAC_DATA_INIT_NEW_PAGE): appends are positional atindex_free, registration belongs to no client transaction, and undo would mean losing vacuum work forever. m_last_blockidis volatile but recoverable: from the last entry when data is non-empty, else from the ghost slotdata[0].blockidmaxed withlog_Gl.hdr.vacuum_last_blockid— kept fresh at SA-mode shutdown byvacuum_sa_reflect_last_blockid.- Even the empty-buffer fast path is durable: catching
last_blockidup toprior_lsa’s block minusLOG_BLOCK_TRAILING_DIFF(2) re-logs the empty page, after a check–lock–recheck onlog_Gl.hdr.does_block_need_vacuumso an in-flight block with MVCC ops is never skipped.
Chapter 5: Eligibility and the Oldest Visible Watermark
Section titled “Chapter 5: Eligibility and the Oldest Visible Watermark”Chapter 4 left a block registered as AVAILABLE — which does not mean safe: its records may still be visible to a running transaction. The decision of when it becomes vacuumable is split. The MVCC table (mvcctable, mvcc_table.hpp/.cpp) owns the oldest visible watermark — the MVCCID below which every snapshot agrees version history is settled — and the vacuum master merely consults it, once per iteration, against each entry’s newest_mvccid. The watermark machinery is the MVCC detail document’s territory: cubrid-mvcc-detail.md Chapter 9 (Vacuum Coordination and the Oldest-Visible Watermark, §9.1–9.5) derives every field, the cross-snapshot sweep, and the pin API line by line; the high-level companion (cubrid-vacuum.md, Common DBMS Design → Oldest-visible-MVCCID watermark) explains why one global watermark suffices. Here, 5.1–5.4 keep just enough of the producer side to read the refresh call correctly; 5.5–5.6 fully trace the vacuum-side consumption, which the MVCC doc only sketches.
5.1 The producer side in brief — mvcctable’s watermark fields
Section titled “5.1 The producer side in brief — mvcctable’s watermark fields”mvcctable is the server-wide MVCC bookkeeping object at log_Gl.mvcc_table. Most fields serve snapshot construction (field tables in cubrid-mvcc-detail.md §1.7, §2.5); two exist purely for vacuum: m_oldest_visible, an std::atomic<MVCCID> caching the published watermark so readers pay one atomic load (get_global_oldest_visible), and m_ov_lock_count, an atomic freeze counter that, while nonzero, forbids publishing (5.4). The scan reads two inputs: the global lower bound m_current_status_lowest_active_mvccid (advanced opportunistically from complete_mvcc via advance_oldest_active) and the per-transaction slot array m_transaction_lowest_visible_mvccids. One slot cell means three things:
| Slot value | State | Who sets it | Scan action |
|---|---|---|---|
MVCCID_NULL (0) | No live snapshot | reset_transaction_lowest_active after LOG_COMMIT; complete_mvcc on rollback | Skip. |
MVCCID_ALL_VISIBLE (3, storage_common.h) | Handshake sentinel: real value imminent | build_mvcc_info, momentarily | Wait — the value may be lower than anything seen. |
| Normal MVCCID | Snapshot’s lowest_active_mvccid; post-commit, the tran’s own MVCCID | build_mvcc_info; raised by complete_mvcc on commit | Min-merge. |
flowchart LR
subgraph producers["producers (per transaction)"]
BMI["build_mvcc_info<br/>publish snapshot lowest"]
CM["complete_mvcc<br/>commit: raise to own mvccid<br/>rollback: NULL"]
RST["reset_transaction_lowest_active<br/>after LOG_COMMIT: NULL"]
end
SLOTS["m_transaction_lowest_visible_mvccids[ ]"]
GLOW["m_current_status_lowest_active_mvccid"]
COMP["compute_oldest_visible_mvccid"]
UPD["update_global_oldest_visible"]
OV["m_oldest_visible"]
LCK["m_ov_lock_count"]
BMI --> SLOTS
CM --> SLOTS
RST --> SLOTS
CM -- "advance_oldest_active" --> GLOW
SLOTS --> COMP
GLOW --> COMP
COMP --> UPD
LCK -- "gates store" --> UPD
UPD --> OV
OV -- "get_global_oldest_visible" --> READERS["vacuum master / workers / locator DDL"]
Figure 5-1: producers and consumers around the watermark fields of mvcctable.
Invariant (slot durability across commit). A committing transaction’s slot is not cleared at
complete_mvcc; it is raised to the transaction’s own MVCCID and stays until the LOG_COMMIT record is written, whenlog_completecallsreset_transaction_lowest_active. The in-code comment gives the failure otherwise: slot goes NULL early → vacuum cleans the transaction’s modifications → crash before LOG_COMMIT → recovery rolls back a transaction whose garbage is already gone. The watermark can never pass an MVCCID whose commit record is not yet durable.
// mvcctable::complete_mvcc -- src/transaction/mvcc_table.cpp if (committed) { /* be sure that transaction modifications can't be vacuumed up to LOG_COMMIT. ... */ if (tran_lowest_active == MVCCID_NULL || MVCC_ID_PRECEDES (tran_lowest_active, mvccid)) { oldest_active_set (..., mvccid, ...); /* <- raise, do not clear */ } } else { oldest_active_set (..., MVCCID_NULL, ...); /* <- rollback clears immediately */ }5.2 The ALL_VISIBLE handshake, in one paragraph
Section titled “5.2 The ALL_VISIBLE handshake, in one paragraph”build_mvcc_info must read the global m_current_status_lowest_active_mvccid and store it into the transaction’s slot — two steps that are not atomic together; if the thread is descheduled between them while the global advances and vacuum refreshes, the stale lower value would land in the slot and force the watermark backwards. The fix is pre-announcement: the slot is first set to MVCCID_ALL_VISIBLE, then the global is read and the real value stored “between next two code lines … no delays” (the in-code comment). The sentinel therefore lives only between two adjacent statements — no I/O, no locks — which is why the scan may simply busy-wait it out (5.3); an edit inserting work between those lines stalls the master’s refresh (latency, not correctness). The full interleaving derivation, with the excerpt, is cubrid-mvcc-detail.md §5.3 (the build_mvcc_info walkthrough); §9.2 of the same doc classifies the sentinel from the scanner’s side. The oldest_active_set/get wrappers also feed Oldest_active_tracker, an 8K debug-build ring with source tags — the first stop when a watermark-regression assert fires.
5.3 compute_oldest_visible_mvccid — the two-phase scan
Section titled “5.3 compute_oldest_visible_mvccid — the two-phase scan”This private const method (reachable only through update_global_oldest_visible) takes the minimum over the global lower bound and every slot:
// mvcctable::compute_oldest_visible_mvccid -- src/transaction/mvcc_table.cpp MVCCID lowest_active_mvccid = oldest_active_get (m_current_status_lowest_active_mvccid, ...); for (size_t idx = 0; idx < m_transaction_lowest_visible_mvccids_size; idx++) /* phase 1 */ { loaded_tran_mvccid = oldest_active_get (m_transaction_lowest_visible_mvccids[idx], ...); if (loaded_tran_mvccid == MVCCID_ALL_VISIBLE) { waiting_mvccids_pos.append (idx); /* <- defer: real value imminent */ } else if (loaded_tran_mvccid != MVCCID_NULL && MVCC_ID_PRECEDES (loaded_tran_mvccid, lowest_active_mvccid)) { lowest_active_mvccid = loaded_tran_mvccid; /* <- new minimum */ } } size_t retry_count = 0; while (waiting_mvccids_pos.get_size () > 0) /* phase 2: drain stragglers */ { ++retry_count; if (retry_count % 20 == 0) { thread_sleep (10); /* <- back off 10 ms every 20 spins */ } // ... condensed: re-read each waiting slot; still ALL_VISIBLE -> keep waiting; // resolved -> min-merge like phase 1, erase from set ... } assert (MVCCID_IS_NORMAL (lowest_active_mvccid)); return lowest_active_mvccid;Phase 1 routes each slot down exactly one of three paths: sentinel → deferred, valid-and-lower → new minimum, MVCCID_NULL or not-lower → ignored. Phase 2 spins on the deferred set — short, by sentinel transience (5.2) — sleeping 10 ms every 20th retry so a descheduled snapshotter doesn’t burn a core; a slot resolving to MVCCID_NULL mid-wait is a transaction that finished and stops constraining. The final assert holds because the starting candidate is always a normal MVCCID. (cubrid-mvcc-detail.md §9.2.1 walks the reverse-iterated erase and the perf counters.)
5.4 update_global_oldest_visible and the freeze counter
Section titled “5.4 update_global_oldest_visible and the freeze counter”The scan result is published only when nobody has frozen the watermark:
// mvcctable::update_global_oldest_visible -- src/transaction/mvcc_table.cpp if (m_ov_lock_count == 0) /* <- cheap pre-check: skip scan if frozen */ { MVCCID oldest_visible = compute_oldest_visible_mvccid (); if (m_ov_lock_count == 0) /* <- re-check: a locker may have arrived mid-scan */ { assert (m_oldest_visible.load () <= oldest_visible); /* <- monotonicity */ m_oldest_visible.store (oldest_visible); } } return m_oldest_visible.load (); /* <- always returns the published value */Three outcomes: frozen on entry → return cached; freeze acquired mid-scan → discard the computed value, return cached; unfrozen throughout → assert monotonic, publish, return. get_global_oldest_visible is the read-only twin; lock_global_oldest_visible / unlock_global_oldest_visible just move m_ov_lock_count (full pin API in cubrid-mvcc-detail.md §9.4).
Who locks, and why. Both lockers go through log_tdes::lock_global_oldest_visible_mvccid (log_tran_table.c), idempotent per transaction via the TDES flag block_global_oldest_active_until_commit. The matching unlock is not in the MVCC completion path but in the log one: log_complete and log_complete_for_2pc (log_manager.c) call unlock_global_oldest_visible_mvccid after the completion record — on commit and abort alike, immediately before the commit path’s reset_transaction_lowest_active — so no count leaks. The two lockers in the tree, xlocator_upgrade_instances_domain and redistribute_partition_data (locator_sr.c), share a pattern: lock, read get_global_oldest_visible () as threshold_mvccid, then run inline cleanup via heap_vacuum_all_objects on pages they are about to rewrite. The freeze keeps the system-wide watermark from advancing past that captured threshold mid-operation — otherwise real vacuum could clean the same heap with a newer threshold and disagree about which versions exist. (§9.7 of the MVCC doc discusses the cost: one small DDL pins vacuum globally.)
Invariant (watermark monotonicity).
m_oldest_visiblenever decreases — enforced by the commit-slot rule (5.1), the handshake (5.2), and the assert above. If it decreased, an already-dispatched job, which judged eligibility against the higher value, would be cleaning versions some snapshot still needs.
One residual subtlety: check–compute–recheck is not atomic with the store; a locker arriving between the second check and the store still sees one watermark move, by a value computed entirely before its lock. In-tree callers tolerate this (they read their threshold only after locking), but a caller assuming a strict “no store after lock returns” fence would be wrong.
5.5 The consumer: master snapshot and the eligibility gate
Section titled “5.5 The consumer: master snapshot and the eligibility gate”The master refreshes and snapshots the watermark exactly once per iteration into the member m_oldest_visible_mvccid (declared in vacuum.c as “saved oldest visible mvccid (recomputed on each iteration)”):
// vacuum_master_task::execute -- src/query/vacuum.c m_oldest_visible_mvccid = log_Gl.mvcc_table.update_global_oldest_visible (); // ... condensed: first-run data load, page flushes, force_data_update ... for (; m_cursor.is_valid () && !should_interrupt_iteration (); m_cursor.increment_blockid ()) { if (!is_cursor_entry_ready_to_vacuum ()) { // next entries cannot be ready if current entry is not ready; stop this iteration break; /* <- break, not continue */ } if (!is_cursor_entry_available ()) { continue; /* <- vacuumed or in-progress: try next */ } start_job_on_cursor_entry (); // ... condensed ... }The master is the only steady-state caller of update_global_oldest_visible: vacuum_data_load_and_recover calls it once at boot, and the only other call, in vacuum_boot, sits in the vacuum-disabled early return and is marked “for debug only” — everyone else reads. One snapshot per iteration gives all entries of a pass a consistent yardstick. The gate has exactly two rejection branches:
// vacuum_master_task::is_cursor_entry_ready_to_vacuum -- src/query/vacuum.c assert (m_cursor.is_valid ()); if (m_cursor.get_current_entry ().newest_mvccid >= m_oldest_visible_mvccid) { // if entry newest MVCCID is still visible, it cannot be vacuumed // ... condensed: vacuum_er_log ... return false; /* <- visibility gate */ } if (m_cursor.get_current_entry ().start_lsa.pageid + 1 >= log_Gl.append.prev_lsa.pageid) { // too close to end of log; let more log be appended before trying to vacuum the block // ... condensed: vacuum_er_log ... return false; /* <- log-tail proximity gate */ } return true;The first branch is the heart of eligibility: the block recorded at logging time (Chapter 3) the highest MVCCID among its operations as newest_mvccid; once even that falls strictly below the watermark, every operation in the block is settled for all current and future snapshots. The >= keeps blocks whose newest equals the watermark — that MVCCID may itself still be active. The second branch is not MVCC at all: it refuses blocks overlapping the log append head, whose pages are still being written (Chapter 7). The caller’s break-not-continue is justified by registration order (Chapter 4): blocks enter vacuum data in log order, so if this one is not ready, none after it can be. The third decision, is_cursor_entry_available, is state bookkeeping rather than eligibility — it skips entries already vacuumed or with a job in flight (Chapter 6).
Invariant (settled-deleter guarantee). Because
newest_mvccidbounds every MVCCID logged in the block, and the watermark lower-bounds every live snapshot’slowest_active_mvccid, vacuum never processes a record whose inserter or deleter could still be invisible to any snapshot — current or, by monotonicity (5.4), future. Every MVCCID the worker meets from an admitted block is either committed-and-globally-visible or rolled back;mvcc_satisfies_vacuum(Chapter 8) never has to ask “is this still in doubt?” for an admitted block’s own operations.
Two different “oldest visible” values now coexist, and confusing them is the classic reader error. The entry’s stored oldest_visible_mvccid is the watermark as of logging time — vacuum_data_entry’s header constructor copies it from log_Gl.hdr.oldest_visible_mvccid, and the delegated constructor asserts oldest <= newest. It plays no role in eligibility, which tests newest_mvccid against the current watermark; the recorded value serves bookkeeping (vacuum_consume_buffer_log_blocks asserts it never exceeds get_global_oldest_visible (), vacuum_verify_vacuum_data_debug asserts entries ascend across neighbors) and drives the trailing edge (5.6). Workers do not reuse the master’s snapshot either: vacuum_process_log_block re-reads get_global_oldest_visible () as its threshold_mvccid at job start — by monotonicity only ever newer than the value that qualified the block.
5.6 The trailing edge: vacuum_Data.oldest_unvacuumed_mvccid
Section titled “5.6 The trailing edge: vacuum_Data.oldest_unvacuumed_mvccid”The watermark is the leading edge — nothing newer may be touched. vacuum_data::oldest_unvacuumed_mvccid (“Global oldest MVCCID not vacuumed (yet)”) is the trailing edge — everything strictly older is guaranteed clean. It is maintained, not computed:
// vacuum_data::set_oldest_unvacuumed_on_boot -- src/query/vacuum.c if (!log_Gl.hdr.does_block_need_vacuum) { // log_Gl.hdr.oldest_visible_mvccid may not remain uninitialized log_Gl.hdr.oldest_visible_mvccid = log_Gl.hdr.mvcc_next_id; /* <- no pending block: seed */ } if (vacuum_Data.is_empty ()) { oldest_unvacuumed_mvccid = log_Gl.hdr.oldest_visible_mvccid; } else { oldest_unvacuumed_mvccid = first_page->data[0].oldest_visible_mvccid; /* <- first = oldest */ assert (oldest_unvacuumed_mvccid <= log_Gl.hdr.oldest_visible_mvccid); }Called once from vacuum_data_load_and_recover at boot (Chapter 11), this covers all three boot shapes: seed an uninitialized header watermark, then either vacuum data is empty (trailing edge = header watermark) or the first — oldest — entry’s recorded oldest_visible_mvccid bounds everything undone. Thereafter vacuum_data::update (the master’s housekeeping pass, Chapter 9) advances it after marking finished jobs and consuming new blocks, again from the first remaining entry (skipped while vacuum data is empty), through a deliberately one-directional setter:
// vacuum_data::upgrade_oldest_unvacuumed -- src/query/vacuum.c assert (oldest_unvacuumed_mvccid <= mvccid); /* <- "upgrade": may only move forward */ oldest_unvacuumed_mvccid = mvccid;Invariant (ascending entry oldest). Entries in vacuum data carry non-decreasing
oldest_visible_mvccid(vacuum_verify_vacuum_data_debugasserts this across neighbors, plusoldest_unvacuumed_mvccid <= entry->oldest_visible_mvccidfor every live entry). Blocks are consumed in log order and the watermark is monotonic, so the first entry is always the global minimum — “first entry’s oldest” is a correct trailing edge without scanning. Break the ordering andupgrade_oldest_unvacuumed’s assert fires — or the trailing edge silently overtakes unvacuumed work.
One writer exists outside the steady path: SA-mode full vacuum (xvacuum, Chapter 11) assigns log_Gl.hdr.mvcc_next_id directly after running every job and logs RVVAC_COMPLETE, whose redo handler vacuum_rv_redo_vacuum_complete replays the assignment. The inverse query rounds out the picture:
// vacuum_is_mvccid_vacuumed -- src/query/vacuum.c if (id < vacuum_Data.oldest_unvacuumed_mvccid) /* <- strictly older than trailing edge */ { return true; } return false;Its consumers sit on the storage side, and one is fully operational: xheap_reclaim_addresses (heap_file.c) deallocates a heap page only if the page’s max MVCCID is already vacuumed, and heap_page_update_chain_after_mvcc_op uses the same test to resolve a page’s HEAP_PAGE_VACUUM_UNKNOWN status. The rest are diagnostics: mvcc_satisfies_snapshot and mvcc_satisfies_vacuum (mvcc.c) classify perfmon counters as PERF_SNAPSHOT_..._LOST when a record that should have been vacuumed is still encountered, and btree_prepare_bts (btree.c) disables its check_not_vacuumed checker while the index’s creator MVCCID is not yet vacuumed. The trailing edge is also consumed directly, without the wrapper: vacuum_cleanup_dropped_files drops ledger entries via MVCC_ID_PRECEDES (entry mvccid, oldest_unvacuumed_mvccid) — no future job can ask about them (Chapter 10) — and the debug checker is_not_vacuumed_and_lost runs mvcc_satisfies_vacuum against oldest_unvacuumed_mvccid to flag versions that should be gone but still exist. The two edges bracket the system: oldest_unvacuumed_mvccid <= entry oldest <= m_oldest_visible — clean below the first, untouchable above the second, and vacuum’s whole job is the band between.
5.7 Chapter summary — key takeaways
Section titled “5.7 Chapter summary — key takeaways”- Eligibility is decided in two places:
mvcctablecomputes and publishes the watermark (m_oldest_visible); the master compares each entry’snewest_mvccidagainst a once-per-iteration snapshot (m_oldest_visible_mvccid). The mvcctable internals are fully derived incubrid-mvcc-detail.mdChapter 9; this chapter owns the vacuum-side consumption. - The scan’s inputs are per-transaction slots encoding three states —
MVCCID_NULL(ignore),MVCCID_ALL_VISIBLE(value imminent — wait), real lowest-active MVCCID (min-merge);compute_oldest_visible_mvccidmin-merges resolved slots, then busy-waits the sentinels out (10 ms back-off per 20 retries). - Monotonicity is load-bearing: committed transactions keep their slot until LOG_COMMIT (cleared by
log_completeviareset_transaction_lowest_active), the handshake prevents stale-low publishes, andupdate_global_oldest_visibleasserts never-lower. m_ov_lock_countfreezes the watermark until completion: locked throughlog_tdes::lock_global_oldest_visible_mvccidbyxlocator_upgrade_instances_domainandredistribute_partition_data(so their inlineheap_vacuum_all_objectsthreshold stays valid), unlocked inlog_complete/log_complete_for_2pcon commit and abort alike; the double-check honors the freeze but leaves a narrow, currently-benign store-after-lock window.- The gate
is_cursor_entry_ready_to_vacuumhas exactly two rejections —newest_mvccid >= watermarkand log-tail proximity — and a failurebreaks the iteration because blocks register in log order; the settled-deleter invariant follows: no admitted block contains an MVCCID any snapshot still holds in doubt. - An entry’s stored
oldest_visible_mvccid(captured at logging time) never gates eligibility —newest_mvccidvs the live watermark does; the stored value drives the trailing edge (set_oldest_unvacuumed_on_boot,upgrade_oldest_unvacuumed) via the ascending-entry invariant, andvacuum_is_mvccid_vacuumed(id < oldest_unvacuumed_mvccid) serves heap-page reclamation and consistency checkers — completing the bracket around vacuum’s working band.
Chapter 6: Master Dispatch and the Job Cursor
Section titled “Chapter 6: Master Dispatch and the Job Cursor”A block is registered in vacuum data (Chapter 4) and eligible under the watermark (Chapter 5). This chapter traces how the master finds it and hands it to a worker without losing its place. vacuum_boot wires the two halves: a worker pool (vacuum_Worker_threads, sized by PRM_ID_VACUUM_WORKER_COUNT) and vacuum_Master_daemon, whose looper runs one vacuum_master_task::execute pass every PRM_ID_VACUUM_MASTER_WAKEUP_INTERVAL milliseconds. Between passes, vacuum_job_cursor preserves the iteration position across vacuum-data mutation. All code is from src/query/vacuum.c, SERVER_MODE only (standalone is Chapter 11).
6.1 The three structs
Section titled “6.1 The three structs”vacuum_master_task is the cubthread::entry_task run on each wakeup. vacuum_boot constructs it exactly once (new vacuum_master_task ()) and hands it to create_daemon, which re-runs the same instance every interval — so its members carry state across wakeups.
| Field | Role |
|---|---|
m_cursor (vacuum_job_cursor) | Iteration position; resumes where the previous wakeup stopped |
m_oldest_visible_mvccid (MVCCID) | Watermark snapshot, recomputed once per execute, held stable for the pass (Chapter 5) |
m_outstanding_job_count (std::size_t) | Jobs pushed but not yet seen finished; master-thread-only, lock-free; an estimate reconciled at finished-queue drains (6.5) |
vacuum_job_cursor is the struct of this chapter. Its header comment states the contract: the blockid is the real progress indicator; the page/index pair is a refixable cache, since data maintenance can relocate an entry to a different page.
| Field | Role |
|---|---|
m_blockid (VACUUM_LOG_BLOCKID) | Logical, flag-free position: dense, monotonic, stable across page unlink/append — a bare (page, index) would dangle when vacuum_Data.update () removes a head page |
m_page (VACUUM_DATA_PAGE *) | Page containing m_blockid, or NULL (the “unloaded” state); caches the page fix for the hot loop |
m_index (INT16) | Slot of m_blockid in m_page->data[], or INDEX_NOT_FOUND (-1); doubles as the WAL record offset in 6.4 |
Unloaded (m_page == NULL) means only the blockid is meaningful: is_valid () is false and the dispatch loop does not run. Loaded means page fixed, slot valid, get_current_entry () returns m_page->data[m_index]. The destructor asserts m_page == NULL: whoever loads must unload.
vacuum_worker_task is the job descriptor handed to the pool. Its single field m_data is a by-value copy of the entry — the worker must not touch the live data page, which can be relocated or unlinked mid-job.
// vacuum_worker_task -- src/query/vacuum.cclass vacuum_worker_task : public cubthread::entry_task{ public: vacuum_worker_task (const VACUUM_DATA_ENTRY & entry_ref) : m_data (entry_ref) /* <- copy, not reference */ { } void execute (cubthread::entry & thread_ref) final { vacuum_process_log_block (&thread_ref, &m_data, false); } /* <- Chapter 7; assert elided */ private: vacuum_worker_task (); /* <- private: no entry, no task */ VACUUM_DATA_ENTRY m_data;};6.2 vacuum_master_task::execute — the per-wakeup pass
Section titled “6.2 vacuum_master_task::execute — the per-wakeup pass”// vacuum_master_task::execute -- src/query/vacuum.cvoidvacuum_master_task::execute (cubthread::entry &thread_ref){ if (prm_get_bool_value (PRM_ID_DISABLE_VACUUM)) { return; } if (check_shutdown ()) { return; } if (!BO_IS_SERVER_RESTARTED ()) { return; } /* <- boot not finished or aborted */ // ... condensed: perf tracker, pgbuf_thread_variables_init ... m_oldest_visible_mvccid = log_Gl.mvcc_table.update_global_oldest_visible (); if (!vacuum_Data.is_loaded) { vacuum_data_load_first_and_last_page (&thread_ref); /* <- lazy: master never commits */ m_cursor.set_on_vacuum_data_start (); } // ... condensed: pgbuf_flush_if_requested on first_page and last_page ... decrease_outstanding_job (m_cursor.force_data_update ()); /* <- the unconditional tick */
for (; m_cursor.is_valid () && !should_interrupt_iteration (); m_cursor.increment_blockid ()) { if (!is_cursor_entry_ready_to_vacuum ()) { break; } /* <- entries are blockid-ordered; later ones cannot be ready either */ if (!is_cursor_entry_available ()) { continue; } /* <- already vacuumed or in progress; skip */ start_job_on_cursor_entry (); if (should_force_data_update ()) { decrease_outstanding_job (m_cursor.force_data_update ()); } } m_cursor.unload (); // ... condensed: NDEBUG fix-count verification, perf timer ...}Walkthrough, branch by branch:
- Early-outs. Vacuum disabled; shutdown; boot not finished (dispatching against half-restored data would be unsound).
check_shutdown ()delegates tovacuum_shutdown_sequence::check_shutdown_request (), a three-state handshake (NO_SHUTDOWNtoSHUTDOWN_REQUESTEDtoSHUTDOWN_REGISTERED): the master registers underm_state_mutexandnotify_ones the requester, which blocks inrequest_shutdownuntil registration — a shutdown request is never missed by a sleeping or mid-pass master. - Watermark refresh, once per wakeup.
- Lazy data load on the first pass: boot’s commit would unfix the boundary pages and the master never commits, so the master loads them itself.
set_on_vacuum_data_startparks the cursor blockid on the first blockid without loading a page. - The unconditional update tick.
m_cursor.force_data_update ()(6.3) drains the finished-job queue and consumes the block buffer (vacuum_Data.update ()), then re-positions and re-loads the cursor; its return value — jobs marked finished — feedsdecrease_outstanding_job(6.5). - The cursor loop, three exits and one skip:
- Not ready → break:
newest_mvccid >= m_oldest_visible_mvccid(still visible) orstart_lsa.pageid + 1 >= log_Gl.append.prev_lsa.pageid(too close to the log tail). Both are monotone over blockid order, so no later entry can pass;breakbypasses the for-increment, and the next wakeup retries the same blockid. This is the nothing-eligible idle path: a quiet pass costs one watermark refresh, one tick, one entry probe. - Not available → continue, which does increment (6.4).
- Dispatch (6.4), then if
should_force_data_update ()—vacuum_Finished_job_queueorvacuum_Block_data_bufferis_half_full ()— a mid-loop tick, which the cursor survives viareadjust_to_vacuum_data_changes(6.3). - Loop condition false → exit: cursor invalid (all data consumed, or
searchfound nothing), orshould_interrupt_iteration ()— shutdown, or the pool-full backoff whenm_outstanding_job_countreachesVACUUM_MAX_TASKS_IN_WORKER_POOL=3 * PRM_ID_VACUUM_WORKER_COUNT; three queued tasks per worker bounds the backlog until a later tick reconciles the count.
- Not ready → break:
m_cursor.unload ()unfixes the cursor page but keeps the blockid, so the next wakeup’s tick resumes exactly there — cursor persistence across sleeps.
Invariant — the cursor page is never held across a sleep or a data update. execute ends with unload (); force_data_update unloads before vacuum_Data.update (). Enforced by assert (m_page == NULL) in the destructor and in search, plus debug-build vacuum_verify_vacuum_data_page_fix_count after every pass. Otherwise update () could unlink or deallocate a page the cursor still has fixed.
6.3 The cursor: keeping a place in moving data
Section titled “6.3 The cursor: keeping a place in moving data”All position changes funnel through change_blockid: it asserts the forward-only rule (assert (m_blockid <= blockid)), then either detects exhaustion (m_blockid > vacuum_Data.get_last_blockid (), asserted to be exactly last + 1) and unload ()s, or reloads. increment_blockid is change_blockid (m_blockid + 1). reload is the cheap path: if a page is fixed and get_index_of_blockid finds the new blockid in it, only m_index moves — the common case, since one page holds many consecutive entries. Otherwise it unloads and falls through to search; load (used after updates) asserts !is_loaded () and calls search directly:
// vacuum_job_cursor::search -- src/query/vacuum.cvoidvacuum_job_cursor::search (){ assert (m_page == NULL); vacuum_data_page *data_page = vacuum_Data.first_page; while (true) { m_index = data_page->get_index_of_blockid (m_blockid); if (m_index != vacuum_data_page::INDEX_NOT_FOUND) { m_page = data_page; return; } /* <- found: keep the fix */ VPID next_vpid = data_page->next_page; vacuum_unfix_data_page (&cubthread::get_entry (), data_page); if (VPID_ISNULL (&next_vpid)) { return; } /* <- not found: cursor stays unloaded */ data_page = vacuum_fix_data_page (&cubthread::get_entry (), &next_vpid); }}The per-page probe is O(1): entries within a page are consecutive blockids, so vacuum_data_page::get_index_of_blockid computes (blockid - first_blockid) + index_unvacuumed after an emptiness check and two range checks. The not-found exit leaves m_page == NULL, hence is_valid () false — also how empty vacuum data terminates dispatch without a special case. The fix/unfix calls are vacuum-specific macros: first_page and last_page stay fixed for the master’s entire uptime, so vacuum_fix_data_page returns the cached pointer for a boundary page and vacuum_unfix_data_page skips the real pgbuf_unfix for them. Hence search may start from vacuum_Data.first_page without a fix call, and the debug fix-count check expects exactly those two permanent fixes after unload ().
force_data_update brackets the mutation: unload () (can’t be loaded while updating), mark_finished = vacuum_Data.update (), readjust_to_vacuum_data_changes (), load (), return mark_finished. The readjustment handles the update removing the cursor’s blockid: vacuum_data_mark_finished (Chapter 9) trims vacuumed head entries and may unlink head pages, so the first blockid can leap past m_blockid; the cursor “was left behind” and jumps to the new first blockid (on empty data it does nothing — load () finds nothing). Relocation of the same blockid to a different page needs no code: load () simply re-searches from the new first page.
Invariant — the cursor blockid is monotonically non-decreasing. Enforced by the change_blockid assert; readjust only moves forward. A backward move would re-visit dispatched entries and break the arithmetic assumptions in get_index_of_blockid for finished-and-removed blockids.
flowchart TD
A["change_blockid(b)"] --> B{"b > last_blockid?"}
B -- yes --> C["unload: all data consumed"]
B -- no --> D{"m_page fixed and<br/>b still in this page?"}
D -- yes --> G["move m_index only (hot path)"]
D -- no --> I["unload + search"]
I --> J{"probe page: get_index_of_blockid"}
J -- found --> K["m_page = page, keep fix"]
J -- "not found, next_page null" --> L["stay unloaded -> is_valid false"]
J -- "not found, has next" --> M["unfix, fix next page"] --> J
Figure 6-1: cursor relocation — change_blockid, reload, and search branches.
6.4 Starting a job: flag, WAL, copy, push
Section titled “6.4 Starting a job: flag, WAL, copy, push”is_cursor_entry_available gates on the status bits packed into the top of the 64-bit entry blockid (Chapter 1): is_available () passes; otherwise the entry is asserted to be is_vacuumed () (done, awaiting removal) or is_job_in_progress () (a worker owns it, or a pre-crash run left it flagged) and is skipped. For an available entry, start_job_on_cursor_entry marks the entry in the cursor’s page, pushes a vacuum_worker_task built from m_cursor.get_current_entry (), and calls increase_outstanding_job (). The marking step:
// vacuum_job_cursor::start_job_on_current_entry -- src/query/vacuum.cvoidvacuum_job_cursor::start_job_on_current_entry () const{ assert (is_valid ()); cubthread::entry *thread_p = &cubthread::get_entry (); vacuum_data_entry &entry = m_page->data[m_index]; entry.set_job_in_progress (); /* <- status bits -> IN_PROGRESS, in the page itself */ if (!entry.was_interrupted ()) { /* Log that a new job is starting. After recovery, the system will then know this job was * partially executed. */ LOG_DATA_ADDR addr { NULL, (PAGE_PTR) m_page, (PGLENGTH) m_index }; log_append_redo_data (thread_p, RVVAC_START_JOB, &addr, 0, NULL); } vacuum_set_dirty_data_page_dont_free (thread_p, m_page);}The entry is flagged IN_PROGRESS in the data page and the RVVAC_START_JOB redo record is appended before the task is pushed. The record carries zero data bytes — page pointer plus entry index (the record’s offset) are the payload — and its redo function, vacuum_rv_redo_start_job, replays set_job_in_progress () on data[rcv->offset] and dirties the page.
Invariant — a job is WAL-marked started before any worker can act on it, and a block is in the worker pool at most once at any time. Enforced by the flag-then-log-then-push ordering in one dispatch call (the source comment notes logging happens here to avoid re-latching vacuum data later) plus the is_job_in_progress () skip above. On a mid-job crash, recovery replays RVVAC_START_JOB, and Chapter 11’s restore pass converts IN_PROGRESS to AVAILABLE plus INTERRUPTED, so the block is re-vacuumed in safe mode; the !was_interrupted () guard skips re-logging such re-dispatched blocks. The worker task’s by-value copy carries the just-set IN_PROGRESS and any INTERRUPTED bit, which vacuum_process_log_block consults for safe-mode redo (Chapter 7).
6.5 Outstanding-job accounting
Section titled “6.5 Outstanding-job accounting”increase_outstanding_job is a bare ++m_outstanding_job_count, master-thread-only. The decrease side never observes completion directly: workers report into the lock-free vacuum_Finished_job_queue (Chapter 9), and the master learns of completions only when vacuum_Data.update () drains that queue inside force_data_update, whose return value (the count vacuum_data_mark_finished consumed) flows into decrease_outstanding_job. Both of its defensive branches — negative count, and count exceeding the current total — assert (false), log, and clamp the counter to zero rather than wrap: an unsigned underflow would read as “pool full forever” and silently stop all vacuuming. Because decreases happen only at update ticks, the count over-approximates in-flight jobs between ticks; the worst case is a conservative early is_task_queue_full exit, fixed by the next tick. This is also why should_force_data_update fires at half-full (VACUUM_FINISHED_JOB_QUEUE_CAPACITY = 2048): waiting for full would risk workers blocking on a full finished queue while the master is still mid-loop.
6.6 Chapter summary — key takeaways
Section titled “6.6 Chapter summary — key takeaways”vacuum_master_task::executeruns one bounded pass perPRM_ID_VACUUM_MASTER_WAKEUP_INTERVALwakeup: early-out guards (disable parameter, shutdown handshake, boot incomplete), watermark refresh, an unconditionalforce_data_updatetick, then the cursor loop — exited by entry-not-ready (break; same blockid retried next wakeup, the idle path), cursor exhaustion, shutdown, or pool saturation at3 * PRM_ID_VACUUM_WORKER_COUNT.vacuum_job_cursor’s source of truth is the logicalm_blockid;(m_page, m_index)is a refixable cache:reloadmoves only the index in the common case,searchre-walks the page chain with an O(1) arithmetic probe per page, andreadjust_to_vacuum_data_changesjumps forward when head trimming removed the cursor’s blockid.- The cursor is forward-only (
assert (m_blockid <= blockid)) and must be unloaded across everyvacuum_Data.update ()and at the end of every pass — the update may unlink the very page the cursor has fixed; the fix/unfix macros special-case the permanently fixed first/last pages. - Dispatch is flag → WAL → push:
RVVAC_START_JOB(zero payload, addressed by page + entry index) is logged before the worker task exists, so recovery knows which jobs may have partially run; the IN_PROGRESS skip guarantees a block is dispatched at most once concurrently. vacuum_worker_taskcarries a by-value copy of the data entry, fully decoupling workers from vacuum data pages; its default constructor is private.m_outstanding_job_countis master-only and lock-free, reconciled solely through the finished-job queue at update ticks; it over-counts between ticks (occasional early backoff, zero synchronization) and clamps to zero on accounting errors instead of wrapping.
Chapter 7: Worker Log Pass and Per Record Dispatch
Section titled “Chapter 7: Worker Log Pass and Per Record Dispatch”Chapter 6 ended with vacuum_worker_task::execute handing a copy of a VACUUM_DATA_ENTRY to vacuum_process_log_block. This chapter traces its PROCESS_LOG phase: prefetch the block’s log pages once, walk the MVCC-op chain backward from start_lsa, and turn each record into one of three actions — collect a heap OID, vacuum a b-tree entry inline, or delete an external-storage file. Replaying the collected heap array is Chapter 8; completion reporting is Chapter 9.
7.1 Worker-local state and the one-shot prefetch
Section titled “7.1 Worker-local state and the one-shot prefetch”Everything here lives in VACUUM_WORKER, allocated once by vacuum_worker_allocate_resources (Chapter 2) and reused across jobs:
// vacuum_worker -- src/query/vacuum.hstruct vacuum_worker{ VACUUM_WORKER_STATE state; /* INACTIVE / PROCESS_LOG / EXECUTE */ INT32 drop_files_version; /* last seen dropped-files version (Ch 10) */ struct log_zip *log_zip_p; /* unzip scratch */ VACUUM_HEAP_OBJECT *heap_objects; /* collected per-job heap targets */ int heap_objects_capacity; /* starts at 4000, doubles on demand */ int n_heap_objects; /* reset to 0 at job start */ char *undo_data_buffer; /* page-straddling undo reassembly */ int undo_data_buffer_capacity; // ... condensed (private_lru_index) ... char *prefetch_log_buffer; /* the block's log pages, fetched once */ LOG_PAGEID prefetch_first_pageid; LOG_PAGEID prefetch_last_pageid; // ... condensed (allocated_resources, idx) ...};
// vacuum_heap_object -- src/query/vacuum.hstruct vacuum_heap_object{ VFID vfid; /* File ID of heap file. */ OID oid; /* Object OID. */};The prefetch buffer holds VACUUM_PREFETCH_LOG_BLOCK_BUFFER_PAGES = 1 + vacuum_Data.log_block_npages pages — one extra beyond the block, because a record starting in the block’s final page may spill into the next, and vacuum_log_prefetch_vacuum_block’s header comment is explicit that only one spill page is handled. The function sets prefetch_first_pageid/prefetch_last_pageid and loops logpb_fetch_page (.., LOG_CS_SAFE_READER, ..) across the range; its only branch is fetch failure (assert (false) + ER_FAILED). Prefetch is skipped when sa_mode_partial_block is true — the SA_MODE tail block of Chapter 11 is not fully logged.
Two early-outs in vacuum_process_log_block — the PRM_ID_DISABLE_VACUUM guard at entry and a prefetch failure — return before the end: label, so vacuum_finished_block_vacuum (Chapter 9) never runs and the block’s entry stays in-progress in vacuum data; nothing retries it in this server’s lifetime. Only the next restart reclaims it, when vacuum_data_load_and_recover sweeps in-progress entries with set_interrupted ().
Every later page access goes through vacuum_fetch_log_page:
// vacuum_fetch_log_page -- src/query/vacuum.c if (vacuum_is_thread_vacuum (thread_p)) { perfmon_inc_stat (thread_p, PSTAT_VAC_NUM_PREFETCH_REQUESTS_LOG_PAGES); if (worker->prefetch_first_pageid <= log_pageid && log_pageid <= worker->prefetch_last_pageid) { size_t page_index = log_pageid - worker->prefetch_first_pageid; memcpy (log_page_p, worker->prefetch_log_buffer + page_index * LOG_PAGESIZE, LOG_PAGESIZE); perfmon_inc_stat (thread_p, PSTAT_VAC_NUM_PREFETCH_HITS_LOG_PAGES); return NO_ERROR; } // else: warning log, fall through } // need to fetch from log error = logpb_fetch_page (thread_p, &req_lsa, LOG_CS_SAFE_READER, log_page_p); if (error != NO_ERROR) { assert (false); logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "vacuum_fetch_log_page"); }A worker can legitimately miss only forward — a tail record extending past the single spill page. It never misses backward: the loop bound in 7.2 stops before any LSA in the previous block is dereferenced. The non-worker arm serves vacuum_recover_lost_block_data (Chapter 11), where the boot thread has no prefetch buffer. In every path, fetch failure is logpb_fatal_error: vacuum cannot progress without the page, and skipping it would silently leak dead versions forever.
7.2 The backward walk and its per-iteration gates
Section titled “7.2 The backward walk and its per-iteration gates”Chapter 3 showed how every MVCC op log record embeds a LOG_VACUUM_INFO whose prev_mvcc_op_log_lsa points at the previous MVCC op record. The block’s start_lsa is the chain’s newest end; the worker walks backward, the next position coming out of the record just parsed — not a scan:
// vacuum_process_log_block -- src/query/vacuum.c for (LSA_COPY (&log_lsa, &data->start_lsa); !LSA_ISNULL (&log_lsa) && log_lsa.pageid >= first_block_pageid; LSA_COPY (&log_lsa, &log_vacuum.prev_mvcc_op_log_lsa))Invariant — chain-complete, block-bounded walk. Every MVCC op record in the block is reachable from start_lsa through prev_mvcc_op_log_lsa, in strictly decreasing LSA order. The bound log_lsa.pageid >= first_block_pageid (VACUUM_FIRST_LOG_PAGEID_IN_BLOCK of the blockid) partitions the single global chain into per-block jobs: this worker stops where the previous block’s job takes over. A producer that appended an MVCC op record without linking it would make it permanently invisible — there is no fallback scan.
Before the loop, the job snapshots threshold_mvccid from get_global_oldest_visible (), zeroes n_heap_objects, and computes was_interrupted = data->was_interrupted () || sa_mode_partial_block — Chapter 8 relaxes its safe-guards when a previous run may have already vacuumed some targets. Each iteration then runs four gates before dispatch:
- Shutdown / interrupt. Under
SERVER_MODE,thread_p->shutdowncausesgoto endwitherror_codestillNO_ERROR— the job is abandoned and Chapter 9’svacuum_finished_block_vacuummarks the block interrupted for re-execution. UnderSA_MODEthe equivalent (logtb_get_check_interruptpluslogtb_is_interrupted) does seterror_code = ER_INTERRUPTED: standalone vacuum runs inside a user-visible operation, so the interrupt must surface as an error. - State flip to PROCESS_LOG, paired with
PERF_UTIME_TRACKER_TIME_AND_RESTART (..., PSTAT_VAC_WORKER_EXECUTE)— the time until the flip was execute time. - Page cache check.
if (log_page_p->hdr.logical_pageid != log_lsa.pageid)refetches viavacuum_fetch_log_page; failure isassert_release+logpb_fatal_error+goto end. - Record parse.
vacuum_process_log_record(7.3). On error,vacuum_check_shutdown_interruptionasserts the failure is shutdown-legitimate (!vacuum_is_thread_vacuum_worker (thread_p) || (thread_p->shutdown && error_code == ER_INTERRUPTED)), thengoto end.
After parsing, the state flips to VACUUM_WORKER_STATE_EXECUTE (mirror perf restart against PSTAT_VAC_WORKER_PROCESS_LOG), then two more gates: the dropped-file continue (the record’s whole file is gone — 7.3 and Chapter 10), and a !NDEBUG-only envelope check — assert (0) + logpb_fatal_error + goto end on violation.
Invariant — MVCCID envelope. In debug builds, every MVCCID met on the walk must lie inside [data->oldest_visible_mvccid, data->newest_mvccid] recorded at block birth (Chapter 3) and strictly below the job’s threshold_mvccid snapshot. Violation means vacuum data or the watermark is corrupt, and proceeding would delete versions some snapshot can still see — hence fatal, not skip.
flowchart TD
A["log_lsa = start_lsa"] --> B{"LSA null or pageid<br/>below first_block_pageid?"}
B -- yes --> Z["loop done -> vacuum_heap<br/>Chapter 8, then vacuum_complete=true"]
B -- no --> C{"shutdown SERVER_MODE /<br/>interrupt SA_MODE?"}
C -- yes --> END["goto end: state INACTIVE,<br/>vacuum_finished_block_vacuum Ch 9"]
C -- no --> D["state = PROCESS_LOG"] --> E{"page cached?"}
E -- no --> F["vacuum_fetch_log_page"]
F -- error --> END
E -- yes --> G
F -- ok --> G["vacuum_process_log_record"]
G -- error --> END
G -- ok --> H["state = EXECUTE"] --> I{"is_file_dropped?"}
I -- yes --> N["continue"]
I -- no --> V{"debug build: MVCCID<br/>outside envelope?"}
V -- "yes: fatal" --> END
V -- no --> J{"rcvindex?"}
J -- "heap op" --> K["collect OID, 7.4"]
J -- "btree op" --> L["decode + vacuum inline, 7.5"]
J -- "RVES_NOTIFY_VACUUM" --> M["delete lob file, 7.6"]
J -- other --> Q["assert safeguard"]
K --> N
L --> N
M --> N
Q --> N
N --> R["log_lsa = prev_mvcc_op_log_lsa"] --> B
Figure 7-1: branch-complete iteration of the PROCESS_LOG loop in vacuum_process_log_block.
stateDiagram-v2
[*] --> INACTIVE
INACTIVE --> PROCESS_LOG : iteration starts \n parse log record
PROCESS_LOG --> EXECUTE : record parsed \n dispatch arm runs
EXECUTE --> PROCESS_LOG : next chain hop
EXECUTE --> INACTIVE : loop ends or goto end
Figure 7-2: worker state flips per iteration. The split keys the PSTAT_VAC_WORKER_PROCESS_LOG / PSTAT_VAC_WORKER_EXECUTE accounting, and a non-INACTIVE state makes the worker visible to vacuum_get_worker_min_dropped_files_version (Chapter 10) — how file droppers know they must wait for this worker.
7.3 vacuum_process_log_record, fully dissected
Section titled “7.3 vacuum_process_log_record, fully dissected”The parser leaves log_lsa_p at the record’s undo data, having extracted the MVCCID, the recovery target (LOG_DATA: rcvindex, volid, pageid, offset), the vacuum chain info, and — when needed — a usable undo-data pointer.
Header parse. After LOG_GET_LOG_RECORD_HEADER and an aligned hop over it, four record types are legal in three parse arms, each extracting the same five things from a differently shaped body:
// vacuum_process_log_record -- src/query/vacuum.c if (log_rec_type == LOG_MVCC_UNDO_DATA) { vacuum_read_advance_when_doesnt_fit (thread_p, sizeof (*mvcc_undo), log_lsa_p, log_page_p); mvcc_undo = (LOG_REC_MVCC_UNDO *) (log_page_p->area + log_lsa_p->offset); *mvccid = mvcc_undo->mvccid; *log_record_data = mvcc_undo->undo.data; ulength = mvcc_undo->undo.length; LSA_COPY (&vacuum_info->prev_mvcc_op_log_lsa, &mvcc_undo->vacuum_info.prev_mvcc_op_log_lsa); VFID_COPY (&vacuum_info->vfid, &mvcc_undo->vacuum_info.vfid); } else if (log_rec_type == LOG_MVCC_UNDOREDO_DATA || log_rec_type == LOG_MVCC_DIFF_UNDOREDO_DATA) { /* same shape via LOG_REC_MVCC_UNDOREDO; ulength = undoredo.ulength */ } else if (log_rec_type == LOG_SYSOP_END) { if (sysop_end->type != LOG_SYSOP_END_LOGICAL_MVCC_UNDO) { assert (false); return ER_FAILED; } /* <- only this flavor carries vacuum info */ mvcc_undo = &sysop_end->mvcc_undo; /* <- embedded LOG_REC_MVCC_UNDO, same extraction */ } else { assert (false); /* ER_GENERIC_ERROR */ return ER_FAILED; } /* <- any other type = corrupt chain */(Struct shapes — LOG_REC_MVCC_UNDO/LOG_REC_MVCC_UNDOREDO wrapping LOG_VACUUM_INFO — are Chapter 3 material in log_record.hpp; diff and non-diff undoredo parse identically because vacuum reads only the undo side.)
The aligned-read clones. Four helpers exist because the stock LOG_READ_* macros fetch through the log page buffer, while vacuum must route through vacuum_fetch_log_page to hit its prefetch buffer. vacuum_read_log_aligned aligns offset to DOUBLE_ALIGNMENT and, while offset >= LOGAREA_SIZE, advances pageid and refetches (fetch failure: logpb_fatal_error); vacuum_read_log_add_aligned is add-then-align; vacuum_read_advance_when_doesnt_fit forces a next-page fetch when the requested struct would straddle the boundary — so the casts above always see a contiguous struct; vacuum_copy_data_from_log is one memcpy when the data fits the page, else a chunked copy across fetches.
Recovery early-out. stop_after_vacuum_info == true returns NO_ERROR right after the header arms — the caller only wanted prev_mvcc_op_log_lsa. The entry asserts show the contract: worker, undo_data_ptr, undo_data_size, is_file_dropped may all be NULL in this mode. The only such caller is vacuum_recover_lost_block_data (Chapter 11).
Dropped-file short-circuit. When the record carries a non-NULL vacuum_info->vfid:
// vacuum_process_log_record -- src/query/vacuum.c if (worker->drop_files_version != vacuum_Dropped_files_version) { /* But first, cleanup collected heap objects. */ VFID_COPY (&vfid, &vacuum_Last_dropped_vfid); vacuum_cleanup_collected_by_vfid (worker, &vfid); worker->drop_files_version = vacuum_Dropped_files_version; } error_code = vacuum_is_file_dropped (thread_p, is_file_dropped, &vacuum_info->vfid, *mvccid); if (error_code != NO_ERROR) { vacuum_check_shutdown_interruption (...); return error_code; } if (*is_file_dropped == true) { return NO_ERROR; }The handshake and ledger lookup are Chapter 10’s subject; what matters here is ordering — the worker must scrub its own heap_objects array before publishing the new version, because publishing releases the dropper, after which the file’s pages may be reused. vacuum_cleanup_collected_by_vfid qsorts the array with vacuum_compare_heap_object and excises the contiguous run matching the VFID. A dropped verdict returns NO_ERROR with *is_file_dropped = true; the block loop continues.
Undo-data extraction. Heap records return early — if (!LOG_IS_MVCC_BTREE_OPERATION (rcvindex) && rcvindex != RVES_NOTIFY_VACUUM) return NO_ERROR; — because the heap pass (Chapter 8) reads the current heap page, not logged images. For the rest, ZIP_CHECK (ulength) decides the real size (GET_ZIP_LEN), then:
// vacuum_process_log_record -- src/query/vacuum.c if (log_lsa_p->offset + *undo_data_size < (int) LOGAREA_SIZE) { *undo_data_ptr = (char *) log_page_p->area + log_lsa_p->offset; } /* <- zero-copy into the page */ else { if (worker->undo_data_buffer_capacity < *undo_data_size) { /* realloc; NULL -> fatal-logged ER_FAILED; capacity grows monotonically */ } *undo_data_ptr = worker->undo_data_buffer; vacuum_copy_data_from_log (thread_p, *undo_data_ptr, *undo_data_size, log_lsa_p, log_page_p); } if (is_zipped) { if (log_unzip (worker->log_zip_p, *undo_data_size, *undo_data_ptr)) { *undo_data_size = (int) worker->log_zip_p->data_length; *undo_data_ptr = (char *) worker->log_zip_p->log_data; } /* <- now into log_zip's buffer */ else { /* fatal-logged */ return ER_FAILED; } }So undo_data may alias three owners — the log page, worker->undo_data_buffer, or worker->log_zip_p — all stable until the next record is parsed, exactly as long as the dispatch arms need.
7.4 Arm 1 — heap: collect now, execute later
Section titled “7.4 Arm 1 — heap: collect now, execute later”Heap rcvindexes (LOG_IS_MVCC_HEAP_OPERATION in mvcc.h: RVHF_MVCC_DELETE_REC_HOME, RVHF_MVCC_INSERT, RVHF_UPDATE_NOTIFY_VACUUM, RVHF_MVCC_DELETE_MODIFY_HOME, RVHF_MVCC_NO_MODIFY_HOME, RVHF_MVCC_REDISTRIBUTE) are not executed per record. The OID is reassembled from LOG_DATA — with one subtlety:
// vacuum_process_log_block -- src/query/vacuum.c heap_object_oid.slotid = heap_rv_remove_flags_from_offset (log_record_data.offset); /* <- offset & ~HEAP_RV_FLAG_VACUUM_STATUS_CHANGE (0x8000): the producer smuggled a recovery flag into the slotid field (Chapter 3); strip it or vacuum targets a garbage slot */ error_code = vacuum_collect_heap_objects (thread_p, worker, &heap_object_oid, &log_vacuum.vfid); if (error_code != NO_ERROR) { assert_release (false); er_clear (); error_code = NO_ERROR; continue; /* <- one lost OID must not sink the block: release keeps going */ }vacuum_collect_heap_objects appends a VACUUM_HEAP_OBJECT (7.1) to worker->heap_objects, doubling capacity by realloc when full (initial VACUUM_DEFAULT_HEAP_OBJECT_BUFFER_SIZE = 4000; only failure mode is ER_OUT_OF_VIRTUAL_MEMORY). The VFID rides along because Chapter 8 must ask the file whether slots are reusable. vacuum_compare_heap_object — VFID fileid, volid, then OID pageid, volid, slotid — groups the array by file then page, both for Chapter 8’s vacuum_heap qsort and for 7.3’s excision. Deferring heap work batches all of a page’s records into one fix/log cycle instead of one per record.
7.5 Arm 2 — b-tree: decode and execute inline
Section titled “7.5 Arm 2 — b-tree: decode and execute inline”B-tree records (LOG_IS_MVCC_BTREE_OPERATION: RVBT_MVCC_DELETE_OBJECT, RVBT_MVCC_INSERT_OBJECT, RVBT_MVCC_INSERT_OBJECT_UNQ, RVBT_MVCC_NOTIFY_VACUUM) cannot be batched by page — the key must be searched top-down each time — so they run inline. The undo payload is decoded two ways: RVBT_MVCC_INSERT_OBJECT_UNQ goes through btree_rv_read_keybuf_two_objects, which unpacks the BTID and two BTREE_OBJECT_INFOs — a unique-index MVCC insert moves the incumbent out of the leaf’s first slot, so the log carries both versions and vacuum’s target OID/class-OID come from the old one. Everything else goes through btree_rv_read_keybuf_nocopy, the one-object flavor that also fills mvcc_info from flag bits packed into the OID. Either way key_buf wraps the still-packed key, and assert (!OID_ISNULL (&oid)) seals the decode. Then the purpose dispatch, four-way:
// vacuum_process_log_block -- src/query/vacuum.c if (log_record_data.rcvindex == RVBT_MVCC_NOTIFY_VACUUM) { if (MVCCID_IS_VALID (mvcc_info.delete_mvccid)) { error_code = btree_vacuum_object (..., mvcc_info.delete_mvccid); } else if (MVCCID_IS_VALID (mvcc_info.insert_mvccid) && mvcc_info.insert_mvccid != MVCCID_ALL_VISIBLE) { error_code = btree_vacuum_insert_mvccid (..., mvcc_info.insert_mvccid); } else { /* impossible case */ assert_release (false); continue; } } else if (log_record_data.rcvindex == RVBT_MVCC_DELETE_OBJECT) { error_code = btree_vacuum_object (..., mvccid); } /* <- record's own MVCCID is the delid */ else if (log_record_data.rcvindex == RVBT_MVCC_INSERT_OBJECT || log_record_data.rcvindex == RVBT_MVCC_INSERT_OBJECT_UNQ) { error_code = btree_vacuum_insert_mvccid (..., mvccid); } else { /* Unexpected. */ assert_release (false); }RVBT_MVCC_NOTIFY_VACUUM is the either-or case: an index load logs one notification per object without knowing whether vacuum will see a dead deleted object (valid delete_mvccid — remove it) or a settled insert (valid, non-MVCCID_ALL_VISIBLE insert_mvccid — strip the insid only); both invalid is corrupt, skipped with continue in release. The executors are thin wrappers over one engine:
// btree_vacuum_object -- src/storage/btree.c BTREE_MVCC_INFO_SET_DELID (&match_mvccinfo, delete_mvccid); return btree_delete_internal (thread_p, btid, oid, class_oid, &mvcc_info, NULL, buffered_key, NULL, SINGLE_ROW_MODIFY, NULL, &match_mvccinfo, NULL, NULL, BTREE_OP_DELETE_VACUUM_OBJECT);btree_vacuum_insert_mvccid is symmetric with BTREE_MVCC_INFO_SET_INSID and BTREE_OP_DELETE_VACUUM_INSID. The match_mvccinfo makes the operation idempotent: if a previous interrupted run already cleaned the entry, the traversal finds no match and succeeds — crucial because interrupted blocks re-execute from start_lsa (Chapter 9). The arm’s epilogue: a SERVER_MODE-only overflow-page accounting block (thread_p->read_ovfl_pages_count, zeroed before the dispatch, checked against g_ovfp_threshold_mgr), then the error branch — thread_p->shutdown makes a b-tree error an acceptable interruption (goto end); otherwise it is asserted, logged, er_clear ()ed, and neutralized (error_code = NO_ERROR) so the block continues. Same robustness policy as the heap arm.
7.6 Arm 3 — external storage, and the nop that explains it
Section titled “7.6 Arm 3 — external storage, and the nop that explains it”RVES_NOTIFY_VACUUM carries a packed URI string in its undo data:
// vacuum_process_log_block -- src/query/vacuum.c (void) or_unpack_string (undo_data, &es_uri); if (es_delete_file (es_uri) != NO_ERROR) { er_clear (); } /* <- file may already be gone; swallow */ else { ASSERT_NO_ERROR (); } db_private_free_and_init (thread_p, es_uri);The producer side explains the trick. A LOB file cannot be unlinked at delete-commit time — older snapshots may still read it — so vacuum_notify_es_deleted appends an undo-only record addressed to nobody:
// vacuum_notify_es_deleted -- src/query/vacuum.c /* This is not actually ever undone, but vacuum will process undo data of log entry. */ log_append_undo_data (thread_p, RVES_NOTIFY_VACUUM, &addr, length, data);Because rollback would execute the record’s undo function, vacuum_rv_es_nop exists as the registered handler that does nothing: the record is a message in a bottle for vacuum, not a recoverable change. The undo-data channel is reused purely because every MVCC-undo record automatically joins the chain this chapter walks. The final else after all three arms is the safeguard assert_release (false) — an rcvindex that is neither heap, b-tree, nor ES should never have entered the chain.
Each iteration closes by asserting worker->state == VACUUM_WORKER_STATE_EXECUTE and that no system op leaked (!...is_under_sysop ()); the same pair guards the loop exit before vacuum_heap runs (Chapter 8) and vacuum_complete = true is set. The end: epilogue flips the state to INACTIVE and — except for sa_mode_partial_block jobs, which have no vacuum-data entry to report — calls vacuum_finished_block_vacuum (Chapter 9) with that flag; under SERVER_MODE it also runs pgbuf_unfix_all as a leak backstop.
7.7 Chapter summary — key takeaways
Section titled “7.7 Chapter summary — key takeaways”- A worker reads each block’s log pages exactly once into its private
prefetch_log_buffer(block size + 1 spill page); later accesses invacuum_fetch_log_pageare memcpy hits. Legitimate misses are forward-only — a tail record spilling past the one extra page — never backward. - The PROCESS_LOG loop is a backward walk of the
prev_mvcc_op_log_lsachain fromdata->start_lsa, partitioned per block by thepageid >= first_block_pageidbound — there is no scan, so an unlinked MVCC record is permanently invisible to vacuum. vacuum_process_log_recordaccepts four record types in three parse arms (LOG_MVCC_UNDO_DATA, the two MVCC undoredo types,LOG_SYSOP_ENDof typeLOG_SYSOP_END_LOGICAL_MVCC_UNDO); itsstop_after_vacuum_infomode serves Chapter 11’s recovery walk, and its dropped-file short-circuit (Chapter 10) scrubs already-collected OIDs before acknowledging a new dropped-files version.- Undo data is materialized lazily and may alias the log page, the growable
undo_data_buffer, or thelog_zip_pbuffer afterlog_unzip— valid only until the next record parse; heap records skip extraction entirely. - The three arms execute asymmetrically: heap OIDs are collected (flag-stripped slotid, capacity-doubling array) for Chapter 8’s batched pass; b-tree entries are vacuumed inline through
btree_delete_internalwith match-MVCCID idempotence; ES records delete a LOB file whose log record exists only as a message — its recovery handlervacuum_rv_es_nopdoes nothing. - Per-record failures in release builds are logged, cleared, and skipped; shutdown/interrupt and parse corruption abandon the block for re-execution via Chapter 9, while the
PRM_ID_DISABLE_VACUUMguard and a prefetch failure return before the completion path — those entries stay in-progress untilvacuum_data_load_and_recoverflips them to interrupted at the next restart. - The
PROCESS_LOG/EXECUTEstate flips bracket the perf accounting and keep the worker visible to the dropped-files version handshake; both loop tail and exit assert the state discipline and that no system operation leaked.
Chapter 8: Heap Execution
Section titled “Chapter 8: Heap Execution”Chapter 7 left the worker with a flat array of (VFID, OID) pairs in worker->heap_objects. This chapter traces the batched second pass: vacuum_heap groups the array by heap page, vacuum_heap_page works each batch through the VACUUM_HEAP_HELPER workbench, mvcc_satisfies_vacuum issues per-record verdicts, and every change is logged for Chapter 11. Heap page anatomy, REC_* slot types, and heap_remove_page_on_vacuum live in cubrid-heap-manager-detail.md; the MVCC header layout and visibility family in cubrid-mvcc-detail.md — used here, not re-derived.
8.1 Batching: vacuum_heap sorts, groups, and survives errors
Section titled “8.1 Batching: vacuum_heap sorts, groups, and survives errors”Each VACUUM_HEAP_OBJECT (in vacuum.h) holds just vfid and oid. vacuum_heap runs once per job from vacuum_process_log_block, with the block’s threshold_mvccid and the was_interrupted flag (true when the job re-executes after crash or shutdown — Chapter 6). It qsorts with vacuum_compare_heap_object — vfid.fileid, vfid.volid, then oid.pageid, oid.volid, oid.slotid — so the array becomes file groups of page runs of slot-sorted duplicates, then calls vacuum_heap_page once per page run. At each file boundary it does HFID_SET_NULL on the cached HFID; the first vacuum_heap_page of the new group lazily refills it (8.3). Error handling is build-asymmetric: a debug build (or a shutdown) stops the job, but a release build (#if defined (NDEBUG)) clears the error and abandons the failed page — the block is still marked vacuumed (Chapter 9), so a skipped record leaks until a later delete touches it. Forward progress wins.
Dropped files never reach this loop; the filter runs at collection time. vacuum_process_log_record (Chapter 7) checks vacuum_is_file_dropped (Chapter 10) before vacuum_collect_heap_objects, and when vacuum_Dropped_files_version advanced mid-block it calls vacuum_cleanup_collected_by_vfid to purge collected entries of vacuum_Last_dropped_vfid (sort, find the VFID’s range, memmove the tail down).
8.2 The workbench: VACUUM_HEAP_HELPER
Section titled “8.2 The workbench: VACUUM_HEAP_HELPER”vacuum_heap_page keeps all working state in one stack struct so latch-dropping retries can rebuild context cheaply:
| Field | Role |
|---|---|
home_page, home_vpid | Fixed PAGE_HEAP pointer under write latch + its VPID (survives unfix/re-fix); NULL page = “latch dropped” |
forward_page, forward_oid | Second fixed page and the OID from the home link record; meaning depends on record_type (matrix below) |
crt_slotid | Slot being vacuumed; doubles as the duplicate filter |
record_type | Slot type, re-read fresh on every retry — the record may change while unlatched |
record over rec_buf[IO_MAX_PAGE_SIZE + MAX_ALIGNMENT] | COPY (not PEEK) of the current record; rewritten in place by the insid strip, and for REC_RELOCATION it doubles as the NEWHOME removal’s undo image |
mvcc_header | Decoded header — input to mvcc_satisfies_vacuum |
hfid, overflow_vfid, reusable | File identity: per-group HFID cache, lazily-resolved overflow file (heap_ovf_delete needs it), FILE_HEAP_REUSE_SLOTS flag (slot afterlife, 8.6) |
can_vacuum | Verdict for the current record; drives the execute split |
slots[MAX_SLOTS_IN_PAGE], results[MAX_SLOTS_IN_PAGE], n_bulk_vacuumed | Pending bulk-logging batch (8.7) — single-page REC_HOME changes only |
forward_recdes over forward_link | COPY of the home link slot; doubles as undo image of the home removal |
n_vacuumed, initial_home_free_space, time_track | Status-assert input, heap_stats_update delta base, prepare/execute/log perf split |
| Field | REC_HOME | REC_RELOCATION | REC_BIGONE |
|---|---|---|---|
forward_page | unused (asserted NULL) | page holding REC_NEWHOME | first overflow page |
forward_oid | unused | OID of REC_NEWHOME slot | OID naming first overflow VPID |
record | copy of home record | copy of REC_NEWHOME record | unused |
mvcc_header | from record | from record | via heap_get_mvcc_rec_header_from_overflow |
8.3 vacuum_heap_page: the page loop
Section titled “8.3 vacuum_heap_page: the page loop”- Fix, interrupted flavor. If
was_interrupted:pgbuf_fix_if_not_deallocated— error → return;home_page == NULL(deallocated by the earlier partial run) → warn, returnNO_ERROR; page typePAGE_FTAB(deallocated and reused as a file-table page) → unfix, returnNO_ERROR. Both tolerated cases assertn_heap_objects == 1. A normal run uses plainpgbuf_fixand treats failure as a hard error. Invariant — a re-executed job must tolerate a vanished page; a first execution must not. The relaxed path used unconditionally would silently skip genuine fix failures on live data. - File identity. If the cached
hfidis NULL,vacuum_heap_get_hfid_and_file_typerunsheap_get_class_oid_from_page,file_descriptor_get,file_get_type— any failure asserts and returns; NULL HFID or type outside {FILE_HEAP,FILE_HEAP_REUSE_SLOTS} →ER_FAILED. On successhelper.reusable = (ftype == FILE_HEAP_REUSE_SLOTS); both values copy out to the caller’s per-group cache. - Per-slot loop. Duplicates skipped via
crt_slotid;vacuum_heap_prepare_record(8.4) on error unfixes forward and jumps toend. ThenREC_RELOCATION/REC_HOME/REC_BIGONEcallmvcc_satisfies_vacuum(8.5):REMOVE→vacuum_heap_record;DELETE_INSID_PREV_VER→vacuum_heap_record_insid_and_prev_version;CANNOT_VACUUM→ no-op. Any other slot type hitsdefault:(per the in-code comment, already vacuumed by another worker or rolled back and reused) and is ignored. An execution error doesassert_release (false), thengoto endif the home latch was lost, elsecontinue. - Status downgrade. After each object — vacuum-worker threads only — re-read
heap_page_get_vacuum_status. The terminal branch fires on(ONCE && !was_interrupted) || (NONE && was_interrupted), assertingn_heap_objects == 1. First half: the normal “single expected vacuum” —heap_page_set_vacuum_status_none, then a bulk log record withall_vacuumed = trueso redo downgrades too. Second half: the paranoid re-execution case (the in-code comment walks an insert/vacuum/delete/crash interleaving where an old job re-runs while a newer vacuum is still owed; downgrading could let the page deallocate under a pending job), so the re-run only resets counters. Either way the page is dirtied, and ifspage_number_of_records <= 1 && helper.reusablethe worker triesheap_remove_page_on_vacuum(guarded bypgbuf_has_prevent_dealloc) — success deallocates the emptied page; failure unfixes. Always exits toend. - Courtesy yield. If
pgbuf_has_any_non_vacuum_waitersand objects remain, the batch is flushed withvacuum_heap_page_log_and_reset(which unfixes) and the page re-fixed (re-fix failure →end) — fairness for foreground threads at the cost of an extra log record. end:the remaining batch is flushed withupdate_best_space_stat = true.
flowchart TD
A["fix home page<br/>(interrupted: tolerate dealloc/FTAB)"] --> B{"hfid cached?"}
B -- no --> C["vacuum_heap_get_hfid_and_file_type"]
B -- yes --> D
C --> D["next object; skip dup slotid"]
D --> E["vacuum_heap_prepare_record"]
E -- error --> Z["end: flush batch, unfix"]
E --> F{"record_type"}
F -- "HOME / REL / BIG" --> G["mvcc_satisfies_vacuum"]
F -- other --> J
G -- REMOVE --> H["vacuum_heap_record"]
G -- DELETE_INSID_PREV_VER --> I["vacuum_heap_record_insid_and_prev_version"]
G -- CANNOT_VACUUM --> J["check page vacuum status"]
H --> J
I --> J
J -- "ONCE and not interrupted<br/>or NONE and interrupted" --> K["set status none, log all_vacuumed<br/>maybe heap_remove_page_on_vacuum"] --> Z
J -- "waiters and more objects" --> L["log_and_reset, re-fix"] --> D
J -- otherwise --> D
Figure 8-1: vacuum_heap_page control flow; every exit funnels through end.
8.4 vacuum_heap_prepare_record: every record-type branch
Section titled “8.4 vacuum_heap_prepare_record: every record-type branch”Prepare gathers what each record shape needs, under a retry_prepare: label that re-reads the slot after any latch drop. It is entered with forward_page == NULL (asserted); a non-NULL forward_page inside the switch can only be left over from an earlier retry of the same call, and every non-matching branch unfixes it.
- Slot gone (
spage_get_slotNULL): type forced toREC_MARKDELETED, returnNO_ERROR; the caller’sdefault:ignores it. - REC_RELOCATION: COPY the home link into
forward_recdes/forward_link(failure →ER_FAILED); unfix a retry-leftover forward page with the wrong VPID. The forward fix obeyspgbuf_get_condition_for_ordered_fix: home-before-forward order → unconditional latch; forward-before-home → conditional try. If the try fails: flush and unfix home (vacuum_heap_page_log_and_reset), fix forward then home unconditionally — the deadlock-safe order — andgoto retry_preparebecause home was unlatched. Invariant — two heap pages are never latched contrary to pgbuf’s ordered-fix rule; violating it risks an undetected latch deadlock with foreground writers. With both pages held, COPY-read the REC_NEWHOME record intorec_buf(the future undo image) and decode it withor_mvcc_get_header. - REC_BIGONE: if
overflow_vfidis unresolved, tryheap_ovf_find_vfidwith a conditional latch; on failure flush/unfix home, retry unconditionally (failure →ER_FAILED), re-fix home,goto retry_prepare. Then COPY the link record, fix the first overflow page unconditionally (overflow pages are always fixed after home pages — no ordering dance), and read the header withheap_get_mvcc_rec_header_from_overflow. - REC_HOME: COPY the record into
rec_buf(the in-code comment says “Peek” but the call passesCOPY) — not for undo this time (REC_HOME changes are logged redo-only, 8.7) but because the insid strip mutates the buffer beforespage_update, which a PEEK pointer into the page would not allow; decode the header withor_mvcc_get_header. - default: (direct REC_NEWHOME, REC_MARKDELETED, REC_DELETED_WILL_REUSE, …) only the type is reported.
// vacuum_heap_prepare_record -- src/query/vacuum.c /* Assert forward page is fixed if and only if record type is either REC_RELOCATION or REC_BIGONE. */ assert ((helper->record_type == REC_RELOCATION || helper->record_type == REC_BIGONE) == (helper->forward_page != NULL));Invariant — forward_page != NULL iff the record type has a forward component. If violated, a later iteration would write through a latch belonging to the wrong page.
8.5 The verdict: mvcc_satisfies_vacuum
Section titled “8.5 The verdict: mvcc_satisfies_vacuum”The heap pass hinges on one pure function of the MVCC header and the block’s threshold_mvccid (the oldest-visible watermark snapshotted at dispatch — Chapter 5) — the vacuum-side sibling of the visibility family in cubrid-mvcc-detail.md:
// mvcc_satisfies_vacuum -- src/transaction/mvcc.c if (!MVCC_IS_HEADER_DELID_VALID (rec_header) || MVCC_IS_REC_DELETED_SINCE_MVCCID (rec_header, oldest_mvccid)) { /* The record was not deleted or was recently deleted and cannot be vacuumed completely. */ if (!MVCC_IS_HEADER_INSID_NOT_ALL_VISIBLE (rec_header) || MVCC_IS_REC_INSERTED_SINCE_MVCCID (rec_header, oldest_mvccid)) { // ... condensed (perfmon) ... return VACUUM_RECORD_CANNOT_VACUUM; } else { // ... condensed (perfmon) ... return VACUUM_RECORD_DELETE_INSID_PREV_VER; } } else { return VACUUM_RECORD_REMOVE; /* <- delete committed before every live snapshot */ }The macros are exact bit tests: MVCC_IS_HEADER_DELID_VALID = DELID flag set and MVCCID_IS_VALID (delid); MVCC_IS_HEADER_INSID_NOT_ALL_VISIBLE = INSID flag set and value != MVCCID_ALL_VISIBLE (the constant 3); the *_SINCE_MVCCID macros are !MVCC_ID_PRECEDES (id, T), i.e. id >= T (the version was touched at or after the threshold, so it is too recent to finish). Note the polarity: the outer test is !DELID_VALID || DELETED_SINCE_MVCCID — so the REMOVE verdict is the negation of DELETED_SINCE_MVCCID, the else branch where the record both has a valid delid and delid < T, meaning the delete committed before every live snapshot. When that fails, an old-enough, not-yet-stripped insert id (insid < T, flag set, not ALL_VISIBLE) yields DELETE_INSID_PREV_VER; otherwise — a fresh insert/delete, or an already-stripped insid — nothing happens.
flowchart TD
S["MVCC header + threshold T"] --> D{"delid valid<br/>and delid < T ?"}
D -- yes --> R["VACUUM_RECORD_REMOVE<br/>no live snapshot can see it"]
D -- no --> I{"insid flag set, not ALL_VISIBLE,<br/>and insid < T ?"}
I -- yes --> P["VACUUM_RECORD_DELETE_INSID_PREV_VER<br/>keep row, strip insid + prev_version_lsa"]
I -- no --> C["VACUUM_RECORD_CANNOT_VACUUM<br/>already stripped, or too fresh"]
Figure 8-2: the three verdicts.
Two consequences. First, CANNOT_VACUUM is normal, not an error: the same OID is collected once per log record touching it, so a record deleted shortly after this block’s threshold gets DELETE_INSID_PREV_VER now and REMOVE from the later job whose threshold passes the delete. Second, MVCCID_ALL_VISIBLE doing double duty — “no insid flag” and “insid == 3” both mean already stripped — is what makes the REC_BIGONE path idempotent (8.6).
8.6 Execution: strip versus remove
Section titled “8.6 Execution: strip versus remove”vacuum_heap_record_insid_and_prev_version (verdict DELETE_INSID_PREV_VER) edits the header; the row survives. REC_RELOCATION and REC_HOME share the same byte surgery on the copied record: look up the current header size from mvcc_header_size_lookup[mvcc_flags]; if both DELID and INSID are present, memcpy the DELID over the INSID slot so it survives; clear OR_MVCC_FLAG_VALID_INSID | OR_MVCC_FLAG_VALID_PREV_VERSION; memmove closes the gap. The shrunken record goes back via spage_update — to forward_page/forward_oid.slotid for the NEWHOME, to home_page/crt_slotid for HOME. REC_RELOCATION then logs immediately (a one-slot vacuum_log_vacuum_heap_page on the forward page) and unfixes; REC_HOME just appends (crt_slotid, DELETE_INSID_PREV_VER) to the bulk batch. REC_BIGONE cannot resize — the overflow header area is fixed-width — so the insid is overwritten, not stripped:
// vacuum_heap_record_insid_and_prev_version -- src/query/vacuum.c /* Replace current insert MVCCID with MVCCID_ALL_VISIBLE. Header must remain the same size. */ MVCC_SET_INSID (&helper->mvcc_header, MVCCID_ALL_VISIBLE); LSA_SET_NULL (&helper->mvcc_header.prev_version_lsa); error_code = heap_set_mvcc_rec_header_on_overflow (helper->forward_page, &helper->mvcc_header); // ... condensed ... vacuum_log_remove_ovf_insid (thread_p, helper->forward_page); /* <- redo-only, zero payload */Invariant — an overflow MVCC header never changes size. Enforced by substitution instead of removal; violating it would shift the large overflow payload that follows the header.
vacuum_heap_record (verdict REMOVE) deletes the version. REC_HOME removals join the bulk batch; REC_RELOCATION/REC_BIGONE are two-page operations, so the batch is flushed first (vacuum_heap_page_log_and_reset with unlatch_page = false) and the pair wrapped in a system operation (log_sysop_start … log_sysop_commit) so recovery never sees a dangling half. All three then run spage_vacuum_slot (… helper->reusable) on the home slot, whose afterlife is the OID-stability contract:
// spage_vacuum_slot -- src/storage/slotted_page.c slot_p->offset_to_record = SPAGE_EMPTY_OFFSET; if (reusable) { slot_p->record_type = REC_DELETED_WILL_REUSE; /* <- nothing references this OID; recycle slotid */ } else { slot_p->record_type = REC_MARKDELETED; /* <- referable file: indexes may hold the OID; tombstone */ }A referable heap (FILE_HEAP) may still have b-tree entries or OID references pointing at the slot, so the slotid is never handed out again; a reusable heap (FILE_HEAP_REUSE_SLOTS) guarantees no such references. The REC_NEWHOME forward slot is always vacuumed with reusable = true — nothing references a NEWHOME directly except its REC_RELOCATION home, dying in the same sysop. Within the sysop: REC_RELOCATION logs the home-slot removal (vacuum_log_redoundo_vacuum_record, undo = the copied link forward_recdes), vacuums the forward slot, conditionally feeds the forward page into best-space statistics (PRM_ID_HF_MAX_BESTSPACE_ENTRIES > 0 && freespace > HEAP_DROP_FREE_SPACE → heap_stats_update), logs the forward-slot removal (undo = the copied NEWHOME record), dirties/unfixes forward, commits. REC_BIGONE logs the home-slot removal, unfixes the overflow page, then heap_ovf_delete deallocates the whole overflow chain — failure → log_sysop_abort + ER_FAILED; success → commit.
8.7 Logging and recovery pairs
Section titled “8.7 Logging and recovery pairs”vacuum_heap_page_log_and_reset is the batch flusher: n_bulk_vacuumed == 0 → just unfix (if asked); else compact if spage_need_compact, fold freed space into best-space stats when update_best_space_stat && initial_home_free_space != -1, emit one vacuum_log_vacuum_heap_page, dirty, optionally unfix, zero the counter. Three log families result:
- Bulk
RVVAC_HEAP_PAGE_VACUUM(redo-only):addr.offsetpacks the slot count plus two flag bits (VACUUM_LOG_VACUUM_HEAP_REUSABLE= 0x8000,VACUUM_LOG_VACUUM_HEAP_ALL_VACUUMED= 0x4000), and each slotid’s sign encodes the verdict — negated forREMOVE, positive for the insid strip. Redovacuum_rv_redo_vacuum_heap_pageunpacksrcv->offset & ~VACUUM_LOG_VACUUM_HEAP_MASK:n_slots == 0(assertsall_vacuumed) →heap_page_set_vacuum_status_none; negative slotid → negate,spage_vacuum_slot; positive → peek (must beREC_HOME/REC_NEWHOME, elseER_FAILED), rebuild with the smaller flag-cleared header. Then compact, downgrade ifall_vacuumed, dirty. - Per-record
RVVAC_HEAP_RECORD_VACUUMfor two-page removals, viavacuum_log_redoundo_vacuum_record: an undoredo whose offset packs slotid + reusable bit, undo crumbs are record type + pre-image, redo payload empty (“only the object’s address to re-vacuum”). Redovacuum_rv_redo_vacuum_heap_recordre-derives slotid/reusable and re-runsspage_vacuum_slot+ compaction; undovacuum_rv_undo_vacuum_heap_recordstrips the flag bits and delegates toheap_rv_redo_insertto re-insert the pre-image. The undo half exists because these removals live in a sysop: if it aborts (theheap_ovf_deletefailure branch), the home slot must come back. - Overflow
RVVAC_REMOVE_OVF_INSID, viavacuum_log_remove_ovf_insid— zero bytes (log_append_redo_data2 (… 0, 0, NULL)). Redovacuum_rv_redo_remove_ovf_insidrebuilds from the page: read header,MVCC_SET_INSID (… MVCCID_ALL_VISIBLE),LSA_SET_NULL (… prev_version_lsa), write back, dirty — idempotent because of the substitution trick (8.6).
was_interrupted ties execution to recovery: a re-executed job replays log records whose heap effects may already be on disk, so its relaxations — pgbuf_fix_if_not_deallocated, tolerated PAGE_FTAB reuse, the softened assert (page_vacuum_status != HEAP_PAGE_VACUUM_NONE || (was_interrupted && helper.n_vacuumed == 0)), the refusal to downgrade an already-NONE page — are the “this already happened” symptoms a crash can manufacture. Chapter 11 covers how the job regains the flag after restart.
8.8 Chapter summary — key takeaways
Section titled “8.8 Chapter summary — key takeaways”vacuum_heapsorts the collected(VFID, OID)array file-major/page-minor and callsvacuum_heap_pageper page run, caching HFID/reusability per file group; dropped files were filtered at collection time, and release builds abandon a failed page rather than fail the job.VACUUM_HEAP_HELPERis the whole working set:forward_pagemeans “NEWHOME page” for REC_RELOCATION but “first overflow page” for REC_BIGONE; the COPY-read record inrec_bufis rewritten in place by the strip; the bulk batch holds only single-page REC_HOME changes.mvcc_satisfies_vacuumreduces to two threshold comparisons:delid < T→REMOVE; elseinsidpresent, notMVCCID_ALL_VISIBLE, and< T→DELETE_INSID_PREV_VER; elseCANNOT_VACUUM— a normal outcome, finished by whichever later job’s threshold passes the pending delete.vacuum_heap_prepare_recordre-reads the slot underretry_prepare:after any latch drop and preserves pgbuf’s ordered-fix rule with a conditional forward latch plus reverse-order refix — ending in the invariant thatforward_pageis fixed iff the type is REC_RELOCATION or REC_BIGONE.- Removal follows the OID-stability contract: referable heaps tombstone (
REC_MARKDELETED), reusable heaps recycle (REC_DELETED_WILL_REUSE), NEWHOME forward slots always recycle, and an emptied reusable page is deallocated viaheap_remove_page_on_vacuum. - Logging is three-tiered: bulk redo-only
RVVAC_HEAP_PAGE_VACUUMwith verdicts as slotid signs andreusable/all_vacuumedbits in the offset; per-record undoredoRVVAC_HEAP_RECORD_VACUUMinside a sysop for two-page removals; zero-payloadRVVAC_REMOVE_OVF_INSIDwhose redo is idempotent because overflow insids are substituted, never stripped. was_interruptedconverts “impossible” states — deallocated page, FTAB reuse, status alreadyNONE— into tolerated no-ops, because a re-executed job expects to find its own past work already applied.
Chapter 9: Block Completion and Log Reclamation
Section titled “Chapter 9: Block Completion and Log Reclamation”A vacuum job ends one of two ways: it processed every MVCC operation in its block, or it died midway (shutdown in SERVER_MODE, interrupt in SA_MODE). This chapter traces how either outcome travels back into vacuum data, how fully vacuumed entries are physically removed from the vacuum data file, and how that removal advances the log floor — the oldest log page the system must keep — so archive purging can proceed. The garbage-collector rationale is in the companion doc (cubrid-vacuum.md); here we trace every branch.
flowchart LR
W["worker<br/>vacuum_process_log_block"] -->|"vacuum_finished_block_vacuum<br/>blockid + status flags"| Q["vacuum_Finished_job_queue"]
Q -->|"consume in batches"| M["master<br/>vacuum_data_mark_finished"]
M -->|"set_vacuumed / set_interrupted"| VD["vacuum data pages"]
VD -->|"page fully vacuumed"| EP["vacuum_data_empty_page"]
M --> K["vacuum_update_keep_from_log_pageid"]
K -->|"vacuum_min_log_pageid_to_keep"| AP["logpb_remove_archive_logs*"]
Figure 9-1. The completion pipeline: worker outcome flows through the finished-job queue into vacuum data, and the log floor follows.
9.1 vacuum_finished_block_vacuum — encoding the outcome into the blockid
Section titled “9.1 vacuum_finished_block_vacuum — encoding the outcome into the blockid”When vacuum_process_log_block reaches its end: label (Chapter 7), it calls vacuum_finished_block_vacuum with vacuum_complete — true only after vacuum_heap returned NO_ERROR; every goto end error path leaves it false. The data argument is the worker’s copy of the VACUUM_DATA_ENTRY; the header comment is explicit that the real table entry may have moved while the job ran, so the outcome travels by blockid value, not by pointer.
// vacuum_finished_block_vacuum -- src/query/vacuum.c if (is_vacuum_complete) { data->set_vacuumed (); /* VACUUMED status, INTERRUPTED flag cleared */ } else { /* We expect that worker job is abandoned during shutdown. But all other cases are error cases. */ // ... condensed: SERVER_MODE warns iff thread_p->shutdown (asserted); SA_MODE iff ER_INTERRUPTED ... data->set_interrupted (); /* AVAILABLE status + INTERRUPTED flag */ } blockid = data->blockid; /* raw field: blockid WITH the flag bits just set */ if (!vacuum_Finished_job_queue->produce (blockid)) { assert_release (false); vacuum_er_log_error (..., "%s", "Finished job queue is full!!!"); } // ... condensed, SERVER_MODE: is_half_full () -> vacuum_Master_daemon->wakeup ();Branch accounting:
- Success —
set_vacuumed ()writesVACUUM_BLOCK_STATUS_VACUUMED(0x8000...) into the 2-bit status mask (VACUUM_BLOCK_STATUS_MASK = 0xC000000000000000) and clearsVACUUM_BLOCK_FLAG_INTERRUPTED(0x2000...) — a block interrupted once and finished on retry must not carry a stale flag. - Failure —
set_interrupted ()sets statusVACUUM_BLOCK_STATUS_AVAILABLE(all-zero) and raises INTERRUPTED: re-dispatchable, with history. Severity is graded — onlythread_p->shutdown(SERVER_MODE) orER_INTERRUPTED(SA_MODE) is legitimate; anything else logs ERROR but proceeds identically. - Queue produce failure — release-mode “should never happen”: the master caps outstanding jobs at
VACUUM_MAX_TASKS_IN_WORKER_POOL(3 ×PRM_ID_VACUUM_WORKER_COUNT), far belowVACUUM_FINISHED_JOB_QUEUE_CAPACITY(2048). The handling is deliberate data loss with a loud log — the entry stays IN_PROGRESS, rescued only by boot-time recovery (Chapter 11). - Half-full wakeup — the producer nudges the master when
vacuum_Finished_job_queue->is_half_full (). The master’s ownvacuum_master_task::should_force_data_updatepolls the same queue andvacuum_Block_data_buffer. Producer nudges, consumer polls; neither alone is load-bearing.
Invariant 9-A — one queue entry per dispatched job, flags already resolved. Every dispatched job calls
vacuum_finished_block_vacuumexactly once — theend:label guards it withif (!sa_mode_partial_block), since partial-block jobs never existed in vacuum data. The pushed blockid carries its final status in its high bits; the queue element is the entire worker-to-master report. If a job could exit without producing, its entry would stay IN_PROGRESS,index_unvacuumedwould never pass it, andkeep_from_log_pageidwould freeze — unbounded archive growth.
9.2 What INTERRUPTED changes on the next attempt
Section titled “9.2 What INTERRUPTED changes on the next attempt”The bit survives in the vacuum data entry (section 9.3) and is consumed twice on redispatch: vacuum_job_cursor::start_job_on_current_entry skips appending RVVAC_START_JOB when entry.was_interrupted () — the entry is already known to be partially executed (Chapter 6) — and vacuum_process_log_block passes it into the heap pass:
// vacuum_process_log_block -- src/query/vacuum.c was_interrupted = data->was_interrupted () || sa_mode_partial_block; // ... condensed ... error_code = vacuum_heap (thread_p, worker, threshold_mvccid, was_interrupted);A re-run replays the whole block’s log, revisiting heap pages the first attempt already cleaned. With was_interrupted == true, vacuum_heap_page tolerates pages whose vacuum status says “nothing to do”; without the flag those states would be treated as corruption (Chapter 8).
9.3 vacuum_data_mark_finished — the master consumes the queue
Section titled “9.3 vacuum_data_mark_finished — the master consumes the queue”Runs inside vacuum_data::update on the master (and once more from vacuum_finalize to drain stragglers); it is the only runtime writer that moves entries out of IN_PROGRESS — recovery replays the same transitions (section 9.5). Shape: drain the queue into a stack buffer (at most VACUUM_FINISHED_JOB_QUEUE_CAPACITY elements; zero consumed → return 0 without fixing a page), qsort with vacuum_compare_blockids — which strips the flag bits — so array order matches physical entry order, then walk vacuum data pages and the sorted array in lockstep.
flowchart TD
A["consume queue, qsort,<br/>start at first_page"] --> D["inner loop: mark each blockid<br/>falling inside current page"]
D --> E{"any change<br/>in this page?"}
E -->|no| H
E -->|yes| F["advance index_unvacuumed<br/>past leading vacuumed entries"]
F --> G{"index_unvacuumed ==<br/>index_free?"}
G -->|"yes: page empty"| EP["vacuum_data_empty_page<br/>no FINISHED_BLOCKS log"]
EP --> EPN{"data_page == NULL?"}
EPN -->|"yes, blocks left unmatched"| ERR1["assert(false)<br/>log + return"]
EPN -->|"yes, all matched"| DONE
EPN -->|"no: moved to next page"| D
G -->|"no: page has data"| CMP["memmove-compact if last page;<br/>log RVVAC_DATA_FINISHED_BLOCKS"]
CMP --> H{"index ==<br/>n_finished_blocks?"}
H -->|yes| DONE["unfix pages"]
H -->|"no, next_page is NULL"| ERR2["assert(false)<br/>log + return"]
H -->|"no, follow next_page"| NXT["fix next page<br/>(fix failure: third early return)"]
NXT --> D
DONE --> K["vacuum_update_keep_from_log_pageid"]
Figure 9-2. vacuum_data_mark_finished, branch-complete. All three early-return exits skip the log-floor update.
The marking loop relies on blockids inside one page being contiguous (asserted as page_free_blockid == data[index_free - 1].get_blockid () + 1):
// vacuum_data_mark_finished -- src/query/vacuum.c while ((index < n_finished_blocks) && ((blockid = VACUUM_BLOCKID_WITHOUT_FLAGS (finished_blocks[index])) < page_free_blockid)) { data = page_unvacuumed_data + (blockid - page_unvacuumed_blockid); /* direct index, no search */ assert (data->get_blockid () == blockid); assert (data->is_job_in_progress ()); /* only dispatched jobs may report back */ if (VACUUM_BLOCK_STATUS_IS_VACUUMED (finished_blocks[index])) data->set_vacuumed (); else data->set_interrupted (); /* AVAILABLE + INTERRUPTED, redispatchable */ index++; }VACUUM_BLOCKID_WITHOUT_FLAGS strips flags for addressing; the flagged value decides VACUUMED vs INTERRUPTED. A page with no matched reports falls through untouched and unlogged. After marking, the index_unvacuumed advance walks only over leading is_vacuumed () entries — an INTERRUPTED entry stops the walk, staying visible for redispatch. Two outcomes follow:
- Page emptied (
index_unvacuumed == index_free): the page is handed tovacuum_data_empty_page(section 9.4). NoRVVAC_DATA_FINISHED_BLOCKSrecord is written here — the page’s marking is recovered indirectly through the reset / splice records thatvacuum_data_empty_pageemits, since after recovery the page is reborn empty or dropped. - Page retains data: only the last page is compacted in place by
memmove(new blocks append there, Chapter 4); interior pages are never compacted. Then oneRVVAC_DATA_FINISHED_BLOCKSredo record is appended, payload exactly the slicefinished_blocks[page_start_index .. index).
Both “report with no matching entry” exits and a failed page fix return early without updating the log floor — better a stale floor than a wrong one.
Invariant 9-B —
index_unvacuumednever points past an unvacuumed entry, and everything before it is VACUUMED. Enforced by the strictly local advance loop (and its twin in redo). The log-floor derivation and the job cursor’s restart position both read the first unvacuumed entry; a vacuumed entry below the watermark would pin the log floor forever, and a skipped INTERRUPTED entry would never be re-vacuumed.
9.4 Physical shrink — vacuum_data_empty_page
Section titled “9.4 Physical shrink — vacuum_data_empty_page”The function receives a fully-vacuumed page (asserts index_unvacuumed == index_free) and distinguishes three cases, numbered the same way in the source comment:
Case 1 (last page — also covers first == last) never deallocates; the file always keeps at least one page. The page is reset via vacuum_init_data_page_with_last_blockid, which reinitializes the header and stores vacuum_Data.get_last_blockid () into slot 0 (data_page->data->blockid = blockid), logging RVVAC_DATA_INIT_NEW_PAGE; *data_page = NULL tells the caller there is no next page. An empty table thus remembers the last block it ever held — how vacuum_data::get_last_blockid stays correct across restarts and how Chapter 4’s append path detects already-consumed blocks.
Case 2 (first page, more pages exist) must repoint the file descriptor’s vpid_first before deallocating; a crash between the two would leave the boot loader (Chapter 2) pointing at a deallocated page. Order under a sysop: fix next page, sysop start, file_descriptor_update, swap vacuum_Data.first_page, file_dealloc old first, sysop commit. Errors are graded — a failed next-page fix aborts before any mutation; a failed file_descriptor_update aborts the sysop with nothing visible; a failed file_dealloc aborts and manually re-swaps vacuum_Data.first_page / vacuum_Data_load.vpid_first back (release-mode damage control for a path that asserts in debug).
Case 3 (interior page) requires prev_data_page (NULL → assert, unfix, early return); inside a sysop it deallocates the page (dealloc failure → sysop abort, return) and splices the list with an undoredo RVVAC_DATA_SET_LINK on the previous page — undo restores the old link on abort. It then re-fixes prev_data_page->next_page so the caller’s scan continues on the page that followed.
The companion vacuum_data_empty_update_last_blockid (asserts vacuum_is_empty () and first_page == last_page) re-runs the slot-0 init to persist the freshest last_blockid. It is called from vacuum_finalize (non-SERVER builds) and from vacuum_sa_reflect_last_blockid, which first copies logpb_last_complete_blockid () into vacuum_Data and log_Gl.hdr.vacuum_last_blockid — keeping an offline-restarted server from re-consuming blocks it already covered.
9.5 Recovery of completion — vacuum_rv_redo_data_finished and friends
Section titled “9.5 Recovery of completion — vacuum_rv_redo_data_finished and friends”RVVAC_DATA_FINISHED_BLOCKS redo replays section 9.3’s page mutation from the logged blockid slice — including the watermark advance and last-page compaction, which were not logged separately because they are deterministic functions of page content:
// vacuum_rv_redo_data_finished -- src/query/vacuum.c if (rcv_data_ptr != NULL) { // ... condensed: per logged blockid_with_flags, locate entry by // (blockid - page_unvacuumed_blockid) and set_vacuumed () / set_interrupted () ... } while (data_page->index_unvacuumed < data_page->index_free && data_page->data[data_page->index_unvacuumed].is_vacuumed ()) { data_page->index_unvacuumed++; /* same watermark advance as runtime */ } if (VPID_ISNULL (&data_page->next_page) && data_page->index_unvacuumed > 0) { /* Remove all vacuumed blocks. */ // ... condensed: memmove compaction, index_free -= index_unvacuumed, index_unvacuumed = 0 ... }The rcv_data_ptr != NULL guard is defensive — the record’s only appender (section 9.3) always logs a non-empty slice; what matters is that the advance and last-page compaction run unconditionally after marking, recomputed from page content rather than from the payload. vacuum_rv_redo_data_finished_dump pretty-prints the payload; its two strings (“vacuumed” vs “available and interrupted”) are the codebase’s most concise statement of the two flag combinations.
vacuum_rv_redo_vacuum_complete recovers the SA_MODE-only RVVAC_COMPLETE record (appended by xvacuum after an offline full vacuum, payload log_Gl.hdr.mvcc_next_id): it installs the logged MVCCID into vacuum_Data.oldest_unvacuumed_mvccid and calls logpb_vacuum_reset_log_header_cache to wipe the header’s partial-block bookkeeping (Chapter 3) — after a complete offline vacuum there is nothing left for vacuum to remember.
9.6 Closing the loop — the log floor and the archive gate
Section titled “9.6 Closing the loop — the log floor and the archive gate”vacuum_data::update is the master’s once-per-iteration consolidation, invoked through vacuum_job_cursor::force_data_update (which unloads the cursor first — Chapter 6):
// vacuum_data::update -- src/query/vacuum.c // first remove vacuumed blocks mark_finished = vacuum_data_mark_finished (thread_p); // then consume new generated blocks vacuum_consume_buffer_log_blocks (thread_p); // Chapter 4 if (!vacuum_Data.is_empty ()) { upgrade_oldest_unvacuumed (get_first_entry ().oldest_visible_mvccid); }upgrade_oldest_unvacuumed asserts monotonicity — valid because entries are appended in MVCCID order and retired from the front; on an empty table the watermark is left alone rather than guessed. The log floor itself is recomputed at the tail of every successful vacuum_data_mark_finished, and once at boot at the end of vacuum_data_load_and_recover:
// vacuum_update_keep_from_log_pageid -- src/query/vacuum.c if (vacuum_is_empty ()) { // keep starting with next after last_blockid () vacuum_Data.keep_from_log_pageid = VACUUM_FIRST_LOG_PAGEID_IN_BLOCK (vacuum_Data.get_last_blockid () + 1); } else { vacuum_Data.keep_from_log_pageid = VACUUM_FIRST_LOG_PAGEID_IN_BLOCK (vacuum_Data.get_first_blockid ()); } // ... condensed: er_log; if (!is_archive_removal_safe) is_archive_removal_safe = true; /* set once */VACUUM_FIRST_LOG_PAGEID_IN_BLOCK (b) is just b * vacuum_Data.log_block_npages: the floor is always a block boundary — first unvacuumed block when the table has entries, the block after the remembered last_blockid when empty.
Invariant 9-C — log pages below
keep_from_log_pageidare never needed by vacuum again. Every entry removed (or page dropped) held a VACUUMED block, and INTERRUPTED entries — whose log is still needed — hold the floor down by staying at or aboveindex_unvacuumed(Invariant 9-B). This is the contract that makes archive deletion safe.
Two read-side gates expose the floor to the log layer:
vacuum_min_log_pageid_to_keepreturns the floor, with two overrides:PRM_ID_DISABLE_VACUUMreturns 0 (keep everything — a debug aid), and SA_MODE afterxvacuumsetsis_vacuum_completereturnsNULL_PAGEID(keep nothing for vacuum). Consumers:logpb_remove_archive_logs_exceed_limitandlogpb_remove_archive_logs(inlog_page_buffer.c) bound which archive numbers may go;logpb_backupuses it to decide which archives a backup must include.vacuum_is_safe_to_remove_archivesreturnsis_archive_removal_safe. Both purge functions check it first and refuse to delete anything while false. Flag and floor boot asfalse/NULL_PAGEID; the flag flips only insidevacuum_update_keep_from_log_pageid— normally first at the end ofvacuum_data_load_and_recover— so purge stays blocked until vacuum data is loaded and a real floor exists.
The periodic driver lives in log_manager.c: the “log-rm-archive” daemon runs log_remove_log_archive_daemon_task::execute, calling logpb_remove_archive_logs_exceed_limit either on the configured PRM_ID_REMOVE_LOG_ARCHIVES_INTERVAL (one archive per tick, max_count = 1) or unthrottled (max_count = 0) when no interval is set.
Finally, vacuum_is_work_in_progress (any vacuum_Workers[i].state not VACUUM_WORKER_STATE_INACTIVE; SA_MODE trivially false) is the shutdown barrier: vacuum_finalize asserts it before draining — only when no worker can still produce is the final vacuum_data_mark_finished plus queue-empty assert meaningful, and only then may the queue be deleted.
9.7 Chapter summary — key takeaways
Section titled “9.7 Chapter summary — key takeaways”- A job’s outcome is encoded in the blockid’s own high bits —
set_vacuumed(VACUUMED, INTERRUPTED cleared) orset_interrupted(AVAILABLE + INTERRUPTED) — and pushed throughvacuum_Finished_job_queue; the queue element is the entire worker-to-master report (Invariant 9-A). - INTERRUPTED means AVAILABLE-with-history, not dead: the next dispatch skips
RVVAC_START_JOBlogging and passeswas_interruptedintovacuum_heap, relaxing safe-guards that would otherwise treat re-vacuumed pages as corruption. vacuum_data_mark_finisheddrains the queue in sorted batches, marks entries by direct contiguous-blockid arithmetic, advancesindex_unvacuumedpast leading VACUUMED entries only (Invariant 9-B), logs oneRVVAC_DATA_FINISHED_BLOCKSper surviving touched page (an emptied page is recovered via its reset/splice records instead), and treats unmatched reports as a loud early return that leaves the log floor untouched.- Physical shrink has three cases: last page reset in place (preserving
last_blockidin slot 0), first page swapped out under a sysop that updates the file descriptor before deallocating, interior pages spliced out with an undoredo link update. - Redo replays marking from the logged flagged-blockid slice and recomputes the watermark and last-page compaction — deterministic from page content, so never logged separately.
- The log floor
keep_from_log_pageidis always a block boundary, and pages below it are never needed by vacuum again (Invariant 9-C).vacuum_min_log_pageid_to_keepserves it to the archive purgers (and backup);vacuum_is_safe_to_remove_archiveskeeps purge blocked until the floor is first computed — normally at boot invacuum_data_load_and_recover; the periodic purge is driven bylog_remove_log_archive_daemon_taskinlog_manager.c. vacuum_is_work_in_progressmakes shutdown draining sound: no producer may be alive whenvacuum_finalizeperforms the lastvacuum_data_mark_finishedand asserts the queue empty.
Chapter 10: The Dropped Files Ledger
Section titled “Chapter 10: The Dropped Files Ledger”Chapters 7 and 8 showed workers calling vacuum_is_file_dropped for every
log record before touching a heap page or b-tree. This chapter traces the
other side: who writes that ledger, why a DROP TABLE committing mid-job
cannot crash a worker holding collected OIDs for the file, and how the
ledger is trimmed. Design rationale: companion, “Dropped-file table”
under Common DBMS Design.
10.1 On-disk shape, globals, and the support structs
Section titled “10.1 On-disk shape, globals, and the support structs”The ledger is a chain of PAGE_DROPPED_FILES pages rooted at
vacuum_Dropped_files_vpid, each a VFID-sorted entry array:
// vacuum_dropped_file -- src/query/vacuum.cstruct vacuum_dropped_file{ VFID vfid; MVCCID mvccid;};// vacuum_dropped_files_page -- src/query/vacuum.cstruct vacuum_dropped_files_page{ VPID next_page; /* VPID of next dropped files page. */ INT16 n_dropped_files; /* Number of entries on page */ VACUUM_DROPPED_FILE dropped_files[1]; /* Dropped files. */};| Field | Role | Why it exists |
|---|---|---|
entry vfid | Dropped file (heap or b-tree). | Sort/search key; one entry per VFID across reuse (10.3). |
entry mvccid | Borderline sampled from log_Gl.hdr.mvcc_next_id at notify time. | Strictly-older records belong to the dead file; >= to a reusing successor (Invariant 10-B). |
page next_page | Link to next ledger page. | Singly linked chain; link changes logged separately (10.3). |
page n_dropped_files | Entry count on this page. | Bounds bsearch/memmove; full at VACUUM_DROPPED_FILES_PAGE_CAPACITY. |
page dropped_files[1] | Flexible entry array filling DB_PAGESIZE. | Sorted by vacuum_compare_dropped_files so lookup is bsearch (Invariant 10-A). |
rcv vfid, class_oid | VACUUM_DROPPED_FILES_RCV_DATA, payload of RVVAC_NOTIFY_DROPPED_FILE. | MVCCID deliberately absent — sampled at apply time (10.5); non-NULL class triggers heap_delete_hfid_from_cache. |
track next_tracked_page, dropped_data_page | Debug-only (!NDEBUG) VACUUM_TRACK_DROPPED_FILES: one malloced copy per disk page. | memcpy-refreshed on every mutation so a debugger can walk the ledger without fixing pages; never read by production logic. |
Static globals: vacuum_Dropped_files_vfid / _vpid,
vacuum_Dropped_files_count (fast-path filter, 10.6),
vacuum_Dropped_files_loaded. The handshake trio
vacuum_Dropped_files_version / _mutex / vacuum_Last_dropped_vfid
(10.5) is non-static, though nothing else links to it.
VACUUM_DROPPED_FILE_FLAG_DUPLICATE (0x8000) is defined but unused
— duplicates are handled by the replace record (10.3), not a flag.
flowchart LR
subgraph DISK["disk chain (PAGE_DROPPED_FILES)"]
P1["page 1<br/>n_dropped_files, sorted entries"] -->|next_page| P2["page 2"]
end
G["vacuum_Dropped_files_vpid"] --> P1
C["vacuum_Dropped_files_count"] -.sum of n_dropped_files.- DISK
subgraph HS["handshake globals"]
V["vacuum_Dropped_files_version"]
LV["vacuum_Last_dropped_vfid"]
M["vacuum_Dropped_files_mutex"]
end
Figure 10-1 — Ledger pages and the globals that root them.
vacuum_load_dropped_files_from_disk fills the in-memory side at boot or
lazily during recovery (10.3): already loaded → assert + NO_ERROR; stale
nonzero count → assert (false) + reset; else a read-latched walk sums
n_dropped_files into the global count (debug builds also build the track
list; a failed malloc frees the partial list, ER_OUT_OF_VIRTUAL_MEMORY),
then sets vacuum_Dropped_files_loaded.
10.2 Registration — vacuum_log_add_dropped_file and the POSTPONE/UNDO selector
Section titled “10.2 Registration — vacuum_log_add_dropped_file and the POSTPONE/UNDO selector”Droppers never touch ledger pages directly. vacuum_log_add_dropped_file
returns immediately under PRM_ID_DISABLE_VACUUM (ledger never written,
never consulted); otherwise it packs VFID + class OID (NULL OID when no
class) and appends RVVAC_NOTIFY_DROPPED_FILE as one of two flavors:
- a postpone record (
VACUUM_LOG_ADD_DROPPED_FILE_POSTPONE=true) when an existing file is dropped — the file dies only if the transaction commits; - an undo record (
VACUUM_LOG_ADD_DROPPED_FILE_UNDO) when a file is created — on abort the new file is garbage.
Callers: heap destroy/create in heap_file.c (xheap_destroy appends the
postpone before file_postpone_destroy, so at commit the notify
handshake completes before the file is destroyed), b-tree drop/create in
btree.c / btree_load.c, and the index-load sort path in
external_sort.c (the b-tree file under construction, undo flavor — not
its temporary sort files). The actual insert happens later, inside
vacuum_rv_notify_dropped_file (10.5).
10.3 The page walk — vacuum_add_dropped_file branch by branch
Section titled “10.3 The page walk — vacuum_add_dropped_file branch by branch”flowchart TD
A["enter vacuum_add_dropped_file"] --> B{"vacuum_Dropped_files_loaded?"}
B -- no --> C["assert !LOG_ISRESTARTED<br/>load from disk; fail -> ER_FAILED"]
B -- yes --> D
C --> D["fix page write-latched, walk chain"]
D --> E{"util_bsearch found vfid?"}
E -- yes --> F["overwrite mvccid<br/>log RVVAC_DROPPED_FILE_REPLACE undoredo<br/>set dirty FREE; return NO_ERROR"]
E -- no --> G{"page full?"}
G -- yes --> H["advance to next_page<br/>advance debug track node"] --> D
G -- no --> I["memmove tail right at position<br/>n_dropped_files++; ATOMIC_INC count<br/>log RVVAC_DROPPED_FILE_ADD undoredo<br/>set dirty FREE; return NO_ERROR"]
D --> J{"chain exhausted?"}
J -- yes --> K["file_alloc new PAGE_DROPPED_FILES<br/>fail -> unfix, ER_FAILED"]
K --> L["init: next NULL, count 1, entry[0]<br/>log RVPGBUF_NEW_PAGE redo; set dirty FREE"]
L --> M["vacuum_dropped_files_set_next_page on old last page<br/>logs RVVAC_DROPPED_FILE_NEXT_PAGE undoredo"]
M --> N["unfix; return NO_ERROR"]
Figure 10-2 — vacuum_add_dropped_file. Every exit is replace, in-page
insert, new-page append, or an assert-backed ER_FAILED.
The replace branch is VFID reuse — dropped before, recycled, and the reincarnation now dropped too; one entry per VFID, newest borderline wins:
// vacuum_add_dropped_file -- src/query/vacuum.cundo_data = page->dropped_files[position];save_mvccid = page->dropped_files[position].mvccid;page->dropped_files[position].mvccid = mvccid;assert_release (MVCC_ID_FOLLOW_OR_EQUAL (mvccid, save_mvccid));log_append_undoredo_data (thread_p, RVVAC_DROPPED_FILE_REPLACE, &addr, /* ... before/after entry ... */);The insert branch logs RVVAC_DROPPED_FILE_ADD with undo length 0 —
undoing an add needs only the position (addr.offset); redo carries the
full entry. The new-page branch chains via
vacuum_dropped_files_set_next_page, which logs old/new VPID as an
undoredo pair before assigning page_p->next_page (and on the new-page
branch a failed track-node malloc unfixes both pages, ER_FAILED).
Invariant 10-A — per-page VFID sort order. Entries are sorted by
vacuum_compare_dropped_files(fileid, then volid); inserts land at theutil_bsearchposition, and recovery replays the same positional insert. If violated, thebsearchinvacuum_find_dropped_filemisses entries and workers vacuum pages of a dropped file — dereferencing freed extents.
10.4 The recovery records and the MVCCID borderline
Section titled “10.4 The recovery records and the MVCCID borderline”| rcvindex | undofun | redofun | payload |
|---|---|---|---|
RVVAC_NOTIFY_DROPPED_FILE | vacuum_rv_notify_dropped_file | same (as run-postpone) | vfid + class_oid |
RVVAC_DROPPED_FILE_ADD | vacuum_rv_undo_add_dropped_file | vacuum_rv_redo_add_dropped_file | redo: entry; undo: none (position in rcv->offset) |
RVVAC_DROPPED_FILE_REPLACE | vacuum_rv_replace_dropped_file | same | before/after entry |
RVVAC_DROPPED_FILE_NEXT_PAGE | vacuum_rv_set_next_page_dropped_files | same | old/new VPID |
RVVAC_DROPPED_FILE_CLEANUP | none — redo-only | vacuum_rv_redo_cleanup_dropped_files | n_indexes + descending index array (10.8) |
(New ledger pages ride the generic RVPGBUF_NEW_PAGE redo, 10.3.)
vacuum_rv_redo_add_dropped_file opens the slot (position < n →
memmove), copies the entry, and n_dropped_files++; > n is a logged
assert + ER_FAILED. vacuum_rv_undo_add_dropped_file mirrors it
(position >= n → ER_FAILED; else memmove the tail down,
n_dropped_files--). The replace function is symmetric — one function for
both phases (before-image on undo, after-image on redo) — and validates
position < n and VFID_EQ with the on-page entry. Both redo-side
functions end with the same guard:
// vacuum_rv_redo_add_dropped_file -- src/query/vacuum.cif (!MVCC_ID_PRECEDES (dropped_file->mvccid, log_Gl.hdr.mvcc_next_id)) { log_Gl.hdr.mvcc_next_id = dropped_file->mvccid; MVCCID_FORWARD (log_Gl.hdr.mvcc_next_id); /* <- keep the borderline ahead of every ledger MVCCID */ }Invariant 10-B — the borderline rule. Every ledger
mvccidstrictly precedeslog_Gl.hdr.mvcc_next_id: committed changes to the old file carry smaller MVCCIDs; any VFID-reusing transaction starts later and carries>=ones. Redo re-asserts it by forwardingmvcc_next_id. Violated, the test in 10.6 misclassifies new-file records as dead (lost vacuum) or dead-file records as live (use-after-free).
10.5 The version handshake — publishing a drop without stalling workers
Section titled “10.5 The version handshake — publishing a drop without stalling workers”vacuum_rv_notify_dropped_file executes RVVAC_NOTIFY_DROPPED_FILE both
as run-postpone and as undo (same function in the recovery.c table). It
samples the borderline, inserts, then notifies workers:
// vacuum_rv_notify_dropped_file -- src/query/vacuum.cmvccid = ATOMIC_LOAD_64 (&log_Gl.hdr.mvcc_next_id); /* <- borderline sampled NOW, not at drop statement */error = vacuum_add_dropped_file (thread_p, &rcv_data->vfid, mvccid);// ... condensed: error -> return; then ...vacuum_notify_all_workers_dropped_file (rcv_data->vfid, mvccid);if (!OID_ISNULL (class_oid)) { (void) heap_delete_hfid_from_cache (thread_p, class_oid); }The notify step is where “without stalling transactions” is bought: the
dropper pays, never the workers. Its body is SERVER_MODE-only — in SA
mode there are no concurrent workers and it compiles to a no-op:
// vacuum_notify_all_workers_dropped_file -- src/query/vacuum.cif (!LOG_ISRESTARTED ()) { return; } /* <- workers are not running during recovery */pthread_mutex_lock (&vacuum_Dropped_files_mutex);VFID_COPY (&vacuum_Last_dropped_vfid, &vfid_dropped); /* <- one drop published at a time */my_version = ++vacuum_Dropped_files_version;for (workers_min_version = vacuum_get_worker_min_dropped_files_version (); workers_min_version != -1 && workers_min_version < my_version; workers_min_version = vacuum_get_worker_min_dropped_files_version ()) { thread_sleep (1); /* <- dropper spins; workers never block */ }VFID_SET_NULL (&vacuum_Last_dropped_vfid);pthread_mutex_unlock (&vacuum_Dropped_files_mutex);vacuum_get_worker_min_dropped_files_version scans vacuum_Workers[],
considering only state != VACUUM_WORKER_STATE_INACTIVE; -1 (none
active) exits the loop immediately. The worker side is the version gate
at the top of vacuum_process_log_record (Chapter 7):
// vacuum_process_log_record -- src/query/vacuum.cif (worker->drop_files_version != vacuum_Dropped_files_version) { VFID_COPY (&vfid, &vacuum_Last_dropped_vfid); vacuum_cleanup_collected_by_vfid (worker, &vfid); /* <- purge BEFORE acknowledging */ worker->drop_files_version = vacuum_Dropped_files_version; }error_code = vacuum_is_file_dropped (thread_p, is_file_dropped, &vacuum_info->vfid, *mvccid);The ordering is the point: a worker first discards heap objects already
collected for vacuum_Last_dropped_vfid, then advances its own
drop_files_version so the dropper’s min-scan sees it; only after every
in-flight worker has done so does the dropper’s commit proceed to destroy
the file. The counter is a free-running INT32, so
vacuum_compare_dropped_files_version makes the min-scan wraparound-safe
(same-sign values compare by plain a - b; for mixed signs the value in
the extreme positive quarter >= 0x3FFFFFFF is treated as the older,
pre-wraparound side).
Invariant 10-C — no worker outruns the ledger. When
vacuum_notify_all_workers_dropped_filereturns, every worker active at publication has purged its collected objects for the dropped VFID and will see the entry on its next check; later workers read it from disk. Enforced by the mutex (one drop in flight),vacuum_Last_dropped_vfidas the purge target, and the min-version spin. If broken, Chapter 8’s heap executor fixes pages of a deallocated file.
10.6 The worker-side query — vacuum_is_file_dropped / vacuum_find_dropped_file
Section titled “10.6 The worker-side query — vacuum_is_file_dropped / vacuum_find_dropped_file”vacuum_is_file_dropped: PRM_ID_DISABLE_VACUUM → false; else
delegate to vacuum_find_dropped_file, which opens with the cheap check
that makes the common case free — vacuum_Dropped_files_count == 0 →
false with no page fix, no latch. Otherwise it walks the chain
read-latched. A failed fix is tolerated only as interrupt/shutdown
(assert (error == ER_INTERRUPTED); workers also assert (thread_p->shutdown)); *is_file_dropped is set false (“actually
unknown but unimportant”) and the error aborts the job. Each fixed page
gets pgbuf_notify_vacuum_follows (dropped-files pages are never
LRU-boosted; this delays victimization). Then bsearch (Invariant 10-A)
gives three outcomes: found and MVCC_ID_PRECEDES (mvccid, dropped_file->mvccid) → true (the record predates the borderline, so it
belongs to the dead file); found but the record MVCCID is >= the entry →
false (VFID reuse — the record belongs to the reincarnation); not found →
follow next_page, and chain-exhausted → false. The first outcome is the
safety property the whole ledger exists for: a worker calls a record
dropped exactly when its MVCCID precedes the entry’s borderline, so it
never vacuums — and never fixes a page of — a file dropped after that
record’s transaction.
10.7 Purging already-collected work — vacuum_cleanup_collected_by_vfid
Section titled “10.7 Purging already-collected work — vacuum_cleanup_collected_by_vfid”Called only from the version gate. It qsorts the worker’s heap_objects
with vacuum_compare_heap_object (VFID-major, OID-minor), scans for the
first matching entry — not found → return; scans for the end of the run;
run reaches the array end → truncate (n_heap_objects = start); else
memmove the tail over the run and subtract end - start. The reorder is
harmless: Chapter 8’s executor re-sorts before batching.
10.8 Trimming — vacuum_cleanup_dropped_files and its logging pair
Section titled “10.8 Trimming — vacuum_cleanup_dropped_files and its logging pair”The trim runs after a full pass has set
vacuum_Data.oldest_unvacuumed_mvccid = log_Gl.hdr.mvcc_next_id; in this
revision the only call site is xvacuum (SA-mode cubrid vacuumdb,
Chapter 11) — no server-mode caller remains, so on a live server the
ledger only grows until the next offline vacuum. Branches: recovery →
skip; vacuum_Dropped_files_count == 0 → skip; per page (write-latched):
empty page → unfix, continue; else scan entries from the end down, and
for each with MVCC_ID_PRECEDES (mvccid, oldest_unvacuumed_mvccid) record
its index in removed_entries[] and immediately memmove the tail down.
If anything was removed: ATOMIC_INC_32 (&count, -n), decrement the page
counter, log via vacuum_log_cleanup_dropped_files (redo-only
RVVAC_DROPPED_FILE_CLEANUP crumbs: n_indexes + the end-first index
array), refresh the track mirror, set dirty FREE; else plain unfix.
vacuum_rv_redo_cleanup_dropped_files unpacks the crumbs (assert (offset == rcv->length)) and replays each removal end-first, so each memmove
compacts a tail untouched by later (smaller) indexes. No undo function:
re-trimming after a crash removes only provably unneeded entries.
Three code smells to know before modifying this path: (1) the in-code
todo notes emptied pages are never deallocated (“it looks like they
are leaked”); (2) the trailing cut-off-link step is doubly broken — it
calls vacuum_dropped_files_set_next_page (thread_p, page, &page->next_page), re-assigning the page’s own current link (a no-op
that never writes the intended NULL), and last_non_empty_page_vpid
records the successor of each non-empty page (it copies vpid after the
advance to page->next_page), so even the fixed page is wrong; and (3) the
cleanup redo computes mem_size = (page->n_dropped_files - indexes[i]) * sizeof (VACUUM_DROPPED_FILE) — one entry more than the runtime path’s
(page_count - i - 1) — so each replayed removal copies one entry past the
live tail (benign garbage beyond the decremented count, except that on a
full page the source read crosses the page boundary).
Invariant 10-D — trim safety. An entry may be removed only when
mvccid < vacuum_Data.oldest_unvacuumed_mvccid: every log record older than the watermark has been vacuumed or skipped-as-dropped (Chapter 9), so no future ledger lookup can need the entry. Violating it resurrects exactly the lost-file problem the ledger exists to prevent.
10.9 Chapter summary — key takeaways
Section titled “10.9 Chapter summary — key takeaways”- The ledger is a VFID-sorted, page-chained table keyed by a borderline
MVCCID sampled from
log_Gl.hdr.mvcc_next_idat notify time; records strictly older belong to the dead file (Invariants 10-A, 10-B). - Registration is transactional by construction:
vacuum_log_add_dropped_fileappends a postpone for drops (applies on commit, ahead of the file-destroy postpone) and an undo for creations (applies on abort); both funnel intovacuum_rv_notify_dropped_file. vacuum_add_dropped_filehas four success exits — replace (VFID reuse), sorted in-page insert (zero-length undo), next-page retry, and new-page append chained viavacuum_dropped_files_set_next_page.VACUUM_DROPPED_FILE_FLAG_DUPLICATEis vestigial.- The handshake inverts the usual cost: the dropper spins under
vacuum_Dropped_files_mutexuntil every active worker’sdrop_files_versioncatches up (wraparound-safe compare); workers pay one comparison per record and purge collected objects viavacuum_cleanup_collected_by_vfidbefore acknowledging (Invariant 10-C). vacuum_find_dropped_fileshort-circuits on a zero global count; the found-and-older outcome is the safety test, while found-but-newer correctly classifies VFID reuse as not-dropped.- Trimming removes entries older than
oldest_unvacuumed_mvccid(Invariant 10-D) with a redo-only positional record; it runs only in SA-modexvacuum, emptied pages leak, the empty-page unlink is a no-op, and the redo over-copies by one entry — four things a modifier should not inherit silently.
Chapter 11: Crash Recovery and Standalone Paths
Section titled “Chapter 11: Crash Recovery and Standalone Paths”Every previous chapter assumed a live SERVER_MODE process: a master producing jobs (Ch 6), workers consuming them (Ch 7-8), an append path feeding vacuum_Block_data_buffer (Ch 3-4). This chapter answers what happens to blocks in each state — AVAILABLE, IN_PROGRESS, VACUUMED, or not yet registered — across a crash, a copydb, or an SA-mode session with no daemon. The high-level story is the companion’s “Recovery integration” section (cubrid-vacuum.md); here we trace every branch.
11.1 The recovery handshake: recovery_lsa and the notify hooks
Section titled “11.1 The recovery handshake: recovery_lsa and the notify hooks”Vacuum does not participate in the ARIES passes. Its entire recovery contract is one field, vacuum_data::recovery_lsa, set by log_recovery immediately before the analysis pass:
// log_recovery -- src/transaction/log_recovery.c /* Notify vacuum it may need to recover the lost block data. * ... 1. recovery finds MVCC op log records after last checkpoint ... * ... 2. no MVCC op log record is found, so vacuum has to start recovery from checkpoint LSA ... */ vacuum_notify_server_crashed (&rcv_lsa); /* <- rcv_lsa = checkpoint LSA (older for media crash) */vacuum_notify_server_crashed is one line: LSA_COPY (&vacuum_Data.recovery_lsa, recovery_lsa). Its clean-shutdown counterpart vacuum_notify_server_shutdown calls vacuum_Data.shutdown_sequence.request_shutdown () from vacuum_stop_workers, telling the master (Ch 6) to stop generating jobs before the worker pool is destroyed.
Invariant 11-A —
recovery_lsais non-null iff crash recovery ran in this boot. Set only byvacuum_notify_server_crashed, consumed byvacuum_recover_lost_block_data, cleared byLSA_SET_NULLinvacuum_data_load_and_recover. A clean boot thus skips lost-block recovery (first early-return in §11.4); a leaked value would force a useless log-tail re-scan.
11.2 The RVVAC_* catalogue — what the redo pass replays for vacuum
Section titled “11.2 The RVVAC_* catalogue — what the redo pass replays for vacuum”Before any vacuum-specific boot logic runs, the ordinary ARIES redo pass has already replayed every page-level vacuum mutation through the RV_fun table in recovery.c. How a record reaches these functions — RV_fun[rcvindex] lookup, page fix, page-LSA gating, undo-vs-redo selection — is the recovery manager’s dispatch machinery; see cubrid-recovery-manager-detail.md for that and take it as given here. The rows are {index, name, undofun, redofun, undo_dump, redo_dump}:
// RV_fun -- src/transaction/recovery.c {RVVAC_COMPLETE, "RVVAC_COMPLETE", NULL, vacuum_rv_redo_vacuum_complete, NULL, NULL}, // ... condensed: RVVAC_START_JOB (105) ... RVVAC_DROPPED_FILE_REPLACE (114) ... {RVVAC_HEAP_RECORD_VACUUM, "RVVAC_HEAP_RECORD_VACUUM", vacuum_rv_undo_vacuum_heap_record, vacuum_rv_redo_vacuum_heap_record, NULL, log_rv_dump_hexa}, // ... condensed: RVVAC_HEAP_PAGE_VACUUM (116), RVVAC_REMOVE_OVF_INSID (117) ...All fourteen indices (104-117 in recovery.h), their handlers, and what each body does:
| Index | Undo fn | Redo fn | Effect (branches) |
|---|---|---|---|
RVVAC_COMPLETE (104) | — | vacuum_rv_redo_vacuum_complete | Restore oldest_unvacuumed_mvccid from the logged mvcc_next_id; reset the log-header vacuum cache. Written only by xvacuum (§11.6); doubles as the backward-search terminator in §11.4. |
RVVAC_START_JOB (105) | — | vacuum_rv_redo_start_job | set_job_in_progress () on the entry at rcv->offset — recreating the IN_PROGRESS state that boot then rewrites (§11.3). Logged by vacuum_job_cursor::start_job_on_current_entry, and only when the entry was not already interrupted (re-logging an interrupted job adds nothing). |
RVVAC_DATA_APPEND_BLOCKS (106) | — | vacuum_rv_redo_append_data | memcpy of N entries at index_free (asserts rcv->offset == index_free), bump index_free. The Ch 4 producer’s persistence. |
RVVAC_DATA_INIT_NEW_PAGE (107) | — | vacuum_rv_redo_initialize_data_page | Reformat page, seed data->blockid with the logged watermark. Logged by vacuum_init_data_page_with_last_blockid — page growth (Ch 4), watermark persistence and copydb reset (§11.7). |
RVVAC_DATA_SET_LINK (108) | vacuum_rv_undoredo_data_set_link | same fn | Two branches: rcv->data == NULL → VPID_SET_NULL (next_page), else copy the VPID. Logged undo+redo when chaining a new page (Ch 4) or unlinking a consumed one (Ch 9). |
RVVAC_DATA_FINISHED_BLOCKS (109) | — | vacuum_rv_redo_data_finished | Replays Ch 9’s mark-finished; excerpt below. |
RVVAC_NOTIFY_DROPPED_FILE (110) | vacuum_rv_notify_dropped_file | same fn | Logical record (in RCV_IS_LOGICAL_LOG): re-add the file with mvccid = log_Gl.hdr.mvcc_next_id as drop boundary, notify all workers, and (if class_oid non-null) evict the HFID cache. Appended as postpone on file destroy and as undo on file create — drop-at-commit and drop-at-abort respectively (Ch 10). |
RVVAC_DROPPED_FILE_CLEANUP (111) | — | vacuum_rv_redo_cleanup_dropped_files | memmove-delete each logged position from a dropped-files page, decrement n_dropped_files. |
RVVAC_DROPPED_FILE_NEXT_PAGE (112) | vacuum_rv_set_next_page_dropped_files | same fn | Unconditional next_page link write from the before/after image. |
RVVAC_DROPPED_FILE_ADD (113) | vacuum_rv_undo_add_dropped_file | vacuum_rv_redo_add_dropped_file | Redo: insert at rcv->offset (memmove to make room unless appending; position > n_dropped_files is ER_FAILED), then forward log_Gl.hdr.mvcc_next_id if the entry’s MVCCID is not behind it. Undo: memmove-delete at the position (position >= n_dropped_files is ER_FAILED). |
RVVAC_DROPPED_FILE_REPLACE (114) | vacuum_rv_replace_dropped_file | same fn | Overwrite the entry’s MVCCID at rcv->offset; position out of range or VFID mismatch is ER_FAILED. |
RVVAC_HEAP_RECORD_VACUUM (115) | vacuum_rv_undo_vacuum_heap_record | vacuum_rv_redo_vacuum_heap_record | Redo: spage_vacuum_slot (slotid and reusable packed in rcv->offset under VACUUM_LOG_VACUUM_HEAP_MASK), then spage_compact if needed. Undo: strip the mask bits and delegate to heap_rv_redo_insert — re-inserting the record the worker removed (Ch 8). |
RVVAC_HEAP_PAGE_VACUUM (116) | — | vacuum_rv_redo_vacuum_heap_page | Whole-page replay: n_slots / reusable / all_vacuumed packed in rcv->offset. n_slots == 0 → header-only “once to none” status transition. Else per slot: negative slotid → record fully removed, spage_vacuum_slot; positive → only the insert MVCCID was vacuumed, strip OR_MVCC_FLAG_VALID_INSID | PREV_VERSION and rebuild the record in place. Then spage_compact if needed; all_vacuumed → status to none (Ch 8). |
RVVAC_REMOVE_OVF_INSID (117) | — | vacuum_rv_redo_remove_ovf_insid | Overflow-page MVCC header: insid := MVCCID_ALL_VISIBLE, prev_version_lsa := null. The fixed-size-header asymmetry of Ch 8. |
(RVES_NOTIFY_VACUUM, the external-storage neighbor row, maps undo and redo to vacuum_rv_es_nop — an explicit no-op.)
Invariant 11-B — vacuum data is redo-only state. Every RVVAC row targeting a vacuum data page (104-107, 109) has a NULL undo function; only
RVVAC_DATA_SET_LINKis undoable, and its undo is the same idempotent link-write applied to the before-image. Vacuum data is mutated exclusively by master/system threads, never rolled back record-by-record: a torn mutation is healed going forward — redo replays it, then §11.3-§11.4 reconcile. The undoable rows all target other structures (dropped-files pages, heap pages) touched inside abortable operations.
The structurally richest body is RVVAC_DATA_FINISHED_BLOCKS, which reproduces Ch 9’s two-phase mark-and-compact on one page:
// vacuum_rv_redo_data_finished -- src/query/vacuum.c if (rcv_data_ptr != NULL) { while (rcv_data_ptr < (char *) rcv->data + rcv->length) { blockid_with_flags = *((VACUUM_LOG_BLOCKID *) rcv_data_ptr); blockid = VACUUM_BLOCKID_WITHOUT_FLAGS (blockid_with_flags); // ... condensed: data_index = blockid - page_unvacuumed_blockid + index_unvacuumed ... if (VACUUM_BLOCK_STATUS_IS_VACUUMED (blockid_with_flags)) data_page->data[data_index].set_vacuumed (); else data_page->data[data_index].set_interrupted (); rcv_data_ptr += sizeof (VACUUM_LOG_BLOCKID); } } while (data_page->index_unvacuumed < data_page->index_free && data_page->data[data_page->index_unvacuumed].is_vacuumed ()) data_page->index_unvacuumed++; if (VPID_ISNULL (&data_page->next_page) && data_page->index_unvacuumed > 0) { /* Remove all vacuumed blocks. */ // ... condensed: memmove survivors to front; index_free -= index_unvacuumed; index_unvacuumed = 0 ... }Three branches: the flag loop (VACUUMED vs interrupted, exactly the per-block outcome Ch 9 logged), the index_unvacuumed advance, and the last-page-only compaction (next_page null). The loop tolerates rcv->data == NULL — a record can carry no block list and still trigger advance + compaction.
11.3 vacuum_data_load_and_recover — reload, reset, rewind
Section titled “11.3 vacuum_data_load_and_recover — reload, reset, rewind”Called from vacuum_boot before any worker exists, in both modes — after the redo pass has put every vacuum data page back into its logged state via §11.2’s functions. Figure 11-1 accounts for every branch:
flowchart TD
A["file_descriptor_get +<br/>fix vpid_first"] -->|error| Z["goto end:<br/>unfix + unload, return error"]
A --> B["page-walk loop:<br/>for each entry in page"]
B --> C{"entry->is_job_in_progress?"}
C -->|yes| D["entry->set_interrupted()<br/>page dirty"]
C -->|no| E[next entry]
D --> E
E --> F{"next_page VPID null?"}
F -->|no, fix fails| Z
F -->|no| B
F -->|yes| G["last_page = data_page"]
G --> H{"vacuum_is_empty()?"}
H -->|yes| I{"logpb_last_complete_blockid < 0?"}
I -->|yes: fresh copydb| J["do not touch last_blockid"]
I -->|no| K{"recovery_lsa null AND<br/>hdr.mvcc_op_log_lsa null?"}
K -->|yes: 10.1 compat| L["set_last_blockid(log_blockid)"]
K -->|no| M["set_last_blockid(MAX(hdr.vacuum_last_blockid,<br/>last_page->data->blockid))"]
H -->|no| N["set_last_blockid(last entry<br/>of last_page)"]
J --> O["is_loaded = true;<br/>update_global_oldest_visible"]
L --> O
M --> O
N --> O
O --> P["vacuum_recover_lost_block_data"]
P -->|error| Z
P --> Q["LSA_SET_NULL(recovery_lsa);<br/>set_oldest_unvacuumed_on_boot;<br/>update_keep_from_log_pageid"]
Q --> R["save vpid_first/vpid_last into<br/>vacuum_Data_load; unfix pages"]
Figure 11-1: vacuum_data_load_and_recover, branch-complete.
The IN_PROGRESS reset touches Ch 1’s status flags directly:
// vacuum_data_load_and_recover -- src/query/vacuum.c if (entry->is_job_in_progress ()) { /* Reset in progress flag, mark the job as interrupted and update last_blockid. */ entry->set_interrupted (); /* <- STATUS_SET_AVAILABLE + SET_INTERRUPTED */ is_page_dirty = true; }Invariant 11-C — IN_PROGRESS never survives a restart. A job claimed by a worker that died with the server (whether the claim came from runtime state or from a replayed
RVVAC_START_JOB) must be re-runnable, so the page walk rewrites it to AVAILABLE + INTERRUPTED. INTERRUPTED is what relaxesvacuum_heap_page’s safe-guards (Ch 8): the block may have been partially executed, so “record already vacuumed” is no longer an anomaly. Resetting to plain AVAILABLE instead would re-arm assertions on the second pass over half-cleaned pages.
When vacuum data is empty, last_blockid is reconstructed three ways. log_blockid < 0 means the log has not filled its first block (“one case may be soon after copydb” — pairs with §11.7). Both recovery_lsa and log_Gl.hdr.mvcc_op_log_lsa null is an explicit 10.1-compat path trusting logpb_last_complete_blockid (). The default takes MAX (log_Gl.hdr.vacuum_last_blockid, vacuum_Data.last_page->data->blockid) because after a long SA session the on-page blockid “will be outdated. Instead, SA_MODE updates log_Gl.hdr.vacuum_last_blockid before removing old archives” (§11.7). Non-empty data needs no guessing: the last registered entry is the watermark.
After lost-block recovery, vacuum_data::set_oldest_unvacuumed_on_boot seeds Ch 5’s watermark: empty data → log_Gl.hdr.oldest_visible_mvccid (first re-initialized from mvcc_next_id if no block needs vacuum); otherwise the first entry’s oldest_visible_mvccid, which lower-bounds all others. Finally — Ch 2’s ownership model — the pages cannot stay fixed by the boot thread: both VPIDs are stashed into vacuum_Data_load and unfixed via vacuum_data_unload_first_and_last_page; the master (or xvacuum) re-fixes them with vacuum_data_load_first_and_last_page.
11.4 vacuum_recover_lost_block_data — rebuilding blocks the crash swallowed
Section titled “11.4 vacuum_recover_lost_block_data — rebuilding blocks the crash swallowed”Ch 3 showed a filled block lives only in log_Gl.hdr and vacuum_Block_data_buffer until consumed (Ch 4). A crash loses both; this function reconstructs them from the WAL. Entry branches:
recovery_lsanull → clean boot,return NO_ERROR.log_Gl.hdr.mvcc_op_log_lsanull → the recovered header forgot the last MVCC op; search backward fromrecovery_lsafor one.vacuum_get_log_blockid (mvcc_op_log_lsa.pageid) <= get_last_blockid ()→ already inside a registered block;logpb_vacuum_reset_log_header_cacheand return.- Otherwise → start from
log_Gl.hdr.mvcc_op_log_lsadirectly.
The branch-2 search walks record headers backward via back_lsa, bounded by stop_at_pageid = VACUUM_LAST_LOG_PAGEID_IN_BLOCK (get_last_blockid ()), with three terminators:
// vacuum_recover_lost_block_data -- src/query/vacuum.c if (log_rec_header.type == LOG_MVCC_UNDO_DATA || log_rec_header.type == LOG_MVCC_UNDOREDO_DATA || log_rec_header.type == LOG_MVCC_DIFF_UNDOREDO_DATA) { LSA_COPY (&mvcc_op_log_lsa, &log_lsa); /* <- found the chain tail */ break; } else if (log_rec_header.type == LOG_SYSOP_END) { // ... condensed: hit if sysop_end->type == LOG_SYSOP_END_LOGICAL_MVCC_UNDO ... } else if (log_rec_header.type == LOG_REDO_DATA) { // ... condensed: break WITHOUT a hit if redo->data.rcvindex == RVVAC_COMPLETE ... } LSA_COPY (&log_lsa, &log_rec_header.back_lsa);RVVAC_COMPLETE is the SA-mode “all clean” marker written by xvacuum (§11.6): anything older is vacuumed, so the search ends with mvcc_op_log_lsa still null → “nothing to recovery” → NO_ERROR. The same null check covers reaching stop_at_pageid. Any logpb_fetch_page failure here (and in the main loop) is logpb_fatal_error — recovery cannot proceed on an unreadable log.
The main loop rebuilds one VACUUM_DATA_ENTRY per block, newest to oldest, following each record’s prev_mvcc_op_log_lsa chain (the same chain workers walk in Ch 7) through vacuum_process_log_record:
// vacuum_recover_lost_block_data -- src/query/vacuum.c std::stack<VACUUM_DATA_ENTRY> vacuum_block_data_buffer_stack; /* we don't reset data.oldest_visible_mvccid between blocks. we need to maintain ordered * oldest_visible_mvccid's ... */ data.oldest_visible_mvccid = MVCCID_NULL; while (crt_blockid > vacuum_Data.get_last_blockid ()) { // ... condensed: inner loop folds each record's mvccid into oldest/newest ... if (data.blockid == vacuum_get_log_blockid (log_Gl.prior_info.prior_lsa.pageid)) { /* <- the still-open block: restore header cache instead of registering */ log_Gl.hdr.oldest_visible_mvccid = data.oldest_visible_mvccid; log_Gl.hdr.newest_block_mvccid = data.newest_mvccid; log_Gl.hdr.does_block_need_vacuum = true; log_Gl.hdr.mvcc_op_log_lsa = mvcc_op_log_lsa; } else { vacuum_block_data_buffer_stack.push (data); } crt_blockid = vacuum_get_log_blockid (log_lsa.pageid); }Two subtleties:
- The still-open block is not registered. If the newest block contains the append point (
log_Gl.prior_info.prior_lsa), registering it would violate Ch 4’s complete-blocks-only rule. Instead the four header fields Ch 3 maintains are restored — the block is produced when it fills, or consumed byxvacuum’s partial pass (§11.6). The header cache was reset just before the loop (“info will be restored if last block is not consumed”). - Blockid-sorted replay via the stack. The scan visits newest→oldest, but
vacuum_consume_buffer_log_blocks(Ch 4) assumes ascending blockids. Popping thestd::stackproduces oldest→newest intovacuum_Block_data_buffer; the guardcrt_blockid > get_last_blockid ()guarantees no overlap with already-registered blocks.
Invariant 11-D — recovered blocks enter vacuum data in ascending blockid order, gap-free, strictly above
last_blockid. Enforced by the stack reversal plus the loop guard. Violation corrupts Ch 1’s dense-array addressing (the verifier assertsentry->get_blockid () == (entry - 1)->get_blockid () + 1, §11.8).
Not resetting data.oldest_visible_mvccid between blocks (quoted comment) re-establishes Ch 5’s monotonicity: an MVCCID active while a newer block was logged must also lower-bound older blocks. Consumption needs master identity, so the function wraps vacuum_consume_buffer_log_blocks in vacuum_convert_thread_to_master / vacuum_restore_thread — the only thread conversion outside the daemons and xvacuum.
11.5 vacuum_rv_check_at_undo — the one vacuum hook inside rollback
Section titled “11.5 vacuum_rv_check_at_undo — the one vacuum hook inside rollback”Heap undo recovery (heap_rv_undo_delete, heap_rv_undo_update, heap_rv_undo_ovf_update in heap_file.c) restores a record’s before-image — including an MVCC header that vacuum may since have been entitled to clean. vacuum_rv_check_at_undo rewrites the restored record to be “valid in terms of vacuuming”. Branches in order: read the header (heap_get_mvcc_rec_header_from_overflow for REC_BIGONE, else spage_get_record COPY + or_mvcc_get_header; each failure is assert_release + ER_FAILED), then decide:
// vacuum_rv_check_at_undo -- src/query/vacuum.c if (log_is_in_crash_recovery ()) { /* always clear flags when recovering from crash - all the objects are visible anyway */ if (MVCC_IS_FLAG_SET (&rec_header, OR_MVCC_FLAG_VALID_INSID)) can_vacuum = VACUUM_RECORD_DELETE_INSID_PREV_VER; else can_vacuum = VACUUM_RECORD_CANNOT_VACUUM; } else { /* <- runtime rollback: ask the real oracle, the Ch 5 watermark */ can_vacuum = mvcc_satisfies_vacuum (thread_p, &rec_header, log_Gl.mvcc_table.get_global_oldest_visible ()); } /* it is impossible to restore a record that should be removed by vacuum */ assert (can_vacuum != VACUUM_RECORD_REMOVE);During crash recovery every undone transaction is doomed, so all surviving versions are visible and any valid insid is flattened unconditionally; at runtime the decision defers to mvcc_satisfies_vacuum. The REMOVE assert is the safety claim: a record vacuum would delete outright requires a committed-and-old deleter, which cannot simultaneously be the uncommitted transaction being undone.
On VACUUM_RECORD_DELETE_INSID_PREV_VER: for REC_BIGONE, set insid to MVCCID_ALL_VISIBLE, null prev_version_lsa, write back via heap_set_mvcc_rec_header_on_overflow; otherwise clear OR_MVCC_FLAG_VALID_INSID | OR_MVCC_FLAG_VALID_PREV_VERSION, or_mvcc_set_header, spage_update. Both paths end with pgbuf_set_dirty; the asymmetry mirrors Ch 8 — overflow headers are fixed-size (values neutralized), heap headers shrink (flags dropped).
Invariant 11-E — after undo, a record header never carries MVCC metadata that vacuum has already passed by. Enforced by this hook in all three heap undo paths. If skipped, a snapshot could chase
prev_version_lsainto log pages already reclaimed (Ch 9), ormvcc_satisfies_vacuumwould re-classify an already-cleaned record. (tde.cdeliberately formats keyinfo records to dodge this rewrite — see its “HACK” comments.)
11.6 SA mode: xvacuum, vacuum_sa_run_job, and the partial block
Section titled “11.6 SA mode: xvacuum, vacuum_sa_run_job, and the partial block”In SA mode there is no daemon; xvacuum compresses the whole lifecycle into one synchronous call (under SERVER_MODE it returns ER_VACUUM_CS_NOT_AVAILABLE). Figure 11-2:
flowchart TD
A{"PRM_ID_DISABLE_VACUUM or<br/>is_vacuum_complete?"} -->|yes| B[return NO_ERROR]
A -->|no| C["convert thread to master;<br/>load first/last pages;<br/>cursor.set_on_vacuum_data_start + load"]
C --> D{"Block_data_buffer<br/>not empty?"}
D -->|yes| E[cursor.force_data_update]
D -->|no| F{cursor.is_valid?}
E --> F
F -->|yes| G{"logtb_is_interrupted?"}
G -->|yes| H["cursor.unload; vacuum_Data.update;<br/>return NO_ERROR"]
G -->|no| I{"entry.is_available?"}
I -->|yes| J["start_job_on_current_entry;<br/>vacuum_sa_run_job(entry, false)"]
I -->|no: vacuumed| K[skip]
J --> L["increment_blockid"]
K --> L
L --> M{"new block, finished queue full,<br/>or cursor exhausted?"}
M -->|yes| N[cursor.force_data_update]
M -->|no| F
N --> F
F -->|no: data empty| O{"hdr.does_block_need_vacuum?"}
O -->|yes| P["build partial_entry from log_Gl.hdr;<br/>disable interrupt;<br/>vacuum_sa_run_job(entry, true)"]
O -->|no| Q["oldest_unvacuumed = mvcc_next_id;<br/>log RVVAC_COMPLETE; flush;<br/>cleanup dropped files; reset hdr cache;<br/>vacuum_finalize; is_vacuum_complete = true"]
P --> Q
Figure 11-2: xvacuum, branch-complete. The interrupt exit leaves vacuum data consistent for the next invocation.
The loop is Ch 6’s master loop re-implemented inline with the same vacuum_job_cursor, except jobs run immediately in this thread. vacuum_sa_run_job performs the double conversion:
// vacuum_sa_run_job -- src/query/vacuum.c VACUUM_WORKER *worker_p = vacuum_Worker_entry_manager->claim_worker (); thread_type save_type = thread_type::TT_NONE; vacuum_convert_thread_to_worker (thread_p, worker_p, save_type); assert (save_type == thread_type::TT_VACUUM_MASTER); /* <- caller must be master */ VACUUM_DATA_ENTRY copy_data_entry = data_entry; /* <- worker mutates its copy */ vacuum_process_log_block (thread_p, ©_data_entry, is_partial); vacuum_convert_thread_to_master (thread_p, save_type); // ... condensed: retire_worker, perf tracking ...The is_partial == true call is unique to SA mode: when the cursor is exhausted but log_Gl.hdr.does_block_need_vacuum is set (the still-open block of Ch 3, possibly restored by §11.4), xvacuum builds an entry straight from the header (vacuum_data_entry::vacuum_data_entry (const log_header &)) and runs it with interrupts off (logtb_set_check_interrupt (thread_p, false) — the header flag was already cleared, so an abort would lose the block). Inside vacuum_process_log_block, sa_mode_partial_block flips three things:
- No prefetch —
vacuum_log_prefetch_vacuum_blockis skipped: “block is not entirely logged and we cannot prefetch it”. - Forced interrupted semantics —
was_interrupted = data->was_interrupted () || sa_mode_partial_block;because interruptions are usually marked in the blockid, but a partial block carries no flag. - No completion bookkeeping — at
end:,vacuum_finished_block_vacuum(Ch 9) runs onlyif (!sa_mode_partial_block); a partial block was never in vacuum data, so there is nothing to mark.
After the partial pass, xvacuum declares total victory: vacuum_Data.oldest_unvacuumed_mvccid = log_Gl.hdr.mvcc_next_id, then logs RVVAC_COMPLETE (carrying mvcc_next_id as redo data) against the first vacuum data page and force-flushes. That record is exactly §11.4’s backward-search terminator — it certifies no unvacuumed MVCC op exists at or before this LSA. vacuum_cleanup_dropped_files (Ch 10), logpb_vacuum_reset_log_header_cache, vacuum_finalize, and is_vacuum_complete = true (making a second xvacuum a no-op) close the pass.
11.7 SA bookkeeping: vacuum_sa_reflect_last_blockid and vacuum_reset_data_after_copydb
Section titled “11.7 SA bookkeeping: vacuum_sa_reflect_last_blockid and vacuum_reset_data_after_copydb”A long SA session consumes log without registering blocks; on shutdown (xboot_shutdown_server, SA-only block in boot_sr.c), vacuum_sa_reflect_last_blockid persists the watermark so the next boot’s empty-data branch (§11.3) does not regress it. Early returns: VPID_ISNULL (&vacuum_Data_load.vpid_first) — fresh or aborted boot; vacuum_Data.is_restoredb_session — “restoredb doesn’t vacuum; we cannot do this here” (the flag is vacuum_initialize’s is_restore parameter); logpb_last_complete_blockid () == VACUUM_NULL_LOG_BLOCKID — unload and return. Otherwise:
// vacuum_sa_reflect_last_blockid -- src/query/vacuum.c vacuum_Data.set_last_blockid (last_blockid); log_Gl.hdr.vacuum_last_blockid = last_blockid; /* <- the MAX() source in section 11.3 */ vacuum_data_empty_update_last_blockid (thread_p); /* <- persists into the empty first page */vacuum_data_empty_update_last_blockid asserts vacuum_is_empty () (single page, index_unvacuumed == index_free, both zero) and rewrites the page via vacuum_init_data_page_with_last_blockid, which logs RVVAC_DATA_INIT_NEW_PAGE redo data (§11.2) — so the persisted watermark itself survives a crash.
vacuum_reset_data_after_copydb handles the other identity discontinuity: a copied database carries vacuum data whose blockids reference the source log. On first boot after copydb (boot_after_copydb, gated on log_Gl.hdr.was_copied), it fixes the first page, asserts emptiness (VPID_ISNULL (next_page), index_free == 0 — copydb requires a fully vacuumed source), and reinitializes with vacuum_init_data_page_with_last_blockid (..., VACUUM_NULL_LOG_BLOCKID). That null watermark later trips the log_blockid < 0 “soon after copydb” branch in vacuum_data_load_and_recover.
11.8 The debug verifiers — the document’s invariant catalogue
Section titled “11.8 The debug verifiers — the document’s invariant catalogue”vacuum_verify_vacuum_data_debug (under !NDEBUG, reached through the VACUUM_VERIFY_VACUUM_DATA macro at the end of vacuum_data_mark_finished and of vacuum_consume_buffer_log_blocks) walks every page and asserts, in effect, every structural claim this document has made:
| Assert (condensed) | Invariant restated | Chapter |
|---|---|---|
(first_page == last_page) == VPID_ISNULL (first_page->next_page) | page chain has no dangling tail | Ch 1 |
0 <= index_unvacuumed <= index_free <= page_data_max_count | per-page cursor sanity | Ch 1 |
is_vacuumed () ==> i != index_unvacuumed | index_unvacuumed always names a live entry | Ch 9 |
entry->oldest_visible_mvccid <= get_global_oldest_visible () | no block claims an oldest above the live watermark | Ch 5 |
oldest_unvacuumed_mvccid <= entry->oldest_visible_mvccid | boot/update watermark lower-bounds all entries | Ch 5, §11.3 |
entry->get_blockid () <= get_last_blockid () | m_last_blockid upper-bounds registered blocks | Ch 4 |
vacuum_get_log_blockid (start_lsa.pageid) == get_blockid () | start_lsa lies inside its own block | Ch 3 |
ascending oldest_visible_mvccid across unvacuumed entries | monotone oldest chain (why §11.4 never resets it) | Ch 5 |
entry->get_blockid () == (entry - 1)->get_blockid () + 1 | dense, gap-free blockid array | Ch 1, 11-D |
in_progress_distance > 500 → warning only | job-leak heuristic, “vacuum is behind or blocked” | Ch 6 |
The last row was once an assertion, demoted to vacuum_er_log_warning (“It was an assertion but we have not seen a case that vacuum is blocked”) — far-behind IN_PROGRESS entries indicate leaked jobs, not corruption.
Its companion vacuum_verify_vacuum_data_page_fix_count checks Ch 2’s fix discipline at five quiescent points (end of vacuum_data_load_and_recover, end of the master task loop, mid-xvacuum, and right after each VACUUM_VERIFY_VACUUM_DATA site above): first and last page each at fix count exactly 1, and pgbuf_get_hold_count exactly 1 (single-page data) or 2 — anything else means a leaked fix in Ch 4/6/9’s page hand-offs.
Finally xvacuum_dump (the utility entry point) is the observability twin: it prints vacuum_min_log_pageid_to_keep (Ch 9’s reclamation floor) and resolves whether that page lives in the active log or which archive (logpb_is_page_in_archive / logpb_get_archive_number). The degenerate branches do not fail — vacuum_Is_booted false or a NULL_PAGEID floor print “vacuum did not boot properly”, a negative archive number prints a bare newline — since it runs against possibly-broken servers.
11.9 Chapter summary — key takeaways
Section titled “11.9 Chapter summary — key takeaways”- Vacuum’s crash-recovery contract with ARIES is a single LSA:
vacuum_notify_server_crashedrecords the recovery start before analysis;vacuum_data_load_and_recoverconsumes and clears it at boot (Invariant 11-A). - Fourteen
RVVAC_*indices (104-117) inRV_funreplay every page-level vacuum mutation during the ordinary redo pass; the rows targeting vacuum data pages are redo-only — vacuum data is never rolled back, only reconciled forward at boot (Invariant 11-B). Dispatch itself is the recovery manager’s job (cubrid-recovery-manager-detail.md). - IN_PROGRESS is volatile: the boot page-walk rewrites it (including any state recreated by
RVVAC_START_JOBredo) to AVAILABLE + INTERRUPTED, which tells re-execution to tolerate half-vacuumed pages (Invariant 11-C). vacuum_recover_lost_block_datarebuilds blocks that existed only in the WAL: backward search to the chain tail (terminated byRVVAC_COMPLETEor registered data), per-block chain walks, astd::stackrestoring ascending-blockid order (Invariant 11-D), and header-cache restoration — not registration — for the still-open block.vacuum_rv_check_at_undois the only vacuum logic inside rollback: it flattens insid/prev-version metadata on restored records — unconditionally in crash recovery, watermark-driven at runtime — so undo never resurrects vacuum-skipped headers (Invariant 11-E).- SA mode replays the lifecycle synchronously in
xvacuum(per-job worker conversion invacuum_sa_run_job; an uninterruptible partial-block pass that skips prefetch, forces interrupted semantics, and skips completion bookkeeping;RVVAC_COMPLETEcertifies the log fully vacuumed), whilevacuum_sa_reflect_last_blockidandvacuum_reset_data_after_copydbprotect the watermark across shutdown and copydb — and the §11.8 verifiers are the executable regression checklist for all of it.
Position hints as of this revision
Section titled “Position hints as of this revision”The following are line numbers as observed on 2026-06-17; symbols are the canonical anchor and line numbers are hints that decay.
| Symbol | File | Line |
|---|---|---|
resource_shared_pool | src/base/resource_shared_pool.hpp | 29 |
VACUUM_FIRST_LOG_PAGEID_IN_BLOCK | src/query/vacuum.c | 81 |
VACUUM_LAST_LOG_PAGEID_IN_BLOCK | src/query/vacuum.c | 84 |
vacuum_data_entry | src/query/vacuum.c | 104 |
VACUUM_DATA_ENTRY_FLAG_MASK | src/query/vacuum.c | 135 |
VACUUM_DATA_ENTRY_BLOCKID_MASK | src/query/vacuum.c | 137 |
VACUUM_BLOCK_STATUS_MASK | src/query/vacuum.c | 141 |
VACUUM_BLOCK_FLAG_INTERRUPTED | src/query/vacuum.c | 146 |
VACUUM_BLOCKID_WITHOUT_FLAGS | src/query/vacuum.c | 150 |
vacuum_data_page | src/query/vacuum.c | 194 |
VACUUM_DATA_PAGE_HEADER_SIZE | src/query/vacuum.c | 212 |
vacuum_fix_data_page | src/query/vacuum.c | 223 |
vacuum_unfix_data_page | src/query/vacuum.c | 236 |
vacuum_unfix_first_and_last_data_page | src/query/vacuum.c | 255 |
vacuum_job_cursor | src/query/vacuum.c | 277 |
vacuum_shutdown_sequence | src/query/vacuum.c | 319 |
vacuum_data | src/query/vacuum.c | 350 |
oldest_unvacuumed_mvccid | src/query/vacuum.c | 356 |
vacuum_set_dirty_data_page | src/query/vacuum.c | 423 |
vacuum_data_load | src/query/vacuum.c | 442 |
vacuum_Data_load | src/query/vacuum.c | 447 |
vacuum_Master | src/query/vacuum.c | 456 |
vacuum_Block_data_buffer | src/query/vacuum.c | 467 |
VACUUM_BLOCK_DATA_BUFFER_CAPACITY | src/query/vacuum.c | 469 |
vacuum_Finished_job_queue | src/query/vacuum.c | 475 |
VACUUM_PREFETCH_LOG_BLOCK_BUFFER_PAGES | src/query/vacuum.c | 479 |
VACUUM_MAX_TASKS_IN_WORKER_POOL | src/query/vacuum.c | 482 |
VACUUM_FINISHED_JOB_QUEUE_CAPACITY | src/query/vacuum.c | 485 |
VACUUM_WORKER_INDEX_TO_TRANID | src/query/vacuum.c | 490 |
vacuum_Workers | src/query/vacuum.c | 498 |
vacuum_heap_helper | src/query/vacuum.c | 504 |
VACUUM_DEFAULT_HEAP_OBJECT_BUFFER_SIZE | src/query/vacuum.c | 561 |
vacuum_Dropped_files_loaded | src/query/vacuum.c | 567 |
vacuum_Dropped_files_count | src/query/vacuum.c | 576 |
vacuum_dropped_file | src/query/vacuum.c | 580 |
vacuum_dropped_files_page | src/query/vacuum.c | 588 |
VACUUM_DROPPED_FILES_PAGE_CAPACITY | src/query/vacuum.c | 602 |
VACUUM_DROPPED_FILE_FLAG_DUPLICATE | src/query/vacuum.c | 610 |
vacuum_track_dropped_files | src/query/vacuum.c | 640 |
vacuum_Track_dropped_files | src/query/vacuum.c | 645 |
vacuum_Dropped_files_version | src/query/vacuum.c | 650 |
vacuum_Last_dropped_vfid | src/query/vacuum.c | 652 |
vacuum_dropped_files_rcv_data | src/query/vacuum.c | 655 |
vacuum_Is_booted | src/query/vacuum.c | 661 |
vacuum_init_thread_context | src/query/vacuum.c | 766 |
vacuum_master_entry_manager | src/query/vacuum.c | 783 |
vacuum_master_task | src/query/vacuum.c | 813 |
m_oldest_visible_mvccid | src/query/vacuum.c | 834 |
vacuum_worker_entry_manager | src/query/vacuum.c | 843 |
vacuum_worker_task | src/query/vacuum.c | 916 |
vacuum_sa_run_job | src/query/vacuum.c | 949 |
xvacuum | src/query/vacuum.c | 979 |
xvacuum_dump | src/query/vacuum.c | 1121 |
vacuum_initialize | src/query/vacuum.c | 1180 |
vacuum_boot | src/query/vacuum.c | 1291 |
vacuum_stop_workers | src/query/vacuum.c | 1363 |
vacuum_stop_master | src/query/vacuum.c | 1390 |
vacuum_finalize | src/query/vacuum.c | 1416 |
vacuum_heap | src/query/vacuum.c | 1494 |
vacuum_heap_page | src/query/vacuum.c | 1577 |
vacuum_heap_prepare_record | src/query/vacuum.c | 1925 |
vacuum_heap_record_insid_and_prev_version | src/query/vacuum.c | 2195 |
vacuum_heap_record | src/query/vacuum.c | 2361 |
vacuum_heap_get_hfid_and_file_type | src/query/vacuum.c | 2513 |
vacuum_heap_page_log_and_reset | src/query/vacuum.c | 2587 |
vacuum_log_vacuum_heap_page | src/query/vacuum.c | 2651 |
vacuum_rv_redo_vacuum_heap_page | src/query/vacuum.c | 2720 |
vacuum_log_remove_ovf_insid | src/query/vacuum.c | 2856 |
vacuum_rv_redo_remove_ovf_insid | src/query/vacuum.c | 2869 |
vacuum_produce_log_block_data | src/query/vacuum.c | 2905 |
vacuum_data_load_first_and_last_page | src/query/vacuum.c | 2948 |
vacuum_data_unload_first_and_last_page | src/query/vacuum.c | 2979 |
vacuum_master_task::execute | src/query/vacuum.c | 3002 |
vacuum_master_task::check_shutdown | src/query/vacuum.c | 3077 |
vacuum_master_task::is_task_queue_full | src/query/vacuum.c | 3089 |
vacuum_master_task::should_interrupt_iteration | src/query/vacuum.c | 3100 |
vacuum_master_task::is_cursor_entry_ready_to_vacuum | src/query/vacuum.c | 3106 |
vacuum_master_task::is_cursor_entry_available | src/query/vacuum.c | 3136 |
vacuum_master_task::start_job_on_cursor_entry | src/query/vacuum.c | 3155 |
vacuum_master_task::should_force_data_update | src/query/vacuum.c | 3165 |
vacuum_master_task::decrease_outstanding_job | src/query/vacuum.c | 3188 |
vacuum_rv_redo_vacuum_complete | src/query/vacuum.c | 3221 |
vacuum_process_log_block | src/query/vacuum.c | 3251 |
vacuum_worker_allocate_resources | src/query/vacuum.c | 3620 |
vacuum_finalize_worker | src/query/vacuum.c | 3689 |
vacuum_finished_block_vacuum | src/query/vacuum.c | 3724 |
vacuum_read_log_aligned | src/query/vacuum.c | 3797 |
vacuum_read_log_add_aligned | src/query/vacuum.c | 3823 |
vacuum_read_advance_when_doesnt_fit | src/query/vacuum.c | 3838 |
vacuum_copy_data_from_log | src/query/vacuum.c | 3859 |
vacuum_process_log_record | src/query/vacuum.c | 3906 |
vacuum_get_worker_min_dropped_files_version | src/query/vacuum.c | 4135 |
vacuum_compare_blockids | src/query/vacuum.c | 4166 |
vacuum_data_load_and_recover | src/query/vacuum.c | 4183 |
vacuum_load_dropped_files_from_disk | src/query/vacuum.c | 4349 |
vacuum_create_file_for_vacuum_data | src/query/vacuum.c | 4445 |
vacuum_data_initialize_new_page | src/query/vacuum.c | 4498 |
vacuum_rv_redo_initialize_data_page | src/query/vacuum.c | 4520 |
vacuum_create_file_for_dropped_files | src/query/vacuum.c | 4544 |
vacuum_is_work_in_progress | src/query/vacuum.c | 4594 |
vacuum_data_mark_finished | src/query/vacuum.c | 4621 |
vacuum_data_empty_page | src/query/vacuum.c | 4832 |
vacuum_rv_redo_data_finished | src/query/vacuum.c | 4986 |
vacuum_rv_redo_data_finished_dump | src/query/vacuum.c | 5055 |
vacuum_consume_buffer_log_blocks | src/query/vacuum.c | 5096 |
vacuum_rv_undoredo_data_set_link | src/query/vacuum.c | 5361 |
vacuum_rv_redo_append_data | src/query/vacuum.c | 5411 |
vacuum_recover_lost_block_data | src/query/vacuum.c | 5465 |
vacuum_get_log_blockid | src/query/vacuum.c | 5702 |
vacuum_min_log_pageid_to_keep | src/query/vacuum.c | 5722 |
vacuum_is_safe_to_remove_archives | src/query/vacuum.c | 5747 |
vacuum_rv_redo_start_job | src/query/vacuum.c | 5760 |
vacuum_update_keep_from_log_pageid | src/query/vacuum.c | 5782 |
vacuum_compare_dropped_files | src/query/vacuum.c | 5820 |
vacuum_add_dropped_file | src/query/vacuum.c | 5846 |
vacuum_log_add_dropped_file | src/query/vacuum.c | 6121 |
vacuum_rv_redo_add_dropped_file | src/query/vacuum.c | 6167 |
vacuum_rv_undo_add_dropped_file | src/query/vacuum.c | 6235 |
vacuum_rv_replace_dropped_file | src/query/vacuum.c | 6269 |
vacuum_notify_all_workers_dropped_file | src/query/vacuum.c | 6335 |
vacuum_rv_notify_dropped_file | src/query/vacuum.c | 6391 |
vacuum_cleanup_dropped_files | src/query/vacuum.c | 6438 |
vacuum_is_file_dropped | src/query/vacuum.c | 6587 |
vacuum_find_dropped_file | src/query/vacuum.c | 6609 |
vacuum_log_cleanup_dropped_files | src/query/vacuum.c | 6719 |
vacuum_rv_redo_cleanup_dropped_files | src/query/vacuum.c | 6754 |
vacuum_dropped_files_set_next_page | src/query/vacuum.c | 6809 |
vacuum_rv_set_next_page_dropped_files | src/query/vacuum.c | 6834 |
vacuum_compare_heap_object | src/query/vacuum.c | 6862 |
vacuum_collect_heap_objects | src/query/vacuum.c | 6912 |
vacuum_cleanup_collected_by_vfid | src/query/vacuum.c | 6955 |
vacuum_compare_dropped_files_version | src/query/vacuum.c | 6999 |
vacuum_verify_vacuum_data_debug | src/query/vacuum.c | 7060 |
vacuum_log_prefetch_vacuum_block | src/query/vacuum.c | 7165 |
vacuum_fetch_log_page | src/query/vacuum.c | 7215 |
is_not_vacuumed_and_lost | src/query/vacuum.c | 7379 |
vacuum_get_first_page_dropped_files | src/query/vacuum.c | 7449 |
vacuum_is_mvccid_vacuumed | src/query/vacuum.c | 7463 |
vacuum_log_redoundo_vacuum_record | src/query/vacuum.c | 7486 |
vacuum_rv_undo_vacuum_heap_record | src/query/vacuum.c | 7524 |
vacuum_rv_redo_vacuum_heap_record | src/query/vacuum.c | 7539 |
vacuum_notify_server_crashed | src/query/vacuum.c | 7570 |
vacuum_notify_server_shutdown | src/query/vacuum.c | 7582 |
vacuum_verify_vacuum_data_page_fix_count | src/query/vacuum.c | 7595 |
vacuum_rv_check_at_undo | src/query/vacuum.c | 7627 |
vacuum_is_empty | src/query/vacuum.c | 7731 |
vacuum_sa_reflect_last_blockid | src/query/vacuum.c | 7749 |
vacuum_data_empty_update_last_blockid | src/query/vacuum.c | 7783 |
vacuum_convert_thread_to_master | src/query/vacuum.c | 7807 |
vacuum_convert_thread_to_worker | src/query/vacuum.c | 7830 |
vacuum_restore_thread | src/query/vacuum.c | 7856 |
vacuum_rv_es_nop | src/query/vacuum.c | 7876 |
vacuum_notify_es_deleted | src/query/vacuum.c | 7895 |
vacuum_check_shutdown_interruption | src/query/vacuum.c | 7936 |
vacuum_reset_data_after_copydb | src/query/vacuum.c | 7951 |
vacuum_init_data_page_with_last_blockid | src/query/vacuum.c | 7989 |
vacuum_data::get_last_blockid | src/query/vacuum.c | 8019 |
vacuum_data::get_first_blockid | src/query/vacuum.c | 8025 |
vacuum_data::set_last_blockid | src/query/vacuum.c | 8042 |
vacuum_data::update | src/query/vacuum.c | 8058 |
vacuum_data::set_oldest_unvacuumed_on_boot | src/query/vacuum.c | 8089 |
vacuum_data::upgrade_oldest_unvacuumed | src/query/vacuum.c | 8110 |
vacuum_data_entry::vacuum_data_entry | src/query/vacuum.c | 8119 |
vacuum_data_entry::was_interrupted | src/query/vacuum.c | 8162 |
vacuum_data_entry::set_vacuumed | src/query/vacuum.c | 8168 |
vacuum_data_entry::set_job_in_progress | src/query/vacuum.c | 8175 |
vacuum_data_entry::set_interrupted | src/query/vacuum.c | 8181 |
vacuum_data_page::get_index_of_blockid | src/query/vacuum.c | 8203 |
vacuum_data_page::get_first_blockid | src/query/vacuum.c | 8226 |
vacuum_job_cursor::start_job_on_current_entry | src/query/vacuum.c | 8295 |
vacuum_job_cursor::force_data_update | src/query/vacuum.c | 8314 |
vacuum_job_cursor::change_blockid | src/query/vacuum.c | 8327 |
vacuum_job_cursor::readjust_to_vacuum_data_changes | src/query/vacuum.c | 8363 |
vacuum_job_cursor::search | src/query/vacuum.c | 8424 |
vacuum_shutdown_sequence::request_shutdown | src/query/vacuum.c | 8465 |
vacuum_shutdown_sequence::check_shutdown_request | src/query/vacuum.c | 8498 |
VACUUM_LOG_ADD_DROPPED_FILE_POSTPONE | src/query/vacuum.h | 78 |
VACUUM_LOG_BLOCK_PAGES_DEFAULT | src/query/vacuum.h | 82 |
vacuum_worker_state | src/query/vacuum.h | 85 |
vacuum_heap_object | src/query/vacuum.h | 98 |
vacuum_worker | src/query/vacuum.h | 106 |
VACUUM_WORKER | src/query/vacuum.h | 106 |
drop_files_version | src/query/vacuum.h | 109 |
VACUUM_MAX_WORKER_COUNT | src/query/vacuum.h | 132 |
btree_prepare_bts | src/storage/btree.c | 15753 |
btree_rv_read_keybuf_nocopy | src/storage/btree.c | 18391 |
btree_rv_read_keybuf_two_objects | src/storage/btree.c | 18455 |
btree_vacuum_insert_mvccid | src/storage/btree.c | 30304 |
btree_vacuum_object | src/storage/btree.c | 30336 |
HEAP_RV_FLAG_VACUUM_STATUS_CHANGE | src/storage/heap_file.c | 514 |
xheap_reclaim_addresses | src/storage/heap_file.c | 6227 |
heap_rv_undo_delete | src/storage/heap_file.c | 16946 |
heap_rv_undo_update | src/storage/heap_file.c | 16981 |
heap_page_update_chain_after_mvcc_op | src/storage/heap_file.c | 24785 |
heap_rv_remove_flags_from_offset | src/storage/heap_file.c | 25085 |
heap_rv_undo_ovf_update | src/storage/heap_file.c | 26059 |
spage_vacuum_slot | src/storage/slotted_page.c | 4857 |
MVCCID_ALL_VISIBLE | src/storage/storage_common.h | 329 |
entry::claim_system_worker | src/thread/thread_entry.cpp | 425 |
entry::retire_system_worker | src/thread/thread_entry.cpp | 433 |
xboot_shutdown_server | src/transaction/boot_sr.c | 3044 |
xboot_emergency_patch | src/transaction/boot_sr.c | 5292 |
boot_after_copydb | src/transaction/boot_sr.c | 6154 |
xlocator_upgrade_instances_domain | src/transaction/locator_sr.c | 12126 |
redistribute_partition_data | src/transaction/locator_sr.c | 12692 |
prior_update_header_mvcc_info | src/transaction/log_append.cpp | 1320 |
prior_lsa_next_record_internal | src/transaction/log_append.cpp | 1357 |
prior_lsa_next_record | src/transaction/log_append.cpp | 1553 |
prior_lsa_next_record_with_lock | src/transaction/log_append.cpp | 1559 |
prior_lsa_start_append | src/transaction/log_append.cpp | 1593 |
prior_lsa_end_append | src/transaction/log_append.cpp | 1652 |
log_prior_node | src/transaction/log_append.hpp | 91 |
VACUUM_NULL_LOG_BLOCKID | src/transaction/log_common_impl.h | 54 |
LOG_SYSTEM_WORKER_FIRST_TRANID | src/transaction/log_impl.h | 185 |
LOG_ISRESTARTED | src/transaction/log_impl.h | 193 |
block_global_oldest_active_until_commit | src/transaction/log_impl.h | 555 |
LOG_RESTARTED | src/transaction/log_impl.h | 627 |
log_complete | src/transaction/log_manager.c | 5653 |
log_complete_for_2pc | src/transaction/log_manager.c | 5758 |
log_remove_log_archive_daemon_task::execute | src/transaction/log_manager.c | 10243 |
logpb_remove_archive_logs_exceed_limit | src/transaction/log_page_buffer.c | 5991 |
logpb_remove_archive_logs | src/transaction/log_page_buffer.c | 6213 |
logpb_backup | src/transaction/log_page_buffer.c | 7593 |
log_vacuum_info | src/transaction/log_record.hpp | 192 |
log_rec_mvcc_undoredo | src/transaction/log_record.hpp | 202 |
log_rec_mvcc_undo | src/transaction/log_record.hpp | 211 |
log_rec_sysop_end | src/transaction/log_record.hpp | 305 |
log_recovery | src/transaction/log_recovery.c | 736 |
VACUUM_LOG_BLOCKID | src/transaction/log_storage.hpp | 91 |
log_header::vacuum_last_blockid | src/transaction/log_storage.hpp | 153 |
log_header::mvcc_op_log_lsa | src/transaction/log_storage.hpp | 166 |
log_header::oldest_visible_mvccid | src/transaction/log_storage.hpp | 167 |
log_header::newest_block_mvccid | src/transaction/log_storage.hpp | 168 |
log_header::does_block_need_vacuum | src/transaction/log_storage.hpp | 173 |
systdes_claim_tdes | src/transaction/log_system_tran.cpp | 78 |
log_system_tdes::log_system_tdes | src/transaction/log_system_tran.cpp | 104 |
log_tdes::lock_global_oldest_visible_mvccid | src/transaction/log_tran_table.c | 6220 |
MVCC_IS_REC_INSERTED_SINCE_MVCCID | src/transaction/mvcc.c | 58 |
MVCC_IS_REC_DELETED_SINCE_MVCCID | src/transaction/mvcc.c | 61 |
mvcc_satisfies_vacuum | src/transaction/mvcc.c | 321 |
MVCC_IS_HEADER_DELID_VALID | src/transaction/mvcc.h | 87 |
MVCC_IS_HEADER_INSID_NOT_ALL_VISIBLE | src/transaction/mvcc.h | 91 |
mvcc_satisfies_vacuum_result | src/transaction/mvcc.h | 232 |
LOG_IS_MVCC_HEAP_OPERATION | src/transaction/mvcc.h | 245 |
LOG_IS_MVCC_BTREE_OPERATION | src/transaction/mvcc.h | 254 |
Oldest_active_tracker | src/transaction/mvcc_table.cpp | 77 |
mvcctable::advance_oldest_active | src/transaction/mvcc_table.cpp | 142 |
mvcctable::build_mvcc_info | src/transaction/mvcc_table.cpp | 226 |
mvcctable::compute_oldest_visible_mvccid | src/transaction/mvcc_table.cpp | 355 |
mvcctable::complete_mvcc | src/transaction/mvcc_table.cpp | 465 |
mvcctable::reset_transaction_lowest_active | src/transaction/mvcc_table.cpp | 593 |
mvcctable::get_global_oldest_visible | src/transaction/mvcc_table.cpp | 611 |
mvcctable::update_global_oldest_visible | src/transaction/mvcc_table.cpp | 617 |
mvcctable::lock_global_oldest_visible | src/transaction/mvcc_table.cpp | 632 |
mvcctable::unlock_global_oldest_visible | src/transaction/mvcc_table.cpp | 638 |
mvcctable | src/transaction/mvcc_table.hpp | 64 |
m_oldest_visible | src/transaction/mvcc_table.hpp | 118 |
m_ov_lock_count | src/transaction/mvcc_table.hpp | 119 |
RV_fun (RVVAC_* rows) | src/transaction/recovery.c | 687 |
RVVAC_* enum values | src/transaction/recovery.h | 156 |
RVVAC_START_JOB | src/transaction/recovery.h | 157 |
Sources
Section titled “Sources”cubrid-vacuum.md— the high-level companion. See alsocubrid-mvcc-detail.md(the oldest-visible watermark vacuum consumes).- Raw analyses under
raw/code-analysis/cubrid/storage/vacuum/. - Code:
src/query/vacuum.{c,h}; watermark coordination insrc/transaction/mvcc_table.cpp. - Methodology:
knowledge/methodology/code-analysis-detail-doc.md.