CUBRID Double Write Buffer — Torn-Page Protection Between Page Buffer and Data Files
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”The Double Write Buffer (DWB) exists to defeat one specific failure mode that the WAL protocol on its own cannot recover from: the torn page. A torn page is a half-old / half-new on-disk image of a data page produced when a crash occurs in the middle of a multi-sector write. Because every disk-resident database page is larger than the unit the operating system / hardware can update atomically, a crash mid-write can leave the home location holding a Frankenstein page whose first half corresponds to the new image and whose second half corresponds to the old image (or vice versa, depending on how the OS schedules the sector writes).
The asymmetry between DBMS page size and filesystem /
device atomic write size is the root cause. CUBRID writes data
pages at IO_PAGESIZE (4–16 KiB depending on configuration);
modern Linux filesystems guarantee atomic writes only at the
hardware sector boundary (typically 512 B or 4 KiB). The
expression “atomic write” itself is overloaded here — Database
Internals (Petrov, ch. 5 §“Recovery”) splits it into two
concerns:
- Atomic-on-success. If
write(2)succeeds and the OS later crashes, the bytes that made it to the platter are an arbitrary subset of the buffer. Even withO_DIRECT, the device may complete only some of the sectors before the power loss. - Atomic-on-failure. Some filesystems (ZFS, Btrfs, XFS with reflinks) commit each page write atomically by writing a new block and switching pointers. On those filesystems torn pages are impossible at the filesystem level — but the DBMS cannot assume it is running on such a filesystem.
The reason WAL alone does not suffice is that ARIES redo
(Mohan et al., TODS 17.1, 1992) needs a coherent before-state
to replay the redo image onto. The redo log carries enough
information to take page state S and produce state S', but if
the on-disk page is neither S nor S' — neither a coherent old
nor coherent new — there is no defined input to redo. Database
Internals §“Torn Pages” frames the three escape hatches every
disk-resident DBMS chooses among:
- Full-page WAL — the first time a page is dirtied after a
checkpoint, the entire page image is logged. PostgreSQL’s
full_page_writes = ondoes this. Cost: log-write amplification (a 16 KiB page becomes a 16 KiB log record). - Atomic 8 KiB / 16 KiB writes from the device or filesystem — relies on the underlying stack guaranteeing atomicity at the page granularity. Linux 6.x has experimental support; ZFS has it natively; most enterprise SAN arrays expose it. Cost: non-portable; the DBMS cannot ship the guarantee, only consume it.
- Double-write buffer — every dirty page is written first
to a sequential staging area (the “double write” file), the
staging area is
fsync’d, and only then is the page written to its home location. On crash recovery, any home page that fails its checksum is restored from the staging area. Cost: each data page is written twice andfsync’d twice. Used by MySQL InnoDB, MariaDB, Percona Server, and CUBRID.
CUBRID picks option 3, which is also the InnoDB default. The
remainder of this document is the slow-zoom into how it is built
in src/storage/double_write_buffer.{hpp,cpp} and how it stitches
into the page buffer’s flush path, the file_io subsystem, and the
boot/recovery driver.
Common DBMS Design
Section titled “Common DBMS Design”Every torn-page-defending engine that picks the double-write
buffer path arranges roughly the same components. The names
differ (InnoDB has Doublewrite_log / dblwr; MariaDB has
buf_dblwr; Percona has the parallel doublewrite; CUBRID has
dwb_Global); the shape is shared. This section names the
shared engineering vocabulary, then ## CUBRID's Approach is the
slow zoom into each row.
Slot-array layout — sequential, fixed-size, contiguous on disk
Section titled “Slot-array layout — sequential, fixed-size, contiguous on disk”The DWB is a fixed-size on-disk file (or pre-allocated extent of the system tablespace) divided into slots. Each slot holds exactly one DBMS page. Writes to slots are sequential — a position counter wraps around — so the underlying disk sees a streaming workload. InnoDB’s classical layout is 2 × 1 MiB blocks × 64 slots × 16 KiB per page; the flushed-to-DWB fsync is a single sequential 1 MiB write.
Block — the fsync unit
Section titled “Block — the fsync unit”Slots are grouped into blocks. A block is the unit of “fsync
the DWB before writing home pages”. The whole block is written,
then fsync’d, then its slots are written to their home volumes
one at a time. Block-level batching matters for two reasons.
First, it amortises the fsync cost over many pages.
Second, it lets the DWB writer thread know when a block has been
fully filled and is therefore safe to flush — once a block is
full, no more producers can be writing into it.
Position-with-flags — a single 64-bit atomic word
Section titled “Position-with-flags — a single 64-bit atomic word”Concurrent producers (page-buffer flushers, foreground writers) need a lock-free way to claim the next free slot. The standard technique is a 64-bit atomic word that packs:
- the current slot position (low bits),
- a per-block “write started” bitmask (high bits, one bit per block), and
- a couple of lifecycle flags (
CREATED,MODIFY_STRUCTURE).
CAS-loops on this word coordinate slot acquisition, block-bit flipping at block boundaries, and structure modifications (create / destroy / resize).
Producer-side flow
Section titled “Producer-side flow”A page-buffer flusher about to write a dirty page to its home volume does, in order:
- Acquire DWB slot. Atomic CAS-bump the position counter, land on a slot.
- Copy page bytes into the slot. The slot’s
io_pagepointer is the destination memory; the page ismemcpy’d in. - Insert into the DWB hash. The hash is keyed on
VPID = (volid, pageid), so a concurrent reader that is about to fix the same page can find a fresh in-memory copy in the DWB rather than re-read from a possibly-torn home volume. - If block fills, request flush. Bump the per-block page
count. When it reaches
BLOCK_NUM_PAGES, wake thedwb-flush-blockdaemon (or do it inline if running standalone). The daemon writes the entire block to the DWB volume sequentially, fsyncs it, then writes each slot’s contents to its home volume.
Recovery-side flow
Section titled “Recovery-side flow”On restart, before any redo, the engine opens the existing
DWB volume from disk. It reads every slot’s page image and
checks, for each (volid, pageid) named by the slot:
- Read the home page from its volume.
- Compare a checksum / sanity-check the home page.
- If the home page is corrupt and the DWB slot’s image is sane, overwrite the home page with the DWB slot.
After this DWB-driven repair pass, the DWB volume is unceremoniously deleted and recreated with the parameters named in the current configuration. The redo pass then runs against home pages that are now guaranteed to be coherent (either pre- or post-image; never torn).
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theoretical concept | CUBRID name |
|---|---|
| DWB volume on disk | dwb_Volume_name (built by fileio_make_dwb_name) |
| Slot — one DBMS page-sized cell | DWB_SLOT (double_write_buffer.hpp) |
| Block — fsync unit, group of slots | DWB_BLOCK (double_write_buffer.cpp) |
| Atomic position-with-flags word | dwb_Global.position_with_flags (UINT64) |
| Slot-VPID hash for in-memory lookup | dwb_Global.slots_hashmap (lockfree hashmap) |
| Producer-side acquire | dwb_acquire_next_slot → dwb_set_data_on_next_slot |
| Producer-side stage + insert | dwb_add_page → dwb_slots_hash_insert |
| Block writer (DWB write + fsync + home write) | dwb_flush_block → dwb_write_block |
| Background flush daemon | dwb_flush_block_daemon (1 ms tick) |
| File-sync helper daemon | dwb_file_sync_helper_daemon (10 ms tick) |
| Reader-side hit (concurrent fix) | dwb_read_page (called from pgbuf_fix page-load path) |
| Crash recovery scan | dwb_load_and_recover_pages (called by boot_sr.c) |
| Per-page corruption gate | dwb_check_data_page_is_sane + fileio_page_check_corruption |
| Force-flush of pending DWB pages | dwb_flush_force (called by fileio_synchronize_all) |
CUBRID’s Approach
Section titled “CUBRID’s Approach”The DWB has six moving parts: the on-disk volume that holds the staged copies, the in-memory blocks/slots that mirror the volume layout one-to-one, the position-with-flags atomic that coordinates concurrent producers, the slot hash that turns the DWB into a cache for concurrent readers, the flush machinery that drives a filled block to disk, and the recovery scan that uses the volume after a crash. We walk them in that order.
Layout — fixed-size sequential volume and contiguous blocks
Section titled “Layout — fixed-size sequential volume and contiguous blocks”The DWB is a single permanent volume created at database init,
named by fileio_make_dwb_name (file_io.c:5882):
// fileio_make_dwb_name — src/storage/file_io.cvoidfileio_make_dwb_name (char *dwb_name_p, const char *dwb_path_p, const char *db_name_p){ sprintf (dwb_name_p, "%s%s%s%s", dwb_path_p, FILEIO_PATH_SEPARATOR (dwb_path_p), db_name_p, FILEIO_SUFFIX_DWB);}The volume sits next to the active log; the suffix is
FILEIO_SUFFIX_DWB. Its size is governed by two parameters:
| Parameter | Bound | CUBRID symbol |
|---|---|---|
| Total buffer size | 512 KiB ≤ size ≤ 32 MiB (power of 2) | PRM_ID_DWB_SIZE |
| Number of blocks | 1 ≤ blocks ≤ 32 (power of 2) | PRM_ID_DWB_BLOCKS |
| Per-block pages | derived: total_pages / num_blocks | dwb_Global.num_block_pages |
Both knobs are loaded by dwb_load_buffer_size /
dwb_load_block_count, each clamped to its range and rounded
up to a power of two via dwb_power2_ceil. Setting either
to zero disables the DWB entirely.
The volume is formatted at create time with
fileio_format (boot_db_full_name, dwb_volume_name, LOG_DBDWB_VOLID, num_block_pages, …) — a permanent volume id
reserved for the DWB. A pre-existing DWB volume on the next
restart is the recovery trigger; an absent volume means there
is nothing to repair.
Producer-side data structures
Section titled “Producer-side data structures”// DWB_SLOT — src/storage/double_write_buffer.hppstruct double_write_slot{ FILEIO_PAGE *io_page; /* The contained page or NULL. */ VPID vpid; /* The page identifier. */ LOG_LSA lsa; /* The page LSA */ bool ensure_metadata; /* Include metadata when syncing */ unsigned int position_in_block; /* The position in block. */ unsigned int block_no; /* The number of the block where the slot reside. */};
// DWB_BLOCK — src/storage/double_write_buffer.cppstruct double_write_block{ FLUSH_VOLUME_INFO *flush_volumes_info; /* per-block flush bookkeeping */ volatile unsigned int count_flush_volumes_info; unsigned int max_to_flush_vdes;
pthread_mutex_t mutex; /* protects wait_queue */ DWB_WAIT_QUEUE wait_queue; /* threads sleeping on this block's flush */
char *write_buffer; /* contiguous block bytes — written in one fileio_write_pages */ DWB_SLOT *slots; /* slot view onto write_buffer */ volatile unsigned int count_wb_pages; /* current fill level */
unsigned int block_no; volatile UINT64 version; /* incremented after each flush */ volatile bool all_pages_written; /* set when home writes complete */};
// DOUBLE_WRITE_BUFFER — global singletonstruct double_write_buffer{ bool logging_enabled; DWB_BLOCK *blocks; /* num_blocks-sized array */ unsigned int num_blocks; /* power of 2, ≤ DWB_MAX_BLOCKS = 32 */ unsigned int num_pages; /* num_blocks × num_block_pages */ unsigned int num_block_pages; /* power of 2 */ unsigned int log2_num_block_pages; /* used by macros */
volatile unsigned int blocks_flush_counter; volatile unsigned int next_block_to_flush;
pthread_mutex_t mutex; DWB_WAIT_QUEUE wait_queue; /* structure-modification waiters */
UINT64 volatile position_with_flags; /* THE coordinator */
dwb_hashmap_type slots_hashmap; /* VPID → DWB_SLOTS_HASH_ENTRY */ int vdes; /* volume descriptor */
DWB_BLOCK *volatile file_sync_helper_block; /* block being post-flushed */};DWB_BLOCK::write_buffer is the only memory backing the
slots — DWB_BLOCK::slots[i].io_page is just a pointer into
write_buffer + i × IO_PAGESIZE. Allocation makes that
explicit:
// dwb_create_blocks — src/storage/double_write_buffer.cppfor (i = 0; i < num_blocks; i++) { blocks_write_buffer[i] = (char *) malloc (block_buffer_size * sizeof (char)); ... for (j = 0; j < num_block_pages; j++) { io_page = (FILEIO_PAGE *) (blocks_write_buffer[i] + j * IO_PAGESIZE); fileio_initialize_res (thread_p, io_page, IO_PAGESIZE); dwb_initialize_slot (&slots[i][j], io_page, j, i); } dwb_initialize_block (&blocks[i], i, 0, blocks_write_buffer[i], slots[i], flush_volumes_info[i], 0, num_block_pages); }Two consequences. (a) The block can be written to disk as
one contiguous fileio_write_pages call — the bytes are
already laid out in volume order. (b) Slot mutation (a
producer copying its page into a slot) is a memcpy into the
shared write buffer; concurrent producers writing into
different slots are safe because each producer owns its slot
position by way of the position-with-flags atomic.
Position-with-flags — the central coordinator
Section titled “Position-with-flags — the central coordinator”The 64-bit atomic dwb_Global.position_with_flags is the only
producer-side serialization point. Bit layout:
bit 63 62 61 ... 33 32 31 30 29 ... 0 +-----------------------+ +---+ +---+ +-----------------+ | block-write-started | | M | | C | | slot position | | bitmask (one bit per | | S | | R | | (30 bits) | | block, max 32 blocks) | | | | E | | | +-----------------------+ +---+ +---+ +-----------------+
M = MODIFY_STRUCTURE flag (bit 31) C = CREATE flag (bit 30)Macro accessors hide the encoding:
// position-with-flags macros — double_write_buffer.cpp#define DWB_GET_POSITION(p) ((p) & 0x000000003fffffffULL) /* low 30 bits */#define DWB_GET_BLOCK_STATUS(p) ((p) & 0xffffffff00000000ULL) /* high 32 bits */#define DWB_MODIFY_STRUCTURE 0x0000000080000000ULL#define DWB_CREATE 0x0000000040000000ULL
/* "block write started" bit — block_no occupies bit (63 - block_no) */#define DWB_IS_BLOCK_WRITE_STARTED(p, block_no) \ (((p) & (1ULL << (63 - (block_no)))) != 0)#define DWB_STARTS_BLOCK_WRITING(p, block_no) ((p) | (1ULL << (63 - (block_no))))#define DWB_ENDS_BLOCK_WRITING(p, block_no) ((p) & ~(1ULL << (63 - (block_no))))
/* "next slot position" — wraps at num_pages */#define DWB_GET_NEXT_POSITION_WITH_FLAGS(p) \ (DWB_GET_POSITION (p) == (DWB_NUM_TOTAL_PAGES - 1) \ ? ((p) & DWB_FLAG_MASK) : ((p) + 1))The 30-bit position field caps the total slot count at 2³⁰
slots, far above the DWB_MAX_SIZE / IO_PAGESIZE cap of a few
thousand. The block-status bit field hard-caps the number of
blocks at 32 (DWB_MAX_BLOCKS), which is why PRM_ID_DWB_BLOCKS
is clamped to that value.
The atomic CAS at the heart of dwb_acquire_next_slot is what
turns this into a wait-free hot path:
// dwb_acquire_next_slot — src/storage/double_write_buffer.cpp (condensed)STATIC_INLINE intdwb_acquire_next_slot (THREAD_ENTRY *thread_p, bool can_wait, DWB_SLOT **p_dwb_slot){start: current_position_with_flags = ATOMIC_INC_64 (&dwb_Global.position_with_flags, 0ULL);
if (DWB_NOT_CREATED_OR_MODIFYING (current_position_with_flags)) { /* wait or return — see below */ }
current_block_no = DWB_GET_BLOCK_NO_FROM_POSITION (current_position_with_flags); position_in_current_block = DWB_GET_POSITION_IN_BLOCK (current_position_with_flags);
if (position_in_current_block == 0) { /* first write into this block — must wait for previous iteration to flush */ if (DWB_IS_BLOCK_WRITE_STARTED (current_position_with_flags, current_block_no)) { if (!can_wait) return NO_ERROR; dwb_wait_for_block_completion (thread_p, current_block_no); goto start; } current_position_with_block_write_started = DWB_STARTS_BLOCK_WRITING (current_position_with_flags, current_block_no); new_position_with_flags = DWB_GET_NEXT_POSITION_WITH_FLAGS (current_position_with_block_write_started); } else { new_position_with_flags = DWB_GET_NEXT_POSITION_WITH_FLAGS (current_position_with_flags); }
if (!ATOMIC_CAS_64 (&dwb_Global.position_with_flags, current_position_with_flags, new_position_with_flags)) goto start; /* lost race, try again */
block = dwb_Global.blocks + current_block_no; *p_dwb_slot = block->slots + position_in_current_block; VPID_SET_NULL (& (*p_dwb_slot)->vpid); /* invalidate previous occupant */ return NO_ERROR;}The first-into-a-block producer is the one who flips the “block-write-started” bit for the new block. That bit is the synchronization handle the flush daemon and the structure- modification path read to know “block N has staged data that must be written”.
Producer side — pgbuf flush flow
Section titled “Producer side — pgbuf flush flow”The single biggest producer of DWB writes is the page buffer’s
flush path. From pgbuf_bcb_safe_flush_internal_release_mutex
(in page_buffer.c around line 10468), the relevant slice is:
// pgbuf flush path — src/storage/page_buffer.cDWB_SLOT *dwb_slot = NULL;bool uses_dwb;
uses_dwb = dwb_is_created () && !is_temp;
start_copy_page: /* ... TDE-encrypt or memcpy bufptr → iopage ... */ if (uses_dwb) { error = dwb_set_data_on_next_slot (thread_p, iopage, false, false, &dwb_slot); if (dwb_slot != NULL) { iopage = NULL; /* the slot's io_page replaces our local */ goto copy_unflushed_lsa; } }
copy_unflushed_lsa: /* ... WAL flush (logpb_flush_log_for_wal) ... */
if (uses_dwb) { error = dwb_add_page (thread_p, iopage, &bufptr->vpid, false, &dwb_slot); /* dwb_add_page enqueues; the actual home write is done later by dwb_flush_block via dwb_write_block */ } else { write_mode = (dwb_is_created () == true ? FILEIO_WRITE_NO_COMPENSATE_WRITE : FILEIO_WRITE_DEFAULT_WRITE); fileio_write (thread_p, ..., iopage, ..., write_mode); }Two properties matter. (a) is_temp short-circuits the DWB
for temporary volumes — torn pages on temp volumes are
benign because temp volumes are recreated at next start.
(b) The path is split in two: dwb_set_data_on_next_slot
acquires and stages, but the WAL force happens between
staging and dwb_add_page. This is the WAL invariant honored
inside the DWB path: the log record describing the change must
be on disk before the data page can be written, even though
the data page is going to the DWB rather than the home volume
first.
Producer side — file_io.c flow for non-pgbuf writers
Section titled “Producer side — file_io.c flow for non-pgbuf writers”A second producer is fileio_write_or_add_to_dwb
(file_io.c:4014). It is used by callers like
volume-extension code that write a page directly via
fileio_write rather than through the page buffer:
// fileio_write_or_add_to_dwb — src/storage/file_io.cvoid *fileio_write_or_add_to_dwb (THREAD_ENTRY *thread_p, int vol_fd, FILEIO_PAGE *io_page_p, PAGEID page_id, size_t page_size, bool ensure_metadata){ bool skip_flush = false; DWB_SLOT *p_dwb_slot = NULL;
skip_flush = dwb_is_created (); if (skip_flush) { arg.vdes = vol_fd; vol_info_p = fileio_traverse_permanent_volume (thread_p, fileio_is_volume_descriptor_equal, &arg); if (vol_info_p) { /* permanent volume — route through DWB */ VPID_SET (&vpid, vol_info_p->volid, page_id); io_page_p->prv.volid = vol_info_p->volid; io_page_p->prv.pageid = page_id; error_code = dwb_add_page (thread_p, io_page_p, &vpid, ensure_metadata, &p_dwb_slot); if (p_dwb_slot != NULL) return io_page_p; /* staged */ } /* not permanent OR DWB disabled meanwhile — fall through to direct write */ }
write_mode = skip_flush ? FILEIO_WRITE_NO_COMPENSATE_WRITE : FILEIO_WRITE_DEFAULT_WRITE; return fileio_write (thread_p, vol_fd, io_page_p, page_id, page_size, write_mode);}FILEIO_WRITE_NO_COMPENSATE_WRITE vs.
FILEIO_WRITE_DEFAULT_WRITE is the second torn-page knob:
when the DWB is active, fileio_write is told not to
re-write the destination page on detection of a partial write
because the DWB will do that recovery on restart. Without the
DWB, fileio_write falls back to its own retry-on-partial-write
path.
Producer-side flow (Mermaid)
Section titled “Producer-side flow (Mermaid)”sequenceDiagram
participant PB as page_buffer flush path
participant DA as dwb_set_data_on_next_slot
participant WL as logpb_flush_log_for_wal (WAL force)
participant DP as dwb_add_page
participant SH as dwb_slots_hashmap (VPID hash)
participant BL as DWB_BLOCK (count_wb_pages++)
participant FD as dwb-flush-block daemon
participant FB as dwb_flush_block
PB->>DA: stage iopage into next slot
DA-->>PB: DWB_SLOT* (page memcpy'd into slot)
PB->>WL: force WAL up to page LSA
WL-->>PB: nxio_lsa >= page.lsa
PB->>DP: dwb_add_page(iopage, vpid, slot)
DP->>SH: insert (vpid → slot)
DP->>BL: ATOMIC_INC_32(&count_wb_pages)
alt block now full
DP->>FD: wakeup
FD->>FB: dwb_flush_next_block
FB->>FB: fileio_write_pages (block → DWB volume)
FB->>FB: fileio_synchronize (DWB volume)
FB->>FB: dwb_write_block (slots → home volumes)
FB->>FB: fileio_synchronize (home volumes — partly via helper)
else block partially full
DP-->>PB: NO_ERROR (no flush yet)
end
Slot hash — turning the DWB into a read-side cache
Section titled “Slot hash — turning the DWB into a read-side cache”When the DWB has staged a page but not yet written it to its
home location, a concurrent reader (any thread doing
pgbuf_fix on a page-table miss) would otherwise read the
old image from the home volume. The slot hash prevents that.
On every successful dwb_add_page, the slot is registered in
dwb_Global.slots_hashmap keyed on VPID.
The page-buffer fix path consults the hash before falling
through to fileio_read:
// pgbuf_fix path — page_buffer.c:8239if (dwb_read_page (thread_p, vpid, &bufptr->iopage_buffer->iopage, &success) != NO_ERROR) return NULL;else if (success == true) /* nothing to do — page bytes copied from DWB slot */;else if (fileio_read (thread_p, fileio_get_volume_descriptor (vpid->volid), ...) == NULL) /* error path */;dwb_read_page looks up the hash, and if the slot’s VPID still
matches the requested one (a producer could be in the middle of
overwriting it), it memcpy’s the slot’s io_page into the
caller’s buffer:
// dwb_read_page — src/storage/double_write_buffer.cpp:3968intdwb_read_page (THREAD_ENTRY *thread_p, const VPID *vpid, void *io_page, bool *success){ *success = false; if (!dwb_is_created ()) return NO_ERROR;
VPID key_vpid = *vpid; slots_hash_entry = dwb_Global.slots_hashmap.find (thread_p, key_vpid); if (slots_hash_entry != NULL) { if (VPID_EQ (&slots_hash_entry->slot->vpid, vpid)) { memcpy ((char *) io_page, (char *) slots_hash_entry->slot->io_page, IO_PAGESIZE); *success = true; } pthread_mutex_unlock (&slots_hash_entry->mutex); } return NO_ERROR;}The semantics matter: a DWB hit means the reader gets the newest version of the page, even one that has not yet been written to the home volume. This is one of the rare design points where the DWB is doing more than torn-page protection — it is also a write-back cache for the home volume. A torn- page-only design would not need the slot hash; CUBRID adds it because the cache is essentially free given the slots already exist in memory.
LSA-ordered hash insert — handling re-fixes
Section titled “LSA-ordered hash insert — handling re-fixes”A page can land in the DWB twice within one block-fill window:
the same VPID gets dirtied, flushed-staged, dirtied again,
flushed-staged again. dwb_slots_hash_insert resolves this by
keeping the newest version visible to readers and
invalidating the older slot:
// dwb_slots_hash_insert — src/storage/double_write_buffer.cpp (excerpt)*inserted = dwb_Global.slots_hashmap.find_or_insert (thread_p, *vpid, slots_hash_entry);if (! (*inserted)) { if (LSA_LT (&slot->lsa, &slots_hash_entry->slot->lsa)) { /* The older slot is better than mine — leave it in hash. */ pthread_mutex_unlock (&slots_hash_entry->mutex); return NO_ERROR; } else if (LSA_EQ (&slot->lsa, &slots_hash_entry->slot->lsa)) { /* Same LSA — page modified without logging (rare). Replace, but invalidate old slot if in same block. */ if (slots_hash_entry->slot->block_no == slot->block_no) { VPID_SET_NULL (&slots_hash_entry->slot->vpid); fileio_initialize_res (thread_p, slots_hash_entry->slot->io_page, IO_PAGESIZE); } } slot->ensure_metadata = slot->ensure_metadata || slots_hash_entry->slot->ensure_metadata; }slots_hash_entry->slot = slot;The LSA_LT check (“I am older than the cached one”) is the
defense against producers whose own slot got a smaller LSA
than another producer’s that won the hash race. The
LSA_EQ arm handles the “modified without logging” path
(temp-volume-style writes whose LSA was never advanced).
Flush — block-level fsync, then home writes
Section titled “Flush — block-level fsync, then home writes”dwb_flush_block is the heart of the durability story. It
runs either inline (when the producer fills the last slot in a
block and there is no daemon available) or from the
dwb-flush-block daemon. Its sequence is:
// dwb_flush_block — src/storage/double_write_buffer.cpp:2191 (condensed)STATIC_INLINE intdwb_flush_block (THREAD_ENTRY *thread_p, DWB_BLOCK *block, bool file_sync_helper_can_flush, UINT64 *cur_pos_w_flags){ ATOMIC_INC_32 (&dwb_Global.blocks_flush_counter, 1);
/* (1) Snapshot the slots in VPID order so the home writes hit each volume contiguously. */ dwb_block_create_ordered_slots (block, &p_dwb_ordered_slots, &ordered_slots_length);
/* (2) De-duplicate: the same VPID may appear twice — keep the newer LSA. */ for (i = 0; i < block->count_wb_pages - 1; i++) { ... }
/* (3) Wait for the previous block's home writes to finish. */ while (dwb_Global.file_sync_helper_block != NULL) { thread_sleep(1) or flush inline; }
/* (4) WRITE THE WHOLE BLOCK TO THE DWB VOLUME — sequential, fast. */ fileio_write_pages (thread_p, dwb_Global.vdes, block->write_buffer, 0, block->count_wb_pages, IO_PAGESIZE, FILEIO_WRITE_NO_COMPENSATE_WRITE);
/* (5) FSYNC THE DWB VOLUME — durability barrier. */ fileio_synchronize (thread_p, dwb_Global.vdes, dwb_Volume_name, false);
/* (6) WRITE THE PAGES TO THEIR HOME VOLUMES (and remove from slot hash). */ dwb_write_block (thread_p, block, p_dwb_ordered_slots, ordered_slots_length, file_sync_helper_can_flush, /* remove_from_hash = */ true);
/* (7) FSYNC THE HOME VOLUMES — small ones inline, big ones offloaded to helper. */ for (i = 0; i < block->count_flush_volumes_info; i++) { if (file_sync_helper_can_flush && num_pages > max_pages_to_sync && dwb_is_file_sync_helper_daemon_available ()) continue; /* let the helper do it */ if (ATOMIC_CAS_32 (&block->flush_volumes_info[i].flushed_status, VOLUME_NOT_FLUSHED, VOLUME_FLUSHED_BY_DWB_FLUSH_THREAD)) fileio_synchronize (thread_p, block->flush_volumes_info[i].vdes, NULL, block->flush_volumes_info[i].metadata); }
block->all_pages_written = true; ATOMIC_TAS_32 (&block->count_wb_pages, 0); ATOMIC_INC_64 (&block->version, 1ULL);
/* (8) Clear the block's "write started" bit; advance next_block_to_flush. */ /* (9) Wake any threads sleeping on this block's wait_queue. */}The eight-step sequence is the torn-page protection contract in execution: between step 5 and step 6, the DWB volume is durable on disk. If a crash occurs anywhere during step 6, the DWB has a clean copy of every page whose home write may have been torn. After step 7, both copies are durable and the DWB slots can be reused on the next block-fill cycle.
The helper daemon dwb_file_sync_helper_daemon exists to
parallelise step 7. Home volumes can be many, and fsync on a
big volume is slow; the main flush thread offloads the
expensive ones to the helper while it returns to write more
data into the next block. The handoff is via the single-pointer
atomic dwb_Global.file_sync_helper_block.
State machine — one DWB slot’s lifecycle
Section titled “State machine — one DWB slot’s lifecycle”stateDiagram-v2 [*] --> FREE: server start (block created) FREE --> STAGED: dwb_set_data_on_next_slot\n(slot acquired, page memcpy'd) STAGED --> HASHED: dwb_add_page\n(VPID inserted into slots_hashmap) HASHED --> DWB_WRITTEN: dwb_flush_block step 4-5\n(write_buffer flushed + fsync DWB volume) DWB_WRITTEN --> HOME_WRITTEN: dwb_write_block\n(slot's io_page written to home volume) HOME_WRITTEN --> HOME_SYNCED: fileio_synchronize\n(home volume fsync — main or helper thread) HOME_SYNCED --> FREE: count_wb_pages reset; block->version++ HASHED --> HASHED_INVALIDATED: same VPID re-staged with newer LSA HASHED_INVALIDATED --> DWB_WRITTEN: still flushed (with NULL VPID skip in dwb_write_block)
Three properties matter. (a) Between STAGED and HASHED, a
concurrent reader cannot find the slot — it must read from the
(possibly old) home volume. (b) Between HASHED and
HOME_SYNCED, a concurrent reader gets the slot’s contents via
dwb_read_page. (c) The transition out of HOME_SYNCED is
the only point at which the slot is released for reuse by the
next block-fill cycle.
Crash recovery — dwb_load_and_recover_pages
Section titled “Crash recovery — dwb_load_and_recover_pages”The recovery story has one external entry: boot_sr.c calls
dwb_load_and_recover_pages after vacuum init but before
log recovery (analysis / redo / undo):
// boot_sr.c:2403 — server boot, with crash recoveryoid_set_root (&boot_Db_parm->rootclass_oid);
/* Load and recover data pages before log recovery */error_code = dwb_load_and_recover_pages (thread_p, log_path, log_prefix);if (error_code != NO_ERROR) goto error;
#if defined(SERVER_MODE)pgbuf_daemons_init ();dwb_daemons_init ();parallel_query::worker_manager_global::get_manager ().init ();#endifThe placement is critical for ARIES correctness: by the time analysis starts walking the log, every home page on disk is either a coherent old image (DWB had no copy and home write never started or completed) or a coherent new image (DWB restored it). Redo can replay LSNs onto a known state.
dwb_load_and_recover_pages body (double_write_buffer.cpp:3199)
unfolds in five phases:
- Open the on-disk DWB volume. If it does not exist, nothing to do — the engine was either freshly initialized or the previous session shut down cleanly enough that DWB was destroyed.
- Allocate one in-memory
DWB_BLOCKof sizenum_dwb_pages— the entire DWB is loaded into a single block, regardless of how many blocks it had during normal operation. This is simpler because the recovery scan does not need to know the block boundaries; it only needs the slot images. - Read all pages.
fileio_read_pagespulls the entire DWB volume into the block’s write_buffer. For each slot, the VPID and LSA are re-derived from the page’sprv.{volid,pageid,lsa}header. - Sort + dedup.
dwb_block_create_ordered_slotssorts by(VPID, LSA). The dedup loop walks the sorted array and nulls out the older copy when the same VPID appears twice (which can happen if the crash occurred mid-flush of one block and the previous block’s slots were partially overwritten by the in-progress one). - For each slot,
dwb_check_data_page_is_sane. Read the home volume’s copy. If it is sane (passesfileio_page_check_corruption), keep it — null the slot’s VPID so the recovery write skips it. If it is corrupt and the DWB slot is sane, the home page will be replaced. If both are corrupt, that is a fatal recovery error. dwb_write_blockto overwrite home pages. The same producer-side function that runs during normal flush is reused here, withremove_from_hash = falsebecause there is no live hash yet.- Dismount + unformat the DWB volume, then re-create it
fresh via
dwb_create. The new DWB starts empty; whatever was staged at crash time is now committed to home pages.
Recovery flow (Mermaid)
Section titled “Recovery flow (Mermaid)”flowchart TB
S0["server start\nboot_sr.c"]
S1["log_initialize\n(reads active log header)"]
S2["vacuum_initialize"]
S3["dwb_load_and_recover_pages"]
S4["pgbuf_daemons_init\ndwb_daemons_init"]
S5["log_recovery (analysis / redo / undo)"]
S0 --> S1 --> S2 --> S3 --> S4 --> S5
subgraph DWB_RECOVERY["dwb_load_and_recover_pages"]
R1["fileio_is_volume_exist?"]
R2["fileio_mount + fileio_read_pages"]
R3["dwb_block_create_ordered_slots\n(sort by VPID, LSA)"]
R4["dedup duplicates\n(same VPID twice)"]
R5["dwb_check_data_page_is_sane\n(read each home page)"]
R6{"home corrupt &\nDWB sane?"}
R7["mark slot for replacement"]
R8["mark slot skipped"]
R9["dwb_write_block\n(replace corrupt home pages)"]
R10["fileio_synchronize home volumes"]
R11["fileio_dismount + fileio_unformat"]
R12["dwb_create (fresh)"]
end
R1 -- "yes" --> R2
R1 -- "no" --> R12
R2 --> R3 --> R4 --> R5
R5 --> R6
R6 -- "yes" --> R7
R6 -- "no" --> R8
R7 --> R9
R8 --> R9
R9 --> R10 --> R11 --> R12
The key invariant: by the time dwb_create returns, the DWB
volume is empty and the home volumes are coherent. Redo, when
it runs next, reads home pages whose state is one consistent
snapshot of the database — not torn.
Performance — write amplification and why it’s affordable
Section titled “Performance — write amplification and why it’s affordable”The DWB’s worst-case cost is 2× write amplification on data pages: every dirty page is written first to the DWB volume, then to its home volume. Two factors make this affordable in practice.
Sequential vs. random. The DWB write is sequential: a whole
block (typically 1 MiB or more) at a time, into a fixed-location
file. Modern storage handles sequential writes at near-peak
bandwidth, so the DWB write is essentially free in throughput
terms — it is dominated by fsync latency, not transfer time.
The home write, by contrast, is random — pages from many
volumes scattered across the disk. The DWB does not add a
random write; it adds a sequential write and an fsync.
Group flush via daemon. The dwb-flush-block daemon
amortises the fsync cost of a block over all the slots it
contains. A block with num_block_pages = 64 does two
fsyncs (one for the DWB volume, one per home volume) for
sixty-four page writes. The fsync per page is therefore
≤ (2 fsync) / 64 ≈ 1/32 of a synchronous write per page —
much cheaper than full_page_writes = on in PostgreSQL, which
log-writes the entire page on every checkpoint-relative dirty.
Concurrent home-write pipelining. While the daemon writes home pages and waits for fsync, other producers continue to fill the next block. The pipeline keeps the foreground page flushers from blocking on DWB fsync. When the helper daemon takes over fsync for big home volumes, the main flush thread is free to start the next block immediately.
The configuration knobs let an operator tune the tradeoff:
small PRM_ID_DWB_SIZE reduces memory but increases
flush frequency; many PRM_ID_DWB_BLOCKS increases pipeline
depth but reduces the number of slots per block. Disabling
either by setting it to 0 turns the DWB off entirely — useful
on storage with native atomic 8 KiB / 16 KiB writes.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers.
Public API (src/storage/double_write_buffer.hpp)
Section titled “Public API (src/storage/double_write_buffer.hpp)”DWB_SLOT— slot struct visible to producers; carriesio_page,vpid,lsa,position_in_block,block_no.dwb_is_created— has the DWB been initialised?dwb_create— create the DWB volume and the in-memory structure; called fromboot_sr.cat first volume creation.dwb_recreate— destroy + create with current parameters (used whenPRM_ID_DWB_SIZE/PRM_ID_DWB_BLOCKSchange).dwb_load_and_recover_pages— the crash-recovery entry point.dwb_destroy— finalise.dwb_get_volume_name— observability.dwb_flush_force— drain everything pending; called fromfileio_synchronize_all.dwb_read_page— slot-hash lookup for the page-buffer fix path.dwb_set_data_on_next_slot— producer-side acquire+stage.dwb_add_page— producer-side commit (insert into hash, bump block fill count, request flush if block full).dwb_synchronize— slot forfileio_synchronizeto push pending DWB content before flushing a single volume.dwb_daemons_init/dwb_daemons_destroy— server-mode daemon lifecycle.
Internal types (src/storage/double_write_buffer.cpp)
Section titled “Internal types (src/storage/double_write_buffer.cpp)”DWB_WAIT_QUEUE_ENTRY/DWB_WAIT_QUEUE— singly-linked wait queue per block + global; threads parked here wake on block flush completion or structure-modification end.FLUSH_VOLUME_INFO— per-block, per-volume bookkeeping for step-7 fsync (descriptor, page count, status flags).DWB_BLOCK— the fsync unit (write_buffer,slots,count_wb_pages,version, wait queue).DWB_SLOTS_HASH_ENTRY— VPID-keyed hash entry pointing at aDWB_SLOT.DOUBLE_WRITE_BUFFER(singletondwb_Global) — global state.
Internal functions
Section titled “Internal functions”dwb_init_wait_queue/dwb_block_add_wait_queue_entry/dwb_block_disconnect_wait_queue_entry/dwb_block_free_wait_queue_entry/dwb_remove_wait_queue_entry/dwb_signal_waiting_threads/dwb_destroy_wait_queue— wait-queue primitives.dwb_signal_waiting_thread— signals one parked thread (setsTHREAD_DWB_QUEUE_RESUMED).dwb_set_status_resumed— flips a parked thread back to resumed after timeout cleanup.dwb_wait_for_block_completion— park on a block’s wait_queue; 20 ms timeout.dwb_wait_for_strucure_modification— park on the global wait_queue during DWB resize/destroy/create; 10 ms timeout.dwb_signal_block_completion/dwb_signal_structure_modificated— wake-all on those queues.dwb_starts_structure_modification/dwb_ends_structure_modification— set/clear theMODIFY_STRUCTUREflag inposition_with_flags; before setting, the function CAS-loops, then waits for in-flight flushers to drain.dwb_load_buffer_size/dwb_load_block_count/dwb_power2_ceil— parameter parsing, clamp to bounds, round up to power of two.dwb_initialize_slot/dwb_initialize_block/dwb_create_blocks/dwb_finalize_block— memory layout helpers.dwb_create_internal/dwb_destroy_internal— create / destroy under the structure-modification flag.dwb_acquire_next_slot— the lock-free producer entry.dwb_set_slot_data—memcpypage bytes + capture VPID and LSA into the slot.dwb_init_slot— reset slot fields (used during sentinel insertion in the ordered-slot snapshot).dwb_block_create_ordered_slots—qsortsnapshot by VPID then LSA, with sentinel.dwb_compare_slots— comparator for the sort.dwb_compare_vol_fd— comparator over volume descriptors.dwb_add_volume_to_block_flush_area— bookkeeping for step-7 fsync targets.dwb_get_next_block_for_flush— pick the next full block (used by the daemon).dwb_flush_next_block— daemon body.dwb_flush_block— the eight-step durability dance.dwb_write_block— step 6, write slots to home volumes.dwb_file_sync_helper— helper daemon body for offloaded step-7 fsyncs.dwb_slots_hash_entry_alloc/_free/_init/_key_copy/_compare_key/_key— hash callbacks.dwb_slots_hash_insert/dwb_slots_hash_delete— hash ops.dwb_check_data_page_is_sane— recovery-time per-slot decision (replace home or skip).dwb_debug_check_dwb— debug-only dup detector for the recovery-time block.dwb_is_flush_block_daemon_available/dwb_is_file_sync_helper_daemon_available/dwb_flush_block_daemon_is_running/dwb_file_sync_helper_daemon_is_running— daemon visibility helpers, gatePRM_ID_ENABLE_DWB_FLUSH_THREAD.dwb_flush_block_daemon_init/dwb_file_sync_helper_daemon_init— daemon spin-up; loopers tick at 1 ms / 10 ms.class dwb_flush_block_daemon_task— task class for the flush daemon.dwb_file_sync_helper_execute— task entry for the helper daemon (registered ascubthread::entry_callable_task).
Cross-references
Section titled “Cross-references”pgbuf_bcb_safe_flush_internal_release_mutex(inpage_buffer.c≈ line 10468) — primary producer.pgbuf_*_safe_flush_internal_release_mutex— fix path that consultsdwb_read_page(page_buffer.c:8239).fileio_write_or_add_to_dwb(infile_io.cline 4014) — secondary producer for direct-write paths.fileio_synchronize_volume_and_dwb(infile_io.cline 2844, 2912, 3126, 3332 — multiple call sites) — callsdwb_synchronizeto drain pending DWB before per-volume fsync.fileio_synchronize_all(infile_io.cline 4642) — callsdwb_flush_forcebefore sweeping volume fsyncs.boot_restart_server(inboot_sr.cline 2403) — callsdwb_load_and_recover_pagesbetween vacuum init andpgbuf_daemons_init.boot_create_database(inboot_sr.cline 4908) — callsdwb_createafter database parameter setup, before formatting the first volume.dwb_initialize_poolis used in disable-flush paths infile_io.c:1882(setskip_flush = dwb_is_created ()— so per-page fsyncs during volume creation are skipped when the DWB will handle durability later).
Position hints as of 2026-04-30
Section titled “Position hints as of 2026-04-30”| Symbol | File | Line |
|---|---|---|
DWB_SLOT (struct) | double_write_buffer.hpp | 33 |
dwb_is_created (declaration) | double_write_buffer.hpp | 44 |
dwb_create (declaration) | double_write_buffer.hpp | 45 |
dwb_load_and_recover_pages (declaration) | double_write_buffer.hpp | 47 |
dwb_destroy (declaration) | double_write_buffer.hpp | 48 |
dwb_flush_force (declaration) | double_write_buffer.hpp | 50 |
dwb_read_page (declaration) | double_write_buffer.hpp | 51 |
dwb_set_data_on_next_slot (declaration) | double_write_buffer.hpp | 52 |
dwb_add_page (declaration) | double_write_buffer.hpp | 54 |
dwb_synchronize (declaration) | double_write_buffer.hpp | 57 |
dwb_daemons_init (declaration) | double_write_buffer.hpp | 60 |
DWB_MIN_SIZE / DWB_MAX_SIZE | double_write_buffer.cpp | 51-52 |
DWB_MIN_BLOCKS / DWB_MAX_BLOCKS | double_write_buffer.cpp | 53-54 |
DWB_POSITION_MASK | double_write_buffer.cpp | 70 |
DWB_BLOCKS_STATUS_MASK | double_write_buffer.cpp | 73 |
DWB_MODIFY_STRUCTURE flag | double_write_buffer.cpp | 76 |
DWB_CREATE flag | double_write_buffer.cpp | 79 |
struct double_write_wait_queue_entry | double_write_buffer.cpp | 176 |
struct double_write_wait_queue | double_write_buffer.cpp | 184 |
enum FLUSH_VOLUME_STATUS | double_write_buffer.cpp | 196 |
struct flush_volume_info | double_write_buffer.cpp | 204 |
struct double_write_block | double_write_buffer.cpp | 215 |
struct dwb_slots_hash_entry | double_write_buffer.cpp | 235 |
struct double_write_buffer | double_write_buffer.cpp | 261 |
dwb_Global (singleton) | double_write_buffer.cpp | 306 |
slots_entry_Descriptor | double_write_buffer.cpp | 421 |
dwb_init_wait_queue | double_write_buffer.cpp | 451 |
dwb_signal_waiting_threads | double_write_buffer.cpp | 663 |
dwb_power2_ceil | double_write_buffer.cpp | 732 |
dwb_load_buffer_size | double_write_buffer.cpp | 769 |
dwb_load_block_count | double_write_buffer.cpp | 795 |
dwb_starts_structure_modification | double_write_buffer.cpp | 822 |
dwb_ends_structure_modification | double_write_buffer.cpp | 924 |
dwb_initialize_slot | double_write_buffer.cpp | 949 |
dwb_initialize_block | double_write_buffer.cpp | 977 |
dwb_create_blocks | double_write_buffer.cpp | 1009 |
dwb_finalize_block | double_write_buffer.cpp | 1131 |
dwb_create_internal | double_write_buffer.cpp | 1163 |
dwb_slots_hash_insert | double_write_buffer.cpp | 1380 |
dwb_destroy_internal | double_write_buffer.cpp | 1474 |
dwb_set_status_resumed | double_write_buffer.cpp | 1521 |
dwb_wait_for_block_completion | double_write_buffer.cpp | 1552 |
dwb_signal_waiting_thread | double_write_buffer.cpp | 1643 |
dwb_wait_for_strucure_modification | double_write_buffer.cpp | 1704 |
dwb_compare_slots | double_write_buffer.cpp | 1781 |
dwb_block_create_ordered_slots | double_write_buffer.cpp | 1845 |
dwb_slots_hash_delete | double_write_buffer.cpp | 1883 |
dwb_add_volume_to_block_flush_area | double_write_buffer.cpp | 1960 |
dwb_write_block | double_write_buffer.cpp | 2007 |
dwb_flush_block | double_write_buffer.cpp | 2192 |
dwb_acquire_next_slot | double_write_buffer.cpp | 2468 |
dwb_set_slot_data | double_write_buffer.cpp | 2612 |
dwb_init_slot | double_write_buffer.cpp | 2642 |
dwb_get_next_block_for_flush | double_write_buffer.cpp | 2659 |
dwb_set_data_on_next_slot | double_write_buffer.cpp | 2686 |
dwb_add_page | double_write_buffer.cpp | 2726 |
dwb_synchronize | double_write_buffer.cpp | 2841 |
dwb_is_created | double_write_buffer.cpp | 2909 |
dwb_create | double_write_buffer.cpp | 2925 |
dwb_recreate | double_write_buffer.cpp | 2967 |
dwb_debug_check_dwb | double_write_buffer.cpp | 3013 |
dwb_check_data_page_is_sane | double_write_buffer.cpp | 3091 |
dwb_load_and_recover_pages | double_write_buffer.cpp | 3199 |
dwb_destroy | double_write_buffer.cpp | 3403 |
dwb_get_volume_name | double_write_buffer.cpp | 3440 |
dwb_flush_next_block | double_write_buffer.cpp | 3459 |
dwb_flush_force | double_write_buffer.cpp | 3514 |
dwb_file_sync_helper | double_write_buffer.cpp | 3766 |
dwb_read_page | double_write_buffer.cpp | 3969 |
class dwb_flush_block_daemon_task | double_write_buffer.cpp | 4013 |
dwb_file_sync_helper_execute | double_write_buffer.cpp | 4053 |
dwb_flush_block_daemon_init | double_write_buffer.cpp | 4073 |
dwb_file_sync_helper_daemon_init | double_write_buffer.cpp | 4087 |
dwb_daemons_init | double_write_buffer.cpp | 4099 |
dwb_daemons_destroy | double_write_buffer.cpp | 4109 |
dwb_read_page use | page_buffer.c | 8239 |
pgbuf flush / dwb_set_data_on_next_slot | page_buffer.c | 10548 |
pgbuf flush / dwb_add_page | page_buffer.c | 10597 |
fileio_write_or_add_to_dwb | file_io.c | 4014 |
fileio_synchronize_all → dwb_flush_force | file_io.c | 4642 |
fileio_make_dwb_name | file_io.c | 5882 |
boot_restart_server → dwb_load_and_recover_pages | boot_sr.c | 2403 |
boot_create_database → dwb_create | boot_sr.c | 4908 |
Cross-check Notes
Section titled “Cross-check Notes”The DWB has no extant raw analyses in
raw/; this is a source-only walkthrough. The cross-check below names invariants read off the live source as of2026-04-30and what the related page-buffer / log-manager / recovery docs (which all reference the DWB) say or imply about it.
-
The DWB is consulted before disk reads on page-table miss.
cubrid-page-buffer-manager.md§“Double Write Buffer (DWB) — torn-page protection” sketches this. Verified atpage_buffer.c:8239—dwb_read_pageis the first call in the on-miss read sequence;fileio_readis the fallback. -
Producer-side staging is split: stage → WAL force → commit.
cubrid-page-buffer-manager.mddoes not call this out, but the source path inpage_buffer.c:10548-10597shows the WAL force (logpb_flush_log_for_wal) sits betweendwb_set_data_on_next_slotanddwb_add_page. The slot is acquired before WAL is forced, but the slot is not visible to readers (not in the hash, not yet flushed) until after WAL has been forced anddwb_add_pagehas committed. -
DWB recovery happens before analysis / redo / undo.
cubrid-recovery-manager.mddescribes the three-pass restart but does not say what runs before analysis. Verified:boot_restart_serveratboot_sr.c:2403callsdwb_load_and_recover_pages, thenpgbuf_daemons_init+dwb_daemons_init, thenlog_recovery. The redo pass reads home pages whose corruption has already been repaired. -
DWB write does not bypass WAL. Even though the DWB volume is durable on its own and could in principle be replayed without the redo log, CUBRID does not use it that way.
dwb_flush_blockdoes not consult the log; it simply staged pages whose underlying log records are also on disk (the WAL invariant was honored by the producer). Recovery uses the DWB to reconstruct the home pages, then redo replays log records onto the now-coherent home pages. The DWB is a pre-redo cleanup, not a redo substitute. -
Per-block flush ordering is enforced.
dwb_flush_blockasserts at the top:(DWB_GET_PREV_BLOCK (flush_block->block_no)->version > flush_block->version) || ...meaning the previous block must have been flushed (its version is higher) before the current one starts. The block-write- started bits inposition_with_flagsare the synchronization primitive that supports this — the daemon picksnext_block_to_flushand only that block is eligible. -
The slot hash is consulted during recovery only via the ordered-slots snapshot, not the live hash. During
dwb_load_and_recover_pages, the recovery block is built fresh; the productionslots_hashmapdoes not exist yet (the recovery code initializes it viadwb_createonly at the end of the recovery scan). This is whydwb_block_create_ordered_slotsruns over the whole recovery block — there is no concurrent producer to fight, and aqsortover a few thousand slots is cheap. -
FILEIO_WRITE_NO_COMPENSATE_WRITEis the DWB’s seal onfileio_write. When the DWB is active, all pgbuf and file_io writes use this mode, which tellsfileio_writeto not perform per-page write-retry on partial writes — that retry would introduce its own torn-page risk. Without the DWB,FILEIO_WRITE_DEFAULT_WRITEis used, which does the retry. This is the load-bearing reason the DWB cannot be partially activated: either every data write skipsfileio_write’s retry (because the DWB will recover) or none of them do. -
The two daemons have different cadences. The
dwb-flush-blockdaemon ticks every 1 ms; thedwb-file-synchelper ticks every 10 ms. The 1 ms cadence on the flusher matches its job (drain blocks as fast as possible); the 10 ms cadence on the helper is acceptable because its work (fsync large home volumes) is bounded by fsync latency anyway. Both daemons honorPRM_ID_ENABLE_DWB_FLUSH_THREAD— disabling that parameter reverts to inline flush by the producer that fills the last slot in a block. -
The DWB volume id is reserved at
LOG_DBDWB_VOLID(log_volids.hpp). It is not part of the disk manager’s permanent-volume range and is not visible indb_volumes_view. This isolates the DWB from disk-allocation paths that walk permanent volumes.
Open Questions
Section titled “Open Questions”-
TDE-encrypted pages in the DWB.
pgbufflush encrypts the page (tde_encrypt_data_page) into a stack buffer before callingdwb_set_data_on_next_slot. The slot therefore holds encrypted bytes. Recovery reads the same encrypted bytes from the DWB volume and writes them to the home location. Decryption would happen later, on the nextpgbuf_fixof the page. Question: what happens if the TDE master key has been rotated between the crash and the restart? Is the slot’s IV/key generation tracked? TheDWB_SLOTstruct does not carry TDE bookkeeping fields, suggesting either (a) the IV is encoded in the page header itself (FILEIO_PAGE::prv) and survives the round-trip, or (b) key rotation requires a clean shutdown (DWB drained) first. Investigation: readtde_encrypt_data_pageandtde_decrypt_data_page, check whether the page header carries the IV. -
DWB and parallel redo.
cubrid-recovery-manager.mddescribes per-VPID parallel redo vialog_recovery_redo_parallel.{cpp,hpp}. The DWB recovery runs before redo, single-threaded (the recovery block walk is sequential). But the redo phase, once started, could in principle race with daemon-driven DWB activity — if the page buffer flushes during redo, those flushes go through the DWB. Sincepgbuf_daemons_initanddwb_daemons_initare both called betweendwb_load_and_recover_pagesandlog_recovery, the flushers are live during redo. Question: are dirty pages produced by redo subject to the same DWB discipline? Yes —pgbuf_set_lsa+pgbuf_set_dirtyin the redo path eventually triggers a flush, which eventually goes through the DWB. Implication: a crash during redo is recoverable the same way as a crash during normal operation — the DWB on the second restart contains pages that were re-applied by the first restart’s interrupted redo. -
Block-write-started bit reuse. With
DWB_MAX_BLOCKS = 32, the block-status bitmask occupies bits 32-63 ofposition_with_flags. If a future change wants more blocks, the layout must be redesigned —dwb_starts_structure_modificationalready enforces “only one structure modification at a time” but does not address the layout cap. Question: is 32 the design ceiling forever, or is there a roadmap to widen it (e.g., a separate block-status word)? Theassert (block_no < DWB_MAX_BLOCKS)appears in every block-bit macro; relaxing it would require auditing every macro. -
Same-LSA, different-block hash collision. The
dwb_slots_hash_insertLSA_EQ arm has a debug-only assertion that, when same-VPID, same-LSA slots land in different blocks, the older block has a strictly smaller version. This is a guard against re-insert ordering bugs but the production path silently accepts the case. Question: is the production path guaranteed to flush the older block first (so the newer slot eventually wins the home write), or could a slow flush of the older block leave the home page at the older image even after the newer staging? The block flush ordering enforced bynext_block_to_flushshould prevent this, but the interaction withdwb_flush_force’s ad-hoc block selection needs tracing. -
dwb_synchronizesemantics. The function name suggests it synchronizes the DWB with a single volume, but the body(error = dwb_flush_force (thread_p, &complete))actually drains all pending DWB content. Is this intentional — the cheapest safe way to flush “everything that might touch this volume” — or a leftover from an earlier design wheredwb_synchronizewas per-volume? Investigation: git blame the function. -
fileio_fsync_pendingshort-circuit.dwb_synchronizebails early iffileio_fsync_pending()is true. Question: under what circumstances is that flag set, and does the short-circuit risk leaving DWB content unsynchronised at shutdown? The risk is low because shutdown forces drain anyway, but the comment says nothing. -
Disabling DWB at runtime. The header comments (“Activating/deactivating DWB while the server is alive, needs additional work”) suggest the
DWB_NOT_CREATED_OR_MODIFYINGpaths are not fully reliable. In practice, operators setPRM_ID_DWB_SIZE = 0and restart. Question: is there a roadmap to support live DWB enable/disable, and what would it take? The producer paths inpgbufandfile_iowould need to handle a transition state, which they currently don’t (they sampledwb_is_createdonce at the top). -
Interaction with copy-database (
migrate) tools.dwb_synchronizeis called in volume copy paths (file_io.c:2844, 2912, 3126, 3332) — does it need to be called by external tools (e.g.,cubrid copydb)? The user’s view is that running tools while the server is live is forbidden, so external tools should not encounter a live DWB. Butdwb_synchronizeis structured as if defensive against being invoked from anywhere; this might be an artifact of an older design.
Sources
Section titled “Sources”This document is source-only: there are no curated raw
analyses for the DWB in raw/code-analysis/cubrid/storage/.
The references below are CUBRID source paths and sibling docs
in this knowledge base.
CUBRID source (/data/hgryoo/references/cubrid/)
Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”src/storage/double_write_buffer.hpp— public API.src/storage/double_write_buffer.cpp— implementation.src/storage/page_buffer.c— primary producer (flush path) and primary reader (fix path).src/storage/file_io.c— secondary producer (fileio_write_or_add_to_dwb), volume name builder (fileio_make_dwb_name), and synchronisation call sites.src/transaction/boot_sr.c— boot-time wiring (dwb_load_and_recover_pagesbefore redo,dwb_createon database init).src/transaction/log_volids.hpp—LOG_DBDWB_VOLIDreservation.
Sibling docs in this knowledge base
Section titled “Sibling docs in this knowledge base”knowledge/code-analysis/cubrid/cubrid-page-buffer-manager.md— page buffer’s flush path and DWB consultation on miss.knowledge/code-analysis/cubrid/cubrid-log-manager.md— WAL invariant the DWB producer side honors.knowledge/code-analysis/cubrid/cubrid-recovery-manager.md— three-pass restart whose redo runs after DWB repair.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database Internals (Petrov), Ch. 5 §“Recovery”, §“Torn Pages” — torn-page problem and three escape hatches.
- Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (TODS 17.1, 1992) — WAL invariant the DWB pairs with.
Comparative engines (referenced in the design space)
Section titled “Comparative engines (referenced in the design space)”- MySQL InnoDB
Doublewrite_log/dblwr— the canonical DWB design CUBRID’s most resembles. - MariaDB / Percona Server parallel doublewrite — a multi- block parallel variant.
- PostgreSQL
full_page_writes = on— the alternative full-page-WAL approach. - SQL Server torn-page detection bits — a lighter-weight detect-only design that requires media restore on failure rather than DWB-driven repair.