Skip to content

CUBRID Double Write Buffer — Torn-Page Protection Between Page Buffer and Data Files

Contents:

The Double Write Buffer (DWB) exists to defeat one specific failure mode that the WAL protocol on its own cannot recover from: the torn page. A torn page is a half-old / half-new on-disk image of a data page produced when a crash occurs in the middle of a multi-sector write. Because every disk-resident database page is larger than the unit the operating system / hardware can update atomically, a crash mid-write can leave the home location holding a Frankenstein page whose first half corresponds to the new image and whose second half corresponds to the old image (or vice versa, depending on how the OS schedules the sector writes).

The asymmetry between DBMS page size and filesystem / device atomic write size is the root cause. CUBRID writes data pages at IO_PAGESIZE (4–16 KiB depending on configuration); modern Linux filesystems guarantee atomic writes only at the hardware sector boundary (typically 512 B or 4 KiB). The expression “atomic write” itself is overloaded here — Database Internals (Petrov, ch. 5 §“Recovery”) splits it into two concerns:

  1. Atomic-on-success. If write(2) succeeds and the OS later crashes, the bytes that made it to the platter are an arbitrary subset of the buffer. Even with O_DIRECT, the device may complete only some of the sectors before the power loss.
  2. Atomic-on-failure. Some filesystems (ZFS, Btrfs, XFS with reflinks) commit each page write atomically by writing a new block and switching pointers. On those filesystems torn pages are impossible at the filesystem level — but the DBMS cannot assume it is running on such a filesystem.

The reason WAL alone does not suffice is that ARIES redo (Mohan et al., TODS 17.1, 1992) needs a coherent before-state to replay the redo image onto. The redo log carries enough information to take page state S and produce state S', but if the on-disk page is neither S nor S' — neither a coherent old nor coherent new — there is no defined input to redo. Database Internals §“Torn Pages” frames the three escape hatches every disk-resident DBMS chooses among:

  1. Full-page WAL — the first time a page is dirtied after a checkpoint, the entire page image is logged. PostgreSQL’s full_page_writes = on does this. Cost: log-write amplification (a 16 KiB page becomes a 16 KiB log record).
  2. Atomic 8 KiB / 16 KiB writes from the device or filesystem — relies on the underlying stack guaranteeing atomicity at the page granularity. Linux 6.x has experimental support; ZFS has it natively; most enterprise SAN arrays expose it. Cost: non-portable; the DBMS cannot ship the guarantee, only consume it.
  3. Double-write buffer — every dirty page is written first to a sequential staging area (the “double write” file), the staging area is fsync’d, and only then is the page written to its home location. On crash recovery, any home page that fails its checksum is restored from the staging area. Cost: each data page is written twice and fsync’d twice. Used by MySQL InnoDB, MariaDB, Percona Server, and CUBRID.

CUBRID picks option 3, which is also the InnoDB default. The remainder of this document is the slow-zoom into how it is built in src/storage/double_write_buffer.{hpp,cpp} and how it stitches into the page buffer’s flush path, the file_io subsystem, and the boot/recovery driver.

Every torn-page-defending engine that picks the double-write buffer path arranges roughly the same components. The names differ (InnoDB has Doublewrite_log / dblwr; MariaDB has buf_dblwr; Percona has the parallel doublewrite; CUBRID has dwb_Global); the shape is shared. This section names the shared engineering vocabulary, then ## CUBRID's Approach is the slow zoom into each row.

Slot-array layout — sequential, fixed-size, contiguous on disk

Section titled “Slot-array layout — sequential, fixed-size, contiguous on disk”

The DWB is a fixed-size on-disk file (or pre-allocated extent of the system tablespace) divided into slots. Each slot holds exactly one DBMS page. Writes to slots are sequential — a position counter wraps around — so the underlying disk sees a streaming workload. InnoDB’s classical layout is 2 × 1 MiB blocks × 64 slots × 16 KiB per page; the flushed-to-DWB fsync is a single sequential 1 MiB write.

Slots are grouped into blocks. A block is the unit of “fsync the DWB before writing home pages”. The whole block is written, then fsync’d, then its slots are written to their home volumes one at a time. Block-level batching matters for two reasons. First, it amortises the fsync cost over many pages. Second, it lets the DWB writer thread know when a block has been fully filled and is therefore safe to flush — once a block is full, no more producers can be writing into it.

Position-with-flags — a single 64-bit atomic word

Section titled “Position-with-flags — a single 64-bit atomic word”

Concurrent producers (page-buffer flushers, foreground writers) need a lock-free way to claim the next free slot. The standard technique is a 64-bit atomic word that packs:

  • the current slot position (low bits),
  • a per-block “write started” bitmask (high bits, one bit per block), and
  • a couple of lifecycle flags (CREATED, MODIFY_STRUCTURE).

CAS-loops on this word coordinate slot acquisition, block-bit flipping at block boundaries, and structure modifications (create / destroy / resize).

A page-buffer flusher about to write a dirty page to its home volume does, in order:

  1. Acquire DWB slot. Atomic CAS-bump the position counter, land on a slot.
  2. Copy page bytes into the slot. The slot’s io_page pointer is the destination memory; the page is memcpy’d in.
  3. Insert into the DWB hash. The hash is keyed on VPID = (volid, pageid), so a concurrent reader that is about to fix the same page can find a fresh in-memory copy in the DWB rather than re-read from a possibly-torn home volume.
  4. If block fills, request flush. Bump the per-block page count. When it reaches BLOCK_NUM_PAGES, wake the dwb-flush-block daemon (or do it inline if running standalone). The daemon writes the entire block to the DWB volume sequentially, fsyncs it, then writes each slot’s contents to its home volume.

On restart, before any redo, the engine opens the existing DWB volume from disk. It reads every slot’s page image and checks, for each (volid, pageid) named by the slot:

  • Read the home page from its volume.
  • Compare a checksum / sanity-check the home page.
  • If the home page is corrupt and the DWB slot’s image is sane, overwrite the home page with the DWB slot.

After this DWB-driven repair pass, the DWB volume is unceremoniously deleted and recreated with the parameters named in the current configuration. The redo pass then runs against home pages that are now guaranteed to be coherent (either pre- or post-image; never torn).

Theoretical conceptCUBRID name
DWB volume on diskdwb_Volume_name (built by fileio_make_dwb_name)
Slot — one DBMS page-sized cellDWB_SLOT (double_write_buffer.hpp)
Block — fsync unit, group of slotsDWB_BLOCK (double_write_buffer.cpp)
Atomic position-with-flags worddwb_Global.position_with_flags (UINT64)
Slot-VPID hash for in-memory lookupdwb_Global.slots_hashmap (lockfree hashmap)
Producer-side acquiredwb_acquire_next_slotdwb_set_data_on_next_slot
Producer-side stage + insertdwb_add_pagedwb_slots_hash_insert
Block writer (DWB write + fsync + home write)dwb_flush_blockdwb_write_block
Background flush daemondwb_flush_block_daemon (1 ms tick)
File-sync helper daemondwb_file_sync_helper_daemon (10 ms tick)
Reader-side hit (concurrent fix)dwb_read_page (called from pgbuf_fix page-load path)
Crash recovery scandwb_load_and_recover_pages (called by boot_sr.c)
Per-page corruption gatedwb_check_data_page_is_sane + fileio_page_check_corruption
Force-flush of pending DWB pagesdwb_flush_force (called by fileio_synchronize_all)

The DWB has six moving parts: the on-disk volume that holds the staged copies, the in-memory blocks/slots that mirror the volume layout one-to-one, the position-with-flags atomic that coordinates concurrent producers, the slot hash that turns the DWB into a cache for concurrent readers, the flush machinery that drives a filled block to disk, and the recovery scan that uses the volume after a crash. We walk them in that order.

Layout — fixed-size sequential volume and contiguous blocks

Section titled “Layout — fixed-size sequential volume and contiguous blocks”

The DWB is a single permanent volume created at database init, named by fileio_make_dwb_name (file_io.c:5882):

// fileio_make_dwb_name — src/storage/file_io.c
void
fileio_make_dwb_name (char *dwb_name_p, const char *dwb_path_p, const char *db_name_p)
{
sprintf (dwb_name_p, "%s%s%s%s", dwb_path_p, FILEIO_PATH_SEPARATOR (dwb_path_p),
db_name_p, FILEIO_SUFFIX_DWB);
}

The volume sits next to the active log; the suffix is FILEIO_SUFFIX_DWB. Its size is governed by two parameters:

ParameterBoundCUBRID symbol
Total buffer size512 KiB ≤ size ≤ 32 MiB (power of 2)PRM_ID_DWB_SIZE
Number of blocks1 ≤ blocks ≤ 32 (power of 2)PRM_ID_DWB_BLOCKS
Per-block pagesderived: total_pages / num_blocksdwb_Global.num_block_pages

Both knobs are loaded by dwb_load_buffer_size / dwb_load_block_count, each clamped to its range and rounded up to a power of two via dwb_power2_ceil. Setting either to zero disables the DWB entirely.

The volume is formatted at create time with fileio_format (boot_db_full_name, dwb_volume_name, LOG_DBDWB_VOLID, num_block_pages, …) — a permanent volume id reserved for the DWB. A pre-existing DWB volume on the next restart is the recovery trigger; an absent volume means there is nothing to repair.

// DWB_SLOT — src/storage/double_write_buffer.hpp
struct double_write_slot
{
FILEIO_PAGE *io_page; /* The contained page or NULL. */
VPID vpid; /* The page identifier. */
LOG_LSA lsa; /* The page LSA */
bool ensure_metadata; /* Include metadata when syncing */
unsigned int position_in_block; /* The position in block. */
unsigned int block_no; /* The number of the block where the slot reside. */
};
// DWB_BLOCK — src/storage/double_write_buffer.cpp
struct double_write_block
{
FLUSH_VOLUME_INFO *flush_volumes_info; /* per-block flush bookkeeping */
volatile unsigned int count_flush_volumes_info;
unsigned int max_to_flush_vdes;
pthread_mutex_t mutex; /* protects wait_queue */
DWB_WAIT_QUEUE wait_queue; /* threads sleeping on this block's flush */
char *write_buffer; /* contiguous block bytes — written in one fileio_write_pages */
DWB_SLOT *slots; /* slot view onto write_buffer */
volatile unsigned int count_wb_pages; /* current fill level */
unsigned int block_no;
volatile UINT64 version; /* incremented after each flush */
volatile bool all_pages_written; /* set when home writes complete */
};
// DOUBLE_WRITE_BUFFER — global singleton
struct double_write_buffer
{
bool logging_enabled;
DWB_BLOCK *blocks; /* num_blocks-sized array */
unsigned int num_blocks; /* power of 2, ≤ DWB_MAX_BLOCKS = 32 */
unsigned int num_pages; /* num_blocks × num_block_pages */
unsigned int num_block_pages; /* power of 2 */
unsigned int log2_num_block_pages; /* used by macros */
volatile unsigned int blocks_flush_counter;
volatile unsigned int next_block_to_flush;
pthread_mutex_t mutex;
DWB_WAIT_QUEUE wait_queue; /* structure-modification waiters */
UINT64 volatile position_with_flags; /* THE coordinator */
dwb_hashmap_type slots_hashmap; /* VPID → DWB_SLOTS_HASH_ENTRY */
int vdes; /* volume descriptor */
DWB_BLOCK *volatile file_sync_helper_block; /* block being post-flushed */
};

DWB_BLOCK::write_buffer is the only memory backing the slots — DWB_BLOCK::slots[i].io_page is just a pointer into write_buffer + i × IO_PAGESIZE. Allocation makes that explicit:

// dwb_create_blocks — src/storage/double_write_buffer.cpp
for (i = 0; i < num_blocks; i++)
{
blocks_write_buffer[i] = (char *) malloc (block_buffer_size * sizeof (char));
...
for (j = 0; j < num_block_pages; j++)
{
io_page = (FILEIO_PAGE *) (blocks_write_buffer[i] + j * IO_PAGESIZE);
fileio_initialize_res (thread_p, io_page, IO_PAGESIZE);
dwb_initialize_slot (&slots[i][j], io_page, j, i);
}
dwb_initialize_block (&blocks[i], i, 0, blocks_write_buffer[i], slots[i],
flush_volumes_info[i], 0, num_block_pages);
}

Two consequences. (a) The block can be written to disk as one contiguous fileio_write_pages call — the bytes are already laid out in volume order. (b) Slot mutation (a producer copying its page into a slot) is a memcpy into the shared write buffer; concurrent producers writing into different slots are safe because each producer owns its slot position by way of the position-with-flags atomic.

Position-with-flags — the central coordinator

Section titled “Position-with-flags — the central coordinator”

The 64-bit atomic dwb_Global.position_with_flags is the only producer-side serialization point. Bit layout:

bit 63 62 61 ... 33 32 31 30 29 ... 0
+-----------------------+ +---+ +---+ +-----------------+
| block-write-started | | M | | C | | slot position |
| bitmask (one bit per | | S | | R | | (30 bits) |
| block, max 32 blocks) | | | | E | | |
+-----------------------+ +---+ +---+ +-----------------+
M = MODIFY_STRUCTURE flag (bit 31)
C = CREATE flag (bit 30)

Macro accessors hide the encoding:

// position-with-flags macros — double_write_buffer.cpp
#define DWB_GET_POSITION(p) ((p) & 0x000000003fffffffULL) /* low 30 bits */
#define DWB_GET_BLOCK_STATUS(p) ((p) & 0xffffffff00000000ULL) /* high 32 bits */
#define DWB_MODIFY_STRUCTURE 0x0000000080000000ULL
#define DWB_CREATE 0x0000000040000000ULL
/* "block write started" bit — block_no occupies bit (63 - block_no) */
#define DWB_IS_BLOCK_WRITE_STARTED(p, block_no) \
(((p) & (1ULL << (63 - (block_no)))) != 0)
#define DWB_STARTS_BLOCK_WRITING(p, block_no) ((p) | (1ULL << (63 - (block_no))))
#define DWB_ENDS_BLOCK_WRITING(p, block_no) ((p) & ~(1ULL << (63 - (block_no))))
/* "next slot position" — wraps at num_pages */
#define DWB_GET_NEXT_POSITION_WITH_FLAGS(p) \
(DWB_GET_POSITION (p) == (DWB_NUM_TOTAL_PAGES - 1) \
? ((p) & DWB_FLAG_MASK) : ((p) + 1))

The 30-bit position field caps the total slot count at 2³⁰ slots, far above the DWB_MAX_SIZE / IO_PAGESIZE cap of a few thousand. The block-status bit field hard-caps the number of blocks at 32 (DWB_MAX_BLOCKS), which is why PRM_ID_DWB_BLOCKS is clamped to that value.

The atomic CAS at the heart of dwb_acquire_next_slot is what turns this into a wait-free hot path:

// dwb_acquire_next_slot — src/storage/double_write_buffer.cpp (condensed)
STATIC_INLINE int
dwb_acquire_next_slot (THREAD_ENTRY *thread_p, bool can_wait, DWB_SLOT **p_dwb_slot)
{
start:
current_position_with_flags = ATOMIC_INC_64 (&dwb_Global.position_with_flags, 0ULL);
if (DWB_NOT_CREATED_OR_MODIFYING (current_position_with_flags))
{ /* wait or return — see below */ }
current_block_no = DWB_GET_BLOCK_NO_FROM_POSITION (current_position_with_flags);
position_in_current_block = DWB_GET_POSITION_IN_BLOCK (current_position_with_flags);
if (position_in_current_block == 0)
{
/* first write into this block — must wait for previous iteration to flush */
if (DWB_IS_BLOCK_WRITE_STARTED (current_position_with_flags, current_block_no))
{
if (!can_wait) return NO_ERROR;
dwb_wait_for_block_completion (thread_p, current_block_no);
goto start;
}
current_position_with_block_write_started =
DWB_STARTS_BLOCK_WRITING (current_position_with_flags, current_block_no);
new_position_with_flags =
DWB_GET_NEXT_POSITION_WITH_FLAGS (current_position_with_block_write_started);
}
else
{
new_position_with_flags = DWB_GET_NEXT_POSITION_WITH_FLAGS (current_position_with_flags);
}
if (!ATOMIC_CAS_64 (&dwb_Global.position_with_flags,
current_position_with_flags, new_position_with_flags))
goto start; /* lost race, try again */
block = dwb_Global.blocks + current_block_no;
*p_dwb_slot = block->slots + position_in_current_block;
VPID_SET_NULL (& (*p_dwb_slot)->vpid); /* invalidate previous occupant */
return NO_ERROR;
}

The first-into-a-block producer is the one who flips the “block-write-started” bit for the new block. That bit is the synchronization handle the flush daemon and the structure- modification path read to know “block N has staged data that must be written”.

The single biggest producer of DWB writes is the page buffer’s flush path. From pgbuf_bcb_safe_flush_internal_release_mutex (in page_buffer.c around line 10468), the relevant slice is:

// pgbuf flush path — src/storage/page_buffer.c
DWB_SLOT *dwb_slot = NULL;
bool uses_dwb;
uses_dwb = dwb_is_created () && !is_temp;
start_copy_page:
/* ... TDE-encrypt or memcpy bufptr → iopage ... */
if (uses_dwb)
{
error = dwb_set_data_on_next_slot (thread_p, iopage, false, false, &dwb_slot);
if (dwb_slot != NULL)
{
iopage = NULL; /* the slot's io_page replaces our local */
goto copy_unflushed_lsa;
}
}
copy_unflushed_lsa:
/* ... WAL flush (logpb_flush_log_for_wal) ... */
if (uses_dwb)
{
error = dwb_add_page (thread_p, iopage, &bufptr->vpid, false, &dwb_slot);
/* dwb_add_page enqueues; the actual home write is done later by
dwb_flush_block via dwb_write_block */
}
else
{
write_mode = (dwb_is_created () == true ?
FILEIO_WRITE_NO_COMPENSATE_WRITE : FILEIO_WRITE_DEFAULT_WRITE);
fileio_write (thread_p, ..., iopage, ..., write_mode);
}

Two properties matter. (a) is_temp short-circuits the DWB for temporary volumes — torn pages on temp volumes are benign because temp volumes are recreated at next start. (b) The path is split in two: dwb_set_data_on_next_slot acquires and stages, but the WAL force happens between staging and dwb_add_page. This is the WAL invariant honored inside the DWB path: the log record describing the change must be on disk before the data page can be written, even though the data page is going to the DWB rather than the home volume first.

Producer side — file_io.c flow for non-pgbuf writers

Section titled “Producer side — file_io.c flow for non-pgbuf writers”

A second producer is fileio_write_or_add_to_dwb (file_io.c:4014). It is used by callers like volume-extension code that write a page directly via fileio_write rather than through the page buffer:

// fileio_write_or_add_to_dwb — src/storage/file_io.c
void *
fileio_write_or_add_to_dwb (THREAD_ENTRY *thread_p, int vol_fd, FILEIO_PAGE *io_page_p,
PAGEID page_id, size_t page_size, bool ensure_metadata)
{
bool skip_flush = false;
DWB_SLOT *p_dwb_slot = NULL;
skip_flush = dwb_is_created ();
if (skip_flush)
{
arg.vdes = vol_fd;
vol_info_p = fileio_traverse_permanent_volume (thread_p,
fileio_is_volume_descriptor_equal, &arg);
if (vol_info_p)
{
/* permanent volume — route through DWB */
VPID_SET (&vpid, vol_info_p->volid, page_id);
io_page_p->prv.volid = vol_info_p->volid;
io_page_p->prv.pageid = page_id;
error_code = dwb_add_page (thread_p, io_page_p, &vpid, ensure_metadata, &p_dwb_slot);
if (p_dwb_slot != NULL) return io_page_p; /* staged */
}
/* not permanent OR DWB disabled meanwhile — fall through to direct write */
}
write_mode = skip_flush ? FILEIO_WRITE_NO_COMPENSATE_WRITE : FILEIO_WRITE_DEFAULT_WRITE;
return fileio_write (thread_p, vol_fd, io_page_p, page_id, page_size, write_mode);
}

FILEIO_WRITE_NO_COMPENSATE_WRITE vs. FILEIO_WRITE_DEFAULT_WRITE is the second torn-page knob: when the DWB is active, fileio_write is told not to re-write the destination page on detection of a partial write because the DWB will do that recovery on restart. Without the DWB, fileio_write falls back to its own retry-on-partial-write path.

sequenceDiagram
  participant PB as page_buffer flush path
  participant DA as dwb_set_data_on_next_slot
  participant WL as logpb_flush_log_for_wal (WAL force)
  participant DP as dwb_add_page
  participant SH as dwb_slots_hashmap (VPID hash)
  participant BL as DWB_BLOCK (count_wb_pages++)
  participant FD as dwb-flush-block daemon
  participant FB as dwb_flush_block

  PB->>DA: stage iopage into next slot
  DA-->>PB: DWB_SLOT* (page memcpy'd into slot)
  PB->>WL: force WAL up to page LSA
  WL-->>PB: nxio_lsa >= page.lsa
  PB->>DP: dwb_add_page(iopage, vpid, slot)
  DP->>SH: insert (vpid → slot)
  DP->>BL: ATOMIC_INC_32(&count_wb_pages)
  alt block now full
    DP->>FD: wakeup
    FD->>FB: dwb_flush_next_block
    FB->>FB: fileio_write_pages (block → DWB volume)
    FB->>FB: fileio_synchronize (DWB volume)
    FB->>FB: dwb_write_block (slots → home volumes)
    FB->>FB: fileio_synchronize (home volumes — partly via helper)
  else block partially full
    DP-->>PB: NO_ERROR (no flush yet)
  end

Slot hash — turning the DWB into a read-side cache

Section titled “Slot hash — turning the DWB into a read-side cache”

When the DWB has staged a page but not yet written it to its home location, a concurrent reader (any thread doing pgbuf_fix on a page-table miss) would otherwise read the old image from the home volume. The slot hash prevents that. On every successful dwb_add_page, the slot is registered in dwb_Global.slots_hashmap keyed on VPID.

The page-buffer fix path consults the hash before falling through to fileio_read:

// pgbuf_fix path — page_buffer.c:8239
if (dwb_read_page (thread_p, vpid, &bufptr->iopage_buffer->iopage, &success) != NO_ERROR)
return NULL;
else if (success == true)
/* nothing to do — page bytes copied from DWB slot */;
else if (fileio_read (thread_p, fileio_get_volume_descriptor (vpid->volid), ...) == NULL)
/* error path */;

dwb_read_page looks up the hash, and if the slot’s VPID still matches the requested one (a producer could be in the middle of overwriting it), it memcpy’s the slot’s io_page into the caller’s buffer:

// dwb_read_page — src/storage/double_write_buffer.cpp:3968
int
dwb_read_page (THREAD_ENTRY *thread_p, const VPID *vpid, void *io_page, bool *success)
{
*success = false;
if (!dwb_is_created ()) return NO_ERROR;
VPID key_vpid = *vpid;
slots_hash_entry = dwb_Global.slots_hashmap.find (thread_p, key_vpid);
if (slots_hash_entry != NULL)
{
if (VPID_EQ (&slots_hash_entry->slot->vpid, vpid))
{
memcpy ((char *) io_page, (char *) slots_hash_entry->slot->io_page, IO_PAGESIZE);
*success = true;
}
pthread_mutex_unlock (&slots_hash_entry->mutex);
}
return NO_ERROR;
}

The semantics matter: a DWB hit means the reader gets the newest version of the page, even one that has not yet been written to the home volume. This is one of the rare design points where the DWB is doing more than torn-page protection — it is also a write-back cache for the home volume. A torn- page-only design would not need the slot hash; CUBRID adds it because the cache is essentially free given the slots already exist in memory.

LSA-ordered hash insert — handling re-fixes

Section titled “LSA-ordered hash insert — handling re-fixes”

A page can land in the DWB twice within one block-fill window: the same VPID gets dirtied, flushed-staged, dirtied again, flushed-staged again. dwb_slots_hash_insert resolves this by keeping the newest version visible to readers and invalidating the older slot:

// dwb_slots_hash_insert — src/storage/double_write_buffer.cpp (excerpt)
*inserted = dwb_Global.slots_hashmap.find_or_insert (thread_p, *vpid, slots_hash_entry);
if (! (*inserted))
{
if (LSA_LT (&slot->lsa, &slots_hash_entry->slot->lsa))
{
/* The older slot is better than mine — leave it in hash. */
pthread_mutex_unlock (&slots_hash_entry->mutex);
return NO_ERROR;
}
else if (LSA_EQ (&slot->lsa, &slots_hash_entry->slot->lsa))
{
/* Same LSA — page modified without logging (rare).
Replace, but invalidate old slot if in same block. */
if (slots_hash_entry->slot->block_no == slot->block_no)
{
VPID_SET_NULL (&slots_hash_entry->slot->vpid);
fileio_initialize_res (thread_p, slots_hash_entry->slot->io_page, IO_PAGESIZE);
}
}
slot->ensure_metadata = slot->ensure_metadata || slots_hash_entry->slot->ensure_metadata;
}
slots_hash_entry->slot = slot;

The LSA_LT check (“I am older than the cached one”) is the defense against producers whose own slot got a smaller LSA than another producer’s that won the hash race. The LSA_EQ arm handles the “modified without logging” path (temp-volume-style writes whose LSA was never advanced).

Flush — block-level fsync, then home writes

Section titled “Flush — block-level fsync, then home writes”

dwb_flush_block is the heart of the durability story. It runs either inline (when the producer fills the last slot in a block and there is no daemon available) or from the dwb-flush-block daemon. Its sequence is:

// dwb_flush_block — src/storage/double_write_buffer.cpp:2191 (condensed)
STATIC_INLINE int
dwb_flush_block (THREAD_ENTRY *thread_p, DWB_BLOCK *block,
bool file_sync_helper_can_flush, UINT64 *cur_pos_w_flags)
{
ATOMIC_INC_32 (&dwb_Global.blocks_flush_counter, 1);
/* (1) Snapshot the slots in VPID order so the home writes hit each volume contiguously. */
dwb_block_create_ordered_slots (block, &p_dwb_ordered_slots, &ordered_slots_length);
/* (2) De-duplicate: the same VPID may appear twice — keep the newer LSA. */
for (i = 0; i < block->count_wb_pages - 1; i++) { ... }
/* (3) Wait for the previous block's home writes to finish. */
while (dwb_Global.file_sync_helper_block != NULL) { thread_sleep(1) or flush inline; }
/* (4) WRITE THE WHOLE BLOCK TO THE DWB VOLUME — sequential, fast. */
fileio_write_pages (thread_p, dwb_Global.vdes, block->write_buffer, 0,
block->count_wb_pages, IO_PAGESIZE,
FILEIO_WRITE_NO_COMPENSATE_WRITE);
/* (5) FSYNC THE DWB VOLUME — durability barrier. */
fileio_synchronize (thread_p, dwb_Global.vdes, dwb_Volume_name, false);
/* (6) WRITE THE PAGES TO THEIR HOME VOLUMES (and remove from slot hash). */
dwb_write_block (thread_p, block, p_dwb_ordered_slots, ordered_slots_length,
file_sync_helper_can_flush, /* remove_from_hash = */ true);
/* (7) FSYNC THE HOME VOLUMES — small ones inline, big ones offloaded to helper. */
for (i = 0; i < block->count_flush_volumes_info; i++)
{
if (file_sync_helper_can_flush && num_pages > max_pages_to_sync &&
dwb_is_file_sync_helper_daemon_available ())
continue; /* let the helper do it */
if (ATOMIC_CAS_32 (&block->flush_volumes_info[i].flushed_status,
VOLUME_NOT_FLUSHED, VOLUME_FLUSHED_BY_DWB_FLUSH_THREAD))
fileio_synchronize (thread_p, block->flush_volumes_info[i].vdes,
NULL, block->flush_volumes_info[i].metadata);
}
block->all_pages_written = true;
ATOMIC_TAS_32 (&block->count_wb_pages, 0);
ATOMIC_INC_64 (&block->version, 1ULL);
/* (8) Clear the block's "write started" bit; advance next_block_to_flush. */
/* (9) Wake any threads sleeping on this block's wait_queue. */
}

The eight-step sequence is the torn-page protection contract in execution: between step 5 and step 6, the DWB volume is durable on disk. If a crash occurs anywhere during step 6, the DWB has a clean copy of every page whose home write may have been torn. After step 7, both copies are durable and the DWB slots can be reused on the next block-fill cycle.

The helper daemon dwb_file_sync_helper_daemon exists to parallelise step 7. Home volumes can be many, and fsync on a big volume is slow; the main flush thread offloads the expensive ones to the helper while it returns to write more data into the next block. The handoff is via the single-pointer atomic dwb_Global.file_sync_helper_block.

State machine — one DWB slot’s lifecycle

Section titled “State machine — one DWB slot’s lifecycle”
stateDiagram-v2
  [*] --> FREE: server start (block created)
  FREE --> STAGED: dwb_set_data_on_next_slot\n(slot acquired, page memcpy'd)
  STAGED --> HASHED: dwb_add_page\n(VPID inserted into slots_hashmap)
  HASHED --> DWB_WRITTEN: dwb_flush_block step 4-5\n(write_buffer flushed + fsync DWB volume)
  DWB_WRITTEN --> HOME_WRITTEN: dwb_write_block\n(slot's io_page written to home volume)
  HOME_WRITTEN --> HOME_SYNCED: fileio_synchronize\n(home volume fsync — main or helper thread)
  HOME_SYNCED --> FREE: count_wb_pages reset; block->version++
  HASHED --> HASHED_INVALIDATED: same VPID re-staged with newer LSA
  HASHED_INVALIDATED --> DWB_WRITTEN: still flushed (with NULL VPID skip in dwb_write_block)

Three properties matter. (a) Between STAGED and HASHED, a concurrent reader cannot find the slot — it must read from the (possibly old) home volume. (b) Between HASHED and HOME_SYNCED, a concurrent reader gets the slot’s contents via dwb_read_page. (c) The transition out of HOME_SYNCED is the only point at which the slot is released for reuse by the next block-fill cycle.

Crash recovery — dwb_load_and_recover_pages

Section titled “Crash recovery — dwb_load_and_recover_pages”

The recovery story has one external entry: boot_sr.c calls dwb_load_and_recover_pages after vacuum init but before log recovery (analysis / redo / undo):

// boot_sr.c:2403 — server boot, with crash recovery
oid_set_root (&boot_Db_parm->rootclass_oid);
/* Load and recover data pages before log recovery */
error_code = dwb_load_and_recover_pages (thread_p, log_path, log_prefix);
if (error_code != NO_ERROR)
goto error;
#if defined(SERVER_MODE)
pgbuf_daemons_init ();
dwb_daemons_init ();
parallel_query::worker_manager_global::get_manager ().init ();
#endif

The placement is critical for ARIES correctness: by the time analysis starts walking the log, every home page on disk is either a coherent old image (DWB had no copy and home write never started or completed) or a coherent new image (DWB restored it). Redo can replay LSNs onto a known state.

dwb_load_and_recover_pages body (double_write_buffer.cpp:3199) unfolds in five phases:

  1. Open the on-disk DWB volume. If it does not exist, nothing to do — the engine was either freshly initialized or the previous session shut down cleanly enough that DWB was destroyed.
  2. Allocate one in-memory DWB_BLOCK of size num_dwb_pages — the entire DWB is loaded into a single block, regardless of how many blocks it had during normal operation. This is simpler because the recovery scan does not need to know the block boundaries; it only needs the slot images.
  3. Read all pages. fileio_read_pages pulls the entire DWB volume into the block’s write_buffer. For each slot, the VPID and LSA are re-derived from the page’s prv.{volid,pageid,lsa} header.
  4. Sort + dedup. dwb_block_create_ordered_slots sorts by (VPID, LSA). The dedup loop walks the sorted array and nulls out the older copy when the same VPID appears twice (which can happen if the crash occurred mid-flush of one block and the previous block’s slots were partially overwritten by the in-progress one).
  5. For each slot, dwb_check_data_page_is_sane. Read the home volume’s copy. If it is sane (passes fileio_page_check_corruption), keep it — null the slot’s VPID so the recovery write skips it. If it is corrupt and the DWB slot is sane, the home page will be replaced. If both are corrupt, that is a fatal recovery error.
  6. dwb_write_block to overwrite home pages. The same producer-side function that runs during normal flush is reused here, with remove_from_hash = false because there is no live hash yet.
  7. Dismount + unformat the DWB volume, then re-create it fresh via dwb_create. The new DWB starts empty; whatever was staged at crash time is now committed to home pages.
flowchart TB
  S0["server start\nboot_sr.c"]
  S1["log_initialize\n(reads active log header)"]
  S2["vacuum_initialize"]
  S3["dwb_load_and_recover_pages"]
  S4["pgbuf_daemons_init\ndwb_daemons_init"]
  S5["log_recovery (analysis / redo / undo)"]

  S0 --> S1 --> S2 --> S3 --> S4 --> S5

  subgraph DWB_RECOVERY["dwb_load_and_recover_pages"]
    R1["fileio_is_volume_exist?"]
    R2["fileio_mount + fileio_read_pages"]
    R3["dwb_block_create_ordered_slots\n(sort by VPID, LSA)"]
    R4["dedup duplicates\n(same VPID twice)"]
    R5["dwb_check_data_page_is_sane\n(read each home page)"]
    R6{"home corrupt &\nDWB sane?"}
    R7["mark slot for replacement"]
    R8["mark slot skipped"]
    R9["dwb_write_block\n(replace corrupt home pages)"]
    R10["fileio_synchronize home volumes"]
    R11["fileio_dismount + fileio_unformat"]
    R12["dwb_create (fresh)"]
  end

  R1 -- "yes" --> R2
  R1 -- "no" --> R12
  R2 --> R3 --> R4 --> R5
  R5 --> R6
  R6 -- "yes" --> R7
  R6 -- "no" --> R8
  R7 --> R9
  R8 --> R9
  R9 --> R10 --> R11 --> R12

The key invariant: by the time dwb_create returns, the DWB volume is empty and the home volumes are coherent. Redo, when it runs next, reads home pages whose state is one consistent snapshot of the database — not torn.

Performance — write amplification and why it’s affordable

Section titled “Performance — write amplification and why it’s affordable”

The DWB’s worst-case cost is 2× write amplification on data pages: every dirty page is written first to the DWB volume, then to its home volume. Two factors make this affordable in practice.

Sequential vs. random. The DWB write is sequential: a whole block (typically 1 MiB or more) at a time, into a fixed-location file. Modern storage handles sequential writes at near-peak bandwidth, so the DWB write is essentially free in throughput terms — it is dominated by fsync latency, not transfer time. The home write, by contrast, is random — pages from many volumes scattered across the disk. The DWB does not add a random write; it adds a sequential write and an fsync.

Group flush via daemon. The dwb-flush-block daemon amortises the fsync cost of a block over all the slots it contains. A block with num_block_pages = 64 does two fsyncs (one for the DWB volume, one per home volume) for sixty-four page writes. The fsync per page is therefore ≤ (2 fsync) / 64 ≈ 1/32 of a synchronous write per page — much cheaper than full_page_writes = on in PostgreSQL, which log-writes the entire page on every checkpoint-relative dirty.

Concurrent home-write pipelining. While the daemon writes home pages and waits for fsync, other producers continue to fill the next block. The pipeline keeps the foreground page flushers from blocking on DWB fsync. When the helper daemon takes over fsync for big home volumes, the main flush thread is free to start the next block immediately.

The configuration knobs let an operator tune the tradeoff: small PRM_ID_DWB_SIZE reduces memory but increases flush frequency; many PRM_ID_DWB_BLOCKS increases pipeline depth but reduces the number of slots per block. Disabling either by setting it to 0 turns the DWB off entirely — useful on storage with native atomic 8 KiB / 16 KiB writes.

Anchor on symbol names, not line numbers.

Public API (src/storage/double_write_buffer.hpp)

Section titled “Public API (src/storage/double_write_buffer.hpp)”
  • DWB_SLOT — slot struct visible to producers; carries io_page, vpid, lsa, position_in_block, block_no.
  • dwb_is_created — has the DWB been initialised?
  • dwb_create — create the DWB volume and the in-memory structure; called from boot_sr.c at first volume creation.
  • dwb_recreate — destroy + create with current parameters (used when PRM_ID_DWB_SIZE / PRM_ID_DWB_BLOCKS change).
  • dwb_load_and_recover_pages — the crash-recovery entry point.
  • dwb_destroy — finalise.
  • dwb_get_volume_name — observability.
  • dwb_flush_force — drain everything pending; called from fileio_synchronize_all.
  • dwb_read_page — slot-hash lookup for the page-buffer fix path.
  • dwb_set_data_on_next_slot — producer-side acquire+stage.
  • dwb_add_page — producer-side commit (insert into hash, bump block fill count, request flush if block full).
  • dwb_synchronize — slot for fileio_synchronize to push pending DWB content before flushing a single volume.
  • dwb_daemons_init / dwb_daemons_destroy — server-mode daemon lifecycle.

Internal types (src/storage/double_write_buffer.cpp)

Section titled “Internal types (src/storage/double_write_buffer.cpp)”
  • DWB_WAIT_QUEUE_ENTRY / DWB_WAIT_QUEUE — singly-linked wait queue per block + global; threads parked here wake on block flush completion or structure-modification end.
  • FLUSH_VOLUME_INFO — per-block, per-volume bookkeeping for step-7 fsync (descriptor, page count, status flags).
  • DWB_BLOCK — the fsync unit (write_buffer, slots, count_wb_pages, version, wait queue).
  • DWB_SLOTS_HASH_ENTRY — VPID-keyed hash entry pointing at a DWB_SLOT.
  • DOUBLE_WRITE_BUFFER (singleton dwb_Global) — global state.
  • dwb_init_wait_queue / dwb_block_add_wait_queue_entry / dwb_block_disconnect_wait_queue_entry / dwb_block_free_wait_queue_entry / dwb_remove_wait_queue_entry / dwb_signal_waiting_threads / dwb_destroy_wait_queue — wait-queue primitives.
  • dwb_signal_waiting_thread — signals one parked thread (sets THREAD_DWB_QUEUE_RESUMED).
  • dwb_set_status_resumed — flips a parked thread back to resumed after timeout cleanup.
  • dwb_wait_for_block_completion — park on a block’s wait_queue; 20 ms timeout.
  • dwb_wait_for_strucure_modification — park on the global wait_queue during DWB resize/destroy/create; 10 ms timeout.
  • dwb_signal_block_completion / dwb_signal_structure_modificated — wake-all on those queues.
  • dwb_starts_structure_modification / dwb_ends_structure_modification — set/clear the MODIFY_STRUCTURE flag in position_with_flags; before setting, the function CAS-loops, then waits for in-flight flushers to drain.
  • dwb_load_buffer_size / dwb_load_block_count / dwb_power2_ceil — parameter parsing, clamp to bounds, round up to power of two.
  • dwb_initialize_slot / dwb_initialize_block / dwb_create_blocks / dwb_finalize_block — memory layout helpers.
  • dwb_create_internal / dwb_destroy_internal — create / destroy under the structure-modification flag.
  • dwb_acquire_next_slot — the lock-free producer entry.
  • dwb_set_slot_datamemcpy page bytes + capture VPID and LSA into the slot.
  • dwb_init_slot — reset slot fields (used during sentinel insertion in the ordered-slot snapshot).
  • dwb_block_create_ordered_slotsqsort snapshot by VPID then LSA, with sentinel.
  • dwb_compare_slots — comparator for the sort.
  • dwb_compare_vol_fd — comparator over volume descriptors.
  • dwb_add_volume_to_block_flush_area — bookkeeping for step-7 fsync targets.
  • dwb_get_next_block_for_flush — pick the next full block (used by the daemon).
  • dwb_flush_next_block — daemon body.
  • dwb_flush_block — the eight-step durability dance.
  • dwb_write_block — step 6, write slots to home volumes.
  • dwb_file_sync_helper — helper daemon body for offloaded step-7 fsyncs.
  • dwb_slots_hash_entry_alloc / _free / _init / _key_copy / _compare_key / _key — hash callbacks.
  • dwb_slots_hash_insert / dwb_slots_hash_delete — hash ops.
  • dwb_check_data_page_is_sane — recovery-time per-slot decision (replace home or skip).
  • dwb_debug_check_dwb — debug-only dup detector for the recovery-time block.
  • dwb_is_flush_block_daemon_available / dwb_is_file_sync_helper_daemon_available / dwb_flush_block_daemon_is_running / dwb_file_sync_helper_daemon_is_running — daemon visibility helpers, gate PRM_ID_ENABLE_DWB_FLUSH_THREAD.
  • dwb_flush_block_daemon_init / dwb_file_sync_helper_daemon_init — daemon spin-up; loopers tick at 1 ms / 10 ms.
  • class dwb_flush_block_daemon_task — task class for the flush daemon.
  • dwb_file_sync_helper_execute — task entry for the helper daemon (registered as cubthread::entry_callable_task).
  • pgbuf_bcb_safe_flush_internal_release_mutex (in page_buffer.c ≈ line 10468) — primary producer.
  • pgbuf_*_safe_flush_internal_release_mutex — fix path that consults dwb_read_page (page_buffer.c:8239).
  • fileio_write_or_add_to_dwb (in file_io.c line 4014) — secondary producer for direct-write paths.
  • fileio_synchronize_volume_and_dwb (in file_io.c line 2844, 2912, 3126, 3332 — multiple call sites) — calls dwb_synchronize to drain pending DWB before per-volume fsync.
  • fileio_synchronize_all (in file_io.c line 4642) — calls dwb_flush_force before sweeping volume fsyncs.
  • boot_restart_server (in boot_sr.c line 2403) — calls dwb_load_and_recover_pages between vacuum init and pgbuf_daemons_init.
  • boot_create_database (in boot_sr.c line 4908) — calls dwb_create after database parameter setup, before formatting the first volume.
  • dwb_initialize_pool is used in disable-flush paths in file_io.c:1882 (set skip_flush = dwb_is_created () — so per-page fsyncs during volume creation are skipped when the DWB will handle durability later).
SymbolFileLine
DWB_SLOT (struct)double_write_buffer.hpp33
dwb_is_created (declaration)double_write_buffer.hpp44
dwb_create (declaration)double_write_buffer.hpp45
dwb_load_and_recover_pages (declaration)double_write_buffer.hpp47
dwb_destroy (declaration)double_write_buffer.hpp48
dwb_flush_force (declaration)double_write_buffer.hpp50
dwb_read_page (declaration)double_write_buffer.hpp51
dwb_set_data_on_next_slot (declaration)double_write_buffer.hpp52
dwb_add_page (declaration)double_write_buffer.hpp54
dwb_synchronize (declaration)double_write_buffer.hpp57
dwb_daemons_init (declaration)double_write_buffer.hpp60
DWB_MIN_SIZE / DWB_MAX_SIZEdouble_write_buffer.cpp51-52
DWB_MIN_BLOCKS / DWB_MAX_BLOCKSdouble_write_buffer.cpp53-54
DWB_POSITION_MASKdouble_write_buffer.cpp70
DWB_BLOCKS_STATUS_MASKdouble_write_buffer.cpp73
DWB_MODIFY_STRUCTURE flagdouble_write_buffer.cpp76
DWB_CREATE flagdouble_write_buffer.cpp79
struct double_write_wait_queue_entrydouble_write_buffer.cpp176
struct double_write_wait_queuedouble_write_buffer.cpp184
enum FLUSH_VOLUME_STATUSdouble_write_buffer.cpp196
struct flush_volume_infodouble_write_buffer.cpp204
struct double_write_blockdouble_write_buffer.cpp215
struct dwb_slots_hash_entrydouble_write_buffer.cpp235
struct double_write_bufferdouble_write_buffer.cpp261
dwb_Global (singleton)double_write_buffer.cpp306
slots_entry_Descriptordouble_write_buffer.cpp421
dwb_init_wait_queuedouble_write_buffer.cpp451
dwb_signal_waiting_threadsdouble_write_buffer.cpp663
dwb_power2_ceildouble_write_buffer.cpp732
dwb_load_buffer_sizedouble_write_buffer.cpp769
dwb_load_block_countdouble_write_buffer.cpp795
dwb_starts_structure_modificationdouble_write_buffer.cpp822
dwb_ends_structure_modificationdouble_write_buffer.cpp924
dwb_initialize_slotdouble_write_buffer.cpp949
dwb_initialize_blockdouble_write_buffer.cpp977
dwb_create_blocksdouble_write_buffer.cpp1009
dwb_finalize_blockdouble_write_buffer.cpp1131
dwb_create_internaldouble_write_buffer.cpp1163
dwb_slots_hash_insertdouble_write_buffer.cpp1380
dwb_destroy_internaldouble_write_buffer.cpp1474
dwb_set_status_resumeddouble_write_buffer.cpp1521
dwb_wait_for_block_completiondouble_write_buffer.cpp1552
dwb_signal_waiting_threaddouble_write_buffer.cpp1643
dwb_wait_for_strucure_modificationdouble_write_buffer.cpp1704
dwb_compare_slotsdouble_write_buffer.cpp1781
dwb_block_create_ordered_slotsdouble_write_buffer.cpp1845
dwb_slots_hash_deletedouble_write_buffer.cpp1883
dwb_add_volume_to_block_flush_areadouble_write_buffer.cpp1960
dwb_write_blockdouble_write_buffer.cpp2007
dwb_flush_blockdouble_write_buffer.cpp2192
dwb_acquire_next_slotdouble_write_buffer.cpp2468
dwb_set_slot_datadouble_write_buffer.cpp2612
dwb_init_slotdouble_write_buffer.cpp2642
dwb_get_next_block_for_flushdouble_write_buffer.cpp2659
dwb_set_data_on_next_slotdouble_write_buffer.cpp2686
dwb_add_pagedouble_write_buffer.cpp2726
dwb_synchronizedouble_write_buffer.cpp2841
dwb_is_createddouble_write_buffer.cpp2909
dwb_createdouble_write_buffer.cpp2925
dwb_recreatedouble_write_buffer.cpp2967
dwb_debug_check_dwbdouble_write_buffer.cpp3013
dwb_check_data_page_is_sanedouble_write_buffer.cpp3091
dwb_load_and_recover_pagesdouble_write_buffer.cpp3199
dwb_destroydouble_write_buffer.cpp3403
dwb_get_volume_namedouble_write_buffer.cpp3440
dwb_flush_next_blockdouble_write_buffer.cpp3459
dwb_flush_forcedouble_write_buffer.cpp3514
dwb_file_sync_helperdouble_write_buffer.cpp3766
dwb_read_pagedouble_write_buffer.cpp3969
class dwb_flush_block_daemon_taskdouble_write_buffer.cpp4013
dwb_file_sync_helper_executedouble_write_buffer.cpp4053
dwb_flush_block_daemon_initdouble_write_buffer.cpp4073
dwb_file_sync_helper_daemon_initdouble_write_buffer.cpp4087
dwb_daemons_initdouble_write_buffer.cpp4099
dwb_daemons_destroydouble_write_buffer.cpp4109
dwb_read_page usepage_buffer.c8239
pgbuf flush / dwb_set_data_on_next_slotpage_buffer.c10548
pgbuf flush / dwb_add_pagepage_buffer.c10597
fileio_write_or_add_to_dwbfile_io.c4014
fileio_synchronize_alldwb_flush_forcefile_io.c4642
fileio_make_dwb_namefile_io.c5882
boot_restart_serverdwb_load_and_recover_pagesboot_sr.c2403
boot_create_databasedwb_createboot_sr.c4908

The DWB has no extant raw analyses in raw/; this is a source-only walkthrough. The cross-check below names invariants read off the live source as of 2026-04-30 and what the related page-buffer / log-manager / recovery docs (which all reference the DWB) say or imply about it.

  • The DWB is consulted before disk reads on page-table miss. cubrid-page-buffer-manager.md §“Double Write Buffer (DWB) — torn-page protection” sketches this. Verified at page_buffer.c:8239dwb_read_page is the first call in the on-miss read sequence; fileio_read is the fallback.

  • Producer-side staging is split: stage → WAL force → commit. cubrid-page-buffer-manager.md does not call this out, but the source path in page_buffer.c:10548-10597 shows the WAL force (logpb_flush_log_for_wal) sits between dwb_set_data_on_next_slot and dwb_add_page. The slot is acquired before WAL is forced, but the slot is not visible to readers (not in the hash, not yet flushed) until after WAL has been forced and dwb_add_page has committed.

  • DWB recovery happens before analysis / redo / undo. cubrid-recovery-manager.md describes the three-pass restart but does not say what runs before analysis. Verified: boot_restart_server at boot_sr.c:2403 calls dwb_load_and_recover_pages, then pgbuf_daemons_init + dwb_daemons_init, then log_recovery. The redo pass reads home pages whose corruption has already been repaired.

  • DWB write does not bypass WAL. Even though the DWB volume is durable on its own and could in principle be replayed without the redo log, CUBRID does not use it that way. dwb_flush_block does not consult the log; it simply staged pages whose underlying log records are also on disk (the WAL invariant was honored by the producer). Recovery uses the DWB to reconstruct the home pages, then redo replays log records onto the now-coherent home pages. The DWB is a pre-redo cleanup, not a redo substitute.

  • Per-block flush ordering is enforced. dwb_flush_block asserts at the top: (DWB_GET_PREV_BLOCK (flush_block->block_no)->version > flush_block->version) || ... meaning the previous block must have been flushed (its version is higher) before the current one starts. The block-write- started bits in position_with_flags are the synchronization primitive that supports this — the daemon picks next_block_to_flush and only that block is eligible.

  • The slot hash is consulted during recovery only via the ordered-slots snapshot, not the live hash. During dwb_load_and_recover_pages, the recovery block is built fresh; the production slots_hashmap does not exist yet (the recovery code initializes it via dwb_create only at the end of the recovery scan). This is why dwb_block_create_ordered_slots runs over the whole recovery block — there is no concurrent producer to fight, and a qsort over a few thousand slots is cheap.

  • FILEIO_WRITE_NO_COMPENSATE_WRITE is the DWB’s seal on fileio_write. When the DWB is active, all pgbuf and file_io writes use this mode, which tells fileio_write to not perform per-page write-retry on partial writes — that retry would introduce its own torn-page risk. Without the DWB, FILEIO_WRITE_DEFAULT_WRITE is used, which does the retry. This is the load-bearing reason the DWB cannot be partially activated: either every data write skips fileio_write’s retry (because the DWB will recover) or none of them do.

  • The two daemons have different cadences. The dwb-flush-block daemon ticks every 1 ms; the dwb-file-sync helper ticks every 10 ms. The 1 ms cadence on the flusher matches its job (drain blocks as fast as possible); the 10 ms cadence on the helper is acceptable because its work (fsync large home volumes) is bounded by fsync latency anyway. Both daemons honor PRM_ID_ENABLE_DWB_FLUSH_THREAD — disabling that parameter reverts to inline flush by the producer that fills the last slot in a block.

  • The DWB volume id is reserved at LOG_DBDWB_VOLID (log_volids.hpp). It is not part of the disk manager’s permanent-volume range and is not visible in db_volumes_view. This isolates the DWB from disk-allocation paths that walk permanent volumes.

  1. TDE-encrypted pages in the DWB. pgbuf flush encrypts the page (tde_encrypt_data_page) into a stack buffer before calling dwb_set_data_on_next_slot. The slot therefore holds encrypted bytes. Recovery reads the same encrypted bytes from the DWB volume and writes them to the home location. Decryption would happen later, on the next pgbuf_fix of the page. Question: what happens if the TDE master key has been rotated between the crash and the restart? Is the slot’s IV/key generation tracked? The DWB_SLOT struct does not carry TDE bookkeeping fields, suggesting either (a) the IV is encoded in the page header itself (FILEIO_PAGE::prv) and survives the round-trip, or (b) key rotation requires a clean shutdown (DWB drained) first. Investigation: read tde_encrypt_data_page and tde_decrypt_data_page, check whether the page header carries the IV.

  2. DWB and parallel redo. cubrid-recovery-manager.md describes per-VPID parallel redo via log_recovery_redo_parallel.{cpp,hpp}. The DWB recovery runs before redo, single-threaded (the recovery block walk is sequential). But the redo phase, once started, could in principle race with daemon-driven DWB activity — if the page buffer flushes during redo, those flushes go through the DWB. Since pgbuf_daemons_init and dwb_daemons_init are both called between dwb_load_and_recover_pages and log_recovery, the flushers are live during redo. Question: are dirty pages produced by redo subject to the same DWB discipline? Yes — pgbuf_set_lsa + pgbuf_set_dirty in the redo path eventually triggers a flush, which eventually goes through the DWB. Implication: a crash during redo is recoverable the same way as a crash during normal operation — the DWB on the second restart contains pages that were re-applied by the first restart’s interrupted redo.

  3. Block-write-started bit reuse. With DWB_MAX_BLOCKS = 32, the block-status bitmask occupies bits 32-63 of position_with_flags. If a future change wants more blocks, the layout must be redesigned — dwb_starts_structure_modification already enforces “only one structure modification at a time” but does not address the layout cap. Question: is 32 the design ceiling forever, or is there a roadmap to widen it (e.g., a separate block-status word)? The assert (block_no < DWB_MAX_BLOCKS) appears in every block-bit macro; relaxing it would require auditing every macro.

  4. Same-LSA, different-block hash collision. The dwb_slots_hash_insert LSA_EQ arm has a debug-only assertion that, when same-VPID, same-LSA slots land in different blocks, the older block has a strictly smaller version. This is a guard against re-insert ordering bugs but the production path silently accepts the case. Question: is the production path guaranteed to flush the older block first (so the newer slot eventually wins the home write), or could a slow flush of the older block leave the home page at the older image even after the newer staging? The block flush ordering enforced by next_block_to_flush should prevent this, but the interaction with dwb_flush_force’s ad-hoc block selection needs tracing.

  5. dwb_synchronize semantics. The function name suggests it synchronizes the DWB with a single volume, but the body (error = dwb_flush_force (thread_p, &complete)) actually drains all pending DWB content. Is this intentional — the cheapest safe way to flush “everything that might touch this volume” — or a leftover from an earlier design where dwb_synchronize was per-volume? Investigation: git blame the function.

  6. fileio_fsync_pending short-circuit. dwb_synchronize bails early if fileio_fsync_pending() is true. Question: under what circumstances is that flag set, and does the short-circuit risk leaving DWB content unsynchronised at shutdown? The risk is low because shutdown forces drain anyway, but the comment says nothing.

  7. Disabling DWB at runtime. The header comments (“Activating/deactivating DWB while the server is alive, needs additional work”) suggest the DWB_NOT_CREATED_OR_MODIFYING paths are not fully reliable. In practice, operators set PRM_ID_DWB_SIZE = 0 and restart. Question: is there a roadmap to support live DWB enable/disable, and what would it take? The producer paths in pgbuf and file_io would need to handle a transition state, which they currently don’t (they sample dwb_is_created once at the top).

  8. Interaction with copy-database (migrate) tools. dwb_synchronize is called in volume copy paths (file_io.c:2844, 2912, 3126, 3332) — does it need to be called by external tools (e.g., cubrid copydb)? The user’s view is that running tools while the server is live is forbidden, so external tools should not encounter a live DWB. But dwb_synchronize is structured as if defensive against being invoked from anywhere; this might be an artifact of an older design.

This document is source-only: there are no curated raw analyses for the DWB in raw/code-analysis/cubrid/storage/. The references below are CUBRID source paths and sibling docs in this knowledge base.

CUBRID source (/data/hgryoo/references/cubrid/)

Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”
  • src/storage/double_write_buffer.hpp — public API.
  • src/storage/double_write_buffer.cpp — implementation.
  • src/storage/page_buffer.c — primary producer (flush path) and primary reader (fix path).
  • src/storage/file_io.c — secondary producer (fileio_write_or_add_to_dwb), volume name builder (fileio_make_dwb_name), and synchronisation call sites.
  • src/transaction/boot_sr.c — boot-time wiring (dwb_load_and_recover_pages before redo, dwb_create on database init).
  • src/transaction/log_volids.hppLOG_DBDWB_VOLID reservation.
  • knowledge/code-analysis/cubrid/cubrid-page-buffer-manager.md — page buffer’s flush path and DWB consultation on miss.
  • knowledge/code-analysis/cubrid/cubrid-log-manager.md — WAL invariant the DWB producer side honors.
  • knowledge/code-analysis/cubrid/cubrid-recovery-manager.md — three-pass restart whose redo runs after DWB repair.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database Internals (Petrov), Ch. 5 §“Recovery”, §“Torn Pages” — torn-page problem and three escape hatches.
  • Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (TODS 17.1, 1992) — WAL invariant the DWB pairs with.

Comparative engines (referenced in the design space)

Section titled “Comparative engines (referenced in the design space)”
  • MySQL InnoDB Doublewrite_log / dblwr — the canonical DWB design CUBRID’s most resembles.
  • MariaDB / Percona Server parallel doublewrite — a multi- block parallel variant.
  • PostgreSQL full_page_writes = on — the alternative full-page-WAL approach.
  • SQL Server torn-page detection bits — a lighter-weight detect-only design that requires media restore on failure rather than DWB-driven repair.