Skip to content

CUBRID Prior List — Lock-Free Producer-Side WAL Queue Decoupling Per-TX Commit From the Log Writer

Contents:

The write-ahead-log (WAL) protocol of Mohan et al.’s ARIES paper (ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, TODS 17.1, 1992) imposes two rules on the engine: log records describing a modification must reach stable storage before the dirty data page they describe (durability), and they must reach stable storage in LSN order (ordering). Database Internals (Petrov, ch. 5 §“Recovery”) repeats the contract and adds the engineering observation that the ordering rule is what makes WAL a concurrency problem — many threads producing records must agree on a single total order before anything is written.

Naïve implementations of that order-then-write step are catastrophic for throughput. The textbook design parks every appender on a global log latch for the duration of the copy-and-stamp work. The latch holds across a memcpy whose size is the record length and across a cursor advance. Mohan and DeWitt’s 1992 commit pipe research named the structural cure: split the pipeline. Producers build records outside the latch into a staging structure; a single drain thread is responsible for the latch-protected work of moving bytes into the authoritative log page.

The split has three moving parts. First, a queue from producers to drain — typically a multi-producer, single-consumer (MPSC) structure (Michael & Scott’s lock-free FIFO, 1996, is the textbook reference; production engines often use a short-held mutex around an enqueue-at-tail operation because the producer’s other work — record formatting, MVCC metadata stamping — happens outside the mutex). Second, a wakeup channel from producers to drain. Third, an ack channel from drain back to producers. The canonical answer for (2) and (3) is a condition variable plus a durable LSN watermark: committers sleep on the CV with a target LSN; the drain broadcasts on each advance. Group commit falls out of this naturally — many committers sleeping on one CV all wake from one fsync’s advance.

The drain must keep up with producers in steady state, or the queue grows without bound. Real systems install a soft cap: when the queue exceeds a threshold, producers either help drain (the self-help pattern) or sleep on backpressure. The choice changes latency but not throughput, which is bounded by the disk’s effective write bandwidth.

Postgres’s XLogInsert is the latch-protected reference point: appenders take a WALInsertLock (8 striped locks in current versions), copy bytes directly into the WAL buffer pages, and release. No separate producer queue, no separate drain thread on the insert path. CUBRID’s prior list is the deliberate inversion: appenders never touch the authoritative log pages; they hand off LOG_PRIOR_NODE records to a queue that only the log-flush daemon (or a help-out path under the log critical section) ever drains into pages.

The patterns every WAL engine combines into its commit pipe repeat across PostgreSQL, InnoDB, Oracle, SQL Server, and CUBRID; only the partitioning of the work changes.

PostgreSQL XLogInsert — centralised, latch-protected

Section titled “PostgreSQL XLogInsert — centralised, latch-protected”

XLogInsert (in src/backend/access/transam/xlog.c) is the simplest model. An appender computes its record size, acquires a WALInsertLock (a striped set of 8 in PG 9.4+, hashed by XLogRecPtr), copies bytes directly into the WAL buffer pages at the LSN it just claimed, and releases. On commit, XLogFlush(LSN) acquires WALWriteLock, calls pg_pwrite / pg_fsync on the WAL segment, and advances WriteRqst.Flush. The cost: WALInsertLock held across a memcpy whose size is the record length. The benefit: simplicity — no separate producer queue, no drain thread, LSN order matches lock-acquisition order trivially.

MySQL InnoDB — group commit on top of a circular buffer

Section titled “MySQL InnoDB — group commit on top of a circular buffer”

InnoDB writes redo records into the global log_buf ring buffer. Mini-transactions (mtr_t) batch records locally; on commit they flush into log_buf under log_sys->mutex. The log_writer thread drains log_buf into redo segments; the log_flusher issues fsync. Group commit emerges from os_event waits. The shape is closer to CUBRID’s (producer / drain / flusher are distinct stages), but the producer still synchronises on log_sys->mutex to write into the authoritative buffer.

Each strand has its own in-memory log buffer and its own LGWR ordering; LGWR interleaves strands at write time. Producers are partitioned across strands (by trid hash) to reduce per-strand mutex contention. Ordering across strands is reconstructed at LGWR time from sequence numbers. CUBRID gets away with a single prior list because all producers feed it under one short-held mutex that does not hold across record-byte movement.

CUBRID — producer-side batched queue, single-consumer drain

Section titled “CUBRID — producer-side batched queue, single-consumer drain”

CUBRID’s prior list is a single-consumer queue of LOG_PRIOR_NODE records. A producer:

  1. Builds its record into a fresh malloc’d LOG_PRIOR_NODE outside any global mutex, including the optional zlib compression of undo/redo payloads (the most expensive thing on the producer path).
  2. Takes prior_lsa_mutex, assigns the next monotonic LSN to the node, links it onto the tail, releases. The mutex hold time covers only LSN arithmetic and list-tail link manipulation.
  3. Optionally, if list_size exceeds the page-buffer capacity, wakes the log-flush daemon and yields 1 ms — the backpressure path.
  4. On commit, calls logpb_flush_pages(lsn), which schedules a force flush and parks the caller on the group-commit CV until nxio_lsa >= my_commit_lsn.

The log-flush daemon does the latch-protected work: LOG_CS_ENTER, logpb_prior_lsa_append_all_list (detach + copy each LOG_PRIOR_NODE into the authoritative LOG_PAGE ring buffer), logpb_flush_all_append_pages (write to disk, advance nxio_lsa), broadcast the group-commit CV.

flowchart LR
  subgraph PROD["Producers (transaction threads)"]
    direction TB
    T1["TX A\nbuild LOG_PRIOR_NODE,\nzlib-compress payload"]
    T2["TX B\nbuild LOG_PRIOR_NODE"]
    T3["TX C\nbuild LOG_PRIOR_NODE"]
  end

  subgraph QUEUE["Prior list (under prior_lsa_mutex)"]
    direction LR
    H["prior_list_header"] --> N1["NODE\nstart_lsa=L1"]
    N1 --> N2["NODE\nstart_lsa=L2"]
    N2 --> N3["NODE\nstart_lsa=L3"]
    N3 --> Tl["prior_list_tail"]
  end

  subgraph CONS["Consumer (log-flush daemon)"]
    direction TB
    D1["LOG_CS_ENTER"]
    D2["detach prior list\n(swap header/tail to NULL\nunder mutex)"]
    D3["copy each NODE into\nLOG_PAGE ring buffer"]
    D4["fileio_write_pages\n· fsync"]
    D5["advance nxio_lsa\nbroadcast gc_cond"]
    D1 --> D2 --> D3 --> D4 --> D5
  end

  T1 -->|prior_lsa_alloc_and_copy_*\nthen prior_lsa_next_record| H
  T2 --> H
  T3 --> H

  Tl -.->|drained by| D2

  subgraph WAITERS["Commit waiters"]
    W1["TX A wait\nuntil nxio_lsa >= L1"]
    W2["TX B wait\nuntil nxio_lsa >= L2"]
  end

  D5 -. broadcast .-> W1 & W2

CUBRID’s distinguishing claim is that the only mutex on the producer path is prior_lsa_mutex and its hold time is dominated by O(1) arithmetic, not by record-size memcpy. The expensive bytes are moved into the prior node before the mutex is taken (in prior_lsa_alloc_and_copy_*), and the equally expensive bytes are copied into authoritative log pages after the mutex is dropped (by the drain thread, under LOG_CS_OWN_WRITE_MODE).

The prior list has eight moving parts. The LOG_PRIOR_NODE shape that defines what is queued. The producer-side allocator that builds it. The producer-side attach that assigns the LSN and links it onto the tail. The prior-list state (prior_list_header, prior_list_tail, list_size, prior_lsa, prev_lsa) that the mutex protects. The drain detach that hands the producer-side list to a flush-side list. The drain copy that walks the flush-side list and fills authoritative LOG_PAGE frames. The commit-waiter mechanism that parks committers on nxio_lsa. The backpressure path that triggers when the queue grows past the page-buffer’s capacity. We walk them in producer-to-consumer order.

The struct lives in log_append.hpp. Three byte buffers, four length fields, one forward link:

// log_prior_node — src/transaction/log_append.hpp
struct log_prior_node
{
LOG_RECORD_HEADER log_header; /* prev_tranlsa, back_lsa, forw_lsa, trid, type */
LOG_LSA start_lsa; /* assigned at attach time; for assertions */
bool tde_encrypted;
int data_header_length; /* type-specific data header */
char *data_header;
int ulength; char *udata; /* undo data (zlib-compressed if eligible) */
int rlength; char *rdata; /* redo data (zlib-compressed if eligible) */
LOG_PRIOR_NODE *next;
};

The node is a malloc’d free-store object — not pooled — allocated in prior_lsa_alloc_and_copy_data / _crumbs (log_append.cpp:273 / :410) and freed in the drain (logpb_append_prior_lsa_listlog_page_buffer.c:3040). The malloc/free pair lives outside the prior-list mutex; only the link manipulation is mutex-protected. The node carries a fully-formed record header including prev_tranlsa (per-transaction back-link) and back_lsa (physical previous record), both filled by prior_lsa_start_append under the mutex because they depend on tdes->tail_lsa and prior_info.prev_lsa which the attach mutates.

// log_prior_lsa_info — src/transaction/log_append.hpp
struct log_prior_lsa_info
{
LOG_LSA prior_lsa; /* next LSA to assign */
LOG_LSA prev_lsa; /* last attached node's LSA */
LOG_PRIOR_NODE *prior_list_header; /* singly-linked producer queue */
LOG_PRIOR_NODE *prior_list_tail;
INT64 list_size; /* bytes — backpressure trigger */
LOG_PRIOR_NODE *prior_flush_list_header; /* drain-owned hand-off */
std::mutex prior_lsa_mutex;
};

Invariants. prior_lsa is the next LSN — every attach copies it into start_lsa and advances it past the record’s serialised size (via log_prior_lsa_append_advance_when_doesnot_fit, _add_align, prior_lsa_append_data, which walk {pageid, offset} forward across page boundaries). prev_lsa is the previous LSN, used for the new node’s back_lsa before prior_lsa is advanced. list_size is the cumulative byte count (sizeof(LOG_PRIOR_NODE) + data_header_length + ulength + rlength, summed); it triggers backpressure when greater than logpb_get_memsize(). prior_flush_list_header is owned by the drain side; it is NULL outside the drain CS. The drain detaches prior_list_header under the mutex, releases, then stages the detached list at prior_flush_list_header for the walk-and-copy loop. log_Gl.prior_info is the singleton.

Producer step 1 — build the node outside the mutex

Section titled “Producer step 1 — build the node outside the mutex”

prior_lsa_alloc_and_copy_data (log_append.cpp:273) is the allocator for non-UNDOREDO records. It mallocs a node, zeros the fields, sets log_header.type, then dispatches on rec_type to a per-type generator (prior_lsa_gen_postpone_record, prior_lsa_gen_dbout_redo_record, prior_lsa_gen_2pc_prepare_record, prior_lsa_gen_end_chkpt_record, or the catch-all prior_lsa_gen_record). Each generator mallocs a data-header struct (sized to LOG_REC_POSTPONE / LOG_REC_DONETIME / etc.) and copies the caller’s data via prior_lsa_copy_undo_data_to_node / _redo_data_to_node (which both malloc(length) + memcpy).

Nothing in this path takes a mutex; nothing touches log_Gl.prior_info. The heavy lifting — formatting the record, possibly compressing it — is wholly outside the critical section.

prior_lsa_alloc_and_copy_crumbs (log_append.cpp:410) is the sister allocator for UNDOREDO-shaped records (the most common kind). It accepts undo and redo as arrays of LOG_CRUMB { length, data } so callers can hand in scattered buffers without pre-concatenating, and delegates to prior_lsa_gen_undoredo_record_from_crumbs (:651) which handles diff-encoding (LOG_DIFF_UNDOREDO_DATA), MVCC stamping (LOG_REC_MVCC_UNDOREDO), and zlib compression.

Compression is the load-bearing detail. UNDOREDO records larger than log_Zip_min_size_to_compress (default 255 bytes) are zlib-compressed via log_zip(zip_undo, ...) / log_zip(zip_redo, ...) on per-thread LOG_ZIP contexts (log_append_get_zip_undo / _redo:1725 / :1751). Compression is CPU-bound and can take tens of microseconds for kilobyte-sized records. If compression had to happen under prior_lsa_mutex, the queue would serialise on the slowest compressor. By doing it before the mutex, CUBRID lets N producers compress in parallel, each finishing with a fully-formed LOG_PRIOR_NODE ready to attach in O(1).

Producer step 2 — assign LSN and attach under the mutex

Section titled “Producer step 2 — assign LSN and attach under the mutex”

The attach is prior_lsa_next_record (log_append.cpp:1553) which delegates to prior_lsa_next_record_internal (:1357).

// prior_lsa_next_record_internal — src/transaction/log_append.cpp:1357 (condensed)
static LOG_LSA
prior_lsa_next_record_internal (THREAD_ENTRY *thread_p,
LOG_PRIOR_NODE *node,
LOG_TDES *tdes, int with_lock)
{
if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK)
log_Gl.prior_info.prior_lsa_mutex.lock ();
prior_lsa_start_append (thread_p, node, tdes); /* assign start_lsa */
LOG_LSA start_lsa = node->start_lsa;
/* MVCC bookkeeping (chain prev_mvcc_op_log_lsa for vacuum) and
tdes state stamping for SYSOP_START_POSTPONE / SYSOP_END /
COMMIT_WITH_POSTPONE / SYSOP_ATOMIC_START / COMMIT / ABORT */
if (LOG_IS_MVCC_OP_RECORD_TYPE (node->log_header.type) || ...) {
prior_update_header_mvcc_info (start_lsa, mvccid);
}
/* advance prior_lsa past data_header + udata + rdata */
log_prior_lsa_append_advance_when_doesnot_fit (node->data_header_length);
log_prior_lsa_append_add_align (node->data_header_length);
if (node->ulength > 0) prior_lsa_append_data (node->ulength);
if (node->rlength > 0) prior_lsa_append_data (node->rlength);
prior_lsa_end_append (thread_p, node); /* fill forw_lsa */
/* link the node onto the tail */
if (log_Gl.prior_info.prior_list_tail == NULL)
log_Gl.prior_info.prior_list_header = log_Gl.prior_info.prior_list_tail = node;
else {
log_Gl.prior_info.prior_list_tail->next = node;
log_Gl.prior_info.prior_list_tail = node;
}
log_Gl.prior_info.list_size +=
sizeof (LOG_PRIOR_NODE) + node->data_header_length + node->ulength + node->rlength;
if (with_lock == LOG_PRIOR_LSA_WITHOUT_LOCK) {
log_Gl.prior_info.prior_lsa_mutex.unlock ();
/* backpressure: queue too big -> wake daemon and yield, or self-help drain */
if (log_Gl.prior_info.list_size >= (INT64) logpb_get_memsize ()) {
perfmon_inc_stat (thread_p, PSTAT_PRIOR_LSA_LIST_MAXED);
#if defined(SERVER_MODE)
if (!log_is_in_crash_recovery ()) {
log_wakeup_log_flush_daemon (); thread_sleep (1);
} else { LOG_CS_ENTER (thread_p); logpb_prior_lsa_append_all_list (thread_p); LOG_CS_EXIT (thread_p); }
#else
LOG_CS_ENTER (thread_p); logpb_prior_lsa_append_all_list (thread_p); LOG_CS_EXIT (thread_p);
#endif
}
}
tdes->num_log_records_written++;
return start_lsa;
}

The function does five things under the mutex: assign start_lsa from prior_lsa; update MVCC bookkeeping (prev_mvcc_op_log_lsa chain, vacuum info, tdes->rcv.atomic_sysop_start_lsa reset, tdes->commit_abort_lsa stamp); advance prior_lsa past the record’s serialised size; link the node onto the tail; bump list_size. None involves bytes — all pointer and integer arithmetic — so the mutex hold time is bounded by a small constant number of cache-line accesses.

prior_lsa_start_append (:1593) fills node->start_lsa = prior_lsa and the back-links (prev_tranlsa = tdes->tail_lsa, back_lsa = prior_info.prev_lsa), then advances prior_lsa past the header. prior_lsa_end_append (:1652) fills forw_lsa = prior_info.prior_lsa after the data has advanced prior_lsa. The node’s header has correct prev/back/forw pointers as if the record had already been serialised.

Two consequences. (a) The LSN returned to the caller is the LSN that record will have when it reaches disk; the caller can stamp it on a data page (pgbuf_set_lsa), use it as a savepoint LSN, hand it to an HA replica. (b) A producer that has finished prior_lsa_next_record has no further work on the WAL path; the bytes will be moved by the drain, the page flushed by the daemon, the durable LSN advanced without the producer’s involvement.

prior_lsa_next_record_with_lock (:1559) is the sibling that expects the caller to already hold prior_lsa_mutex — used during checkpoint where the mutex must wrap LOG_END_CHKPT stamping so no record interleaves between start and end LSAs.

The drain side — logpb_prior_lsa_append_all_list

Section titled “The drain side — logpb_prior_lsa_append_all_list”

The drain runs under LOG_CS_OWN_WRITE_MODE — the global log critical section. Three callers reach it: the log-flush daemon (via logpb_flush_pages_direct), any thread that LOG_CS_ENTERs and calls logpb_flush_pages_direct directly (e.g., the backpressure self-help path on small builds, or a log_no_logging forced flush at commit), and the partial-record flush path (logpb_append_next_record calls logpb_flush_all_append_pages when a record straddles a buffer boundary).

// logpb_prior_lsa_append_all_list — src/transaction/log_page_buffer.c:3106
int
logpb_prior_lsa_append_all_list (THREAD_ENTRY * thread_p)
{
LOG_PRIOR_NODE *prior_list;
INT64 current_size;
assert (LOG_CS_OWN_WRITE_MODE (thread_p));
log_Gl.prior_info.prior_lsa_mutex.lock ();
current_size = log_Gl.prior_info.list_size;
prior_list = prior_lsa_remove_prior_list (thread_p);
log_Gl.prior_info.prior_lsa_mutex.unlock ();
if (prior_list != NULL) {
perfmon_add_stat (thread_p, PSTAT_PRIOR_LSA_LIST_SIZE,
(unsigned int) current_size / ONE_K);
perfmon_inc_stat (thread_p, PSTAT_PRIOR_LSA_LIST_REMOVED);
logpb_append_prior_lsa_list (thread_p, prior_list);
}
return NO_ERROR;
}

The detach is the simplest possible: under the mutex, snapshot list_size, swap prior_list_header to NULL, swap prior_list_tail to NULL, zero list_size, release. The mutex is held for exactly three pointer stores and one INT64 store. No bytes are moved while the mutex is held. The detached list is a private singly-linked list whose nodes are no longer reachable from any other thread.

prior_lsa_remove_prior_list (log_page_buffer.c:3084) is the helper that does the swap. It is called only from logpb_prior_lsa_append_all_list and asserts LOG_CS_OWN_WRITE_MODE.

logpb_append_prior_lsa_list (log_page_buffer.c:3040) is the walk:

// logpb_append_prior_lsa_list — src/transaction/log_page_buffer.c:3040
static int
logpb_append_prior_lsa_list (THREAD_ENTRY * thread_p, LOG_PRIOR_NODE * list)
{
LOG_PRIOR_NODE *node;
assert (LOG_CS_OWN_WRITE_MODE (thread_p));
/* stage the detached list at the flush-side header */
assert (log_Gl.prior_info.prior_flush_list_header == NULL);
log_Gl.prior_info.prior_flush_list_header = list;
while (log_Gl.prior_info.prior_flush_list_header != NULL) {
node = log_Gl.prior_info.prior_flush_list_header;
log_Gl.prior_info.prior_flush_list_header = node->next;
logpb_append_next_record (thread_p, node); /* copy bytes into LOG_PAGE */
if (node->data_header) free_and_init (node->data_header);
if (node->udata) free_and_init (node->udata);
if (node->rdata) free_and_init (node->rdata);
free_and_init (node);
}
return NO_ERROR;
}

logpb_append_next_record (log_page_buffer.c:2981) is where the actual bytes finally meet the authoritative pages. It validates that node->start_lsa == log_Gl.hdr.append_lsa (the prior list’s LSAs match the page-side cursor — a sanity check that the queue was assembled in order); it pre-flushes if the next record would overflow the in-memory log buffer; it stamps the page’s tde_encrypted flag from the node; it calls logpb_start_append (writes the record header), then logpb_append_data for each of {data_header, udata, rdata}, then logpb_end_append.

The flush-side header pointer is not strictly needed — the drain could walk a local pointer just as well — but it is checked at many other places as a liveness signal (“a record is being appended; do not concurrently re-detach”). The assertion assert(prior_flush_list_header == NULL) at the top of logpb_append_prior_lsa_list enforces that only one drain runs at a time.

After each node’s bytes are copied into the page, the node’s malloc’d payload (data_header, udata, rdata, and the node itself) is freed. There is no node pool; allocation pressure falls on the C library’s malloc arenas. CUBRID’s monitoring exposes PSTAT_PRIOR_LSA_LIST_SIZE (the average bytes per drain) and PSTAT_PRIOR_LSA_LIST_REMOVED (the drain count) so an operator can tell whether the queue is being drained smoothly.

logpb_flush_all_append_pages (log_page_buffer.c:3232) is the write-to-disk half. It walks the log-page-buffer’s dirty list, issues fileio_write_pages to the active log volume, and advances log_Gl.append.nxio_lsa. The function is large (≈ 600 lines) because it has to handle two-step flushing of partial records: it flushes everything except the page where the most-recent record header lives, then flushes that header page last so a crash mid- flush leaves either an old end-of-log marker or a new one — never a dangling forward pointer. The two-step dance is documented inline at :3355-3380.

logpb_force_flush_pages (log_page_buffer.c:4096) is the bare-bones force path:

// logpb_force_flush_pages — src/transaction/log_page_buffer.c:4096
void
logpb_force_flush_pages (THREAD_ENTRY * thread_p)
{
LOG_CS_ENTER (thread_p);
logpb_flush_pages_direct (thread_p);
LOG_CS_EXIT (thread_p);
}
// logpb_flush_pages_direct — src/transaction/log_page_buffer.c:3952
void
logpb_flush_pages_direct (THREAD_ENTRY * thread_p)
{
assert (LOG_CS_OWN_WRITE_MODE (thread_p));
logpb_prior_lsa_append_all_list (thread_p); /* drain the prior list */
(void) logpb_flush_all_append_pages (thread_p); /* write pages to disk */
log_Stat.direct_flush_count++;
}

logpb_flush_pages_direct is the canonical “make sure everything in the prior list is on disk” entry point. It first drains the prior list into the page buffer, then writes the dirty pages out and fsyncs them. The two-stage call mirrors the producer/consumer split: the drain is the queue-to-page step, the flush is the page-to-disk step.

log_commitlog_commit_local → … → logpb_flush_pages(commit_lsa) (log_page_buffer.c:3980) is the committer’s path. The function implements an explicit four-quadrant policy on (PRM_ID_LOG_ASYNC_COMMIT, group_commit):

asyncgcBehaviour
nonoWake LFT, wait on gc_cond until nxio_lsa >= flush_lsa.
noyesWait on gc_cond (LFT ticks every log_get_log_group_commit_interval).
yesnoWake LFT, return immediately (commit acked before durable).
yesyesReturn immediately, LFT will tick.

The synchronous-with-group-commit case is the interesting one: the committer does not wake the daemon. The daemon’s looper has a timed period; each daemon tick batches all committers waiting on gc_cond since the previous tick. Fewer wakeups → larger batches. That is the group-commit win.

The committer parks in a CV loop:

// logpb_flush_pages — log_page_buffer.c:3980 (committer wait, condensed)
nxio_lsa = log_Gl.append.get_nxio_lsa ();
while (LSA_LT (&nxio_lsa, flush_lsa)) {
timespec to = now() + max_wait_time_in_msec; /* 1 second */
pthread_mutex_lock (&gc->gc_mutex);
nxio_lsa = log_Gl.append.get_nxio_lsa ();
if (LSA_GE (&nxio_lsa, flush_lsa)) { pthread_mutex_unlock (&gc->gc_mutex); break; }
if (need_wakeup_LFT) log_wakeup_log_flush_daemon ();
pthread_cond_timedwait (&gc->gc_cond, &gc->gc_mutex, &to);
pthread_mutex_unlock (&gc->gc_mutex);
need_wakeup_LFT = true;
nxio_lsa = log_Gl.append.get_nxio_lsa ();
}

The group-commit ack is pthread_cond_broadcast (&gc_cond) issued by the daemon’s log_flush_execute (log_manager.c:10377) after logpb_flush_pages_direct returns — i.e., after the drain and the page flush have both completed and nxio_lsa has been advanced:

// log_flush_execute — log_manager.c:10377
static void
log_flush_execute (cubthread::entry & thread_ref)
{
if (!BO_IS_SERVER_RESTARTED () || !log_Flush_has_been_requested) return;
LOG_CS_ENTER (&thread_ref);
logpb_flush_pages_direct (&thread_ref); /* drain + write to disk */
LOG_CS_EXIT (&thread_ref);
log_Stat.gc_flush_count++;
pthread_mutex_lock (&log_Gl.group_commit_info.gc_mutex);
pthread_cond_broadcast (&log_Gl.group_commit_info.gc_cond);
log_Flush_has_been_requested = false;
pthread_mutex_unlock (&log_Gl.group_commit_info.gc_mutex);
}

The daemon is registered with a cubthread::looper whose period is log_get_log_group_commit_interval — a function pointer to the live config value, so changing the parameter propagates without restart.

Backpressure — when producers grow the queue faster than the drain

Section titled “Backpressure — when producers grow the queue faster than the drain”

The backpressure path is the “self-help” pattern at the tail of prior_lsa_next_record_internal:

// excerpt from prior_lsa_next_record_internal — log_append.cpp:1521
if (log_Gl.prior_info.list_size >= (INT64) logpb_get_memsize ()) {
perfmon_inc_stat (thread_p, PSTAT_PRIOR_LSA_LIST_MAXED);
#if defined(SERVER_MODE)
if (!log_is_in_crash_recovery ()) {
log_wakeup_log_flush_daemon ();
thread_sleep (1); /* 1 millisecond */
} else {
LOG_CS_ENTER (thread_p);
logpb_prior_lsa_append_all_list (thread_p);
LOG_CS_EXIT (thread_p);
}
#else
LOG_CS_ENTER (thread_p);
logpb_prior_lsa_append_all_list (thread_p);
LOG_CS_EXIT (thread_p);
#endif
}

logpb_get_memsize () returns the in-memory log page buffer size (typically a few thousand pages). When the queued bytes exceed that, the producer recognises that the drain is falling behind and:

  • in server mode (the normal case), wakes the flush daemon and yields the OS thread for 1 ms. The daemon does the drain; the producer comes back, observes the drained queue, and proceeds to its commit wait.
  • in crash recovery (no daemon yet) or standalone mode (no daemon at all), the producer takes the log critical section itself and drains. This is the “self-help” path — it is slower (the drain happens on the producer thread instead of in parallel) but correct.

The PSTAT_PRIOR_LSA_LIST_MAXED counter is the operator-visible signal that backpressure is firing. Steady-state firing means the producer load has saturated the daemon’s drain rate; the cure is either faster I/O, larger log page buffer, or async-commit / group-commit tuning.

State machine — one transaction’s commit

Section titled “State machine — one transaction’s commit”
stateDiagram-v2
  [*] --> Active: BEGIN
  Active --> ProducerBuilding: log_append_undoredo_data\n(or any append-API call)
  ProducerBuilding --> ProducerAttaching: prior_lsa_alloc_and_copy_*\n(zlib compress, malloc node)
  ProducerAttaching --> Active: prior_lsa_next_record\n(LSN assigned, node on tail)
  Active --> ProducerBuilding: more log records
  Active --> Committing: log_commit
  Committing --> CommitNodeAttached: append LOG_COMMIT,\nattach to prior list
  CommitNodeAttached --> Waiting: logpb_flush_pages(commit_lsa)
  Waiting --> Waiting: nxio_lsa < commit_lsa\n(timed wait on gc_cond)
  Waiting --> Durable: nxio_lsa >= commit_lsa\n(daemon broadcast)
  Durable --> [*]: log_complete, ack client

  state ProducerBuilding {
    [*] --> Allocating
    Allocating --> Compressing: large UNDOREDO?
    Compressing --> Filling
    Allocating --> Filling: small record
    Filling --> [*]
  }

  state Waiting {
    [*] --> CheckLsn
    CheckLsn --> WakeLFT: need_wakeup_LFT
    WakeLFT --> Sleep
    CheckLsn --> Sleep: !need_wakeup_LFT
    Sleep --> CheckLsn: cond_timedwait wake
  }

The diagram makes explicit that the producer phase has no durable-LSN dependency — the transaction is allowed to keep producing records until the commit; only at commit does the transaction enter Waiting, and only then does its progress depend on the drain’s progress.

Comparative Mermaid — Postgres central latch vs CUBRID prior list

Section titled “Comparative Mermaid — Postgres central latch vs CUBRID prior list”
flowchart TB
  subgraph PG["PostgreSQL XLogInsert (centralised, latch-protected)"]
    direction TB
    PGT1["TX A"]
    PGT2["TX B"]
    PGT3["TX C"]
    PGL["WALInsertLock\n(8 striped locks)"]
    PGB["WAL buffer pages\n(authoritative)"]
    PGW["WAL writer\n(XLogFlush)"]
    PGF["WAL segment file"]

    PGT1 -->|hold lock| PGL
    PGT2 -->|hold lock| PGL
    PGT3 -->|hold lock| PGL
    PGL -->|memcpy under lock| PGB
    PGB --> PGW
    PGW --> PGF
  end

  subgraph CUB["CUBRID prior list (producer-side queued)"]
    direction TB
    CT1["TX A"]
    CT2["TX B"]
    CT3["TX C"]
    CN1["build NODE\n(zlib outside mutex)"]
    CN2["build NODE\n(zlib outside mutex)"]
    CN3["build NODE\n(zlib outside mutex)"]
    CL["prior_lsa_mutex\n(O(1) hold)"]
    CQ["prior_list\n(singly-linked queue)"]
    CD["log-flush daemon\n(LOG_CS_ENTER)"]
    CB["LOG_PAGE buffer\n(authoritative)"]
    CF["active log file"]

    CT1 --> CN1
    CT2 --> CN2
    CT3 --> CN3
    CN1 -->|attach under mutex| CL
    CN2 -->|attach under mutex| CL
    CN3 -->|attach under mutex| CL
    CL --> CQ
    CQ -->|drain| CD
    CD --> CB
    CB --> CF
  end

Two structural differences are highlighted by the diagram. (a) Postgres’s lock holds across the memcpy of record bytes; CUBRID’s mutex holds across O(1) link-tail manipulation. (b) Postgres has no intermediate queue between the appender and the authoritative buffer; CUBRID has a queue. The price CUBRID pays is two-stage memory traffic — bytes are copied first into the prior node, then later into the log page. The price Postgres pays is contention serialised on the lock. In low-concurrency CPU-bound workloads Postgres wins; in high-concurrency I/O-bound workloads CUBRID’s design has more headroom.

Failure cases — crash with non-empty prior list

Section titled “Failure cases — crash with non-empty prior list”

Records in the prior list at crash time are in main memory only — not on disk. The owning transaction has not received a commit ack (it would still be parked on gc_cond if it had finished its append). On recovery:

  • The active log file ends at whatever LSN was durably written before the crash. LSNs beyond that never existed.
  • Recovery’s analysis pass finds no LOG_COMMIT for the transaction; it is marked active at crash time. Undo rolls back whatever the transaction did manage to log durably.
  • Data pages obey the WAL invariant: they cannot be fresher than the durable log, so undo can roll them back.

The crucial property: a transaction never sees a commit ack unless its LOG_COMMIT record is durable, because log_commit calls logpb_flush_pages(commit_lsa) which blocks until nxio_lsa >= commit_lsa. “Non-empty prior list at crash” can only mean “no in-flight commit was lost” by construction; what is lost is uncommitted work, which is supposed to be lost. The malloc’d nodes that did not drain are simply leaked at process exit.

sequenceDiagram
  participant TX as TX thread
  participant API as log_append_*
  participant PA as prior_lsa_alloc_and_copy_*
  participant PN as prior_lsa_next_record
  participant DR as logpb_prior_lsa_append_all_list
  participant FA as logpb_flush_all_append_pages
  participant FD as log_Flush_daemon
  participant DK as Disk

  TX->>API: log_append_undoredo_data(...)
  API->>PA: malloc node, format header,\nzlib-compress payload (outside mutex)
  PA-->>API: LOG_PRIOR_NODE *
  API->>PN: prior_lsa_mutex.lock()
  PN->>PN: assign start_lsa,\nstamp tdes / MVCC,\nlink to prior_list_tail,\nbump list_size
  PN->>PN: prior_lsa_mutex.unlock()
  PN-->>API: LSN
  API-->>TX: return LSN
  Note over TX: business logic continues
  TX->>API: log_commit(...)
  API->>PA: build LOG_COMMIT node
  API->>PN: attach commit node, get commit_lsa
  TX->>FD: logpb_flush_pages(commit_lsa)\n[wake daemon if !group_commit]
  TX->>TX: pthread_cond_timedwait(gc_cond)
  loop daemon iteration
    FD->>FD: LOG_CS_ENTER
    FD->>DR: logpb_prior_lsa_append_all_list
    DR->>DR: detach prior list under mutex
    DR->>DR: walk nodes,\nlogpb_append_next_record\n(copy bytes into LOG_PAGE)
    DR->>DR: free nodes
    FD->>FA: logpb_flush_all_append_pages
    FA->>DK: fileio_write_pages, fsync
    FA->>FA: nxio_lsa = append_lsa
    FD->>FD: LOG_CS_EXIT
    FD->>TX: pthread_cond_broadcast(gc_cond)
  end
  TX->>TX: nxio_lsa >= commit_lsa? yes
  API-->>TX: TRAN_UNACTIVE_COMMITTED

Anchor on symbol names, not line numbers. Lines drift.

Producer-side types and globals (log_append.hpp)

Section titled “Producer-side types and globals (log_append.hpp)”
  • LOG_PRIOR_NODE — queued record (header, TDE flag, three byte buffers, forward link).
  • LOG_PRIOR_LSA_INFO — queue state (head/tail, size, prior_lsa, prev_lsa, mutex, flush-side header).
  • log_append_info — page-side cursor (vdes, atomic nxio_lsa, prev_lsa, log_pgptr, TDE-on-new-page flag).
  • LOG_PRIOR_LSA_LOCK — with/without-lock enum for the attach API.
  • LOG_DATA_ADDR — (vfid, page, offset) triple passed by every append call; high bits of offset carry partial-update flags (LOG_RV_RECORD_INSERT, _DELETE, _UPDATE_ALL, _UPDATE_PARTIAL).

Producer-side allocators and attach (log_append.cpp)

Section titled “Producer-side allocators and attach (log_append.cpp)”
  • prior_lsa_alloc_and_copy_data — non-UNDOREDO allocator; dispatches to per-type generators (prior_lsa_gen_postpone_record / _dbout_redo_record / _2pc_prepare_record / _end_chkpt_record / _record).
  • prior_lsa_alloc_and_copy_crumbs — UNDOREDO allocator over scattered LOG_CRUMB arrays; delegates to prior_lsa_gen_undoredo_record_from_crumbs which handles MVCC stamping, zlib compression, diff-encoding.
  • prior_lsa_copy_{undo,redo}_data_to_node / _{undo,redo}_crumbs_to_node — buffer copy helpers (malloc + memcpy).
  • prior_lsa_next_record / _with_lock / _internal — the attach (assign LSN, stamp MVCC / sysop / commit state on tdes, advance prior_lsa, link node, fire backpressure on overflow).
  • prior_lsa_start_append — fills start_lsa, prev_tranlsa, back_lsa.
  • prior_lsa_end_append — fills forw_lsa.
  • prior_lsa_append_data — advances prior_lsa by length, crossing page boundaries.
  • log_prior_lsa_append_align / _advance_when_doesnot_fit / _add_align — alignment / boundary helpers.
  • prior_update_header_mvcc_info — chains MVCC operations through log_Gl.hdr.mvcc_op_log_lsa for vacuum.
  • prior_set_tde_encrypted / prior_is_tde_encrypted — TDE flag.
  • log_append_init_zip / _final_zip / log_append_get_zip_undo / _redo — per-thread zlib context lifecycle.
  • log_append_realloc_data_ptr / _get_data_ptr — per-thread scratch buffer for compression staging.
  • log_prior_has_worker_log_records — drain-side HA check.
  • LOG_RESET_APPEND_LSA / LOG_RESET_PREV_LSA / LOG_APPEND_PTR — global cursor manipulation used at boot / recovery.
  • logpb_prior_lsa_append_all_list — entry under LOG_CS_OWN_WRITE_MODE; detach + walk.
  • prior_lsa_remove_prior_list — detach (swap head/tail to NULL under mutex).
  • logpb_append_prior_lsa_list — walk-and-copy loop.
  • logpb_append_next_record — copy one node into authoritative LOG_PAGE; pre-flush on overflow.
  • logpb_next_append_page — page-boundary cross; dirty current page in flush-info.
  • logpb_flush_all_append_pages — write dirty pages, two-step partial-record dance, advance nxio_lsa.
  • logpb_flush_pages_direct — drain + flush in existing CS.
  • logpb_flush_pages — committer path with (async, group) matrix.
  • logpb_force_flush_pages / _header_and_pages — force-flush wrappers.
  • log_flush_execute — daemon body: drain + flush + broadcast.
  • log_wakeup_log_flush_daemon — set log_Flush_has_been_requested, call wakeup().
  • log_is_log_flush_daemon_available — presence check.
  • log_flush_daemon_init — create daemon with looper period log_get_log_group_commit_interval.

Public append API (consumed by the rest of the engine)

Section titled “Public append API (consumed by the rest of the engine)”

log_append_undoredo_data / _undoredo_crumbs (UNDOREDO), log_append_undo_data, log_append_redo_data, log_append_dboutside_redo, log_append_postpone, log_append_run_postpone, log_append_compensate (CLR), log_append_savepoint, log_append_ha_server_state, log_append_supplemental_* (CDC). Each builds a node via prior_lsa_alloc_and_copy_* and attaches via prior_lsa_next_record.

SymbolFileLine
log_append_infolog_append.hpp73
log_prior_nodelog_append.hpp91
log_prior_lsa_infolog_append.hpp112
log_prior_has_worker_log_recordslog_append.cpp152
log_append_init_zip / _final_ziplog_append.cpp185 / 232
prior_lsa_alloc_and_copy_datalog_append.cpp273
prior_lsa_alloc_and_copy_crumbslog_append.cpp410
prior_lsa_copy_{undo,redo}_data_to_nodelog_append.cpp493 / 524
prior_lsa_copy_{undo,redo}_crumbs_to_nodelog_append.cpp555 / 600
prior_lsa_gen_undoredo_record_from_crumbslog_append.cpp651
prior_lsa_gen_postpone_recordlog_append.cpp1062
prior_lsa_gen_dbout_redo_recordlog_append.cpp1109
prior_lsa_gen_2pc_prepare_recordlog_append.cpp1144
prior_lsa_gen_end_chkpt_recordlog_append.cpp1181
prior_lsa_gen_recordlog_append.cpp1217
prior_update_header_mvcc_infolog_append.cpp1320
prior_lsa_next_record_internallog_append.cpp1357
prior_lsa_next_recordlog_append.cpp1553
prior_lsa_next_record_with_locklog_append.cpp1559
prior_set_tde_encrypted / prior_is_tde_encryptedlog_append.cpp1565 / 1581
prior_lsa_start_appendlog_append.cpp1593
prior_lsa_end_appendlog_append.cpp1652
prior_lsa_append_datalog_append.cpp1660
log_append_get_zip_undo / _redolog_append.cpp1725 / 1751
log_prior_lsa_append_align and friendslog_append.cpp1892-1922
logpb_next_append_pagelog_page_buffer.c2630
logpb_append_next_recordlog_page_buffer.c2981
logpb_append_prior_lsa_listlog_page_buffer.c3040
prior_lsa_remove_prior_listlog_page_buffer.c3084
logpb_prior_lsa_append_all_listlog_page_buffer.c3106
logpb_flush_all_append_pageslog_page_buffer.c3232
logpb_flush_pages_directlog_page_buffer.c3952
logpb_flush_pageslog_page_buffer.c3980
logpb_force_flush_pageslog_page_buffer.c4096
log_commitlog_manager.c5352
log_wakeup_log_flush_daemonlog_manager.c10126
log_flush_executelog_manager.c10377
log_flush_daemon_initlog_manager.c10493

This section reconciles what the prior-list-from-the-outside view in sibling docs says with what the source actually does.

  • vs cubrid-log-manager.md §“The prior list — staging before the page buffer”. The log-manager doc names the queue, the mutex, and the drain function correctly. What it does not zoom into is the producer-side separation of “build” (outside the mutex) from “attach” (under the mutex), nor the role of zlib compression in keeping the mutex hold time short. The log-manager doc leaves “what is the exact group-commit window policy” as an open question; the answer (verified above) is the four-quadrant (async_commit, group_commit) matrix in logpb_flush_pages plus the daemon’s looper period log_get_log_group_commit_interval. There is no count-based threshold; the policy is purely time-based plus on-demand wakeup.

  • vs cubrid-double-write-buffer.md §“Producer-side staging is split: stage → WAL force → commit”. The DWB doc notes that page-flush sequence is dwb_set_data_on_next_slot → WAL force → dwb_add_page. The WAL force is logpb_flush_log_for_wal, which threads through to logpb_flush_pages(page_lsa) — the same group-commit parking described here. The prior list is in the page-flush durability path: a dirty page cannot be home-written unless its log-LSN has been drained from the prior list and fsync’d.

  • vs cubrid-thread-worker-pool.md §“Daemon consumers”. The worker-pool doc lists log-flush with looper policy “INF (woken by commits)”. The “INF” claim is imprecise: log_flush_daemon_init registers a looper with log_get_log_group_commit_interval, a period function — WAIT_PERIODIC_FUNC mode, not WAIT_INF. The daemon ticks on the interval and responds to wakeups. The daemon body short-circuits if log_Flush_has_been_requested is false, so an empty tick does no work. Net behaviour: at most one flush per interval, plus on-demand flushes.

  • The drain runs inside the global log critical section. LOG_CS_ENTER acquires log_Gl.log_cs (a heavyweight csect_t) in writer mode. This is the outer lock; prior_lsa_mutex is the inner lock. Producers take only the inner; the drain takes both — outer first, inner briefly to detach. No path takes them in the opposite order.

  • The prior-list mutex’s hold time is bounded but not formalised. Under the mutex, prior_lsa_next_record_internal does per-record-type stamping (MVCC chain, sysop state, commit state, vacuum link, prior_lsa alignment). More than O(1) in the strict sense (one branch per record type), but bounded by a small constant — no loops over bytes, no loops over the list itself, no I/O.

  • Backpressure is not flow control. The list_size >= logpb_get_memsize() check is a soft cap: wake daemon, yield, continue. No hard cap blocks producers. A pathological producer can outpace the daemon; the symptom is OOM. In practice the disk’s effective bandwidth flow-controls producers via the WAL invariant on data pages.

  1. Multiple consumer threads. The single-daemon design is enforced by assert (prior_flush_list_header == NULL) at the top of logpb_append_prior_lsa_list. Two daemons could not help because the destination LOG_PAGE buffer is also single- writer under LOG_CS. The right axis to widen is concurrent fsync (multiple log segments fsync’d in parallel, à la PostgreSQL’s segment striping), not multiple drain threads.

  2. Per-NUMA-node prior lists. Oracle’s redo strands shard the queue across NUMA nodes; CUBRID’s single mutex is a possible bottleneck on high-core machines. Sharding would require an LSN-merge step at drain time; LSNs would no longer be monotonic at attach time. Worth investigating only if prior_lsa_mutex shows measurable contention under load.

  3. The cost of malloc / free per node. Every node is malloc’d by the producer and free’d by the drain. A per-thread node pool (each producer keeps a small free-list, drain returns nodes) would be NUMA-friendly and remove an allocator hotspot if profiling shows malloc/free in the top hotspots.

  4. The list_size accounting overhead. Every attach is one cache-line write to list_size. Switching to record count would lose precision because record sizes vary by orders of magnitude (tiny LOG_DUMMY_GENERIC vs multi-kilobyte UNDOREDO).

  5. prior_lsa_next_record_with_lock callers. The variant that expects an externally-held mutex is used during checkpoint. std::mutex is not recursive, so the caller must guarantee single acquisition — every caller needs auditing.

  6. TDE-encrypted node lifecycle. A LOG_PRIOR_NODE with tde_encrypted = true lands on a log page marked appending_page_tde_encrypted. If the master key rotates between node build and drain, encryption is by the old key but the page records the current generation. The prior list carries no IV bookkeeping; the IV mechanism warrants the same investigation as the DWB’s open question 1.

  7. Help-out drain in standalone mode. SA_MODE has no daemon; the producer drains itself when list_size overflows. The producer releases prior_lsa_mutex before entering LOG_CS, so recursion is impossible — but any future code path that holds the mutex into LOG_CS would deadlock.

  8. Async commit observability. With PRM_ID_LOG_ASYNC_COMMIT = true, a committer returns before its commit LSN is durable. log_Stat.async_commit_request_count counts requests; there is no “ack’d but not yet durable” gauge.

Source-only document; no curated raw analyses focus on the prior list specifically.

CUBRID source (/data/hgryoo/references/cubrid/)

Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”
  • src/transaction/log_append.hpp — public types (LOG_PRIOR_NODE, LOG_PRIOR_LSA_INFO, log_append_info).
  • src/transaction/log_append.cpp — allocator and attach (prior_lsa_alloc_and_copy_*, prior_lsa_next_record*, prior_lsa_start_append, _end_append, _append_data).
  • src/transaction/log_record.hppLOG_RECORD_HEADER, LOG_RECTYPE, per-type record structs.
  • src/transaction/log_manager.c — daemon (log_flush_execute, log_wakeup_log_flush_daemon), commit pipeline (log_commit).
  • src/transaction/log_page_buffer.c — drain (logpb_prior_lsa_append_all_list, _append_prior_lsa_list, _append_next_record), flush (logpb_flush_all_append_pages, _force_flush_pages, _flush_pages_direct, _flush_pages).
  • src/transaction/log_postpone_cache.{hpp,cpp} — adjacent in-memory buffering for postpone records; not part of the prior list but conceptually related.
  • src/storage/page_buffer.c — WAL-invariant consumer via logpb_flush_log_for_wal.
  • cubrid-log-manager.md — full WAL discipline, LSA scheme, log record types, archive bookkeeping.
  • cubrid-double-write-buffer.md — page-flush durability that depends on prior-list drain.
  • cubrid-thread-worker-pool.md — daemon framework, cubthread::looper semantics, LOG_CS heavyweight csect_t.
  • cubrid-recovery-manager.md — recovery walks the prior list’s durable byproduct.
  • cubrid-mvcc.md — consumer of LOG_MVCC_* records the prior list stamps.
  • cubrid-page-buffer-manager.mdpgbuf_flush_check_log_lsa invariant.
  • Mohan et al., ARIES (TODS 17.1, 1992) — the WAL contract the prior list services.
  • Database Internals (Petrov), ch. 5 §“Recovery” / §“Write-Ahead Logging” — commit-pipe and group-commit framing.
  • Mohan & DeWitt commit-pipe research (1992) — structural argument for splitting appender from writer.
  • Michael & Scott, Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms (PODC 1996) — canonical MPSC FIFO; CUBRID’s prior list is its short-mutex equivalent.
  • Herlihy & Shavit, The Art of Multiprocessor Programming — lock-free queue patterns.
  • PostgreSQL XLogInsert / WALInsertLock — centralised, latch-protected; the contrast point.
  • MySQL InnoDB log_buf + log_writer / log_flusher — group commit on a circular buffer.
  • Oracle redo strands — multi-strand with merge-at-LGWR.
  • SQL Server log buffer + WAL writer — comparable to InnoDB.