CUBRID Transaction — TDES, Isolation, and Savepoints
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Source verification (as of 2026-05-01)
- Beyond CUBRID — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A transaction in the relational sense is the unit of atomicity and isolation the engine sells to the client. Database Internals (Petrov, ch. 5 §“Transactions”) frames the responsibilities hierarchically: ACID-A (atomicity) is owned by the recovery manager through WAL and CLRs; ACID-D (durability) by the log force-flush at commit; ACID-C (consistency) by the data-side constraints; and ACID-I (isolation) by the joint operation of the lock manager and the MVCC visibility machinery. The transaction module sits at the hub where those four threads meet: it owns the per-transaction state that lock manager, log manager, MVCC table, and recovery manager all read and write.
The unit of state is the transaction descriptor (TDES in CUBRID,
PROC in PostgreSQL, trx_t in InnoDB). The descriptor has a long
list of obligations:
- A stable identity (
trid) that survives connection drops and appears in every log record. - A lifecycle state (active, committed, aborted, in-doubt, …) that recovery’s analysis pass reconstructs at restart.
- An isolation level that gates what the access path’s visibility and lock acquisition look like.
- A history of LSAs — head, tail, undo-next, postpone-next, savepoint, top-op — that recovery and rollback need to walk the transaction backwards.
- A collection of side state — modified-classes registry, replication records, unique-index statistics, lob locators — that needs to be cleaned at commit / abort.
Two implementation choices the transaction model leaves open shape the rest of this document:
- Where the TDES lives and how it is named. The textbook answer
is “in a fixed-size transaction table indexed by transaction
index”. The variants are how the table is sized (static vs.
elastic), how indices are reused, and what is hot vs. cold inside
the descriptor. CUBRID picks a fixed-size table, allocates TDES
slots from a contiguous area, and recycles slots through a
hint_free_index. - How isolation is enforced — at the access boundary, at the statement boundary, or both. SI engines enforce read isolation via snapshot acquisition; 2PL engines enforce it via lock acquisition; hybrid engines (CUBRID is one) acquire snapshots and take key-range locks for SERIALIZABLE / REPEATABLE READ. The isolation field on the TDES is the dispatch key.
Once those choices are named, every other piece of state on the TDES is in service of one of them.
Common DBMS Design
Section titled “Common DBMS Design”Every relational engine that supports nested rollbacks, isolation level toggling, and recovery uses the same handful of patterns around the transaction descriptor.
Per-transaction descriptor table
Section titled “Per-transaction descriptor table”A fixed-size array of descriptors, indexed by a small integer
(“transaction index”). The trade-off: fixed size means client count
caps at table size, but indexing is O(1) and the memory layout is
cache-friendly. PostgreSQL uses MaxBackends-sized PROC array;
InnoDB has a trx_sys->trx_list; SQL Server uses a hash on tid.
CUBRID is in the fixed-array camp.
Isolation level as TDES enum
Section titled “Isolation level as TDES enum”Isolation is a per-TDES enum with three or four values (READ COMMITTED, REPEATABLE READ, SERIALIZABLE; some engines add READ UNCOMMITTED). The value is read at three places: snapshot acquisition (which kind of MVCC snapshot to build), lock acquisition (whether to take key-range locks), and statement boundary (whether to release short-cursor locks early). Every engine has the same three-way switch.
Nested top-operations / system operations
Section titled “Nested top-operations / system operations”Any operation that touches multiple pages atomically (B+Tree split, heap overflow allocation, schema mutation) needs a sub-transactional unit of recovery — a “system op” in CUBRID, a “subxact” in PostgreSQL, a “mtr” in InnoDB. The TDES carries a stack of in-progress system ops. Commit pops the top frame and merges its log range into the parent; abort pops and rolls back the frame. The stack is a recursion-style structure, not a tree.
Savepoints as named LSAs
Section titled “Savepoints as named LSAs”A savepoint is a name attached to “the LSA of the most recent log
record at savepoint creation”. Rollback-to-savepoint is “undo all log
records between current tail LSA and the savepoint LSA”. The
implementation is a chain on the TDES: savept_lsa plus a
prv_savept chain in each savepoint log record. Re-establishing
savepoints in recovery happens by walking the chain.
Lifecycle as state machine, not flag soup
Section titled “Lifecycle as state machine, not flag soup”Naive engines use a bool committed and a bool aborted. Real
engines use an enum with 10+ states because 2PC and postpone
operations create transitions (“committed-with-postpone”,
“committed-informing-participants”, “2PC-prepared”,
“unilaterally-aborted”) that aren’t expressible with two booleans.
CUBRID’s TRAN_STATE enum has 15 values; PostgreSQL has fewer
(2PC is its own subsystem); InnoDB has fewer still (no 2PC server
side).
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theoretical concept | CUBRID name |
|---|---|
| Transaction identifier | TRANID trid (in LOG_TDES) |
| Transaction descriptor | LOG_TDES (log_impl.h) |
| Transaction table | TRANTABLE log_Gl.trantable (log_impl.h) |
| Transaction state enum | TRAN_STATE — 15 states (log_comm.h) |
| Isolation level enum | DB_TRAN_ISOLATION aliased to TRAN_ISOLATION (compat/dbtran_def.h) |
| Active / committed / aborted predicates | LOG_ISTRAN_ACTIVE, LOG_ISTRAN_COMMITTED, LOG_ISTRAN_ABORTED macros (log_impl.h) |
| MVCC info (visibility, snapshot) | LOG_TDES::mvccinfo of type MVCC_INFO |
| Nested top-op stack | LOG_TDES::topops of type LOG_TOPOPS_STACK (log_impl.h) |
| Top-op log range | LOG_TOPOPS_ADDRESSES { lastparent_lsa; posp_lsa } per stack frame |
| Savepoint | LOG_TDES::savept_lsa + LOG_REC_SAVEPT chain |
| Postpone (deferred actions) | LOG_TDES::posp_nxlsa + log_postpone_cache m_log_postpone_cache |
| Modified-class registry | LOG_TDES::m_modified_classes of type tx_transient_class_registry |
| Per-tran B+Tree unique stats | LOG_TDES::m_multiupd_stats of type multi_index_unique_stats |
| 2PC coordinator info | LOG_TDES::coord of type LOG_2PC_COORDINATOR * (covered in cubrid-2pc.md) |
| 2PC global tran info | LOG_TDES::gtrinfo of type LOG_2PC_GTRINFO |
| Recovery-time TDES annotations | LOG_TDES::rcv of type LOG_RCV_TDES |
| Server-side commit entry | xtran_server_commit (transaction_sr.c) |
| Server-side abort entry | xtran_server_abort (transaction_sr.c) |
| Client-visible commit | tran_commit (transaction_cl.c) |
| Reusable index assignment | logtb_assign_tran_index (log_tran_table.c) |
| System op (sub-transaction) stack push | log_sysop_start (log_manager.c) |
CUBRID’s Approach
Section titled “CUBRID’s Approach”The transaction module’s four moving parts are the trantable that holds all live TDES, the TDES itself, the lifecycle state machine that the trantable’s entries traverse, and the system-op stack the TDES owns for sub-transactional rollback. We walk them in that order.
Overall structure
Section titled “Overall structure”flowchart LR
subgraph CL["Client side (transaction_cl.c)"]
TCL["tm_Tran_index\ntm_Tran_isolation\ntm_Tran_ID"]
API["tran_commit\ntran_abort\ntran_savepoint_internal"]
TCL --> API
end
subgraph SR["Server side (transaction_sr.c)"]
XSC["xtran_server_commit"]
XSA["xtran_server_abort"]
XSV["xtran_server_savepoint"]
end
subgraph TT["Trantable (log_Gl.trantable)"]
HDR["TRANTABLE { num_total_indices, hint_free_index, all_tdes[] }"]
T1["LOG_TDES idx=1\ntrid=42"]
T2["LOG_TDES idx=2\ntrid=43"]
Tn["..."]
HDR --> T1
HDR --> T2
HDR --> Tn
end
subgraph LM["log_manager"]
LC["log_commit"]
LA["log_abort"]
LS["log_sysop_start / commit / abort"]
end
API -->|RPC| XSC
API -->|RPC| XSA
API -->|RPC| XSV
XSC --> LC
XSA --> LA
LC --> T1
LA --> T1
LS --> T1
The figure encodes the three boundaries the transaction module sits
at. (client/server) the client TDES is a thin shadow
(tm_Tran_* globals); the heavy state lives server-side. (TDES /
trantable) TDES slots are owned by the trantable; lookups are
O(1) by index. (TDES / log) every commit / abort / savepoint /
system-op call mutates the TDES and appends a log record; the
two are kept in lockstep.
TDES — the descriptor
Section titled “TDES — the descriptor”LOG_TDES in log_impl.h is the central data structure of the
transaction module. It is large; below is the load-bearing slice
with line-comment annotation.
// LOG_TDES — src/transaction/log_impl.hstruct log_tdes{ /* === MVCC and identity === */ MVCC_INFO mvccinfo; /* MVCC info — snapshot, MVCCID, sub-IDs */
int tran_index; /* Index into trantable */ TRANID trid; /* Stable transaction identifier */
/* === lifecycle === */ bool isloose_end; TRAN_STATE state; /* 15-value enum */ TRAN_ISOLATION isolation; /* READ_COMMITTED | REPEATABLE_READ | SERIALIZABLE */ int wait_msecs; /* Lock wait timeout */
/* === LSA chain — these are the recovery anchors === */ LOG_LSA head_lsa; /* First record of this transaction */ LOG_LSA tail_lsa; /* Last record */ LOG_LSA undo_nxlsa; /* Next record to undo (compensate-aware) */ LOG_LSA posp_nxlsa; /* First / next postpone record */ LOG_LSA savept_lsa; /* Last user/system savepoint */ LOG_LSA topop_lsa; /* Last system op */ LOG_LSA tail_topresult_lsa; /* Last partial abort/commit */ LOG_LSA commit_abort_lsa; /* Commit/abort record (used by checkpoint) */
/* === client identity, locking, and 2PC === */ int client_id; int gtrid; /* Global tran ID for 2PC */ CLIENTIDS client; SYNC_RMUTEX rmutex_topop; /* Reentrant mutex serialising sysop begin/end */ LOG_TOPOPS_STACK topops; /* Active sub-transactional ops */ LOG_2PC_GTRINFO gtrinfo; LOG_2PC_COORDINATOR *coord; /* NULL unless this site is the 2PC coordinator */
/* === per-transaction caches and stats === */ int num_unique_btrees; multi_index_unique_stats m_multiupd_stats; volatile sig_atomic_t interrupt; tx_transient_class_registry m_modified_classes; int num_transient_classnames; int num_repl_records; struct log_repl *repl_records; LOG_LSA repl_insert_lsa; LOG_LSA repl_update_lsa; void *first_save_entry; int suppress_replication; struct lob_rb_root lob_locator_root; INT64 query_timeout; INT64 query_start_time; INT64 tran_start_time; XASL_ID xasl_id; LK_RES *waiting_for_res; /* The lock-resource I'm blocked on, if any */ int disable_modifications; TRAN_ABORT_REASON tran_abort_reason; int num_exec_queries; DB_VALUE_ARRAY bind_history[MAX_NUM_EXEC_QUERY_HISTORY]; int num_log_records_written; LOG_TRAN_UPDATE_STATS log_upd_stats; bool has_deadlock_priority; bool block_global_oldest_active_until_commit; bool is_user_active; LOG_RCV_TDES rcv; /* Recovery-time annotations only */ log_postpone_cache m_log_postpone_cache; bool has_supplemental_log; char *ddl_sql_user_text; // ... member functions for sysop locking and oldest-mvccid pinning ...};The struct is dense but cleanly stratified. The first block (mvcc
info, identity) is what every page access reads. The second block
(state, isolation, wait_msecs) is what lock acquisition and
visibility decisions read. The LSA chain is the recovery-side
contract: every TDES needs head_lsa/tail_lsa so analysis can
identify the transaction, undo_nxlsa for rollback, posp_nxlsa
for postpone replay, and savept_lsa / topop_lsa /
tail_topresult_lsa for partial-rollback semantics. The remaining
fields are bookkeeping: lock the transaction is waiting on, lobs to
clean up, replication records to ship, modified classes to invalidate
on commit.
Trantable — the table of TDES
Section titled “Trantable — the table of TDES”The trantable in log_impl.h is a small header plus a contiguous
allocation area for TDES.
// TRANTABLE — src/transaction/log_impl.hstruct trantable{ int num_total_indices; /* Capacity (configured at boot) */ int num_assigned_indices; /* Currently in use */ int num_coord_loose_end_indices; int num_prepared_loose_end_indices; int hint_free_index; /* Speeds up next assignment */ volatile sig_atomic_t num_interrupts; LOG_ADDR_TDESAREA *area; /* Linked list of TDES storage areas */ LOG_TDES **all_tdes; /* Indexed pointer table */};The two important properties: (a) all_tdes is the lookup table
indexed by transaction index, so LOG_FIND_TDES(idx) is one
load. (b) area is a chain of contiguous allocations rather
than a single block, because the table can grow (logtb_grow_*
paths) without invalidating existing pointers.
logtb_assign_tran_index (log_tran_table.c:796) is the assigner.
It uses hint_free_index to find a free slot fast, allocates a new
area if needed, and initializes a fresh TDES — including
logtb_set_loose_end_chkpt_lsa and the first head_lsa. The
matching logtb_release_tran_index (1139) clears the TDES, releases
locks the transaction held, and updates hint_free_index.
The trantable lives behind the TR_TABLE_CS critical section
(csect_enter (CSECT_TRAN_TABLE)); writers (assign / release) take
it in write mode, readers (most TDES lookups) take it in read mode
or skip it entirely when going via direct index.
Lifecycle — the TRAN_STATE state machine
Section titled “Lifecycle — the TRAN_STATE state machine”TRAN_STATE (log_comm.h) is the lifecycle enum. It has 15 values,
which matters: a “transaction” is more than alive/dead/in-progress
because postpone, 2PC, and unilateral abort each create their own
intermediate states the recovery analysis pass must distinguish.
// TRAN_STATE — src/transaction/log_comm.henum{ TRAN_RECOVERY, /* system tran for recovery */ TRAN_ACTIVE, /* normal in-flight */
TRAN_UNACTIVE_COMMITTED, /* commit complete */ TRAN_UNACTIVE_WILL_COMMIT, /* commit log written, force pending */ TRAN_UNACTIVE_COMMITTED_WITH_POSTPONE, /* committed, postpones running */ TRAN_UNACTIVE_TOPOPE_COMMITTED_WITH_POSTPONE, /* sysop-postpone variant */
TRAN_UNACTIVE_ABORTED, /* user-initiated abort */ TRAN_UNACTIVE_UNILATERALLY_ABORTED, /* system aborted (crash) */
TRAN_UNACTIVE_2PC_PREPARE, /* prepared, waiting decision */ TRAN_UNACTIVE_2PC_COLLECTING_PARTICIPANT_VOTES, /* coordinator phase 1 */ TRAN_UNACTIVE_2PC_ABORT_DECISION, /* coordinator phase 2 — abort */ TRAN_UNACTIVE_2PC_COMMIT_DECISION, /* coordinator phase 2 — commit */ TRAN_UNACTIVE_COMMITTED_INFORMING_PARTICIPANTS, /* informing after commit */ TRAN_UNACTIVE_ABORTED_INFORMING_PARTICIPANTS, /* informing after abort */
TRAN_UNACTIVE_UNKNOWN} TRAN_STATE;The LOG_ISTRAN_* predicates in log_impl.h:143-183 collapse the
enum into the questions the rest of the engine asks: LOG_ISTRAN_ACTIVE
checks “is this a normal in-flight tran on a restarted server”,
LOG_ISTRAN_COMMITTED collapses 5 commit-side states,
LOG_ISTRAN_ABORTED collapses 4 abort-side states,
LOG_ISTRAN_2PC_IN_SECOND_PHASE collapses the second-phase 2PC
states. The collapsed views are how the recovery manager decides
“do we need to redo, undo, or finish 2PC”, and how callers like
logpb_checkpoint decide whether a TDES needs further attention.
Isolation — three levels, dispatched at access time
Section titled “Isolation — three levels, dispatched at access time”DB_TRAN_ISOLATION (compat/dbtran_def.h) is a 3-bit field:
// DB_TRAN_ISOLATION — src/compat/dbtran_def.htypedef enum{ TRAN_UNKNOWN_ISOLATION = 0x00,
TRAN_READ_COMMITTED = 0x04, /* alias TRAN_REP_CLASS_COMMIT_INSTANCE, TRAN_CURSOR_STABILITY */ TRAN_REPEATABLE_READ = 0x05, /* alias TRAN_REP_READ */ TRAN_SERIALIZABLE = 0x06, /* alias TRAN_NO_PHANTOM_READ */
TRAN_DEFAULT_ISOLATION = TRAN_READ_COMMITTED,} DB_TRAN_ISOLATION;The value is set per-TDES (log_tdes::isolation), with a client-side
shadow in tm_Tran_isolation (transaction_cl.h). It is read at
three places:
- Snapshot acquisition (
mvcc_satisfies_snapshotincubrid-mvcc.md): READ COMMITTED reacquires per-statement; REPEATABLE READ holds the snapshot for the transaction; SERIALIZABLE behaves like REPEATABLE READ at the snapshot level but adds key-range locks. - Lock acquisition (lock manager): SERIALIZABLE takes key-range locks at scan boundaries. REPEATABLE READ relies on MVCC for read-stability and only locks data being written. READ COMMITTED takes minimal locks; locks released at statement boundary.
- Statement boundary (
xtran_*_query_end_*): for cursor-stability semantics READ COMMITTED releases its snapshot; the others retain.
The deliberate aliasing — TRAN_CURSOR_STABILITY is the same value
as TRAN_READ_COMMITTED — is a backward-compatibility seam: older
APIs named the levels by the locking-engine vocabulary
(cursor-stability, repeatable-class), newer code uses the
SQL-standard names. Both work, and both compile to the same dispatch.
Commit and abort — the standard paths
Section titled “Commit and abort — the standard paths”Server-side commit lands at xtran_server_commit (transaction_sr.c:71);
abort at xtran_server_abort (128). They are thin RPC wrappers that
forward to log_commit / log_abort in log_manager.c, which is
where the actual sequencing happens.
// xtran_server_commit — src/transaction/transaction_sr.c (condensed)TRAN_STATExtran_server_commit (THREAD_ENTRY *thread_p, bool retain_lock){ TRAN_STATE state; int tran_index = LOG_FIND_THREAD_TRAN_INDEX (thread_p);
/* Guard rails: no in-flight queries, no held mutex stack. */ // ... condensed ...
state = log_commit (thread_p, tran_index, retain_lock);
/* Fire post-commit triggers (replication, CDC supplemental flush). */ // ... condensed ... return state;}The path inside log_commit (covered in cubrid-log-manager.md):
append LOG_COMMIT_WITH_POSTPONE if any postpone records are
buffered, run pending postpones, append LOG_COMMIT,
force-flush, transition state to TRAN_UNACTIVE_COMMITTED,
release locks (or retain if retain_lock), free the TDES via
logtb_release_tran_index. log_abort is the mirror: append
LOG_ABORT, drive undo, release locks, free.
System ops — sub-transactional units of recovery
Section titled “System ops — sub-transactional units of recovery”Operations that are atomic-as-a-group but touch many pages (B+Tree
splits, heap overflow allocation, schema mutations) use system
ops. A system op opens with log_sysop_start
(log_manager.c:3599), nests on the TDES’s topops stack, runs its
sub-operation, then commits with log_sysop_commit (3916) or aborts
with log_sysop_abort.
// LOG_TOPOPS_STACK / LOG_TOPOPS_ADDRESSES — src/transaction/log_impl.hstruct log_topops_addresses{ LOG_LSA lastparent_lsa; /* Where the parent's log range was when this op began */ LOG_LSA posp_lsa; /* First postpone of this op */};
struct log_topops_stack{ int max; int last; /* -1 ⇒ no system op in progress */ LOG_TOPOPS_ADDRESSES *stack;};log_sysop_start pushes a new LOG_TOPOPS_ADDRESSES whose
lastparent_lsa is the parent’s current tail_lsa. While the system
op is active, the system op’s log records form a contiguous range
on the log; log_sysop_commit writes a LOG_SYSOP_END_COMMIT
record (with lastparent_lsa and prv_topresult_lsa for chaining)
that marks the range complete. log_sysop_abort writes
LOG_SYSOP_END_ABORT and walks the range backward applying undos.
The variants on log_sysop_end_* correspond to the union arms in
LOG_REC_SYSOP_END (covered in cubrid-log-manager.md):
log_sysop_end_logical_undo— system op carries its own logical undo image (used for index splits where physical undo is not enough).log_sysop_end_logical_compensate— system op was undone, leaves a CLR pointer.log_sysop_end_logical_run_postpone— system op was used to drive a postpone.
Recovery is aware of these variants: the analysis pass categorises
sysop ranges by their LOG_SYSOP_END_TYPE, and the redo / undo
passes invoke the right path per variant. (Details in
cubrid-recovery-manager.md.)
Savepoints — named LSAs in a chain
Section titled “Savepoints — named LSAs in a chain”Savepoint creation is a log_append_savepoint
(log_manager.c, declared in log_manager.h:132) that emits a
LOG_SAVEPOINT record carrying the savepoint name. The TDES updates
savept_lsa to the new record’s LSA; the record carries a
prv_savept pointer to the previous savepoint, so the chain can be
walked at rollback-to-savepoint time.
// LOG_REC_SAVEPT — src/transaction/log_record.hppstruct log_rec_savept{ LOG_LSA prv_savept; /* Previous savepoint record */ int length; /* Savepoint name length follows */};Rollback to savepoint:
log_abort_partial(savepoint_name, savept_lsa) → walk savept_lsa chain to find named savepoint → undo from tail_lsa back to that savepoint → emit CLRs at each step → reset tail_lsa to the savepoint record's LSASavepoints come in two flavours marked by SAVEPOINT_TYPE
(transaction_cl.h:49): USER_SAVEPOINT (named explicitly via
SQL SAVEPOINT foo) and SYSTEM_SAVEPOINT (engine-internal, e.g.,
to bracket a statement so an error rolls back just that
statement).
Lifecycle, end to end
Section titled “Lifecycle, end to end”stateDiagram-v2 [*] --> ACTIVE: logtb_assign_tran_index ACTIVE --> WILL_COMMIT: log_commit\n(append LOG_COMMIT) WILL_COMMIT --> COMMITTED_W_POSTPONE: postpones queued COMMITTED_W_POSTPONE --> COMMITTED: log_do_postpone done WILL_COMMIT --> COMMITTED: no postpones ACTIVE --> ABORTED: log_abort\n(append LOG_ABORT, drive undo) ACTIVE --> UNILATERALLY_ABORTED: crash detected\nin recovery analysis ACTIVE --> PREPARED_2PC: log_2pc_prepare PREPARED_2PC --> COMMIT_DECISION: coordinator says commit PREPARED_2PC --> ABORT_DECISION: coordinator says abort COMMIT_DECISION --> INFORMING_PARTICIPANTS_C: phase 2 send ABORT_DECISION --> INFORMING_PARTICIPANTS_A: phase 2 send INFORMING_PARTICIPANTS_C --> COMMITTED: all acks received INFORMING_PARTICIPANTS_A --> ABORTED: all acks received COMMITTED --> [*]: logtb_release_tran_index ABORTED --> [*] UNILATERALLY_ABORTED --> [*]
Each transition is named by its log-record emission: log_commit
emits LOG_COMMIT_WITH_POSTPONE or LOG_COMMIT; log_abort emits
LOG_ABORT; the 2PC paths (covered in cubrid-2pc.md) emit
LOG_2PC_* types. The analysis pass at recovery walks the log
forward, builds a per-TDES picture, and uses LOG_ISTRAN_* to
decide each TDES’s fate.
Client-side shadow and the API surface
Section titled “Client-side shadow and the API surface”The client side of the transaction module is small: a handful of
globals (tm_Tran_index, tm_Tran_isolation, tm_Tran_ID,
tm_Tran_async_ws, tm_Tran_wait_msecs) and the tran_* API
in transaction_cl.h.
// Client-visible API — src/transaction/transaction_cl.h (excerpt)extern int tran_commit (bool retain_lock);extern int tran_abort (void);extern int tran_unilaterally_abort (void);extern int tran_reset_isolation (TRAN_ISOLATION isolation, bool async_ws);extern int tran_reset_wait_times (int wait_in_msecs);extern int tran_savepoint_internal (const char *name, SAVEPOINT_TYPE type);extern int tran_abort_upto_user_savepoint (const char *name);extern int tran_abort_upto_system_savepoint (const char *name);extern int tran_2pc_start (void);extern int tran_2pc_prepare (void);extern int tran_set_global_tran_info (int gtrid, void *info, int size);extern bool tran_has_updated (void);Each entry point does some local bookkeeping (e.g., call any
registered tran_end_libcas_function for the broker) and then
performs the server RPC. The server side is covered above.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers.
Headers and types
Section titled “Headers and types”log_tdes(log_impl.h) — the descriptor.trantable(log_impl.h) — the table of descriptors.log_topops_stack,log_topops_addresses(log_impl.h) — system op stack.log_rcv_tdes(log_impl.h) — recovery-time TDES annotations.TRAN_STATE(log_comm.h) — 15-value lifecycle enum.DB_TRAN_ISOLATION(compat/dbtran_def.h) — 3-level isolation enum.LOG_TOPOP_RANGE(log_manager.h) — pair of(start_lsa, end_lsa)used for nested-top postpone replay.tx_transient_class_registry(transaction_transient.hpp) — modified-classes list that needs invalidation.
Trantable management
Section titled “Trantable management”logtb_assign_tran_index(log_tran_table.c) — allocate a slot for a new transaction.logtb_release_tran_index(log_tran_table.c) — return the slot.logtb_set_current_tran_index(log_tran_table.c) — set thread-current index.logtb_complete_mvcc(log_tran_table.c) — close out MVCC info on commit/abort.logtb_grow_*(log_tran_table.c) — table growth.
Server entry points
Section titled “Server entry points”xtran_server_commit(transaction_sr.c) — server commit.xtran_server_abort(transaction_sr.c) — server abort.xtran_server_savepoint(transaction_sr.c) — server savepoint creation.xtran_server_unilaterally_abort_tran(transaction_sr.c) — forced abort during error recovery.
System op surface
Section titled “System op surface”log_sysop_start(log_manager.c) — push frame.log_sysop_commit(log_manager.c) — pop with commit, writeLOG_SYSOP_END_COMMIT.log_sysop_abort(log_manager.c) — pop with abort, writeLOG_SYSOP_END_ABORT, walk backward for undo.log_sysop_start_atomic(log_manager.c) — atomic variant for recovery-sensitive operations (file allocation/deallocation).log_sysop_end_logical_undo(log_manager.c) — system op that carries its own logical undo.log_sysop_end_logical_compensate/log_sysop_end_logical_run_postpone(log_manager.c) — variants for compensation and postpone replay.log_sysop_attach_to_outer(log_manager.c) — attach a system op’s log range to its parent without writing an end record (used when a system op is essentially a marker).
Client-side API
Section titled “Client-side API”tran_commit(transaction_cl.c).tran_abort(transaction_cl.c).tran_savepoint_internal(transaction_cl.c) — both USER and SYSTEM savepoints land here.tran_abort_upto_user_savepoint/tran_abort_upto_system_savepoint(transaction_cl.c).tran_reset_isolation(transaction_cl.c) — flipstm_Tran_isolationand forwards to server.
Position hints as of 2026-05-01
Section titled “Position hints as of 2026-05-01”| Symbol | File | Line |
|---|---|---|
log_tdes (struct) | log_impl.h | 475 |
log_topops_stack | log_impl.h | 362 |
log_topops_addresses | log_impl.h | 353 |
log_rcv_tdes | log_impl.h | 458 |
trantable | log_impl.h | 602 |
LOG_ISTRAN_ACTIVE (macro) | log_impl.h | 143 |
LOG_ISTRAN_COMMITTED (macro) | log_impl.h | 146 |
LOG_ISTRAN_ABORTED (macro) | log_impl.h | 153 |
LOG_ISTRAN_2PC (macro) | log_impl.h | 173 |
TRAN_STATE enum | log_comm.h | 36 |
DB_TRAN_ISOLATION enum | dbtran_def.h | 36 |
logtb_assign_tran_index | log_tran_table.c | 796 |
logtb_release_tran_index | log_tran_table.c | 1139 |
logtb_complete_mvcc | log_tran_table.c | 4050 |
logtb_set_current_tran_index | log_tran_table.c | 6002 |
xtran_server_commit | transaction_sr.c | 71 |
xtran_server_abort | transaction_sr.c | 128 |
xtran_server_savepoint | transaction_sr.c | 348 |
log_sysop_start | log_manager.c | 3599 |
log_sysop_start_atomic | log_manager.c | 3665 |
log_sysop_commit_internal | log_manager.c | 3825 |
log_sysop_commit | log_manager.c | 3916 |
log_commit | log_manager.c | 5352 |
log_abort | log_manager.c | 5461 |
Source verification (as of 2026-05-01)
Section titled “Source verification (as of 2026-05-01)”Verified facts
Section titled “Verified facts”-
LOG_TDESis a single struct of ~50 fields, not split between hot and cold. Verified atlog_impl.h:475. Unlike PostgreSQL (which splitsPROCfromPGXACTso visibility scans only touch hot fields), CUBRID inlines visibility-relevantmvccinfonext to bookkeeping fields likebind_historyandquery_timeout. Implication: visibility scans of the TDES table read more cache lines per descriptor than the strictly necessary set. -
TRAN_STATEhas 15 values, of which 7 belong to the 2PC state machine. Verified atlog_comm.h:36-67. TheLOG_ISTRAN_2PCmacro atlog_impl.h:173-176collapses 6 of them into “is in 2PC”. The 15-value enum does not includeTRAN_RECOVERYseparately as a 2PC variant; it’s a pseudo-state used for the recovery worker’s pseudo-tran. -
Default isolation is
TRAN_READ_COMMITTED(0x04). Verified atdbtran_def.h:53(TRAN_DEFAULT_ISOLATION = TRAN_READ_COMMITTED) anddbtran_def.h:54(MVCC_TRAN_DEFAULT_ISOLATION = TRAN_READ_COMMITTED). Both defaults agree because CUBRID is MVCC across the board; there is no non-MVCC mode where the default would differ. -
Isolation-level enum values are deliberately aliased.
TRAN_READ_COMMITTED == TRAN_REP_CLASS_COMMIT_INSTANCE == TRAN_CURSOR_STABILITY == 0x04. Verified atdbtran_def.h:40-42. The aliases preserve API compatibility with the older locking-vocabulary names; they compile to the same dispatch path. -
Trantable size is configured at boot, not dynamic per transaction. Verified by reading
logtb_assign_tran_index(log_tran_table.c:796) — it allocates from a contiguous area managed byLOG_ADDR_TDESAREAlinked list, growing only when exhausted, never shrinking. The cap is set by themax_clientsserver parameter. -
System ops nest via a stack on the TDES, not a separate table. Verified at
log_impl.h:361-367(LOG_TOPOPS_STACK). The stack’slastfield is-1when no system op is active, an integer index otherwise. There is no global system-op table — every TDES owns its own stack. -
Lock-acquisition wait timeout is per-TDES. Verified at
log_impl.h:486(wait_msecsfield) and the corresponding client-side globaltm_Tran_wait_msecsintransaction_cl.h:58. The macroTRAN_LOCK_INFINITE_WAIT = -1(log_comm.h:29) encodes the “wait forever” sentinel. -
block_global_oldest_active_until_commitexists for long-running operations that need to do their own vacuuming. Verified atlog_impl.h:555and thelock_global_oldest_visible_mvccidmember function declared atlog_impl.h:585. Used by reorganize-partition / upgrade-domain code paths that scan large amounts of data and would otherwise have their MVCC threshold pushed forward by concurrent transactions. -
LOG_2PC_GTRINFOandLOG_2PC_COORDINATOR *are inline TDES fields, present even for non-2PC transactions. Verified atlog_impl.h:505-508.coordisNULLif the site is not the coordinator. The cost is one pointer per TDES; the benefit is that attaching a 2PC role to a previously-local transaction does not re-allocate. -
LOG_RCV_TDESis non-NULL only during recovery. Verified atlog_impl.h:458(struct definition) and 558 (inlined intolog_tdes::rcv). Its fields (sysop_start_postpone_lsa,tran_start_postpone_lsa,atomic_sysop_start_lsa,analysis_last_aborted_sysop_*) are populated during analysis-pass and consumed during redo/undo.
Open questions
Section titled “Open questions”-
TDES hot/cold split. Has anyone measured the cache-miss penalty of putting
mvccinfonext tobind_history? Other engines split, presumably for a reason. Investigation path:perf stat -e cache-misseson a high-concurrency read workload; compare against a hypothetical TDES split. -
Trantable growth. The header field
LOG_ADDR_TDESAREA *areasuggests growth is supported at runtime, but the trigger and coordination are unverified. Investigation path: grep forareawrites inlog_tran_table.c; check whether growth happens in the request path or only at a quiescent point. -
hint_free_indexcorrectness under contention. Multiple threads can simultaneously calllogtb_assign_tran_index. The hint is single-valued — what guards it? Investigation path: read the body oflogtb_assign_tran_indexfor compare-and-swap or mutex usage. -
System-op
rmutex_topopbehaviour. A reentrant mutex per-TDES suggests system ops can recursively start while one is in progress on the same thread, but the depth bound is unverified. Investigation path: examinelog_sysop_startforlock_topop()calls and chase the reentrance count. -
Postpone cache integration.
m_log_postpone_cacheis a C++ class (log_postpone_cache) inlined into the TDES. Its purpose per the field comment is to remember postpone records that may be replayed atlog_do_postpone. The exact lifetime (cleared on commit? on abort? carried across sysop boundaries?) is unverified. Investigation path: readlog_postpone_cache.cpptogether withlog_do_postponeinlog_manager.c. -
Client-side TDES shadow vs. server reality.
tm_Tran_*are client-side globals; what happens on a connection failover when the server has a differentwait_msecs? Investigation path: tracetran_cache_tran_settingsconsumers; check whether the CAS broker re-syncs on reconnect.
Beyond CUBRID — Comparative Designs & Research Frontiers
Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up doc.
-
PostgreSQL
PROC/PGXACTsplit — PG splits the descriptor into a hot half (PGXACT: xid, xmin, vacuumFlags) read by visibility scans and a cold half (PROC: locktag arrays, myProcLocks). A side-by-side with CUBRID’s monolithic TDES would measure the cache cost. -
InnoDB
trx_tpluslock_sysreservation — InnoDB embeds per-tran lock reservation insidetrx_t::lockand uses a globallock_sys_tmutex. CUBRID separates this:LK_RES *waiting_for_reson the TDES plus the lock manager’s per-resource hash. Comparing the two would illuminate the lock-acquisition critical path. -
Hekaton in-memory transaction map (Larson et al., VLDB 2011) — Hekaton stores TDES in a lock-free hash on transaction-id, with versions stored inline on records. CUBRID’s fixed-array trantable is the opposite design point.
-
Partial rollback chains in PostgreSQL subtransactions — PG uses
SubTransactionIdand a per-backend stack much like CUBRID’s topops stack. The two-version subtransaction-id mapping in PG (subxact + parent xid) is more elaborate than CUBRID’sLOG_TOPOPS_ADDRESSESbut the lifecycle is structurally identical. -
Optimistic concurrency control on RDMA (FaRM, NSDI 2014) — FaRM eliminates the TDES table by encoding transaction state directly in record versions. CUBRID’s TDES survives because its isolation modes need the descriptor for lock acquisition; comparison highlights what the descriptor is for on a shared-memory engine.
-
JTA
XAResourcesemantics (JSR 907) — the CUBRID 2PC TRAN_STATE branch is conformant to JTA prepared/commit/rollback semantics; the cubrid-2pc.md doc is the natural follow-up that enumerates the conformance points. -
CockroachDB serializable + parallel commits (Taft et al., SIGMOD 2020) — Cockroach pushes the descriptor into a distributed KV layer and commits a transaction by writing a single intent record whose status is resolved lazily; the “transaction record” plays the role of CUBRID’s TDES but without a fixed-size table. A side-by-side would surface what shared-memory engines pay (the trantable cap) versus what shared-nothing engines pay (intent resolution traffic).
Sources
Section titled “Sources”Raw analyses (raw/code-analysis/cubrid/storage/transaction/)
Section titled “Raw analyses (raw/code-analysis/cubrid/storage/transaction/)”Transaction Internals.pdfTransaction Internals.pptx
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database Internals (Petrov), Ch. 5 “Transactions and Recovery”, §“ACID” and §“Isolation levels”.
- Concurrency Control and Recovery in Database Systems (Bernstein, Hadzilacos, Goodman), Ch. 1–4.
CUBRID source (/data/hgryoo/references/cubrid/)
Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”src/transaction/log_impl.h— TDES, trantable, sysop stack.src/transaction/log_tran_table.c— trantable management.src/transaction/transaction_cl.{h,c}— client-side API.src/transaction/transaction_sr.{h,c}— server entry points.src/transaction/transaction_global.hpp— system tran constants.src/transaction/transaction_transient.hpp— modified-class registry, lob locator chain.src/transaction/log_comm.h—TRAN_STATEenum.src/transaction/log_manager.c— sysop, commit, abort.src/compat/dbtran_def.h—DB_TRAN_ISOLATIONenum.
Sibling docs in this knowledge base
Section titled “Sibling docs in this knowledge base”knowledge/code-analysis/cubrid/cubrid-log-manager.md— log records the TDES emits.knowledge/code-analysis/cubrid/cubrid-mvcc.md— consumer oflog_tdes::mvccinfo.knowledge/code-analysis/cubrid/cubrid-lock-manager.md— consumer oflog_tdes::wait_msecsand producer oflog_tdes::waiting_for_res.knowledge/code-analysis/cubrid/cubrid-recovery-manager.md— consumer of TDES at analysis time; in-progress in the same batch.knowledge/code-analysis/cubrid/cubrid-2pc.md— owner of the 2PC state-machine arms andcoord/gtrinfo; in-progress in the same batch.