CUBRID Lock Manager — Code-Level Deep Dive
Where this document fits: The high-level analysis
cubrid-lock-manager.mdcovers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single lock request inside the kernel.
Contents:
Chapter 1: Data-Structure Map
Section titled “Chapter 1: Data-Structure Map”Every lock-manager operation is built on top of five structs. This chapter walks through every field of each struct with code excerpts and draws the pointer relationships between them. Having this map in your head is the prerequisite for following the flows in later chapters.
1.1 Five core structs — overview
Section titled “1.1 Five core structs — overview”flowchart TB
subgraph GLOBAL["lk_Gl (LK_GLOBAL_DATA) — single global instance"]
HT["m_obj_hash_table\n(lockfree_hashmap)"]
TLT["tran_lock_table[0..N]\n(LK_TRAN_LOCK array)"]
FREE["obj_free_entry_list\n(LF_FREELIST)"]
WFG["TWFG_node[] / TWFG_edge[]\n(deadlock detection)"]
end
subgraph RES["LK_RES — one lockable resource"]
KEY["key: LK_RES_KEY"]
AGG["total_holders_mode\ntotal_waiters_mode"]
HL["holder -> waiter -> non2pl"]
end
subgraph ENTRY["LK_ENTRY — one lock fact"]
E_RES["res_head -> LK_RES"]
E_MODE["granted_mode / blocked_mode"]
E_LINK["next (resource side)\ntran_next/tran_prev (tran side)"]
E_CLASS["class_entry -> parent"]
end
subgraph TRAN["LK_TRAN_LOCK — one transaction's lock state"]
ROOT["root_class_hold"]
CLS["class_hold_list"]
INST["inst_hold_list"]
POOL["lk_entry_pool\n(local pool, max 10)"]
N2PL["non2pl_list"]
end
HT --> RES
TLT --> TRAN
RES -.-> ENTRY
TRAN -.-> ENTRY
ENTRY --> RES
ENTRY --> TRAN
Figure 1-1 — Relationship among the five core structs. Arrows show
pointer direction. The key insight is that LK_ENTRY is linked into
both LK_RES (resource view) and LK_TRAN_LOCK (transaction view)
simultaneously.
1.2 LK_RES_KEY — identifier of a lockable resource
Section titled “1.2 LK_RES_KEY — identifier of a lockable resource”The key that identifies what object a lock protects. Used as the hash-table lookup key.
// struct lk_res_key — src/transaction/lock_manager.hstruct lk_res_key{ LOCK_RESOURCE_TYPE type; /* INSTANCE, CLASS, ROOT_CLASS */ OID oid; /* target object's OID */ OID class_oid; /* owning class OID (instance only) */};| Field | Size | Description |
|---|---|---|
type | enum (4B) | One of LOCK_RESOURCE_INSTANCE, LOCK_RESOURCE_CLASS, LOCK_RESOURCE_ROOT_CLASS. LOCK_RESOURCE_OBJECT is obsolete. |
oid | 8B (pageid 4B + slotid 2B + volid 2B) | Identifier of the object this lock protects. |
class_oid | 8B | Valid only for instance locks. Set to NULL OID for class/root-class locks. |
type is derived from oid and class_oid. lock_create_search_key
implements the derivation rule:
// lock_create_search_key — src/transaction/lock_manager.cstatic LK_RES_KEYlock_create_search_key (OID * oid, OID * class_oid){ LK_RES_KEY search_key; // ... OID copying ...
if (oid != NULL && OID_IS_ROOTOID (oid)) { search_key.type = LOCK_RESOURCE_ROOT_CLASS; } else if (class_oid == NULL || OID_IS_ROOTOID (class_oid)) { search_key.type = LOCK_RESOURCE_CLASS; } else { search_key.type = LOCK_RESOURCE_INSTANCE; } return search_key;}The derivation rules in tabular form:
| Condition | type | Meaning |
|---|---|---|
oid is root OID | ROOT_CLASS | The database root class itself |
class_oid is NULL or root OID | CLASS | A table (class) — the coarse-grain lock target |
| Otherwise | INSTANCE | An individual row (instance) within a table |
Hash-table comparison ignores type and checks oid only.
lock_res_key_compare calls OID_EQ(&k1->oid, &k2->oid) and uses
type only in an assert. This relies on the invariant that a single
OID cannot simultaneously serve as two different resource types.
1.3 LK_RES — lock resource (state of one lockable object)
Section titled “1.3 LK_RES — lock resource (state of one lockable object)”Holds all lock state for one lockable object (database, class, or
instance). Found in the hash table by LK_RES_KEY.
// struct lk_res — src/transaction/lock_manager.hstruct lk_res{ LK_RES_KEY key; /* hash key — see 1.2 */ LOCK total_holders_mode; /* aggregate mode of all current holders */ LOCK total_waiters_mode; /* aggregate mode of all current waiters */ LK_ENTRY *holder; /* head of granted-lock linked list */ LK_ENTRY *waiter; /* head of blocked-lock linked list */ LK_ENTRY *non2pl; /* head of early-released-lock tracking list */ pthread_mutex_t res_mutex; /* per-resource mutex */ LK_RES *hash_next; /* hash-bucket collision chain */ LK_RES *stack; /* lock-free freelist retired stack */ UINT64 del_id; /* lock-free reclamation epoch ID */};Field-by-field:
| Field | Role | Why it exists |
|---|---|---|
total_holders_mode | LUB (least upper bound) of all granted locks | Makes compatibility checks O(1) — compare against this one value instead of walking the holder list. |
total_waiters_mode | LUB of all waiting locks | Starvation guard: a new request must also be compatible with waiting requests, so a stream of S requests cannot leapfrog a waiting X. |
holder | Singly-linked list of granted LK_ENTRYs | Linked via LK_ENTRY.next. Multiple transactions can hold the same resource simultaneously (e.g., multiple S_LOCKs). |
waiter | Singly-linked list of blocked LK_ENTRYs | FIFO order. Head is the longest-waiting request. |
non2pl | Traces of S locks released early under READ COMMITTED | Used for conflict detection when the same transaction re-reads later. |
res_mutex | Per-resource mutex | Protects holder/waiter/non2pl list manipulation. Per-resource (not global), so lock requests on different resources proceed in parallel. |
When total_holders_mode is recomputed: every time an entry is
added to, removed from, or converted within the holder list, the
entire holder list is walked to recompute the LUB. This happens under
res_mutex, so it is serialized per-resource.
1.4 LK_ENTRY — one lock fact (the dual-threaded record)
Section titled “1.4 LK_ENTRY — one lock fact (the dual-threaded record)”The most important struct in the lock manager. Represents a single lock fact and is linked into both the resource view and the transaction view simultaneously.
// struct lk_entry — src/transaction/lock_manager.hstruct lk_entry{ struct lk_res *res_head; /* back-pointer to owning resource */ THREAD_ENTRY *thrd_entry; /* requesting thread (valid only when blocked) */ int tran_index; /* owning transaction's index */ LOCK granted_mode; /* currently granted mode */ LOCK blocked_mode; /* requested mode when blocked */ int count; /* re-entrance counter */ UINT64 del_id; /* lock-free reclamation */ LK_ENTRY *stack; /* freelist retired stack */ LK_ENTRY *next; /* resource view: next in holder/waiter list */ LK_ENTRY *tran_next; /* transaction view: next lock of same tran */ LK_ENTRY *tran_prev; /* transaction view: prev lock of same tran */ LK_ENTRY *class_entry; /* parent class's LK_ENTRY (for instance locks) */ int ngranules; /* count of instance locks below this class lock */ int instant_lock_count; /* instant lock request count */ int bind_index_in_tran; XASL_ID xasl_id; /* query plan ID that requested this lock */};An LK_ENTRY plays one of three roles — granted holder, blocked
waiter, or non2pl marker. Which fields are meaningful depends on the
role:
| Field | holder (granted) | waiter (blocked) | non2pl | Description |
|---|---|---|---|---|
res_head | ✓ | ✓ | ✓ | Always valid. Which resource this lock belongs to. |
thrd_entry | NULL | ✓ | NULL | Waiter only. Points to the thread to resume. |
tran_index | ✓ | ✓ | ✓ | Which transaction owns this lock. |
granted_mode | current mode | NULL_LOCK | pre-release mode | For holders, the actual lock mode. For waiters, not yet granted so NULL_LOCK. |
blocked_mode | NULL_LOCK | requested mode | NULL_LOCK | For waiters, the mode being requested. During conversion wait, a holder can have blocked_mode != NULL_LOCK — granted_mode is the currently held mode, blocked_mode is the upgrade target. |
count | ≥ 1 | 1 | 0 | Re-entrance counter. If the same transaction requests the same resource and mode n times, count becomes n. Always 0 for non2pl. |
next | next in holder list | next in waiter list | next in non2pl list | Resource-view linkage. Singly-linked. |
tran_next/tran_prev | ✓ | — | ✓ | Transaction-view linkage. Doubly-linked. Waiters are not in the transaction hold list. |
class_entry | parent class’s LK_ENTRY | — | — | One-hop back-reference from instance lock to class lock. Used for escalation. |
ngranules | count of child instance locks | — | — | For class locks only. Compared against escalation threshold. Zero for instance locks. |
instant_lock_count | instant-mode lock count | — | — | Managed by lock_start_instant_lock_mode / lock_stop_instant_lock_mode. |
Key invariant: exclusive use of granted_mode and blocked_mode
Entry in holder list: granted_mode = actual held mode (S, X, IS, IX, ...) blocked_mode = NULL_LOCK (normal) or upgrade-target mode (conversion wait)
Entry in waiter list: granted_mode = NULL_LOCK blocked_mode = requested modeRead the pair as the state. One entry per (resource,
transaction): the field pair plus which list the entry threads
is the transaction’s state on that resource — X/– in the
holder list: plain holder · IX/SIX in the holder list: upgrader
(partially acquired, with a pending target) · –/X in the waiter
list: waiter (nothing acquired yet). Vocabulary note: blocked_mode
is a pending target, not an “intention” — that word is reserved
for the intention modes IS/IX/SIX.
This invariant is enforced by the two initializers:
// lock_initialize_entry_as_granted — src/transaction/lock_manager.cstatic voidlock_initialize_entry_as_granted (LK_ENTRY * entry_ptr, int tran_index, LK_RES * res, LOCK lock){ entry_ptr->granted_mode = lock; /* actual mode */ entry_ptr->blocked_mode = NULL_LOCK; /* not waiting */ entry_ptr->count = 1; // ...}
// lock_initialize_entry_as_blocked — src/transaction/lock_manager.cstatic voidlock_initialize_entry_as_blocked (LK_ENTRY * entry_ptr, THREAD_ENTRY * thread_p, int tran_index, LK_RES * res, LOCK lock){ entry_ptr->granted_mode = NULL_LOCK; /* not yet granted */ entry_ptr->blocked_mode = lock; /* requested mode */ entry_ptr->count = 1; // ...}Conversion wait is the exception where both fields are non-NULL.
For example, if tran 7 holds S_LOCK and requests an upgrade to
X_LOCK:
granted_mode = S_LOCK(still held)blocked_mode = X_LOCK(upgrade pending)- The entry stays in the holder list but is repositioned to the
back via
lock_position_holder_entry. Covered in detail in Ch. 4.
1.5 LK_TRAN_LOCK — one transaction’s complete lock state
Section titled “1.5 LK_TRAN_LOCK — one transaction’s complete lock state”// struct lk_tran_lock — src/transaction/lock_manager.cstruct lk_tran_lock{ pthread_mutex_t hold_mutex; /* protects hold lists */ LK_ENTRY *inst_hold_list; /* instance lock hold list */ LK_ENTRY *class_hold_list; /* class lock hold list */ LK_ENTRY *root_class_hold; /* root class lock (at most one) */ LK_ENTRY *lk_entry_pool; /* local pool of reusable LK_ENTRY */ int lk_entry_pool_count; /* current count in pool */ int inst_hold_count; /* number of instance locks held */ int class_hold_count; /* number of class locks held */
LK_ENTRY *waiting; /* currently blocked lock entry (at most one) */
pthread_mutex_t non2pl_mutex; /* protects non2pl_list */ LK_ENTRY *non2pl_list; /* early-released lock tracking list */ int num_incons_non2pl; /* inconsistent non2pl count */
bool lock_escalation_on; /* escalation-in-progress guard */ bool is_instant_duration; /* instant lock mode on/off */};Why three separate hold sub-lists:
flowchart TB TRAN["LK_TRAN_LOCK (tran #7)"] ROOT["root_class_hold\n(at most 1)"] CLS["class_hold_list\n-> entry(class A, IX)\n-> entry(class B, IS)\n-> ..."] INST["inst_hold_list\n-> entry(row A.1, X)\n-> entry(row A.2, S)\n-> ..."] TRAN --> ROOT TRAN --> CLS TRAN --> INST
Figure 1-2 — Three-way hold list split. Separating root/class/instance means “what lock does this transaction hold on class A?” can be answered by walking only the class list, O(number of classes), instead of scanning all locks.
| Field | Detail |
|---|---|
hold_mutex | Acquired when manipulating inst/class/root hold lists. Separate from res_mutex — cross-locking is needed when one transaction manipulates locks on different resources. |
lk_entry_pool | Each transaction locally caches up to LOCK_TRAN_LOCAL_POOL_MAX_SIZE = 10 LK_ENTRY structs. lock_get_new_entry checks this pool first; only falls through to the global obj_free_entry_list when empty. Purpose: reduce contention on the global freelist in hot OLTP paths. |
waiting | The single LK_ENTRY this transaction is currently blocked on. A transaction can wait on at most one lock at a time (one thread suspends on one lock request). |
non2pl_list | When an S lock is released early under READ COMMITTED, the LK_ENTRY is moved here. Other transactions check this list when attempting an X lock on the same resource to detect inconsistencies. |
lock_escalation_on | Re-entrance guard during escalation. |
is_instant_duration | Turned on by lock_start_instant_lock_mode. Locks acquired in this mode are released as soon as the statement ends. |
1.6 LK_GLOBAL_DATA — the global singleton lk_Gl
Section titled “1.6 LK_GLOBAL_DATA — the global singleton lk_Gl”Exactly one instance exists system-wide. The root of all lock state.
// struct lk_global_data — src/transaction/lock_manager.cstruct lk_global_data{ int max_obj_locks; /* max lock count set at init */ lk_hashmap_type m_obj_hash_table; /* OID -> LK_RES lockfree hashmap */ LF_FREELIST obj_free_entry_list; /* global LK_ENTRY freelist */
int num_trans; /* number of initialized transactions */ LK_TRAN_LOCK *tran_lock_table; /* per-tran LK_TRAN_LOCK array */
pthread_mutex_t DL_detection_mutex; /* deadlock detection guard */ struct timeval last_deadlock_run; /* timestamp of last deadlock scan */ LK_WFG_NODE *TWFG_node; /* WFG node array (per-tran) */ LK_WFG_EDGE *TWFG_edge; /* WFG edge array */ int max_TWFG_edge; /* max edge array size */ int TWFG_free_edge_idx; /* next free edge index */ int global_edge_seq_num; /* edge creation sequence number */
short no_victim_case_count; bool verbose_mode; std::atomic_int deadlock_and_timeout_detector;};
LK_GLOBAL_DATA lk_Gl; /* global instance — file scope */Overall memory allocation strategy:
flowchart TB REQ["lock request -> need LK_ENTRY"] LP["1. transaction local pool\n(lk_entry_pool, max 10)"] GF["2. global freelist\n(obj_free_entry_list)"] MA["3. malloc new block"] REQ --> LP LP -- "empty" --> GF GF -- "empty" --> MA REL["lock release -> return LK_ENTRY"] LP2["1. return to local pool\n(if count < 10)"] GF2["2. return to global freelist"] REL --> LP2 LP2 -- "pool full" --> GF2
Figure 1-3 — Three-tier LK_ENTRY allocation/deallocation strategy.
Allocation: local pool → global freelist → malloc. Deallocation:
reverse order. Implemented by lock_get_new_entry and
lock_free_entry.
1.7 Lock modes — the 12-value LOCK enum and two static tables
Section titled “1.7 Lock modes — the 12-value LOCK enum and two static tables”// enum LOCK — src/transaction/lock_table.htypedef enum{ NA_LOCK = 0, INCON_NON_TWO_PHASE_LOCK = 1, /* NON2PL */ NULL_LOCK = 2, SCH_S_LOCK = 3, IS_LOCK = 4, S_LOCK = 5, IX_LOCK = 6, BU_LOCK = 7, SIX_LOCK = 8, U_LOCK = 9, X_LOCK = 10, SCH_M_LOCK = 11, LOCK_COUNT /* = 12 */} LOCK;Two static 12×12 tables drive all mode decisions:
lock_Comp[requested][current] — compatibility table:
LOCK_COMPAT_YES: the new request can coexist with the existing lockLOCK_COMPAT_NO: conflict — the new request must waitLOCK_COMPAT_UNKNOWN: this combination cannot occur in practice (assert-guarded)
lock_Conv[requested][current] — conversion table:
- When the same transaction already holds
currentand requestsrequested, the resulting mode islock_Conv[requested][current](the LUB of the two modes) NA_LOCKresult: this conversion is undefined (assert failure)
// inline accessors — src/transaction/lock_table.hinline LOCKlock_conv (LOCK requested, LOCK current){ assert (lock_Conv[requested][current] != NA_LOCK); return lock_Conv[requested][current];}
inline LOCK_COMPATIBILITYlock_compat (LOCK requested, LOCK current){ assert (lock_Comp[requested][current] != LOCK_COMPAT_UNKNOWN); return lock_Comp[requested][current];}Conversion examples (read directly from lock_Conv):
| Currently held | New request | Result (LUB) | Meaning |
|---|---|---|---|
IS_LOCK | IX_LOCK | IX_LOCK | Intention shared → intention exclusive upgrade |
S_LOCK | IX_LOCK | SIX_LOCK | Shared + intention exclusive = SIX |
S_LOCK | X_LOCK | X_LOCK | Shared → exclusive upgrade |
IX_LOCK | S_LOCK | SIX_LOCK | IX + S = SIX (symmetric) |
U_LOCK | X_LOCK | X_LOCK | Update → exclusive upgrade |
1.8 Pointer-relationship summary — everywhere a single lock connects
Section titled “1.8 Pointer-relationship summary — everywhere a single lock connects”Every path a single LK_ENTRY is linked into:
flowchart LR
subgraph Resource["LK_RES (resource R)"]
direction TB
RH["holder list"]
RW["waiter list"]
RN["non2pl list"]
end
subgraph Tran["LK_TRAN_LOCK (tran T)"]
direction TB
TI["inst_hold_list"]
TC["class_hold_list"]
TR["root_class_hold"]
TN["non2pl_list"]
end
E["LK_ENTRY\n(tran T, resource R)"]
RH -- "next" --> E
E -- "res_head" --> Resource
TI -- "tran_next/prev" --> E
E -- "class_entry" --> CE["LK_ENTRY\n(tran T, class of R)"]
CE -- "ngranules++" --> CE
Figure 1-4 — Every pointer relationship involving a single LK_ENTRY.
next is the resource view, tran_next/tran_prev is the transaction
view, class_entry bridges the granularity hierarchy, and res_head
is the back-reference.
1.9 Chapter summary — key takeaways
Section titled “1.9 Chapter summary — key takeaways”LK_ENTRYis a dual-threaded record. A single lock fact exists in both the resource view (next) and the transaction view (tran_next/tran_prev) simultaneously.total_holders_modeandtotal_waiters_modeenable O(1) compatibility checks. A new request compares against two aggregate values instead of walking the entire holder list.- The combination of
granted_modeandblocked_modedetermines an entry’s state:granted != NULL_LOCK, blocked == NULL_LOCK→ normal holdergranted == NULL_LOCK, blocked != NULL_LOCK→ waitergranted != NULL_LOCK, blocked != NULL_LOCK→ holder undergoing conversion wait
- Memory allocation is three-tier: local pool (10 entries) → global freelist → malloc.
- Per-resource
res_mutexprovides resource-granularity serialization, so lock requests on different resources proceed fully in parallel.
Chapter 2: Initialization and Memory Management
Section titled “Chapter 2: Initialization and Memory Management”The lock manager is bootstrapped once during server startup by
lock_initialize and torn down by lock_finalize. This chapter
traces the initialization sequence step by step, then examines the
runtime memory management — the three-tier allocation strategy and
its integration with the lockfree infrastructure.
2.1 lock_initialize — the top-level entry point
Section titled “2.1 lock_initialize — the top-level entry point”// lock_initialize — src/transaction/lock_manager.cintlock_initialize (void){ error_code = lock_initialize_tran_lock_table (); /* step 1 */ lock_initialize_object_hash_table (); /* step 2 */ error_code = lock_initialize_object_lock_entry_list (); /* step 3 */ error_code = lock_initialize_deadlock_detection (); /* step 4 */ // ... env-var checks for verbose/dump mode ... lock_deadlock_detect_daemon_init (); /* step 5 */ return error_code;}The order matters. Step 1 must complete before step 4 because
lock_initialize_deadlock_detection reads lk_Gl.num_trans
(set by step 1) to size the WFG node array.
flowchart LR INIT["lock_initialize"] S1["1. tran_lock_table\nMAX_NTRANS slots\n+ 10 entries each"] S2["2. obj_hash_table\nlockfree hashmap\nMAX(10K, NTRANS*300)"] S3["3. obj_free_entry_list\nlockfree freelist"] S4["4. deadlock detection\nTWFG_node[num_trans]"] S5["5. deadlock daemon\n100ms loop"] INIT --> S1 --> S2 --> S3 --> S4 --> S5
Figure 2-1 — lock_initialize sequence. Each step builds on the
previous one’s output. The deadlock daemon (step 5) is the last
piece — it must not run until all data structures are ready.
2.2 Step 1 — lock_initialize_tran_lock_table
Section titled “2.2 Step 1 — lock_initialize_tran_lock_table”Allocates and initializes the per-transaction lock state table.
// lock_initialize_tran_lock_table — src/transaction/lock_manager.cstatic intlock_initialize_tran_lock_table (void){ num_trans = MAX_NTRANS; lk_Gl.tran_lock_table = (LK_TRAN_LOCK *) malloc (SIZEOF_LK_TRAN_LOCK * num_trans); memset (lk_Gl.tran_lock_table, 0, SIZEOF_LK_TRAN_LOCK * num_trans);
for (i = 0; i < num_trans; i++) { tran_lock = &lk_Gl.tran_lock_table[i]; pthread_mutex_init (&tran_lock->hold_mutex, NULL); pthread_mutex_init (&tran_lock->non2pl_mutex, NULL);
/* pre-fill local entry pool */ for (j = 0; j < LOCK_TRAN_LOCAL_POOL_MAX_SIZE; j++) /* = 10 */ { entry = (LK_ENTRY *) malloc (sizeof (LK_ENTRY)); lock_initialize_entry (entry); entry->next = tran_lock->lk_entry_pool; tran_lock->lk_entry_pool = entry; } tran_lock->lk_entry_pool_count = LOCK_TRAN_LOCAL_POOL_MAX_SIZE; } return NO_ERROR;}Key observations:
MAX_NTRANSslots are allocated up front. This is a compile-time constant, not a runtime parameter. Everytran_indexfrom 0 toMAX_NTRANS - 1gets its ownLK_TRAN_LOCKregardless of how many transactions are active.- Each slot gets 10 pre-allocated
LK_ENTRYs via directmalloc. These go into thelk_entry_poolsingly-linked list. This is the Tier 1 local pool from Ch. 1 §1.6. - Two mutexes per transaction:
hold_mutex(protects the three hold lists) andnon2pl_mutex(protects the non2pl list). They are separate because a lock release can simultaneously manipulate the hold list and the non2pl list from different code paths. - All other fields are zeroed by the
memset, which means:inst_hold_list = NULL,class_hold_list = NULL,root_class_hold = NULL,waiting = NULL,non2pl_list = NULL,lock_escalation_on = false,is_instant_duration = false.
2.3 Step 2 — lock_initialize_object_hash_table
Section titled “2.3 Step 2 — lock_initialize_object_hash_table”Sets up the lockfree hashmap that maps LK_RES_KEY → LK_RES.
// lock_initialize_object_hash_table — src/transaction/lock_manager.cstatic voidlock_initialize_object_hash_table (void){ lk_Gl.max_obj_locks = 10000; /* LK_INITIAL_OBJECT_LOCK_TABLE_SIZE */ const int obj_hash_size = MAX (lk_Gl.max_obj_locks, LK_MIN_OBJECT_LOCKS); /* LK_MIN_OBJECT_LOCKS = MAX_NTRANS * 300 */
lk_Gl.m_obj_hash_table.init ( obj_lock_res_Ts, THREAD_TS_OBJ_LOCK_RES, obj_hash_size, block_size, block_count, lk_Obj_lock_res_desc);}The hash table is a lockfree_hashmap<LK_RES_KEY, LK_RES> — a
generic lockfree hash table from CUBRID’s lockfree infrastructure
(see cubrid-lockfree-hashmap.md). The lock manager plugs in its
own behavior through an LF_ENTRY_DESCRIPTOR:
// lk_Obj_lock_res_desc — src/transaction/lock_manager.cLF_ENTRY_DESCRIPTOR lk_Obj_lock_res_desc = { offsetof (LK_RES, stack), /* retired-entry stack pointer */ offsetof (LK_RES, hash_next), /* hash collision chain */ offsetof (LK_RES, del_id), /* epoch ID for reclamation */ offsetof (LK_RES, key), /* key location */ offsetof (LK_RES, res_mutex), /* per-entry mutex */ LF_EM_USING_MUTEX, /* each LK_RES has its own mutex */ LF_ENTRY_DESCRIPTOR_MAX_ALLOC, lock_alloc_resource, /* malloc + pthread_mutex_init */ lock_dealloc_resource, /* pthread_mutex_destroy + free */ lock_init_resource, /* zero out holders/waiters/non2pl */ lock_uninit_resource, /* assert lists are empty */ lock_res_key_copy, lock_res_key_compare, /* OID_EQ only — type is assert-checked */ lock_res_key_hash, /* delegates to LK_OBJ_LOCK_HASH macro */ NULL /* no insert callback */};The hash function (lock_res_key_hash → LK_OBJ_LOCK_HASH →
lock_get_hash_value) deserves a closer look:
// lock_get_hash_value — src/transaction/lock_manager.cstatic unsigned intlock_get_hash_value (const OID * oid, int htsize){ if (oid->slotid <= 0) { addr = oid->pageid - oid->slotid; } else { next_base_slotid = 2; while (next_base_slotid <= (unsigned) oid->slotid) next_base_slotid *= 2;
addr = oid->pageid + (htsize / next_base_slotid) * (2 * oid->slotid - next_base_slotid + 1); } return (addr % htsize);}This is not a simple modulo hash. For positive slotid values, it
uses a van Emde Boas–style bit-interleave that spreads
consecutive slot IDs within the same page across different hash
buckets. The intent: rows on the same heap page (which tend to be
locked together in scan order) are distributed across the hash
table, reducing bucket-chain contention.
For slotid <= 0 (index keys — unique index last-key OIDs use
slotid = -1), it falls back to a simpler pageid - slotid
computation.
Worked example — hash distribution for rows on page 100:
Assume htsize = 6000 (a small example). Six rows on the same heap
page (pageid=100) with slotid 1–6:
| OID (pageid, slotid) | next_base_slotid | addr formula | addr | bucket (addr % 6000) |
|---|---|---|---|---|
| (100, 1) | 2 | 100 + (6000/2) * (2·1 - 2 + 1) = 100 + 3000 | 3100 | 3100 |
| (100, 2) | 4 | 100 + (6000/4) * (2·2 - 4 + 1) = 100 + 1500 | 1600 | 1600 |
| (100, 3) | 4 | 100 + (6000/4) * (2·3 - 4 + 1) = 100 + 4500 | 4600 | 4600 |
| (100, 4) | 8 | 100 + (6000/8) * (2·4 - 8 + 1) = 100 + 750 | 850 | 850 |
| (100, 5) | 8 | 100 + (6000/8) * (2·5 - 8 + 1) = 100 + 2250 | 2350 | 2350 |
| (100, 6) | 8 | 100 + (6000/8) * (2·6 - 8 + 1) = 100 + 3750 | 3850 | 3850 |
All six rows land in different buckets (3100, 1600, 4600, 850,
2350, 3850). A naive pageid * 31 + slotid hash would cluster them
within 6 adjacent buckets. The van Emde Boas–style stride ensures
that a heap-page scan locking rows 1, 2, 3, … in order hits
widely separated buckets, minimizing hash-chain contention under
concurrent scans.
Hash table structure after several lock requests:
flowchart TB
subgraph HT["m_obj_hash_table (lockfree hashmap, 6000 buckets)"]
B850["bucket 850"]
B1600["bucket 1600"]
B3100["bucket 3100"]
B3850["bucket 3850"]
BDOT["..."]
end
R1["LK_RES\nkey: (100,4) class\ntotal_holders: IX\nholder -> waiter -> non2pl"]
R2["LK_RES\nkey: (100,2) inst\ntotal_holders: X\nholder -> ..."]
R3["LK_RES\nkey: (100,1) inst\ntotal_holders: S\nholder -> ..."]
R4["LK_RES\nkey: (200,3) inst\ntotal_holders: S\nholder -> ..."]
B850 --> R1
B1600 --> R2
B3100 --> R3
R3 -. "hash_next\n(collision)" .-> R4
B3850 -. "empty" .-> EMPTY["NULL"]
Figure 2-2 — Hash table after locking rows from pages 100 and 200.
Each bucket points to an LK_RES chain. Most buckets have a single
entry (the hash function’s goal); collisions (bucket 3100 here:
OID (100,1) and (200,3) happen to collide) are resolved via the
hash_next singly-linked chain. Each LK_RES carries its own
res_mutex — concurrent threads touching different buckets never
contend.
End-to-end: from OID to lock grant (the hash table’s role):
flowchart LR OID["OID (100, 2, 5)\npageid=100, slotid=2, volid=5"] KEY["lock_create_search_key\n-> LK_RES_KEY\n type=INSTANCE\n oid=(100,2,5)\n class_oid=(50,1,5)"] HASH["lock_res_key_hash\n-> bucket 1600"] FIND["m_obj_hash_table\n.find_or_insert\n(lockfree)"] RES["LK_RES found/created\nres_mutex locked"] CHECK["check compat with\ntotal_holders_mode\ntotal_waiters_mode"] OID --> KEY --> HASH --> FIND --> RES --> CHECK
Figure 2-3 — From OID to compatibility check. The lockfree hashmap
resolves the OID to an LK_RES without a global lock. The
per-resource res_mutex is held only after the LK_RES is found,
narrowing the critical section to one resource.
2.4 Step 3 — lock_initialize_object_lock_entry_list
Section titled “2.4 Step 3 — lock_initialize_object_lock_entry_list”Sets up the lockfree freelist for LK_ENTRY allocation (Tier 2 of
the three-tier strategy).
// lock_initialize_object_lock_entry_list — src/transaction/lock_manager.cstatic intlock_initialize_object_lock_entry_list (void){ block_count = 1; block_size = (int) MAX ((lk_Gl.max_obj_locks * LK_ENTRY_RATIO), 1); /* LK_ENTRY_RATIO = 0.1f */
ret = lf_freelist_init (&lk_Gl.obj_free_entry_list, block_count, block_size, &obj_lock_entry_desc, &obj_lock_ent_Ts); return ret;}The LK_ENTRY descriptor differs from the LK_RES descriptor in
one crucial way:
// obj_lock_entry_desc — src/transaction/lock_manager.cLF_ENTRY_DESCRIPTOR obj_lock_entry_desc = { offsetof (LK_ENTRY, stack), offsetof (LK_ENTRY, next), offsetof (LK_ENTRY, del_id), 0, /* no key — not in a hash table */ 0, /* no mutex offset */ LF_EM_NOT_USING_MUTEX, /* ← key difference */ // ... lock_alloc_entry, /* plain malloc */ lock_dealloc_entry, /* plain free */ // ...};LK_ENTRY uses LF_EM_NOT_USING_MUTEX because it does not carry
its own mutex. Protection comes from the owning LK_RES.res_mutex
or LK_TRAN_LOCK.hold_mutex, depending on which list the entry is
being manipulated through.
2.5 Step 4 — lock_initialize_deadlock_detection
Section titled “2.5 Step 4 — lock_initialize_deadlock_detection”// lock_initialize_deadlock_detection — src/transaction/lock_manager.cstatic intlock_initialize_deadlock_detection (void){ pthread_mutex_init (&lk_Gl.DL_detection_mutex, NULL); gettimeofday (&lk_Gl.last_deadlock_run, NULL);
lk_Gl.TWFG_node = (LK_WFG_NODE *) malloc (SIZEOF_LK_WFG_NODE * lk_Gl.num_trans); for (i = 0; i < lk_Gl.num_trans; i++) { lk_Gl.TWFG_node[i].DL_victim = false; lk_Gl.TWFG_node[i].checked_by_deadlock_detector = false; lk_Gl.TWFG_node[i].thrd_wait_stime = 0; }
lk_Gl.TWFG_edge = NULL; /* allocated lazily on first use */ lk_Gl.max_TWFG_edge = 0; lk_Gl.TWFG_free_edge_idx = -1; lk_Gl.global_edge_seq_num = 0; return NO_ERROR;}Key observations:
- WFG nodes are allocated per-transaction — one
LK_WFG_NODEpertran_index. This is why step 4 depends onlk_Gl.num_transbeing set by step 1. - WFG edges are lazily allocated.
TWFG_edgestarts as NULL and is allocated on the first call tolock_add_WFG_edge(when the first actual lock wait occurs). Initial size isLK_MID_TWFG_EDGE_COUNT = 1000; max isMAX_NTRANS * MAX_NTRANS. DL_detection_mutexserializes deadlock detection runs. Only one deadlock scan runs at a time.
2.6 Step 5 — lock_deadlock_detect_daemon_init
Section titled “2.6 Step 5 — lock_deadlock_detect_daemon_init”// lock_deadlock_detect_daemon_init — src/transaction/lock_manager.cvoidlock_deadlock_detect_daemon_init (){ cubthread::looper looper = cubthread::looper (std::chrono::milliseconds (100)); cubthread::entry_callable_task *daemon_task = new cubthread::entry_callable_task (deadlock_detect_task_execute); lock_Deadlock_detect_daemon = cubthread::get_manager ()->create_daemon (looper, daemon_task, "deadlock-detect");}The daemon loops every 100ms and performs three duties per
iteration (in deadlock_detect_task_execute):
- Interrupt check — resume any thread whose transaction was interrupted externally.
- Timeout check — resume any thread whose lock-wait has
exceeded its
wait_msecsdeadline (lock_force_timeout_expired_wait_transactions). - Deadlock detection — if the interval since the last run
exceeds
PRM_ID_LK_RUN_DEADLOCK_INTERVALand at least two threads are suspended, runlock_detect_local_deadlock. This avoids the cost of a full WFG scan when contention is low.
The 100ms loop period is the timeout check granularity — a lock
wait with wait_msecs = 50 will be noticed within 100ms, not
50ms. The deadlock scan itself runs less frequently, gated by the
configurable interval parameter.
2.7 lock_finalize — teardown
Section titled “2.7 lock_finalize — teardown”// lock_finalize — src/transaction/lock_manager.cvoidlock_finalize (void){ free_and_init (lk_Gl.TWFG_node); /* WFG nodes */ lock_finalize_tran_lock_table (); /* per-tran state */ pthread_mutex_destroy (&lk_Gl.DL_detection_mutex); lk_Gl.m_obj_hash_table.destroy (); /* lockfree hashmap */ lf_freelist_destroy (&lk_Gl.obj_free_entry_list); /* lockfree freelist */ lock_deadlock_detect_daemon_destroy (); /* stop daemon */}lock_finalize_tran_lock_table walks each LK_TRAN_LOCK, destroys
its two mutexes, and frees every entry remaining in
lk_entry_pool via plain free. Note: this does not walk the hold
lists — by the time lock_finalize runs, all transactions must
have already released their locks (or been forcibly aborted).
2.8 Three-tier LK_ENTRY allocation at runtime
Section titled “2.8 Three-tier LK_ENTRY allocation at runtime”This is the runtime counterpart of the initialization in §2.2 and
§2.4. Every lock acquisition needs an LK_ENTRY; every release
returns one.
// lock_get_new_entry — src/transaction/lock_manager.cstatic LK_ENTRY *lock_get_new_entry (int tran_index, LF_TRAN_ENTRY * tran_entry, LF_FREELIST * freelist){ LK_TRAN_LOCK *tran_lock = &lk_Gl.tran_lock_table[tran_index];
/* Tier 1: check local pool (no synchronization needed) */ if (tran_lock->lk_entry_pool) { lock_entry = tran_lock->lk_entry_pool; tran_lock->lk_entry_pool = lock_entry->next; tran_lock->lk_entry_pool_count--; return lock_entry; }
/* Tier 2+3: lockfree global freelist (may malloc internally) */ return (LK_ENTRY *) lf_freelist_claim (tran_entry, freelist);}// lock_free_entry — src/transaction/lock_manager.cstatic voidlock_free_entry (int tran_index, LF_TRAN_ENTRY * tran_entry, LF_FREELIST * freelist, LK_ENTRY * lock_entry){ LK_TRAN_LOCK *tran_lock = &lk_Gl.tran_lock_table[tran_index];
/* Tier 1: return to local pool if not full */ if (tran_lock->lk_entry_pool_count < LOCK_TRAN_LOCAL_POOL_MAX_SIZE) { lock_uninit_entry (lock_entry); lock_entry->next = tran_lock->lk_entry_pool; tran_lock->lk_entry_pool = lock_entry; tran_lock->lk_entry_pool_count++; } else { /* Tier 2: retire to global freelist (epoch-based reclamation) */ lf_freelist_retire (tran_entry, freelist, lock_entry); }}Why three tiers instead of just one global freelist?
| Tier | Structure | Sync cost | When used |
|---|---|---|---|
| 1 | lk_entry_pool (per-tran, max 10) | Zero — only the owning transaction touches it | Hot path: short OLTP transactions that acquire/release ≤10 locks |
| 2 | obj_free_entry_list (global lockfree) | Lockfree CAS — lf_freelist_claim/retire | When local pool is empty (alloc) or full (dealloc) |
| 3 | malloc / free | Kernel syscall | When the global freelist’s pre-allocated blocks are exhausted |
For a typical OLTP transaction that touches 3–5 rows, all lock entries come from and return to Tier 1 — the global freelist is never contended. Tier 2 absorbs the spillover from batch operations that exceed 10 concurrent locks per transaction. Tier 3 (malloc) is a cold path that fires only during initial ramp-up or extreme load.
2.9 Lockfree integration — two transaction systems
Section titled “2.9 Lockfree integration — two transaction systems”The lock manager participates in two independent lockfree transaction systems, each with its own epoch counter:
| System ID | Manages | Used by |
|---|---|---|
THREAD_TS_OBJ_LOCK_RES | LK_RES (hash table entries) | m_obj_hash_table.find_or_insert, erase_locked |
THREAD_TS_OBJ_LOCK_ENT | LK_ENTRY (lock entries) | lock_get_new_entry, lock_free_entry |
Every lock manager function that allocates or frees resources begins by obtaining the thread’s participation handle:
LF_TRAN_ENTRY *t_entry_res = thread_get_tran_entry (thread_p, THREAD_TS_OBJ_LOCK_RES);LF_TRAN_ENTRY *t_entry_ent = thread_get_tran_entry (thread_p, THREAD_TS_OBJ_LOCK_ENT);The del_id fields on LK_RES and LK_ENTRY are set when the
object is retired (via lf_freelist_retire or hashmap erase).
A retired object is not physically freed until every concurrent
reader — tracked by the epoch counter — has advanced past the
retirement epoch. This is the standard hazard-pointer / epoch-based
reclamation pattern described in cubrid-lockfree-transaction.md.
What is lockfree vs. mutex-protected:
| Operation | Mechanism |
|---|---|
Hash table lookup/insert (LK_RES by OID) | Lockfree hashmap |
LK_ENTRY alloc from global pool | Lockfree freelist (lf_freelist_claim) |
LK_ENTRY alloc from local pool | No sync (per-tran, single-thread) |
| Memory reclamation of retired entries | Lockfree epoch-based (del_id) |
| Holder/waiter/non2pl list manipulation | res_mutex (per-resource) |
| Transaction hold list manipulation | hold_mutex (per-transaction) |
| Transaction non2pl list manipulation | non2pl_mutex (per-transaction) |
The boundary is clear: finding or creating the LK_RES is
lockfree; manipulating its contents (holder/waiter lists) is
mutex-protected. This means concurrent lock requests on
different resources never contend with each other at the hash
level, and only share-nothing local-pool entries on the allocation
side.
2.10 Chapter summary — key takeaways
Section titled “2.10 Chapter summary — key takeaways”- Initialization order is load-bearing. Transaction table
(step 1) must precede deadlock detection (step 4) because the
WFG is sized by
num_trans. - The hash table size is
MAX(10000, MAX_NTRANS * 300)— a hard-coded formula, not a tunable parameter. The hash function uses a van Emde Boas–style distribution to spread consecutive slot IDs across buckets. - Each transaction gets 10 pre-allocated
LK_ENTRYs at init time. These form the Tier 1 local pool that satisfies most OLTP lock requests without any synchronization. - WFG edges are lazily allocated on first lock-wait, not at startup. This avoids wasting memory in read-only workloads.
- The deadlock daemon loops every 100ms but only runs the full WFG cycle scan when (a) the configured interval has elapsed and (b) at least two threads are suspended. Timeout checks run on every iteration.
- Lockfree boundary: hash lookup/insert and global freelist claim/retire are lockfree. Everything below — holder/waiter list manipulation, transaction hold list updates — is mutex-serialized per-resource or per-transaction.
Chapter 3: Lock Acquisition
Section titled “Chapter 3: Lock Acquisition”This chapter traces every branch of the lock acquisition path — from
the public API (lock_object, lock_scan) through the internal
workhorse lock_internal_perform_lock_object. By the end you should
be able to predict exactly what happens for any combination of
resource state and request parameters.
3.1 Two public entry points — lock_object and lock_scan
Section titled “3.1 Two public entry points — lock_object and lock_scan”The lock manager exposes two primary APIs. They differ in
granularity intent, not mechanism — both eventually call
lock_internal_perform_lock_object.
flowchart LR SCAN["lock_scan(class_oid, class_lock)\nClass-grain: IS or IX\non the class itself"] OBJ["lock_object(oid, class_oid, lock)\nAny grain: root, class, or instance"] SCAN --> IPLK["lock_internal_perform_lock_object\n(the real work)"] OBJ --> PREP["Preparation layer:\n1. Root OID? -> lock root directly\n2. Class lock? -> ensure IX/IS on root first\n3. Instance lock? -> ensure IX/IS on class first"] PREP --> IPLK
Figure 3-1 — Two entry points converge on the same internal
function. lock_scan is a thin wrapper for class-grain locks.
lock_object does multi-granularity preparation before calling
the internal function.
lock_scan is the simpler of the two. It acquires a class-level
lock (typically IS_LOCK for SELECT, IX_LOCK for DML) with no
instance-level work:
// lock_scan — src/transaction/lock_manager.c (simplified)intlock_scan (THREAD_ENTRY * thread_p, const OID * class_oid, int cond_flag, LOCK class_lock){ root_class_entry = lock_get_class_lock (thread_p, oid_Root_class_oid); granted = lock_internal_perform_lock_object ( thread_p, tran_index, class_oid, NULL, /* class_oid = NULL → this IS a class lock */ class_lock, wait_msecs, &class_entry, root_class_entry); return granted;}lock_object handles three cases based on the OID:
flowchart TD
A["lock_object(oid, class_oid, lock)"]
A --> B{"OID_IS_ROOTOID(oid)?"}
B -- "yes" --> C1["Case 1: ROOT_CLASS\nlock root directly\nclass_entry = NULL"]
B -- "no" --> D{"OID_IS_ROOTOID(class_oid)?"}
D -- "yes" --> C2["Case 2: CLASS\n1. Ensure IX/IS on root\n (if not already held)\n2. Lock the class itself"]
D -- "no" --> C3["Case 3: INSTANCE\n1. Ensure IX/IS on class\n (if not already held)\n2. Check if class lock\n subsumes the request\n3. Lock the instance"]
C1 --> IPLK["lock_internal_perform_lock_object"]
C2 --> IPLK
C3 --> IPLK
Figure 3-2 — lock_object three-case dispatch. Each case ensures
the multi-granularity hierarchy is respected: an instance lock
requires an intention lock on its class, a class lock requires an
intention lock on the root.
The intention lock mode is derived from the requested lock:
// lock_object — src/transaction/lock_manager.cif (lock <= S_LOCK) new_class_lock = IS_LOCK; /* read intent */else new_class_lock = IX_LOCK; /* write intent */Key optimization in Case 3: before calling
lock_internal_perform_lock_object for the instance, lock_object
checks lock_is_class_lock_escalated(old_class_lock, lock). If the
class lock already held is strong enough to subsume the instance
request (e.g., holding X on class makes any instance lock
redundant), the function returns LK_GRANTED immediately.
3.2 Caller map — who calls the lock manager and why
Section titled “3.2 Caller map — who calls the lock manager and why”The lock manager is a service layer. Before diving into the internal state machine, it helps to know who calls the APIs and in what context. This map covers both acquisition and release APIs — it is referenced again in Ch. 5.
flowchart TB
subgraph Callers["Upper-layer callers"]
LOC["locator_sr.c (41 calls)\nObject fetch / store / DDL"]
BT["btree.c (13)\nIndex traversal, FK checks"]
HEAP["heap_file.c (8)\nHeap scan, page access"]
LOG["log_manager.c (5)\nCommit / rollback"]
BOOT["boot_sr.c (5)\nServer startup / DDL"]
SCAN["scan_manager.c (4)\nHeap/index scan iteration"]
QE["query_executor.c (4)\nQuery eval, instant lock mode"]
SER["serial.c (4)\nNEXT VALUE / CURRENT VALUE"]
CAT["system_catalog.c (4)\nCatalog access"]
STAT["statistics_sr.c (4)\nALTER INDEX statistics"]
end
subgraph LockAPI["Lock Manager API"]
LO["lock_object\n(any grain)"]
LS["lock_scan\n(class grain)"]
LH["lock_hold_object_instant\n(probe, no block)"]
LC["lock_classes_lock_hint\n(batch class locks)"]
UO["lock_unlock_object"]
UDN["lock_unlock_object_donot_move_to_non2pl"]
UA["lock_unlock_all"]
end
LOC --> LO & UO & UDN & LC
BT --> LO & UDN
HEAP --> LO & LS & UO
LOG --> UA
BOOT --> LO & UO
SCAN --> LO & UO & LH
QE --> LO & UO
SER --> LO & UO
CAT --> LO & UDN
STAT --> LO & UO
Figure 3-3 — Caller map. Numbers in parentheses are approximate
call-site counts. locator_sr.c is the dominant caller — it
mediates all client-initiated object access. log_manager.c is
the exclusive caller of lock_unlock_all (commit/rollback).
Key caller patterns:
| Caller | Lock API used | Typical mode | Context |
|---|---|---|---|
locator_sr.c | lock_object, lock_unlock_object, lock_classes_lock_hint | SCH_M for DDL, S/X for DML, IS/IX intent | Object fetch (xlocator_fetch), store (xlocator_force), class creation/drop |
btree.c | lock_object, lock_unlock_object_donot_move_to_non2pl | S/X on instance OIDs | Index key lock during traversal; FK validation — uses donot_move_to_non2pl because rows failing FK check are never surfaced to client |
heap_file.c | lock_scan, lock_object, lock_unlock_object | IS via lock_scan; S/X per row | Heap scan start (lock_scan for class IS), row-level lock during fetch |
scan_manager.c | lock_object, lock_hold_object_instant, lock_unlock_object | S or X on rows; instant probe | Heap/index scan: lock_hold_object_instant for non-blocking probe, fallback to lock_object if probe fails |
query_executor.c | lock_start_instant_lock_mode, lock_stop_instant_lock_mode | (controls mode, not direct lock) | WHERE clause evaluation — instant mode brackets |
serial.c | lock_object(X_LOCK), lock_unlock_object | X on serial instance OID | NEXT VALUE / CURRENT VALUE — the primary exerciser of the non2pl path under RC |
log_manager.c | lock_unlock_all | (releases everything) | log_commit / log_abort — the only caller of bulk release |
Why scan_manager.c uses lock_hold_object_instant: During
a heap or index scan, the scan manager first tries a non-blocking
instant probe. If the probe fails (lock conflict), it falls back
to the full lock_object path with waiting. This optimistic-first
pattern avoids suspending the thread for rows that are likely
uncontended.
3.3 lock_internal_perform_lock_object — the complete state machine
Section titled “3.3 lock_internal_perform_lock_object — the complete state machine”This is the core function (~700 lines). It handles four top-level scenarios:
flowchart TD
START["lock_internal_perform_lock_object\n(oid, class_oid, lock, wait_msecs)"]
START --> DISPATCH{"instance lock?\n(class_oid != NULL\n&& !root)"}
DISPATCH -- "yes (instance)" --> ESC["lock_escalate_if_needed\n(may promote to class lock)"]
ESC --> SUBSUME{"class lock\nsubsumes request?"}
SUBSUME -- "yes" --> GRANT_FAST["return LK_GRANTED\n(no instance lock needed)"]
SUBSUME -- "no" --> HASH
DISPATCH -- "no (class/root)" --> FAST{"lock_find_class_entry\n(already held?)"}
FAST -- "found" --> CONV["goto lock_tran_lk_entry\n(conversion path)"]
FAST -- "not found" --> HASH
HASH["find_or_insert in hash table\n(lockfree, res_mutex locked on return)"]
HASH --> EMPTY{"res empty?\n(no holder/waiter/non2pl)"}
EMPTY -- "yes" --> PATH_A["PATH A: Fresh resource\nGrant unconditionally"]
EMPTY -- "no" --> SCAN_HOLDER{"scan holder list\nfor my tran_index"}
SCAN_HOLDER -- "not found" --> NEW_REQ["I am NOT a holder"]
SCAN_HOLDER -- "found" --> CONV
NEW_REQ --> COMPAT{"compat with\nholders AND waiters?"}
COMPAT -- "yes" --> PATH_B["PATH B: Compatible\nGrant immediately"]
COMPAT -- "no" --> ZERO{"wait_msecs ==\nZERO_WAIT?"}
ZERO -- "yes" --> PATH_C["PATH C: Timeout\nReturn immediately"]
ZERO -- "no" --> PATH_D["PATH D: Enqueue\nas waiter, suspend"]
CONV --> CONV_CHECK{"new_mode ==\ngranted_mode?"}
CONV_CHECK -- "yes" --> PATH_E["PATH E: Re-entrance\nJust bump count"]
CONV_CHECK -- "no" --> CONV_COMPAT{"compat with\nOTHER holders?"}
CONV_COMPAT -- "yes" --> PATH_F["PATH F: Conversion granted\nUpgrade in place"]
CONV_COMPAT -- "no" --> CONV_ZERO{"wait_msecs ==\nZERO_WAIT?"}
CONV_ZERO -- "yes" --> PATH_C
CONV_ZERO -- "no" --> PATH_G["PATH G: Conversion blocked\nSet blocked_mode, reposition, suspend"]
Figure 3-4 — Complete state machine of
lock_internal_perform_lock_object. Seven paths (A–G) cover every
combination of resource state, holder status, compatibility, and
wait policy.
Entry conditions for each path:
| Path | Am I already a holder? | Resource state | Compatibility | Wait policy | Result |
|---|---|---|---|---|---|
| A | No | Empty (no holder, waiter, non2pl) | N/A | Any | Grant unconditionally |
| B | No | Non-empty | lock_compat(req, total_holders) == YES AND lock_compat(req, total_waiters) == YES | Any | Grant immediately |
| C | No | Non-empty | Fails holder or waiter compat | ZERO_WAIT or FORCE_ZERO_WAIT | Return timeout |
| D | No | Non-empty | Fails holder or waiter compat | wait_msecs > 0 or INFINITE_WAIT | Enqueue as waiter, suspend |
| E | Yes | — | lock_conv(req, granted) == granted (no upgrade needed) | Any | Bump count, return |
| F | Yes | — | lock_conv(req, granted) != granted AND lock_compat(new_mode, group_mode_of_others) == YES | Any | Upgrade in place |
| G | Yes | — | lock_conv(req, granted) != granted AND lock_compat(new_mode, group_mode_of_others) == NO | wait_msecs > 0 or INFINITE_WAIT | Set blocked_mode, reposition (UPR), suspend |
Notes:
- PATH C also applies to holders whose conversion is blocked with zero-wait (the same timeout logic fires from the conversion branch).
- PATH E does not acquire
res_mutexwhen reached via thelock_find_class_entryfast path. That fast path does acquirehold_mutex(per-transaction) to walk the class hold list, but avoids the more expensive per-resourceres_mutex. - PATH F and G both compute
group_modeby walking other holders (excluding self), not using the cachedtotal_holders_mode.
3.4 PATH A — fresh resource, unconditional grant
Section titled “3.4 PATH A — fresh resource, unconditional grant”The fastest path. The LK_RES was just created by find_or_insert
(no prior holders, waiters, or non2pl entries).
// PATH A — src/transaction/lock_manager.c (annotated)if (res_ptr->holder == NULL && res_ptr->waiter == NULL && res_ptr->non2pl == NULL) { lock_initialize_resource_as_allocated (res_ptr, NULL_LOCK);
entry_ptr = lock_get_new_entry (tran_index, t_entry_ent, &lk_Gl.obj_free_entry_list); lock_initialize_entry_as_granted (entry_ptr, tran_index, res_ptr, lock); /* instant lock tracking */ if (is_instant_duration) entry_ptr->instant_lock_count++;
res_ptr->holder = entry_ptr; /* sole holder */ entry_ptr->class_entry = class_entry; /* link to parent */ lock_increment_class_granules (class_entry); lock_insert_into_tran_hold_list (entry_ptr, tran_index); res_ptr->total_holders_mode = lock;
pthread_mutex_unlock (&res_ptr->res_mutex); *entry_addr_ptr = entry_ptr; return LK_GRANTED; }What happens step by step:
- Allocate an
LK_ENTRYfrom the three-tier pool (Ch. 2 §2.8). - Initialize as granted (
granted_mode = lock,blocked_mode = NULL_LOCK). - Splice into
res_ptr->holder(sole entry — the list was empty). - Set
class_entryback-pointer and incrementngranuleson the parent. - Splice into the transaction’s hold list via
lock_insert_into_tran_hold_list. - Set
total_holders_mode = lock(no LUB needed — we are the only holder). - Release
res_mutexand return.
3.5 PATH B — existing resource, compatible, grant immediately
Section titled “3.5 PATH B — existing resource, compatible, grant immediately”The resource already has holders and/or waiters, but the new request
is compatible with both total_holders_mode and
total_waiters_mode.
// PATH B — src/transaction/lock_manager.c (annotated)compat1 = lock_compat (lock, res_ptr->total_waiters_mode);compat2 = lock_compat (lock, res_ptr->total_holders_mode);if (compat1 == LOCK_COMPAT_YES && compat2 == LOCK_COMPAT_YES) { entry_ptr = lock_get_new_entry (...); lock_initialize_entry_as_granted (entry_ptr, tran_index, res_ptr, lock); /* position in holder list per UPR */ lock_position_holder_entry (res_ptr, entry_ptr);
/* update aggregate mode */ res_ptr->total_holders_mode = lock_conv (lock, res_ptr->total_holders_mode);
lock_insert_into_tran_hold_list (entry_ptr, tran_index); lock_update_non2pl_list (thread_p, res_ptr, tran_index, lock);
pthread_mutex_unlock (&res_ptr->res_mutex); return LK_GRANTED; }Key invariant: dual compatibility check. The new request must
pass both total_holders_mode and total_waiters_mode. Checking
only holders would allow a stream of S requests to leapfrog a
waiting X request — the starvation guard from Ch. 1 §1.3.
lock_update_non2pl_list is called on every successful grant.
It walks the resource’s non2pl list and marks any other-transaction
non2pl entries as INCON_NON_TWO_PHASE_LOCK if the newly granted
lock is incompatible with the non2pl entry’s recorded mode. This is
the conflict-detection trigger described in Ch. 5 notes.
3.6 PATH C — incompatible, zero-wait timeout
Section titled “3.6 PATH C — incompatible, zero-wait timeout”When the caller specifies LK_ZERO_WAIT or LK_FORCE_ZERO_WAIT
(conditional lock request), the function does not block:
// PATH C — src/transaction/lock_manager.c (annotated)if (wait_msecs == LK_ZERO_WAIT || wait_msecs == LK_FORCE_ZERO_WAIT) { pthread_mutex_unlock (&res_ptr->res_mutex); if (wait_msecs == LK_ZERO_WAIT) { /* create a temporary blocked entry just for the error message */ p = lock_get_new_entry (...); lock_initialize_entry_as_blocked (p, thread_p, ...); lock_set_error_for_timeout (thread_p, p); /* ← sets ER_LK_... */ lock_free_entry (..., p); } return LK_NOTGRANTED_DUE_TIMEOUT; }The difference between the two zero-wait modes:
LK_ZERO_WAIT(value 0): timeout with error message. A temporaryLK_ENTRYis created just to populate the error diagnostics (xasl_id,bind_index_in_tranvialock_set_error_for_timeout), then immediately freed.LK_FORCE_ZERO_WAIT(value -2): silent timeout, no error set. Used by conditional lock probes that expect failure as a normal outcome.
3.7 PATH D — incompatible, enqueue as waiter
Section titled “3.7 PATH D — incompatible, enqueue as waiter”The standard blocking path. The new request cannot be granted, and the caller is willing to wait.
// PATH D — src/transaction/lock_manager.c (annotated)/* allocate and initialize as blocked */entry_ptr = lock_get_new_entry (...);lock_initialize_entry_as_blocked (entry_ptr, thread_p, tran_index, res_ptr, lock);
/* append at the END of the waiter list (FIFO) */prev = NULL;for (i = res_ptr->waiter; i != NULL; i = i->next) prev = i;if (prev == NULL) res_ptr->waiter = entry_ptr;else prev->next = entry_ptr;
/* update aggregate waiter mode */res_ptr->total_waiters_mode = lock_conv (lock, res_ptr->total_waiters_mode);
goto blocked; /* → suspend thread */At the blocked: label, the thread is suspended:
blocked: /* release res_mutex, then suspend */ pthread_mutex_unlock (&res_ptr->res_mutex); ret_val = lock_suspend (thread_p, entry_ptr, wait_msecs);On resume, the return value of lock_suspend determines the
outcome:
ret_val | Meaning | Action |
|---|---|---|
LOCK_RESUMED | Lock granted by a releaser | Continue to lock_conversion_treatement |
LOCK_RESUMED_TIMEOUT | Wait timed out | lock_internal_perform_unlock_object removes the waiter, return LK_NOTGRANTED_DUE_TIMEOUT |
LOCK_RESUMED_ABORTED | Deadlock victim | Same cleanup, return LK_NOTGRANTED_DUE_ABORTED |
LOCK_RESUMED_INTERRUPT | Server shutdown | Return LK_NOTGRANTED_DUE_ERROR |
Key detail: who grants the waiter? The suspended thread does
not re-check compatibility on resume. Instead, when a holder
releases its lock (lock_internal_perform_unlock_object, Ch. 5),
the releaser calls lock_grant_blocked_holder and
lock_grant_blocked_waiter, which walk the waiter list, check
compatibility against the updated total_holders_mode, and move
grantable waiters to the holder list before waking them. The
resumed thread finds its entry already in the holder list with
granted_mode set.
3.8 PATH E — re-entrance (same tran, same or weaker mode)
Section titled “3.8 PATH E — re-entrance (same tran, same or weaker mode)”When a transaction requests a lock it already holds, and the
conversion result (lock_conv(requested, granted)) equals the
currently granted mode, the request is a no-op upgrade — just
bump the count:
// PATH E — src/transaction/lock_manager.clock_tran_lk_entry: new_mode = lock_conv (lock, entry_ptr->granted_mode); if (new_mode == entry_ptr->granted_mode) { entry_ptr->count += 1; /* instant lock tracking ... */ return LK_GRANTED; /* no res_mutex needed */ }This path does not need res_mutex when reached via the
lock_find_class_entry fast path for class locks.
lock_find_class_entry acquires hold_mutex (the per-transaction
mutex) to safely walk the class_hold_list, but never touches
res_mutex. The count increment itself is safe because only one
thread per transaction can be in the lock manager at a time.
3.9 PATH F — conversion granted immediately
Section titled “3.9 PATH F — conversion granted immediately”The transaction already holds a lock but requests a stronger mode.
The conversion result (new_mode) differs from granted_mode, but
is compatible with all other holders:
// PATH F — src/transaction/lock_manager.c (annotated)/* compute aggregate of OTHER holders (excluding myself) */group_mode = NULL_LOCK;for (i = res_ptr->holder; i != NULL; i = i->next) { if (i != entry_ptr) group_mode = lock_conv (i->granted_mode, group_mode); }
compat1 = lock_compat (new_mode, group_mode);if (compat1 == LOCK_COMPAT_YES) { entry_ptr->granted_mode = new_mode; /* upgrade in place */ entry_ptr->count += 1; res_ptr->total_holders_mode = lock_conv (lock, res_ptr->total_holders_mode); lock_update_non2pl_list (...); pthread_mutex_unlock (&res_ptr->res_mutex); goto lock_conversion_treatement; }Key difference from PATH B: the compatibility check is against
group_mode (other holders only), not total_holders_mode (which
includes this transaction’s own mode). A transaction cannot conflict
with itself.
lock_conversion_treatement [sic — typo in source] handles
post-grant cleanup for class-lock conversions. When a class lock is
upgraded (e.g., IS→X), instance locks that are now subsumed are
removed:
| Old class mode | New class mode | Action |
|---|---|---|
IS | S, SIX, or X | Remove all S instance locks |
IX | SIX | Remove all S instance locks |
IX | X | Remove all X instance locks |
SIX | X | Remove all X instance locks |
3.10 PATH G — conversion blocked (Upgrader Positioning Rule)
Section titled “3.10 PATH G — conversion blocked (Upgrader Positioning Rule)”The conversion is incompatible with other holders, and the transaction must wait. This is the most complex path because the entry stays in the holder list (it still holds the old mode) but is repositioned to signal the pending upgrade:
// PATH G — src/transaction/lock_manager.c (annotated)entry_ptr->blocked_mode = new_mode; /* ← both fields now non-NULL */entry_ptr->count += 1;entry_ptr->thrd_entry = thread_p;
res_ptr->total_holders_mode = lock_conv (lock, res_ptr->total_holders_mode);
/* remove from current position in holder list *//* ... unlink entry_ptr ... */
/* reposition per Upgrader Positioning Rule (UPR) */lock_position_holder_entry (res_ptr, entry_ptr);
goto blocked; /* → suspend thread */Key invariant: granted_mode != NULL_LOCK AND blocked_mode != NULL_LOCK. This is the conversion-wait state from Ch. 1 §1.4.
The entry stays in the holder list (other transactions still see
the old granted_mode in compatibility checks) but is flagged as
pending upgrade.
Upgrader Positioning Rule (UPR):
lock_position_holder_entry places the entry based on compatibility
between its blocked_mode and other holders’ modes. The goal is
stronger than reducing unnecessary waits: the grant pass
(lock_grant_blocked_holder) scans only the front of the holder
list and stops at the first non-upgrading holder, so an upgrader
that ends up behind a plain holder is invisible to every future
release — position is a liveness requirement, not a tuning knob
(see §3.11). The three candidate positions (ta, tb, tc) in the
holder list are evaluated in priority order, and the entry is
inserted at the best available position.
3.11 Upgrader Positioning Rule (UPR) — deep dive into lock_position_holder_entry
Section titled “3.11 Upgrader Positioning Rule (UPR) — deep dive into lock_position_holder_entry”lock_position_holder_entry decides where in the holder list a
new or repositioned entry is inserted. The holder list is not a flat
FIFO — it is ordered so that conversion-waiting holders (upgraders)
sit before non-upgrading holders. This ordering directly affects
which waiters can be granted next when a holder releases its lock.
Why position is correctness — the buried upgrader.
lock_grant_blocked_holder walks the holder list front-to-back and
stops at the first entry whose blocked_mode == NULL_LOCK
(while (holder != NULL && holder->blocked_mode != NULL_LOCK)).
That early stop is what keeps the per-release scan cheap — but it is
sound only if every upgrader sits in the front zone. With naive
append (no UPR), a class-grain timeline reproduces the failure; each
entry reads granted/blocked (– = NULL_LOCK):
| t | event | holder list (front → back) |
|---|---|---|
| 1 | T5 IS, T3 IX, T2 IX — all compatible | T5 IS/– → T3 IX/– → T2 IX/– |
| 2 | T3 upgrades to SIX; T2’s IX conflicts → only blocked_mode is written, suspended in place | T5 IS/– → T3 IX/SIX → T2 IX/– |
| 3 | T2 commits — the grant pass starts at T5: blocked = – ⇒ exits, nobody examined | T5 IS/– → T3 IX/SIX |
| 4 | SIX ∧ IS is compatible — T3 is grantable, yet sleeps until lock timeout or deadlock victim | unchanged — forever |
Note that the example needs intention modes: a plain S → X version
cannot reproduce the bug — a plain S holder in front also blocks
the X target, and once it releases (leaving the list) the upgrader
reaches the front anyway. The liveness failure requires a
remaining plain holder whose granted mode is compatible with the
upgrade target — e.g. IS standing in front of an IX → SIX
upgrader.
Figure 3-2 — The same timeline drawn. Panels ①–③ are the
counterfactual (naive append): the scan stops at plain T5 and T3’s
grantable upgrade sleeps forever. The green strip is the real code:
lock_position_holder_entry moves T3 to the front before it
sleeps, so the very next release grants it.
Two cases based on blocked_mode:
Case 1: blocked_mode == NULL_LOCK (normal grant, not upgrading)
The entry is a plain holder with no pending upgrade. It is inserted after all upgraders but before other plain holders:
// Case 1 — src/transaction/lock_manager.cif (entry_ptr->blocked_mode == NULL_LOCK) { prev = NULL; i = res_ptr->holder; while (i != NULL) { if (i->blocked_mode == NULL_LOCK) break; /* ← stop at the first non-upgrader */ prev = i; i = i->next; } /* insert after prev (i.e., after all upgraders) */ }The resulting holder list order is: [upgraders...] [plain holders...].
Case 2: blocked_mode != NULL_LOCK (conversion wait — upgrader)
The entry is an upgrader. Three candidate insertion points are evaluated in priority order:
// Case 2 — src/transaction/lock_manager.c (annotated)ta = tb = tc = NULL; /* three candidates */
for each holder i in the list: if (i is also an upgrader) /* i->blocked_mode != NULL_LOCK */ { /* ta: first upgrader whose blocked_mode is COMPATIBLE with my blocked_mode */ compat1 = lock_compat (me->blocked_mode, i->blocked_mode); if (ta == NULL && compat1 == YES) ta = i;
/* tb: first upgrader where my blocked_mode is compatible with its granted_mode, BUT its blocked_mode is INCOMPATIBLE with my granted_mode */ compat1 = lock_compat (me->blocked_mode, i->granted_mode); compat2 = lock_compat (i->blocked_mode, me->granted_mode); if (ta == NULL && tb == NULL && compat1 == YES && compat2 == NO) tb = i; } else /* i is a plain holder */ { /* tc: first plain (non-upgrading) holder */ if (tc == NULL) tc = i; }
/* choose insertion point by priority: ta > tb > tc */if (ta != NULL) prev = tap; /* before ta */else if (tb != NULL) prev = tbp; /* before tb */else if (tc != NULL) prev = tcp; /* before tc *//* else: append at the end (prev from the loop) */The three candidates explained:
| Candidate | Condition | Meaning | Why insert before it |
|---|---|---|---|
| ta | My blocked_mode is compatible with i’s blocked_mode | I and this upgrader want compatible upgrades | We can both proceed — placing me first lets the releaser grant me sooner without blocking the other |
| tb | My blocked_mode is compat with i’s granted_mode, BUT i’s blocked_mode is incompat with my granted_mode | I don’t conflict with what it has, but it conflicts with what I have | If I go after it, it would block on my granted_mode and I’d block on its blocked_mode — deadlock. Putting me first avoids this |
| tc | First plain (non-upgrading) holder | Boundary between upgraders and plain holders | All upgraders should be ahead of non-upgraders |
Priority ta > tb > tc: ta is the best position (compatible
upgraders grouped together). tb is the fallback (avoid deadlock
between upgraders). tc is the last resort (at least stay in the
upgrader zone).
The tb case, worked. Say the entry being positioned is
new = IX→SIX and the existing upgrader is i = IS→X:
SIX ∧ IS = ✓ (i’s held mode does not block new) but
X ∧ IX = ✗ (new’s held mode blocks i for as long as new holds).
The grant pass grants a compatible prefix of the upgrader zone and
stops at the first failure, so the order decides everything:
| Order in the upgrader zone | What the grant pass does | Outcome |
|---|---|---|
[ new, i ] — tb’s pick | new: SIX ∧ IS = ✓ → granted; i gets its turn at new’s commit | progress |
[ i, new ] | i: X ∧ IX = ✗ → stop — new never examined, though grantable | both sleep — wedged until timeout |
A one-directionally conflicting pair has exactly one viable order: whoever can be granted while the other still holds goes first.
Worked example:
Holder list before UPR: [T3: granted=S, blocked=X] → [T5: granted=IS, blocked=NULL] → [T8: granted=S, blocked=NULL] ^^ upgrader ^^ plain holder ^^ plain holder
New entry: T7 wants to upgrade (granted=IX, blocked=SIX)Evaluation:
- T3 (upgrader):
lock_compat(SIX, X)→ NO. Not ta.lock_compat(SIX, S)→ NO. Not tb either. - T5 (plain): tc = T5 (first plain holder).
Result: no ta, no tb, tc = T5. Insert before T5:
After UPR: [T3: granted=S, blocked=X] → [T7: granted=IX, blocked=SIX] → [T5: granted=IS, blocked=NULL] → [T8: granted=S, blocked=NULL] ^^ upgrader ^^ upgrader (new) ^^ plain holder ^^ plain holderHolder list layout invariant maintained by UPR:
flowchart LR
subgraph HL["Holder list order"]
direction LR
U1["upgrader 1\ngranted + blocked"]
U2["upgrader 2\ngranted + blocked"]
UN["..."]
P1["plain holder 1\ngranted only"]
P2["plain holder 2\ngranted only"]
PN["..."]
U1 --> U2 --> UN --> P1 --> P2 --> PN
end
Figure 3-5 — Holder list ordering invariant. Upgraders (entries
with blocked_mode != NULL_LOCK) are always positioned before
plain holders. Among upgraders, the order is determined by
compatibility of their blocked_modes to minimize inter-upgrader
deadlocks.
Why UPR matters for correctness: When a holder releases its
lock and lock_grant_blocked_holder scans the holder list to see
if any upgrader can now be granted, it walks front-to-back. If
upgraders with compatible goals are grouped at the front, they can
be granted together in one pass. Without UPR, a newly arriving
upgrader at the back might deadlock with an earlier upgrader because
the grant order would be wrong.
3.12 Concrete scenario — three transactions contend for one row
Section titled “3.12 Concrete scenario — three transactions contend for one row”flowchart LR
subgraph T0["Initial"]
H0["holder: T5 (S)"]
W0["waiter: --"]
AGG0["holders=S\nwaiters=NULL"]
end
subgraph T1["T7 req S (PATH B)"]
H1["holder: T5 (S), T7 (S)"]
W1["waiter: --"]
AGG1["holders=S\nwaiters=NULL"]
end
subgraph T2["T9 req X (PATH D)"]
H2["holder: T5 (S), T7 (S)"]
W2["waiter: T9 (X)"]
AGG2["holders=S\nwaiters=X"]
end
subgraph T3["T12 req S (PATH D!)"]
H3["holder: T5 (S), T7 (S)"]
W3["waiter: T9 (X), T12 (S)"]
AGG3["holders=S\nwaiters=SIX"]
end
T0 --> T1 --> T2 --> T3
Figure 3-6 — Four-step contention scenario. T7’s S request is
compatible (PATH B). T9’s X request fails both holder and waiter
compat (PATH D). T12’s S request is compatible with
total_holders_mode (S vs S = yes) but incompatible with
total_waiters_mode (S vs X = no) — so T12 also blocks (PATH D),
demonstrating the starvation guard: T9’s pending X prevents T12
from leapfrogging.
3.13 Diagnostic fields — xasl_id and bind_index_in_tran
Section titled “3.13 Diagnostic fields — xasl_id and bind_index_in_tran”At the very end of a successful acquisition (all paths that return
LK_GRANTED):
// lock_internal_perform_lock_object — src/transaction/lock_manager.cend: if (entry_ptr != NULL && ret_val == LK_GRANTED) lock_event_set_xasl_id_to_entry (tran_index, entry_ptr);This stamps the entry with the currently executing query’s plan ID
and statement index. These are read-only diagnostic fields — they
do not affect lock behavior. Their sole consumer is the event-log
subsystem (event_log_sql_string, event_log_bind_values) which
prints them in timeout and deadlock reports so a DBA can identify
which query caused the lock wait.
3.14 Performance analysis — acquisition cost by path
Section titled “3.14 Performance analysis — acquisition cost by path”Not all paths cost the same. The table below ranks them from cheapest to most expensive:
| Path | Cost | Synchronization | When it fires |
|---|---|---|---|
| E (re-entrance, class fast path) | O(1) | hold_mutex only — no res_mutex, no hash lookup | Same tran re-requests same class with same-or-weaker mode. Most common in OLTP. |
| E (re-entrance, via hash) | O(1) + hash lookup | res_mutex | Same tran re-requests same instance with same-or-weaker mode. |
| A (fresh resource) | O(1) + hash insert | Lockfree find_or_insert + res_mutex | First lock on a never-before-locked OID. |
| B (compatible grant) | O(1) compat check | res_mutex | New holder, compatible with aggregate modes. |
| F (conversion granted) | O(holders) | res_mutex | Upgrade needed, must compute group_mode excluding self. |
| D (enqueue + suspend) | O(waiters) + thread suspend | res_mutex + thread mutex + context switch | Incompatible — thread goes to sleep. |
| G (conversion blocked) | O(holders) + suspend | res_mutex + UPR repositioning + context switch | Upgrade incompatible with other holders. |
Key performance observations:
1. The class-entry fast path (PATH E) avoids the hash table entirely.
lock_object → lock_find_class_entry (hold_mutex only) → found → lock_tran_lk_entry → count++ → returnThis is the hottest path in OLTP: every DML statement re-requests
the class intention lock (IS/IX) that was already acquired by
lock_scan. The cost is one hold_mutex lock/unlock + one linked
list walk of class_hold_list (typically < 10 entries) + one
integer increment. No hash lookup, no res_mutex.
2. The O(1) aggregate-mode check eliminates per-holder scanning for PATH B.
Without total_holders_mode / total_waiters_mode, every new lock
request would need to walk the entire holder list to check
compatibility — O(holders per resource). With the cached aggregates,
PATH B does two table lookups:
compat1 = lock_Comp[lock][res_ptr->total_waiters_mode]; // O(1)compat2 = lock_Comp[lock][res_ptr->total_holders_mode]; // O(1)The trade-off: total_holders_mode must be recomputed on every
holder-list change (O(holders) in Ch. 5 §5.6). But since grant is
far more frequent than release in steady-state OLTP, the net win is
significant.
3. Escalation check on every instance lock request is cheap.
lock_escalate_if_needed → lock_check_escalate is:
- One
hold_mutexlock/unlock - One integer comparison (
ngranules < threshold) - Return false (the common case — no escalation)
Total: ~100ns. This is acceptable because the alternative — unbounded instance lock accumulation — has far worse consequences (memory pressure, O(n) hold-list walks).
4. The hash function’s slot-distribution design pays off under concurrent scans.
Consecutive OIDs on the same heap page (slotid 1, 2, 3, …) land
in different hash buckets (Ch. 2 §2.3, Figure 2-2). Two threads
scanning the same table in parallel hit different res_mutex
instances, so they never contend at the hash level. A naive hash
would cluster them in the same bucket, serializing the scans.
3.15 Chapter summary — key takeaways
Section titled “3.15 Chapter summary — key takeaways”- Two public APIs, one internal function.
lock_scanhandles class-grain intention locks;lock_objecthandles all three resource types with multi-granularity preparation. Both converge onlock_internal_perform_lock_object. - Seven paths (A–G) cover every combination of resource state, holder status, compatibility, and wait policy.
- Dual compatibility check (PATH B/D): a new request must be
compatible with both
total_holders_modeandtotal_waiters_mode. The waiter check is the starvation guard. - Re-entrance (PATH E) is a count bump with no mutex needed when reached via the class-entry fast path.
- Conversion (PATH F/G) checks against other holders only
(excluding self). Blocked conversions set both
granted_modeandblocked_mode, leaving the entry in the holder list but repositioned per UPR. - Waiters are granted by the releaser, not by the waking thread. The resume path does not re-check compatibility.
lock_update_non2pl_listis called on successful grants where the resource already has prior state (PATH B, F) to mark conflicting non2pl entries for decache notification. PATH A (fresh resource) skips this — the non2pl list is empty.
Chapter 4: Lock Conversion and Re-entrance
Section titled “Chapter 4: Lock Conversion and Re-entrance”Chapter 3 covered where conversion and re-entrance appear in the acquisition state machine (PATH E/F/G). This chapter zooms in on the mechanics: how the conversion table works, how re-entrance counting interacts with unlock, how blocked conversions are granted, and what cleanup happens after a class-lock upgrade.
4.1 When conversion happens — the trigger condition
Section titled “4.1 When conversion happens — the trigger condition”Conversion occurs when a transaction requests a lock on a resource
it already holds. The function detects this by scanning the
holder list for a matching tran_index:
// lock_internal_perform_lock_object — src/transaction/lock_manager.centry_ptr = res_ptr->holder;while (entry_ptr != NULL) { if (entry_ptr->tran_index == tran_index) break; /* ← found: I am already a holder */ entry_ptr = entry_ptr->next; }If found, the code jumps to lock_tran_lk_entry: — the conversion
path. If not found, it is a new lock request (PATH A–D).
For class locks, there is a fast path that avoids the hash table
entirely: lock_find_class_entry walks the transaction’s
class_hold_list (O(classes held) instead of O(holders on the
resource)):
// lock_find_class_entry — src/transaction/lock_manager.centry_ptr = tran_lock->class_hold_list;while (entry_ptr != NULL) { if (OID_EQ (&entry_ptr->res_head->key.oid, class_oid)) break; entry_ptr = entry_ptr->tran_next; }This fast path bypasses find_or_insert and res_mutex for the
re-entrance case (PATH E). It does acquire hold_mutex (the
per-transaction mutex) to walk the list safely, but avoids the
heavier per-resource res_mutex. This is the most common
conversion scenario.
4.2 The conversion table — lock_Conv[requested][current]
Section titled “4.2 The conversion table — lock_Conv[requested][current]”The conversion result is the least upper bound (LUB) of the requested mode and the currently held mode:
new_mode = lock_conv (lock, entry_ptr->granted_mode);// equivalent to: new_mode = lock_Conv[lock][entry_ptr->granted_mode];The full 12×12 table is in lock_table.c. The practically
important subset:
| requested ↓ \ current → | IS | S | IX | SIX | U | X |
|---|---|---|---|---|---|---|
| IS | IS | S | IX | SIX | — | X |
| S | S | S | SIX | SIX | U | X |
| IX | IX | SIX | IX | SIX | — | X |
| SIX | SIX | SIX | SIX | SIX | — | X |
| U | — | U | — | — | U | X |
| X | X | X | X | X | X | X |
Reading the table: lock_Conv[S][IX] = SIX means “if I hold IX
and request S, the result is SIX.”
NA_LOCK entries (marked —) mean the combination is undefined
and triggers an assert failure. These are combinations that should
never arise in practice (e.g., U_LOCK is only requested when
already holding S_LOCK).
4.3 Re-entrance — PATH E in detail
Section titled “4.3 Re-entrance — PATH E in detail”Re-entrance is the most common conversion: the transaction requests a mode that is equal to or weaker than what it already holds. The conversion result equals the current mode, so no actual upgrade is needed.
// PATH E — lock_tran_lk_entry in lock_internal_perform_lock_objectnew_mode = lock_conv (lock, entry_ptr->granted_mode);if (new_mode == entry_ptr->granted_mode) { entry_ptr->count += 1; /* ← just bump the counter */ // ... instant lock tracking ... return LK_GRANTED; }The count field is critical for correct unlock behavior. Each
lock_object call increments count; each lock_unlock_object
call decrements it. The lock is only actually released when
count reaches zero. This prevents a nested caller from
accidentally releasing a lock that an outer caller still needs.
Example — nested re-entrance:
lock_object(oid=R1, mode=S) → count=1, granted_mode=S lock_object(oid=R1, mode=S) → count=2, granted_mode=S (PATH E) lock_unlock_object(R1, S) → count=1, lock STAYSlock_unlock_object(R1, S) → count=0, lock RELEASED4.4 Immediate conversion — PATH F in detail
Section titled “4.4 Immediate conversion — PATH F in detail”When new_mode != granted_mode (a real upgrade is needed), the
function must check compatibility with other holders only
(excluding self):
// PATH F — lock_internal_perform_lock_object (annotated)group_mode = NULL_LOCK;for (i = res_ptr->holder; i != NULL; i = i->next) { if (i != entry_ptr) /* ← skip myself */ group_mode = lock_conv (i->granted_mode, group_mode); }
compat1 = lock_compat (new_mode, group_mode);if (compat1 == LOCK_COMPAT_YES) { /* upgrade in place */ entry_ptr->granted_mode = new_mode; entry_ptr->count += 1; res_ptr->total_holders_mode = lock_conv (lock, res_ptr->total_holders_mode); lock_update_non2pl_list (...); goto lock_conversion_treatement; }Why exclude self? A transaction cannot conflict with its own
lock. If T holds S and requests X, the total_holders_mode might
be SIX (because another transaction holds IX). But group_mode
(other holders only) might be IX, and lock_compat(X, IX) == NO —
so the conversion blocks. Without the self-exclusion, the
compatibility check would use total_holders_mode = SIX and
incorrectly conclude the conversion is impossible even when the
only S holder is the requesting transaction itself.
Concrete example:
Resource R: holder list = [T5(IX), T7(S)]T7 requests X_LOCK on R.
new_mode = lock_conv(X, S) = Xgroup_mode = T5's granted_mode = IXlock_compat(X, IX) = NO → PATH G (blocked)
But if R only had: [T7(S)]group_mode = NULL_LOCKlock_compat(X, NULL) = YES → PATH F (granted, upgrade S→X)4.5 Blocked conversion — PATH G in detail
Section titled “4.5 Blocked conversion — PATH G in detail”When the upgrade is incompatible with other holders, the entry enters the dual-field state — the defining characteristic of a blocked conversion:
// PATH G — lock_internal_perform_lock_object (annotated)entry_ptr->blocked_mode = new_mode; /* upgrade target */entry_ptr->count += 1;entry_ptr->thrd_entry = thread_p;
/* update total_holders_mode to reflect the REQUESTED mode */res_ptr->total_holders_mode = lock_conv (lock, res_ptr->total_holders_mode);
/* remove from current position, reposition per UPR */// ... unlink entry_ptr from holder list ...lock_position_holder_entry (res_ptr, entry_ptr);
goto blocked; /* → lock_suspend */Key invariant during blocked conversion:
granted_mode= old mode (still effective — other transactions see this in compatibility checks)blocked_mode= new mode (the upgrade being waited for)- Entry stays in the holder list (not the waiter list)
- Thread is suspended
total_holders_mode includes the requested mode even though it
is not yet granted. This is conservative: it ensures that new
requests see the strongest possible contention picture. The trade-off
is that some requests that could theoretically coexist with the old
mode will be blocked. This is intentional — it prevents starvation
of the upgrader.
4.6 How blocked conversions are granted — lock_grant_blocked_holder
Section titled “4.6 How blocked conversions are granted — lock_grant_blocked_holder”When a holder releases its lock (Ch. 5), the release path calls
lock_grant_blocked_holder. This function scans the holder list
front-to-back (which is why UPR positions upgraders at the
front):
// lock_grant_blocked_holder — src/transaction/lock_manager.c (annotated)static voidlock_grant_blocked_holder (THREAD_ENTRY * thread_p, LK_RES * res_ptr){ holder = res_ptr->holder; while (holder != NULL && holder->blocked_mode != NULL_LOCK) { /* compute aggregate of all holders AFTER this one */ mode = NULL_LOCK; for (h = holder->next; h != NULL; h = h->next) mode = lock_conv (h->granted_mode, mode);
compat = lock_compat (holder->blocked_mode, mode); if (compat == LOCK_COMPAT_NO) break; /* ← stop granting */
/* grant the conversion */ holder->granted_mode = holder->blocked_mode; holder->blocked_mode = NULL_LOCK;
/* reposition: move from upgrader zone to plain-holder zone */ // ... reposition per UPR ...
lock_update_non2pl_list (...); lock_resume (holder, LOCK_RESUMED); /* wake up */
holder = next; }}The algorithm walks upgraders only (stops at the first entry
with blocked_mode == NULL_LOCK). For each upgrader, it computes
the aggregate of holders after it in the list (not before). This
asymmetry is what UPR enables: compatible upgraders at the front
can be granted together because their blocked_modes don’t conflict
with the modes behind them.
Grant sequence:
granted_mode = blocked_mode(the upgrade takes effect)blocked_mode = NULL_LOCK(no longer an upgrader)- Reposition to the plain-holder zone via UPR
lock_update_non2pl_listfor conflict detectionlock_resume(LOCK_RESUMED)wakes the suspended thread
4.7 How blocked waiters are granted — lock_grant_blocked_waiter
Section titled “4.7 How blocked waiters are granted — lock_grant_blocked_waiter”After granting blocked holders, the release path calls
lock_grant_blocked_waiter. This function scans the waiter list
front-to-back:
// lock_grant_blocked_waiter — src/transaction/lock_manager.c (annotated)static intlock_grant_blocked_waiter (THREAD_ENTRY * thread_p, LK_RES * res_ptr){ waiter = res_ptr->waiter; while (waiter != NULL) { compat = lock_compat (waiter->blocked_mode, res_ptr->total_holders_mode); if (compat == LOCK_COMPAT_NO) break; /* ← stop granting */
/* grant the waiter */ waiter->granted_mode = waiter->blocked_mode; waiter->blocked_mode = NULL_LOCK;
/* move from waiter list to holder list */ // ... remove from waiter list ... lock_position_holder_entry (res_ptr, waiter);
/* update aggregates */ res_ptr->total_holders_mode = lock_conv (waiter->granted_mode, res_ptr->total_holders_mode);
/* add to transaction hold list */ lock_insert_into_tran_hold_list (waiter, owner_tran_index);
lock_update_non2pl_list (...); lock_resume (waiter, LOCK_RESUMED);
waiter = next; }
/* recompute total_waiters_mode from remaining waiters */ if (change_total_waiters_mode) { mode = NULL_LOCK; for (w = res_ptr->waiter; w != NULL; w = w->next) mode = lock_conv (w->blocked_mode, mode); res_ptr->total_waiters_mode = mode; }}Key differences from lock_grant_blocked_holder:
| Aspect | lock_grant_blocked_holder | lock_grant_blocked_waiter |
|---|---|---|
| Scans | Holder list (upgraders) | Waiter list |
| Compat check against | Aggregate of holders after this entry | total_holders_mode (all holders) |
| On grant | granted_mode = blocked_mode (in-place) | Move from waiter list to holder list |
| Stop condition | First incompatible upgrader | First incompatible waiter |
Order of the two calls matters: blocked holders (upgraders) are
granted first, which may change total_holders_mode. Then blocked
waiters are checked against the updated total_holders_mode. This
ensures upgraders have priority over new waiters — they were here
first and already hold a weaker lock.
4.8 Post-conversion cleanup — lock_conversion_treatement
Section titled “4.8 Post-conversion cleanup — lock_conversion_treatement”After a class-lock conversion is granted (PATH F, or PATH G after resume), instance locks that are now subsumed by the stronger class lock are removed:
// lock_conversion_treatement — src/transaction/lock_manager.cif (entry_ptr->res_head->key.type == LOCK_RESOURCE_CLASS && lock_conversion == true) { new_mode = entry_ptr->granted_mode; switch (old_mode) { case IS_LOCK: if (IS_WRITE_EXCLUSIVE_LOCK (new_mode) || new_mode == S_LOCK || new_mode == SIX_LOCK) lock_remove_all_inst_locks (thread_p, tran_index, oid, S_LOCK); break;
case IX_LOCK: if (new_mode == SIX_LOCK) lock_remove_all_inst_locks (..., S_LOCK); else if (IS_WRITE_EXCLUSIVE_LOCK (new_mode)) lock_remove_all_inst_locks (..., X_LOCK); break;
case SIX_LOCK: /* new_mode == X_LOCK */ lock_remove_all_inst_locks (..., X_LOCK); break; } }What gets removed and why:
| Old class lock | New class lock | Remove instance locks | Reason |
|---|---|---|---|
| IS | S, SIX, X | All S instance locks | Class S subsumes instance S |
| IX | SIX | All S instance locks | SIX = S + IX; the S part subsumes instance S |
| IX | X | All X instance locks | Class X subsumes everything |
| SIX | X | All X instance locks | Was already S+IX; now full X |
This is a memory optimization: the subsumed instance LK_ENTRYs
are freed and ngranules on the class entry is decremented. Without
this cleanup, a class-lock escalation would leave behind redundant
instance entries that waste memory and slow down hold-list traversal.
4.9 Concrete scenario — conversion lifecycle
Section titled “4.9 Concrete scenario — conversion lifecycle”flowchart LR
subgraph S1["1. T7 holds S on row R"]
H1["holder: T7(S, count=1)"]
AGG1["holders=S"]
end
subgraph S2["2. T5 acquires S on R (PATH B)"]
H2["holder: T7(S), T5(S)"]
AGG2["holders=S"]
end
subgraph S3["3. T7 requests X (PATH G)"]
H3["holder: T7(S, blocked=X), T5(S)"]
AGG3["holders=X\n(conservative)"]
end
subgraph S4["4. T5 releases S"]
H4["holder: T7(X, count=2)"]
AGG4["holders=X"]
end
S1 --> S2 --> S3 --> S4
Figure 4-1 — Conversion lifecycle. Step 3: T7’s S→X conversion
is blocked by T5’s S (PATH G). total_holders_mode becomes X
(conservative). Step 4: T5 releases, lock_grant_blocked_holder
grants T7’s upgrade: granted_mode becomes X, blocked_mode
becomes NULL_LOCK, T7’s thread resumes. Note count=2 because
both the original S request and the X request each incremented it.
4.10 Performance analysis — conversion cost
Section titled “4.10 Performance analysis — conversion cost”Re-entrance (PATH E) is nearly free; real conversion is O(holders).
| Scenario | Cost | Bottleneck |
|---|---|---|
| Re-entrance (same or weaker mode) | O(1): lock_conv table lookup + count++ | None — no list walk, no compat check |
| Conversion granted (PATH F) | O(holders): must walk holder list to compute group_mode excluding self | The self-exclusion loop. A resource with 100 holders means 100 iterations. |
| Conversion blocked (PATH G) | O(holders) + UPR repositioning + suspend | Same as PATH F plus the UPR walk and a context switch |
Grant cascade (lock_grant_blocked_holder) | O(upgraders × holders-after-each) | On resume, each upgrader checks compat against all holders behind it. With k upgraders and n total holders, worst case is O(k × n). |
Why group_mode computation is O(holders), not O(1):
For a new lock request (PATH B), the aggregate
total_holders_mode suffices — O(1) check. But for conversion
(PATH F/G), the aggregate includes the requesting transaction’s
own mode, which must be excluded. There is no way to “subtract”
one mode from an aggregate LUB in O(1) — the LUB is not invertible.
So the code walks the holder list, skipping entry_ptr, and
recomputes the aggregate from scratch.
This is acceptable because:
- Conversion is less frequent than new grants (most lock requests are fresh or re-entrant).
- The holder count per resource is typically small (< 10 in OLTP).
- The walk is under
res_mutex, which is per-resource — other resources are unaffected.
Post-conversion cleanup (lock_conversion_treatement) is
O(instance locks):
When a class lock is upgraded (IS→S, IX→X, etc.),
lock_remove_all_inst_locks walks the transaction’s
inst_hold_list and removes every instance lock of the specified
mode for that class. This is O(total instance locks held by the
transaction), not O(instances of that class) — the list is not
partitioned by class. For a transaction holding 5000 instance locks
across 3 classes, the walk examines all 5000 even if only 1000
belong to the escalated class.
4.11 Chapter summary — key takeaways
Section titled “4.11 Chapter summary — key takeaways”- Conversion is triggered when a holder requests the same
resource again with a different mode. The conversion table
(
lock_Conv) computes the LUB of the two modes. - Re-entrance (PATH E) is the common case — same or weaker
mode, just bump
count. Nores_mutexneeded on the class fast path. countcontrols unlock depth. Eachlock_objectincrements it; eachlock_unlock_objectdecrements it. The lock is only released whencountreaches zero.- Immediate conversion (PATH F) checks against other holders only (self-exclusion). This prevents a transaction from dead-locking with itself.
- Blocked conversion (PATH G) sets both
granted_modeandblocked_mode, stays in the holder list, and is repositioned per UPR.total_holders_modeconservatively includes the requested mode. - Grant order: upgraders first, then waiters.
lock_grant_blocked_holderscans front-to-back among upgraders, thenlock_grant_blocked_waiterscans the waiter list against the updatedtotal_holders_mode. - Post-conversion cleanup removes instance locks subsumed by the upgraded class lock, freeing memory and reducing hold-list size.
Chapter 5: Lock Release and the NON2PL Protocol
Section titled “Chapter 5: Lock Release and the NON2PL Protocol”This chapter traces the full unlock path — from the public API through the internal release mechanics, then into the NON2PL protocol that tracks early-released locks under READ COMMITTED.
5.1 Three public unlock entry points
Section titled “5.1 Three public unlock entry points”flowchart LR UO["lock_unlock_object\n(oid, class_oid, lock, force)"] UDN["lock_unlock_object_donot_move_to_non2pl\n(oid, class_oid, lock)"] UA["lock_unlock_all\n(commit/rollback)"] INT["lock_internal_perform_unlock_object\n(entry, release_flag, move_to_non2pl)"] UO --> INT UDN --> INT UA --> INT
Figure 5-1 — Three unlock entry points. They differ in
release_flag and move_to_non2pl parameters, which control
whether the lock is fully removed and whether a non2pl shadow
record is created.
| Entry point | release_flag | move_to_non2pl | When used |
|---|---|---|---|
lock_unlock_object(force=true) | false | true | Forced mid-transaction release |
lock_unlock_object(force=false) | false | true | Isolation-gated release (see §5.2) |
lock_unlock_object_donot_move_to_non2pl | false | false | Btree FK checks, catalog lookups — caller knows no stale-cache risk |
lock_unlock_all | true | false | Commit/rollback — unconditional release of everything |
5.2 Unlock caller patterns (see Ch. 3 §3.2 for the full map)
Section titled “5.2 Unlock caller patterns (see Ch. 3 §3.2 for the full map)”The full caller map (Figure 3-3) is in Ch. 3 §3.2. The unlock-specific patterns worth highlighting:
| Unlock API | Primary callers | Why this variant |
|---|---|---|
lock_unlock_object(force=false) | locator_sr.c, heap_file.c, scan_manager.c | Standard isolation-gated release — the four-stage filter in §5.3 applies |
lock_unlock_object(force=true) | locator_sr.c (DDL rollback), serial.c | Forced release bypassing isolation check — caller knows the lock must go |
lock_unlock_object_donot_move_to_non2pl | btree.c (FK checks), system_catalog.c, locator_sr.c | Rows never surfaced to client — no stale-cache risk, non2pl tracking skipped |
lock_unlock_all | log_manager.c only | Commit/rollback — unconditional bulk release |
lock_stop_instant_lock_mode | query_executor.c | Releases all instant-duration locks at end of WHERE evaluation |
5.3 lock_unlock_object — the isolation policy gate
Section titled “5.3 lock_unlock_object — the isolation policy gate”This is the most commonly called unlock function. It implements the isolation-level policy before delegating to the internal function:
// lock_unlock_object — src/transaction/lock_manager.c (annotated)voidlock_unlock_object (THREAD_ENTRY * thread_p, const OID * oid, const OID * class_oid, LOCK lock, bool force){ if (force == true) { entry_ptr = lock_find_tran_hold_entry (...); lock_internal_perform_unlock_object (thread_p, entry_ptr, false, true); return; }
/* force == false: apply isolation policy */ if (lock != S_LOCK) { /* These will not be released. */ return; /* ← X, IX, SIX, etc. stay to commit */ }
isolation = logtb_find_isolation (tran_index); switch (isolation) { case TRAN_SERIALIZABLE: case TRAN_REPEATABLE_READ: return; /* nothing to do — S locks held to commit */
case TRAN_READ_COMMITTED: lock_unlock_object_by_isolation (thread_p, tran_index, isolation, class_oid, oid); break; }}Three filters that prevent release:
flowchart TD
REQ["lock_unlock_object(force=false)"]
F1{"lock != S_LOCK?"}
F1 -- "yes (X, IX, ...)" --> KEEP1["KEEP: write locks\nnever released early"]
F1 -- "no (S_LOCK)" --> F2{"isolation >= RR?"}
F2 -- "yes (RR/SER)" --> KEEP2["KEEP: S locks\nheld to commit"]
F2 -- "no (RC)" --> F3["lock_unlock_object_by_isolation"]
F3 --> F3A{"root or class OID?"}
F3A -- "yes" --> KEEP3["KEEP: class locks\nnever released early"]
F3A -- "no" --> F3B{"mvcc_is_mvcc_disabled_class?"}
F3B -- "yes" --> REL["RELEASE:\nlock_unlock_shared_inst_lock\n(move_to_non2pl=true)"]
F3B -- "no (MVCC table)" --> KEEP4["KEEP: snapshot-based,\nno lock to release"]
Figure 5-2 — Four-stage filter in the non-forced unlock path. Only S locks, under RC, on MVCC-disabled instance OIDs, are actually released. Everything else is kept to commit.
Key insight: In a typical MVCC workload, the vast majority of
lock_unlock_object(force=false) calls hit KEEP4 — MVCC-enabled
tables use snapshots, not locks, for reads. The only locks that
reach the actual release path are S locks on MVCC-disabled classes
(serials, root class, collation class, HA apply-info class).
5.4 lock_internal_perform_unlock_object — the complete flow
Section titled “5.4 lock_internal_perform_unlock_object — the complete flow”This is the internal workhorse. It handles three cases depending on where the entry is found:
flowchart TD
START["lock_internal_perform_unlock_object\n(entry_ptr, release_flag, move_to_non2pl)"]
START --> DEC{"release_flag == false?"}
DEC -- "yes" --> COUNT["count--\n(+ instant_lock_count-- if instant mode)"]
COUNT --> EARLY{"blocked_mode == NULL\nAND count > 0?"}
EARLY -- "yes" --> RET["RETURN early\n(other callers still need this lock)"]
EARLY -- "no" --> MUTEX
DEC -- "no (release_flag=true)" --> MUTEX
MUTEX["pthread_mutex_lock(res_mutex)"]
MUTEX --> FIND_H{"find entry in\nholder list?"}
FIND_H -- "yes" --> CASE_H["CASE H: holder release"]
FIND_H -- "no" --> FIND_W{"find entry in\nwaiter list?"}
FIND_W -- "yes" --> CASE_W["CASE W: waiter removal\n(timeout/deadlock victim)"]
FIND_W -- "no" --> ERR["ERROR: lost transaction"]
CASE_H --> H_CHECK{"release_flag==false\nAND count > 0?"}
H_CHECK -- "yes" --> H_BLOCKED["Blocked holder cleanup:\nblocked_mode = NULL\nreposition per UPR"]
H_CHECK -- "no" --> H_RELEASE["Full release:\n1. Remove from tran hold list\n2. Decrement class granules\n3. If move_to_non2pl: lock_add_non2pl_lock\n4. Free entry"]
H_RELEASE --> RECOMP["Recompute total_holders_mode"]
H_BLOCKED --> RECOMP
RECOMP --> EMPTY{"holder==NULL\nAND waiter==NULL?"}
EMPTY -- "yes, non2pl==NULL" --> REMOVE["lock_remove_resource\n(erase from hash table)"]
EMPTY -- "yes, non2pl!=NULL" --> UNLOCK["unlock res_mutex\n(resource kept for non2pl)"]
EMPTY -- "no" --> GRANT["lock_grant_blocked_holder\nlock_grant_blocked_waiter\nunlock res_mutex"]
Figure 5-3 — Complete flow of lock_internal_perform_unlock_object.
The early return (count > 0) is the re-entrance guard from Ch. 4
§4.3. CASE H is the normal holder release. CASE W handles
waiter removal after timeout or deadlock.
5.5 The release_flag and count interaction
Section titled “5.5 The release_flag and count interaction”The two parameters release_flag and count together control
whether the lock is actually released:
// lock_internal_perform_unlock_object — src/transaction/lock_manager.cif (release_flag == false) { entry_ptr->count--; if (lock_is_instant_lock_mode (tran_index)) entry_ptr->instant_lock_count--;
if (entry_ptr->blocked_mode == NULL_LOCK && entry_ptr->count > 0) return; /* ← early return: other callers still need this lock */ }release_flag | count after decrement | blocked_mode | Result |
|---|---|---|---|
| false | > 0 | NULL_LOCK | Early return — lock stays, other re-entrant callers still active |
| false | > 0 | != NULL_LOCK | Continue — this is a blocked holder being cleaned up (timeout/victim) |
| false | 0 | any | Continue — last reference, proceed to actual release |
| true | (not decremented) | any | Continue — unconditional release (lock_unlock_all) |
5.6 CASE H — holder release (the normal path)
Section titled “5.6 CASE H — holder release (the normal path)”When the entry is found in the holder list and the count has
reached zero (or release_flag is true):
// CASE H — lock_internal_perform_unlock_object (annotated)/* remove from holder list */if (prev == NULL) res_ptr->holder = curr->next;else prev->next = curr->next;
if (release_flag == false && curr->count > 0) { /* blocked holder cleanup (timeout/victim) */ curr->blocked_mode = NULL_LOCK; lock_position_holder_entry (res_ptr, entry_ptr); }else { /* full release */ lock_delete_from_tran_hold_list (curr, tran_index); lock_decrement_class_granules (curr->class_entry);
if (release_flag == false && move_to_non2pl == true) lock_add_non2pl_lock (thread_p, res_ptr, tran_index, curr->granted_mode); /* ← NON2PL creation */
lock_free_entry (tran_index, t_entry, &lk_Gl.obj_free_entry_list, curr); }
/* recompute total_holders_mode */mode = NULL_LOCK;for (i = res_ptr->holder; i != NULL; i = i->next) { mode = lock_conv (i->granted_mode, mode); mode = lock_conv (i->blocked_mode, mode); /* include upgraders */ }res_ptr->total_holders_mode = mode;Why this walk exists — and why it is cheap. Commit never
searches for its entries: lock_unlock_all walks the
transaction-side hold lists and hands each LK_ENTRY pointer
directly to the unlock path (the dual-view payoff, §1.4). The only
walk left is finding prev above — the resource-side list is
singly linked, so splicing a node out needs its predecessor.
The asymmetry is a deliberate cost trade: the transaction view is
doubly linked (tran_prev) because commit unlinks thousands of
entries one by one and each unlink must be O(1); the resource-side
holder list stays singly linked because an object rarely has more
than a handful of holders — and the same res_mutex section
re-walks that list anyway (the aggregate recompute below, then the
grant passes).
After the holder is removed, three outcomes based on remaining state:
| Remaining state | Action |
|---|---|
| No holders, no waiters, no non2pl | lock_remove_resource — erase the LK_RES from the hash table |
| No holders, no waiters, but non2pl exists | Keep LK_RES alive (unlock res_mutex) — the non2pl entries still need the resource |
| Holders or waiters remain | lock_grant_blocked_holder → lock_grant_blocked_waiter (Ch. 4 §4.6–4.7) → unlock res_mutex |
5.7 CASE W — waiter removal (timeout / deadlock victim)
Section titled “5.7 CASE W — waiter removal (timeout / deadlock victim)”When the entry is not in the holder list but is found in the waiter list — this happens when a suspended thread is resumed due to timeout or deadlock detection:
// CASE W — lock_internal_perform_unlock_object (annotated)/* remove from waiter list */if (prev == NULL) res_ptr->waiter = curr->next;else prev->next = curr->next;
lock_free_entry (..., curr);
if (from_whom != NULL) lock_grant_blocked_waiter_partial (thread_p, res_ptr, from_whom);else { /* recompute total_waiters_mode from remaining waiters */ mode = NULL_LOCK; for (i = res_ptr->waiter; i != NULL; i = i->next) mode = lock_conv (i->blocked_mode, mode); res_ptr->total_waiters_mode = mode; }lock_grant_blocked_waiter_partial is a variant of
lock_grant_blocked_waiter that only checks waiters from a given
position onward — since the removed waiter was ahead of from_whom,
only the waiters behind it might now be grantable.
5.8 lock_unlock_all — commit/rollback release
Section titled “5.8 lock_unlock_all — commit/rollback release”Called at transaction end. Unconditionally releases everything:
// lock_unlock_all — src/transaction/lock_manager.c (annotated)voidlock_unlock_all (THREAD_ENTRY * thread_p){ /* 1. release all instance locks */ entry_ptr = tran_lock->inst_hold_list; while (entry_ptr != NULL) { lock_internal_perform_unlock_object (thread_p, entry_ptr, true, /* ← release_flag */ false); /* ← no non2pl */ entry_ptr = tran_lock->inst_hold_list; /* re-read head */ }
/* 2. release all class locks */ entry_ptr = tran_lock->class_hold_list; while (entry_ptr != NULL) { lock_internal_perform_unlock_object (thread_p, entry_ptr, true, false); entry_ptr = tran_lock->class_hold_list; }
/* 3. release root class lock */ if (tran_lock->root_class_hold != NULL) lock_internal_perform_unlock_object (thread_p, tran_lock->root_class_hold, true, false);
/* 4. clean up non2pl list */ while (tran_lock->non2pl_list != NULL) { entry_ptr = tran_lock->non2pl_list; tran_lock->non2pl_list = entry_ptr->tran_next; if (entry_ptr->granted_mode == INCON_NON_TWO_PHASE_LOCK) tran_lock->num_incons_non2pl -= 1; lock_remove_non2pl (thread_p, entry_ptr, tran_index); }
lock_clear_deadlock_victim (tran_index);}Release order is deliberate: instances → classes → root. This is leaf-to-root, matching 2PL’s requirement that locks are released in reverse acquisition order. Releasing a class lock while instance locks still exist would violate the multi-granularity hierarchy.
release_flag = true means count is not decremented — the
loop ignores re-entrance depth and frees the entry unconditionally.
move_to_non2pl = false because at commit/rollback time there is
no need to track released locks for future conflict detection.
5.9 The NON2PL protocol — lifecycle
Section titled “5.9 The NON2PL protocol — lifecycle”NON2PL is a shadow-record mechanism for detecting serialization inconsistencies under READ COMMITTED. It is not a lock mode that blocks — it is a passive tracking record.
Phase 1: Creation — lock_add_non2pl_lock
When lock_internal_perform_unlock_object releases a holder with
move_to_non2pl = true:
// lock_add_non2pl_lock — src/transaction/lock_manager.c (annotated)static LK_ENTRY *lock_add_non2pl_lock (THREAD_ENTRY * thread_p, LK_RES * res_ptr, int tran_index, LOCK lock){ /* check if I already have a non2pl entry on this resource */ non2pl = res_ptr->non2pl; while (non2pl != NULL) { if (non2pl->tran_index == tran_index) break; non2pl = non2pl->next; }
if (non2pl != NULL) { /* merge: upgrade the existing non2pl entry's mode */ if (non2pl->granted_mode != INCON_NON_TWO_PHASE_LOCK) { compat = lock_compat (lock, non2pl->granted_mode); if (compat == LOCK_COMPAT_NO) { non2pl->granted_mode = INCON_NON_TWO_PHASE_LOCK; tran_lock->num_incons_non2pl += 1; } else non2pl->granted_mode = lock_conv (lock, non2pl->granted_mode); } } else { /* create new non2pl entry */ non2pl = lock_get_new_entry (...); lock_initialize_entry_as_non2pl (non2pl, tran_index, res_ptr, lock); /* link into resource non2pl list */ non2pl->next = res_ptr->non2pl; res_ptr->non2pl = non2pl; /* link into transaction non2pl list */ lock_insert_into_tran_non2pl_list (non2pl, tran_index); }}Key detail: the merge path. If the transaction already has a
non2pl entry on this resource (e.g., it read the same serial twice
in the same RC transaction, releasing S each time), the existing
entry is upgraded rather than duplicated. If the new mode is
incompatible with the existing non2pl mode, the entry is promoted
directly to INCON_NON_TWO_PHASE_LOCK.
Phase 2: Conflict marking — lock_update_non2pl_list
Called whenever a lock is successfully granted (PATH B, F in Ch. 3). Walks the resource’s non2pl list and checks each other-transaction entry:
// lock_update_non2pl_list — src/transaction/lock_manager.c (annotated)curr = res_ptr->non2pl;while (curr != NULL) { if (curr->tran_index == tran_index) { /* same transaction: remove our own non2pl entry */ lock_delete_from_tran_non2pl_list (curr, tran_index); lock_free_entry (..., curr); } else { /* different transaction: check for conflict */ if (curr->granted_mode != INCON_NON_TWO_PHASE_LOCK) { compat = lock_compat (lock, curr->granted_mode); if (compat == LOCK_COMPAT_NO) { curr->granted_mode = INCON_NON_TWO_PHASE_LOCK; tran_lock->num_incons_non2pl += 1; } } } curr = curr->next; }Two actions:
- Same transaction: if I acquire a new lock on a resource where I have an old non2pl entry, that entry is removed — I now hold a real lock, the shadow is superseded.
- Other transaction: if the newly granted lock is incompatible
with the other transaction’s non2pl
granted_mode, promote it toINCON_NON_TWO_PHASE_LOCK. This means: “the row you read and released has been modified by another transaction.”
Phase 3: Notification — lock_notify_isolation_incons
At the next fetch boundary, locator_sr.c calls this function:
// lock_notify_isolation_incons — src/transaction/lock_manager.c (simplified)voidlock_notify_isolation_incons (THREAD_ENTRY * thread_p, bool (*fun) (const OID *, const OID *, void *), void *args){ if (isolation == TRAN_REPEATABLE_READ || isolation == TRAN_SERIALIZABLE) return; /* nothing was released under RR/SER */
curr = tran_lock->non2pl_list; while (tran_lock->num_incons_non2pl > 0 && curr != NULL) { if (curr->granted_mode == INCON_NON_TWO_PHASE_LOCK) { ret_val = (*fun) (&curr->res_head->key.class_oid, &curr->res_head->key.oid, args); /* remove from non2pl list, free entry */ tran_lock->num_incons_non2pl -= 1; } curr = curr->tran_next; }}The callback (locator_notify_decache) adds each flagged OID to
the response with LC_FETCH_DECACHE_LOCK, telling the client to
invalidate its cached copy.
Phase 4: Cleanup — lock_unlock_all
At commit/rollback, lock_unlock_all walks the entire non2pl list
and removes every entry via lock_remove_non2pl.
5.10 NON2PL scope — why it matters almost exclusively for serials
Section titled “5.10 NON2PL scope — why it matters almost exclusively for serials”lock_unlock_object_by_isolation is the scope gate:
// lock_unlock_object_by_isolation — src/transaction/lock_manager.cif (isolation != TRAN_READ_COMMITTED) return; /* RR/SER: nothing released */
if (OID_IS_ROOTOID (oid) || OID_IS_ROOTOID (class_oid)) return; /* class locks kept */
if (mvcc_is_mvcc_disabled_class (class_oid)) lock_unlock_shared_inst_lock (...); /* ← non2pl path */else { /* MVCC table: "READ COMMITTED isolation uses snapshot instead of locks. We don't have to release anything." */ }mvcc_is_mvcc_disabled_class returns true for only four class
kinds: root class, serial class, collation class, HA apply-info
class. Of these, only serial produces a realistic RC workload
where S locks are acquired on instance OIDs and then released
mid-transaction.
Historical context: The entire non2pl/decache infrastructure
(locator_notify_decache, LC_FETCH_DECACHE_LOCK,
lock_notify_isolation_incons) predates MVCC. Before MVCC, all
instance reads used S locks, and RC released them early — making
non2pl tracking necessary for every row. After MVCC was introduced,
normal tables use snapshots for visibility, and the non2pl path
narrowed to MVCC-disabled classes only. The infrastructure remains
intact but its active scope is essentially serial reads under RC.
5.11 Instant lock and its relationship to NON2PL
Section titled “5.11 Instant lock and its relationship to NON2PL”Instant lock is a mode where locks acquired during WHERE-clause evaluation are automatically released when the evaluation phase ends.
// lock_start_instant_lock_mode — src/transaction/lock_manager.cvoid lock_start_instant_lock_mode (int tran_index){ tran_lock->is_instant_duration = true;}
// lock_stop_instant_lock_mode — src/transaction/lock_manager.cvoid lock_stop_instant_lock_mode (THREAD_ENTRY * thread_p, int tran_index, bool need_unlock){ /* walk instance hold list */ entry_ptr = tran_lock->inst_hold_list; while (entry_ptr != NULL) { count = entry_ptr->instant_lock_count; if (need_unlock) while (count > 0) { lock_internal_perform_unlock_object (thread_p, entry_ptr, false, true); /* ← move_to_non2pl = true */ count--; } entry_ptr->instant_lock_count = 0; entry_ptr = next_ptr; } /* same for class hold list ... */ tran_lock->is_instant_duration = false;}Relationship to NON2PL: Instant lock feeds into non2pl
because the release uses move_to_non2pl = true. But non2pl also
exists without instant lock — any mid-transaction S lock release
under RC goes through the same path.
Callers: Only query_executor.c uses instant lock mode
(lock_start_instant_lock_mode / lock_stop_instant_lock_mode),
specifically around condition evaluation in query execution.
Edge case — protecting non-instant locks: When
is_instant_duration == true and the transaction already holds an
IX or stronger lock with instant_lock_count == 0 (a non-instant
lock), and the new request is incompatible with the held mode,
instant mode is forcibly stopped
(lock_stop_instant_lock_mode(need_unlock=false)) to prevent the
non-instant lock from being accidentally released during instant
mode cleanup.
5.12 Performance analysis — release cost and the grant cascade
Section titled “5.12 Performance analysis — release cost and the grant cascade”Most unlock calls are O(1) — the count early return dominates.
In typical OLTP, a transaction acquires the same lock multiple times
via re-entrance (Ch. 4 §4.3). Each lock_unlock_object just
decrements count. No mutex, no list manipulation:
if (release_flag == false) { entry_ptr->count--; if (entry_ptr->blocked_mode == NULL_LOCK && entry_ptr->count > 0) return; /* ← most unlock calls exit here */ }When count reaches zero — the real release cost:
| Step | Cost | Notes |
|---|---|---|
Acquire res_mutex | O(1) | Per-resource, uncontended in most cases |
| Find entry in holder list | O(holders) | Linear scan of singly-linked list |
| Remove from holder list | O(1) | Pointer unlink |
| Remove from tran hold list | O(1) | Doubly-linked — O(1) unlink |
lock_add_non2pl_lock (if move_to_non2pl) | O(non2pl entries on this resource) | Walks non2pl list to check for existing entry to merge |
| Free entry to pool | O(1) | Local pool or lockfree retire |
Recompute total_holders_mode | O(remaining holders) | Walks entire remaining holder list to compute LUB |
| Grant cascade | O(upgraders + grantable waiters) | See below |
The total_holders_mode recomputation is the per-release
O(holders) cost.
mode = NULL_LOCK;for (i = res_ptr->holder; i != NULL; i = i->next) { mode = lock_conv (i->granted_mode, mode); mode = lock_conv (i->blocked_mode, mode); /* upgraders too */ }res_ptr->total_holders_mode = mode;This cannot be optimized to O(1) because the LUB operation is not invertible — you cannot “subtract” a removed mode from the aggregate without recomputing from scratch. The same limitation explains why conversion (Ch. 4 §4.10) requires O(holders).
The grant cascade — release one, potentially wake many:
After recomputing total_holders_mode, the releaser calls:
lock_grant_blocked_holder— O(upgraders × holders-behind-each)lock_grant_blocked_waiter— O(waiters), each checked againsttotal_holders_mode- Each granted waiter: move to holder list +
lock_resume
flowchart LR REL["T1 releases X_LOCK"] RECOMP["recompute\ntotal_holders_mode\nO(remaining holders)"] GBH["grant_blocked_holder\n(check each upgrader)"] GBW["grant_blocked_waiter\n(check each waiter\nvs total_holders_mode)"] WAKE["resume N threads\n(N context switches)"] REL --> RECOMP --> GBH --> GBW --> WAKE
Figure 5-4 — Release cascade cost. A single release can trigger O(upgraders + waiters) compat checks and N thread wakeups. The worst case is a hot row with many waiters — but this is also the case where the release provides the most value.
Worst-case scenario: A hot row held by one X-lock holder with 50 S-lock waiters. When the holder releases:
- Recompute
total_holders_mode: O(0) (no remaining holders) grant_blocked_holder: O(0) (no upgraders)grant_blocked_waiter: 50 waiters all become grantable (S is compatible with S) → 50 entries moved to holder list, 50lock_resumecalls, 50 context switches
This is expensive but unavoidable — the 50 threads were suspended and need to be woken. The alternative (each woken thread re-checks compatibility independently) would be O(50²) total compat checks instead of O(50).
NON2PL cost on the grant path:
lock_update_non2pl_list is called on every successful grant
(PATH B, F, and now on each cascaded grant). It walks the
resource’s non2pl list — O(non2pl entries per resource). In MVCC
workloads, this list is almost always empty (non2pl only exists for
MVCC-disabled classes under RC). But for serial-heavy workloads,
the non2pl list could accumulate entries across many RC transactions.
5.13 Chapter summary — key takeaways
Section titled “5.13 Chapter summary — key takeaways”- Three public unlock APIs differ in
release_flagandmove_to_non2pl.lock_unlock_all(commit/rollback) usesrelease_flag=trueto bypass count checks. lock_unlock_object(force=false)passes four filters before actually releasing: only S locks, only under RC, only instances, only MVCC-disabled classes. In practice, this means serials.countcontrols release depth. Eachlock_objectincrements it; eachlock_unlock_objectdecrements it. The lock is only freed when count reaches zero — the re-entrance guard.- After a holder release,
total_holders_modeis recomputed by walking the remaining holder list, then blocked holders and waiters are granted (Ch. 4 §4.6–4.7). - Resource removal: an
LK_RESis erased from the hash table only when all three lists (holder, waiter, non2pl) are empty. A non2pl-only resource stays alive. - NON2PL lifecycle: create (
lock_add_non2pl_lock) → mark conflict (lock_update_non2pl_list) → notify client (lock_notify_isolation_incons) → cleanup (lock_unlock_all). - NON2PL scope narrowed post-MVCC. The infrastructure is pre-MVCC era; today only MVCC-disabled classes (serials in practice) exercise the path.
- Instant lock feeds into non2pl via
move_to_non2pl = true, but the two mechanisms are independent — non2pl exists without instant lock, and instant lock cleanup protects pre-existing non-instant locks. - Release order in
lock_unlock_allis instances → classes → root → non2pl — leaf-to-root, matching 2PL’s reverse-order release requirement.
Chapter 6: Lock Escalation
Section titled “Chapter 6: Lock Escalation”Lock escalation replaces many fine-grain instance locks with a
single coarse-grain class lock when the instance lock count exceeds
a threshold. It is a memory-pressure relief valve — without it, a
bulk UPDATE touching 100K rows would hold 100K LK_ENTRY structs.
6.1 When escalation is triggered
Section titled “6.1 When escalation is triggered”Escalation is checked at the top of every instance lock request,
before the hash lookup, inside lock_internal_perform_lock_object:
// lock_internal_perform_lock_object — src/transaction/lock_manager.cstart: if (class_oid != NULL && !OID_IS_ROOTOID (class_oid)) { /* instance lock request */ ret_val = lock_escalate_if_needed (thread_p, class_entry, tran_index); // ... if escalated, class lock subsumes → return LK_GRANTED }This means escalation is evaluated on every instance lock
acquisition — not periodically, not lazily. The lock_check_escalate
function makes the threshold check fast enough to justify this.
6.2 lock_check_escalate — the threshold check
Section titled “6.2 lock_check_escalate — the threshold check”// lock_check_escalate — src/transaction/lock_manager.c (annotated)static boollock_check_escalate (THREAD_ENTRY * thread_p, LK_ENTRY * class_entry, LK_TRAN_LOCK * tran_lock){ if (class_entry == NULL) return false;
if (class_entry->granted_mode == BU_LOCK) return false; /* ← bulk update already has class-level lock */
if (tran_lock->lock_escalation_on == true) return false; /* ← another thread of same tran is escalating */
superclass_entry = class_entry->class_entry;
if (superclass_entry != NULL && !OID_IS_ROOTOID (&superclass_entry->res_head->key.oid)) { /* class hierarchy: check superclass's ngranules */ if (superclass_entry->ngranules < prm_get_integer_value (PRM_ID_LK_ESCALATION_AT)) return false; } else if (class_entry->ngranules < prm_get_integer_value (PRM_ID_LK_ESCALATION_AT)) return false;
return true; /* ← escalation needed */}Five conditions that prevent escalation:
| Condition | Why |
|---|---|
class_entry == NULL | No class context (shouldn’t happen for instance locks) |
granted_mode == BU_LOCK | Bulk update already has class-level protection |
lock_escalation_on == true | Another thread of the same transaction is already escalating — re-entrance guard |
Superclass hierarchy: superclass->ngranules < threshold | In a class hierarchy, the superclass aggregates granule counts |
Direct: class_entry->ngranules < threshold | Normal case — not enough instance locks yet |
The threshold is PRM_ID_LK_ESCALATION_AT, a runtime
configuration parameter (default value set in cubrid.conf). The
ngranules field on class_entry is incremented by
lock_increment_class_granules every time an instance lock is
granted (Ch. 3 PATH A/B) and decremented by
lock_decrement_class_granules on release (Ch. 5).
6.3 lock_escalate_if_needed — the escalation procedure
Section titled “6.3 lock_escalate_if_needed — the escalation procedure”When lock_check_escalate returns true, the actual escalation
begins:
flowchart LR
CHK["lock_check_escalate\n→ true"]
ABORT{"PRM_ID_LK_ROLLBACK\n_ON_LOCK_ESCALATION?"}
ABORT -- "yes" --> ROLL["Abort transaction\nER_LK_ROLLBACK_ON_LOCK_ESCALATION"]
ABORT -- "no" --> MODE["Determine escalated\nclass lock mode"]
MODE --> TRY["lock_internal_perform_lock_object\n(class_oid, max_class_lock,\nFORCE_ZERO_WAIT)"]
TRY --> OK{"granted?"}
OK -- "yes" --> UNDO["Unlock one count\nof old class lock\n(maintain count balance)"]
UNDO --> DONE["Return LK_GRANTED\n(caller checks subsumption)"]
OK -- "no" --> FAIL["Reset lock_escalation_on\nReturn failure\n(fall through to normal path)"]
Figure 6-1 — Escalation procedure. The escalated class lock is attempted conditionally (FORCE_ZERO_WAIT). If it fails (e.g., another transaction holds a conflicting class lock), escalation is silently abandoned and the normal instance lock path proceeds.
The code:
// lock_escalate_if_needed — src/transaction/lock_manager.c (annotated)static intlock_escalate_if_needed (THREAD_ENTRY * thread_p, LK_ENTRY * class_entry, int tran_index){ tran_lock = &lk_Gl.tran_lock_table[tran_index]; pthread_mutex_lock (&tran_lock->hold_mutex);
if (lock_check_escalate (...) == false) { pthread_mutex_unlock (&tran_lock->hold_mutex); return LK_NOTGRANTED; /* ← not an error, just "no escalation" */ }
/* option: abort instead of escalating */ if (prm_get_bool_value (PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION)) { er_set (ER_ERROR_SEVERITY, ..., ER_LK_ROLLBACK_ON_LOCK_ESCALATION, ...); lock_set_tran_abort_reason (tran_index, TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION); pthread_mutex_unlock (&tran_lock->hold_mutex); return LK_NOTGRANTED_DUE_ABORTED; }
/* set re-entrance guard */ tran_lock->lock_escalation_on = true;
/* determine escalated mode */ if (class_entry->granted_mode == NULL_LOCK || class_entry->granted_mode == S_LOCK || class_entry->granted_mode == X_LOCK || class_entry->granted_mode == SCH_M_LOCK) { /* already strong enough — no instance locks possible */ tran_lock->lock_escalation_on = false; pthread_mutex_unlock (&tran_lock->hold_mutex); return LK_GRANTED; }
/* IS → S, IX/SIX → X */ if (class_entry->granted_mode == IX_LOCK || class_entry->granted_mode == SIX_LOCK) max_class_lock = X_LOCK; else /* IS_LOCK */ max_class_lock = S_LOCK;
pthread_mutex_unlock (&tran_lock->hold_mutex);
/* attempt the escalated class lock — conditional (no wait) */ granted = lock_internal_perform_lock_object ( thread_p, tran_index, &class_entry->res_head->key.oid, NULL, /* class_oid = NULL → this is a class lock */ max_class_lock, LK_FORCE_ZERO_WAIT, /* ← conditional: don't block */ &class_entry, NULL);
if (granted != LK_GRANTED) { /* escalation failed — fall through to normal instance lock */ tran_lock->lock_escalation_on = false; return granted; }
/* success: undo the extra count from the escalation request */ lock_internal_perform_unlock_object (thread_p, class_entry, false, true);
tran_lock->lock_escalation_on = false; return LK_GRANTED;}6.4 Escalated mode decision — simple rule
Section titled “6.4 Escalated mode decision — simple rule”The mode decision is deliberately simple to avoid expensive instance-list scanning:
| Current class lock | Escalated to | Rationale |
|---|---|---|
IS_LOCK | S_LOCK | Was reading instances → take class S |
IX_LOCK | X_LOCK | Was writing instances → take class X |
SIX_LOCK | X_LOCK | Was reading + writing → take class X |
S_LOCK, X_LOCK, SCH_M_LOCK | (no-op) | Already class-level — no instance locks to escalate |
NULL_LOCK | (no-op) | No class lock held — shouldn’t have instance locks |
The comment in the source explains: “Because to count the shared and exclusive instance locks may cause high CPU usage, we used a simple rule.” Walking the instance hold list to compute the precise LUB of all instance modes would be O(instance locks) — the very thing escalation is trying to avoid.
6.5 Why LK_FORCE_ZERO_WAIT — conditional escalation
Section titled “6.5 Why LK_FORCE_ZERO_WAIT — conditional escalation”The escalated class lock is attempted with LK_FORCE_ZERO_WAIT
(no blocking, no error message). If it fails:
- The escalation is silently abandoned.
lock_escalation_onis reset.- The caller (
lock_internal_perform_lock_object) proceeds with the normal instance lock path.
This is a deliberate design choice: escalation is an optimization, not a requirement. If the class lock can’t be obtained without waiting (because another transaction holds a conflicting lock), it is better to continue with per-instance locks than to block the entire transaction waiting for a class lock that might cause a cascade of conflicts.
6.6 Post-escalation: what happens to instance locks
Section titled “6.6 Post-escalation: what happens to instance locks”After escalation succeeds, the caller in
lock_internal_perform_lock_object checks:
if (ret_val == LK_GRANTED && lock_is_class_lock_escalated ( lock_get_object_lock (class_oid, oid_Root_class_oid), lock) == true) { return LK_GRANTED; /* class lock subsumes the instance request */ }lock_is_class_lock_escalated checks whether the now-upgraded
class lock mode is strong enough to cover the requested instance
lock mode (e.g., class X subsumes instance S or X).
The existing instance LK_ENTRY structs are not immediately
freed by escalation itself. They are freed later via the
lock_conversion_treatement cleanup (Ch. 4 §4.8) when the class
lock conversion is processed, and individually as
lock_unlock_all walks the instance hold list at commit.
6.7 The PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION option
Section titled “6.7 The PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION option”Some applications prefer to abort rather than escalate — class locks are coarser and may cause unexpected serialization with other transactions:
if (prm_get_bool_value (PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION)) { er_set (..., ER_LK_ROLLBACK_ON_LOCK_ESCALATION, ...); lock_set_tran_abort_reason (tran_index, TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION); return LK_NOTGRANTED_DUE_ABORTED; }When this parameter is true, hitting the escalation threshold
causes the transaction to abort with
TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION. This is checked in the
caller:
// lock_internal_perform_lock_object — after lock_escalate_if_neededif (ret_val == LK_NOTGRANTED_DUE_ABORTED) { LOG_TDES *tdes = LOG_FIND_TDES (tran_index); if (tdes && tdes->tran_abort_reason == TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION) goto end; /* propagate the abort */ }6.8 Concrete scenario — escalation during bulk UPDATE
Section titled “6.8 Concrete scenario — escalation during bulk UPDATE”flowchart LR
subgraph S1["Before escalation"]
CL1["class A: IX_LOCK\nngranules = 4999"]
IL1["inst locks:\nrow1(X), row2(X),\n..., row4999(X)"]
end
subgraph S2["5000th row request triggers escalation"]
CHK2["lock_check_escalate:\nngranules(4999) >= threshold(5000)?\nNot yet → grant row5000 normally"]
end
subgraph S3["5001st row request"]
CHK3["lock_check_escalate:\nngranules(5000) >= threshold(5000)?\nYES → escalate"]
ESC3["IX → X (FORCE_ZERO_WAIT)"]
end
subgraph S4["After escalation"]
CL4["class A: X_LOCK\n(subsumes all instances)"]
IL4["inst locks: still present\n(freed at commit or\nby conversion cleanup)"]
end
S1 --> S2 --> S3 --> S4
Figure 6-2 — Escalation during bulk UPDATE. When ngranules
reaches the threshold (here 5000), the next instance lock request
triggers escalation from IX to X. The class X lock subsumes all
future instance requests, so no more instance LK_ENTRY structs
are created. Existing instance entries are cleaned up later.
6.9 Chapter summary — key takeaways
Section titled “6.9 Chapter summary — key takeaways”- Escalation is checked on every instance lock request — at
the top of
lock_internal_perform_lock_object, before the hash lookup. - The threshold is
PRM_ID_LK_ESCALATION_AT(configurable). Thengranulescounter on the classLK_ENTRYtracks the current instance lock count. - Escalated mode is determined by a simple rule: IS→S, IX/SIX→X. No instance-list scanning — the whole point is to avoid that cost.
- Escalation is conditional (
LK_FORCE_ZERO_WAIT). If the class lock can’t be obtained without blocking, escalation is silently abandoned and the normal instance lock path proceeds. lock_escalation_onis a re-entrance guard — prevents multiple threads of the same transaction from escalating simultaneously.PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATIONallows applications to abort instead of escalate, avoiding the coarse-grain serialization that class locks impose.- Existing instance locks are not immediately freed — they are
cleaned up by conversion treatment (Ch. 4 §4.8) and by
lock_unlock_allat commit.
Chapter 7: Suspend/Resume and Timeout
Section titled “Chapter 7: Suspend/Resume and Timeout”When a lock request cannot be granted immediately (PATH D or G in Ch. 3), the requesting thread is suspended until the lock becomes available, a timeout expires, or a deadlock is detected. This chapter traces the suspend/resume machinery and the four wakeup paths.
7.1 lock_suspend — putting a thread to sleep
Section titled “7.1 lock_suspend — putting a thread to sleep”// lock_suspend — src/transaction/lock_manager.c (annotated)static LOCK_WAIT_STATElock_suspend (THREAD_ENTRY * thread_p, LK_ENTRY * entry_ptr, int wait_msecs){ /* assert: no page latches held while waiting for a lock */ assert (lock_is_safe_lock_with_page (thread_p, entry_ptr) || !pgbuf_has_perm_pages_fixed (thread_p));
/* register wait info into the thread entry */ entry_ptr->thrd_entry->lockwait = (void *) entry_ptr; entry_ptr->thrd_entry->lockwait_stime = (tv.tv_sec * 1000000LL + tv.tv_usec) / 1000LL; /* ms */ entry_ptr->thrd_entry->lockwait_msecs = wait_msecs; entry_ptr->thrd_entry->lockwait_state = (int) LOCK_SUSPENDED;
/* register in WFG for deadlock detection */ lk_Gl.TWFG_node[entry_ptr->tran_index].thrd_wait_stime = entry_ptr->thrd_entry->lockwait_stime; lk_Gl.deadlock_and_timeout_detector++;
/* record which resource we're waiting for (for diagnostics) */ tdes->waiting_for_res = entry_ptr->res_head;
/* === actually suspend === */ thread_suspend_wakeup_and_unlock_entry (entry_ptr->thrd_entry, THREAD_LOCK_SUSPENDED);
/* === resumed here === */ lk_Gl.deadlock_and_timeout_detector--; lk_Gl.TWFG_node[entry_ptr->tran_index].thrd_wait_stime = 0; tdes->waiting_for_res = NULL;
/* dispatch on resume reason */ // ... see §7.3 ...}Key fields set before suspension:
| Field | Where | Purpose |
|---|---|---|
thrd_entry->lockwait | Thread entry | Points to the LK_ENTRY being waited on. Non-NULL means “this thread is suspended for a lock.” |
thrd_entry->lockwait_stime | Thread entry | Timestamp (ms) when wait started. Used by timeout check. |
thrd_entry->lockwait_msecs | Thread entry | Timeout deadline. LK_INFINITE_WAIT (-1) for no timeout. |
thrd_entry->lockwait_state | Thread entry | Set to LOCK_SUSPENDED. Changed by the waker to the resume reason. |
TWFG_node[tran].thrd_wait_stime | WFG node | Used by the deadlock daemon to know this transaction is waiting. |
deadlock_and_timeout_detector | lk_Gl | Atomic counter. The deadlock daemon skips its scan when this is 0 (no threads waiting). |
The page-latch safety assert: A thread must not hold any permanent page latches when it suspends for a lock. Holding a page latch while waiting for a lock could deadlock with another thread that holds the lock and needs the page latch — a classic latch-lock ordering violation. The assert catches this at development time.
7.2 lock_resume — waking a thread
Section titled “7.2 lock_resume — waking a thread”// lock_resume — src/transaction/lock_manager.c (annotated)static voidlock_resume (LK_ENTRY * entry_ptr, int state){ /* The caller is holding the thread entry mutex */ entry_ptr->thrd_entry->lockwait = NULL; entry_ptr->thrd_entry->lockwait_state = (int) state; entry_ptr->thrd_entry->resume_status = THREAD_LOCK_RESUMED; pthread_cond_signal (&entry_ptr->thrd_entry->wakeup_cond); thread_unlock_entry (entry_ptr->thrd_entry);}Protocol: The waker (whoever calls lock_resume) must:
- Hold the thread entry mutex (
thread_lock_entry). - Verify
LK_IS_LOCKWAIT_THREAD(thrd)— the thread is still suspended. - Set
lockwait = NULLandlockwait_state = statebefore signaling the condition variable. - Signal and release the mutex.
The suspended thread, upon waking, reads lockwait_state to
determine the resume reason. Since the waker sets the state before
signaling, no race exists.
7.3 Resume dispatch — six states
Section titled “7.3 Resume dispatch — six states”After thread_suspend_wakeup_and_unlock_entry returns, lock_suspend
reads lockwait_state to determine why it was woken:
flowchart LR
WAKE["thread resumes"]
WAKE --> S1{"lockwait_state?"}
S1 -- "LOCK_RESUMED" --> R1["Lock granted\n(entry moved to holder list\nby the releaser)"]
S1 -- "LOCK_RESUMED_ABORTED_FIRST" --> R2["Deadlock victim (primary)\nAbort transaction,\nwait for other threads to finish"]
S1 -- "LOCK_RESUMED_ABORTED_OTHER" --> R3["Deadlock victim (secondary)\nTreated as timeout\n(another thread handles abort)"]
S1 -- "LOCK_RESUMED_DEADLOCK_TIMEOUT" --> R4["Deadlock timeout\n(set error, return timeout)"]
S1 -- "LOCK_RESUMED_TIMEOUT" --> R5["Normal timeout\n(wait_msecs expired)"]
S1 -- "LOCK_RESUMED_INTERRUPT" --> R6["Server shutdown\n(ER_INTERRUPTED)"]
Figure 7-1 — Six resume states. Only LOCK_RESUMED means the lock
was granted. All other states require the caller to clean up the
waiter entry via lock_internal_perform_unlock_object.
| State | Who sets it | Meaning | Return value |
|---|---|---|---|
LOCK_RESUMED | lock_grant_blocked_holder or lock_grant_blocked_waiter (Ch. 4 §4.6–4.7) | Lock granted — entry already in holder list | LOCK_RESUMED |
LOCK_RESUMED_ABORTED_FIRST | lock_wakeup_deadlock_victim_aborted | This thread is the primary deadlock victim — must perform the rollback | LOCK_RESUMED_ABORTED |
LOCK_RESUMED_ABORTED_OTHER | lock_wakeup_deadlock_victim_aborted | Another thread of the same transaction is already handling the abort | LOCK_RESUMED_DEADLOCK_TIMEOUT |
LOCK_RESUMED_DEADLOCK_TIMEOUT | lock_wakeup_deadlock_victim_timeout | Deadlock victim woken as timeout (soft abort) | LOCK_RESUMED_DEADLOCK_TIMEOUT |
LOCK_RESUMED_TIMEOUT | lock_force_timeout_expired_wait_transactions (deadlock daemon) | Wait time exceeded lockwait_msecs | LOCK_RESUMED_TIMEOUT |
LOCK_RESUMED_INTERRUPT | Server shutdown | Server is shutting down — all waiters must exit | LOCK_RESUMED_INTERRUPT |
7.4 Who wakes up suspended threads — four sources
Section titled “7.4 Who wakes up suspended threads — four sources”flowchart TB
subgraph Sources["Four wakeup sources"]
S1["1. Lock releaser\n(lock_grant_blocked_holder\n lock_grant_blocked_waiter)\n→ LOCK_RESUMED"]
S2["2. Deadlock daemon\n(timeout check, every 100ms)\n→ LOCK_RESUMED_TIMEOUT"]
S3["3. Deadlock daemon\n(cycle detection)\n→ LOCK_RESUMED_ABORTED_FIRST\n LOCK_RESUMED_ABORTED_OTHER\n LOCK_RESUMED_DEADLOCK_TIMEOUT"]
S4["4. Server shutdown\n(thread interrupt)\n→ LOCK_RESUMED_INTERRUPT"]
end
S1 --> RESUME["lock_resume(entry, state)"]
S2 --> RESUME
S3 --> RESUME
S4 --> INT["thread wakeup\n(THREAD_RESUME_DUE_TO_INTERRUPT)"]
Figure 7-2 — Four sources that wake a suspended lock-waiting thread. Source 1 is the happy path. Sources 2–4 are failure paths that require cleanup.
7.5 Timeout check — lock_force_timeout_expired_wait_transactions
Section titled “7.5 Timeout check — lock_force_timeout_expired_wait_transactions”The deadlock daemon (Ch. 2 §2.6) calls this on every 100ms iteration for each suspended thread:
// lock_force_timeout_expired_wait_transactions — src/transaction/lock_manager.cstatic boollock_force_timeout_expired_wait_transactions (void *thrd_entry){ thrd = (THREAD_ENTRY *) thrd_entry; thread_lock_entry (thrd);
if (LK_IS_LOCKWAIT_THREAD (thrd)) { /* check for interrupt first */ if (logtb_is_interrupted_tran (thrd, ...)) { lock_resume ((LK_ENTRY *) thrd->lockwait, LOCK_RESUMED_INTERRUPT); return true; }
/* check for timeout */ if (LK_CAN_TIMEOUT (thrd->lockwait_msecs)) { etime = current_time_ms; if (etime - thrd->lockwait_stime > thrd->lockwait_msecs) { lock_resume ((LK_ENTRY *) thrd->lockwait, LOCK_RESUMED_TIMEOUT); return true; } }
thread_unlock_entry (thrd); return false; /* not timed out, still waiting */ } // ... thread not waiting ...}Timeout granularity: The daemon loops every 100ms, so a thread
with wait_msecs = 50 may not be timed out until up to 100ms after
its deadline. This is by design — the daemon does not use
high-resolution timers. For LK_INFINITE_WAIT (-1),
LK_CAN_TIMEOUT returns false and the timeout path is skipped
entirely.
7.6 The tran_next_wait chain — multi-thread wait coordination
Section titled “7.6 The tran_next_wait chain — multi-thread wait coordination”When multiple threads of the same transaction try to lock the same
resource, the second thread finds the first thread already waiting.
Instead of creating a second waiter entry, it chains itself onto
the first thread’s tran_next_wait:
// lock_internal_perform_lock_object — same-tran-already-waiting branchif (wait_entry_ptr != NULL) /* another thread of my tran is waiting */ { thrd_entry->tran_next_wait = wait_entry_ptr->thrd_entry->tran_next_wait; wait_entry_ptr->thrd_entry->tran_next_wait = thrd_entry;
pthread_mutex_unlock (&res_ptr->res_mutex); thread_suspend_wakeup_and_unlock_entry (thrd_entry, THREAD_LOCK_SUSPENDED); goto start; /* retry after wakeup */ }When the primary waiter is granted and resumes in lock_suspend,
it wakes all chained threads:
// lock_suspend — after main dispatch, before returnthread_lock_entry (entry_ptr->thrd_entry);while (entry_ptr->thrd_entry->tran_next_wait) { p = entry_ptr->thrd_entry->tran_next_wait; entry_ptr->thrd_entry->tran_next_wait = p->tran_next_wait; p->tran_next_wait = NULL; thread_wakeup (p, THREAD_LOCK_RESUMED); }thread_unlock_entry (entry_ptr->thrd_entry);The woken secondary threads goto start and retry the entire
lock acquisition from the beginning — they do not inherit the
primary thread’s grant.
7.7 The deadlock_and_timeout_detector counter
Section titled “7.7 The deadlock_and_timeout_detector counter”// lock_suspend — before and after suspensionlk_Gl.deadlock_and_timeout_detector++; /* before suspend */// ... thread sleeps ...lk_Gl.deadlock_and_timeout_detector--; /* after resume */This atomic counter is checked by the deadlock daemon:
// deadlock_detect_task_execute — src/transaction/lock_manager.cif (lk_Gl.deadlock_and_timeout_detector == 0) return; /* no threads waiting — skip entirely */When the counter is zero, no thread is suspended for a lock, so there is nothing to time out and no deadlock is possible. This avoids the cost of the thread-map scan and WFG construction on every 100ms tick when the system is uncontended.
7.8 Chapter summary — key takeaways
Section titled “7.8 Chapter summary — key takeaways”lock_suspendregisters five fields in the thread entry and WFG node before suspending. The page-latch safety assert prevents latch-lock ordering deadlocks.lock_resumesetslockwait_statebefore signaling — the protocol is race-free by construction.- Six resume states distinguish grant, abort (primary/secondary),
deadlock timeout, normal timeout, and interrupt. Only
LOCK_RESUMEDmeans success. - Four wakeup sources: lock releaser (happy path), daemon timeout check, daemon deadlock detection, and server shutdown.
- Timeout granularity is 100ms (the daemon loop period). This is acceptable for lock timeouts, which are typically in the seconds range.
- Multi-thread wait coordination uses
tran_next_waitchains: secondary threads piggyback on the primary waiter and retry from scratch on wakeup. deadlock_and_timeout_detectoris a fast early-exit for the daemon — when zero, the 100ms tick does no work.
Chapter 8: Deadlock Detection
Section titled “Chapter 8: Deadlock Detection”The deadlock daemon (Ch. 2 §2.6, Ch. 7 §7.4) periodically constructs a transaction waits-for graph (WFG) and scans it for cycles. This chapter traces the full detection pipeline: WFG construction, cycle detection via DFS, victim selection, and resolution.
8.1 Overall pipeline — lock_detect_local_deadlock
Section titled “8.1 Overall pipeline — lock_detect_local_deadlock”flowchart LR P1["Phase 1\nInit WFG tables\n(reset nodes + edges)"] P2["Phase 2\nBuild WFG\n(scan all LK_RES,\nadd edges)"] P3["Phase 3\nDFS cycle detection\n(walk TWFG_node graph)"] P4["Phase 4\nVictim selection\n(per cycle)"] P5["Phase 5\nResolution\n(wake victims:\ntimeout or abort)"] P1 --> P2 --> P3 --> P4 --> P5
Figure 8-1 — Five-phase deadlock detection pipeline. The entire
pipeline runs under DL_detection_mutex, serializing detection
runs. Each phase operates on the TWFG_node[] / TWFG_edge[]
arrays in lk_Gl.
Figure 8-1b — What actually happens on a daemon tick. Three gates sit in front of the five phases — any suspended waiter at all (atomic counter), at least two suspended threads, detection interval elapsed — and most ticks die at the gray return rail for the price of one atomic read. Only when all three pass (at most once per second) does the pipeline run; phase 2’s hash-table sweep is the cost center.
8.2 Phase 1 — initialize WFG tables
Section titled “8.2 Phase 1 — initialize WFG tables”// lock_detect_local_deadlock — Phase 1for (i = 1; i < lk_Gl.num_trans; i++) { lk_Gl.TWFG_node[i].first_edge = -1; lk_Gl.TWFG_node[i].tran_edge_seq_num = 0; lk_Gl.TWFG_node[i].checked_by_deadlock_detector = true; }
lk_Gl.TWFG_edge = &TWFG_edge_block[0]; /* static block, 200 edges */lk_Gl.max_TWFG_edge = LK_MIN_TWFG_EDGE_COUNT; /* 200 */for (i = 0; i < LK_MIN_TWFG_EDGE_COUNT; i++) { lk_Gl.TWFG_edge[i].to_tran_index = -1; lk_Gl.TWFG_edge[i].next = (i + 1); /* free-list chain */ }lk_Gl.TWFG_edge[max - 1].next = -1;lk_Gl.TWFG_free_edge_idx = 0;lk_Gl.global_edge_seq_num = 0;victim_count = 0;The WFG is rebuilt from scratch on every detection run — there
is no incremental maintenance between runs. The edge array starts
from a static block (TWFG_edge_block[200]) and can grow to
LK_MID_TWFG_EDGE_COUNT (1000) or LK_MAX_TWFG_EDGE_COUNT
(MAX_NTRANS²) if needed.
8.3 Phase 2 — build the WFG by scanning all resources
Section titled “8.3 Phase 2 — build the WFG by scanning all resources”The function iterates over every LK_RES in the hash table and
adds edges for three categories of wait relationships:
// lock_detect_local_deadlock — Phase 2 (annotated)for (res_ptr = iterator.iterate (); res_ptr != NULL; ...) { /* (a) among holders: upgrader-to-upgrader conflicts */ for (hi = res_ptr->holder; hi != NULL; hi = hi->next) { if (hi->blocked_mode == NULL_LOCK) break; /* end of upgraders */ for (hj = hi->next; hj != NULL; hj = hj->next) { /* hj waits for hi? */ if (incompat(hj->blocked_mode, hi->granted_mode) || incompat(hj->blocked_mode, hi->blocked_mode)) lock_add_WFG_edge (hj→hi, holder=true);
/* hi waits for hj? */ if (incompat(hi->blocked_mode, hj->granted_mode)) lock_add_WFG_edge (hi→hj, holder=true); } }
/* (b) from waiters to holders */ for (hi = res_ptr->holder; hi != NULL; hi = hi->next) for (hj = res_ptr->waiter; hj != NULL; hj = hj->next) { if (incompat(hj->blocked_mode, hi->granted_mode) || incompat(hj->blocked_mode, hi->blocked_mode)) lock_add_WFG_edge (hj→hi, holder=true); }
/* (c) from waiters to other waiters */ for (hi = res_ptr->waiter; hi != NULL; hi = hi->next) for (hj = hi->next; hj != NULL; hj = hj->next) { if (incompat(hj->blocked_mode, hi->blocked_mode)) lock_add_WFG_edge (hj→hi, holder=false); } }Three edge categories:
| Category | From | To | Edge meaning |
|---|---|---|---|
| (a) Holder↔Holder | Upgrader hj | Upgrader hi | hj’s upgrade blocked by hi’s granted or blocked mode |
| (b) Waiter→Holder | Waiter hj | Holder hi | hj can’t be granted because of hi’s granted or blocked mode |
| (c) Waiter→Waiter | Waiter hj | Waiter hi (earlier in queue) | hj can’t be granted even if holders release, because hi’s request (ahead in FIFO) is incompatible |
The holder_flag parameter on each edge records whether the target
(to_tran_index) is a holder — this is used in victim selection
(§8.5) to prefer aborting a holder over a waiter.
8.4 lock_add_WFG_edge — edge allocation and staleness detection
Section titled “8.4 lock_add_WFG_edge — edge allocation and staleness detection”// lock_add_WFG_edge — src/transaction/lock_manager.c (annotated)static intlock_add_WFG_edge (int from_tran_index, int to_tran_index, int holder_flag, INT64 edge_wait_stime, LK_ENTRY * holder, LK_ENTRY * waiter){ /* skip if either transaction is already a victim */ if (TWFG_node[from].DL_victim || TWFG_node[to].DL_victim) return NO_ERROR;
/* increment global edge sequence number */ lk_Gl.global_edge_seq_num++;
/* handle stale transactions: if a node was not checked by the detector yet, it might be a new transaction that reused the tran_index — clear its old edges */ if (TWFG_node[from].checked_by_deadlock_detector == false) { /* recycle old edges to free list */ // ... chain old edges back to TWFG_free_edge_idx ... TWFG_node[from].first_edge = -1; TWFG_node[from].checked_by_deadlock_detector = true; TWFG_node[from].tran_edge_seq_num = global_edge_seq_num; } /* same for to_tran_index ... */
/* allocate edge from free list (may grow the array) */ alloc_idx = lk_Gl.TWFG_free_edge_idx; lk_Gl.TWFG_free_edge_idx = TWFG_edge[alloc_idx].next;
/* populate edge */ TWFG_edge[alloc_idx].to_tran_index = to_tran_index; TWFG_edge[alloc_idx].edge_seq_num = global_edge_seq_num; TWFG_edge[alloc_idx].holder_flag = holder_flag; TWFG_edge[alloc_idx].edge_wait_stime = edge_wait_stime;
/* prepend to from_tran's edge list */ TWFG_edge[alloc_idx].next = TWFG_node[from].first_edge; TWFG_node[from].first_edge = alloc_idx;}Edge array growth (3-tier):
| Tier | Size | Storage |
|---|---|---|
| Initial | LK_MIN_TWFG_EDGE_COUNT = 200 | Static TWFG_edge_block[200] |
| Mid | LK_MID_TWFG_EDGE_COUNT = 1000 | Same static block (reuses first 200 + extends) |
| Max | LK_MAX_TWFG_EDGE_COUNT = MAX_NTRANS² | malloc’d block (copies from static) |
8.5 Phase 3 — DFS cycle detection
Section titled “8.5 Phase 3 — DFS cycle detection”The cycle detection is a classic DFS over the WFG node/edge
structure. The ancestor field chains the DFS path, and a cycle
is detected when a follow-edge leads to a node that is already on
the current path (ancestor != -1):
// lock_detect_local_deadlock — Phase 3 (annotated)for (k = 1; k < lk_Gl.num_trans; k++) { TWFG_node[k].current = TWFG_node[k].first_edge; TWFG_node[k].ancestor = -1; }
for (k = 1; k < lk_Gl.num_trans; k++) { if (TWFG_node[k].current == -1) continue; /* no outgoing edges — skip */
s = k; TWFG_node[s].ancestor = -2; /* sentinel: DFS root */
for (; s != -2; ) { /* skip stale nodes/edges */ // ... validation checks ...
if (TWFG_node[s].current == -1) { /* backtrack: no more edges from s */ t = TWFG_node[s].ancestor; TWFG_node[s].ancestor = -1; s = t; // advance parent's edge pointer continue; }
t = TWFG_edge[TWFG_node[s].current].to_tran_index;
/* skip stale edges */ // ... validation checks ...
if (TWFG_node[t].ancestor != -1) { /* CYCLE FOUND: s → t, and t is already on the path */ lock_select_deadlock_victim (thread_p, s, t); } else { /* no cycle yet: advance DFS into t */ TWFG_node[t].ancestor = s; s = t; } } }Staleness checks during DFS filter out edges that have become
invalid since Phase 2 — a transaction may have committed or been
aborted between the WFG build and the DFS walk. Three conditions
invalidate a node: checked_by_deadlock_detector == false (new
transaction reused the index), thrd_wait_stime == 0 (no longer
waiting), or edge_seq_num < tran_edge_seq_num (edge is from a
previous incarnation of this tran_index).
8.6 Phase 4 — victim selection strategy
Section titled “8.6 Phase 4 — victim selection strategy”When a cycle is found, lock_select_deadlock_victim is called with
s (the node that discovered the back-edge) and t (the node
already on the DFS path — the cycle start):
Cycle: t → ... → s → tThe function first validates the cycle (false-cycle check for stale edges), then selects a victim using a priority-based strategy:
Victim Selection Strategy (from source comments): 1) Must be a lock holder (holder_flag == true). 2) Must be an active transaction. 3) Prefer a transaction that does NOT have deadlock priority. 4) Prefer a transaction that has written fewer log records. 5) Prefer a transaction with a closer timeout. 6) Prefer the youngest transaction (highest tranid).Why each criterion exists:
- Holder — only a holder’s rollback frees what the cycle waits on; aborting a pure waiter removes an edge but releases nothing, so the cycle persists.
- Active — a transaction already committing / aborting is exiting anyway and cannot be aborted again; victimizing it resolves nothing.
- No deadlock priority — the priority flag is an explicit protection knob; protected transactions survive.
- Fewer log records — the cheapest rollback.
- Finite timeout — that application already expects retry-style failures, so a retryable wake costs the least surprise.
- Youngest — the least work thrown away; and because elders always survive ties, a long-running transaction is never re-victimized forever — the system as a whole keeps making progress.
The candidate field on TWFG_node marks which cycle members are
eligible victims (they are on the holder side of at least one edge).
The function walks the cycle via ancestor links, comparing each
candidate against the current best victim using the six criteria
in order.
// lock_select_deadlock_victim — src/transaction/lock_manager.c (annotated)static voidlock_select_deadlock_victim (THREAD_ENTRY * thread_p, int s, int t){ /* Phase A: validate cycle — check for stale edges */ // walk cycle t → ... → s → t, mark false_dd_cycle if any // node/edge is stale; if false cycle, clear ancestors and return
if (false_dd_cycle == true) { /* clear ancestor chain and return */ return; }
/* Phase B: initialize victim from the edge s→t */ if (TWFG_edge[TWFG_node[s].current].holder_flag) { /* t is a holder — candidate for victim */ if (logtb_is_active (thread_p, logtb_find_tranid (t))) { victims[victim_count].tran_index = t; victims[victim_count].can_timeout = LK_CAN_TIMEOUT (logtb_find_wait_msecs (t)); } }
/* Phase C: walk cycle s → ancestor → ... → t, compare candidates */ for (v = s; v != t; v = TWFG_node[v].ancestor) { if (TWFG_node[v].candidate == false) continue; /* not a holder in this cycle — skip */
tranid = logtb_find_tranid (v); if (!logtb_is_active (thread_p, tranid)) continue; /* criterion 2: must be active */
victim_tran_index = victims[victim_count].tran_index;
if (victim_tran_index == NULL_TRAN_INDEX) { victim_tranid = tranid; /* first eligible candidate */ } /* criterion 3: prefer no deadlock priority */ else if (logtb_has_deadlock_priority (victim_tran_index) != logtb_has_deadlock_priority (v)) { if (logtb_has_deadlock_priority (v) == false) victim_tranid = tranid; } else { tran_log_count = logtb_find_log_records_count (v); victim_tran_log_count = logtb_find_log_records_count (victim_tran_index);
/* criterion 4: prefer fewer log records */ if (tran_log_count < victim_tran_log_count) victim_tranid = tranid; else if (tran_log_count == victim_tran_log_count) { /* criterion 5: prefer closer timeout */ /* criterion 6: prefer youngest (highest tranid) */ if ((victims[victim_count].can_timeout == false && can_timeout == true) || (same_timeout_policy && LK_ISYOUNGER (tranid, current_victim))) victim_tranid = tranid; } }
if (victim_tranid != NULL_TRANID) { victims[victim_count].tran_index = v; victims[victim_count].tranid = victim_tranid; victims[victim_count].can_timeout = can_timeout; } }
/* Phase D: record victim for event log and increment victim_count */ if (victims[victim_count].tran_index != NULL_TRAN_INDEX) { /* collect LK_ENTRY pointers along the cycle for event log */ // ... walk cycle again, store holder/waiter entries ... victim_count++; }}The six criteria applied in order (each is a tie-breaker for the previous):
flowchart TD
C1{"1. Is it a lock holder\nin this cycle?"}
C1 -- "no" --> SKIP["skip"]
C1 -- "yes" --> C2{"2. Is transaction\nactive?"}
C2 -- "no" --> SKIP
C2 -- "yes" --> C3{"3. Does it have\ndeadlock priority?"}
C3 -- "yes (protected)" --> KEEP["keep current victim"]
C3 -- "no (unprotected)" --> PREFER["prefer this one"]
C3 -- "same" --> C4{"4. Fewer log\nrecords written?"}
C4 -- "yes (less work)" --> PREFER
C4 -- "same count" --> C5{"5. Has finite\ntimeout?"}
C5 -- "yes (can retry)" --> PREFER
C5 -- "same" --> C6{"6. Younger\n(higher tranid)?"}
C6 -- "yes" --> PREFER
C6 -- "no" --> KEEP
Figure 8-2 — Victim selection criteria as a decision tree. Each criterion is a tie-breaker for the previous. The overall preference is: unprotected, less work done, retryable, youngest.
Resolution modes:
can_timeout = true(the victim’swait_msecsis finite): wake withLOCK_RESUMED_DEADLOCK_TIMEOUT— treated as a soft timeout, the application can retry.can_timeout = false(LK_INFINITE_WAIT): wake withLOCK_RESUMED_ABORTED_FIRST/_OTHER— the transaction must be rolled back.
8.7 Phase 5 — resolution (waking victims)
Section titled “8.7 Phase 5 — resolution (waking victims)”// lock_detect_local_deadlock — Phase 5for (k = 0; k < victim_count; k++) { if (victims[k].can_timeout) lock_wakeup_deadlock_victim_timeout (victims[k].tran_index); else lock_wakeup_deadlock_victim_aborted (victims[k].tran_index); }These functions (detailed in Ch. 7 §7.3–7.4) find the victim’s
suspended thread and call lock_resume with the appropriate state.
The resumed thread then cleans up its waiter entry via
lock_internal_perform_unlock_object (Ch. 5 §5.7 CASE W).
Safety net — the no_victim_case_count escalation:
If 60 consecutive deadlock detection runs find no victims but threads are still suspended, the system suspects a livelock or undetectable deadlock. As a last resort, it forcibly times out one thread:
if (victim_count == 0) { if (lk_Gl.no_victim_case_count >= 60) { if (css_are_all_request_handlers_suspended ()) thread_get_manager ()->map_entries ( lock_victimize_first_thread_mapfunc); lk_Gl.no_victim_case_count = 0; } else lk_Gl.no_victim_case_count += 1; }At 100ms per daemon tick, 60 runs = ~6 seconds of no progress before the safety net fires.
8.8 WFG data structures — recap from Ch. 1
Section titled “8.8 WFG data structures — recap from Ch. 1”For reference, the WFG structures (defined in lock_manager.c):
struct lk_WFG_node{ int first_edge; /* head of outgoing edge list (-1 = none) */ bool candidate; /* eligible as victim in current cycle */ int current; /* DFS cursor into edge list */ int ancestor; /* DFS parent (-1 = not on path, -2 = root) */ INT64 thrd_wait_stime; /* wait start time (0 = not waiting) */ int tran_edge_seq_num; /* staleness marker */ bool checked_by_deadlock_detector; /* false = new tran, clear old edges */ bool DL_victim; /* already selected as victim */};
struct lk_WFG_edge{ int to_tran_index; /* target transaction (-2 = stale edge) */ int edge_seq_num; /* creation sequence number */ int holder_flag; /* true if target is a holder */ int next; /* next edge in same node's list */ INT64 edge_wait_stime; /* waiter's wait start time */ LK_ENTRY *holder; /* LK_ENTRY of the blocking holder */ LK_ENTRY *waiter; /* LK_ENTRY of the blocked waiter */};8.9 Concrete scenario — three-transaction deadlock
Section titled “8.9 Concrete scenario — three-transaction deadlock”flowchart LR
subgraph Build["Phase 2: Build WFG"]
direction LR
T1["T1: holds X on R1\nwaits S on R2"]
T2["T2: holds S on R2\nwaits X on R3"]
T3["T3: holds X on R3\nwaits S on R1"]
T1 -- "edge: T1 waits T2\n(waiter->holder on R2)" --> T2
T2 -- "edge: T2 waits T3\n(waiter->holder on R3)" --> T3
T3 -- "edge: T3 waits T1\n(waiter->holder on R1)" --> T1
end
subgraph DFS["Phase 3: DFS from T1"]
D1["visit T1 (ancestor=-2)"]
D2["visit T2 (ancestor=T1)"]
D3["visit T3 (ancestor=T2)"]
D4["follow edge T3->T1:\nT1.ancestor != -1\n→ CYCLE FOUND"]
D1 --> D2 --> D3 --> D4
end
subgraph Victim["Phase 4: Select victim"]
V1["Cycle: T1 -> T2 -> T3 -> T1\nAll are holders\nAll are active\nPick youngest (highest tranid)\n→ T3 is victim"]
end
subgraph Resolve["Phase 5: Resolution"]
R1["T3.can_timeout?\nyes → DEADLOCK_TIMEOUT\nno → ABORTED_FIRST"]
end
Build --> DFS --> Victim --> Resolve
Figure 8-3 — Three-transaction deadlock detection end-to-end. Phase 2 adds three edges forming a cycle. Phase 3 DFS discovers the cycle when T3’s edge leads back to T1. Phase 4 selects T3 (youngest by the default victim strategy). Phase 5 wakes T3 with timeout or abort.
8.10 Performance analysis — is full rebuild every run acceptable?
Section titled “8.10 Performance analysis — is full rebuild every run acceptable?”The WFG is rebuilt from scratch on every detection run. This section analyzes the actual cost and the design rationale.
Three gates that prevent unnecessary work:
flowchart LR
TICK["daemon tick\n(every 100ms)"]
G1{"deadlock_and_timeout\n_detector == 0?"}
G1 -- "yes (no waiters)" --> SKIP1["skip entirely\n(atomic read only)"]
G1 -- "no" --> TIMEOUT["timeout check\n(map over threads)"]
TIMEOUT --> G2{"lock_wait_count >= 2?"}
G2 -- "no (0 or 1 waiter)" --> SKIP2["skip WFG\n(deadlock impossible)"]
G2 -- "yes" --> G3{"interval elapsed?\n(default 1.0s)"}
G3 -- "no" --> SKIP3["skip WFG\n(too soon)"]
G3 -- "yes" --> FULL["FULL WFG SCAN\nlock_detect_local_deadlock"]
Figure 8-4 — Three gates before the WFG full scan runs. In an uncontended workload, Gate 1 catches everything — cost is one atomic read per 100ms. In a lightly contended workload, Gate 2 catches single-waiter cases. Gate 3 ensures the full scan runs at most once per second.
Cost breakdown when the full scan does execute:
| Phase | Time complexity | Practical cost | Bottleneck? |
|---|---|---|---|
| Node init | O(MAX_NTRANS) | ~tens of µs | No |
| Edge init | O(200) static block | negligible | No |
| Hash table scan | O(all active LK_RES) | The dominant cost. Each resource: acquire res_mutex, scan holders/waiters, release. | Yes |
| Per-resource edge build | O(holders² + holders×waiters + waiters²) | Most resources have 1–2 holders, 0 waiters → O(1) per resource | Rarely |
| DFS cycle detection | O(nodes + edges) | nodes ≤ MAX_NTRANS, edges = actual wait relationships | No |
The hash table scan is the bottleneck. If the system has 100K
active lock resources, the scan performs 100K res_mutex
lock/unlock pairs. However:
- Most resources have no waiter — the scan checks
res_ptr->holderandres_ptr->waiter, adds zero edges, and moves on. The per-resource cost is one mutex acquire + a few pointer reads + one mutex release. res_mutexis per-resource — while scanning resource R₁, other threads can freely lock/unlock resource R₂. The scan does not serialize the entire lock manager.- The scan runs at most once per second (default
PRM_ID_LK_RUN_DEADLOCK_INTERVAL = 1.0).
Design trade-off — full rebuild vs. incremental maintenance:
| Approach | Hot-path overhead (lock acquire/release) | Detection cost | Complexity |
|---|---|---|---|
| Full rebuild (CUBRID) | Zero — no WFG bookkeeping on acquire/release | O(resources) per scan, ≤ 1/sec | Simple |
| Incremental WFG | Edge add on every block, edge remove on every grant/release | O(DFS only) per scan | High — must maintain edge consistency under concurrent modifications |
CUBRID chose zero overhead on the hot path. In OLTP workloads where lock acquires happen tens of thousands of times per second, adding WFG edge maintenance to every acquire would cost more in aggregate than one full scan per second. The full-rebuild approach also avoids the complexity of maintaining graph consistency under concurrent lock state changes.
When the cost could become a problem:
- Tens of thousands of active lock resources and sustained multi-thread contention (many waiters per resource).
- In this scenario, the hash table scan could take tens of
milliseconds. Mitigation options:
- Increase
PRM_ID_LK_RUN_DEADLOCK_INTERVAL(reduce scan frequency — trades detection latency for lower overhead). - Lower
PRM_ID_LK_ESCALATION_AT(escalate sooner — reduces the total number of lock resources). - The timeout-based fallback (
wait_msecson each lock request) still resolves deadlocks even if the WFG scan is delayed.
- Increase
8.11 Chapter summary — key takeaways
Section titled “8.11 Chapter summary — key takeaways”- The WFG is rebuilt from scratch on every detection run — no incremental maintenance. This simplifies correctness at the cost of O(resources × holders²) construction time.
- Three categories of edges capture all wait relationships: holder↔holder (upgrader conflicts), waiter→holder, and waiter→waiter.
- DFS cycle detection uses
ancestorchains. Staleness checks during DFS filter out edges invalidated by concurrent commits. - Victim selection uses a six-criteria priority: must be holder, must be active, no deadlock priority, fewer log records, closer timeout, youngest transaction.
- Two resolution modes:
can_timeout→ soft timeout (retryable);!can_timeout→ abort (rollback required). - Edge array grows in 3 tiers: 200 (static) → 1000 (static extended) → MAX_NTRANS² (malloc). The static initial block avoids allocation for low-contention workloads.
- Safety net: 60 consecutive no-victim runs (~6 seconds) triggers forced timeout of one thread to prevent server hang.
Chapter 9: Special Paths
Section titled “Chapter 9: Special Paths”The main lock lifecycle (acquire → convert → release) was covered in Chapters 3–5. This chapter covers the special-purpose APIs that don’t fit that main flow: instant lock probe, composite lock, lock_scan / lock_classes_lock_hint, lock demotion, and lock_subclass.
9.1 lock_hold_object_instant — non-blocking probe
Section titled “9.1 lock_hold_object_instant — non-blocking probe”A lightweight lock check that never blocks and never creates an
LK_ENTRY. It tests whether a lock could be granted right now
without actually acquiring it.
// lock_internal_hold_lock_object_instant — src/transaction/lock_manager.cstatic intlock_internal_hold_lock_object_instant (THREAD_ENTRY * thread_p, int tran_index, const OID * oid, const OID * class_oid, LOCK lock){ /* shortcut: class lock subsumes the request? */ if (class_oid != NULL && !OID_IS_ROOTOID (class_oid)) if (lock_is_class_lock_escalated (..., lock)) return LK_GRANTED;
/* find (not find_or_insert!) the resource */ res_ptr = lk_Gl.m_obj_hash_table.find (thread_p, search_key); if (res_ptr == NULL) return LK_GRANTED; /* no resource → no contention */
/* am I already a holder? */ for (entry_ptr = res_ptr->holder; ...; entry_ptr = entry_ptr->next) if (entry_ptr->tran_index == tran_index) break;
if (entry_ptr == NULL) { /* not a holder: check compat with holders AND waiters */ compat1 = lock_compat (lock, res_ptr->total_waiters_mode); compat2 = lock_compat (lock, res_ptr->total_holders_mode); pthread_mutex_unlock (&res_ptr->res_mutex); return (compat1 == YES && compat2 == YES) ? LK_GRANTED : LK_NOTGRANTED; } else { /* already a holder: check conversion compat */ new_mode = lock_conv (lock, entry_ptr->granted_mode); if (new_mode == entry_ptr->granted_mode) { /* re-entrance */ return LK_GRANTED; } /* check against other holders */ // ... same group_mode logic as PATH F (Ch. 3 §3.9) ... return compat ? LK_GRANTED : LK_NOTGRANTED; }}Key differences from lock_object:
| Aspect | lock_hold_object_instant | lock_object |
|---|---|---|
Creates LK_ENTRY | No | Yes |
| Blocks on conflict | No (returns LK_NOTGRANTED) | Yes (suspends thread) |
| Modifies holder/waiter lists | No | Yes |
| Hash operation | find (read-only) | find_or_insert (may create LK_RES) |
| Use case | Optimistic probe before full lock | Actual lock acquisition |
Caller: scan_manager.c uses this as an optimistic first try
during heap/index scans. If the probe returns LK_GRANTED, the
scan can proceed knowing the lock would succeed, often without
needing to create any lock entry at all. If LK_NOTGRANTED, it
falls back to the full lock_object path.
9.2 Composite lock — batched multi-object locking
Section titled “9.2 Composite lock — batched multi-object locking”Composite lock is a deferred locking strategy for bulk DELETE/UPDATE. Instead of acquiring X locks on each row during the scan phase (which could deadlock with concurrent transactions), it collects OIDs first and acquires all locks at the end.
Lifecycle:
flowchart LR INIT["lock_initialize_composite_lock\n(set tran_index, wait_msecs)"] ADD["lock_add_composite_lock\n(collect OID + class_oid)\nRepeated per row"] FIN["lock_finalize_composite_lock\n(acquire all X locks)\nEscalate if count >= threshold"] ABORT["lock_abort_composite_lock\n(free collected OIDs)"] INIT --> ADD --> FIN ADD -- "error" --> ABORT FIN -- "error" --> ABORT
Figure 9-1 — Composite lock lifecycle. OIDs are collected during scan, locks are acquired in batch at finalization.
lock_add_composite_lock collects instance OIDs grouped by
class:
// lock_add_composite_lock — src/transaction/lock_manager.c (simplified)int lock_add_composite_lock (... LK_COMPOSITE_LOCK * comp_lock, const OID * oid, const OID * class_oid){ /* find or create per-class bucket */ for (lockcomp_class = lockcomp->class_list; ...; ) if (OID_EQ (class_oid, &lockcomp_class->class_oid)) break;
if (lockcomp_class == NULL) { /* new class: acquire IX on class immediately */ lock_internal_perform_lock_object (..., class_oid, IX_LOCK, ...); /* allocate OID buffer (up to escalation threshold) */ }
/* if class lock already >= X: skip instance OID collection */ if (lockcomp_class->class_lock_ptr->granted_mode >= X_LOCK) return; /* class lock subsumes everything */
/* collect the instance OID */ lockcomp_class->inst_oid_space[num_inst_oids++] = *oid; /* if count reaches escalation threshold: stop collecting */}lock_finalize_composite_lock acquires the actual locks:
// lock_finalize_composite_lock — src/transaction/lock_manager.c (simplified)int lock_finalize_composite_lock (... LK_COMPOSITE_LOCK * comp_lock){ for (each lockcomp_class in class_list) { if (class already >= X_LOCK || num_inst_oids == escalation_threshold) { /* escalation: X lock on the class */ lock_internal_perform_lock_object (..., class_oid, X_LOCK, ...); } else { /* individual: X lock on each collected instance */ for (i = 0; i < num_inst_oids; i++) lock_internal_perform_lock_object (..., inst_oid[i], X_LOCK, ...); } } lock_abort_composite_lock (comp_lock); /* free memory */}Caller: query_executor.c uses composite lock for
DELETE/UPDATE queries. The XASL plan carries a
LK_COMPOSITE_LOCK field; the executor collects target OIDs during
the scan phase and calls lock_finalize_composite_lock before
applying the modifications.
Escalation integration: Composite lock has its own escalation
logic — if the collected OID count for a class reaches
PRM_ID_LK_ESCALATION_AT, it stops collecting OIDs and instead
acquires X on the class at finalization. This mirrors the behavior
of the standard escalation (Ch. 6) but with a two-phase approach.
9.3 lock_scan and lock_classes_lock_hint — class-grain APIs
Section titled “9.3 lock_scan and lock_classes_lock_hint — class-grain APIs”These two APIs operate at the class level only, with no instance locking.
lock_scan (covered briefly in Ch. 3 §3.1) acquires a single
class lock for a heap scan:
// lock_scan — src/transaction/lock_manager.cint lock_scan (THREAD_ENTRY * thread_p, const OID * class_oid, int cond_flag, LOCK class_lock){ root_class_entry = lock_get_class_lock (thread_p, oid_Root_class_oid); return lock_internal_perform_lock_object ( thread_p, tran_index, class_oid, NULL, class_lock, wait_msecs, &class_entry, root_class_entry);}Caller: heap_file.c calls lock_scan(class_oid, IS_LOCK) at the
start of a heap scan.
lock_classes_lock_hint acquires locks on multiple classes at
once, given a pre-built LC_LOCKHINT array:
// lock_classes_lock_hint — src/transaction/lock_manager.c (simplified)int lock_classes_lock_hint (THREAD_ENTRY * thread_p, LC_LOCKHINT * lockhint){ /* sort classes for consistent ordering (avoid class-level deadlocks) */ // ... sort lockhint->classes by OID ...
for (i = 0; i < lockhint->num_classes; i++) { lock_internal_perform_lock_object ( thread_p, tran_index, &lockhint->classes[i].oid, NULL, lockhint->classes[i].lock, wait_msecs, &entry, root_class_entry); }}Caller: locator_sr.c (xlocator_fetch_lockhint_classes) uses
this for batch class locking when the client prefetches multiple
classes. The sort step is critical — acquiring class locks in a
consistent order prevents deadlocks between transactions that need
the same set of classes.
9.4 Lock demotion — lock_demote_class_lock
Section titled “9.4 Lock demotion — lock_demote_class_lock”Demotion is the reverse of conversion: downgrade a held lock to a weaker mode. This is used only in narrow circumstances.
// lock_demote_class_lock — src/transaction/lock_manager.c (simplified)int lock_demote_class_lock (THREAD_ENTRY * thread_p, const OID * oid, LOCK lock, LOCK * ex_lock){ entry_ptr = lock_find_tran_hold_entry (thread_p, tran_index, oid, true); return lock_internal_demote_class_lock (thread_p, entry_ptr, lock, ex_lock);}lock_internal_demote_class_lock changes granted_mode to the
weaker mode, recomputes total_holders_mode, and calls
lock_grant_blocked_holder / lock_grant_blocked_waiter to see if
the demotion unblocks any waiters.
Callers:
lock_demote_read_class_lock_for_checksumdb— demotes S→IS on class locks during checksumdb to reduce contention. Explicitly documented as “ONLY for checksumdb.”lock_demote_all_shared_class_locks— internal function that demotes all S class locks to IS, called during specific transaction state transitions.
Warning from the source: “Demoting a write lock in the middle of a transaction may cause recovery issues. It should be carefully used for some particular cases.” Demotion of instance locks is deliberately not supported.
9.5 lock_subclass — class hierarchy locking
Section titled “9.5 lock_subclass — class hierarchy locking”For CUBRID’s class inheritance hierarchy, locking a subclass requires intention locks on the superclass:
// lock_subclass — src/transaction/lock_manager.c (simplified)int lock_subclass (THREAD_ENTRY * thread_p, const OID * subclass_oid, const OID * superclass_oid, LOCK lock, int cond_flag){ /* determine intention lock for superclass */ if (lock <= S_LOCK) new_superclass_lock = IS_LOCK; else new_superclass_lock = IX_LOCK;
/* acquire intention lock on superclass if needed */ if (old_superclass_lock < new_superclass_lock) lock_internal_perform_lock_object (..., superclass_oid, NULL, new_superclass_lock, ...);
/* acquire the actual lock on the subclass */ lock_internal_perform_lock_object (..., subclass_oid, NULL, lock, ..., &subclass_entry, superclass_entry);}This follows the same multi-granularity pattern as lock_object’s
instance-on-class locking, but one level up in the hierarchy:
subclass-on-superclass instead of instance-on-class. The
class_entry back-pointer chains allow escalation to propagate
through the hierarchy.
9.6 lock_rep_read_tran — REPEATABLE READ helper
Section titled “9.6 lock_rep_read_tran — REPEATABLE READ helper”A specialized entry point for ALTER TABLE ADD COLUMN NOT NULL under REPEATABLE READ isolation:
int lock_rep_read_tran (THREAD_ENTRY * thread_p, LOCK lock, int cond_flag);This function acquires a lock on the root class OID to prevent
concurrent DDL during the operation. It is a narrow special case
and does not introduce any new locking mechanism — it just calls
lock_internal_perform_lock_object on the root OID.
9.7 Caller map for special-path APIs
Section titled “9.7 Caller map for special-path APIs”These APIs have narrower caller bases than the main lock_object/
lock_unlock_object (see Ch. 3 §3.2 for the main caller map).
| API | Callers | Context |
|---|---|---|
lock_hold_object_instant | scan_manager.c (heap scan, index scan) | Optimistic non-blocking probe before full lock_object. Two call sites: scan_next_heap_scan for heap OIDs, scan_next_index_scan for index OIDs. |
lock_initialize/add/finalize_composite_lock | query_executor.c | Bulk DELETE/UPDATE. lock_initialize at query start, lock_add per candidate row during scan, lock_finalize before applying modifications. |
lock_scan | heap_file.c | Single call site: heap_scancache_start acquires IS_LOCK on the class before scanning the heap. |
lock_classes_lock_hint | locator_sr.c (xlocator_fetch_lockhint_classes) | Batch class locking when the client prefetches multiple classes. Paired with lock_unlock_classes_lock_hint on completion. |
lock_subclass | locator_sr.c (INSERT/UPDATE on partitioned tables), query_executor.c (partition scan) | Acquires IX/IS on subclass with intention lock on superclass. Used when operating on class hierarchies / partitions. |
lock_demote_class_lock | locator_sr.c (xlocator_demote_class_lock), btree_load.c (index load) | Demote class lock after bulk operation. General-purpose but restricted. |
lock_demote_read_class_lock_for_checksumdb | locator_sr.c (xlocator_check_fk_validity) | S→IS demotion during checksumdb only. Source explicitly says “NEVER consider to use this function for any other clients/threads.” |
lock_rep_read_tran | transaction_sr.c (xtran_lock_rep_read) | Root-class lock for ALTER TABLE ADD COLUMN NOT NULL under RR. Single call site. |
9.8 Chapter summary — key takeaways
Section titled “9.8 Chapter summary — key takeaways”lock_hold_object_instantis a read-only, non-blocking probe: usesfind(notfind_or_insert), creates noLK_ENTRY, never blocks. Used by scan_manager for optimistic locking.- Composite lock defers X-lock acquisition to a batch finalization phase. Integrates its own escalation logic (class X if count reaches threshold). Used by query_executor for bulk DELETE/UPDATE.
lock_scanis a thin wrapper for class-grain IS/IX locks.lock_classes_lock_hintsorts classes before locking to prevent class-level deadlocks.- Lock demotion is intentionally restricted to class locks only and to specific callers (checksumdb). The source explicitly warns against demoting write locks mid-transaction.
lock_subclassextends multi-granularity locking to the class inheritance hierarchy: subclass-on-superclass mirrors instance-on-class.
Chapter 10: Serializable conflict handling (first-updater-wins)
Section titled “Chapter 10: Serializable conflict handling (first-updater-wins)”This chapter covers the one isolation rule the lock manager enforces
without a lock — the write-path ER_MVCC_SERIALIZABLE_CONFLICT raised
under REPEATABLE READ / SERIALIZABLE. It is the write-side counterpart to
§5.3 (lock_unlock_object — the isolation policy gate): §5.3 decides which
locks a transaction keeps to end-of-transaction under RR; this rule stops a
transaction from modifying a row that another transaction changed out from
under its snapshot. MVCC visibility (cubrid-mvcc-detail.md) supplies the
detection substrate; the serialization decision lives here. (The MVCC detail
doc delegates here for exactly this reason — visibility alone cannot serialize
two writers.)
10.1 What the error is
Section titled “10.1 What the error is”ER_MVCC_SERIALIZABLE_CONFLICT (-1158, base/error_code.h) is CUBRID’s
first-updater-wins write-conflict error — the analogue of PostgreSQL’s
“could not serialize access due to concurrent update” (SQLSTATE 40001).
Under RR/SERIALIZABLE a transaction may not UPDATE/DELETE a row whose latest
version differs from the version its snapshot sees. If a concurrent transaction
created a newer version, or deleted the row, after this transaction’s snapshot
was taken, the modify is refused with -1158 and the statement aborts.
There is no separate SERIALIZABLE mechanism: TRAN_SERIALIZABLE has no
standalone branch anywhere in the tree, so at runtime RR ≡ SERIALIZABLE and
both are gated identically by isolation > TRAN_READ_COMMITTED. The source
enforces serializability through this one first-updater-wins check plus RR’s
strict-2PL lock retention (§5.3); it does not add key-range locks.
10.2 Where it is raised — locator_lock_and_get_object_internal
Section titled “10.2 Where it is raised — locator_lock_and_get_object_internal”The check is not in lock_manager.c. It lives in the locator’s lock-and-fetch
path that UPDATE/DELETE use to pin a row for modification:
locator_lock_and_get_object_internal (src/transaction/locator_sr.c),
reached from locator_lock_and_get_object_with_evaluation with
lock_mode = X_LOCK. The X-lock is taken first, and only then is the
version fetched and checked:
// locator_lock_and_get_object_internal — src/transaction/locator_sr.c (annotated)// 1) acquire the row OID X-lock: conditional, else unconditional (blocking)if (lock_object (.., oid, class_oid, lock_mode, LK_COND_LOCK) != LK_GRANTED) { ... pgbuf_ordered_unfix (..); // drop page watchers if (lock_object (.., oid, class_oid, lock_mode, LK_UNCOND_LOCK) != LK_GRANTED) // ← write-write WAIT goto error; heap_prepare_get_context (..); // pages were unlatched; re-fetch }// 2) only now fetch the latest version and read its MVCC headerheap_get_last_version (.., context);heap_get_mvcc_header (.., context, &recdes_header);// 3) RR/SERIALIZABLE: compare the latest version against the txn snapshot (§10.3)So the write-write wait today is the OID-lock block (LK_UNCOND_LOCK); the
snapshot check below is a post-lock isolation gate, not the conflict
detector. (This ordering is the lever the cubrid_cv I13 implicit-update-lock
analysis proposes to invert.)
10.3 The gate — RR/SERIALIZABLE on RR-eligible classes
Section titled “10.3 The gate — RR/SERIALIZABLE on RR-eligible classes”// locator_sr.c (annotated)if (logtb_find_current_isolation (thread_p) > TRAN_READ_COMMITTED && logtb_check_class_for_rr_isolation_err (class_oid)) { MVCC_SNAPSHOT *snap = logtb_get_mvcc_snapshot (thread_p); MVCC_SATISFIES_SNAPSHOT_RESULT res = snap->snapshot_fnc (.., &recdes_header, snap); ... }Two conditions, both required:
- Isolation > READ COMMITTED — RR or SERIALIZABLE. Under READ COMMITTED the
whole block is skipped: RC never raises
-1158; it waits on the row lock and then re-reads the new version. - RR-eligible class —
logtb_check_class_for_rr_isolation_err(src/transaction/log_tran_table.c) returns true only when the class is notdb_root,_db_user, or_db_trigger. Those are accessed before table locks and secure consistency by locks, so they are exempt; most other catalog classes are MVCC-enabled and are checked.
10.4 The three branches
Section titled “10.4 The three branches”The gate runs the transaction snapshot (mvcc_satisfies_snapshot,
src/transaction/mvcc.c) against the latest version’s header:
| Snapshot result on latest version | Meaning | Outcome |
|---|---|---|
TOO_OLD_FOR_SNAPSHOT | Deleted + committed before my snapshot — invisible to me | S_DOESNT_EXIST (row “gone”, not a conflict) |
TOO_NEW_FOR_SNAPSHOT | Latest version’s inserter is active, or committed after my snapshot (MVCC_IS_REC_INSERTER_IN_SNAPSHOT) | ER_MVCC_SERIALIZABLE_CONFLICT |
SNAPSHOT_SATISFIED and MVCC_IS_HEADER_DELID_VALID | Latest version deleted by a concurrent transaction | ER_MVCC_SERIALIZABLE_CONFLICT |
SNAPSHOT_SATISFIED, not deleted | Latest == my visible version; nobody changed it | fall through — modify proceeds |
TOO_NEW_FOR_SNAPSHOT is defined in mvcc_satisfies_snapshot: a not-deleted
record whose inserter is “in snapshot” (active, or committed after the snapshot)
returns TOO_NEW. That is precisely “a concurrent transaction produced a version
I am not entitled to see” — the first-updater-wins condition.
10.5 Scenarios
Section titled “10.5 Scenarios”- Concurrent UPDATE. T1 snapshots at t0; T2 updates row R and commits after
t0; T1 then updates R → its latest version is
TOO_NEWfor T1 →-1158. (If T1 first blocked on T2’s row X-lock, the re-fetch after the grant sees the new version and raises-1158rather than silently overwriting.) - Concurrent DELETE. T2 deletes R and commits after t0; T1 updates/deletes R
→ latest version carries a concurrent deleter’s delete-id →
-1158. (If T2’s delete committed before t0 the result isTOO_OLD→S_DOESNT_EXIST.) - No conflict. No concurrent modify since t0 → latest == visible version → modify proceeds.
10.6 How callers treat it
Section titled “10.6 How callers treat it”-1158 is an abort signal, recognised alongside ER_LK_UNILATERALLY_ABORTED
by the layers that must not swallow it — query_executor.c, work_space.c
(client retry), trigger_manager.c. The engine does not retry internally: the
transaction aborts and the application decides whether to retry under a fresh
snapshot.
10.7 Relation to MVCC and escalation
Section titled “10.7 Relation to MVCC and escalation”- MVCC (
cubrid-mvcc-detail.md). MVCC supplies the substrate — the record header’s insert/delete MVCCID and themvcc_satisfies_*classifiers. This chapter is where the isolation layer turns “the latest version is not my visible version” into a refusal. - Escalation (Chapter 6). Orthogonal: escalation changes the lock granule
(instance→class) to relieve
LK_ENTRYmemory; it does not change this check. A bulk UPDATE that escalates still runs the per-row first-updater-wins check on each modified row under RR/SERIALIZABLE.
10.8 Position hints (cross-module)
Section titled “10.8 Position hints (cross-module)”These symbols sit outside lock_manager.c; anchor by function name (line numbers
verified against the develop+I8 worktree, 2026-06, and drift across revisions).
| Symbol | File | Line |
|---|---|---|
ER_MVCC_SERIALIZABLE_CONFLICT (-1158) | base/error_code.h | 1487 |
locator_lock_and_get_object_internal (raise sites, gate) | transaction/locator_sr.c | 12935 (raises 13041 / 13047) |
locator_lock_and_get_object_with_evaluation (entry, X_LOCK) | transaction/locator_sr.c | 13099 |
mvcc_satisfies_snapshot (TOO_NEW / TOO_OLD / SATISFIED) | transaction/mvcc.c | 156 |
logtb_check_class_for_rr_isolation_err (RR-eligible gate) | transaction/log_tran_table.c | 5671 |
Position hints as of this revision
Section titled “Position hints as of this revision”| Symbol | File | Line |
|---|---|---|
| Types (headers) | ||
struct lk_entry | lock_manager.h | 79 |
struct lk_res_key | lock_manager.h | 164 |
struct lk_res | lock_manager.h | 175 |
lock_remove_all_inst_locks | lock_manager.h | 195 |
enum LOCK | lock_table.h | 29 |
lock_Comp[][] | lock_table.c | 67 |
lock_Conv[][] | lock_table.c | 179 |
lock_conv() / lock_compat() | lock_table.h | 58 / 65 |
| Internal types (lock_manager.c) | ||
struct lk_tran_lock | lock_manager.c | 324 |
struct lk_global_data | lock_manager.c | 361 |
lk_Gl (global instance) | lock_manager.c | 418 |
lk_Obj_lock_res_desc | lock_manager.c | 621 |
obj_lock_entry_desc | lock_manager.c | 592 |
| Constants and macros | ||
LK_OBJ_LOCK_HASH | lock_manager.c | 84 |
LK_MIN_OBJECT_LOCKS | lock_manager.c | 433 |
LOCK_TRAN_LOCAL_POOL_MAX_SIZE | lock_manager.c | 351 |
LK_MIN_TWFG_EDGE_COUNT | lock_manager.c | 451 |
LK_MID_TWFG_EDGE_COUNT | lock_manager.c | 453 |
LK_MAX_TWFG_EDGE_COUNT | lock_manager.c | 455 |
LK_MAX_VICTIM_COUNT | lock_manager.c | 448 |
PRM_ID_LK_ESCALATION_AT | system_parameter.h | (config) |
PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION | system_parameter.h | (config) |
PRM_ID_LK_RUN_DEADLOCK_INTERVAL | system_parameter.c | 1313 |
| Initialization / finalization | ||
lock_initialize | lock_manager.c | 5627 |
lock_finalize | lock_manager.c | 5859 |
lock_initialize_tran_lock_table | lock_manager.c | 1061 |
lock_initialize_object_hash_table | lock_manager.c | 1121 |
lock_initialize_object_lock_entry_list | lock_manager.c | 1152 |
lock_initialize_deadlock_detection | lock_manager.c | 1180 |
lock_finalize_tran_lock_table | lock_manager.c | 5702 |
lock_deadlock_detect_daemon_init | lock_manager.c | 5812 |
deadlock_detect_task_execute | lock_manager.c | 5780 |
| Entry init / hash | ||
lock_initialize_entry_as_granted | lock_manager.c | 920 |
lock_initialize_entry_as_blocked | lock_manager.c | 940 |
lock_initialize_entry_as_non2pl | lock_manager.c | 962 |
lock_create_search_key | lock_manager.c | 641 |
lock_res_key_compare | lock_manager.c | 844 |
lock_res_key_hash | lock_manager.c | 884 |
lock_get_hash_value | lock_manager.c | 1015 |
lock_get_new_entry | lock_manager.c | 9778 |
lock_free_entry | lock_manager.c | 9809 |
| Acquisition | ||
lock_object | lock_manager.c | 5945 |
lock_object_wait_msecs | lock_manager.c | 6273 |
lock_scan | lock_manager.c | 6303 |
lock_subclass | lock_manager.c | 6154 |
lock_classes_lock_hint | lock_manager.c | 6379 |
lock_internal_perform_lock_object | lock_manager.c | 3223 |
lock_find_class_entry | lock_manager.c | 1596 |
lock_position_holder_entry | lock_manager.c | 1743 |
lock_is_class_lock_escalated | lock_manager.c | 508 |
| Grant | ||
lock_grant_blocked_holder | lock_manager.c | 2512 |
lock_grant_blocked_waiter | lock_manager.c | 2629 |
lock_grant_blocked_waiter_partial | lock_manager.c | 2758 |
| Release | ||
lock_unlock_object | lock_manager.c | 6630 |
lock_unlock_object_donot_move_to_non2pl | lock_manager.c | 6613 |
lock_unlock_object_lock_internal | lock_manager.c | 6580 |
lock_unlock_object_by_isolation | lock_manager.c | 9841 |
lock_unlock_shared_inst_lock | lock_manager.c | 4399 |
lock_unlock_all | lock_manager.c | 6851 |
lock_internal_perform_unlock_object | lock_manager.c | 3945 |
| NON2PL | ||
lock_add_non2pl_lock | lock_manager.c | 1649 |
lock_update_non2pl_list | lock_manager.c | 4592 |
lock_remove_non2pl | lock_manager.c | 539 |
lock_notify_isolation_incons | lock_manager.c | 7516 |
| Instant lock | ||
lock_start_instant_lock_mode | lock_manager.c | 8879 |
lock_stop_instant_lock_mode | lock_manager.c | 8901 |
| Escalation | ||
lock_escalate_if_needed | lock_manager.c | 2974 |
lock_check_escalate | lock_manager.c | 2907 |
lock_increment_class_granules | lock_manager.c | 548 |
lock_decrement_class_granules | lock_manager.c | 550 |
| Suspend / resume | ||
lock_suspend | lock_manager.c | 2151 |
lock_resume | lock_manager.c | 2338 |
lock_wakeup_deadlock_victim_timeout | lock_manager.c | 2386 |
lock_wakeup_deadlock_victim_aborted | lock_manager.c | 2442 |
lock_force_timeout_expired_wait_transactions | lock_manager.c | 7443 |
| Deadlock detection | ||
lock_detect_local_deadlock | lock_manager.c | 7696 |
lock_add_WFG_edge | lock_manager.c | 4670 |
lock_select_deadlock_victim | lock_manager.c | 4806 |
lock_is_local_deadlock_detection_interval_up | lock_manager.c | 7622 |
| Special paths | ||
lock_hold_object_instant | lock_manager.c | 5901 |
lock_internal_hold_lock_object_instant | lock_manager.c | 3087 |
lock_initialize_composite_lock | lock_manager.c | 8559 |
lock_add_composite_lock | lock_manager.c | 8586 |
lock_finalize_composite_lock | lock_manager.c | 8740 |
lock_abort_composite_lock | lock_manager.c | 8800 |
lock_demote_class_lock | lock_manager.c | 4165 |
lock_internal_demote_class_lock | lock_manager.c | 4188 |
lock_rep_read_tran | lock_manager.c | 9698 |