Skip to content

CUBRID Lock Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-lock-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single lock request inside the kernel.

Contents:

ChTitleStatus
1Data-Structure Map
2Initialization and Memory Management
3Lock Acquisition — lock_internal_perform_lock_object Dissected
4Lock Conversion and Re-entrance
5Lock Release and the NON2PL Protocol
6Lock Escalation
7Suspend/Resume and Timeout
8Deadlock Detection — WFG Construction and Cycle Scan
9Special Paths — Instant Lock, Composite Lock, lock_scan

Every lock-manager operation is built on top of five structs. This chapter walks through every field of each struct with code excerpts and draws the pointer relationships between them. Having this map in your head is the prerequisite for following the flows in later chapters.

flowchart TB
  subgraph GLOBAL["lk_Gl (LK_GLOBAL_DATA) — single global instance"]
    HT["m_obj_hash_table\n(lockfree_hashmap)"]
    TLT["tran_lock_table[0..N]\n(LK_TRAN_LOCK array)"]
    FREE["obj_free_entry_list\n(LF_FREELIST)"]
    WFG["TWFG_node[] / TWFG_edge[]\n(deadlock detection)"]
  end

  subgraph RES["LK_RES — one lockable resource"]
    KEY["key: LK_RES_KEY"]
    AGG["total_holders_mode\ntotal_waiters_mode"]
    HL["holder -> waiter -> non2pl"]
  end

  subgraph ENTRY["LK_ENTRY — one lock fact"]
    E_RES["res_head -> LK_RES"]
    E_MODE["granted_mode / blocked_mode"]
    E_LINK["next (resource side)\ntran_next/tran_prev (tran side)"]
    E_CLASS["class_entry -> parent"]
  end

  subgraph TRAN["LK_TRAN_LOCK — one transaction's lock state"]
    ROOT["root_class_hold"]
    CLS["class_hold_list"]
    INST["inst_hold_list"]
    POOL["lk_entry_pool\n(local pool, max 10)"]
    N2PL["non2pl_list"]
  end

  HT --> RES
  TLT --> TRAN
  RES -.-> ENTRY
  TRAN -.-> ENTRY
  ENTRY --> RES
  ENTRY --> TRAN

Figure 1-1 — Relationship among the five core structs. Arrows show pointer direction. The key insight is that LK_ENTRY is linked into both LK_RES (resource view) and LK_TRAN_LOCK (transaction view) simultaneously.

1.2 LK_RES_KEY — identifier of a lockable resource

Section titled “1.2 LK_RES_KEY — identifier of a lockable resource”

The key that identifies what object a lock protects. Used as the hash-table lookup key.

// struct lk_res_key — src/transaction/lock_manager.h
struct lk_res_key
{
LOCK_RESOURCE_TYPE type; /* INSTANCE, CLASS, ROOT_CLASS */
OID oid; /* target object's OID */
OID class_oid; /* owning class OID (instance only) */
};
FieldSizeDescription
typeenum (4B)One of LOCK_RESOURCE_INSTANCE, LOCK_RESOURCE_CLASS, LOCK_RESOURCE_ROOT_CLASS. LOCK_RESOURCE_OBJECT is obsolete.
oid8B (pageid 4B + slotid 2B + volid 2B)Identifier of the object this lock protects.
class_oid8BValid only for instance locks. Set to NULL OID for class/root-class locks.

type is derived from oid and class_oid. lock_create_search_key implements the derivation rule:

// lock_create_search_key — src/transaction/lock_manager.c
static LK_RES_KEY
lock_create_search_key (OID * oid, OID * class_oid)
{
LK_RES_KEY search_key;
// ... OID copying ...
if (oid != NULL && OID_IS_ROOTOID (oid))
{
search_key.type = LOCK_RESOURCE_ROOT_CLASS;
}
else if (class_oid == NULL || OID_IS_ROOTOID (class_oid))
{
search_key.type = LOCK_RESOURCE_CLASS;
}
else
{
search_key.type = LOCK_RESOURCE_INSTANCE;
}
return search_key;
}

The derivation rules in tabular form:

ConditiontypeMeaning
oid is root OIDROOT_CLASSThe database root class itself
class_oid is NULL or root OIDCLASSA table (class) — the coarse-grain lock target
OtherwiseINSTANCEAn individual row (instance) within a table

Hash-table comparison ignores type and checks oid only. lock_res_key_compare calls OID_EQ(&k1->oid, &k2->oid) and uses type only in an assert. This relies on the invariant that a single OID cannot simultaneously serve as two different resource types.

1.3 LK_RES — lock resource (state of one lockable object)

Section titled “1.3 LK_RES — lock resource (state of one lockable object)”

Holds all lock state for one lockable object (database, class, or instance). Found in the hash table by LK_RES_KEY.

// struct lk_res — src/transaction/lock_manager.h
struct lk_res
{
LK_RES_KEY key; /* hash key — see 1.2 */
LOCK total_holders_mode; /* aggregate mode of all current holders */
LOCK total_waiters_mode; /* aggregate mode of all current waiters */
LK_ENTRY *holder; /* head of granted-lock linked list */
LK_ENTRY *waiter; /* head of blocked-lock linked list */
LK_ENTRY *non2pl; /* head of early-released-lock tracking list */
pthread_mutex_t res_mutex; /* per-resource mutex */
LK_RES *hash_next; /* hash-bucket collision chain */
LK_RES *stack; /* lock-free freelist retired stack */
UINT64 del_id; /* lock-free reclamation epoch ID */
};

Field-by-field:

FieldRoleWhy it exists
total_holders_modeLUB (least upper bound) of all granted locksMakes compatibility checks O(1) — compare against this one value instead of walking the holder list.
total_waiters_modeLUB of all waiting locksStarvation guard: a new request must also be compatible with waiting requests, so a stream of S requests cannot leapfrog a waiting X.
holderSingly-linked list of granted LK_ENTRYsLinked via LK_ENTRY.next. Multiple transactions can hold the same resource simultaneously (e.g., multiple S_LOCKs).
waiterSingly-linked list of blocked LK_ENTRYsFIFO order. Head is the longest-waiting request.
non2plTraces of S locks released early under READ COMMITTEDUsed for conflict detection when the same transaction re-reads later.
res_mutexPer-resource mutexProtects holder/waiter/non2pl list manipulation. Per-resource (not global), so lock requests on different resources proceed in parallel.

When total_holders_mode is recomputed: every time an entry is added to, removed from, or converted within the holder list, the entire holder list is walked to recompute the LUB. This happens under res_mutex, so it is serialized per-resource.

1.4 LK_ENTRY — one lock fact (the dual-threaded record)

Section titled “1.4 LK_ENTRY — one lock fact (the dual-threaded record)”

The most important struct in the lock manager. Represents a single lock fact and is linked into both the resource view and the transaction view simultaneously.

// struct lk_entry — src/transaction/lock_manager.h
struct lk_entry
{
struct lk_res *res_head; /* back-pointer to owning resource */
THREAD_ENTRY *thrd_entry; /* requesting thread (valid only when blocked) */
int tran_index; /* owning transaction's index */
LOCK granted_mode; /* currently granted mode */
LOCK blocked_mode; /* requested mode when blocked */
int count; /* re-entrance counter */
UINT64 del_id; /* lock-free reclamation */
LK_ENTRY *stack; /* freelist retired stack */
LK_ENTRY *next; /* resource view: next in holder/waiter list */
LK_ENTRY *tran_next; /* transaction view: next lock of same tran */
LK_ENTRY *tran_prev; /* transaction view: prev lock of same tran */
LK_ENTRY *class_entry; /* parent class's LK_ENTRY (for instance locks) */
int ngranules; /* count of instance locks below this class lock */
int instant_lock_count; /* instant lock request count */
int bind_index_in_tran;
XASL_ID xasl_id; /* query plan ID that requested this lock */
};

An LK_ENTRY plays one of three roles — granted holder, blocked waiter, or non2pl marker. Which fields are meaningful depends on the role:

Fieldholder (granted)waiter (blocked)non2plDescription
res_headAlways valid. Which resource this lock belongs to.
thrd_entryNULLNULLWaiter only. Points to the thread to resume.
tran_indexWhich transaction owns this lock.
granted_modecurrent modeNULL_LOCKpre-release modeFor holders, the actual lock mode. For waiters, not yet granted so NULL_LOCK.
blocked_modeNULL_LOCKrequested modeNULL_LOCKFor waiters, the mode being requested. During conversion wait, a holder can have blocked_mode != NULL_LOCKgranted_mode is the currently held mode, blocked_mode is the upgrade target.
count≥ 110Re-entrance counter. If the same transaction requests the same resource and mode n times, count becomes n. Always 0 for non2pl.
nextnext in holder listnext in waiter listnext in non2pl listResource-view linkage. Singly-linked.
tran_next/tran_prevTransaction-view linkage. Doubly-linked. Waiters are not in the transaction hold list.
class_entryparent class’s LK_ENTRYOne-hop back-reference from instance lock to class lock. Used for escalation.
ngranulescount of child instance locksFor class locks only. Compared against escalation threshold. Zero for instance locks.
instant_lock_countinstant-mode lock countManaged by lock_start_instant_lock_mode / lock_stop_instant_lock_mode.

Key invariant: exclusive use of granted_mode and blocked_mode

Entry in holder list:
granted_mode = actual held mode (S, X, IS, IX, ...)
blocked_mode = NULL_LOCK (normal) or upgrade-target mode (conversion wait)
Entry in waiter list:
granted_mode = NULL_LOCK
blocked_mode = requested mode

Read the pair as the state. One entry per (resource, transaction): the field pair plus which list the entry threads is the transaction’s state on that resource — X/– in the holder list: plain holder · IX/SIX in the holder list: upgrader (partially acquired, with a pending target) · –/X in the waiter list: waiter (nothing acquired yet). Vocabulary note: blocked_mode is a pending target, not an “intention” — that word is reserved for the intention modes IS/IX/SIX.

This invariant is enforced by the two initializers:

// lock_initialize_entry_as_granted — src/transaction/lock_manager.c
static void
lock_initialize_entry_as_granted (LK_ENTRY * entry_ptr, int tran_index,
LK_RES * res, LOCK lock)
{
entry_ptr->granted_mode = lock; /* actual mode */
entry_ptr->blocked_mode = NULL_LOCK; /* not waiting */
entry_ptr->count = 1;
// ...
}
// lock_initialize_entry_as_blocked — src/transaction/lock_manager.c
static void
lock_initialize_entry_as_blocked (LK_ENTRY * entry_ptr, THREAD_ENTRY * thread_p,
int tran_index, LK_RES * res, LOCK lock)
{
entry_ptr->granted_mode = NULL_LOCK; /* not yet granted */
entry_ptr->blocked_mode = lock; /* requested mode */
entry_ptr->count = 1;
// ...
}

Conversion wait is the exception where both fields are non-NULL. For example, if tran 7 holds S_LOCK and requests an upgrade to X_LOCK:

  • granted_mode = S_LOCK (still held)
  • blocked_mode = X_LOCK (upgrade pending)
  • The entry stays in the holder list but is repositioned to the back via lock_position_holder_entry. Covered in detail in Ch. 4.

1.5 LK_TRAN_LOCK — one transaction’s complete lock state

Section titled “1.5 LK_TRAN_LOCK — one transaction’s complete lock state”
// struct lk_tran_lock — src/transaction/lock_manager.c
struct lk_tran_lock
{
pthread_mutex_t hold_mutex; /* protects hold lists */
LK_ENTRY *inst_hold_list; /* instance lock hold list */
LK_ENTRY *class_hold_list; /* class lock hold list */
LK_ENTRY *root_class_hold; /* root class lock (at most one) */
LK_ENTRY *lk_entry_pool; /* local pool of reusable LK_ENTRY */
int lk_entry_pool_count; /* current count in pool */
int inst_hold_count; /* number of instance locks held */
int class_hold_count; /* number of class locks held */
LK_ENTRY *waiting; /* currently blocked lock entry (at most one) */
pthread_mutex_t non2pl_mutex; /* protects non2pl_list */
LK_ENTRY *non2pl_list; /* early-released lock tracking list */
int num_incons_non2pl; /* inconsistent non2pl count */
bool lock_escalation_on; /* escalation-in-progress guard */
bool is_instant_duration; /* instant lock mode on/off */
};

Why three separate hold sub-lists:

flowchart TB
  TRAN["LK_TRAN_LOCK (tran #7)"]
  ROOT["root_class_hold\n(at most 1)"]
  CLS["class_hold_list\n-> entry(class A, IX)\n-> entry(class B, IS)\n-> ..."]
  INST["inst_hold_list\n-> entry(row A.1, X)\n-> entry(row A.2, S)\n-> ..."]

  TRAN --> ROOT
  TRAN --> CLS
  TRAN --> INST

Figure 1-2 — Three-way hold list split. Separating root/class/instance means “what lock does this transaction hold on class A?” can be answered by walking only the class list, O(number of classes), instead of scanning all locks.

FieldDetail
hold_mutexAcquired when manipulating inst/class/root hold lists. Separate from res_mutex — cross-locking is needed when one transaction manipulates locks on different resources.
lk_entry_poolEach transaction locally caches up to LOCK_TRAN_LOCAL_POOL_MAX_SIZE = 10 LK_ENTRY structs. lock_get_new_entry checks this pool first; only falls through to the global obj_free_entry_list when empty. Purpose: reduce contention on the global freelist in hot OLTP paths.
waitingThe single LK_ENTRY this transaction is currently blocked on. A transaction can wait on at most one lock at a time (one thread suspends on one lock request).
non2pl_listWhen an S lock is released early under READ COMMITTED, the LK_ENTRY is moved here. Other transactions check this list when attempting an X lock on the same resource to detect inconsistencies.
lock_escalation_onRe-entrance guard during escalation.
is_instant_durationTurned on by lock_start_instant_lock_mode. Locks acquired in this mode are released as soon as the statement ends.

1.6 LK_GLOBAL_DATA — the global singleton lk_Gl

Section titled “1.6 LK_GLOBAL_DATA — the global singleton lk_Gl”

Exactly one instance exists system-wide. The root of all lock state.

// struct lk_global_data — src/transaction/lock_manager.c
struct lk_global_data
{
int max_obj_locks; /* max lock count set at init */
lk_hashmap_type m_obj_hash_table; /* OID -> LK_RES lockfree hashmap */
LF_FREELIST obj_free_entry_list; /* global LK_ENTRY freelist */
int num_trans; /* number of initialized transactions */
LK_TRAN_LOCK *tran_lock_table; /* per-tran LK_TRAN_LOCK array */
pthread_mutex_t DL_detection_mutex; /* deadlock detection guard */
struct timeval last_deadlock_run; /* timestamp of last deadlock scan */
LK_WFG_NODE *TWFG_node; /* WFG node array (per-tran) */
LK_WFG_EDGE *TWFG_edge; /* WFG edge array */
int max_TWFG_edge; /* max edge array size */
int TWFG_free_edge_idx; /* next free edge index */
int global_edge_seq_num; /* edge creation sequence number */
short no_victim_case_count;
bool verbose_mode;
std::atomic_int deadlock_and_timeout_detector;
};
LK_GLOBAL_DATA lk_Gl; /* global instance — file scope */

Overall memory allocation strategy:

flowchart TB
  REQ["lock request -> need LK_ENTRY"]
  LP["1. transaction local pool\n(lk_entry_pool, max 10)"]
  GF["2. global freelist\n(obj_free_entry_list)"]
  MA["3. malloc new block"]

  REQ --> LP
  LP -- "empty" --> GF
  GF -- "empty" --> MA

  REL["lock release -> return LK_ENTRY"]
  LP2["1. return to local pool\n(if count < 10)"]
  GF2["2. return to global freelist"]

  REL --> LP2
  LP2 -- "pool full" --> GF2

Figure 1-3 — Three-tier LK_ENTRY allocation/deallocation strategy. Allocation: local pool → global freelist → malloc. Deallocation: reverse order. Implemented by lock_get_new_entry and lock_free_entry.

1.7 Lock modes — the 12-value LOCK enum and two static tables

Section titled “1.7 Lock modes — the 12-value LOCK enum and two static tables”
// enum LOCK — src/transaction/lock_table.h
typedef enum
{
NA_LOCK = 0,
INCON_NON_TWO_PHASE_LOCK = 1, /* NON2PL */
NULL_LOCK = 2,
SCH_S_LOCK = 3,
IS_LOCK = 4,
S_LOCK = 5,
IX_LOCK = 6,
BU_LOCK = 7,
SIX_LOCK = 8,
U_LOCK = 9,
X_LOCK = 10,
SCH_M_LOCK = 11,
LOCK_COUNT /* = 12 */
} LOCK;

Two static 12×12 tables drive all mode decisions:

lock_Comp[requested][current] — compatibility table:

  • LOCK_COMPAT_YES: the new request can coexist with the existing lock
  • LOCK_COMPAT_NO: conflict — the new request must wait
  • LOCK_COMPAT_UNKNOWN: this combination cannot occur in practice (assert-guarded)

lock_Conv[requested][current] — conversion table:

  • When the same transaction already holds current and requests requested, the resulting mode is lock_Conv[requested][current] (the LUB of the two modes)
  • NA_LOCK result: this conversion is undefined (assert failure)
// inline accessors — src/transaction/lock_table.h
inline LOCK
lock_conv (LOCK requested, LOCK current)
{
assert (lock_Conv[requested][current] != NA_LOCK);
return lock_Conv[requested][current];
}
inline LOCK_COMPATIBILITY
lock_compat (LOCK requested, LOCK current)
{
assert (lock_Comp[requested][current] != LOCK_COMPAT_UNKNOWN);
return lock_Comp[requested][current];
}

Conversion examples (read directly from lock_Conv):

Currently heldNew requestResult (LUB)Meaning
IS_LOCKIX_LOCKIX_LOCKIntention shared → intention exclusive upgrade
S_LOCKIX_LOCKSIX_LOCKShared + intention exclusive = SIX
S_LOCKX_LOCKX_LOCKShared → exclusive upgrade
IX_LOCKS_LOCKSIX_LOCKIX + S = SIX (symmetric)
U_LOCKX_LOCKX_LOCKUpdate → exclusive upgrade

1.8 Pointer-relationship summary — everywhere a single lock connects

Section titled “1.8 Pointer-relationship summary — everywhere a single lock connects”

Every path a single LK_ENTRY is linked into:

flowchart LR
  subgraph Resource["LK_RES (resource R)"]
    direction TB
    RH["holder list"]
    RW["waiter list"]
    RN["non2pl list"]
  end

  subgraph Tran["LK_TRAN_LOCK (tran T)"]
    direction TB
    TI["inst_hold_list"]
    TC["class_hold_list"]
    TR["root_class_hold"]
    TN["non2pl_list"]
  end

  E["LK_ENTRY\n(tran T, resource R)"]

  RH -- "next" --> E
  E -- "res_head" --> Resource
  TI -- "tran_next/prev" --> E
  E -- "class_entry" --> CE["LK_ENTRY\n(tran T, class of R)"]
  CE -- "ngranules++" --> CE

Figure 1-4 — Every pointer relationship involving a single LK_ENTRY. next is the resource view, tran_next/tran_prev is the transaction view, class_entry bridges the granularity hierarchy, and res_head is the back-reference.

  1. LK_ENTRY is a dual-threaded record. A single lock fact exists in both the resource view (next) and the transaction view (tran_next/tran_prev) simultaneously.
  2. total_holders_mode and total_waiters_mode enable O(1) compatibility checks. A new request compares against two aggregate values instead of walking the entire holder list.
  3. The combination of granted_mode and blocked_mode determines an entry’s state:
    • granted != NULL_LOCK, blocked == NULL_LOCK → normal holder
    • granted == NULL_LOCK, blocked != NULL_LOCK → waiter
    • granted != NULL_LOCK, blocked != NULL_LOCKholder undergoing conversion wait
  4. Memory allocation is three-tier: local pool (10 entries) → global freelist → malloc.
  5. Per-resource res_mutex provides resource-granularity serialization, so lock requests on different resources proceed fully in parallel.

Chapter 2: Initialization and Memory Management

Section titled “Chapter 2: Initialization and Memory Management”

The lock manager is bootstrapped once during server startup by lock_initialize and torn down by lock_finalize. This chapter traces the initialization sequence step by step, then examines the runtime memory management — the three-tier allocation strategy and its integration with the lockfree infrastructure.

2.1 lock_initialize — the top-level entry point

Section titled “2.1 lock_initialize — the top-level entry point”
// lock_initialize — src/transaction/lock_manager.c
int
lock_initialize (void)
{
error_code = lock_initialize_tran_lock_table (); /* step 1 */
lock_initialize_object_hash_table (); /* step 2 */
error_code = lock_initialize_object_lock_entry_list (); /* step 3 */
error_code = lock_initialize_deadlock_detection (); /* step 4 */
// ... env-var checks for verbose/dump mode ...
lock_deadlock_detect_daemon_init (); /* step 5 */
return error_code;
}

The order matters. Step 1 must complete before step 4 because lock_initialize_deadlock_detection reads lk_Gl.num_trans (set by step 1) to size the WFG node array.

flowchart LR
  INIT["lock_initialize"]
  S1["1. tran_lock_table\nMAX_NTRANS slots\n+ 10 entries each"]
  S2["2. obj_hash_table\nlockfree hashmap\nMAX(10K, NTRANS*300)"]
  S3["3. obj_free_entry_list\nlockfree freelist"]
  S4["4. deadlock detection\nTWFG_node[num_trans]"]
  S5["5. deadlock daemon\n100ms loop"]

  INIT --> S1 --> S2 --> S3 --> S4 --> S5

Figure 2-1 — lock_initialize sequence. Each step builds on the previous one’s output. The deadlock daemon (step 5) is the last piece — it must not run until all data structures are ready.

2.2 Step 1 — lock_initialize_tran_lock_table

Section titled “2.2 Step 1 — lock_initialize_tran_lock_table”

Allocates and initializes the per-transaction lock state table.

// lock_initialize_tran_lock_table — src/transaction/lock_manager.c
static int
lock_initialize_tran_lock_table (void)
{
num_trans = MAX_NTRANS;
lk_Gl.tran_lock_table =
(LK_TRAN_LOCK *) malloc (SIZEOF_LK_TRAN_LOCK * num_trans);
memset (lk_Gl.tran_lock_table, 0, SIZEOF_LK_TRAN_LOCK * num_trans);
for (i = 0; i < num_trans; i++)
{
tran_lock = &lk_Gl.tran_lock_table[i];
pthread_mutex_init (&tran_lock->hold_mutex, NULL);
pthread_mutex_init (&tran_lock->non2pl_mutex, NULL);
/* pre-fill local entry pool */
for (j = 0; j < LOCK_TRAN_LOCAL_POOL_MAX_SIZE; j++) /* = 10 */
{
entry = (LK_ENTRY *) malloc (sizeof (LK_ENTRY));
lock_initialize_entry (entry);
entry->next = tran_lock->lk_entry_pool;
tran_lock->lk_entry_pool = entry;
}
tran_lock->lk_entry_pool_count = LOCK_TRAN_LOCAL_POOL_MAX_SIZE;
}
return NO_ERROR;
}

Key observations:

  • MAX_NTRANS slots are allocated up front. This is a compile-time constant, not a runtime parameter. Every tran_index from 0 to MAX_NTRANS - 1 gets its own LK_TRAN_LOCK regardless of how many transactions are active.
  • Each slot gets 10 pre-allocated LK_ENTRYs via direct malloc. These go into the lk_entry_pool singly-linked list. This is the Tier 1 local pool from Ch. 1 §1.6.
  • Two mutexes per transaction: hold_mutex (protects the three hold lists) and non2pl_mutex (protects the non2pl list). They are separate because a lock release can simultaneously manipulate the hold list and the non2pl list from different code paths.
  • All other fields are zeroed by the memset, which means: inst_hold_list = NULL, class_hold_list = NULL, root_class_hold = NULL, waiting = NULL, non2pl_list = NULL, lock_escalation_on = false, is_instant_duration = false.

2.3 Step 2 — lock_initialize_object_hash_table

Section titled “2.3 Step 2 — lock_initialize_object_hash_table”

Sets up the lockfree hashmap that maps LK_RES_KEY → LK_RES.

// lock_initialize_object_hash_table — src/transaction/lock_manager.c
static void
lock_initialize_object_hash_table (void)
{
lk_Gl.max_obj_locks = 10000; /* LK_INITIAL_OBJECT_LOCK_TABLE_SIZE */
const int obj_hash_size =
MAX (lk_Gl.max_obj_locks, LK_MIN_OBJECT_LOCKS);
/* LK_MIN_OBJECT_LOCKS = MAX_NTRANS * 300 */
lk_Gl.m_obj_hash_table.init (
obj_lock_res_Ts, THREAD_TS_OBJ_LOCK_RES,
obj_hash_size, block_size, block_count,
lk_Obj_lock_res_desc);
}

The hash table is a lockfree_hashmap<LK_RES_KEY, LK_RES> — a generic lockfree hash table from CUBRID’s lockfree infrastructure (see cubrid-lockfree-hashmap.md). The lock manager plugs in its own behavior through an LF_ENTRY_DESCRIPTOR:

// lk_Obj_lock_res_desc — src/transaction/lock_manager.c
LF_ENTRY_DESCRIPTOR lk_Obj_lock_res_desc = {
offsetof (LK_RES, stack), /* retired-entry stack pointer */
offsetof (LK_RES, hash_next), /* hash collision chain */
offsetof (LK_RES, del_id), /* epoch ID for reclamation */
offsetof (LK_RES, key), /* key location */
offsetof (LK_RES, res_mutex), /* per-entry mutex */
LF_EM_USING_MUTEX, /* each LK_RES has its own mutex */
LF_ENTRY_DESCRIPTOR_MAX_ALLOC,
lock_alloc_resource, /* malloc + pthread_mutex_init */
lock_dealloc_resource, /* pthread_mutex_destroy + free */
lock_init_resource, /* zero out holders/waiters/non2pl */
lock_uninit_resource, /* assert lists are empty */
lock_res_key_copy,
lock_res_key_compare, /* OID_EQ only — type is assert-checked */
lock_res_key_hash, /* delegates to LK_OBJ_LOCK_HASH macro */
NULL /* no insert callback */
};

The hash function (lock_res_key_hashLK_OBJ_LOCK_HASHlock_get_hash_value) deserves a closer look:

// lock_get_hash_value — src/transaction/lock_manager.c
static unsigned int
lock_get_hash_value (const OID * oid, int htsize)
{
if (oid->slotid <= 0)
{
addr = oid->pageid - oid->slotid;
}
else
{
next_base_slotid = 2;
while (next_base_slotid <= (unsigned) oid->slotid)
next_base_slotid *= 2;
addr = oid->pageid
+ (htsize / next_base_slotid)
* (2 * oid->slotid - next_base_slotid + 1);
}
return (addr % htsize);
}

This is not a simple modulo hash. For positive slotid values, it uses a van Emde Boas–style bit-interleave that spreads consecutive slot IDs within the same page across different hash buckets. The intent: rows on the same heap page (which tend to be locked together in scan order) are distributed across the hash table, reducing bucket-chain contention.

For slotid <= 0 (index keys — unique index last-key OIDs use slotid = -1), it falls back to a simpler pageid - slotid computation.

Worked example — hash distribution for rows on page 100:

Assume htsize = 6000 (a small example). Six rows on the same heap page (pageid=100) with slotid 1–6:

OID (pageid, slotid)next_base_slotidaddr formulaaddrbucket (addr % 6000)
(100, 1)2100 + (6000/2) * (2·1 - 2 + 1) = 100 + 300031003100
(100, 2)4100 + (6000/4) * (2·2 - 4 + 1) = 100 + 150016001600
(100, 3)4100 + (6000/4) * (2·3 - 4 + 1) = 100 + 450046004600
(100, 4)8100 + (6000/8) * (2·4 - 8 + 1) = 100 + 750850850
(100, 5)8100 + (6000/8) * (2·5 - 8 + 1) = 100 + 225023502350
(100, 6)8100 + (6000/8) * (2·6 - 8 + 1) = 100 + 375038503850

All six rows land in different buckets (3100, 1600, 4600, 850, 2350, 3850). A naive pageid * 31 + slotid hash would cluster them within 6 adjacent buckets. The van Emde Boas–style stride ensures that a heap-page scan locking rows 1, 2, 3, … in order hits widely separated buckets, minimizing hash-chain contention under concurrent scans.

Hash table structure after several lock requests:

flowchart TB
  subgraph HT["m_obj_hash_table (lockfree hashmap, 6000 buckets)"]
    B850["bucket 850"]
    B1600["bucket 1600"]
    B3100["bucket 3100"]
    B3850["bucket 3850"]
    BDOT["..."]
  end

  R1["LK_RES\nkey: (100,4) class\ntotal_holders: IX\nholder -> waiter -> non2pl"]
  R2["LK_RES\nkey: (100,2) inst\ntotal_holders: X\nholder -> ..."]
  R3["LK_RES\nkey: (100,1) inst\ntotal_holders: S\nholder -> ..."]
  R4["LK_RES\nkey: (200,3) inst\ntotal_holders: S\nholder -> ..."]

  B850 --> R1
  B1600 --> R2
  B3100 --> R3
  R3 -. "hash_next\n(collision)" .-> R4
  B3850 -. "empty" .-> EMPTY["NULL"]

Figure 2-2 — Hash table after locking rows from pages 100 and 200. Each bucket points to an LK_RES chain. Most buckets have a single entry (the hash function’s goal); collisions (bucket 3100 here: OID (100,1) and (200,3) happen to collide) are resolved via the hash_next singly-linked chain. Each LK_RES carries its own res_mutex — concurrent threads touching different buckets never contend.

End-to-end: from OID to lock grant (the hash table’s role):

flowchart LR
  OID["OID (100, 2, 5)\npageid=100, slotid=2, volid=5"]
  KEY["lock_create_search_key\n-> LK_RES_KEY\n  type=INSTANCE\n  oid=(100,2,5)\n  class_oid=(50,1,5)"]
  HASH["lock_res_key_hash\n-> bucket 1600"]
  FIND["m_obj_hash_table\n.find_or_insert\n(lockfree)"]
  RES["LK_RES found/created\nres_mutex locked"]
  CHECK["check compat with\ntotal_holders_mode\ntotal_waiters_mode"]

  OID --> KEY --> HASH --> FIND --> RES --> CHECK

Figure 2-3 — From OID to compatibility check. The lockfree hashmap resolves the OID to an LK_RES without a global lock. The per-resource res_mutex is held only after the LK_RES is found, narrowing the critical section to one resource.

2.4 Step 3 — lock_initialize_object_lock_entry_list

Section titled “2.4 Step 3 — lock_initialize_object_lock_entry_list”

Sets up the lockfree freelist for LK_ENTRY allocation (Tier 2 of the three-tier strategy).

// lock_initialize_object_lock_entry_list — src/transaction/lock_manager.c
static int
lock_initialize_object_lock_entry_list (void)
{
block_count = 1;
block_size = (int) MAX ((lk_Gl.max_obj_locks * LK_ENTRY_RATIO), 1);
/* LK_ENTRY_RATIO = 0.1f */
ret = lf_freelist_init (&lk_Gl.obj_free_entry_list,
block_count, block_size,
&obj_lock_entry_desc, &obj_lock_ent_Ts);
return ret;
}

The LK_ENTRY descriptor differs from the LK_RES descriptor in one crucial way:

// obj_lock_entry_desc — src/transaction/lock_manager.c
LF_ENTRY_DESCRIPTOR obj_lock_entry_desc = {
offsetof (LK_ENTRY, stack),
offsetof (LK_ENTRY, next),
offsetof (LK_ENTRY, del_id),
0, /* no key — not in a hash table */
0, /* no mutex offset */
LF_EM_NOT_USING_MUTEX, /* ← key difference */
// ...
lock_alloc_entry, /* plain malloc */
lock_dealloc_entry, /* plain free */
// ...
};

LK_ENTRY uses LF_EM_NOT_USING_MUTEX because it does not carry its own mutex. Protection comes from the owning LK_RES.res_mutex or LK_TRAN_LOCK.hold_mutex, depending on which list the entry is being manipulated through.

2.5 Step 4 — lock_initialize_deadlock_detection

Section titled “2.5 Step 4 — lock_initialize_deadlock_detection”
// lock_initialize_deadlock_detection — src/transaction/lock_manager.c
static int
lock_initialize_deadlock_detection (void)
{
pthread_mutex_init (&lk_Gl.DL_detection_mutex, NULL);
gettimeofday (&lk_Gl.last_deadlock_run, NULL);
lk_Gl.TWFG_node =
(LK_WFG_NODE *) malloc (SIZEOF_LK_WFG_NODE * lk_Gl.num_trans);
for (i = 0; i < lk_Gl.num_trans; i++)
{
lk_Gl.TWFG_node[i].DL_victim = false;
lk_Gl.TWFG_node[i].checked_by_deadlock_detector = false;
lk_Gl.TWFG_node[i].thrd_wait_stime = 0;
}
lk_Gl.TWFG_edge = NULL; /* allocated lazily on first use */
lk_Gl.max_TWFG_edge = 0;
lk_Gl.TWFG_free_edge_idx = -1;
lk_Gl.global_edge_seq_num = 0;
return NO_ERROR;
}

Key observations:

  • WFG nodes are allocated per-transaction — one LK_WFG_NODE per tran_index. This is why step 4 depends on lk_Gl.num_trans being set by step 1.
  • WFG edges are lazily allocated. TWFG_edge starts as NULL and is allocated on the first call to lock_add_WFG_edge (when the first actual lock wait occurs). Initial size is LK_MID_TWFG_EDGE_COUNT = 1000; max is MAX_NTRANS * MAX_NTRANS.
  • DL_detection_mutex serializes deadlock detection runs. Only one deadlock scan runs at a time.

2.6 Step 5 — lock_deadlock_detect_daemon_init

Section titled “2.6 Step 5 — lock_deadlock_detect_daemon_init”
// lock_deadlock_detect_daemon_init — src/transaction/lock_manager.c
void
lock_deadlock_detect_daemon_init ()
{
cubthread::looper looper =
cubthread::looper (std::chrono::milliseconds (100));
cubthread::entry_callable_task *daemon_task =
new cubthread::entry_callable_task (deadlock_detect_task_execute);
lock_Deadlock_detect_daemon =
cubthread::get_manager ()->create_daemon (looper, daemon_task,
"deadlock-detect");
}

The daemon loops every 100ms and performs three duties per iteration (in deadlock_detect_task_execute):

  1. Interrupt check — resume any thread whose transaction was interrupted externally.
  2. Timeout check — resume any thread whose lock-wait has exceeded its wait_msecs deadline (lock_force_timeout_expired_wait_transactions).
  3. Deadlock detection — if the interval since the last run exceeds PRM_ID_LK_RUN_DEADLOCK_INTERVAL and at least two threads are suspended, run lock_detect_local_deadlock. This avoids the cost of a full WFG scan when contention is low.

The 100ms loop period is the timeout check granularity — a lock wait with wait_msecs = 50 will be noticed within 100ms, not 50ms. The deadlock scan itself runs less frequently, gated by the configurable interval parameter.

// lock_finalize — src/transaction/lock_manager.c
void
lock_finalize (void)
{
free_and_init (lk_Gl.TWFG_node); /* WFG nodes */
lock_finalize_tran_lock_table (); /* per-tran state */
pthread_mutex_destroy (&lk_Gl.DL_detection_mutex);
lk_Gl.m_obj_hash_table.destroy (); /* lockfree hashmap */
lf_freelist_destroy (&lk_Gl.obj_free_entry_list); /* lockfree freelist */
lock_deadlock_detect_daemon_destroy (); /* stop daemon */
}

lock_finalize_tran_lock_table walks each LK_TRAN_LOCK, destroys its two mutexes, and frees every entry remaining in lk_entry_pool via plain free. Note: this does not walk the hold lists — by the time lock_finalize runs, all transactions must have already released their locks (or been forcibly aborted).

2.8 Three-tier LK_ENTRY allocation at runtime

Section titled “2.8 Three-tier LK_ENTRY allocation at runtime”

This is the runtime counterpart of the initialization in §2.2 and §2.4. Every lock acquisition needs an LK_ENTRY; every release returns one.

// lock_get_new_entry — src/transaction/lock_manager.c
static LK_ENTRY *
lock_get_new_entry (int tran_index, LF_TRAN_ENTRY * tran_entry,
LF_FREELIST * freelist)
{
LK_TRAN_LOCK *tran_lock = &lk_Gl.tran_lock_table[tran_index];
/* Tier 1: check local pool (no synchronization needed) */
if (tran_lock->lk_entry_pool)
{
lock_entry = tran_lock->lk_entry_pool;
tran_lock->lk_entry_pool = lock_entry->next;
tran_lock->lk_entry_pool_count--;
return lock_entry;
}
/* Tier 2+3: lockfree global freelist (may malloc internally) */
return (LK_ENTRY *) lf_freelist_claim (tran_entry, freelist);
}
// lock_free_entry — src/transaction/lock_manager.c
static void
lock_free_entry (int tran_index, LF_TRAN_ENTRY * tran_entry,
LF_FREELIST * freelist, LK_ENTRY * lock_entry)
{
LK_TRAN_LOCK *tran_lock = &lk_Gl.tran_lock_table[tran_index];
/* Tier 1: return to local pool if not full */
if (tran_lock->lk_entry_pool_count < LOCK_TRAN_LOCAL_POOL_MAX_SIZE)
{
lock_uninit_entry (lock_entry);
lock_entry->next = tran_lock->lk_entry_pool;
tran_lock->lk_entry_pool = lock_entry;
tran_lock->lk_entry_pool_count++;
}
else
{
/* Tier 2: retire to global freelist (epoch-based reclamation) */
lf_freelist_retire (tran_entry, freelist, lock_entry);
}
}

Why three tiers instead of just one global freelist?

TierStructureSync costWhen used
1lk_entry_pool (per-tran, max 10)Zero — only the owning transaction touches itHot path: short OLTP transactions that acquire/release ≤10 locks
2obj_free_entry_list (global lockfree)Lockfree CASlf_freelist_claim/retireWhen local pool is empty (alloc) or full (dealloc)
3malloc / freeKernel syscallWhen the global freelist’s pre-allocated blocks are exhausted

For a typical OLTP transaction that touches 3–5 rows, all lock entries come from and return to Tier 1 — the global freelist is never contended. Tier 2 absorbs the spillover from batch operations that exceed 10 concurrent locks per transaction. Tier 3 (malloc) is a cold path that fires only during initial ramp-up or extreme load.

2.9 Lockfree integration — two transaction systems

Section titled “2.9 Lockfree integration — two transaction systems”

The lock manager participates in two independent lockfree transaction systems, each with its own epoch counter:

System IDManagesUsed by
THREAD_TS_OBJ_LOCK_RESLK_RES (hash table entries)m_obj_hash_table.find_or_insert, erase_locked
THREAD_TS_OBJ_LOCK_ENTLK_ENTRY (lock entries)lock_get_new_entry, lock_free_entry

Every lock manager function that allocates or frees resources begins by obtaining the thread’s participation handle:

LF_TRAN_ENTRY *t_entry_res =
thread_get_tran_entry (thread_p, THREAD_TS_OBJ_LOCK_RES);
LF_TRAN_ENTRY *t_entry_ent =
thread_get_tran_entry (thread_p, THREAD_TS_OBJ_LOCK_ENT);

The del_id fields on LK_RES and LK_ENTRY are set when the object is retired (via lf_freelist_retire or hashmap erase). A retired object is not physically freed until every concurrent reader — tracked by the epoch counter — has advanced past the retirement epoch. This is the standard hazard-pointer / epoch-based reclamation pattern described in cubrid-lockfree-transaction.md.

What is lockfree vs. mutex-protected:

OperationMechanism
Hash table lookup/insert (LK_RES by OID)Lockfree hashmap
LK_ENTRY alloc from global poolLockfree freelist (lf_freelist_claim)
LK_ENTRY alloc from local poolNo sync (per-tran, single-thread)
Memory reclamation of retired entriesLockfree epoch-based (del_id)
Holder/waiter/non2pl list manipulationres_mutex (per-resource)
Transaction hold list manipulationhold_mutex (per-transaction)
Transaction non2pl list manipulationnon2pl_mutex (per-transaction)

The boundary is clear: finding or creating the LK_RES is lockfree; manipulating its contents (holder/waiter lists) is mutex-protected. This means concurrent lock requests on different resources never contend with each other at the hash level, and only share-nothing local-pool entries on the allocation side.

  1. Initialization order is load-bearing. Transaction table (step 1) must precede deadlock detection (step 4) because the WFG is sized by num_trans.
  2. The hash table size is MAX(10000, MAX_NTRANS * 300) — a hard-coded formula, not a tunable parameter. The hash function uses a van Emde Boas–style distribution to spread consecutive slot IDs across buckets.
  3. Each transaction gets 10 pre-allocated LK_ENTRYs at init time. These form the Tier 1 local pool that satisfies most OLTP lock requests without any synchronization.
  4. WFG edges are lazily allocated on first lock-wait, not at startup. This avoids wasting memory in read-only workloads.
  5. The deadlock daemon loops every 100ms but only runs the full WFG cycle scan when (a) the configured interval has elapsed and (b) at least two threads are suspended. Timeout checks run on every iteration.
  6. Lockfree boundary: hash lookup/insert and global freelist claim/retire are lockfree. Everything below — holder/waiter list manipulation, transaction hold list updates — is mutex-serialized per-resource or per-transaction.

This chapter traces every branch of the lock acquisition path — from the public API (lock_object, lock_scan) through the internal workhorse lock_internal_perform_lock_object. By the end you should be able to predict exactly what happens for any combination of resource state and request parameters.

3.1 Two public entry points — lock_object and lock_scan

Section titled “3.1 Two public entry points — lock_object and lock_scan”

The lock manager exposes two primary APIs. They differ in granularity intent, not mechanism — both eventually call lock_internal_perform_lock_object.

flowchart LR
  SCAN["lock_scan(class_oid, class_lock)\nClass-grain: IS or IX\non the class itself"]
  OBJ["lock_object(oid, class_oid, lock)\nAny grain: root, class, or instance"]

  SCAN --> IPLK["lock_internal_perform_lock_object\n(the real work)"]
  OBJ --> PREP["Preparation layer:\n1. Root OID? -> lock root directly\n2. Class lock? -> ensure IX/IS on root first\n3. Instance lock? -> ensure IX/IS on class first"]
  PREP --> IPLK

Figure 3-1 — Two entry points converge on the same internal function. lock_scan is a thin wrapper for class-grain locks. lock_object does multi-granularity preparation before calling the internal function.

lock_scan is the simpler of the two. It acquires a class-level lock (typically IS_LOCK for SELECT, IX_LOCK for DML) with no instance-level work:

// lock_scan — src/transaction/lock_manager.c (simplified)
int
lock_scan (THREAD_ENTRY * thread_p, const OID * class_oid,
int cond_flag, LOCK class_lock)
{
root_class_entry = lock_get_class_lock (thread_p, oid_Root_class_oid);
granted = lock_internal_perform_lock_object (
thread_p, tran_index, class_oid,
NULL, /* class_oid = NULL → this IS a class lock */
class_lock, wait_msecs,
&class_entry, root_class_entry);
return granted;
}

lock_object handles three cases based on the OID:

flowchart TD
  A["lock_object(oid, class_oid, lock)"]
  A --> B{"OID_IS_ROOTOID(oid)?"}
  B -- "yes" --> C1["Case 1: ROOT_CLASS\nlock root directly\nclass_entry = NULL"]
  B -- "no" --> D{"OID_IS_ROOTOID(class_oid)?"}
  D -- "yes" --> C2["Case 2: CLASS\n1. Ensure IX/IS on root\n   (if not already held)\n2. Lock the class itself"]
  D -- "no" --> C3["Case 3: INSTANCE\n1. Ensure IX/IS on class\n   (if not already held)\n2. Check if class lock\n   subsumes the request\n3. Lock the instance"]

  C1 --> IPLK["lock_internal_perform_lock_object"]
  C2 --> IPLK
  C3 --> IPLK

Figure 3-2 — lock_object three-case dispatch. Each case ensures the multi-granularity hierarchy is respected: an instance lock requires an intention lock on its class, a class lock requires an intention lock on the root.

The intention lock mode is derived from the requested lock:

// lock_object — src/transaction/lock_manager.c
if (lock <= S_LOCK)
new_class_lock = IS_LOCK; /* read intent */
else
new_class_lock = IX_LOCK; /* write intent */

Key optimization in Case 3: before calling lock_internal_perform_lock_object for the instance, lock_object checks lock_is_class_lock_escalated(old_class_lock, lock). If the class lock already held is strong enough to subsume the instance request (e.g., holding X on class makes any instance lock redundant), the function returns LK_GRANTED immediately.

3.2 Caller map — who calls the lock manager and why

Section titled “3.2 Caller map — who calls the lock manager and why”

The lock manager is a service layer. Before diving into the internal state machine, it helps to know who calls the APIs and in what context. This map covers both acquisition and release APIs — it is referenced again in Ch. 5.

flowchart TB
  subgraph Callers["Upper-layer callers"]
    LOC["locator_sr.c (41 calls)\nObject fetch / store / DDL"]
    BT["btree.c (13)\nIndex traversal, FK checks"]
    HEAP["heap_file.c (8)\nHeap scan, page access"]
    LOG["log_manager.c (5)\nCommit / rollback"]
    BOOT["boot_sr.c (5)\nServer startup / DDL"]
    SCAN["scan_manager.c (4)\nHeap/index scan iteration"]
    QE["query_executor.c (4)\nQuery eval, instant lock mode"]
    SER["serial.c (4)\nNEXT VALUE / CURRENT VALUE"]
    CAT["system_catalog.c (4)\nCatalog access"]
    STAT["statistics_sr.c (4)\nALTER INDEX statistics"]
  end

  subgraph LockAPI["Lock Manager API"]
    LO["lock_object\n(any grain)"]
    LS["lock_scan\n(class grain)"]
    LH["lock_hold_object_instant\n(probe, no block)"]
    LC["lock_classes_lock_hint\n(batch class locks)"]
    UO["lock_unlock_object"]
    UDN["lock_unlock_object_donot_move_to_non2pl"]
    UA["lock_unlock_all"]
  end

  LOC --> LO & UO & UDN & LC
  BT --> LO & UDN
  HEAP --> LO & LS & UO
  LOG --> UA
  BOOT --> LO & UO
  SCAN --> LO & UO & LH
  QE --> LO & UO
  SER --> LO & UO
  CAT --> LO & UDN
  STAT --> LO & UO

Figure 3-3 — Caller map. Numbers in parentheses are approximate call-site counts. locator_sr.c is the dominant caller — it mediates all client-initiated object access. log_manager.c is the exclusive caller of lock_unlock_all (commit/rollback).

Key caller patterns:

CallerLock API usedTypical modeContext
locator_sr.clock_object, lock_unlock_object, lock_classes_lock_hintSCH_M for DDL, S/X for DML, IS/IX intentObject fetch (xlocator_fetch), store (xlocator_force), class creation/drop
btree.clock_object, lock_unlock_object_donot_move_to_non2plS/X on instance OIDsIndex key lock during traversal; FK validation — uses donot_move_to_non2pl because rows failing FK check are never surfaced to client
heap_file.clock_scan, lock_object, lock_unlock_objectIS via lock_scan; S/X per rowHeap scan start (lock_scan for class IS), row-level lock during fetch
scan_manager.clock_object, lock_hold_object_instant, lock_unlock_objectS or X on rows; instant probeHeap/index scan: lock_hold_object_instant for non-blocking probe, fallback to lock_object if probe fails
query_executor.clock_start_instant_lock_mode, lock_stop_instant_lock_mode(controls mode, not direct lock)WHERE clause evaluation — instant mode brackets
serial.clock_object(X_LOCK), lock_unlock_objectX on serial instance OIDNEXT VALUE / CURRENT VALUE — the primary exerciser of the non2pl path under RC
log_manager.clock_unlock_all(releases everything)log_commit / log_abort — the only caller of bulk release

Why scan_manager.c uses lock_hold_object_instant: During a heap or index scan, the scan manager first tries a non-blocking instant probe. If the probe fails (lock conflict), it falls back to the full lock_object path with waiting. This optimistic-first pattern avoids suspending the thread for rows that are likely uncontended.

3.3 lock_internal_perform_lock_object — the complete state machine

Section titled “3.3 lock_internal_perform_lock_object — the complete state machine”

This is the core function (~700 lines). It handles four top-level scenarios:

flowchart TD
  START["lock_internal_perform_lock_object\n(oid, class_oid, lock, wait_msecs)"]

  START --> DISPATCH{"instance lock?\n(class_oid != NULL\n&& !root)"}
  DISPATCH -- "yes (instance)" --> ESC["lock_escalate_if_needed\n(may promote to class lock)"]
  ESC --> SUBSUME{"class lock\nsubsumes request?"}
  SUBSUME -- "yes" --> GRANT_FAST["return LK_GRANTED\n(no instance lock needed)"]
  SUBSUME -- "no" --> HASH

  DISPATCH -- "no (class/root)" --> FAST{"lock_find_class_entry\n(already held?)"}
  FAST -- "found" --> CONV["goto lock_tran_lk_entry\n(conversion path)"]
  FAST -- "not found" --> HASH

  HASH["find_or_insert in hash table\n(lockfree, res_mutex locked on return)"]

  HASH --> EMPTY{"res empty?\n(no holder/waiter/non2pl)"}
  EMPTY -- "yes" --> PATH_A["PATH A: Fresh resource\nGrant unconditionally"]
  EMPTY -- "no" --> SCAN_HOLDER{"scan holder list\nfor my tran_index"}
  SCAN_HOLDER -- "not found" --> NEW_REQ["I am NOT a holder"]
  SCAN_HOLDER -- "found" --> CONV

  NEW_REQ --> COMPAT{"compat with\nholders AND waiters?"}
  COMPAT -- "yes" --> PATH_B["PATH B: Compatible\nGrant immediately"]
  COMPAT -- "no" --> ZERO{"wait_msecs ==\nZERO_WAIT?"}
  ZERO -- "yes" --> PATH_C["PATH C: Timeout\nReturn immediately"]
  ZERO -- "no" --> PATH_D["PATH D: Enqueue\nas waiter, suspend"]

  CONV --> CONV_CHECK{"new_mode ==\ngranted_mode?"}
  CONV_CHECK -- "yes" --> PATH_E["PATH E: Re-entrance\nJust bump count"]
  CONV_CHECK -- "no" --> CONV_COMPAT{"compat with\nOTHER holders?"}
  CONV_COMPAT -- "yes" --> PATH_F["PATH F: Conversion granted\nUpgrade in place"]
  CONV_COMPAT -- "no" --> CONV_ZERO{"wait_msecs ==\nZERO_WAIT?"}
  CONV_ZERO -- "yes" --> PATH_C
  CONV_ZERO -- "no" --> PATH_G["PATH G: Conversion blocked\nSet blocked_mode, reposition, suspend"]

Figure 3-4 — Complete state machine of lock_internal_perform_lock_object. Seven paths (A–G) cover every combination of resource state, holder status, compatibility, and wait policy.

Entry conditions for each path:

PathAm I already a holder?Resource stateCompatibilityWait policyResult
ANoEmpty (no holder, waiter, non2pl)N/AAnyGrant unconditionally
BNoNon-emptylock_compat(req, total_holders) == YES AND lock_compat(req, total_waiters) == YESAnyGrant immediately
CNoNon-emptyFails holder or waiter compatZERO_WAIT or FORCE_ZERO_WAITReturn timeout
DNoNon-emptyFails holder or waiter compatwait_msecs > 0 or INFINITE_WAITEnqueue as waiter, suspend
EYeslock_conv(req, granted) == granted (no upgrade needed)AnyBump count, return
FYeslock_conv(req, granted) != granted AND lock_compat(new_mode, group_mode_of_others) == YESAnyUpgrade in place
GYeslock_conv(req, granted) != granted AND lock_compat(new_mode, group_mode_of_others) == NOwait_msecs > 0 or INFINITE_WAITSet blocked_mode, reposition (UPR), suspend

Notes:

  • PATH C also applies to holders whose conversion is blocked with zero-wait (the same timeout logic fires from the conversion branch).
  • PATH E does not acquire res_mutex when reached via the lock_find_class_entry fast path. That fast path does acquire hold_mutex (per-transaction) to walk the class hold list, but avoids the more expensive per-resource res_mutex.
  • PATH F and G both compute group_mode by walking other holders (excluding self), not using the cached total_holders_mode.

3.4 PATH A — fresh resource, unconditional grant

Section titled “3.4 PATH A — fresh resource, unconditional grant”

The fastest path. The LK_RES was just created by find_or_insert (no prior holders, waiters, or non2pl entries).

// PATH A — src/transaction/lock_manager.c (annotated)
if (res_ptr->holder == NULL && res_ptr->waiter == NULL
&& res_ptr->non2pl == NULL)
{
lock_initialize_resource_as_allocated (res_ptr, NULL_LOCK);
entry_ptr = lock_get_new_entry (tran_index, t_entry_ent,
&lk_Gl.obj_free_entry_list);
lock_initialize_entry_as_granted (entry_ptr, tran_index,
res_ptr, lock);
/* instant lock tracking */
if (is_instant_duration)
entry_ptr->instant_lock_count++;
res_ptr->holder = entry_ptr; /* sole holder */
entry_ptr->class_entry = class_entry; /* link to parent */
lock_increment_class_granules (class_entry);
lock_insert_into_tran_hold_list (entry_ptr, tran_index);
res_ptr->total_holders_mode = lock;
pthread_mutex_unlock (&res_ptr->res_mutex);
*entry_addr_ptr = entry_ptr;
return LK_GRANTED;
}

What happens step by step:

  1. Allocate an LK_ENTRY from the three-tier pool (Ch. 2 §2.8).
  2. Initialize as granted (granted_mode = lock, blocked_mode = NULL_LOCK).
  3. Splice into res_ptr->holder (sole entry — the list was empty).
  4. Set class_entry back-pointer and increment ngranules on the parent.
  5. Splice into the transaction’s hold list via lock_insert_into_tran_hold_list.
  6. Set total_holders_mode = lock (no LUB needed — we are the only holder).
  7. Release res_mutex and return.

3.5 PATH B — existing resource, compatible, grant immediately

Section titled “3.5 PATH B — existing resource, compatible, grant immediately”

The resource already has holders and/or waiters, but the new request is compatible with both total_holders_mode and total_waiters_mode.

// PATH B — src/transaction/lock_manager.c (annotated)
compat1 = lock_compat (lock, res_ptr->total_waiters_mode);
compat2 = lock_compat (lock, res_ptr->total_holders_mode);
if (compat1 == LOCK_COMPAT_YES && compat2 == LOCK_COMPAT_YES)
{
entry_ptr = lock_get_new_entry (...);
lock_initialize_entry_as_granted (entry_ptr, tran_index,
res_ptr, lock);
/* position in holder list per UPR */
lock_position_holder_entry (res_ptr, entry_ptr);
/* update aggregate mode */
res_ptr->total_holders_mode =
lock_conv (lock, res_ptr->total_holders_mode);
lock_insert_into_tran_hold_list (entry_ptr, tran_index);
lock_update_non2pl_list (thread_p, res_ptr, tran_index, lock);
pthread_mutex_unlock (&res_ptr->res_mutex);
return LK_GRANTED;
}

Key invariant: dual compatibility check. The new request must pass both total_holders_mode and total_waiters_mode. Checking only holders would allow a stream of S requests to leapfrog a waiting X request — the starvation guard from Ch. 1 §1.3.

lock_update_non2pl_list is called on every successful grant. It walks the resource’s non2pl list and marks any other-transaction non2pl entries as INCON_NON_TWO_PHASE_LOCK if the newly granted lock is incompatible with the non2pl entry’s recorded mode. This is the conflict-detection trigger described in Ch. 5 notes.

3.6 PATH C — incompatible, zero-wait timeout

Section titled “3.6 PATH C — incompatible, zero-wait timeout”

When the caller specifies LK_ZERO_WAIT or LK_FORCE_ZERO_WAIT (conditional lock request), the function does not block:

// PATH C — src/transaction/lock_manager.c (annotated)
if (wait_msecs == LK_ZERO_WAIT || wait_msecs == LK_FORCE_ZERO_WAIT)
{
pthread_mutex_unlock (&res_ptr->res_mutex);
if (wait_msecs == LK_ZERO_WAIT)
{
/* create a temporary blocked entry just for the error message */
p = lock_get_new_entry (...);
lock_initialize_entry_as_blocked (p, thread_p, ...);
lock_set_error_for_timeout (thread_p, p); /* ← sets ER_LK_... */
lock_free_entry (..., p);
}
return LK_NOTGRANTED_DUE_TIMEOUT;
}

The difference between the two zero-wait modes:

  • LK_ZERO_WAIT (value 0): timeout with error message. A temporary LK_ENTRY is created just to populate the error diagnostics (xasl_id, bind_index_in_tran via lock_set_error_for_timeout), then immediately freed.
  • LK_FORCE_ZERO_WAIT (value -2): silent timeout, no error set. Used by conditional lock probes that expect failure as a normal outcome.

3.7 PATH D — incompatible, enqueue as waiter

Section titled “3.7 PATH D — incompatible, enqueue as waiter”

The standard blocking path. The new request cannot be granted, and the caller is willing to wait.

// PATH D — src/transaction/lock_manager.c (annotated)
/* allocate and initialize as blocked */
entry_ptr = lock_get_new_entry (...);
lock_initialize_entry_as_blocked (entry_ptr, thread_p, tran_index,
res_ptr, lock);
/* append at the END of the waiter list (FIFO) */
prev = NULL;
for (i = res_ptr->waiter; i != NULL; i = i->next)
prev = i;
if (prev == NULL)
res_ptr->waiter = entry_ptr;
else
prev->next = entry_ptr;
/* update aggregate waiter mode */
res_ptr->total_waiters_mode =
lock_conv (lock, res_ptr->total_waiters_mode);
goto blocked; /* → suspend thread */

At the blocked: label, the thread is suspended:

blocked:
/* release res_mutex, then suspend */
pthread_mutex_unlock (&res_ptr->res_mutex);
ret_val = lock_suspend (thread_p, entry_ptr, wait_msecs);

On resume, the return value of lock_suspend determines the outcome:

ret_valMeaningAction
LOCK_RESUMEDLock granted by a releaserContinue to lock_conversion_treatement
LOCK_RESUMED_TIMEOUTWait timed outlock_internal_perform_unlock_object removes the waiter, return LK_NOTGRANTED_DUE_TIMEOUT
LOCK_RESUMED_ABORTEDDeadlock victimSame cleanup, return LK_NOTGRANTED_DUE_ABORTED
LOCK_RESUMED_INTERRUPTServer shutdownReturn LK_NOTGRANTED_DUE_ERROR

Key detail: who grants the waiter? The suspended thread does not re-check compatibility on resume. Instead, when a holder releases its lock (lock_internal_perform_unlock_object, Ch. 5), the releaser calls lock_grant_blocked_holder and lock_grant_blocked_waiter, which walk the waiter list, check compatibility against the updated total_holders_mode, and move grantable waiters to the holder list before waking them. The resumed thread finds its entry already in the holder list with granted_mode set.

3.8 PATH E — re-entrance (same tran, same or weaker mode)

Section titled “3.8 PATH E — re-entrance (same tran, same or weaker mode)”

When a transaction requests a lock it already holds, and the conversion result (lock_conv(requested, granted)) equals the currently granted mode, the request is a no-op upgrade — just bump the count:

// PATH E — src/transaction/lock_manager.c
lock_tran_lk_entry:
new_mode = lock_conv (lock, entry_ptr->granted_mode);
if (new_mode == entry_ptr->granted_mode)
{
entry_ptr->count += 1;
/* instant lock tracking ... */
return LK_GRANTED; /* no res_mutex needed */
}

This path does not need res_mutex when reached via the lock_find_class_entry fast path for class locks. lock_find_class_entry acquires hold_mutex (the per-transaction mutex) to safely walk the class_hold_list, but never touches res_mutex. The count increment itself is safe because only one thread per transaction can be in the lock manager at a time.

3.9 PATH F — conversion granted immediately

Section titled “3.9 PATH F — conversion granted immediately”

The transaction already holds a lock but requests a stronger mode. The conversion result (new_mode) differs from granted_mode, but is compatible with all other holders:

// PATH F — src/transaction/lock_manager.c (annotated)
/* compute aggregate of OTHER holders (excluding myself) */
group_mode = NULL_LOCK;
for (i = res_ptr->holder; i != NULL; i = i->next)
{
if (i != entry_ptr)
group_mode = lock_conv (i->granted_mode, group_mode);
}
compat1 = lock_compat (new_mode, group_mode);
if (compat1 == LOCK_COMPAT_YES)
{
entry_ptr->granted_mode = new_mode; /* upgrade in place */
entry_ptr->count += 1;
res_ptr->total_holders_mode =
lock_conv (lock, res_ptr->total_holders_mode);
lock_update_non2pl_list (...);
pthread_mutex_unlock (&res_ptr->res_mutex);
goto lock_conversion_treatement;
}

Key difference from PATH B: the compatibility check is against group_mode (other holders only), not total_holders_mode (which includes this transaction’s own mode). A transaction cannot conflict with itself.

lock_conversion_treatement [sic — typo in source] handles post-grant cleanup for class-lock conversions. When a class lock is upgraded (e.g., IS→X), instance locks that are now subsumed are removed:

Old class modeNew class modeAction
ISS, SIX, or XRemove all S instance locks
IXSIXRemove all S instance locks
IXXRemove all X instance locks
SIXXRemove all X instance locks

3.10 PATH G — conversion blocked (Upgrader Positioning Rule)

Section titled “3.10 PATH G — conversion blocked (Upgrader Positioning Rule)”

The conversion is incompatible with other holders, and the transaction must wait. This is the most complex path because the entry stays in the holder list (it still holds the old mode) but is repositioned to signal the pending upgrade:

// PATH G — src/transaction/lock_manager.c (annotated)
entry_ptr->blocked_mode = new_mode; /* ← both fields now non-NULL */
entry_ptr->count += 1;
entry_ptr->thrd_entry = thread_p;
res_ptr->total_holders_mode =
lock_conv (lock, res_ptr->total_holders_mode);
/* remove from current position in holder list */
/* ... unlink entry_ptr ... */
/* reposition per Upgrader Positioning Rule (UPR) */
lock_position_holder_entry (res_ptr, entry_ptr);
goto blocked; /* → suspend thread */

Key invariant: granted_mode != NULL_LOCK AND blocked_mode != NULL_LOCK. This is the conversion-wait state from Ch. 1 §1.4. The entry stays in the holder list (other transactions still see the old granted_mode in compatibility checks) but is flagged as pending upgrade.

Upgrader Positioning Rule (UPR): lock_position_holder_entry places the entry based on compatibility between its blocked_mode and other holders’ modes. The goal is stronger than reducing unnecessary waits: the grant pass (lock_grant_blocked_holder) scans only the front of the holder list and stops at the first non-upgrading holder, so an upgrader that ends up behind a plain holder is invisible to every future release — position is a liveness requirement, not a tuning knob (see §3.11). The three candidate positions (ta, tb, tc) in the holder list are evaluated in priority order, and the entry is inserted at the best available position.

3.11 Upgrader Positioning Rule (UPR) — deep dive into lock_position_holder_entry

Section titled “3.11 Upgrader Positioning Rule (UPR) — deep dive into lock_position_holder_entry”

lock_position_holder_entry decides where in the holder list a new or repositioned entry is inserted. The holder list is not a flat FIFO — it is ordered so that conversion-waiting holders (upgraders) sit before non-upgrading holders. This ordering directly affects which waiters can be granted next when a holder releases its lock.

Why position is correctness — the buried upgrader. lock_grant_blocked_holder walks the holder list front-to-back and stops at the first entry whose blocked_mode == NULL_LOCK (while (holder != NULL && holder->blocked_mode != NULL_LOCK)). That early stop is what keeps the per-release scan cheap — but it is sound only if every upgrader sits in the front zone. With naive append (no UPR), a class-grain timeline reproduces the failure; each entry reads granted/blocked ( = NULL_LOCK):

teventholder list (front → back)
1T5 IS, T3 IX, T2 IX — all compatibleT5 IS/–T3 IX/–T2 IX/–
2T3 upgrades to SIX; T2’s IX conflicts → only blocked_mode is written, suspended in placeT5 IS/–T3 IX/SIXT2 IX/–
3T2 commits — the grant pass starts at T5: blocked = – ⇒ exits, nobody examinedT5 IS/–T3 IX/SIX
4SIX ∧ IS is compatible — T3 is grantable, yet sleeps until lock timeout or deadlock victimunchanged — forever

Note that the example needs intention modes: a plain S → X version cannot reproduce the bug — a plain S holder in front also blocks the X target, and once it releases (leaving the list) the upgrader reaches the front anyway. The liveness failure requires a remaining plain holder whose granted mode is compatible with the upgrade target — e.g. IS standing in front of an IX → SIX upgrader.

Figure 3-2 — The buried upgrader: naive append vs UPR

Figure 3-2 — The same timeline drawn. Panels ①–③ are the counterfactual (naive append): the scan stops at plain T5 and T3’s grantable upgrade sleeps forever. The green strip is the real code: lock_position_holder_entry moves T3 to the front before it sleeps, so the very next release grants it.

Two cases based on blocked_mode:

Case 1: blocked_mode == NULL_LOCK (normal grant, not upgrading)

The entry is a plain holder with no pending upgrade. It is inserted after all upgraders but before other plain holders:

// Case 1 — src/transaction/lock_manager.c
if (entry_ptr->blocked_mode == NULL_LOCK)
{
prev = NULL;
i = res_ptr->holder;
while (i != NULL)
{
if (i->blocked_mode == NULL_LOCK)
break; /* ← stop at the first non-upgrader */
prev = i;
i = i->next;
}
/* insert after prev (i.e., after all upgraders) */
}

The resulting holder list order is: [upgraders...] [plain holders...].

Case 2: blocked_mode != NULL_LOCK (conversion wait — upgrader)

The entry is an upgrader. Three candidate insertion points are evaluated in priority order:

// Case 2 — src/transaction/lock_manager.c (annotated)
ta = tb = tc = NULL; /* three candidates */
for each holder i in the list:
if (i is also an upgrader) /* i->blocked_mode != NULL_LOCK */
{
/* ta: first upgrader whose blocked_mode is COMPATIBLE
with my blocked_mode */
compat1 = lock_compat (me->blocked_mode, i->blocked_mode);
if (ta == NULL && compat1 == YES)
ta = i;
/* tb: first upgrader where my blocked_mode is compatible
with its granted_mode, BUT its blocked_mode is
INCOMPATIBLE with my granted_mode */
compat1 = lock_compat (me->blocked_mode, i->granted_mode);
compat2 = lock_compat (i->blocked_mode, me->granted_mode);
if (ta == NULL && tb == NULL
&& compat1 == YES && compat2 == NO)
tb = i;
}
else /* i is a plain holder */
{
/* tc: first plain (non-upgrading) holder */
if (tc == NULL)
tc = i;
}
/* choose insertion point by priority: ta > tb > tc */
if (ta != NULL) prev = tap; /* before ta */
else if (tb != NULL) prev = tbp; /* before tb */
else if (tc != NULL) prev = tcp; /* before tc */
/* else: append at the end (prev from the loop) */

The three candidates explained:

CandidateConditionMeaningWhy insert before it
taMy blocked_mode is compatible with i’s blocked_modeI and this upgrader want compatible upgradesWe can both proceed — placing me first lets the releaser grant me sooner without blocking the other
tbMy blocked_mode is compat with i’s granted_mode, BUT i’s blocked_mode is incompat with my granted_modeI don’t conflict with what it has, but it conflicts with what I haveIf I go after it, it would block on my granted_mode and I’d block on its blocked_mode — deadlock. Putting me first avoids this
tcFirst plain (non-upgrading) holderBoundary between upgraders and plain holdersAll upgraders should be ahead of non-upgraders

Priority ta > tb > tc: ta is the best position (compatible upgraders grouped together). tb is the fallback (avoid deadlock between upgraders). tc is the last resort (at least stay in the upgrader zone).

The tb case, worked. Say the entry being positioned is new = IX→SIX and the existing upgrader is i = IS→X: SIX ∧ IS = ✓ (i’s held mode does not block new) but X ∧ IX = ✗ (new’s held mode blocks i for as long as new holds). The grant pass grants a compatible prefix of the upgrader zone and stops at the first failure, so the order decides everything:

Order in the upgrader zoneWhat the grant pass doesOutcome
[ new, i ] — tb’s picknew: SIX ∧ IS = ✓ → granted; i gets its turn at new’s commitprogress
[ i, new ]i: X ∧ IX = ✗ → stop — new never examined, though grantableboth sleep — wedged until timeout

A one-directionally conflicting pair has exactly one viable order: whoever can be granted while the other still holds goes first.

Worked example:

Holder list before UPR:
[T3: granted=S, blocked=X] → [T5: granted=IS, blocked=NULL] → [T8: granted=S, blocked=NULL]
^^ upgrader ^^ plain holder ^^ plain holder
New entry: T7 wants to upgrade (granted=IX, blocked=SIX)

Evaluation:

  • T3 (upgrader): lock_compat(SIX, X) → NO. Not ta. lock_compat(SIX, S) → NO. Not tb either.
  • T5 (plain): tc = T5 (first plain holder).

Result: no ta, no tb, tc = T5. Insert before T5:

After UPR:
[T3: granted=S, blocked=X] → [T7: granted=IX, blocked=SIX] → [T5: granted=IS, blocked=NULL] → [T8: granted=S, blocked=NULL]
^^ upgrader ^^ upgrader (new) ^^ plain holder ^^ plain holder

Holder list layout invariant maintained by UPR:

flowchart LR
  subgraph HL["Holder list order"]
    direction LR
    U1["upgrader 1\ngranted + blocked"]
    U2["upgrader 2\ngranted + blocked"]
    UN["..."]
    P1["plain holder 1\ngranted only"]
    P2["plain holder 2\ngranted only"]
    PN["..."]
    U1 --> U2 --> UN --> P1 --> P2 --> PN
  end

Figure 3-5 — Holder list ordering invariant. Upgraders (entries with blocked_mode != NULL_LOCK) are always positioned before plain holders. Among upgraders, the order is determined by compatibility of their blocked_modes to minimize inter-upgrader deadlocks.

Why UPR matters for correctness: When a holder releases its lock and lock_grant_blocked_holder scans the holder list to see if any upgrader can now be granted, it walks front-to-back. If upgraders with compatible goals are grouped at the front, they can be granted together in one pass. Without UPR, a newly arriving upgrader at the back might deadlock with an earlier upgrader because the grant order would be wrong.

3.12 Concrete scenario — three transactions contend for one row

Section titled “3.12 Concrete scenario — three transactions contend for one row”
flowchart LR
  subgraph T0["Initial"]
    H0["holder: T5 (S)"]
    W0["waiter: --"]
    AGG0["holders=S\nwaiters=NULL"]
  end

  subgraph T1["T7 req S (PATH B)"]
    H1["holder: T5 (S), T7 (S)"]
    W1["waiter: --"]
    AGG1["holders=S\nwaiters=NULL"]
  end

  subgraph T2["T9 req X (PATH D)"]
    H2["holder: T5 (S), T7 (S)"]
    W2["waiter: T9 (X)"]
    AGG2["holders=S\nwaiters=X"]
  end

  subgraph T3["T12 req S (PATH D!)"]
    H3["holder: T5 (S), T7 (S)"]
    W3["waiter: T9 (X), T12 (S)"]
    AGG3["holders=S\nwaiters=SIX"]
  end

  T0 --> T1 --> T2 --> T3

Figure 3-6 — Four-step contention scenario. T7’s S request is compatible (PATH B). T9’s X request fails both holder and waiter compat (PATH D). T12’s S request is compatible with total_holders_mode (S vs S = yes) but incompatible with total_waiters_mode (S vs X = no) — so T12 also blocks (PATH D), demonstrating the starvation guard: T9’s pending X prevents T12 from leapfrogging.

3.13 Diagnostic fields — xasl_id and bind_index_in_tran

Section titled “3.13 Diagnostic fields — xasl_id and bind_index_in_tran”

At the very end of a successful acquisition (all paths that return LK_GRANTED):

// lock_internal_perform_lock_object — src/transaction/lock_manager.c
end:
if (entry_ptr != NULL && ret_val == LK_GRANTED)
lock_event_set_xasl_id_to_entry (tran_index, entry_ptr);

This stamps the entry with the currently executing query’s plan ID and statement index. These are read-only diagnostic fields — they do not affect lock behavior. Their sole consumer is the event-log subsystem (event_log_sql_string, event_log_bind_values) which prints them in timeout and deadlock reports so a DBA can identify which query caused the lock wait.

3.14 Performance analysis — acquisition cost by path

Section titled “3.14 Performance analysis — acquisition cost by path”

Not all paths cost the same. The table below ranks them from cheapest to most expensive:

PathCostSynchronizationWhen it fires
E (re-entrance, class fast path)O(1)hold_mutex only — no res_mutex, no hash lookupSame tran re-requests same class with same-or-weaker mode. Most common in OLTP.
E (re-entrance, via hash)O(1) + hash lookupres_mutexSame tran re-requests same instance with same-or-weaker mode.
A (fresh resource)O(1) + hash insertLockfree find_or_insert + res_mutexFirst lock on a never-before-locked OID.
B (compatible grant)O(1) compat checkres_mutexNew holder, compatible with aggregate modes.
F (conversion granted)O(holders)res_mutexUpgrade needed, must compute group_mode excluding self.
D (enqueue + suspend)O(waiters) + thread suspendres_mutex + thread mutex + context switchIncompatible — thread goes to sleep.
G (conversion blocked)O(holders) + suspendres_mutex + UPR repositioning + context switchUpgrade incompatible with other holders.

Key performance observations:

1. The class-entry fast path (PATH E) avoids the hash table entirely.

lock_object → lock_find_class_entry (hold_mutex only)
→ found → lock_tran_lk_entry → count++ → return

This is the hottest path in OLTP: every DML statement re-requests the class intention lock (IS/IX) that was already acquired by lock_scan. The cost is one hold_mutex lock/unlock + one linked list walk of class_hold_list (typically < 10 entries) + one integer increment. No hash lookup, no res_mutex.

2. The O(1) aggregate-mode check eliminates per-holder scanning for PATH B.

Without total_holders_mode / total_waiters_mode, every new lock request would need to walk the entire holder list to check compatibility — O(holders per resource). With the cached aggregates, PATH B does two table lookups:

compat1 = lock_Comp[lock][res_ptr->total_waiters_mode]; // O(1)
compat2 = lock_Comp[lock][res_ptr->total_holders_mode]; // O(1)

The trade-off: total_holders_mode must be recomputed on every holder-list change (O(holders) in Ch. 5 §5.6). But since grant is far more frequent than release in steady-state OLTP, the net win is significant.

3. Escalation check on every instance lock request is cheap.

lock_escalate_if_neededlock_check_escalate is:

  • One hold_mutex lock/unlock
  • One integer comparison (ngranules < threshold)
  • Return false (the common case — no escalation)

Total: ~100ns. This is acceptable because the alternative — unbounded instance lock accumulation — has far worse consequences (memory pressure, O(n) hold-list walks).

4. The hash function’s slot-distribution design pays off under concurrent scans.

Consecutive OIDs on the same heap page (slotid 1, 2, 3, …) land in different hash buckets (Ch. 2 §2.3, Figure 2-2). Two threads scanning the same table in parallel hit different res_mutex instances, so they never contend at the hash level. A naive hash would cluster them in the same bucket, serializing the scans.

  1. Two public APIs, one internal function. lock_scan handles class-grain intention locks; lock_object handles all three resource types with multi-granularity preparation. Both converge on lock_internal_perform_lock_object.
  2. Seven paths (A–G) cover every combination of resource state, holder status, compatibility, and wait policy.
  3. Dual compatibility check (PATH B/D): a new request must be compatible with both total_holders_mode and total_waiters_mode. The waiter check is the starvation guard.
  4. Re-entrance (PATH E) is a count bump with no mutex needed when reached via the class-entry fast path.
  5. Conversion (PATH F/G) checks against other holders only (excluding self). Blocked conversions set both granted_mode and blocked_mode, leaving the entry in the holder list but repositioned per UPR.
  6. Waiters are granted by the releaser, not by the waking thread. The resume path does not re-check compatibility.
  7. lock_update_non2pl_list is called on successful grants where the resource already has prior state (PATH B, F) to mark conflicting non2pl entries for decache notification. PATH A (fresh resource) skips this — the non2pl list is empty.

Chapter 4: Lock Conversion and Re-entrance

Section titled “Chapter 4: Lock Conversion and Re-entrance”

Chapter 3 covered where conversion and re-entrance appear in the acquisition state machine (PATH E/F/G). This chapter zooms in on the mechanics: how the conversion table works, how re-entrance counting interacts with unlock, how blocked conversions are granted, and what cleanup happens after a class-lock upgrade.

4.1 When conversion happens — the trigger condition

Section titled “4.1 When conversion happens — the trigger condition”

Conversion occurs when a transaction requests a lock on a resource it already holds. The function detects this by scanning the holder list for a matching tran_index:

// lock_internal_perform_lock_object — src/transaction/lock_manager.c
entry_ptr = res_ptr->holder;
while (entry_ptr != NULL)
{
if (entry_ptr->tran_index == tran_index)
break; /* ← found: I am already a holder */
entry_ptr = entry_ptr->next;
}

If found, the code jumps to lock_tran_lk_entry: — the conversion path. If not found, it is a new lock request (PATH A–D).

For class locks, there is a fast path that avoids the hash table entirely: lock_find_class_entry walks the transaction’s class_hold_list (O(classes held) instead of O(holders on the resource)):

// lock_find_class_entry — src/transaction/lock_manager.c
entry_ptr = tran_lock->class_hold_list;
while (entry_ptr != NULL)
{
if (OID_EQ (&entry_ptr->res_head->key.oid, class_oid))
break;
entry_ptr = entry_ptr->tran_next;
}

This fast path bypasses find_or_insert and res_mutex for the re-entrance case (PATH E). It does acquire hold_mutex (the per-transaction mutex) to walk the list safely, but avoids the heavier per-resource res_mutex. This is the most common conversion scenario.

4.2 The conversion table — lock_Conv[requested][current]

Section titled “4.2 The conversion table — lock_Conv[requested][current]”

The conversion result is the least upper bound (LUB) of the requested mode and the currently held mode:

new_mode = lock_conv (lock, entry_ptr->granted_mode);
// equivalent to: new_mode = lock_Conv[lock][entry_ptr->granted_mode];

The full 12×12 table is in lock_table.c. The practically important subset:

requested ↓ \ current →ISSIXSIXUX
ISISSIXSIXX
SSSSIXSIXUX
IXIXSIXIXSIXX
SIXSIXSIXSIXSIXX
UUUX
XXXXXXX

Reading the table: lock_Conv[S][IX] = SIX means “if I hold IX and request S, the result is SIX.”

NA_LOCK entries (marked ) mean the combination is undefined and triggers an assert failure. These are combinations that should never arise in practice (e.g., U_LOCK is only requested when already holding S_LOCK).

Re-entrance is the most common conversion: the transaction requests a mode that is equal to or weaker than what it already holds. The conversion result equals the current mode, so no actual upgrade is needed.

// PATH E — lock_tran_lk_entry in lock_internal_perform_lock_object
new_mode = lock_conv (lock, entry_ptr->granted_mode);
if (new_mode == entry_ptr->granted_mode)
{
entry_ptr->count += 1; /* ← just bump the counter */
// ... instant lock tracking ...
return LK_GRANTED;
}

The count field is critical for correct unlock behavior. Each lock_object call increments count; each lock_unlock_object call decrements it. The lock is only actually released when count reaches zero. This prevents a nested caller from accidentally releasing a lock that an outer caller still needs.

Example — nested re-entrance:

lock_object(oid=R1, mode=S) → count=1, granted_mode=S
lock_object(oid=R1, mode=S) → count=2, granted_mode=S (PATH E)
lock_unlock_object(R1, S) → count=1, lock STAYS
lock_unlock_object(R1, S) → count=0, lock RELEASED

4.4 Immediate conversion — PATH F in detail

Section titled “4.4 Immediate conversion — PATH F in detail”

When new_mode != granted_mode (a real upgrade is needed), the function must check compatibility with other holders only (excluding self):

// PATH F — lock_internal_perform_lock_object (annotated)
group_mode = NULL_LOCK;
for (i = res_ptr->holder; i != NULL; i = i->next)
{
if (i != entry_ptr) /* ← skip myself */
group_mode = lock_conv (i->granted_mode, group_mode);
}
compat1 = lock_compat (new_mode, group_mode);
if (compat1 == LOCK_COMPAT_YES)
{
/* upgrade in place */
entry_ptr->granted_mode = new_mode;
entry_ptr->count += 1;
res_ptr->total_holders_mode =
lock_conv (lock, res_ptr->total_holders_mode);
lock_update_non2pl_list (...);
goto lock_conversion_treatement;
}

Why exclude self? A transaction cannot conflict with its own lock. If T holds S and requests X, the total_holders_mode might be SIX (because another transaction holds IX). But group_mode (other holders only) might be IX, and lock_compat(X, IX) == NO — so the conversion blocks. Without the self-exclusion, the compatibility check would use total_holders_mode = SIX and incorrectly conclude the conversion is impossible even when the only S holder is the requesting transaction itself.

Concrete example:

Resource R: holder list = [T5(IX), T7(S)]
T7 requests X_LOCK on R.
new_mode = lock_conv(X, S) = X
group_mode = T5's granted_mode = IX
lock_compat(X, IX) = NO → PATH G (blocked)
But if R only had: [T7(S)]
group_mode = NULL_LOCK
lock_compat(X, NULL) = YES → PATH F (granted, upgrade S→X)

4.5 Blocked conversion — PATH G in detail

Section titled “4.5 Blocked conversion — PATH G in detail”

When the upgrade is incompatible with other holders, the entry enters the dual-field state — the defining characteristic of a blocked conversion:

// PATH G — lock_internal_perform_lock_object (annotated)
entry_ptr->blocked_mode = new_mode; /* upgrade target */
entry_ptr->count += 1;
entry_ptr->thrd_entry = thread_p;
/* update total_holders_mode to reflect the REQUESTED mode */
res_ptr->total_holders_mode =
lock_conv (lock, res_ptr->total_holders_mode);
/* remove from current position, reposition per UPR */
// ... unlink entry_ptr from holder list ...
lock_position_holder_entry (res_ptr, entry_ptr);
goto blocked; /* → lock_suspend */

Key invariant during blocked conversion:

  • granted_mode = old mode (still effective — other transactions see this in compatibility checks)
  • blocked_mode = new mode (the upgrade being waited for)
  • Entry stays in the holder list (not the waiter list)
  • Thread is suspended

total_holders_mode includes the requested mode even though it is not yet granted. This is conservative: it ensures that new requests see the strongest possible contention picture. The trade-off is that some requests that could theoretically coexist with the old mode will be blocked. This is intentional — it prevents starvation of the upgrader.

4.6 How blocked conversions are granted — lock_grant_blocked_holder

Section titled “4.6 How blocked conversions are granted — lock_grant_blocked_holder”

When a holder releases its lock (Ch. 5), the release path calls lock_grant_blocked_holder. This function scans the holder list front-to-back (which is why UPR positions upgraders at the front):

// lock_grant_blocked_holder — src/transaction/lock_manager.c (annotated)
static void
lock_grant_blocked_holder (THREAD_ENTRY * thread_p, LK_RES * res_ptr)
{
holder = res_ptr->holder;
while (holder != NULL && holder->blocked_mode != NULL_LOCK)
{
/* compute aggregate of all holders AFTER this one */
mode = NULL_LOCK;
for (h = holder->next; h != NULL; h = h->next)
mode = lock_conv (h->granted_mode, mode);
compat = lock_compat (holder->blocked_mode, mode);
if (compat == LOCK_COMPAT_NO)
break; /* ← stop granting */
/* grant the conversion */
holder->granted_mode = holder->blocked_mode;
holder->blocked_mode = NULL_LOCK;
/* reposition: move from upgrader zone to plain-holder zone */
// ... reposition per UPR ...
lock_update_non2pl_list (...);
lock_resume (holder, LOCK_RESUMED); /* wake up */
holder = next;
}
}

The algorithm walks upgraders only (stops at the first entry with blocked_mode == NULL_LOCK). For each upgrader, it computes the aggregate of holders after it in the list (not before). This asymmetry is what UPR enables: compatible upgraders at the front can be granted together because their blocked_modes don’t conflict with the modes behind them.

Grant sequence:

  1. granted_mode = blocked_mode (the upgrade takes effect)
  2. blocked_mode = NULL_LOCK (no longer an upgrader)
  3. Reposition to the plain-holder zone via UPR
  4. lock_update_non2pl_list for conflict detection
  5. lock_resume(LOCK_RESUMED) wakes the suspended thread

4.7 How blocked waiters are granted — lock_grant_blocked_waiter

Section titled “4.7 How blocked waiters are granted — lock_grant_blocked_waiter”

After granting blocked holders, the release path calls lock_grant_blocked_waiter. This function scans the waiter list front-to-back:

// lock_grant_blocked_waiter — src/transaction/lock_manager.c (annotated)
static int
lock_grant_blocked_waiter (THREAD_ENTRY * thread_p, LK_RES * res_ptr)
{
waiter = res_ptr->waiter;
while (waiter != NULL)
{
compat = lock_compat (waiter->blocked_mode,
res_ptr->total_holders_mode);
if (compat == LOCK_COMPAT_NO)
break; /* ← stop granting */
/* grant the waiter */
waiter->granted_mode = waiter->blocked_mode;
waiter->blocked_mode = NULL_LOCK;
/* move from waiter list to holder list */
// ... remove from waiter list ...
lock_position_holder_entry (res_ptr, waiter);
/* update aggregates */
res_ptr->total_holders_mode =
lock_conv (waiter->granted_mode, res_ptr->total_holders_mode);
/* add to transaction hold list */
lock_insert_into_tran_hold_list (waiter, owner_tran_index);
lock_update_non2pl_list (...);
lock_resume (waiter, LOCK_RESUMED);
waiter = next;
}
/* recompute total_waiters_mode from remaining waiters */
if (change_total_waiters_mode)
{
mode = NULL_LOCK;
for (w = res_ptr->waiter; w != NULL; w = w->next)
mode = lock_conv (w->blocked_mode, mode);
res_ptr->total_waiters_mode = mode;
}
}

Key differences from lock_grant_blocked_holder:

Aspectlock_grant_blocked_holderlock_grant_blocked_waiter
ScansHolder list (upgraders)Waiter list
Compat check againstAggregate of holders after this entrytotal_holders_mode (all holders)
On grantgranted_mode = blocked_mode (in-place)Move from waiter list to holder list
Stop conditionFirst incompatible upgraderFirst incompatible waiter

Order of the two calls matters: blocked holders (upgraders) are granted first, which may change total_holders_mode. Then blocked waiters are checked against the updated total_holders_mode. This ensures upgraders have priority over new waiters — they were here first and already hold a weaker lock.

4.8 Post-conversion cleanup — lock_conversion_treatement

Section titled “4.8 Post-conversion cleanup — lock_conversion_treatement”

After a class-lock conversion is granted (PATH F, or PATH G after resume), instance locks that are now subsumed by the stronger class lock are removed:

// lock_conversion_treatement — src/transaction/lock_manager.c
if (entry_ptr->res_head->key.type == LOCK_RESOURCE_CLASS
&& lock_conversion == true)
{
new_mode = entry_ptr->granted_mode;
switch (old_mode)
{
case IS_LOCK:
if (IS_WRITE_EXCLUSIVE_LOCK (new_mode)
|| new_mode == S_LOCK || new_mode == SIX_LOCK)
lock_remove_all_inst_locks (thread_p, tran_index,
oid, S_LOCK);
break;
case IX_LOCK:
if (new_mode == SIX_LOCK)
lock_remove_all_inst_locks (..., S_LOCK);
else if (IS_WRITE_EXCLUSIVE_LOCK (new_mode))
lock_remove_all_inst_locks (..., X_LOCK);
break;
case SIX_LOCK:
/* new_mode == X_LOCK */
lock_remove_all_inst_locks (..., X_LOCK);
break;
}
}

What gets removed and why:

Old class lockNew class lockRemove instance locksReason
ISS, SIX, XAll S instance locksClass S subsumes instance S
IXSIXAll S instance locksSIX = S + IX; the S part subsumes instance S
IXXAll X instance locksClass X subsumes everything
SIXXAll X instance locksWas already S+IX; now full X

This is a memory optimization: the subsumed instance LK_ENTRYs are freed and ngranules on the class entry is decremented. Without this cleanup, a class-lock escalation would leave behind redundant instance entries that waste memory and slow down hold-list traversal.

4.9 Concrete scenario — conversion lifecycle

Section titled “4.9 Concrete scenario — conversion lifecycle”
flowchart LR
  subgraph S1["1. T7 holds S on row R"]
    H1["holder: T7(S, count=1)"]
    AGG1["holders=S"]
  end

  subgraph S2["2. T5 acquires S on R (PATH B)"]
    H2["holder: T7(S), T5(S)"]
    AGG2["holders=S"]
  end

  subgraph S3["3. T7 requests X (PATH G)"]
    H3["holder: T7(S, blocked=X), T5(S)"]
    AGG3["holders=X\n(conservative)"]
  end

  subgraph S4["4. T5 releases S"]
    H4["holder: T7(X, count=2)"]
    AGG4["holders=X"]
  end

  S1 --> S2 --> S3 --> S4

Figure 4-1 — Conversion lifecycle. Step 3: T7’s S→X conversion is blocked by T5’s S (PATH G). total_holders_mode becomes X (conservative). Step 4: T5 releases, lock_grant_blocked_holder grants T7’s upgrade: granted_mode becomes X, blocked_mode becomes NULL_LOCK, T7’s thread resumes. Note count=2 because both the original S request and the X request each incremented it.

4.10 Performance analysis — conversion cost

Section titled “4.10 Performance analysis — conversion cost”

Re-entrance (PATH E) is nearly free; real conversion is O(holders).

ScenarioCostBottleneck
Re-entrance (same or weaker mode)O(1): lock_conv table lookup + count++None — no list walk, no compat check
Conversion granted (PATH F)O(holders): must walk holder list to compute group_mode excluding selfThe self-exclusion loop. A resource with 100 holders means 100 iterations.
Conversion blocked (PATH G)O(holders) + UPR repositioning + suspendSame as PATH F plus the UPR walk and a context switch
Grant cascade (lock_grant_blocked_holder)O(upgraders × holders-after-each)On resume, each upgrader checks compat against all holders behind it. With k upgraders and n total holders, worst case is O(k × n).

Why group_mode computation is O(holders), not O(1):

For a new lock request (PATH B), the aggregate total_holders_mode suffices — O(1) check. But for conversion (PATH F/G), the aggregate includes the requesting transaction’s own mode, which must be excluded. There is no way to “subtract” one mode from an aggregate LUB in O(1) — the LUB is not invertible. So the code walks the holder list, skipping entry_ptr, and recomputes the aggregate from scratch.

This is acceptable because:

  1. Conversion is less frequent than new grants (most lock requests are fresh or re-entrant).
  2. The holder count per resource is typically small (< 10 in OLTP).
  3. The walk is under res_mutex, which is per-resource — other resources are unaffected.

Post-conversion cleanup (lock_conversion_treatement) is O(instance locks):

When a class lock is upgraded (IS→S, IX→X, etc.), lock_remove_all_inst_locks walks the transaction’s inst_hold_list and removes every instance lock of the specified mode for that class. This is O(total instance locks held by the transaction), not O(instances of that class) — the list is not partitioned by class. For a transaction holding 5000 instance locks across 3 classes, the walk examines all 5000 even if only 1000 belong to the escalated class.

  1. Conversion is triggered when a holder requests the same resource again with a different mode. The conversion table (lock_Conv) computes the LUB of the two modes.
  2. Re-entrance (PATH E) is the common case — same or weaker mode, just bump count. No res_mutex needed on the class fast path.
  3. count controls unlock depth. Each lock_object increments it; each lock_unlock_object decrements it. The lock is only released when count reaches zero.
  4. Immediate conversion (PATH F) checks against other holders only (self-exclusion). This prevents a transaction from dead-locking with itself.
  5. Blocked conversion (PATH G) sets both granted_mode and blocked_mode, stays in the holder list, and is repositioned per UPR. total_holders_mode conservatively includes the requested mode.
  6. Grant order: upgraders first, then waiters. lock_grant_blocked_holder scans front-to-back among upgraders, then lock_grant_blocked_waiter scans the waiter list against the updated total_holders_mode.
  7. Post-conversion cleanup removes instance locks subsumed by the upgraded class lock, freeing memory and reducing hold-list size.

Chapter 5: Lock Release and the NON2PL Protocol

Section titled “Chapter 5: Lock Release and the NON2PL Protocol”

This chapter traces the full unlock path — from the public API through the internal release mechanics, then into the NON2PL protocol that tracks early-released locks under READ COMMITTED.

flowchart LR
  UO["lock_unlock_object\n(oid, class_oid, lock, force)"]
  UDN["lock_unlock_object_donot_move_to_non2pl\n(oid, class_oid, lock)"]
  UA["lock_unlock_all\n(commit/rollback)"]

  INT["lock_internal_perform_unlock_object\n(entry, release_flag, move_to_non2pl)"]

  UO --> INT
  UDN --> INT
  UA --> INT

Figure 5-1 — Three unlock entry points. They differ in release_flag and move_to_non2pl parameters, which control whether the lock is fully removed and whether a non2pl shadow record is created.

Entry pointrelease_flagmove_to_non2plWhen used
lock_unlock_object(force=true)falsetrueForced mid-transaction release
lock_unlock_object(force=false)falsetrueIsolation-gated release (see §5.2)
lock_unlock_object_donot_move_to_non2plfalsefalseBtree FK checks, catalog lookups — caller knows no stale-cache risk
lock_unlock_alltruefalseCommit/rollback — unconditional release of everything

5.2 Unlock caller patterns (see Ch. 3 §3.2 for the full map)

Section titled “5.2 Unlock caller patterns (see Ch. 3 §3.2 for the full map)”

The full caller map (Figure 3-3) is in Ch. 3 §3.2. The unlock-specific patterns worth highlighting:

Unlock APIPrimary callersWhy this variant
lock_unlock_object(force=false)locator_sr.c, heap_file.c, scan_manager.cStandard isolation-gated release — the four-stage filter in §5.3 applies
lock_unlock_object(force=true)locator_sr.c (DDL rollback), serial.cForced release bypassing isolation check — caller knows the lock must go
lock_unlock_object_donot_move_to_non2plbtree.c (FK checks), system_catalog.c, locator_sr.cRows never surfaced to client — no stale-cache risk, non2pl tracking skipped
lock_unlock_alllog_manager.c onlyCommit/rollback — unconditional bulk release
lock_stop_instant_lock_modequery_executor.cReleases all instant-duration locks at end of WHERE evaluation

5.3 lock_unlock_object — the isolation policy gate

Section titled “5.3 lock_unlock_object — the isolation policy gate”

This is the most commonly called unlock function. It implements the isolation-level policy before delegating to the internal function:

// lock_unlock_object — src/transaction/lock_manager.c (annotated)
void
lock_unlock_object (THREAD_ENTRY * thread_p, const OID * oid,
const OID * class_oid, LOCK lock, bool force)
{
if (force == true)
{
entry_ptr = lock_find_tran_hold_entry (...);
lock_internal_perform_unlock_object (thread_p, entry_ptr,
false, true);
return;
}
/* force == false: apply isolation policy */
if (lock != S_LOCK)
{
/* These will not be released. */
return; /* ← X, IX, SIX, etc. stay to commit */
}
isolation = logtb_find_isolation (tran_index);
switch (isolation)
{
case TRAN_SERIALIZABLE:
case TRAN_REPEATABLE_READ:
return; /* nothing to do — S locks held to commit */
case TRAN_READ_COMMITTED:
lock_unlock_object_by_isolation (thread_p, tran_index,
isolation, class_oid, oid);
break;
}
}

Three filters that prevent release:

flowchart TD
  REQ["lock_unlock_object(force=false)"]
  F1{"lock != S_LOCK?"}
  F1 -- "yes (X, IX, ...)" --> KEEP1["KEEP: write locks\nnever released early"]
  F1 -- "no (S_LOCK)" --> F2{"isolation >= RR?"}
  F2 -- "yes (RR/SER)" --> KEEP2["KEEP: S locks\nheld to commit"]
  F2 -- "no (RC)" --> F3["lock_unlock_object_by_isolation"]
  F3 --> F3A{"root or class OID?"}
  F3A -- "yes" --> KEEP3["KEEP: class locks\nnever released early"]
  F3A -- "no" --> F3B{"mvcc_is_mvcc_disabled_class?"}
  F3B -- "yes" --> REL["RELEASE:\nlock_unlock_shared_inst_lock\n(move_to_non2pl=true)"]
  F3B -- "no (MVCC table)" --> KEEP4["KEEP: snapshot-based,\nno lock to release"]

Figure 5-2 — Four-stage filter in the non-forced unlock path. Only S locks, under RC, on MVCC-disabled instance OIDs, are actually released. Everything else is kept to commit.

Key insight: In a typical MVCC workload, the vast majority of lock_unlock_object(force=false) calls hit KEEP4 — MVCC-enabled tables use snapshots, not locks, for reads. The only locks that reach the actual release path are S locks on MVCC-disabled classes (serials, root class, collation class, HA apply-info class).

5.4 lock_internal_perform_unlock_object — the complete flow

Section titled “5.4 lock_internal_perform_unlock_object — the complete flow”

This is the internal workhorse. It handles three cases depending on where the entry is found:

flowchart TD
  START["lock_internal_perform_unlock_object\n(entry_ptr, release_flag, move_to_non2pl)"]

  START --> DEC{"release_flag == false?"}
  DEC -- "yes" --> COUNT["count--\n(+ instant_lock_count-- if instant mode)"]
  COUNT --> EARLY{"blocked_mode == NULL\nAND count > 0?"}
  EARLY -- "yes" --> RET["RETURN early\n(other callers still need this lock)"]
  EARLY -- "no" --> MUTEX
  DEC -- "no (release_flag=true)" --> MUTEX

  MUTEX["pthread_mutex_lock(res_mutex)"]
  MUTEX --> FIND_H{"find entry in\nholder list?"}
  FIND_H -- "yes" --> CASE_H["CASE H: holder release"]
  FIND_H -- "no" --> FIND_W{"find entry in\nwaiter list?"}
  FIND_W -- "yes" --> CASE_W["CASE W: waiter removal\n(timeout/deadlock victim)"]
  FIND_W -- "no" --> ERR["ERROR: lost transaction"]

  CASE_H --> H_CHECK{"release_flag==false\nAND count > 0?"}
  H_CHECK -- "yes" --> H_BLOCKED["Blocked holder cleanup:\nblocked_mode = NULL\nreposition per UPR"]
  H_CHECK -- "no" --> H_RELEASE["Full release:\n1. Remove from tran hold list\n2. Decrement class granules\n3. If move_to_non2pl: lock_add_non2pl_lock\n4. Free entry"]

  H_RELEASE --> RECOMP["Recompute total_holders_mode"]
  H_BLOCKED --> RECOMP
  RECOMP --> EMPTY{"holder==NULL\nAND waiter==NULL?"}
  EMPTY -- "yes, non2pl==NULL" --> REMOVE["lock_remove_resource\n(erase from hash table)"]
  EMPTY -- "yes, non2pl!=NULL" --> UNLOCK["unlock res_mutex\n(resource kept for non2pl)"]
  EMPTY -- "no" --> GRANT["lock_grant_blocked_holder\nlock_grant_blocked_waiter\nunlock res_mutex"]

Figure 5-3 — Complete flow of lock_internal_perform_unlock_object. The early return (count > 0) is the re-entrance guard from Ch. 4 §4.3. CASE H is the normal holder release. CASE W handles waiter removal after timeout or deadlock.

5.5 The release_flag and count interaction

Section titled “5.5 The release_flag and count interaction”

The two parameters release_flag and count together control whether the lock is actually released:

// lock_internal_perform_unlock_object — src/transaction/lock_manager.c
if (release_flag == false)
{
entry_ptr->count--;
if (lock_is_instant_lock_mode (tran_index))
entry_ptr->instant_lock_count--;
if (entry_ptr->blocked_mode == NULL_LOCK && entry_ptr->count > 0)
return; /* ← early return: other callers still need this lock */
}
release_flagcount after decrementblocked_modeResult
false> 0NULL_LOCKEarly return — lock stays, other re-entrant callers still active
false> 0!= NULL_LOCKContinue — this is a blocked holder being cleaned up (timeout/victim)
false0anyContinue — last reference, proceed to actual release
true(not decremented)anyContinue — unconditional release (lock_unlock_all)

5.6 CASE H — holder release (the normal path)

Section titled “5.6 CASE H — holder release (the normal path)”

When the entry is found in the holder list and the count has reached zero (or release_flag is true):

// CASE H — lock_internal_perform_unlock_object (annotated)
/* remove from holder list */
if (prev == NULL)
res_ptr->holder = curr->next;
else
prev->next = curr->next;
if (release_flag == false && curr->count > 0)
{
/* blocked holder cleanup (timeout/victim) */
curr->blocked_mode = NULL_LOCK;
lock_position_holder_entry (res_ptr, entry_ptr);
}
else
{
/* full release */
lock_delete_from_tran_hold_list (curr, tran_index);
lock_decrement_class_granules (curr->class_entry);
if (release_flag == false && move_to_non2pl == true)
lock_add_non2pl_lock (thread_p, res_ptr, tran_index,
curr->granted_mode); /* ← NON2PL creation */
lock_free_entry (tran_index, t_entry,
&lk_Gl.obj_free_entry_list, curr);
}
/* recompute total_holders_mode */
mode = NULL_LOCK;
for (i = res_ptr->holder; i != NULL; i = i->next)
{
mode = lock_conv (i->granted_mode, mode);
mode = lock_conv (i->blocked_mode, mode); /* include upgraders */
}
res_ptr->total_holders_mode = mode;

Why this walk exists — and why it is cheap. Commit never searches for its entries: lock_unlock_all walks the transaction-side hold lists and hands each LK_ENTRY pointer directly to the unlock path (the dual-view payoff, §1.4). The only walk left is finding prev above — the resource-side list is singly linked, so splicing a node out needs its predecessor. The asymmetry is a deliberate cost trade: the transaction view is doubly linked (tran_prev) because commit unlinks thousands of entries one by one and each unlink must be O(1); the resource-side holder list stays singly linked because an object rarely has more than a handful of holders — and the same res_mutex section re-walks that list anyway (the aggregate recompute below, then the grant passes).

After the holder is removed, three outcomes based on remaining state:

Remaining stateAction
No holders, no waiters, no non2pllock_remove_resource — erase the LK_RES from the hash table
No holders, no waiters, but non2pl existsKeep LK_RES alive (unlock res_mutex) — the non2pl entries still need the resource
Holders or waiters remainlock_grant_blocked_holderlock_grant_blocked_waiter (Ch. 4 §4.6–4.7) → unlock res_mutex

5.7 CASE W — waiter removal (timeout / deadlock victim)

Section titled “5.7 CASE W — waiter removal (timeout / deadlock victim)”

When the entry is not in the holder list but is found in the waiter list — this happens when a suspended thread is resumed due to timeout or deadlock detection:

// CASE W — lock_internal_perform_unlock_object (annotated)
/* remove from waiter list */
if (prev == NULL)
res_ptr->waiter = curr->next;
else
prev->next = curr->next;
lock_free_entry (..., curr);
if (from_whom != NULL)
lock_grant_blocked_waiter_partial (thread_p, res_ptr, from_whom);
else
{
/* recompute total_waiters_mode from remaining waiters */
mode = NULL_LOCK;
for (i = res_ptr->waiter; i != NULL; i = i->next)
mode = lock_conv (i->blocked_mode, mode);
res_ptr->total_waiters_mode = mode;
}

lock_grant_blocked_waiter_partial is a variant of lock_grant_blocked_waiter that only checks waiters from a given position onward — since the removed waiter was ahead of from_whom, only the waiters behind it might now be grantable.

5.8 lock_unlock_all — commit/rollback release

Section titled “5.8 lock_unlock_all — commit/rollback release”

Called at transaction end. Unconditionally releases everything:

// lock_unlock_all — src/transaction/lock_manager.c (annotated)
void
lock_unlock_all (THREAD_ENTRY * thread_p)
{
/* 1. release all instance locks */
entry_ptr = tran_lock->inst_hold_list;
while (entry_ptr != NULL)
{
lock_internal_perform_unlock_object (thread_p, entry_ptr,
true, /* ← release_flag */
false); /* ← no non2pl */
entry_ptr = tran_lock->inst_hold_list; /* re-read head */
}
/* 2. release all class locks */
entry_ptr = tran_lock->class_hold_list;
while (entry_ptr != NULL)
{
lock_internal_perform_unlock_object (thread_p, entry_ptr,
true, false);
entry_ptr = tran_lock->class_hold_list;
}
/* 3. release root class lock */
if (tran_lock->root_class_hold != NULL)
lock_internal_perform_unlock_object (thread_p,
tran_lock->root_class_hold,
true, false);
/* 4. clean up non2pl list */
while (tran_lock->non2pl_list != NULL)
{
entry_ptr = tran_lock->non2pl_list;
tran_lock->non2pl_list = entry_ptr->tran_next;
if (entry_ptr->granted_mode == INCON_NON_TWO_PHASE_LOCK)
tran_lock->num_incons_non2pl -= 1;
lock_remove_non2pl (thread_p, entry_ptr, tran_index);
}
lock_clear_deadlock_victim (tran_index);
}

Release order is deliberate: instances → classes → root. This is leaf-to-root, matching 2PL’s requirement that locks are released in reverse acquisition order. Releasing a class lock while instance locks still exist would violate the multi-granularity hierarchy.

release_flag = true means count is not decremented — the loop ignores re-entrance depth and frees the entry unconditionally. move_to_non2pl = false because at commit/rollback time there is no need to track released locks for future conflict detection.

NON2PL is a shadow-record mechanism for detecting serialization inconsistencies under READ COMMITTED. It is not a lock mode that blocks — it is a passive tracking record.

Phase 1: Creation — lock_add_non2pl_lock

When lock_internal_perform_unlock_object releases a holder with move_to_non2pl = true:

// lock_add_non2pl_lock — src/transaction/lock_manager.c (annotated)
static LK_ENTRY *
lock_add_non2pl_lock (THREAD_ENTRY * thread_p, LK_RES * res_ptr,
int tran_index, LOCK lock)
{
/* check if I already have a non2pl entry on this resource */
non2pl = res_ptr->non2pl;
while (non2pl != NULL)
{
if (non2pl->tran_index == tran_index)
break;
non2pl = non2pl->next;
}
if (non2pl != NULL)
{
/* merge: upgrade the existing non2pl entry's mode */
if (non2pl->granted_mode != INCON_NON_TWO_PHASE_LOCK)
{
compat = lock_compat (lock, non2pl->granted_mode);
if (compat == LOCK_COMPAT_NO)
{
non2pl->granted_mode = INCON_NON_TWO_PHASE_LOCK;
tran_lock->num_incons_non2pl += 1;
}
else
non2pl->granted_mode = lock_conv (lock, non2pl->granted_mode);
}
}
else
{
/* create new non2pl entry */
non2pl = lock_get_new_entry (...);
lock_initialize_entry_as_non2pl (non2pl, tran_index,
res_ptr, lock);
/* link into resource non2pl list */
non2pl->next = res_ptr->non2pl;
res_ptr->non2pl = non2pl;
/* link into transaction non2pl list */
lock_insert_into_tran_non2pl_list (non2pl, tran_index);
}
}

Key detail: the merge path. If the transaction already has a non2pl entry on this resource (e.g., it read the same serial twice in the same RC transaction, releasing S each time), the existing entry is upgraded rather than duplicated. If the new mode is incompatible with the existing non2pl mode, the entry is promoted directly to INCON_NON_TWO_PHASE_LOCK.

Phase 2: Conflict marking — lock_update_non2pl_list

Called whenever a lock is successfully granted (PATH B, F in Ch. 3). Walks the resource’s non2pl list and checks each other-transaction entry:

// lock_update_non2pl_list — src/transaction/lock_manager.c (annotated)
curr = res_ptr->non2pl;
while (curr != NULL)
{
if (curr->tran_index == tran_index)
{
/* same transaction: remove our own non2pl entry */
lock_delete_from_tran_non2pl_list (curr, tran_index);
lock_free_entry (..., curr);
}
else
{
/* different transaction: check for conflict */
if (curr->granted_mode != INCON_NON_TWO_PHASE_LOCK)
{
compat = lock_compat (lock, curr->granted_mode);
if (compat == LOCK_COMPAT_NO)
{
curr->granted_mode = INCON_NON_TWO_PHASE_LOCK;
tran_lock->num_incons_non2pl += 1;
}
}
}
curr = curr->next;
}

Two actions:

  1. Same transaction: if I acquire a new lock on a resource where I have an old non2pl entry, that entry is removed — I now hold a real lock, the shadow is superseded.
  2. Other transaction: if the newly granted lock is incompatible with the other transaction’s non2pl granted_mode, promote it to INCON_NON_TWO_PHASE_LOCK. This means: “the row you read and released has been modified by another transaction.”

Phase 3: Notification — lock_notify_isolation_incons

At the next fetch boundary, locator_sr.c calls this function:

// lock_notify_isolation_incons — src/transaction/lock_manager.c (simplified)
void
lock_notify_isolation_incons (THREAD_ENTRY * thread_p,
bool (*fun) (const OID *, const OID *, void *), void *args)
{
if (isolation == TRAN_REPEATABLE_READ
|| isolation == TRAN_SERIALIZABLE)
return; /* nothing was released under RR/SER */
curr = tran_lock->non2pl_list;
while (tran_lock->num_incons_non2pl > 0 && curr != NULL)
{
if (curr->granted_mode == INCON_NON_TWO_PHASE_LOCK)
{
ret_val = (*fun) (&curr->res_head->key.class_oid,
&curr->res_head->key.oid, args);
/* remove from non2pl list, free entry */
tran_lock->num_incons_non2pl -= 1;
}
curr = curr->tran_next;
}
}

The callback (locator_notify_decache) adds each flagged OID to the response with LC_FETCH_DECACHE_LOCK, telling the client to invalidate its cached copy.

Phase 4: Cleanup — lock_unlock_all

At commit/rollback, lock_unlock_all walks the entire non2pl list and removes every entry via lock_remove_non2pl.

5.10 NON2PL scope — why it matters almost exclusively for serials

Section titled “5.10 NON2PL scope — why it matters almost exclusively for serials”

lock_unlock_object_by_isolation is the scope gate:

// lock_unlock_object_by_isolation — src/transaction/lock_manager.c
if (isolation != TRAN_READ_COMMITTED)
return; /* RR/SER: nothing released */
if (OID_IS_ROOTOID (oid) || OID_IS_ROOTOID (class_oid))
return; /* class locks kept */
if (mvcc_is_mvcc_disabled_class (class_oid))
lock_unlock_shared_inst_lock (...); /* ← non2pl path */
else
{ /* MVCC table: "READ COMMITTED isolation uses snapshot
instead of locks. We don't have to release anything." */ }

mvcc_is_mvcc_disabled_class returns true for only four class kinds: root class, serial class, collation class, HA apply-info class. Of these, only serial produces a realistic RC workload where S locks are acquired on instance OIDs and then released mid-transaction.

Historical context: The entire non2pl/decache infrastructure (locator_notify_decache, LC_FETCH_DECACHE_LOCK, lock_notify_isolation_incons) predates MVCC. Before MVCC, all instance reads used S locks, and RC released them early — making non2pl tracking necessary for every row. After MVCC was introduced, normal tables use snapshots for visibility, and the non2pl path narrowed to MVCC-disabled classes only. The infrastructure remains intact but its active scope is essentially serial reads under RC.

5.11 Instant lock and its relationship to NON2PL

Section titled “5.11 Instant lock and its relationship to NON2PL”

Instant lock is a mode where locks acquired during WHERE-clause evaluation are automatically released when the evaluation phase ends.

// lock_start_instant_lock_mode — src/transaction/lock_manager.c
void lock_start_instant_lock_mode (int tran_index)
{
tran_lock->is_instant_duration = true;
}
// lock_stop_instant_lock_mode — src/transaction/lock_manager.c
void lock_stop_instant_lock_mode (THREAD_ENTRY * thread_p,
int tran_index, bool need_unlock)
{
/* walk instance hold list */
entry_ptr = tran_lock->inst_hold_list;
while (entry_ptr != NULL)
{
count = entry_ptr->instant_lock_count;
if (need_unlock)
while (count > 0)
{
lock_internal_perform_unlock_object (thread_p, entry_ptr,
false, true); /* ← move_to_non2pl = true */
count--;
}
entry_ptr->instant_lock_count = 0;
entry_ptr = next_ptr;
}
/* same for class hold list ... */
tran_lock->is_instant_duration = false;
}

Relationship to NON2PL: Instant lock feeds into non2pl because the release uses move_to_non2pl = true. But non2pl also exists without instant lock — any mid-transaction S lock release under RC goes through the same path.

Callers: Only query_executor.c uses instant lock mode (lock_start_instant_lock_mode / lock_stop_instant_lock_mode), specifically around condition evaluation in query execution.

Edge case — protecting non-instant locks: When is_instant_duration == true and the transaction already holds an IX or stronger lock with instant_lock_count == 0 (a non-instant lock), and the new request is incompatible with the held mode, instant mode is forcibly stopped (lock_stop_instant_lock_mode(need_unlock=false)) to prevent the non-instant lock from being accidentally released during instant mode cleanup.

5.12 Performance analysis — release cost and the grant cascade

Section titled “5.12 Performance analysis — release cost and the grant cascade”

Most unlock calls are O(1) — the count early return dominates.

In typical OLTP, a transaction acquires the same lock multiple times via re-entrance (Ch. 4 §4.3). Each lock_unlock_object just decrements count. No mutex, no list manipulation:

if (release_flag == false)
{
entry_ptr->count--;
if (entry_ptr->blocked_mode == NULL_LOCK && entry_ptr->count > 0)
return; /* ← most unlock calls exit here */
}

When count reaches zero — the real release cost:

StepCostNotes
Acquire res_mutexO(1)Per-resource, uncontended in most cases
Find entry in holder listO(holders)Linear scan of singly-linked list
Remove from holder listO(1)Pointer unlink
Remove from tran hold listO(1)Doubly-linked — O(1) unlink
lock_add_non2pl_lock (if move_to_non2pl)O(non2pl entries on this resource)Walks non2pl list to check for existing entry to merge
Free entry to poolO(1)Local pool or lockfree retire
Recompute total_holders_modeO(remaining holders)Walks entire remaining holder list to compute LUB
Grant cascadeO(upgraders + grantable waiters)See below

The total_holders_mode recomputation is the per-release O(holders) cost.

mode = NULL_LOCK;
for (i = res_ptr->holder; i != NULL; i = i->next)
{
mode = lock_conv (i->granted_mode, mode);
mode = lock_conv (i->blocked_mode, mode); /* upgraders too */
}
res_ptr->total_holders_mode = mode;

This cannot be optimized to O(1) because the LUB operation is not invertible — you cannot “subtract” a removed mode from the aggregate without recomputing from scratch. The same limitation explains why conversion (Ch. 4 §4.10) requires O(holders).

The grant cascade — release one, potentially wake many:

After recomputing total_holders_mode, the releaser calls:

  1. lock_grant_blocked_holder — O(upgraders × holders-behind-each)
  2. lock_grant_blocked_waiter — O(waiters), each checked against total_holders_mode
  3. Each granted waiter: move to holder list + lock_resume
flowchart LR
  REL["T1 releases X_LOCK"]
  RECOMP["recompute\ntotal_holders_mode\nO(remaining holders)"]
  GBH["grant_blocked_holder\n(check each upgrader)"]
  GBW["grant_blocked_waiter\n(check each waiter\nvs total_holders_mode)"]
  WAKE["resume N threads\n(N context switches)"]

  REL --> RECOMP --> GBH --> GBW --> WAKE

Figure 5-4 — Release cascade cost. A single release can trigger O(upgraders + waiters) compat checks and N thread wakeups. The worst case is a hot row with many waiters — but this is also the case where the release provides the most value.

Worst-case scenario: A hot row held by one X-lock holder with 50 S-lock waiters. When the holder releases:

  • Recompute total_holders_mode: O(0) (no remaining holders)
  • grant_blocked_holder: O(0) (no upgraders)
  • grant_blocked_waiter: 50 waiters all become grantable (S is compatible with S) → 50 entries moved to holder list, 50 lock_resume calls, 50 context switches

This is expensive but unavoidable — the 50 threads were suspended and need to be woken. The alternative (each woken thread re-checks compatibility independently) would be O(50²) total compat checks instead of O(50).

NON2PL cost on the grant path:

lock_update_non2pl_list is called on every successful grant (PATH B, F, and now on each cascaded grant). It walks the resource’s non2pl list — O(non2pl entries per resource). In MVCC workloads, this list is almost always empty (non2pl only exists for MVCC-disabled classes under RC). But for serial-heavy workloads, the non2pl list could accumulate entries across many RC transactions.

  1. Three public unlock APIs differ in release_flag and move_to_non2pl. lock_unlock_all (commit/rollback) uses release_flag=true to bypass count checks.
  2. lock_unlock_object(force=false) passes four filters before actually releasing: only S locks, only under RC, only instances, only MVCC-disabled classes. In practice, this means serials.
  3. count controls release depth. Each lock_object increments it; each lock_unlock_object decrements it. The lock is only freed when count reaches zero — the re-entrance guard.
  4. After a holder release, total_holders_mode is recomputed by walking the remaining holder list, then blocked holders and waiters are granted (Ch. 4 §4.6–4.7).
  5. Resource removal: an LK_RES is erased from the hash table only when all three lists (holder, waiter, non2pl) are empty. A non2pl-only resource stays alive.
  6. NON2PL lifecycle: create (lock_add_non2pl_lock) → mark conflict (lock_update_non2pl_list) → notify client (lock_notify_isolation_incons) → cleanup (lock_unlock_all).
  7. NON2PL scope narrowed post-MVCC. The infrastructure is pre-MVCC era; today only MVCC-disabled classes (serials in practice) exercise the path.
  8. Instant lock feeds into non2pl via move_to_non2pl = true, but the two mechanisms are independent — non2pl exists without instant lock, and instant lock cleanup protects pre-existing non-instant locks.
  9. Release order in lock_unlock_all is instances → classes → root → non2pl — leaf-to-root, matching 2PL’s reverse-order release requirement.

Lock escalation replaces many fine-grain instance locks with a single coarse-grain class lock when the instance lock count exceeds a threshold. It is a memory-pressure relief valve — without it, a bulk UPDATE touching 100K rows would hold 100K LK_ENTRY structs.

Escalation is checked at the top of every instance lock request, before the hash lookup, inside lock_internal_perform_lock_object:

// lock_internal_perform_lock_object — src/transaction/lock_manager.c
start:
if (class_oid != NULL && !OID_IS_ROOTOID (class_oid))
{
/* instance lock request */
ret_val = lock_escalate_if_needed (thread_p, class_entry,
tran_index);
// ... if escalated, class lock subsumes → return LK_GRANTED
}

This means escalation is evaluated on every instance lock acquisition — not periodically, not lazily. The lock_check_escalate function makes the threshold check fast enough to justify this.

6.2 lock_check_escalate — the threshold check

Section titled “6.2 lock_check_escalate — the threshold check”
// lock_check_escalate — src/transaction/lock_manager.c (annotated)
static bool
lock_check_escalate (THREAD_ENTRY * thread_p, LK_ENTRY * class_entry,
LK_TRAN_LOCK * tran_lock)
{
if (class_entry == NULL)
return false;
if (class_entry->granted_mode == BU_LOCK)
return false; /* ← bulk update already has class-level lock */
if (tran_lock->lock_escalation_on == true)
return false; /* ← another thread of same tran is escalating */
superclass_entry = class_entry->class_entry;
if (superclass_entry != NULL
&& !OID_IS_ROOTOID (&superclass_entry->res_head->key.oid))
{
/* class hierarchy: check superclass's ngranules */
if (superclass_entry->ngranules
< prm_get_integer_value (PRM_ID_LK_ESCALATION_AT))
return false;
}
else if (class_entry->ngranules
< prm_get_integer_value (PRM_ID_LK_ESCALATION_AT))
return false;
return true; /* ← escalation needed */
}

Five conditions that prevent escalation:

ConditionWhy
class_entry == NULLNo class context (shouldn’t happen for instance locks)
granted_mode == BU_LOCKBulk update already has class-level protection
lock_escalation_on == trueAnother thread of the same transaction is already escalating — re-entrance guard
Superclass hierarchy: superclass->ngranules < thresholdIn a class hierarchy, the superclass aggregates granule counts
Direct: class_entry->ngranules < thresholdNormal case — not enough instance locks yet

The threshold is PRM_ID_LK_ESCALATION_AT, a runtime configuration parameter (default value set in cubrid.conf). The ngranules field on class_entry is incremented by lock_increment_class_granules every time an instance lock is granted (Ch. 3 PATH A/B) and decremented by lock_decrement_class_granules on release (Ch. 5).

6.3 lock_escalate_if_needed — the escalation procedure

Section titled “6.3 lock_escalate_if_needed — the escalation procedure”

When lock_check_escalate returns true, the actual escalation begins:

flowchart LR
  CHK["lock_check_escalate\n→ true"]
  ABORT{"PRM_ID_LK_ROLLBACK\n_ON_LOCK_ESCALATION?"}
  ABORT -- "yes" --> ROLL["Abort transaction\nER_LK_ROLLBACK_ON_LOCK_ESCALATION"]
  ABORT -- "no" --> MODE["Determine escalated\nclass lock mode"]
  MODE --> TRY["lock_internal_perform_lock_object\n(class_oid, max_class_lock,\nFORCE_ZERO_WAIT)"]
  TRY --> OK{"granted?"}
  OK -- "yes" --> UNDO["Unlock one count\nof old class lock\n(maintain count balance)"]
  UNDO --> DONE["Return LK_GRANTED\n(caller checks subsumption)"]
  OK -- "no" --> FAIL["Reset lock_escalation_on\nReturn failure\n(fall through to normal path)"]

Figure 6-1 — Escalation procedure. The escalated class lock is attempted conditionally (FORCE_ZERO_WAIT). If it fails (e.g., another transaction holds a conflicting class lock), escalation is silently abandoned and the normal instance lock path proceeds.

The code:

// lock_escalate_if_needed — src/transaction/lock_manager.c (annotated)
static int
lock_escalate_if_needed (THREAD_ENTRY * thread_p,
LK_ENTRY * class_entry, int tran_index)
{
tran_lock = &lk_Gl.tran_lock_table[tran_index];
pthread_mutex_lock (&tran_lock->hold_mutex);
if (lock_check_escalate (...) == false)
{
pthread_mutex_unlock (&tran_lock->hold_mutex);
return LK_NOTGRANTED; /* ← not an error, just "no escalation" */
}
/* option: abort instead of escalating */
if (prm_get_bool_value (PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION))
{
er_set (ER_ERROR_SEVERITY, ...,
ER_LK_ROLLBACK_ON_LOCK_ESCALATION, ...);
lock_set_tran_abort_reason (tran_index,
TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION);
pthread_mutex_unlock (&tran_lock->hold_mutex);
return LK_NOTGRANTED_DUE_ABORTED;
}
/* set re-entrance guard */
tran_lock->lock_escalation_on = true;
/* determine escalated mode */
if (class_entry->granted_mode == NULL_LOCK
|| class_entry->granted_mode == S_LOCK
|| class_entry->granted_mode == X_LOCK
|| class_entry->granted_mode == SCH_M_LOCK)
{
/* already strong enough — no instance locks possible */
tran_lock->lock_escalation_on = false;
pthread_mutex_unlock (&tran_lock->hold_mutex);
return LK_GRANTED;
}
/* IS → S, IX/SIX → X */
if (class_entry->granted_mode == IX_LOCK
|| class_entry->granted_mode == SIX_LOCK)
max_class_lock = X_LOCK;
else /* IS_LOCK */
max_class_lock = S_LOCK;
pthread_mutex_unlock (&tran_lock->hold_mutex);
/* attempt the escalated class lock — conditional (no wait) */
granted = lock_internal_perform_lock_object (
thread_p, tran_index,
&class_entry->res_head->key.oid,
NULL, /* class_oid = NULL → this is a class lock */
max_class_lock,
LK_FORCE_ZERO_WAIT, /* ← conditional: don't block */
&class_entry, NULL);
if (granted != LK_GRANTED)
{
/* escalation failed — fall through to normal instance lock */
tran_lock->lock_escalation_on = false;
return granted;
}
/* success: undo the extra count from the escalation request */
lock_internal_perform_unlock_object (thread_p, class_entry,
false, true);
tran_lock->lock_escalation_on = false;
return LK_GRANTED;
}

6.4 Escalated mode decision — simple rule

Section titled “6.4 Escalated mode decision — simple rule”

The mode decision is deliberately simple to avoid expensive instance-list scanning:

Current class lockEscalated toRationale
IS_LOCKS_LOCKWas reading instances → take class S
IX_LOCKX_LOCKWas writing instances → take class X
SIX_LOCKX_LOCKWas reading + writing → take class X
S_LOCK, X_LOCK, SCH_M_LOCK(no-op)Already class-level — no instance locks to escalate
NULL_LOCK(no-op)No class lock held — shouldn’t have instance locks

The comment in the source explains: “Because to count the shared and exclusive instance locks may cause high CPU usage, we used a simple rule.” Walking the instance hold list to compute the precise LUB of all instance modes would be O(instance locks) — the very thing escalation is trying to avoid.

6.5 Why LK_FORCE_ZERO_WAIT — conditional escalation

Section titled “6.5 Why LK_FORCE_ZERO_WAIT — conditional escalation”

The escalated class lock is attempted with LK_FORCE_ZERO_WAIT (no blocking, no error message). If it fails:

  • The escalation is silently abandoned.
  • lock_escalation_on is reset.
  • The caller (lock_internal_perform_lock_object) proceeds with the normal instance lock path.

This is a deliberate design choice: escalation is an optimization, not a requirement. If the class lock can’t be obtained without waiting (because another transaction holds a conflicting lock), it is better to continue with per-instance locks than to block the entire transaction waiting for a class lock that might cause a cascade of conflicts.

6.6 Post-escalation: what happens to instance locks

Section titled “6.6 Post-escalation: what happens to instance locks”

After escalation succeeds, the caller in lock_internal_perform_lock_object checks:

if (ret_val == LK_GRANTED
&& lock_is_class_lock_escalated (
lock_get_object_lock (class_oid, oid_Root_class_oid),
lock) == true)
{
return LK_GRANTED; /* class lock subsumes the instance request */
}

lock_is_class_lock_escalated checks whether the now-upgraded class lock mode is strong enough to cover the requested instance lock mode (e.g., class X subsumes instance S or X).

The existing instance LK_ENTRY structs are not immediately freed by escalation itself. They are freed later via the lock_conversion_treatement cleanup (Ch. 4 §4.8) when the class lock conversion is processed, and individually as lock_unlock_all walks the instance hold list at commit.

6.7 The PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION option

Section titled “6.7 The PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION option”

Some applications prefer to abort rather than escalate — class locks are coarser and may cause unexpected serialization with other transactions:

if (prm_get_bool_value (PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION))
{
er_set (..., ER_LK_ROLLBACK_ON_LOCK_ESCALATION, ...);
lock_set_tran_abort_reason (tran_index,
TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION);
return LK_NOTGRANTED_DUE_ABORTED;
}

When this parameter is true, hitting the escalation threshold causes the transaction to abort with TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION. This is checked in the caller:

// lock_internal_perform_lock_object — after lock_escalate_if_needed
if (ret_val == LK_NOTGRANTED_DUE_ABORTED)
{
LOG_TDES *tdes = LOG_FIND_TDES (tran_index);
if (tdes && tdes->tran_abort_reason
== TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION)
goto end; /* propagate the abort */
}

6.8 Concrete scenario — escalation during bulk UPDATE

Section titled “6.8 Concrete scenario — escalation during bulk UPDATE”
flowchart LR
  subgraph S1["Before escalation"]
    CL1["class A: IX_LOCK\nngranules = 4999"]
    IL1["inst locks:\nrow1(X), row2(X),\n..., row4999(X)"]
  end

  subgraph S2["5000th row request triggers escalation"]
    CHK2["lock_check_escalate:\nngranules(4999) >= threshold(5000)?\nNot yet → grant row5000 normally"]
  end

  subgraph S3["5001st row request"]
    CHK3["lock_check_escalate:\nngranules(5000) >= threshold(5000)?\nYES → escalate"]
    ESC3["IX → X (FORCE_ZERO_WAIT)"]
  end

  subgraph S4["After escalation"]
    CL4["class A: X_LOCK\n(subsumes all instances)"]
    IL4["inst locks: still present\n(freed at commit or\nby conversion cleanup)"]
  end

  S1 --> S2 --> S3 --> S4

Figure 6-2 — Escalation during bulk UPDATE. When ngranules reaches the threshold (here 5000), the next instance lock request triggers escalation from IX to X. The class X lock subsumes all future instance requests, so no more instance LK_ENTRY structs are created. Existing instance entries are cleaned up later.

  1. Escalation is checked on every instance lock request — at the top of lock_internal_perform_lock_object, before the hash lookup.
  2. The threshold is PRM_ID_LK_ESCALATION_AT (configurable). The ngranules counter on the class LK_ENTRY tracks the current instance lock count.
  3. Escalated mode is determined by a simple rule: IS→S, IX/SIX→X. No instance-list scanning — the whole point is to avoid that cost.
  4. Escalation is conditional (LK_FORCE_ZERO_WAIT). If the class lock can’t be obtained without blocking, escalation is silently abandoned and the normal instance lock path proceeds.
  5. lock_escalation_on is a re-entrance guard — prevents multiple threads of the same transaction from escalating simultaneously.
  6. PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION allows applications to abort instead of escalate, avoiding the coarse-grain serialization that class locks impose.
  7. Existing instance locks are not immediately freed — they are cleaned up by conversion treatment (Ch. 4 §4.8) and by lock_unlock_all at commit.

When a lock request cannot be granted immediately (PATH D or G in Ch. 3), the requesting thread is suspended until the lock becomes available, a timeout expires, or a deadlock is detected. This chapter traces the suspend/resume machinery and the four wakeup paths.

7.1 lock_suspend — putting a thread to sleep

Section titled “7.1 lock_suspend — putting a thread to sleep”
// lock_suspend — src/transaction/lock_manager.c (annotated)
static LOCK_WAIT_STATE
lock_suspend (THREAD_ENTRY * thread_p, LK_ENTRY * entry_ptr,
int wait_msecs)
{
/* assert: no page latches held while waiting for a lock */
assert (lock_is_safe_lock_with_page (thread_p, entry_ptr)
|| !pgbuf_has_perm_pages_fixed (thread_p));
/* register wait info into the thread entry */
entry_ptr->thrd_entry->lockwait = (void *) entry_ptr;
entry_ptr->thrd_entry->lockwait_stime =
(tv.tv_sec * 1000000LL + tv.tv_usec) / 1000LL; /* ms */
entry_ptr->thrd_entry->lockwait_msecs = wait_msecs;
entry_ptr->thrd_entry->lockwait_state = (int) LOCK_SUSPENDED;
/* register in WFG for deadlock detection */
lk_Gl.TWFG_node[entry_ptr->tran_index].thrd_wait_stime =
entry_ptr->thrd_entry->lockwait_stime;
lk_Gl.deadlock_and_timeout_detector++;
/* record which resource we're waiting for (for diagnostics) */
tdes->waiting_for_res = entry_ptr->res_head;
/* === actually suspend === */
thread_suspend_wakeup_and_unlock_entry (entry_ptr->thrd_entry,
THREAD_LOCK_SUSPENDED);
/* === resumed here === */
lk_Gl.deadlock_and_timeout_detector--;
lk_Gl.TWFG_node[entry_ptr->tran_index].thrd_wait_stime = 0;
tdes->waiting_for_res = NULL;
/* dispatch on resume reason */
// ... see §7.3 ...
}

Key fields set before suspension:

FieldWherePurpose
thrd_entry->lockwaitThread entryPoints to the LK_ENTRY being waited on. Non-NULL means “this thread is suspended for a lock.”
thrd_entry->lockwait_stimeThread entryTimestamp (ms) when wait started. Used by timeout check.
thrd_entry->lockwait_msecsThread entryTimeout deadline. LK_INFINITE_WAIT (-1) for no timeout.
thrd_entry->lockwait_stateThread entrySet to LOCK_SUSPENDED. Changed by the waker to the resume reason.
TWFG_node[tran].thrd_wait_stimeWFG nodeUsed by the deadlock daemon to know this transaction is waiting.
deadlock_and_timeout_detectorlk_GlAtomic counter. The deadlock daemon skips its scan when this is 0 (no threads waiting).

The page-latch safety assert: A thread must not hold any permanent page latches when it suspends for a lock. Holding a page latch while waiting for a lock could deadlock with another thread that holds the lock and needs the page latch — a classic latch-lock ordering violation. The assert catches this at development time.

// lock_resume — src/transaction/lock_manager.c (annotated)
static void
lock_resume (LK_ENTRY * entry_ptr, int state)
{
/* The caller is holding the thread entry mutex */
entry_ptr->thrd_entry->lockwait = NULL;
entry_ptr->thrd_entry->lockwait_state = (int) state;
entry_ptr->thrd_entry->resume_status = THREAD_LOCK_RESUMED;
pthread_cond_signal (&entry_ptr->thrd_entry->wakeup_cond);
thread_unlock_entry (entry_ptr->thrd_entry);
}

Protocol: The waker (whoever calls lock_resume) must:

  1. Hold the thread entry mutex (thread_lock_entry).
  2. Verify LK_IS_LOCKWAIT_THREAD(thrd) — the thread is still suspended.
  3. Set lockwait = NULL and lockwait_state = state before signaling the condition variable.
  4. Signal and release the mutex.

The suspended thread, upon waking, reads lockwait_state to determine the resume reason. Since the waker sets the state before signaling, no race exists.

After thread_suspend_wakeup_and_unlock_entry returns, lock_suspend reads lockwait_state to determine why it was woken:

flowchart LR
  WAKE["thread resumes"]
  WAKE --> S1{"lockwait_state?"}
  S1 -- "LOCK_RESUMED" --> R1["Lock granted\n(entry moved to holder list\nby the releaser)"]
  S1 -- "LOCK_RESUMED_ABORTED_FIRST" --> R2["Deadlock victim (primary)\nAbort transaction,\nwait for other threads to finish"]
  S1 -- "LOCK_RESUMED_ABORTED_OTHER" --> R3["Deadlock victim (secondary)\nTreated as timeout\n(another thread handles abort)"]
  S1 -- "LOCK_RESUMED_DEADLOCK_TIMEOUT" --> R4["Deadlock timeout\n(set error, return timeout)"]
  S1 -- "LOCK_RESUMED_TIMEOUT" --> R5["Normal timeout\n(wait_msecs expired)"]
  S1 -- "LOCK_RESUMED_INTERRUPT" --> R6["Server shutdown\n(ER_INTERRUPTED)"]

Figure 7-1 — Six resume states. Only LOCK_RESUMED means the lock was granted. All other states require the caller to clean up the waiter entry via lock_internal_perform_unlock_object.

StateWho sets itMeaningReturn value
LOCK_RESUMEDlock_grant_blocked_holder or lock_grant_blocked_waiter (Ch. 4 §4.6–4.7)Lock granted — entry already in holder listLOCK_RESUMED
LOCK_RESUMED_ABORTED_FIRSTlock_wakeup_deadlock_victim_abortedThis thread is the primary deadlock victim — must perform the rollbackLOCK_RESUMED_ABORTED
LOCK_RESUMED_ABORTED_OTHERlock_wakeup_deadlock_victim_abortedAnother thread of the same transaction is already handling the abortLOCK_RESUMED_DEADLOCK_TIMEOUT
LOCK_RESUMED_DEADLOCK_TIMEOUTlock_wakeup_deadlock_victim_timeoutDeadlock victim woken as timeout (soft abort)LOCK_RESUMED_DEADLOCK_TIMEOUT
LOCK_RESUMED_TIMEOUTlock_force_timeout_expired_wait_transactions (deadlock daemon)Wait time exceeded lockwait_msecsLOCK_RESUMED_TIMEOUT
LOCK_RESUMED_INTERRUPTServer shutdownServer is shutting down — all waiters must exitLOCK_RESUMED_INTERRUPT

7.4 Who wakes up suspended threads — four sources

Section titled “7.4 Who wakes up suspended threads — four sources”
flowchart TB
  subgraph Sources["Four wakeup sources"]
    S1["1. Lock releaser\n(lock_grant_blocked_holder\n lock_grant_blocked_waiter)\n→ LOCK_RESUMED"]
    S2["2. Deadlock daemon\n(timeout check, every 100ms)\n→ LOCK_RESUMED_TIMEOUT"]
    S3["3. Deadlock daemon\n(cycle detection)\n→ LOCK_RESUMED_ABORTED_FIRST\n   LOCK_RESUMED_ABORTED_OTHER\n   LOCK_RESUMED_DEADLOCK_TIMEOUT"]
    S4["4. Server shutdown\n(thread interrupt)\n→ LOCK_RESUMED_INTERRUPT"]
  end

  S1 --> RESUME["lock_resume(entry, state)"]
  S2 --> RESUME
  S3 --> RESUME
  S4 --> INT["thread wakeup\n(THREAD_RESUME_DUE_TO_INTERRUPT)"]

Figure 7-2 — Four sources that wake a suspended lock-waiting thread. Source 1 is the happy path. Sources 2–4 are failure paths that require cleanup.

7.5 Timeout check — lock_force_timeout_expired_wait_transactions

Section titled “7.5 Timeout check — lock_force_timeout_expired_wait_transactions”

The deadlock daemon (Ch. 2 §2.6) calls this on every 100ms iteration for each suspended thread:

// lock_force_timeout_expired_wait_transactions — src/transaction/lock_manager.c
static bool
lock_force_timeout_expired_wait_transactions (void *thrd_entry)
{
thrd = (THREAD_ENTRY *) thrd_entry;
thread_lock_entry (thrd);
if (LK_IS_LOCKWAIT_THREAD (thrd))
{
/* check for interrupt first */
if (logtb_is_interrupted_tran (thrd, ...))
{
lock_resume ((LK_ENTRY *) thrd->lockwait,
LOCK_RESUMED_INTERRUPT);
return true;
}
/* check for timeout */
if (LK_CAN_TIMEOUT (thrd->lockwait_msecs))
{
etime = current_time_ms;
if (etime - thrd->lockwait_stime > thrd->lockwait_msecs)
{
lock_resume ((LK_ENTRY *) thrd->lockwait,
LOCK_RESUMED_TIMEOUT);
return true;
}
}
thread_unlock_entry (thrd);
return false; /* not timed out, still waiting */
}
// ... thread not waiting ...
}

Timeout granularity: The daemon loops every 100ms, so a thread with wait_msecs = 50 may not be timed out until up to 100ms after its deadline. This is by design — the daemon does not use high-resolution timers. For LK_INFINITE_WAIT (-1), LK_CAN_TIMEOUT returns false and the timeout path is skipped entirely.

7.6 The tran_next_wait chain — multi-thread wait coordination

Section titled “7.6 The tran_next_wait chain — multi-thread wait coordination”

When multiple threads of the same transaction try to lock the same resource, the second thread finds the first thread already waiting. Instead of creating a second waiter entry, it chains itself onto the first thread’s tran_next_wait:

// lock_internal_perform_lock_object — same-tran-already-waiting branch
if (wait_entry_ptr != NULL) /* another thread of my tran is waiting */
{
thrd_entry->tran_next_wait =
wait_entry_ptr->thrd_entry->tran_next_wait;
wait_entry_ptr->thrd_entry->tran_next_wait = thrd_entry;
pthread_mutex_unlock (&res_ptr->res_mutex);
thread_suspend_wakeup_and_unlock_entry (thrd_entry,
THREAD_LOCK_SUSPENDED);
goto start; /* retry after wakeup */
}

When the primary waiter is granted and resumes in lock_suspend, it wakes all chained threads:

// lock_suspend — after main dispatch, before return
thread_lock_entry (entry_ptr->thrd_entry);
while (entry_ptr->thrd_entry->tran_next_wait)
{
p = entry_ptr->thrd_entry->tran_next_wait;
entry_ptr->thrd_entry->tran_next_wait = p->tran_next_wait;
p->tran_next_wait = NULL;
thread_wakeup (p, THREAD_LOCK_RESUMED);
}
thread_unlock_entry (entry_ptr->thrd_entry);

The woken secondary threads goto start and retry the entire lock acquisition from the beginning — they do not inherit the primary thread’s grant.

7.7 The deadlock_and_timeout_detector counter

Section titled “7.7 The deadlock_and_timeout_detector counter”
// lock_suspend — before and after suspension
lk_Gl.deadlock_and_timeout_detector++; /* before suspend */
// ... thread sleeps ...
lk_Gl.deadlock_and_timeout_detector--; /* after resume */

This atomic counter is checked by the deadlock daemon:

// deadlock_detect_task_execute — src/transaction/lock_manager.c
if (lk_Gl.deadlock_and_timeout_detector == 0)
return; /* no threads waiting — skip entirely */

When the counter is zero, no thread is suspended for a lock, so there is nothing to time out and no deadlock is possible. This avoids the cost of the thread-map scan and WFG construction on every 100ms tick when the system is uncontended.

  1. lock_suspend registers five fields in the thread entry and WFG node before suspending. The page-latch safety assert prevents latch-lock ordering deadlocks.
  2. lock_resume sets lockwait_state before signaling — the protocol is race-free by construction.
  3. Six resume states distinguish grant, abort (primary/secondary), deadlock timeout, normal timeout, and interrupt. Only LOCK_RESUMED means success.
  4. Four wakeup sources: lock releaser (happy path), daemon timeout check, daemon deadlock detection, and server shutdown.
  5. Timeout granularity is 100ms (the daemon loop period). This is acceptable for lock timeouts, which are typically in the seconds range.
  6. Multi-thread wait coordination uses tran_next_wait chains: secondary threads piggyback on the primary waiter and retry from scratch on wakeup.
  7. deadlock_and_timeout_detector is a fast early-exit for the daemon — when zero, the 100ms tick does no work.

The deadlock daemon (Ch. 2 §2.6, Ch. 7 §7.4) periodically constructs a transaction waits-for graph (WFG) and scans it for cycles. This chapter traces the full detection pipeline: WFG construction, cycle detection via DFS, victim selection, and resolution.

8.1 Overall pipeline — lock_detect_local_deadlock

Section titled “8.1 Overall pipeline — lock_detect_local_deadlock”
flowchart LR
  P1["Phase 1\nInit WFG tables\n(reset nodes + edges)"]
  P2["Phase 2\nBuild WFG\n(scan all LK_RES,\nadd edges)"]
  P3["Phase 3\nDFS cycle detection\n(walk TWFG_node graph)"]
  P4["Phase 4\nVictim selection\n(per cycle)"]
  P5["Phase 5\nResolution\n(wake victims:\ntimeout or abort)"]

  P1 --> P2 --> P3 --> P4 --> P5

Figure 8-1 — Five-phase deadlock detection pipeline. The entire pipeline runs under DL_detection_mutex, serializing detection runs. Each phase operates on the TWFG_node[] / TWFG_edge[] arrays in lk_Gl.

Figure 8-1b — The gates in front of the pipeline

Figure 8-1b — What actually happens on a daemon tick. Three gates sit in front of the five phases — any suspended waiter at all (atomic counter), at least two suspended threads, detection interval elapsed — and most ticks die at the gray return rail for the price of one atomic read. Only when all three pass (at most once per second) does the pipeline run; phase 2’s hash-table sweep is the cost center.

// lock_detect_local_deadlock — Phase 1
for (i = 1; i < lk_Gl.num_trans; i++)
{
lk_Gl.TWFG_node[i].first_edge = -1;
lk_Gl.TWFG_node[i].tran_edge_seq_num = 0;
lk_Gl.TWFG_node[i].checked_by_deadlock_detector = true;
}
lk_Gl.TWFG_edge = &TWFG_edge_block[0]; /* static block, 200 edges */
lk_Gl.max_TWFG_edge = LK_MIN_TWFG_EDGE_COUNT; /* 200 */
for (i = 0; i < LK_MIN_TWFG_EDGE_COUNT; i++)
{
lk_Gl.TWFG_edge[i].to_tran_index = -1;
lk_Gl.TWFG_edge[i].next = (i + 1); /* free-list chain */
}
lk_Gl.TWFG_edge[max - 1].next = -1;
lk_Gl.TWFG_free_edge_idx = 0;
lk_Gl.global_edge_seq_num = 0;
victim_count = 0;

The WFG is rebuilt from scratch on every detection run — there is no incremental maintenance between runs. The edge array starts from a static block (TWFG_edge_block[200]) and can grow to LK_MID_TWFG_EDGE_COUNT (1000) or LK_MAX_TWFG_EDGE_COUNT (MAX_NTRANS²) if needed.

8.3 Phase 2 — build the WFG by scanning all resources

Section titled “8.3 Phase 2 — build the WFG by scanning all resources”

The function iterates over every LK_RES in the hash table and adds edges for three categories of wait relationships:

// lock_detect_local_deadlock — Phase 2 (annotated)
for (res_ptr = iterator.iterate (); res_ptr != NULL; ...)
{
/* (a) among holders: upgrader-to-upgrader conflicts */
for (hi = res_ptr->holder; hi != NULL; hi = hi->next)
{
if (hi->blocked_mode == NULL_LOCK) break; /* end of upgraders */
for (hj = hi->next; hj != NULL; hj = hj->next)
{
/* hj waits for hi? */
if (incompat(hj->blocked_mode, hi->granted_mode)
|| incompat(hj->blocked_mode, hi->blocked_mode))
lock_add_WFG_edge (hj→hi, holder=true);
/* hi waits for hj? */
if (incompat(hi->blocked_mode, hj->granted_mode))
lock_add_WFG_edge (hi→hj, holder=true);
}
}
/* (b) from waiters to holders */
for (hi = res_ptr->holder; hi != NULL; hi = hi->next)
for (hj = res_ptr->waiter; hj != NULL; hj = hj->next)
{
if (incompat(hj->blocked_mode, hi->granted_mode)
|| incompat(hj->blocked_mode, hi->blocked_mode))
lock_add_WFG_edge (hj→hi, holder=true);
}
/* (c) from waiters to other waiters */
for (hi = res_ptr->waiter; hi != NULL; hi = hi->next)
for (hj = hi->next; hj != NULL; hj = hj->next)
{
if (incompat(hj->blocked_mode, hi->blocked_mode))
lock_add_WFG_edge (hj→hi, holder=false);
}
}

Three edge categories:

CategoryFromToEdge meaning
(a) Holder↔HolderUpgrader hjUpgrader hihj’s upgrade blocked by hi’s granted or blocked mode
(b) Waiter→HolderWaiter hjHolder hihj can’t be granted because of hi’s granted or blocked mode
(c) Waiter→WaiterWaiter hjWaiter hi (earlier in queue)hj can’t be granted even if holders release, because hi’s request (ahead in FIFO) is incompatible

The holder_flag parameter on each edge records whether the target (to_tran_index) is a holder — this is used in victim selection (§8.5) to prefer aborting a holder over a waiter.

8.4 lock_add_WFG_edge — edge allocation and staleness detection

Section titled “8.4 lock_add_WFG_edge — edge allocation and staleness detection”
// lock_add_WFG_edge — src/transaction/lock_manager.c (annotated)
static int
lock_add_WFG_edge (int from_tran_index, int to_tran_index,
int holder_flag, INT64 edge_wait_stime,
LK_ENTRY * holder, LK_ENTRY * waiter)
{
/* skip if either transaction is already a victim */
if (TWFG_node[from].DL_victim || TWFG_node[to].DL_victim)
return NO_ERROR;
/* increment global edge sequence number */
lk_Gl.global_edge_seq_num++;
/* handle stale transactions: if a node was not checked by the
detector yet, it might be a new transaction that reused the
tran_index — clear its old edges */
if (TWFG_node[from].checked_by_deadlock_detector == false)
{
/* recycle old edges to free list */
// ... chain old edges back to TWFG_free_edge_idx ...
TWFG_node[from].first_edge = -1;
TWFG_node[from].checked_by_deadlock_detector = true;
TWFG_node[from].tran_edge_seq_num = global_edge_seq_num;
}
/* same for to_tran_index ... */
/* allocate edge from free list (may grow the array) */
alloc_idx = lk_Gl.TWFG_free_edge_idx;
lk_Gl.TWFG_free_edge_idx = TWFG_edge[alloc_idx].next;
/* populate edge */
TWFG_edge[alloc_idx].to_tran_index = to_tran_index;
TWFG_edge[alloc_idx].edge_seq_num = global_edge_seq_num;
TWFG_edge[alloc_idx].holder_flag = holder_flag;
TWFG_edge[alloc_idx].edge_wait_stime = edge_wait_stime;
/* prepend to from_tran's edge list */
TWFG_edge[alloc_idx].next = TWFG_node[from].first_edge;
TWFG_node[from].first_edge = alloc_idx;
}

Edge array growth (3-tier):

TierSizeStorage
InitialLK_MIN_TWFG_EDGE_COUNT = 200Static TWFG_edge_block[200]
MidLK_MID_TWFG_EDGE_COUNT = 1000Same static block (reuses first 200 + extends)
MaxLK_MAX_TWFG_EDGE_COUNT = MAX_NTRANS²malloc’d block (copies from static)

The cycle detection is a classic DFS over the WFG node/edge structure. The ancestor field chains the DFS path, and a cycle is detected when a follow-edge leads to a node that is already on the current path (ancestor != -1):

// lock_detect_local_deadlock — Phase 3 (annotated)
for (k = 1; k < lk_Gl.num_trans; k++)
{
TWFG_node[k].current = TWFG_node[k].first_edge;
TWFG_node[k].ancestor = -1;
}
for (k = 1; k < lk_Gl.num_trans; k++)
{
if (TWFG_node[k].current == -1)
continue; /* no outgoing edges — skip */
s = k;
TWFG_node[s].ancestor = -2; /* sentinel: DFS root */
for (; s != -2; )
{
/* skip stale nodes/edges */
// ... validation checks ...
if (TWFG_node[s].current == -1)
{
/* backtrack: no more edges from s */
t = TWFG_node[s].ancestor;
TWFG_node[s].ancestor = -1;
s = t;
// advance parent's edge pointer
continue;
}
t = TWFG_edge[TWFG_node[s].current].to_tran_index;
/* skip stale edges */
// ... validation checks ...
if (TWFG_node[t].ancestor != -1)
{
/* CYCLE FOUND: s → t, and t is already on the path */
lock_select_deadlock_victim (thread_p, s, t);
}
else
{
/* no cycle yet: advance DFS into t */
TWFG_node[t].ancestor = s;
s = t;
}
}
}

Staleness checks during DFS filter out edges that have become invalid since Phase 2 — a transaction may have committed or been aborted between the WFG build and the DFS walk. Three conditions invalidate a node: checked_by_deadlock_detector == false (new transaction reused the index), thrd_wait_stime == 0 (no longer waiting), or edge_seq_num < tran_edge_seq_num (edge is from a previous incarnation of this tran_index).

When a cycle is found, lock_select_deadlock_victim is called with s (the node that discovered the back-edge) and t (the node already on the DFS path — the cycle start):

Cycle: t → ... → s → t

The function first validates the cycle (false-cycle check for stale edges), then selects a victim using a priority-based strategy:

Victim Selection Strategy (from source comments):
1) Must be a lock holder (holder_flag == true).
2) Must be an active transaction.
3) Prefer a transaction that does NOT have deadlock priority.
4) Prefer a transaction that has written fewer log records.
5) Prefer a transaction with a closer timeout.
6) Prefer the youngest transaction (highest tranid).

Why each criterion exists:

  1. Holder — only a holder’s rollback frees what the cycle waits on; aborting a pure waiter removes an edge but releases nothing, so the cycle persists.
  2. Active — a transaction already committing / aborting is exiting anyway and cannot be aborted again; victimizing it resolves nothing.
  3. No deadlock priority — the priority flag is an explicit protection knob; protected transactions survive.
  4. Fewer log records — the cheapest rollback.
  5. Finite timeout — that application already expects retry-style failures, so a retryable wake costs the least surprise.
  6. Youngest — the least work thrown away; and because elders always survive ties, a long-running transaction is never re-victimized forever — the system as a whole keeps making progress.

The candidate field on TWFG_node marks which cycle members are eligible victims (they are on the holder side of at least one edge). The function walks the cycle via ancestor links, comparing each candidate against the current best victim using the six criteria in order.

// lock_select_deadlock_victim — src/transaction/lock_manager.c (annotated)
static void
lock_select_deadlock_victim (THREAD_ENTRY * thread_p, int s, int t)
{
/* Phase A: validate cycle — check for stale edges */
// walk cycle t → ... → s → t, mark false_dd_cycle if any
// node/edge is stale; if false cycle, clear ancestors and return
if (false_dd_cycle == true)
{ /* clear ancestor chain and return */ return; }
/* Phase B: initialize victim from the edge s→t */
if (TWFG_edge[TWFG_node[s].current].holder_flag)
{
/* t is a holder — candidate for victim */
if (logtb_is_active (thread_p, logtb_find_tranid (t)))
{
victims[victim_count].tran_index = t;
victims[victim_count].can_timeout =
LK_CAN_TIMEOUT (logtb_find_wait_msecs (t));
}
}
/* Phase C: walk cycle s → ancestor → ... → t, compare candidates */
for (v = s; v != t; v = TWFG_node[v].ancestor)
{
if (TWFG_node[v].candidate == false)
continue; /* not a holder in this cycle — skip */
tranid = logtb_find_tranid (v);
if (!logtb_is_active (thread_p, tranid))
continue; /* criterion 2: must be active */
victim_tran_index = victims[victim_count].tran_index;
if (victim_tran_index == NULL_TRAN_INDEX)
{
victim_tranid = tranid; /* first eligible candidate */
}
/* criterion 3: prefer no deadlock priority */
else if (logtb_has_deadlock_priority (victim_tran_index)
!= logtb_has_deadlock_priority (v))
{
if (logtb_has_deadlock_priority (v) == false)
victim_tranid = tranid;
}
else
{
tran_log_count = logtb_find_log_records_count (v);
victim_tran_log_count =
logtb_find_log_records_count (victim_tran_index);
/* criterion 4: prefer fewer log records */
if (tran_log_count < victim_tran_log_count)
victim_tranid = tranid;
else if (tran_log_count == victim_tran_log_count)
{
/* criterion 5: prefer closer timeout */
/* criterion 6: prefer youngest (highest tranid) */
if ((victims[victim_count].can_timeout == false
&& can_timeout == true)
|| (same_timeout_policy
&& LK_ISYOUNGER (tranid, current_victim)))
victim_tranid = tranid;
}
}
if (victim_tranid != NULL_TRANID)
{
victims[victim_count].tran_index = v;
victims[victim_count].tranid = victim_tranid;
victims[victim_count].can_timeout = can_timeout;
}
}
/* Phase D: record victim for event log and increment victim_count */
if (victims[victim_count].tran_index != NULL_TRAN_INDEX)
{
/* collect LK_ENTRY pointers along the cycle for event log */
// ... walk cycle again, store holder/waiter entries ...
victim_count++;
}
}

The six criteria applied in order (each is a tie-breaker for the previous):

flowchart TD
  C1{"1. Is it a lock holder\nin this cycle?"}
  C1 -- "no" --> SKIP["skip"]
  C1 -- "yes" --> C2{"2. Is transaction\nactive?"}
  C2 -- "no" --> SKIP
  C2 -- "yes" --> C3{"3. Does it have\ndeadlock priority?"}
  C3 -- "yes (protected)" --> KEEP["keep current victim"]
  C3 -- "no (unprotected)" --> PREFER["prefer this one"]
  C3 -- "same" --> C4{"4. Fewer log\nrecords written?"}
  C4 -- "yes (less work)" --> PREFER
  C4 -- "same count" --> C5{"5. Has finite\ntimeout?"}
  C5 -- "yes (can retry)" --> PREFER
  C5 -- "same" --> C6{"6. Younger\n(higher tranid)?"}
  C6 -- "yes" --> PREFER
  C6 -- "no" --> KEEP

Figure 8-2 — Victim selection criteria as a decision tree. Each criterion is a tie-breaker for the previous. The overall preference is: unprotected, less work done, retryable, youngest.

Resolution modes:

  • can_timeout = true (the victim’s wait_msecs is finite): wake with LOCK_RESUMED_DEADLOCK_TIMEOUT — treated as a soft timeout, the application can retry.
  • can_timeout = false (LK_INFINITE_WAIT): wake with LOCK_RESUMED_ABORTED_FIRST / _OTHER — the transaction must be rolled back.

8.7 Phase 5 — resolution (waking victims)

Section titled “8.7 Phase 5 — resolution (waking victims)”
// lock_detect_local_deadlock — Phase 5
for (k = 0; k < victim_count; k++)
{
if (victims[k].can_timeout)
lock_wakeup_deadlock_victim_timeout (victims[k].tran_index);
else
lock_wakeup_deadlock_victim_aborted (victims[k].tran_index);
}

These functions (detailed in Ch. 7 §7.3–7.4) find the victim’s suspended thread and call lock_resume with the appropriate state. The resumed thread then cleans up its waiter entry via lock_internal_perform_unlock_object (Ch. 5 §5.7 CASE W).

Safety net — the no_victim_case_count escalation:

If 60 consecutive deadlock detection runs find no victims but threads are still suspended, the system suspects a livelock or undetectable deadlock. As a last resort, it forcibly times out one thread:

if (victim_count == 0)
{
if (lk_Gl.no_victim_case_count >= 60)
{
if (css_are_all_request_handlers_suspended ())
thread_get_manager ()->map_entries (
lock_victimize_first_thread_mapfunc);
lk_Gl.no_victim_case_count = 0;
}
else
lk_Gl.no_victim_case_count += 1;
}

At 100ms per daemon tick, 60 runs = ~6 seconds of no progress before the safety net fires.

8.8 WFG data structures — recap from Ch. 1

Section titled “8.8 WFG data structures — recap from Ch. 1”

For reference, the WFG structures (defined in lock_manager.c):

struct lk_WFG_node
{
int first_edge; /* head of outgoing edge list (-1 = none) */
bool candidate; /* eligible as victim in current cycle */
int current; /* DFS cursor into edge list */
int ancestor; /* DFS parent (-1 = not on path, -2 = root) */
INT64 thrd_wait_stime; /* wait start time (0 = not waiting) */
int tran_edge_seq_num; /* staleness marker */
bool checked_by_deadlock_detector; /* false = new tran, clear old edges */
bool DL_victim; /* already selected as victim */
};
struct lk_WFG_edge
{
int to_tran_index; /* target transaction (-2 = stale edge) */
int edge_seq_num; /* creation sequence number */
int holder_flag; /* true if target is a holder */
int next; /* next edge in same node's list */
INT64 edge_wait_stime; /* waiter's wait start time */
LK_ENTRY *holder; /* LK_ENTRY of the blocking holder */
LK_ENTRY *waiter; /* LK_ENTRY of the blocked waiter */
};

8.9 Concrete scenario — three-transaction deadlock

Section titled “8.9 Concrete scenario — three-transaction deadlock”
flowchart LR
  subgraph Build["Phase 2: Build WFG"]
    direction LR
    T1["T1: holds X on R1\nwaits S on R2"]
    T2["T2: holds S on R2\nwaits X on R3"]
    T3["T3: holds X on R3\nwaits S on R1"]
    T1 -- "edge: T1 waits T2\n(waiter->holder on R2)" --> T2
    T2 -- "edge: T2 waits T3\n(waiter->holder on R3)" --> T3
    T3 -- "edge: T3 waits T1\n(waiter->holder on R1)" --> T1
  end

  subgraph DFS["Phase 3: DFS from T1"]
    D1["visit T1 (ancestor=-2)"]
    D2["visit T2 (ancestor=T1)"]
    D3["visit T3 (ancestor=T2)"]
    D4["follow edge T3->T1:\nT1.ancestor != -1\n→ CYCLE FOUND"]
    D1 --> D2 --> D3 --> D4
  end

  subgraph Victim["Phase 4: Select victim"]
    V1["Cycle: T1 -> T2 -> T3 -> T1\nAll are holders\nAll are active\nPick youngest (highest tranid)\n→ T3 is victim"]
  end

  subgraph Resolve["Phase 5: Resolution"]
    R1["T3.can_timeout?\nyes → DEADLOCK_TIMEOUT\nno → ABORTED_FIRST"]
  end

  Build --> DFS --> Victim --> Resolve

Figure 8-3 — Three-transaction deadlock detection end-to-end. Phase 2 adds three edges forming a cycle. Phase 3 DFS discovers the cycle when T3’s edge leads back to T1. Phase 4 selects T3 (youngest by the default victim strategy). Phase 5 wakes T3 with timeout or abort.

8.10 Performance analysis — is full rebuild every run acceptable?

Section titled “8.10 Performance analysis — is full rebuild every run acceptable?”

The WFG is rebuilt from scratch on every detection run. This section analyzes the actual cost and the design rationale.

Three gates that prevent unnecessary work:

flowchart LR
  TICK["daemon tick\n(every 100ms)"]
  G1{"deadlock_and_timeout\n_detector == 0?"}
  G1 -- "yes (no waiters)" --> SKIP1["skip entirely\n(atomic read only)"]
  G1 -- "no" --> TIMEOUT["timeout check\n(map over threads)"]
  TIMEOUT --> G2{"lock_wait_count >= 2?"}
  G2 -- "no (0 or 1 waiter)" --> SKIP2["skip WFG\n(deadlock impossible)"]
  G2 -- "yes" --> G3{"interval elapsed?\n(default 1.0s)"}
  G3 -- "no" --> SKIP3["skip WFG\n(too soon)"]
  G3 -- "yes" --> FULL["FULL WFG SCAN\nlock_detect_local_deadlock"]

Figure 8-4 — Three gates before the WFG full scan runs. In an uncontended workload, Gate 1 catches everything — cost is one atomic read per 100ms. In a lightly contended workload, Gate 2 catches single-waiter cases. Gate 3 ensures the full scan runs at most once per second.

Cost breakdown when the full scan does execute:

PhaseTime complexityPractical costBottleneck?
Node initO(MAX_NTRANS)~tens of µsNo
Edge initO(200) static blocknegligibleNo
Hash table scanO(all active LK_RES)The dominant cost. Each resource: acquire res_mutex, scan holders/waiters, release.Yes
Per-resource edge buildO(holders² + holders×waiters + waiters²)Most resources have 1–2 holders, 0 waiters → O(1) per resourceRarely
DFS cycle detectionO(nodes + edges)nodes ≤ MAX_NTRANS, edges = actual wait relationshipsNo

The hash table scan is the bottleneck. If the system has 100K active lock resources, the scan performs 100K res_mutex lock/unlock pairs. However:

  • Most resources have no waiter — the scan checks res_ptr->holder and res_ptr->waiter, adds zero edges, and moves on. The per-resource cost is one mutex acquire + a few pointer reads + one mutex release.
  • res_mutex is per-resource — while scanning resource R₁, other threads can freely lock/unlock resource R₂. The scan does not serialize the entire lock manager.
  • The scan runs at most once per second (default PRM_ID_LK_RUN_DEADLOCK_INTERVAL = 1.0).

Design trade-off — full rebuild vs. incremental maintenance:

ApproachHot-path overhead (lock acquire/release)Detection costComplexity
Full rebuild (CUBRID)Zero — no WFG bookkeeping on acquire/releaseO(resources) per scan, ≤ 1/secSimple
Incremental WFGEdge add on every block, edge remove on every grant/releaseO(DFS only) per scanHigh — must maintain edge consistency under concurrent modifications

CUBRID chose zero overhead on the hot path. In OLTP workloads where lock acquires happen tens of thousands of times per second, adding WFG edge maintenance to every acquire would cost more in aggregate than one full scan per second. The full-rebuild approach also avoids the complexity of maintaining graph consistency under concurrent lock state changes.

When the cost could become a problem:

  • Tens of thousands of active lock resources and sustained multi-thread contention (many waiters per resource).
  • In this scenario, the hash table scan could take tens of milliseconds. Mitigation options:
    • Increase PRM_ID_LK_RUN_DEADLOCK_INTERVAL (reduce scan frequency — trades detection latency for lower overhead).
    • Lower PRM_ID_LK_ESCALATION_AT (escalate sooner — reduces the total number of lock resources).
    • The timeout-based fallback (wait_msecs on each lock request) still resolves deadlocks even if the WFG scan is delayed.
  1. The WFG is rebuilt from scratch on every detection run — no incremental maintenance. This simplifies correctness at the cost of O(resources × holders²) construction time.
  2. Three categories of edges capture all wait relationships: holder↔holder (upgrader conflicts), waiter→holder, and waiter→waiter.
  3. DFS cycle detection uses ancestor chains. Staleness checks during DFS filter out edges invalidated by concurrent commits.
  4. Victim selection uses a six-criteria priority: must be holder, must be active, no deadlock priority, fewer log records, closer timeout, youngest transaction.
  5. Two resolution modes: can_timeout → soft timeout (retryable); !can_timeout → abort (rollback required).
  6. Edge array grows in 3 tiers: 200 (static) → 1000 (static extended) → MAX_NTRANS² (malloc). The static initial block avoids allocation for low-contention workloads.
  7. Safety net: 60 consecutive no-victim runs (~6 seconds) triggers forced timeout of one thread to prevent server hang.

The main lock lifecycle (acquire → convert → release) was covered in Chapters 3–5. This chapter covers the special-purpose APIs that don’t fit that main flow: instant lock probe, composite lock, lock_scan / lock_classes_lock_hint, lock demotion, and lock_subclass.

9.1 lock_hold_object_instant — non-blocking probe

Section titled “9.1 lock_hold_object_instant — non-blocking probe”

A lightweight lock check that never blocks and never creates an LK_ENTRY. It tests whether a lock could be granted right now without actually acquiring it.

// lock_internal_hold_lock_object_instant — src/transaction/lock_manager.c
static int
lock_internal_hold_lock_object_instant (THREAD_ENTRY * thread_p,
int tran_index, const OID * oid,
const OID * class_oid, LOCK lock)
{
/* shortcut: class lock subsumes the request? */
if (class_oid != NULL && !OID_IS_ROOTOID (class_oid))
if (lock_is_class_lock_escalated (..., lock))
return LK_GRANTED;
/* find (not find_or_insert!) the resource */
res_ptr = lk_Gl.m_obj_hash_table.find (thread_p, search_key);
if (res_ptr == NULL)
return LK_GRANTED; /* no resource → no contention */
/* am I already a holder? */
for (entry_ptr = res_ptr->holder; ...; entry_ptr = entry_ptr->next)
if (entry_ptr->tran_index == tran_index) break;
if (entry_ptr == NULL)
{
/* not a holder: check compat with holders AND waiters */
compat1 = lock_compat (lock, res_ptr->total_waiters_mode);
compat2 = lock_compat (lock, res_ptr->total_holders_mode);
pthread_mutex_unlock (&res_ptr->res_mutex);
return (compat1 == YES && compat2 == YES)
? LK_GRANTED : LK_NOTGRANTED;
}
else
{
/* already a holder: check conversion compat */
new_mode = lock_conv (lock, entry_ptr->granted_mode);
if (new_mode == entry_ptr->granted_mode)
{ /* re-entrance */ return LK_GRANTED; }
/* check against other holders */
// ... same group_mode logic as PATH F (Ch. 3 §3.9) ...
return compat ? LK_GRANTED : LK_NOTGRANTED;
}
}

Key differences from lock_object:

Aspectlock_hold_object_instantlock_object
Creates LK_ENTRYNoYes
Blocks on conflictNo (returns LK_NOTGRANTED)Yes (suspends thread)
Modifies holder/waiter listsNoYes
Hash operationfind (read-only)find_or_insert (may create LK_RES)
Use caseOptimistic probe before full lockActual lock acquisition

Caller: scan_manager.c uses this as an optimistic first try during heap/index scans. If the probe returns LK_GRANTED, the scan can proceed knowing the lock would succeed, often without needing to create any lock entry at all. If LK_NOTGRANTED, it falls back to the full lock_object path.

9.2 Composite lock — batched multi-object locking

Section titled “9.2 Composite lock — batched multi-object locking”

Composite lock is a deferred locking strategy for bulk DELETE/UPDATE. Instead of acquiring X locks on each row during the scan phase (which could deadlock with concurrent transactions), it collects OIDs first and acquires all locks at the end.

Lifecycle:

flowchart LR
  INIT["lock_initialize_composite_lock\n(set tran_index, wait_msecs)"]
  ADD["lock_add_composite_lock\n(collect OID + class_oid)\nRepeated per row"]
  FIN["lock_finalize_composite_lock\n(acquire all X locks)\nEscalate if count >= threshold"]
  ABORT["lock_abort_composite_lock\n(free collected OIDs)"]

  INIT --> ADD --> FIN
  ADD -- "error" --> ABORT
  FIN -- "error" --> ABORT

Figure 9-1 — Composite lock lifecycle. OIDs are collected during scan, locks are acquired in batch at finalization.

lock_add_composite_lock collects instance OIDs grouped by class:

// lock_add_composite_lock — src/transaction/lock_manager.c (simplified)
int lock_add_composite_lock (... LK_COMPOSITE_LOCK * comp_lock,
const OID * oid, const OID * class_oid)
{
/* find or create per-class bucket */
for (lockcomp_class = lockcomp->class_list; ...; )
if (OID_EQ (class_oid, &lockcomp_class->class_oid)) break;
if (lockcomp_class == NULL)
{
/* new class: acquire IX on class immediately */
lock_internal_perform_lock_object (..., class_oid, IX_LOCK, ...);
/* allocate OID buffer (up to escalation threshold) */
}
/* if class lock already >= X: skip instance OID collection */
if (lockcomp_class->class_lock_ptr->granted_mode >= X_LOCK)
return; /* class lock subsumes everything */
/* collect the instance OID */
lockcomp_class->inst_oid_space[num_inst_oids++] = *oid;
/* if count reaches escalation threshold: stop collecting */
}

lock_finalize_composite_lock acquires the actual locks:

// lock_finalize_composite_lock — src/transaction/lock_manager.c (simplified)
int lock_finalize_composite_lock (... LK_COMPOSITE_LOCK * comp_lock)
{
for (each lockcomp_class in class_list)
{
if (class already >= X_LOCK
|| num_inst_oids == escalation_threshold)
{
/* escalation: X lock on the class */
lock_internal_perform_lock_object (..., class_oid, X_LOCK, ...);
}
else
{
/* individual: X lock on each collected instance */
for (i = 0; i < num_inst_oids; i++)
lock_internal_perform_lock_object (..., inst_oid[i], X_LOCK, ...);
}
}
lock_abort_composite_lock (comp_lock); /* free memory */
}

Caller: query_executor.c uses composite lock for DELETE/UPDATE queries. The XASL plan carries a LK_COMPOSITE_LOCK field; the executor collects target OIDs during the scan phase and calls lock_finalize_composite_lock before applying the modifications.

Escalation integration: Composite lock has its own escalation logic — if the collected OID count for a class reaches PRM_ID_LK_ESCALATION_AT, it stops collecting OIDs and instead acquires X on the class at finalization. This mirrors the behavior of the standard escalation (Ch. 6) but with a two-phase approach.

9.3 lock_scan and lock_classes_lock_hint — class-grain APIs

Section titled “9.3 lock_scan and lock_classes_lock_hint — class-grain APIs”

These two APIs operate at the class level only, with no instance locking.

lock_scan (covered briefly in Ch. 3 §3.1) acquires a single class lock for a heap scan:

// lock_scan — src/transaction/lock_manager.c
int lock_scan (THREAD_ENTRY * thread_p, const OID * class_oid,
int cond_flag, LOCK class_lock)
{
root_class_entry = lock_get_class_lock (thread_p, oid_Root_class_oid);
return lock_internal_perform_lock_object (
thread_p, tran_index, class_oid, NULL,
class_lock, wait_msecs,
&class_entry, root_class_entry);
}

Caller: heap_file.c calls lock_scan(class_oid, IS_LOCK) at the start of a heap scan.

lock_classes_lock_hint acquires locks on multiple classes at once, given a pre-built LC_LOCKHINT array:

// lock_classes_lock_hint — src/transaction/lock_manager.c (simplified)
int lock_classes_lock_hint (THREAD_ENTRY * thread_p,
LC_LOCKHINT * lockhint)
{
/* sort classes for consistent ordering (avoid class-level deadlocks) */
// ... sort lockhint->classes by OID ...
for (i = 0; i < lockhint->num_classes; i++)
{
lock_internal_perform_lock_object (
thread_p, tran_index,
&lockhint->classes[i].oid, NULL,
lockhint->classes[i].lock, wait_msecs,
&entry, root_class_entry);
}
}

Caller: locator_sr.c (xlocator_fetch_lockhint_classes) uses this for batch class locking when the client prefetches multiple classes. The sort step is critical — acquiring class locks in a consistent order prevents deadlocks between transactions that need the same set of classes.

9.4 Lock demotion — lock_demote_class_lock

Section titled “9.4 Lock demotion — lock_demote_class_lock”

Demotion is the reverse of conversion: downgrade a held lock to a weaker mode. This is used only in narrow circumstances.

// lock_demote_class_lock — src/transaction/lock_manager.c (simplified)
int lock_demote_class_lock (THREAD_ENTRY * thread_p,
const OID * oid, LOCK lock, LOCK * ex_lock)
{
entry_ptr = lock_find_tran_hold_entry (thread_p, tran_index, oid, true);
return lock_internal_demote_class_lock (thread_p, entry_ptr,
lock, ex_lock);
}

lock_internal_demote_class_lock changes granted_mode to the weaker mode, recomputes total_holders_mode, and calls lock_grant_blocked_holder / lock_grant_blocked_waiter to see if the demotion unblocks any waiters.

Callers:

  • lock_demote_read_class_lock_for_checksumdb — demotes S→IS on class locks during checksumdb to reduce contention. Explicitly documented as “ONLY for checksumdb.”
  • lock_demote_all_shared_class_locks — internal function that demotes all S class locks to IS, called during specific transaction state transitions.

Warning from the source: “Demoting a write lock in the middle of a transaction may cause recovery issues. It should be carefully used for some particular cases.” Demotion of instance locks is deliberately not supported.

9.5 lock_subclass — class hierarchy locking

Section titled “9.5 lock_subclass — class hierarchy locking”

For CUBRID’s class inheritance hierarchy, locking a subclass requires intention locks on the superclass:

// lock_subclass — src/transaction/lock_manager.c (simplified)
int lock_subclass (THREAD_ENTRY * thread_p,
const OID * subclass_oid,
const OID * superclass_oid,
LOCK lock, int cond_flag)
{
/* determine intention lock for superclass */
if (lock <= S_LOCK)
new_superclass_lock = IS_LOCK;
else
new_superclass_lock = IX_LOCK;
/* acquire intention lock on superclass if needed */
if (old_superclass_lock < new_superclass_lock)
lock_internal_perform_lock_object (..., superclass_oid, NULL,
new_superclass_lock, ...);
/* acquire the actual lock on the subclass */
lock_internal_perform_lock_object (..., subclass_oid, NULL,
lock, ..., &subclass_entry,
superclass_entry);
}

This follows the same multi-granularity pattern as lock_object’s instance-on-class locking, but one level up in the hierarchy: subclass-on-superclass instead of instance-on-class. The class_entry back-pointer chains allow escalation to propagate through the hierarchy.

9.6 lock_rep_read_tran — REPEATABLE READ helper

Section titled “9.6 lock_rep_read_tran — REPEATABLE READ helper”

A specialized entry point for ALTER TABLE ADD COLUMN NOT NULL under REPEATABLE READ isolation:

int lock_rep_read_tran (THREAD_ENTRY * thread_p, LOCK lock,
int cond_flag);

This function acquires a lock on the root class OID to prevent concurrent DDL during the operation. It is a narrow special case and does not introduce any new locking mechanism — it just calls lock_internal_perform_lock_object on the root OID.

These APIs have narrower caller bases than the main lock_object/ lock_unlock_object (see Ch. 3 §3.2 for the main caller map).

APICallersContext
lock_hold_object_instantscan_manager.c (heap scan, index scan)Optimistic non-blocking probe before full lock_object. Two call sites: scan_next_heap_scan for heap OIDs, scan_next_index_scan for index OIDs.
lock_initialize/add/finalize_composite_lockquery_executor.cBulk DELETE/UPDATE. lock_initialize at query start, lock_add per candidate row during scan, lock_finalize before applying modifications.
lock_scanheap_file.cSingle call site: heap_scancache_start acquires IS_LOCK on the class before scanning the heap.
lock_classes_lock_hintlocator_sr.c (xlocator_fetch_lockhint_classes)Batch class locking when the client prefetches multiple classes. Paired with lock_unlock_classes_lock_hint on completion.
lock_subclasslocator_sr.c (INSERT/UPDATE on partitioned tables), query_executor.c (partition scan)Acquires IX/IS on subclass with intention lock on superclass. Used when operating on class hierarchies / partitions.
lock_demote_class_locklocator_sr.c (xlocator_demote_class_lock), btree_load.c (index load)Demote class lock after bulk operation. General-purpose but restricted.
lock_demote_read_class_lock_for_checksumdblocator_sr.c (xlocator_check_fk_validity)S→IS demotion during checksumdb only. Source explicitly says “NEVER consider to use this function for any other clients/threads.”
lock_rep_read_trantransaction_sr.c (xtran_lock_rep_read)Root-class lock for ALTER TABLE ADD COLUMN NOT NULL under RR. Single call site.
  1. lock_hold_object_instant is a read-only, non-blocking probe: uses find (not find_or_insert), creates no LK_ENTRY, never blocks. Used by scan_manager for optimistic locking.
  2. Composite lock defers X-lock acquisition to a batch finalization phase. Integrates its own escalation logic (class X if count reaches threshold). Used by query_executor for bulk DELETE/UPDATE.
  3. lock_scan is a thin wrapper for class-grain IS/IX locks. lock_classes_lock_hint sorts classes before locking to prevent class-level deadlocks.
  4. Lock demotion is intentionally restricted to class locks only and to specific callers (checksumdb). The source explicitly warns against demoting write locks mid-transaction.
  5. lock_subclass extends multi-granularity locking to the class inheritance hierarchy: subclass-on-superclass mirrors instance-on-class.

Chapter 10: Serializable conflict handling (first-updater-wins)

Section titled “Chapter 10: Serializable conflict handling (first-updater-wins)”

This chapter covers the one isolation rule the lock manager enforces without a lock — the write-path ER_MVCC_SERIALIZABLE_CONFLICT raised under REPEATABLE READ / SERIALIZABLE. It is the write-side counterpart to §5.3 (lock_unlock_object — the isolation policy gate): §5.3 decides which locks a transaction keeps to end-of-transaction under RR; this rule stops a transaction from modifying a row that another transaction changed out from under its snapshot. MVCC visibility (cubrid-mvcc-detail.md) supplies the detection substrate; the serialization decision lives here. (The MVCC detail doc delegates here for exactly this reason — visibility alone cannot serialize two writers.)

ER_MVCC_SERIALIZABLE_CONFLICT (-1158, base/error_code.h) is CUBRID’s first-updater-wins write-conflict error — the analogue of PostgreSQL’s “could not serialize access due to concurrent update” (SQLSTATE 40001). Under RR/SERIALIZABLE a transaction may not UPDATE/DELETE a row whose latest version differs from the version its snapshot sees. If a concurrent transaction created a newer version, or deleted the row, after this transaction’s snapshot was taken, the modify is refused with -1158 and the statement aborts.

There is no separate SERIALIZABLE mechanism: TRAN_SERIALIZABLE has no standalone branch anywhere in the tree, so at runtime RR ≡ SERIALIZABLE and both are gated identically by isolation > TRAN_READ_COMMITTED. The source enforces serializability through this one first-updater-wins check plus RR’s strict-2PL lock retention (§5.3); it does not add key-range locks.

10.2 Where it is raised — locator_lock_and_get_object_internal

Section titled “10.2 Where it is raised — locator_lock_and_get_object_internal”

The check is not in lock_manager.c. It lives in the locator’s lock-and-fetch path that UPDATE/DELETE use to pin a row for modification: locator_lock_and_get_object_internal (src/transaction/locator_sr.c), reached from locator_lock_and_get_object_with_evaluation with lock_mode = X_LOCK. The X-lock is taken first, and only then is the version fetched and checked:

// locator_lock_and_get_object_internal — src/transaction/locator_sr.c (annotated)
// 1) acquire the row OID X-lock: conditional, else unconditional (blocking)
if (lock_object (.., oid, class_oid, lock_mode, LK_COND_LOCK) != LK_GRANTED)
{
... pgbuf_ordered_unfix (..); // drop page watchers
if (lock_object (.., oid, class_oid, lock_mode, LK_UNCOND_LOCK) != LK_GRANTED) // ← write-write WAIT
goto error;
heap_prepare_get_context (..); // pages were unlatched; re-fetch
}
// 2) only now fetch the latest version and read its MVCC header
heap_get_last_version (.., context);
heap_get_mvcc_header (.., context, &recdes_header);
// 3) RR/SERIALIZABLE: compare the latest version against the txn snapshot (§10.3)

So the write-write wait today is the OID-lock block (LK_UNCOND_LOCK); the snapshot check below is a post-lock isolation gate, not the conflict detector. (This ordering is the lever the cubrid_cv I13 implicit-update-lock analysis proposes to invert.)

10.3 The gate — RR/SERIALIZABLE on RR-eligible classes

Section titled “10.3 The gate — RR/SERIALIZABLE on RR-eligible classes”
// locator_sr.c (annotated)
if (logtb_find_current_isolation (thread_p) > TRAN_READ_COMMITTED
&& logtb_check_class_for_rr_isolation_err (class_oid))
{
MVCC_SNAPSHOT *snap = logtb_get_mvcc_snapshot (thread_p);
MVCC_SATISFIES_SNAPSHOT_RESULT res = snap->snapshot_fnc (.., &recdes_header, snap);
...
}

Two conditions, both required:

  • Isolation > READ COMMITTED — RR or SERIALIZABLE. Under READ COMMITTED the whole block is skipped: RC never raises -1158; it waits on the row lock and then re-reads the new version.
  • RR-eligible classlogtb_check_class_for_rr_isolation_err (src/transaction/log_tran_table.c) returns true only when the class is not db_root, _db_user, or _db_trigger. Those are accessed before table locks and secure consistency by locks, so they are exempt; most other catalog classes are MVCC-enabled and are checked.

The gate runs the transaction snapshot (mvcc_satisfies_snapshot, src/transaction/mvcc.c) against the latest version’s header:

Snapshot result on latest versionMeaningOutcome
TOO_OLD_FOR_SNAPSHOTDeleted + committed before my snapshot — invisible to meS_DOESNT_EXIST (row “gone”, not a conflict)
TOO_NEW_FOR_SNAPSHOTLatest version’s inserter is active, or committed after my snapshot (MVCC_IS_REC_INSERTER_IN_SNAPSHOT)ER_MVCC_SERIALIZABLE_CONFLICT
SNAPSHOT_SATISFIED and MVCC_IS_HEADER_DELID_VALIDLatest version deleted by a concurrent transactionER_MVCC_SERIALIZABLE_CONFLICT
SNAPSHOT_SATISFIED, not deletedLatest == my visible version; nobody changed itfall through — modify proceeds

TOO_NEW_FOR_SNAPSHOT is defined in mvcc_satisfies_snapshot: a not-deleted record whose inserter is “in snapshot” (active, or committed after the snapshot) returns TOO_NEW. That is precisely “a concurrent transaction produced a version I am not entitled to see” — the first-updater-wins condition.

  • Concurrent UPDATE. T1 snapshots at t0; T2 updates row R and commits after t0; T1 then updates R → its latest version is TOO_NEW for T1 → -1158. (If T1 first blocked on T2’s row X-lock, the re-fetch after the grant sees the new version and raises -1158 rather than silently overwriting.)
  • Concurrent DELETE. T2 deletes R and commits after t0; T1 updates/deletes R → latest version carries a concurrent deleter’s delete-id → -1158. (If T2’s delete committed before t0 the result is TOO_OLDS_DOESNT_EXIST.)
  • No conflict. No concurrent modify since t0 → latest == visible version → modify proceeds.

-1158 is an abort signal, recognised alongside ER_LK_UNILATERALLY_ABORTED by the layers that must not swallow it — query_executor.c, work_space.c (client retry), trigger_manager.c. The engine does not retry internally: the transaction aborts and the application decides whether to retry under a fresh snapshot.

  • MVCC (cubrid-mvcc-detail.md). MVCC supplies the substrate — the record header’s insert/delete MVCCID and the mvcc_satisfies_* classifiers. This chapter is where the isolation layer turns “the latest version is not my visible version” into a refusal.
  • Escalation (Chapter 6). Orthogonal: escalation changes the lock granule (instance→class) to relieve LK_ENTRY memory; it does not change this check. A bulk UPDATE that escalates still runs the per-row first-updater-wins check on each modified row under RR/SERIALIZABLE.

These symbols sit outside lock_manager.c; anchor by function name (line numbers verified against the develop+I8 worktree, 2026-06, and drift across revisions).

SymbolFileLine
ER_MVCC_SERIALIZABLE_CONFLICT (-1158)base/error_code.h1487
locator_lock_and_get_object_internal (raise sites, gate)transaction/locator_sr.c12935 (raises 13041 / 13047)
locator_lock_and_get_object_with_evaluation (entry, X_LOCK)transaction/locator_sr.c13099
mvcc_satisfies_snapshot (TOO_NEW / TOO_OLD / SATISFIED)transaction/mvcc.c156
logtb_check_class_for_rr_isolation_err (RR-eligible gate)transaction/log_tran_table.c5671

SymbolFileLine
Types (headers)
struct lk_entrylock_manager.h79
struct lk_res_keylock_manager.h164
struct lk_reslock_manager.h175
lock_remove_all_inst_lockslock_manager.h195
enum LOCKlock_table.h29
lock_Comp[][]lock_table.c67
lock_Conv[][]lock_table.c179
lock_conv() / lock_compat()lock_table.h58 / 65
Internal types (lock_manager.c)
struct lk_tran_locklock_manager.c324
struct lk_global_datalock_manager.c361
lk_Gl (global instance)lock_manager.c418
lk_Obj_lock_res_desclock_manager.c621
obj_lock_entry_desclock_manager.c592
Constants and macros
LK_OBJ_LOCK_HASHlock_manager.c84
LK_MIN_OBJECT_LOCKSlock_manager.c433
LOCK_TRAN_LOCAL_POOL_MAX_SIZElock_manager.c351
LK_MIN_TWFG_EDGE_COUNTlock_manager.c451
LK_MID_TWFG_EDGE_COUNTlock_manager.c453
LK_MAX_TWFG_EDGE_COUNTlock_manager.c455
LK_MAX_VICTIM_COUNTlock_manager.c448
PRM_ID_LK_ESCALATION_ATsystem_parameter.h(config)
PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATIONsystem_parameter.h(config)
PRM_ID_LK_RUN_DEADLOCK_INTERVALsystem_parameter.c1313
Initialization / finalization
lock_initializelock_manager.c5627
lock_finalizelock_manager.c5859
lock_initialize_tran_lock_tablelock_manager.c1061
lock_initialize_object_hash_tablelock_manager.c1121
lock_initialize_object_lock_entry_listlock_manager.c1152
lock_initialize_deadlock_detectionlock_manager.c1180
lock_finalize_tran_lock_tablelock_manager.c5702
lock_deadlock_detect_daemon_initlock_manager.c5812
deadlock_detect_task_executelock_manager.c5780
Entry init / hash
lock_initialize_entry_as_grantedlock_manager.c920
lock_initialize_entry_as_blockedlock_manager.c940
lock_initialize_entry_as_non2pllock_manager.c962
lock_create_search_keylock_manager.c641
lock_res_key_comparelock_manager.c844
lock_res_key_hashlock_manager.c884
lock_get_hash_valuelock_manager.c1015
lock_get_new_entrylock_manager.c9778
lock_free_entrylock_manager.c9809
Acquisition
lock_objectlock_manager.c5945
lock_object_wait_msecslock_manager.c6273
lock_scanlock_manager.c6303
lock_subclasslock_manager.c6154
lock_classes_lock_hintlock_manager.c6379
lock_internal_perform_lock_objectlock_manager.c3223
lock_find_class_entrylock_manager.c1596
lock_position_holder_entrylock_manager.c1743
lock_is_class_lock_escalatedlock_manager.c508
Grant
lock_grant_blocked_holderlock_manager.c2512
lock_grant_blocked_waiterlock_manager.c2629
lock_grant_blocked_waiter_partiallock_manager.c2758
Release
lock_unlock_objectlock_manager.c6630
lock_unlock_object_donot_move_to_non2pllock_manager.c6613
lock_unlock_object_lock_internallock_manager.c6580
lock_unlock_object_by_isolationlock_manager.c9841
lock_unlock_shared_inst_locklock_manager.c4399
lock_unlock_alllock_manager.c6851
lock_internal_perform_unlock_objectlock_manager.c3945
NON2PL
lock_add_non2pl_locklock_manager.c1649
lock_update_non2pl_listlock_manager.c4592
lock_remove_non2pllock_manager.c539
lock_notify_isolation_inconslock_manager.c7516
Instant lock
lock_start_instant_lock_modelock_manager.c8879
lock_stop_instant_lock_modelock_manager.c8901
Escalation
lock_escalate_if_neededlock_manager.c2974
lock_check_escalatelock_manager.c2907
lock_increment_class_granuleslock_manager.c548
lock_decrement_class_granuleslock_manager.c550
Suspend / resume
lock_suspendlock_manager.c2151
lock_resumelock_manager.c2338
lock_wakeup_deadlock_victim_timeoutlock_manager.c2386
lock_wakeup_deadlock_victim_abortedlock_manager.c2442
lock_force_timeout_expired_wait_transactionslock_manager.c7443
Deadlock detection
lock_detect_local_deadlocklock_manager.c7696
lock_add_WFG_edgelock_manager.c4670
lock_select_deadlock_victimlock_manager.c4806
lock_is_local_deadlock_detection_interval_uplock_manager.c7622
Special paths
lock_hold_object_instantlock_manager.c5901
lock_internal_hold_lock_object_instantlock_manager.c3087
lock_initialize_composite_locklock_manager.c8559
lock_add_composite_locklock_manager.c8586
lock_finalize_composite_locklock_manager.c8740
lock_abort_composite_locklock_manager.c8800
lock_demote_class_locklock_manager.c4165
lock_internal_demote_class_locklock_manager.c4188
lock_rep_read_tranlock_manager.c9698