CUBRID Lock Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-lock-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single lock request inside the kernel.

Contents:

Ch	Title	Status
1	Data-Structure Map	✅
2	Initialization and Memory Management	✅
3	Lock Acquisition — `lock_internal_perform_lock_object` Dissected	✅
4	Lock Conversion and Re-entrance	✅
5	Lock Release and the NON2PL Protocol	✅
6	Lock Escalation	✅
7	Suspend/Resume and Timeout	✅
8	Deadlock Detection — WFG Construction and Cycle Scan	✅
9	Special Paths — Instant Lock, Composite Lock, lock_scan	✅

Chapter 1: Data-Structure Map

Every lock-manager operation is built on top of five structs. This chapter walks through every field of each struct with code excerpts and draws the pointer relationships between them. Having this map in your head is the prerequisite for following the flows in later chapters.

1.1 Five core structs — overview

flowchart TB
  subgraph GLOBAL["lk_Gl (LK_GLOBAL_DATA) — single global instance"]
    HT["m_obj_hash_table\n(lockfree_hashmap)"]
    TLT["tran_lock_table[0..N]\n(LK_TRAN_LOCK array)"]
    FREE["obj_free_entry_list\n(LF_FREELIST)"]
    WFG["TWFG_node[] / TWFG_edge[]\n(deadlock detection)"]
  end

  subgraph RES["LK_RES — one lockable resource"]
    KEY["key: LK_RES_KEY"]
    AGG["total_holders_mode\ntotal_waiters_mode"]
    HL["holder -> waiter -> non2pl"]
  end

  subgraph ENTRY["LK_ENTRY — one lock fact"]
    E_RES["res_head -> LK_RES"]
    E_MODE["granted_mode / blocked_mode"]
    E_LINK["next (resource side)\ntran_next/tran_prev (tran side)"]
    E_CLASS["class_entry -> parent"]
  end

  subgraph TRAN["LK_TRAN_LOCK — one transaction's lock state"]
    ROOT["root_class_hold"]
    CLS["class_hold_list"]
    INST["inst_hold_list"]
    POOL["lk_entry_pool\n(local pool, max 10)"]
    N2PL["non2pl_list"]
  end

  HT --> RES
  TLT --> TRAN
  RES -.-> ENTRY
  TRAN -.-> ENTRY
  ENTRY --> RES
  ENTRY --> TRAN

Figure 1-1 — Relationship among the five core structs. Arrows show pointer direction. The key insight is that LK_ENTRY is linked into both LK_RES (resource view) and LK_TRAN_LOCK (transaction view) simultaneously.

1.2 `LK_RES_KEY` — identifier of a lockable resource

The key that identifies what object a lock protects. Used as the hash-table lookup key.

// struct lk_res_key — src/transaction/lock_manager.h
struct lk_res_key
{
  LOCK_RESOURCE_TYPE type;   /* INSTANCE, CLASS, ROOT_CLASS */
  OID oid;                   /* target object's OID */
  OID class_oid;             /* owning class OID (instance only) */
};

Field	Size	Description
`type`	`enum` (4B)	One of `LOCK_RESOURCE_INSTANCE`, `LOCK_RESOURCE_CLASS`, `LOCK_RESOURCE_ROOT_CLASS`. `LOCK_RESOURCE_OBJECT` is obsolete.
`oid`	8B (`pageid` 4B + `slotid` 2B + `volid` 2B)	Identifier of the object this lock protects.
`class_oid`	8B	Valid only for instance locks. Set to NULL OID for class/root-class locks.

type is derived from oid and class_oid. lock_create_search_key implements the derivation rule:

// lock_create_search_key — src/transaction/lock_manager.c
static LK_RES_KEY
lock_create_search_key (OID * oid, OID * class_oid)
{
  LK_RES_KEY search_key;
  // ... OID copying ...

  if (oid != NULL && OID_IS_ROOTOID (oid))
    {
      search_key.type = LOCK_RESOURCE_ROOT_CLASS;
    }
  else if (class_oid == NULL || OID_IS_ROOTOID (class_oid))
    {
      search_key.type = LOCK_RESOURCE_CLASS;
    }
  else
    {
      search_key.type = LOCK_RESOURCE_INSTANCE;
    }
  return search_key;
}

The derivation rules in tabular form:

Condition	`type`	Meaning
`oid` is root OID	`ROOT_CLASS`	The database root class itself
`class_oid` is NULL or root OID	`CLASS`	A table (class) — the coarse-grain lock target
Otherwise	`INSTANCE`	An individual row (instance) within a table

Hash-table comparison ignores type and checks oid only. lock_res_key_compare calls OID_EQ(&k1->oid, &k2->oid) and uses type only in an assert. This relies on the invariant that a single OID cannot simultaneously serve as two different resource types.

1.3 `LK_RES` — lock resource (state of one lockable object)

Holds all lock state for one lockable object (database, class, or instance). Found in the hash table by LK_RES_KEY.

// struct lk_res — src/transaction/lock_manager.h
struct lk_res
{
  LK_RES_KEY key;              /* hash key — see 1.2 */
  LOCK total_holders_mode;     /* aggregate mode of all current holders */
  LOCK total_waiters_mode;     /* aggregate mode of all current waiters */
  LK_ENTRY *holder;            /* head of granted-lock linked list */
  LK_ENTRY *waiter;            /* head of blocked-lock linked list */
  LK_ENTRY *non2pl;            /* head of early-released-lock tracking list */
  pthread_mutex_t res_mutex;   /* per-resource mutex */
  LK_RES *hash_next;           /* hash-bucket collision chain */
  LK_RES *stack;               /* lock-free freelist retired stack */
  UINT64 del_id;               /* lock-free reclamation epoch ID */
};

Field-by-field:

Field	Role	Why it exists
`total_holders_mode`	LUB (least upper bound) of all granted locks	Makes compatibility checks O(1) — compare against this one value instead of walking the holder list.
`total_waiters_mode`	LUB of all waiting locks	Starvation guard: a new request must also be compatible with waiting requests, so a stream of S requests cannot leapfrog a waiting X.
`holder`	Singly-linked list of granted `LK_ENTRY`s	Linked via `LK_ENTRY.next`. Multiple transactions can hold the same resource simultaneously (e.g., multiple S_LOCKs).
`waiter`	Singly-linked list of blocked `LK_ENTRY`s	FIFO order. Head is the longest-waiting request.
`non2pl`	Traces of S locks released early under READ COMMITTED	Used for conflict detection when the same transaction re-reads later.
`res_mutex`	Per-resource mutex	Protects holder/waiter/non2pl list manipulation. Per-resource (not global), so lock requests on different resources proceed in parallel.

When total_holders_mode is recomputed: every time an entry is added to, removed from, or converted within the holder list, the entire holder list is walked to recompute the LUB. This happens under res_mutex, so it is serialized per-resource.

1.4 `LK_ENTRY` — one lock fact (the dual-threaded record)

The most important struct in the lock manager. Represents a single lock fact and is linked into both the resource view and the transaction view simultaneously.

// struct lk_entry — src/transaction/lock_manager.h
struct lk_entry
{
  struct lk_res *res_head;     /* back-pointer to owning resource */
  THREAD_ENTRY *thrd_entry;    /* requesting thread (valid only when blocked) */
  int tran_index;              /* owning transaction's index */
  LOCK granted_mode;           /* currently granted mode */
  LOCK blocked_mode;           /* requested mode when blocked */
  int count;                   /* re-entrance counter */
  UINT64 del_id;               /* lock-free reclamation */
  LK_ENTRY *stack;             /* freelist retired stack */
  LK_ENTRY *next;              /* resource view: next in holder/waiter list */
  LK_ENTRY *tran_next;         /* transaction view: next lock of same tran */
  LK_ENTRY *tran_prev;         /* transaction view: prev lock of same tran */
  LK_ENTRY *class_entry;       /* parent class's LK_ENTRY (for instance locks) */
  int ngranules;               /* count of instance locks below this class lock */
  int instant_lock_count;      /* instant lock request count */
  int bind_index_in_tran;
  XASL_ID xasl_id;             /* query plan ID that requested this lock */
};

An LK_ENTRY plays one of three roles — granted holder, blocked waiter, or non2pl marker. Which fields are meaningful depends on the role:

Field	holder (granted)	waiter (blocked)	non2pl	Description
`res_head`	✓	✓	✓	Always valid. Which resource this lock belongs to.
`thrd_entry`	NULL	✓	NULL	Waiter only. Points to the thread to resume.
`tran_index`	✓	✓	✓	Which transaction owns this lock.
`granted_mode`	current mode	`NULL_LOCK`	pre-release mode	For holders, the actual lock mode. For waiters, not yet granted so NULL_LOCK.
`blocked_mode`	`NULL_LOCK`	requested mode	`NULL_LOCK`	For waiters, the mode being requested. During conversion wait, a holder can have `blocked_mode != NULL_LOCK` — `granted_mode` is the currently held mode, `blocked_mode` is the upgrade target.
`count`	≥ 1	1	0	Re-entrance counter. If the same transaction requests the same resource and mode n times, count becomes n. Always 0 for non2pl.
`next`	next in holder list	next in waiter list	next in non2pl list	Resource-view linkage. Singly-linked.
`tran_next`/`tran_prev`	✓	—	✓	Transaction-view linkage. Doubly-linked. Waiters are not in the transaction hold list.
`class_entry`	parent class’s `LK_ENTRY`	—	—	One-hop back-reference from instance lock to class lock. Used for escalation.
`ngranules`	count of child instance locks	—	—	For class locks only. Compared against escalation threshold. Zero for instance locks.
`instant_lock_count`	instant-mode lock count	—	—	Managed by `lock_start_instant_lock_mode` / `lock_stop_instant_lock_mode`.

Key invariant: exclusive use of granted_mode and blocked_mode

Entry in holder list:
  granted_mode = actual held mode (S, X, IS, IX, ...)
  blocked_mode = NULL_LOCK (normal) or upgrade-target mode (conversion wait)

Entry in waiter list:
  granted_mode = NULL_LOCK
  blocked_mode = requested mode

Read the pair as the state. One entry per (resource, transaction): the field pair plus which list the entry threads is the transaction’s state on that resource — X/– in the holder list: plain holder · IX/SIX in the holder list: upgrader (partially acquired, with a pending target) · –/X in the waiter list: waiter (nothing acquired yet). Vocabulary note: blocked_mode is a pending target, not an “intention” — that word is reserved for the intention modes IS/IX/SIX.

This invariant is enforced by the two initializers:

// lock_initialize_entry_as_granted — src/transaction/lock_manager.c
static void
lock_initialize_entry_as_granted (LK_ENTRY * entry_ptr, int tran_index,
                                  LK_RES * res, LOCK lock)
{
  entry_ptr->granted_mode  = lock;       /* actual mode */
  entry_ptr->blocked_mode  = NULL_LOCK;  /* not waiting */
  entry_ptr->count         = 1;
  // ...
}

// lock_initialize_entry_as_blocked — src/transaction/lock_manager.c
static void
lock_initialize_entry_as_blocked (LK_ENTRY * entry_ptr, THREAD_ENTRY * thread_p,
                                  int tran_index, LK_RES * res, LOCK lock)
{
  entry_ptr->granted_mode  = NULL_LOCK;  /* not yet granted */
  entry_ptr->blocked_mode  = lock;       /* requested mode */
  entry_ptr->count         = 1;
  // ...
}

Conversion wait is the exception where both fields are non-NULL. For example, if tran 7 holds S_LOCK and requests an upgrade to X_LOCK:

granted_mode = S_LOCK (still held)
blocked_mode = X_LOCK (upgrade pending)
The entry stays in the holder list but is repositioned to the back via lock_position_holder_entry. Covered in detail in Ch. 4.

1.5 `LK_TRAN_LOCK` — one transaction’s complete lock state

// struct lk_tran_lock — src/transaction/lock_manager.c
struct lk_tran_lock
{
  pthread_mutex_t hold_mutex;    /* protects hold lists */
  LK_ENTRY *inst_hold_list;     /* instance lock hold list */
  LK_ENTRY *class_hold_list;    /* class lock hold list */
  LK_ENTRY *root_class_hold;    /* root class lock (at most one) */
  LK_ENTRY *lk_entry_pool;      /* local pool of reusable LK_ENTRY */
  int lk_entry_pool_count;       /* current count in pool */
  int inst_hold_count;           /* number of instance locks held */
  int class_hold_count;          /* number of class locks held */

  LK_ENTRY *waiting;             /* currently blocked lock entry (at most one) */

  pthread_mutex_t non2pl_mutex;  /* protects non2pl_list */
  LK_ENTRY *non2pl_list;         /* early-released lock tracking list */
  int num_incons_non2pl;         /* inconsistent non2pl count */

  bool lock_escalation_on;       /* escalation-in-progress guard */
  bool is_instant_duration;      /* instant lock mode on/off */
};

Why three separate hold sub-lists:

flowchart TB
  TRAN["LK_TRAN_LOCK (tran #7)"]
  ROOT["root_class_hold\n(at most 1)"]
  CLS["class_hold_list\n-> entry(class A, IX)\n-> entry(class B, IS)\n-> ..."]
  INST["inst_hold_list\n-> entry(row A.1, X)\n-> entry(row A.2, S)\n-> ..."]

  TRAN --> ROOT
  TRAN --> CLS
  TRAN --> INST

Figure 1-2 — Three-way hold list split. Separating root/class/instance means “what lock does this transaction hold on class A?” can be answered by walking only the class list, O(number of classes), instead of scanning all locks.

Field	Detail
`hold_mutex`	Acquired when manipulating inst/class/root hold lists. Separate from `res_mutex` — cross-locking is needed when one transaction manipulates locks on different resources.
`lk_entry_pool`	Each transaction locally caches up to `LOCK_TRAN_LOCAL_POOL_MAX_SIZE = 10` `LK_ENTRY` structs. `lock_get_new_entry` checks this pool first; only falls through to the global `obj_free_entry_list` when empty. Purpose: reduce contention on the global freelist in hot OLTP paths.
`waiting`	The single `LK_ENTRY` this transaction is currently blocked on. A transaction can wait on at most one lock at a time (one thread suspends on one lock request).
`non2pl_list`	When an S lock is released early under READ COMMITTED, the `LK_ENTRY` is moved here. Other transactions check this list when attempting an X lock on the same resource to detect inconsistencies.
`lock_escalation_on`	Re-entrance guard during escalation.
`is_instant_duration`	Turned on by `lock_start_instant_lock_mode`. Locks acquired in this mode are released as soon as the statement ends.

1.6 `LK_GLOBAL_DATA` — the global singleton `lk_Gl`

Exactly one instance exists system-wide. The root of all lock state.

// struct lk_global_data — src/transaction/lock_manager.c
struct lk_global_data
{
  int max_obj_locks;                          /* max lock count set at init */
  lk_hashmap_type m_obj_hash_table;           /* OID -> LK_RES lockfree hashmap */
  LF_FREELIST obj_free_entry_list;            /* global LK_ENTRY freelist */

  int num_trans;                              /* number of initialized transactions */
  LK_TRAN_LOCK *tran_lock_table;             /* per-tran LK_TRAN_LOCK array */

  pthread_mutex_t DL_detection_mutex;         /* deadlock detection guard */
  struct timeval last_deadlock_run;           /* timestamp of last deadlock scan */
  LK_WFG_NODE *TWFG_node;                    /* WFG node array (per-tran) */
  LK_WFG_EDGE *TWFG_edge;                    /* WFG edge array */
  int max_TWFG_edge;                          /* max edge array size */
  int TWFG_free_edge_idx;                     /* next free edge index */
  int global_edge_seq_num;                    /* edge creation sequence number */

  short no_victim_case_count;
  bool verbose_mode;
  std::atomic_int deadlock_and_timeout_detector;
};

LK_GLOBAL_DATA lk_Gl;   /* global instance — file scope */

Overall memory allocation strategy:

flowchart TB
  REQ["lock request -> need LK_ENTRY"]
  LP["1. transaction local pool\n(lk_entry_pool, max 10)"]
  GF["2. global freelist\n(obj_free_entry_list)"]
  MA["3. malloc new block"]

  REQ --> LP
  LP -- "empty" --> GF
  GF -- "empty" --> MA

  REL["lock release -> return LK_ENTRY"]
  LP2["1. return to local pool\n(if count < 10)"]
  GF2["2. return to global freelist"]

  REL --> LP2
  LP2 -- "pool full" --> GF2

Figure 1-3 — Three-tier LK_ENTRY allocation/deallocation strategy. Allocation: local pool → global freelist → malloc. Deallocation: reverse order. Implemented by lock_get_new_entry and lock_free_entry.

1.7 Lock modes — the 12-value `LOCK` enum and two static tables

// enum LOCK — src/transaction/lock_table.h
typedef enum
{
  NA_LOCK = 0,
  INCON_NON_TWO_PHASE_LOCK = 1,  /* NON2PL */
  NULL_LOCK = 2,
  SCH_S_LOCK = 3,
  IS_LOCK = 4,
  S_LOCK = 5,
  IX_LOCK = 6,
  BU_LOCK = 7,
  SIX_LOCK = 8,
  U_LOCK = 9,
  X_LOCK = 10,
  SCH_M_LOCK = 11,
  LOCK_COUNT    /* = 12 */
} LOCK;

Two static 12×12 tables drive all mode decisions:

lock_Comp[requested][current] — compatibility table:

LOCK_COMPAT_YES: the new request can coexist with the existing lock
LOCK_COMPAT_NO: conflict — the new request must wait
LOCK_COMPAT_UNKNOWN: this combination cannot occur in practice (assert-guarded)

lock_Conv[requested][current] — conversion table:

When the same transaction already holds current and requests requested, the resulting mode is lock_Conv[requested][current] (the LUB of the two modes)
NA_LOCK result: this conversion is undefined (assert failure)

// inline accessors — src/transaction/lock_table.h
inline LOCK
lock_conv (LOCK requested, LOCK current)
{
  assert (lock_Conv[requested][current] != NA_LOCK);
  return lock_Conv[requested][current];
}

inline LOCK_COMPATIBILITY
lock_compat (LOCK requested, LOCK current)
{
  assert (lock_Comp[requested][current] != LOCK_COMPAT_UNKNOWN);
  return lock_Comp[requested][current];
}

Conversion examples (read directly from lock_Conv):

Currently held	New request	Result (LUB)	Meaning
`IS_LOCK`	`IX_LOCK`	`IX_LOCK`	Intention shared → intention exclusive upgrade
`S_LOCK`	`IX_LOCK`	`SIX_LOCK`	Shared + intention exclusive = SIX
`S_LOCK`	`X_LOCK`	`X_LOCK`	Shared → exclusive upgrade
`IX_LOCK`	`S_LOCK`	`SIX_LOCK`	IX + S = SIX (symmetric)
`U_LOCK`	`X_LOCK`	`X_LOCK`	Update → exclusive upgrade

1.8 Pointer-relationship summary — everywhere a single lock connects

Every path a single LK_ENTRY is linked into:

flowchart LR
  subgraph Resource["LK_RES (resource R)"]
    direction TB
    RH["holder list"]
    RW["waiter list"]
    RN["non2pl list"]
  end

  subgraph Tran["LK_TRAN_LOCK (tran T)"]
    direction TB
    TI["inst_hold_list"]
    TC["class_hold_list"]
    TR["root_class_hold"]
    TN["non2pl_list"]
  end

  E["LK_ENTRY\n(tran T, resource R)"]

  RH -- "next" --> E
  E -- "res_head" --> Resource
  TI -- "tran_next/prev" --> E
  E -- "class_entry" --> CE["LK_ENTRY\n(tran T, class of R)"]
  CE -- "ngranules++" --> CE

Figure 1-4 — Every pointer relationship involving a single LK_ENTRY. next is the resource view, tran_next/tran_prev is the transaction view, class_entry bridges the granularity hierarchy, and res_head is the back-reference.

1.9 Chapter summary — key takeaways

LK_ENTRY is a dual-threaded record. A single lock fact exists in both the resource view (next) and the transaction view (tran_next/tran_prev) simultaneously.
total_holders_mode and total_waiters_mode enable O(1) compatibility checks. A new request compares against two aggregate values instead of walking the entire holder list.
The combination of granted_mode and blocked_mode determines an entry’s state:
- granted != NULL_LOCK, blocked == NULL_LOCK → normal holder
- granted == NULL_LOCK, blocked != NULL_LOCK → waiter
- granted != NULL_LOCK, blocked != NULL_LOCK → holder undergoing conversion wait
Memory allocation is three-tier: local pool (10 entries) → global freelist → malloc.
Per-resource res_mutex provides resource-granularity serialization, so lock requests on different resources proceed fully in parallel.

Chapter 2: Initialization and Memory Management

The lock manager is bootstrapped once during server startup by lock_initialize and torn down by lock_finalize. This chapter traces the initialization sequence step by step, then examines the runtime memory management — the three-tier allocation strategy and its integration with the lockfree infrastructure.

2.1 `lock_initialize` — the top-level entry point

// lock_initialize — src/transaction/lock_manager.c
int
lock_initialize (void)
{
  error_code = lock_initialize_tran_lock_table ();  /* step 1 */
  lock_initialize_object_hash_table ();              /* step 2 */
  error_code = lock_initialize_object_lock_entry_list (); /* step 3 */
  error_code = lock_initialize_deadlock_detection (); /* step 4 */
  // ... env-var checks for verbose/dump mode ...
  lock_deadlock_detect_daemon_init ();               /* step 5 */
  return error_code;
}

The order matters. Step 1 must complete before step 4 because lock_initialize_deadlock_detection reads lk_Gl.num_trans (set by step 1) to size the WFG node array.

flowchart LR
  INIT["lock_initialize"]
  S1["1. tran_lock_table\nMAX_NTRANS slots\n+ 10 entries each"]
  S2["2. obj_hash_table\nlockfree hashmap\nMAX(10K, NTRANS*300)"]
  S3["3. obj_free_entry_list\nlockfree freelist"]
  S4["4. deadlock detection\nTWFG_node[num_trans]"]
  S5["5. deadlock daemon\n100ms loop"]

  INIT --> S1 --> S2 --> S3 --> S4 --> S5

Figure 2-1 — lock_initialize sequence. Each step builds on the previous one’s output. The deadlock daemon (step 5) is the last piece — it must not run until all data structures are ready.

2.2 Step 1 — `lock_initialize_tran_lock_table`

Allocates and initializes the per-transaction lock state table.

// lock_initialize_tran_lock_table — src/transaction/lock_manager.c
static int
lock_initialize_tran_lock_table (void)
{
  num_trans = MAX_NTRANS;
  lk_Gl.tran_lock_table =
    (LK_TRAN_LOCK *) malloc (SIZEOF_LK_TRAN_LOCK * num_trans);
  memset (lk_Gl.tran_lock_table, 0, SIZEOF_LK_TRAN_LOCK * num_trans);

  for (i = 0; i < num_trans; i++)
    {
      tran_lock = &lk_Gl.tran_lock_table[i];
      pthread_mutex_init (&tran_lock->hold_mutex, NULL);
      pthread_mutex_init (&tran_lock->non2pl_mutex, NULL);

      /* pre-fill local entry pool */
      for (j = 0; j < LOCK_TRAN_LOCAL_POOL_MAX_SIZE; j++)  /* = 10 */
        {
          entry = (LK_ENTRY *) malloc (sizeof (LK_ENTRY));
          lock_initialize_entry (entry);
          entry->next = tran_lock->lk_entry_pool;
          tran_lock->lk_entry_pool = entry;
        }
      tran_lock->lk_entry_pool_count = LOCK_TRAN_LOCAL_POOL_MAX_SIZE;
    }
  return NO_ERROR;
}

Key observations:

MAX_NTRANS slots are allocated up front. This is a compile-time constant, not a runtime parameter. Every tran_index from 0 to MAX_NTRANS - 1 gets its own LK_TRAN_LOCK regardless of how many transactions are active.
Each slot gets 10 pre-allocated LK_ENTRYs via direct malloc. These go into the lk_entry_pool singly-linked list. This is the Tier 1 local pool from Ch. 1 §1.6.
Two mutexes per transaction: hold_mutex (protects the three hold lists) and non2pl_mutex (protects the non2pl list). They are separate because a lock release can simultaneously manipulate the hold list and the non2pl list from different code paths.
All other fields are zeroed by the memset, which means: inst_hold_list = NULL, class_hold_list = NULL, root_class_hold = NULL, waiting = NULL, non2pl_list = NULL, lock_escalation_on = false, is_instant_duration = false.

2.3 Step 2 — `lock_initialize_object_hash_table`

Sets up the lockfree hashmap that maps LK_RES_KEY → LK_RES.

// lock_initialize_object_hash_table — src/transaction/lock_manager.c
static void
lock_initialize_object_hash_table (void)
{
  lk_Gl.max_obj_locks = 10000;  /* LK_INITIAL_OBJECT_LOCK_TABLE_SIZE */
  const int obj_hash_size =
    MAX (lk_Gl.max_obj_locks, LK_MIN_OBJECT_LOCKS);
    /* LK_MIN_OBJECT_LOCKS = MAX_NTRANS * 300 */

  lk_Gl.m_obj_hash_table.init (
    obj_lock_res_Ts, THREAD_TS_OBJ_LOCK_RES,
    obj_hash_size, block_size, block_count,
    lk_Obj_lock_res_desc);
}

The hash table is a lockfree_hashmap<LK_RES_KEY, LK_RES> — a generic lockfree hash table from CUBRID’s lockfree infrastructure (see cubrid-lockfree-hashmap.md). The lock manager plugs in its own behavior through an LF_ENTRY_DESCRIPTOR:

// lk_Obj_lock_res_desc — src/transaction/lock_manager.c
LF_ENTRY_DESCRIPTOR lk_Obj_lock_res_desc = {
  offsetof (LK_RES, stack),      /* retired-entry stack pointer */
  offsetof (LK_RES, hash_next),  /* hash collision chain */
  offsetof (LK_RES, del_id),     /* epoch ID for reclamation */
  offsetof (LK_RES, key),        /* key location */
  offsetof (LK_RES, res_mutex),  /* per-entry mutex */
  LF_EM_USING_MUTEX,             /* each LK_RES has its own mutex */
  LF_ENTRY_DESCRIPTOR_MAX_ALLOC,
  lock_alloc_resource,           /* malloc + pthread_mutex_init */
  lock_dealloc_resource,         /* pthread_mutex_destroy + free */
  lock_init_resource,            /* zero out holders/waiters/non2pl */
  lock_uninit_resource,          /* assert lists are empty */
  lock_res_key_copy,
  lock_res_key_compare,          /* OID_EQ only — type is assert-checked */
  lock_res_key_hash,             /* delegates to LK_OBJ_LOCK_HASH macro */
  NULL                           /* no insert callback */
};

The hash function (lock_res_key_hash → LK_OBJ_LOCK_HASH → lock_get_hash_value) deserves a closer look:

// lock_get_hash_value — src/transaction/lock_manager.c
static unsigned int
lock_get_hash_value (const OID * oid, int htsize)
{
  if (oid->slotid <= 0)
    {
      addr = oid->pageid - oid->slotid;
    }
  else
    {
      next_base_slotid = 2;
      while (next_base_slotid <= (unsigned) oid->slotid)
        next_base_slotid *= 2;

      addr = oid->pageid
             + (htsize / next_base_slotid)
               * (2 * oid->slotid - next_base_slotid + 1);
    }
  return (addr % htsize);
}

This is not a simple modulo hash. For positive slotid values, it uses a van Emde Boas–style bit-interleave that spreads consecutive slot IDs within the same page across different hash buckets. The intent: rows on the same heap page (which tend to be locked together in scan order) are distributed across the hash table, reducing bucket-chain contention.

For slotid <= 0 (index keys — unique index last-key OIDs use slotid = -1), it falls back to a simpler pageid - slotid computation.

Worked example — hash distribution for rows on page 100:

Assume htsize = 6000 (a small example). Six rows on the same heap page (pageid=100) with slotid 1–6:

OID (pageid, slotid)	next_base_slotid	addr formula	addr	bucket (addr % 6000)
(100, 1)	2	100 + (6000/2) * (2·1 - 2 + 1) = 100 + 3000	3100	3100
(100, 2)	4	100 + (6000/4) * (2·2 - 4 + 1) = 100 + 1500	1600	1600
(100, 3)	4	100 + (6000/4) * (2·3 - 4 + 1) = 100 + 4500	4600	4600
(100, 4)	8	100 + (6000/8) * (2·4 - 8 + 1) = 100 + 750	850	850
(100, 5)	8	100 + (6000/8) * (2·5 - 8 + 1) = 100 + 2250	2350	2350
(100, 6)	8	100 + (6000/8) * (2·6 - 8 + 1) = 100 + 3750	3850	3850

All six rows land in different buckets (3100, 1600, 4600, 850, 2350, 3850). A naive pageid * 31 + slotid hash would cluster them within 6 adjacent buckets. The van Emde Boas–style stride ensures that a heap-page scan locking rows 1, 2, 3, … in order hits widely separated buckets, minimizing hash-chain contention under concurrent scans.

Hash table structure after several lock requests:

flowchart TB
  subgraph HT["m_obj_hash_table (lockfree hashmap, 6000 buckets)"]
    B850["bucket 850"]
    B1600["bucket 1600"]
    B3100["bucket 3100"]
    B3850["bucket 3850"]
    BDOT["..."]
  end

  R1["LK_RES\nkey: (100,4) class\ntotal_holders: IX\nholder -> waiter -> non2pl"]
  R2["LK_RES\nkey: (100,2) inst\ntotal_holders: X\nholder -> ..."]
  R3["LK_RES\nkey: (100,1) inst\ntotal_holders: S\nholder -> ..."]
  R4["LK_RES\nkey: (200,3) inst\ntotal_holders: S\nholder -> ..."]

  B850 --> R1
  B1600 --> R2
  B3100 --> R3
  R3 -. "hash_next\n(collision)" .-> R4
  B3850 -. "empty" .-> EMPTY["NULL"]

Figure 2-2 — Hash table after locking rows from pages 100 and 200. Each bucket points to an LK_RES chain. Most buckets have a single entry (the hash function’s goal); collisions (bucket 3100 here: OID (100,1) and (200,3) happen to collide) are resolved via the hash_next singly-linked chain. Each LK_RES carries its own res_mutex — concurrent threads touching different buckets never contend.

End-to-end: from OID to lock grant (the hash table’s role):

flowchart LR
  OID["OID (100, 2, 5)\npageid=100, slotid=2, volid=5"]
  KEY["lock_create_search_key\n-> LK_RES_KEY\n  type=INSTANCE\n  oid=(100,2,5)\n  class_oid=(50,1,5)"]
  HASH["lock_res_key_hash\n-> bucket 1600"]
  FIND["m_obj_hash_table\n.find_or_insert\n(lockfree)"]
  RES["LK_RES found/created\nres_mutex locked"]
  CHECK["check compat with\ntotal_holders_mode\ntotal_waiters_mode"]

  OID --> KEY --> HASH --> FIND --> RES --> CHECK

Figure 2-3 — From OID to compatibility check. The lockfree hashmap resolves the OID to an LK_RES without a global lock. The per-resource res_mutex is held only after the LK_RES is found, narrowing the critical section to one resource.

2.4 Step 3 — `lock_initialize_object_lock_entry_list`

Sets up the lockfree freelist for LK_ENTRY allocation (Tier 2 of the three-tier strategy).

// lock_initialize_object_lock_entry_list — src/transaction/lock_manager.c
static int
lock_initialize_object_lock_entry_list (void)
{
  block_count = 1;
  block_size = (int) MAX ((lk_Gl.max_obj_locks * LK_ENTRY_RATIO), 1);
  /* LK_ENTRY_RATIO = 0.1f */

  ret = lf_freelist_init (&lk_Gl.obj_free_entry_list,
                           block_count, block_size,
                           &obj_lock_entry_desc, &obj_lock_ent_Ts);
  return ret;
}

The LK_ENTRY descriptor differs from the LK_RES descriptor in one crucial way:

// obj_lock_entry_desc — src/transaction/lock_manager.c
LF_ENTRY_DESCRIPTOR obj_lock_entry_desc = {
  offsetof (LK_ENTRY, stack),
  offsetof (LK_ENTRY, next),
  offsetof (LK_ENTRY, del_id),
  0,                              /* no key — not in a hash table */
  0,                              /* no mutex offset */
  LF_EM_NOT_USING_MUTEX,          /* ← key difference */
  // ...
  lock_alloc_entry,               /* plain malloc */
  lock_dealloc_entry,             /* plain free */
  // ...
};

LK_ENTRY uses LF_EM_NOT_USING_MUTEX because it does not carry its own mutex. Protection comes from the owning LK_RES.res_mutex or LK_TRAN_LOCK.hold_mutex, depending on which list the entry is being manipulated through.

2.5 Step 4 — `lock_initialize_deadlock_detection`

// lock_initialize_deadlock_detection — src/transaction/lock_manager.c
static int
lock_initialize_deadlock_detection (void)
{
  pthread_mutex_init (&lk_Gl.DL_detection_mutex, NULL);
  gettimeofday (&lk_Gl.last_deadlock_run, NULL);

  lk_Gl.TWFG_node =
    (LK_WFG_NODE *) malloc (SIZEOF_LK_WFG_NODE * lk_Gl.num_trans);
  for (i = 0; i < lk_Gl.num_trans; i++)
    {
      lk_Gl.TWFG_node[i].DL_victim = false;
      lk_Gl.TWFG_node[i].checked_by_deadlock_detector = false;
      lk_Gl.TWFG_node[i].thrd_wait_stime = 0;
    }

  lk_Gl.TWFG_edge = NULL;   /* allocated lazily on first use */
  lk_Gl.max_TWFG_edge = 0;
  lk_Gl.TWFG_free_edge_idx = -1;
  lk_Gl.global_edge_seq_num = 0;
  return NO_ERROR;
}

Key observations:

WFG nodes are allocated per-transaction — one LK_WFG_NODE per tran_index. This is why step 4 depends on lk_Gl.num_trans being set by step 1.
WFG edges are lazily allocated. TWFG_edge starts as NULL and is allocated on the first call to lock_add_WFG_edge (when the first actual lock wait occurs). Initial size is LK_MID_TWFG_EDGE_COUNT = 1000; max is MAX_NTRANS * MAX_NTRANS.
DL_detection_mutex serializes deadlock detection runs. Only one deadlock scan runs at a time.

2.6 Step 5 — `lock_deadlock_detect_daemon_init`

// lock_deadlock_detect_daemon_init — src/transaction/lock_manager.c
void
lock_deadlock_detect_daemon_init ()
{
  cubthread::looper looper =
    cubthread::looper (std::chrono::milliseconds (100));
  cubthread::entry_callable_task *daemon_task =
    new cubthread::entry_callable_task (deadlock_detect_task_execute);
  lock_Deadlock_detect_daemon =
    cubthread::get_manager ()->create_daemon (looper, daemon_task,
                                              "deadlock-detect");
}

The daemon loops every 100ms and performs three duties per iteration (in deadlock_detect_task_execute):

Interrupt check — resume any thread whose transaction was interrupted externally.
Timeout check — resume any thread whose lock-wait has exceeded its wait_msecs deadline (lock_force_timeout_expired_wait_transactions).
Deadlock detection — if the interval since the last run exceeds PRM_ID_LK_RUN_DEADLOCK_INTERVAL and at least two threads are suspended, run lock_detect_local_deadlock. This avoids the cost of a full WFG scan when contention is low.

The 100ms loop period is the timeout check granularity — a lock wait with wait_msecs = 50 will be noticed within 100ms, not 50ms. The deadlock scan itself runs less frequently, gated by the configurable interval parameter.

2.7 `lock_finalize` — teardown

// lock_finalize — src/transaction/lock_manager.c
void
lock_finalize (void)
{
  free_and_init (lk_Gl.TWFG_node);        /* WFG nodes */
  lock_finalize_tran_lock_table ();         /* per-tran state */
  pthread_mutex_destroy (&lk_Gl.DL_detection_mutex);
  lk_Gl.m_obj_hash_table.destroy ();       /* lockfree hashmap */
  lf_freelist_destroy (&lk_Gl.obj_free_entry_list); /* lockfree freelist */
  lock_deadlock_detect_daemon_destroy ();   /* stop daemon */
}

lock_finalize_tran_lock_table walks each LK_TRAN_LOCK, destroys its two mutexes, and frees every entry remaining in lk_entry_pool via plain free. Note: this does not walk the hold lists — by the time lock_finalize runs, all transactions must have already released their locks (or been forcibly aborted).

2.8 Three-tier `LK_ENTRY` allocation at runtime

This is the runtime counterpart of the initialization in §2.2 and §2.4. Every lock acquisition needs an LK_ENTRY; every release returns one.

// lock_get_new_entry — src/transaction/lock_manager.c
static LK_ENTRY *
lock_get_new_entry (int tran_index, LF_TRAN_ENTRY * tran_entry,
                    LF_FREELIST * freelist)
{
  LK_TRAN_LOCK *tran_lock = &lk_Gl.tran_lock_table[tran_index];

  /* Tier 1: check local pool (no synchronization needed) */
  if (tran_lock->lk_entry_pool)
    {
      lock_entry = tran_lock->lk_entry_pool;
      tran_lock->lk_entry_pool = lock_entry->next;
      tran_lock->lk_entry_pool_count--;
      return lock_entry;
    }

  /* Tier 2+3: lockfree global freelist (may malloc internally) */
  return (LK_ENTRY *) lf_freelist_claim (tran_entry, freelist);
}

// lock_free_entry — src/transaction/lock_manager.c
static void
lock_free_entry (int tran_index, LF_TRAN_ENTRY * tran_entry,
                 LF_FREELIST * freelist, LK_ENTRY * lock_entry)
{
  LK_TRAN_LOCK *tran_lock = &lk_Gl.tran_lock_table[tran_index];

  /* Tier 1: return to local pool if not full */
  if (tran_lock->lk_entry_pool_count < LOCK_TRAN_LOCAL_POOL_MAX_SIZE)
    {
      lock_uninit_entry (lock_entry);
      lock_entry->next = tran_lock->lk_entry_pool;
      tran_lock->lk_entry_pool = lock_entry;
      tran_lock->lk_entry_pool_count++;
    }
  else
    {
      /* Tier 2: retire to global freelist (epoch-based reclamation) */
      lf_freelist_retire (tran_entry, freelist, lock_entry);
    }
}

Why three tiers instead of just one global freelist?

Tier	Structure	Sync cost	When used
1	`lk_entry_pool` (per-tran, max 10)	Zero — only the owning transaction touches it	Hot path: short OLTP transactions that acquire/release ≤10 locks
2	`obj_free_entry_list` (global lockfree)	Lockfree CAS — `lf_freelist_claim`/`retire`	When local pool is empty (alloc) or full (dealloc)
3	`malloc` / `free`	Kernel syscall	When the global freelist’s pre-allocated blocks are exhausted

For a typical OLTP transaction that touches 3–5 rows, all lock entries come from and return to Tier 1 — the global freelist is never contended. Tier 2 absorbs the spillover from batch operations that exceed 10 concurrent locks per transaction. Tier 3 (malloc) is a cold path that fires only during initial ramp-up or extreme load.

2.9 Lockfree integration — two transaction systems

The lock manager participates in two independent lockfree transaction systems, each with its own epoch counter:

System ID	Manages	Used by
`THREAD_TS_OBJ_LOCK_RES`	`LK_RES` (hash table entries)	`m_obj_hash_table.find_or_insert`, `erase_locked`
`THREAD_TS_OBJ_LOCK_ENT`	`LK_ENTRY` (lock entries)	`lock_get_new_entry`, `lock_free_entry`

Every lock manager function that allocates or frees resources begins by obtaining the thread’s participation handle:

LF_TRAN_ENTRY *t_entry_res =
  thread_get_tran_entry (thread_p, THREAD_TS_OBJ_LOCK_RES);
LF_TRAN_ENTRY *t_entry_ent =
  thread_get_tran_entry (thread_p, THREAD_TS_OBJ_LOCK_ENT);

The del_id fields on LK_RES and LK_ENTRY are set when the object is retired (via lf_freelist_retire or hashmap erase). A retired object is not physically freed until every concurrent reader — tracked by the epoch counter — has advanced past the retirement epoch. This is the standard hazard-pointer / epoch-based reclamation pattern described in cubrid-lockfree-transaction.md.

What is lockfree vs. mutex-protected:

Operation	Mechanism
Hash table lookup/insert (`LK_RES` by OID)	Lockfree hashmap
`LK_ENTRY` alloc from global pool	Lockfree freelist (`lf_freelist_claim`)
`LK_ENTRY` alloc from local pool	No sync (per-tran, single-thread)
Memory reclamation of retired entries	Lockfree epoch-based (`del_id`)
Holder/waiter/non2pl list manipulation	`res_mutex` (per-resource)
Transaction hold list manipulation	`hold_mutex` (per-transaction)
Transaction non2pl list manipulation	`non2pl_mutex` (per-transaction)

The boundary is clear: finding or creating the LK_RES is lockfree; manipulating its contents (holder/waiter lists) is mutex-protected. This means concurrent lock requests on different resources never contend with each other at the hash level, and only share-nothing local-pool entries on the allocation side.

2.10 Chapter summary — key takeaways

Initialization order is load-bearing. Transaction table (step 1) must precede deadlock detection (step 4) because the WFG is sized by num_trans.
The hash table size is MAX(10000, MAX_NTRANS * 300) — a hard-coded formula, not a tunable parameter. The hash function uses a van Emde Boas–style distribution to spread consecutive slot IDs across buckets.
Each transaction gets 10 pre-allocated LK_ENTRYs at init time. These form the Tier 1 local pool that satisfies most OLTP lock requests without any synchronization.
WFG edges are lazily allocated on first lock-wait, not at startup. This avoids wasting memory in read-only workloads.
The deadlock daemon loops every 100ms but only runs the full WFG cycle scan when (a) the configured interval has elapsed and (b) at least two threads are suspended. Timeout checks run on every iteration.
Lockfree boundary: hash lookup/insert and global freelist claim/retire are lockfree. Everything below — holder/waiter list manipulation, transaction hold list updates — is mutex-serialized per-resource or per-transaction.

Chapter 3: Lock Acquisition

This chapter traces every branch of the lock acquisition path — from the public API (lock_object, lock_scan) through the internal workhorse lock_internal_perform_lock_object. By the end you should be able to predict exactly what happens for any combination of resource state and request parameters.

3.1 Two public entry points — `lock_object` and `lock_scan`

The lock manager exposes two primary APIs. They differ in granularity intent, not mechanism — both eventually call lock_internal_perform_lock_object.

flowchart LR
  SCAN["lock_scan(class_oid, class_lock)\nClass-grain: IS or IX\non the class itself"]
  OBJ["lock_object(oid, class_oid, lock)\nAny grain: root, class, or instance"]

  SCAN --> IPLK["lock_internal_perform_lock_object\n(the real work)"]
  OBJ --> PREP["Preparation layer:\n1. Root OID? -> lock root directly\n2. Class lock? -> ensure IX/IS on root first\n3. Instance lock? -> ensure IX/IS on class first"]
  PREP --> IPLK

Figure 3-1 — Two entry points converge on the same internal function. lock_scan is a thin wrapper for class-grain locks. lock_object does multi-granularity preparation before calling the internal function.

lock_scan is the simpler of the two. It acquires a class-level lock (typically IS_LOCK for SELECT, IX_LOCK for DML) with no instance-level work:

// lock_scan — src/transaction/lock_manager.c (simplified)
int
lock_scan (THREAD_ENTRY * thread_p, const OID * class_oid,
           int cond_flag, LOCK class_lock)
{
  root_class_entry = lock_get_class_lock (thread_p, oid_Root_class_oid);
  granted = lock_internal_perform_lock_object (
    thread_p, tran_index, class_oid,
    NULL,            /* class_oid = NULL → this IS a class lock */
    class_lock, wait_msecs,
    &class_entry, root_class_entry);
  return granted;
}

lock_object handles three cases based on the OID:

flowchart TD
  A["lock_object(oid, class_oid, lock)"]
  A --> B{"OID_IS_ROOTOID(oid)?"}
  B -- "yes" --> C1["Case 1: ROOT_CLASS\nlock root directly\nclass_entry = NULL"]
  B -- "no" --> D{"OID_IS_ROOTOID(class_oid)?"}
  D -- "yes" --> C2["Case 2: CLASS\n1. Ensure IX/IS on root\n   (if not already held)\n2. Lock the class itself"]
  D -- "no" --> C3["Case 3: INSTANCE\n1. Ensure IX/IS on class\n   (if not already held)\n2. Check if class lock\n   subsumes the request\n3. Lock the instance"]

  C1 --> IPLK["lock_internal_perform_lock_object"]
  C2 --> IPLK
  C3 --> IPLK

Figure 3-2 — lock_object three-case dispatch. Each case ensures the multi-granularity hierarchy is respected: an instance lock requires an intention lock on its class, a class lock requires an intention lock on the root.

The intention lock mode is derived from the requested lock:

// lock_object — src/transaction/lock_manager.c
if (lock <= S_LOCK)
  new_class_lock = IS_LOCK;    /* read intent */
else
  new_class_lock = IX_LOCK;    /* write intent */

Key optimization in Case 3: before calling lock_internal_perform_lock_object for the instance, lock_object checks lock_is_class_lock_escalated(old_class_lock, lock). If the class lock already held is strong enough to subsume the instance request (e.g., holding X on class makes any instance lock redundant), the function returns LK_GRANTED immediately.

3.2 Caller map — who calls the lock manager and why

The lock manager is a service layer. Before diving into the internal state machine, it helps to know who calls the APIs and in what context. This map covers both acquisition and release APIs — it is referenced again in Ch. 5.

flowchart TB
  subgraph Callers["Upper-layer callers"]
    LOC["locator_sr.c (41 calls)\nObject fetch / store / DDL"]
    BT["btree.c (13)\nIndex traversal, FK checks"]
    HEAP["heap_file.c (8)\nHeap scan, page access"]
    LOG["log_manager.c (5)\nCommit / rollback"]
    BOOT["boot_sr.c (5)\nServer startup / DDL"]
    SCAN["scan_manager.c (4)\nHeap/index scan iteration"]
    QE["query_executor.c (4)\nQuery eval, instant lock mode"]
    SER["serial.c (4)\nNEXT VALUE / CURRENT VALUE"]
    CAT["system_catalog.c (4)\nCatalog access"]
    STAT["statistics_sr.c (4)\nALTER INDEX statistics"]
  end

  subgraph LockAPI["Lock Manager API"]
    LO["lock_object\n(any grain)"]
    LS["lock_scan\n(class grain)"]
    LH["lock_hold_object_instant\n(probe, no block)"]
    LC["lock_classes_lock_hint\n(batch class locks)"]
    UO["lock_unlock_object"]
    UDN["lock_unlock_object_donot_move_to_non2pl"]
    UA["lock_unlock_all"]
  end

  LOC --> LO & UO & UDN & LC
  BT --> LO & UDN
  HEAP --> LO & LS & UO
  LOG --> UA
  BOOT --> LO & UO
  SCAN --> LO & UO & LH
  QE --> LO & UO
  SER --> LO & UO
  CAT --> LO & UDN
  STAT --> LO & UO

Figure 3-3 — Caller map. Numbers in parentheses are approximate call-site counts. locator_sr.c is the dominant caller — it mediates all client-initiated object access. log_manager.c is the exclusive caller of lock_unlock_all (commit/rollback).

Key caller patterns:

Caller	Lock API used	Typical mode	Context
`locator_sr.c`	`lock_object`, `lock_unlock_object`, `lock_classes_lock_hint`	SCH_M for DDL, S/X for DML, IS/IX intent	Object fetch (`xlocator_fetch`), store (`xlocator_force`), class creation/drop
`btree.c`	`lock_object`, `lock_unlock_object_donot_move_to_non2pl`	S/X on instance OIDs	Index key lock during traversal; FK validation — uses `donot_move_to_non2pl` because rows failing FK check are never surfaced to client
`heap_file.c`	`lock_scan`, `lock_object`, `lock_unlock_object`	IS via `lock_scan`; S/X per row	Heap scan start (`lock_scan` for class IS), row-level lock during fetch
`scan_manager.c`	`lock_object`, `lock_hold_object_instant`, `lock_unlock_object`	S or X on rows; instant probe	Heap/index scan: `lock_hold_object_instant` for non-blocking probe, fallback to `lock_object` if probe fails
`query_executor.c`	`lock_start_instant_lock_mode`, `lock_stop_instant_lock_mode`	(controls mode, not direct lock)	WHERE clause evaluation — instant mode brackets
`serial.c`	`lock_object(X_LOCK)`, `lock_unlock_object`	X on serial instance OID	`NEXT VALUE` / `CURRENT VALUE` — the primary exerciser of the non2pl path under RC
`log_manager.c`	`lock_unlock_all`	(releases everything)	`log_commit` / `log_abort` — the only caller of bulk release

Why scan_manager.c uses lock_hold_object_instant: During a heap or index scan, the scan manager first tries a non-blocking instant probe. If the probe fails (lock conflict), it falls back to the full lock_object path with waiting. This optimistic-first pattern avoids suspending the thread for rows that are likely uncontended.

3.3 `lock_internal_perform_lock_object` — the complete state machine

This is the core function (~700 lines). It handles four top-level scenarios:

flowchart TD
  START["lock_internal_perform_lock_object\n(oid, class_oid, lock, wait_msecs)"]

  START --> DISPATCH{"instance lock?\n(class_oid != NULL\n&& !root)"}
  DISPATCH -- "yes (instance)" --> ESC["lock_escalate_if_needed\n(may promote to class lock)"]
  ESC --> SUBSUME{"class lock\nsubsumes request?"}
  SUBSUME -- "yes" --> GRANT_FAST["return LK_GRANTED\n(no instance lock needed)"]
  SUBSUME -- "no" --> HASH

  DISPATCH -- "no (class/root)" --> FAST{"lock_find_class_entry\n(already held?)"}
  FAST -- "found" --> CONV["goto lock_tran_lk_entry\n(conversion path)"]
  FAST -- "not found" --> HASH

  HASH["find_or_insert in hash table\n(lockfree, res_mutex locked on return)"]

  HASH --> EMPTY{"res empty?\n(no holder/waiter/non2pl)"}
  EMPTY -- "yes" --> PATH_A["PATH A: Fresh resource\nGrant unconditionally"]
  EMPTY -- "no" --> SCAN_HOLDER{"scan holder list\nfor my tran_index"}
  SCAN_HOLDER -- "not found" --> NEW_REQ["I am NOT a holder"]
  SCAN_HOLDER -- "found" --> CONV

  NEW_REQ --> COMPAT{"compat with\nholders AND waiters?"}
  COMPAT -- "yes" --> PATH_B["PATH B: Compatible\nGrant immediately"]
  COMPAT -- "no" --> ZERO{"wait_msecs ==\nZERO_WAIT?"}
  ZERO -- "yes" --> PATH_C["PATH C: Timeout\nReturn immediately"]
  ZERO -- "no" --> PATH_D["PATH D: Enqueue\nas waiter, suspend"]

  CONV --> CONV_CHECK{"new_mode ==\ngranted_mode?"}
  CONV_CHECK -- "yes" --> PATH_E["PATH E: Re-entrance\nJust bump count"]
  CONV_CHECK -- "no" --> CONV_COMPAT{"compat with\nOTHER holders?"}
  CONV_COMPAT -- "yes" --> PATH_F["PATH F: Conversion granted\nUpgrade in place"]
  CONV_COMPAT -- "no" --> CONV_ZERO{"wait_msecs ==\nZERO_WAIT?"}
  CONV_ZERO -- "yes" --> PATH_C
  CONV_ZERO -- "no" --> PATH_G["PATH G: Conversion blocked\nSet blocked_mode, reposition, suspend"]

Figure 3-4 — Complete state machine of lock_internal_perform_lock_object. Seven paths (A–G) cover every combination of resource state, holder status, compatibility, and wait policy.

Entry conditions for each path:

Path	Am I already a holder?	Resource state	Compatibility	Wait policy	Result
A	No	Empty (no holder, waiter, non2pl)	N/A	Any	Grant unconditionally
B	No	Non-empty	`lock_compat(req, total_holders) == YES` AND `lock_compat(req, total_waiters) == YES`	Any	Grant immediately
C	No	Non-empty	Fails holder or waiter compat	`ZERO_WAIT` or `FORCE_ZERO_WAIT`	Return timeout
D	No	Non-empty	Fails holder or waiter compat	`wait_msecs > 0` or `INFINITE_WAIT`	Enqueue as waiter, suspend
E	Yes	—	`lock_conv(req, granted) == granted` (no upgrade needed)	Any	Bump count, return
F	Yes	—	`lock_conv(req, granted) != granted` AND `lock_compat(new_mode, group_mode_of_others) == YES`	Any	Upgrade in place
G	Yes	—	`lock_conv(req, granted) != granted` AND `lock_compat(new_mode, group_mode_of_others) == NO`	`wait_msecs > 0` or `INFINITE_WAIT`	Set `blocked_mode`, reposition (UPR), suspend

Notes:

PATH C also applies to holders whose conversion is blocked with zero-wait (the same timeout logic fires from the conversion branch).
PATH E does not acquire res_mutex when reached via the lock_find_class_entry fast path. That fast path does acquire hold_mutex (per-transaction) to walk the class hold list, but avoids the more expensive per-resource res_mutex.
PATH F and G both compute group_mode by walking other holders (excluding self), not using the cached total_holders_mode.

3.4 PATH A — fresh resource, unconditional grant

The fastest path. The LK_RES was just created by find_or_insert (no prior holders, waiters, or non2pl entries).

// PATH A — src/transaction/lock_manager.c (annotated)
if (res_ptr->holder == NULL && res_ptr->waiter == NULL
    && res_ptr->non2pl == NULL)
  {
    lock_initialize_resource_as_allocated (res_ptr, NULL_LOCK);

    entry_ptr = lock_get_new_entry (tran_index, t_entry_ent,
                                    &lk_Gl.obj_free_entry_list);
    lock_initialize_entry_as_granted (entry_ptr, tran_index,
                                      res_ptr, lock);
    /* instant lock tracking */
    if (is_instant_duration)
      entry_ptr->instant_lock_count++;

    res_ptr->holder = entry_ptr;          /* sole holder */
    entry_ptr->class_entry = class_entry; /* link to parent */
    lock_increment_class_granules (class_entry);
    lock_insert_into_tran_hold_list (entry_ptr, tran_index);
    res_ptr->total_holders_mode = lock;

    pthread_mutex_unlock (&res_ptr->res_mutex);
    *entry_addr_ptr = entry_ptr;
    return LK_GRANTED;
  }

What happens step by step:

Allocate an LK_ENTRY from the three-tier pool (Ch. 2 §2.8).
Initialize as granted (granted_mode = lock, blocked_mode = NULL_LOCK).
Splice into res_ptr->holder (sole entry — the list was empty).
Set class_entry back-pointer and increment ngranules on the parent.
Splice into the transaction’s hold list via lock_insert_into_tran_hold_list.
Set total_holders_mode = lock (no LUB needed — we are the only holder).
Release res_mutex and return.

3.5 PATH B — existing resource, compatible, grant immediately

The resource already has holders and/or waiters, but the new request is compatible with both total_holders_mode and total_waiters_mode.

// PATH B — src/transaction/lock_manager.c (annotated)
compat1 = lock_compat (lock, res_ptr->total_waiters_mode);
compat2 = lock_compat (lock, res_ptr->total_holders_mode);
if (compat1 == LOCK_COMPAT_YES && compat2 == LOCK_COMPAT_YES)
  {
    entry_ptr = lock_get_new_entry (...);
    lock_initialize_entry_as_granted (entry_ptr, tran_index,
                                      res_ptr, lock);
    /* position in holder list per UPR */
    lock_position_holder_entry (res_ptr, entry_ptr);

    /* update aggregate mode */
    res_ptr->total_holders_mode =
      lock_conv (lock, res_ptr->total_holders_mode);

    lock_insert_into_tran_hold_list (entry_ptr, tran_index);
    lock_update_non2pl_list (thread_p, res_ptr, tran_index, lock);

    pthread_mutex_unlock (&res_ptr->res_mutex);
    return LK_GRANTED;
  }

Key invariant: dual compatibility check. The new request must pass both total_holders_mode and total_waiters_mode. Checking only holders would allow a stream of S requests to leapfrog a waiting X request — the starvation guard from Ch. 1 §1.3.

lock_update_non2pl_list is called on every successful grant. It walks the resource’s non2pl list and marks any other-transaction non2pl entries as INCON_NON_TWO_PHASE_LOCK if the newly granted lock is incompatible with the non2pl entry’s recorded mode. This is the conflict-detection trigger described in Ch. 5 notes.

3.6 PATH C — incompatible, zero-wait timeout

When the caller specifies LK_ZERO_WAIT or LK_FORCE_ZERO_WAIT (conditional lock request), the function does not block:

// PATH C — src/transaction/lock_manager.c (annotated)
if (wait_msecs == LK_ZERO_WAIT || wait_msecs == LK_FORCE_ZERO_WAIT)
  {
    pthread_mutex_unlock (&res_ptr->res_mutex);
    if (wait_msecs == LK_ZERO_WAIT)
      {
        /* create a temporary blocked entry just for the error message */
        p = lock_get_new_entry (...);
        lock_initialize_entry_as_blocked (p, thread_p, ...);
        lock_set_error_for_timeout (thread_p, p);  /* ← sets ER_LK_... */
        lock_free_entry (..., p);
      }
    return LK_NOTGRANTED_DUE_TIMEOUT;
  }

The difference between the two zero-wait modes:

LK_ZERO_WAIT (value 0): timeout with error message. A temporary LK_ENTRY is created just to populate the error diagnostics (xasl_id, bind_index_in_tran via lock_set_error_for_timeout), then immediately freed.
LK_FORCE_ZERO_WAIT (value -2): silent timeout, no error set. Used by conditional lock probes that expect failure as a normal outcome.

3.7 PATH D — incompatible, enqueue as waiter

The standard blocking path. The new request cannot be granted, and the caller is willing to wait.

// PATH D — src/transaction/lock_manager.c (annotated)
/* allocate and initialize as blocked */
entry_ptr = lock_get_new_entry (...);
lock_initialize_entry_as_blocked (entry_ptr, thread_p, tran_index,
                                  res_ptr, lock);

/* append at the END of the waiter list (FIFO) */
prev = NULL;
for (i = res_ptr->waiter; i != NULL; i = i->next)
  prev = i;
if (prev == NULL)
  res_ptr->waiter = entry_ptr;
else
  prev->next = entry_ptr;

/* update aggregate waiter mode */
res_ptr->total_waiters_mode =
  lock_conv (lock, res_ptr->total_waiters_mode);

goto blocked;  /* → suspend thread */

At the blocked: label, the thread is suspended:

blocked:
  /* release res_mutex, then suspend */
  pthread_mutex_unlock (&res_ptr->res_mutex);
  ret_val = lock_suspend (thread_p, entry_ptr, wait_msecs);

On resume, the return value of lock_suspend determines the outcome:

`ret_val`	Meaning	Action
`LOCK_RESUMED`	Lock granted by a releaser	Continue to `lock_conversion_treatement`
`LOCK_RESUMED_TIMEOUT`	Wait timed out	`lock_internal_perform_unlock_object` removes the waiter, return `LK_NOTGRANTED_DUE_TIMEOUT`
`LOCK_RESUMED_ABORTED`	Deadlock victim	Same cleanup, return `LK_NOTGRANTED_DUE_ABORTED`
`LOCK_RESUMED_INTERRUPT`	Server shutdown	Return `LK_NOTGRANTED_DUE_ERROR`

Key detail: who grants the waiter? The suspended thread does not re-check compatibility on resume. Instead, when a holder releases its lock (lock_internal_perform_unlock_object, Ch. 5), the releaser calls lock_grant_blocked_holder and lock_grant_blocked_waiter, which walk the waiter list, check compatibility against the updated total_holders_mode, and move grantable waiters to the holder list before waking them. The resumed thread finds its entry already in the holder list with granted_mode set.

3.8 PATH E — re-entrance (same tran, same or weaker mode)

When a transaction requests a lock it already holds, and the conversion result (lock_conv(requested, granted)) equals the currently granted mode, the request is a no-op upgrade — just bump the count:

// PATH E — src/transaction/lock_manager.c
lock_tran_lk_entry:
  new_mode = lock_conv (lock, entry_ptr->granted_mode);
  if (new_mode == entry_ptr->granted_mode)
    {
      entry_ptr->count += 1;
      /* instant lock tracking ... */
      return LK_GRANTED;       /* no res_mutex needed */
    }

This path does not need res_mutex when reached via the lock_find_class_entry fast path for class locks. lock_find_class_entry acquires hold_mutex (the per-transaction mutex) to safely walk the class_hold_list, but never touches res_mutex. The count increment itself is safe because only one thread per transaction can be in the lock manager at a time.

3.9 PATH F — conversion granted immediately

The transaction already holds a lock but requests a stronger mode. The conversion result (new_mode) differs from granted_mode, but is compatible with all other holders:

// PATH F — src/transaction/lock_manager.c (annotated)
/* compute aggregate of OTHER holders (excluding myself) */
group_mode = NULL_LOCK;
for (i = res_ptr->holder; i != NULL; i = i->next)
  {
    if (i != entry_ptr)
      group_mode = lock_conv (i->granted_mode, group_mode);
  }

compat1 = lock_compat (new_mode, group_mode);
if (compat1 == LOCK_COMPAT_YES)
  {
    entry_ptr->granted_mode = new_mode;  /* upgrade in place */
    entry_ptr->count += 1;
    res_ptr->total_holders_mode =
      lock_conv (lock, res_ptr->total_holders_mode);
    lock_update_non2pl_list (...);
    pthread_mutex_unlock (&res_ptr->res_mutex);
    goto lock_conversion_treatement;
  }

Key difference from PATH B: the compatibility check is against group_mode (other holders only), not total_holders_mode (which includes this transaction’s own mode). A transaction cannot conflict with itself.

lock_conversion_treatement [sic — typo in source] handles post-grant cleanup for class-lock conversions. When a class lock is upgraded (e.g., IS→X), instance locks that are now subsumed are removed:

Old class mode	New class mode	Action
`IS`	`S`, `SIX`, or `X`	Remove all S instance locks
`IX`	`SIX`	Remove all S instance locks
`IX`	`X`	Remove all X instance locks
`SIX`	`X`	Remove all X instance locks

3.10 PATH G — conversion blocked (Upgrader Positioning Rule)

The conversion is incompatible with other holders, and the transaction must wait. This is the most complex path because the entry stays in the holder list (it still holds the old mode) but is repositioned to signal the pending upgrade:

// PATH G — src/transaction/lock_manager.c (annotated)
entry_ptr->blocked_mode = new_mode;  /* ← both fields now non-NULL */
entry_ptr->count += 1;
entry_ptr->thrd_entry = thread_p;

res_ptr->total_holders_mode =
  lock_conv (lock, res_ptr->total_holders_mode);

/* remove from current position in holder list */
/* ... unlink entry_ptr ... */

/* reposition per Upgrader Positioning Rule (UPR) */
lock_position_holder_entry (res_ptr, entry_ptr);

goto blocked;  /* → suspend thread */

Key invariant: granted_mode != NULL_LOCK AND blocked_mode != NULL_LOCK. This is the conversion-wait state from Ch. 1 §1.4. The entry stays in the holder list (other transactions still see the old granted_mode in compatibility checks) but is flagged as pending upgrade.

Upgrader Positioning Rule (UPR): lock_position_holder_entry places the entry based on compatibility between its blocked_mode and other holders’ modes. The goal is stronger than reducing unnecessary waits: the grant pass (lock_grant_blocked_holder) scans only the front of the holder list and stops at the first non-upgrading holder, so an upgrader that ends up behind a plain holder is invisible to every future release — position is a liveness requirement, not a tuning knob (see §3.11). The three candidate positions (ta, tb, tc) in the holder list are evaluated in priority order, and the entry is inserted at the best available position.

3.11 Upgrader Positioning Rule (UPR) — deep dive into `lock_position_holder_entry`

lock_position_holder_entry decides where in the holder list a new or repositioned entry is inserted. The holder list is not a flat FIFO — it is ordered so that conversion-waiting holders (upgraders) sit before non-upgrading holders. This ordering directly affects which waiters can be granted next when a holder releases its lock.

Why position is correctness — the buried upgrader. lock_grant_blocked_holder walks the holder list front-to-back and stops at the first entry whose blocked_mode == NULL_LOCK (while (holder != NULL && holder->blocked_mode != NULL_LOCK)). That early stop is what keeps the per-release scan cheap — but it is sound only if every upgrader sits in the front zone. With naive append (no UPR), a class-grain timeline reproduces the failure; each entry reads granted/blocked (– = NULL_LOCK):

t	event	holder list (front → back)
1	T5 `IS`, T3 `IX`, T2 `IX` — all compatible	`T5 IS/–` → `T3 IX/–` → `T2 IX/–`
2	T3 upgrades to `SIX`; T2’s `IX` conflicts → only `blocked_mode` is written, suspended in place	`T5 IS/–` → `T3 IX/SIX` → `T2 IX/–`
3	T2 commits — the grant pass starts at T5: `blocked = –` ⇒ exits, nobody examined	`T5 IS/–` → `T3 IX/SIX`
4	`SIX ∧ IS` is compatible — T3 is grantable, yet sleeps until lock timeout or deadlock victim	unchanged — forever

Note that the example needs intention modes: a plain S → X version cannot reproduce the bug — a plain S holder in front also blocks the X target, and once it releases (leaving the list) the upgrader reaches the front anyway. The liveness failure requires a remaining plain holder whose granted mode is compatible with the upgrade target — e.g. IS standing in front of an IX → SIX upgrader.

Figure 3-2 — The buried upgrader: naive append vs UPR

Figure 3-2 — The same timeline drawn. Panels ①–③ are the counterfactual (naive append): the scan stops at plain T5 and T3’s grantable upgrade sleeps forever. The green strip is the real code: lock_position_holder_entry moves T3 to the front before it sleeps, so the very next release grants it.

Two cases based on blocked_mode:

Case 1: blocked_mode == NULL_LOCK (normal grant, not upgrading)

The entry is a plain holder with no pending upgrade. It is inserted after all upgraders but before other plain holders:

// Case 1 — src/transaction/lock_manager.c
if (entry_ptr->blocked_mode == NULL_LOCK)
  {
    prev = NULL;
    i = res_ptr->holder;
    while (i != NULL)
      {
        if (i->blocked_mode == NULL_LOCK)
          break;       /* ← stop at the first non-upgrader */
        prev = i;
        i = i->next;
      }
    /* insert after prev (i.e., after all upgraders) */
  }

The resulting holder list order is: [upgraders...] [plain holders...].

Case 2: blocked_mode != NULL_LOCK (conversion wait — upgrader)

The entry is an upgrader. Three candidate insertion points are evaluated in priority order:

// Case 2 — src/transaction/lock_manager.c (annotated)
ta = tb = tc = NULL;  /* three candidates */

for each holder i in the list:
  if (i is also an upgrader)  /* i->blocked_mode != NULL_LOCK */
    {
      /* ta: first upgrader whose blocked_mode is COMPATIBLE
             with my blocked_mode */
      compat1 = lock_compat (me->blocked_mode, i->blocked_mode);
      if (ta == NULL && compat1 == YES)
        ta = i;

      /* tb: first upgrader where my blocked_mode is compatible
             with its granted_mode, BUT its blocked_mode is
             INCOMPATIBLE with my granted_mode */
      compat1 = lock_compat (me->blocked_mode, i->granted_mode);
      compat2 = lock_compat (i->blocked_mode, me->granted_mode);
      if (ta == NULL && tb == NULL
          && compat1 == YES && compat2 == NO)
        tb = i;
    }
  else  /* i is a plain holder */
    {
      /* tc: first plain (non-upgrading) holder */
      if (tc == NULL)
        tc = i;
    }

/* choose insertion point by priority: ta > tb > tc */
if (ta != NULL)      prev = tap;   /* before ta */
else if (tb != NULL) prev = tbp;   /* before tb */
else if (tc != NULL) prev = tcp;   /* before tc */
/* else: append at the end (prev from the loop) */

The three candidates explained:

Candidate	Condition	Meaning	Why insert before it
ta	My `blocked_mode` is compatible with `i`’s `blocked_mode`	I and this upgrader want compatible upgrades	We can both proceed — placing me first lets the releaser grant me sooner without blocking the other
tb	My `blocked_mode` is compat with `i`’s `granted_mode`, BUT `i`’s `blocked_mode` is incompat with my `granted_mode`	I don’t conflict with what it has, but it conflicts with what I have	If I go after it, it would block on my granted_mode and I’d block on its blocked_mode — deadlock. Putting me first avoids this
tc	First plain (non-upgrading) holder	Boundary between upgraders and plain holders	All upgraders should be ahead of non-upgraders

Priority ta > tb > tc: ta is the best position (compatible upgraders grouped together). tb is the fallback (avoid deadlock between upgraders). tc is the last resort (at least stay in the upgrader zone).

The tb case, worked. Say the entry being positioned is new = IX→SIX and the existing upgrader is i = IS→X: SIX ∧ IS = ✓ (i’s held mode does not block new) but X ∧ IX = ✗ (new’s held mode blocks i for as long as new holds). The grant pass grants a compatible prefix of the upgrader zone and stops at the first failure, so the order decides everything:

Order in the upgrader zone	What the grant pass does	Outcome
`[ new, i ]` — tb’s pick	`new`: `SIX ∧ IS = ✓` → granted; `i` gets its turn at `new`’s commit	progress
`[ i, new ]`	`i`: `X ∧ IX = ✗` → stop — `new` never examined, though grantable	both sleep — wedged until timeout

A one-directionally conflicting pair has exactly one viable order: whoever can be granted while the other still holds goes first.

Worked example:

Holder list before UPR:
  [T3: granted=S, blocked=X] → [T5: granted=IS, blocked=NULL] → [T8: granted=S, blocked=NULL]
   ^^ upgrader                   ^^ plain holder                   ^^ plain holder

New entry: T7 wants to upgrade (granted=IX, blocked=SIX)

Evaluation:

T3 (upgrader): lock_compat(SIX, X) → NO. Not ta. lock_compat(SIX, S) → NO. Not tb either.
T5 (plain): tc = T5 (first plain holder).

Result: no ta, no tb, tc = T5. Insert before T5:

After UPR:
  [T3: granted=S, blocked=X] → [T7: granted=IX, blocked=SIX] → [T5: granted=IS, blocked=NULL] → [T8: granted=S, blocked=NULL]
   ^^ upgrader                   ^^ upgrader (new)                ^^ plain holder                   ^^ plain holder

Holder list layout invariant maintained by UPR:

flowchart LR
  subgraph HL["Holder list order"]
    direction LR
    U1["upgrader 1\ngranted + blocked"]
    U2["upgrader 2\ngranted + blocked"]
    UN["..."]
    P1["plain holder 1\ngranted only"]
    P2["plain holder 2\ngranted only"]
    PN["..."]
    U1 --> U2 --> UN --> P1 --> P2 --> PN
  end

Figure 3-5 — Holder list ordering invariant. Upgraders (entries with blocked_mode != NULL_LOCK) are always positioned before plain holders. Among upgraders, the order is determined by compatibility of their blocked_modes to minimize inter-upgrader deadlocks.

Why UPR matters for correctness: When a holder releases its lock and lock_grant_blocked_holder scans the holder list to see if any upgrader can now be granted, it walks front-to-back. If upgraders with compatible goals are grouped at the front, they can be granted together in one pass. Without UPR, a newly arriving upgrader at the back might deadlock with an earlier upgrader because the grant order would be wrong.

3.12 Concrete scenario — three transactions contend for one row

flowchart LR
  subgraph T0["Initial"]
    H0["holder: T5 (S)"]
    W0["waiter: --"]
    AGG0["holders=S\nwaiters=NULL"]
  end

  subgraph T1["T7 req S (PATH B)"]
    H1["holder: T5 (S), T7 (S)"]
    W1["waiter: --"]
    AGG1["holders=S\nwaiters=NULL"]
  end

  subgraph T2["T9 req X (PATH D)"]
    H2["holder: T5 (S), T7 (S)"]
    W2["waiter: T9 (X)"]
    AGG2["holders=S\nwaiters=X"]
  end

  subgraph T3["T12 req S (PATH D!)"]
    H3["holder: T5 (S), T7 (S)"]
    W3["waiter: T9 (X), T12 (S)"]
    AGG3["holders=S\nwaiters=SIX"]
  end

  T0 --> T1 --> T2 --> T3

Figure 3-6 — Four-step contention scenario. T7’s S request is compatible (PATH B). T9’s X request fails both holder and waiter compat (PATH D). T12’s S request is compatible with total_holders_mode (S vs S = yes) but incompatible with total_waiters_mode (S vs X = no) — so T12 also blocks (PATH D), demonstrating the starvation guard: T9’s pending X prevents T12 from leapfrogging.

3.13 Diagnostic fields — `xasl_id` and `bind_index_in_tran`

At the very end of a successful acquisition (all paths that return LK_GRANTED):

// lock_internal_perform_lock_object — src/transaction/lock_manager.c
end:
  if (entry_ptr != NULL && ret_val == LK_GRANTED)
    lock_event_set_xasl_id_to_entry (tran_index, entry_ptr);

This stamps the entry with the currently executing query’s plan ID and statement index. These are read-only diagnostic fields — they do not affect lock behavior. Their sole consumer is the event-log subsystem (event_log_sql_string, event_log_bind_values) which prints them in timeout and deadlock reports so a DBA can identify which query caused the lock wait.

3.14 Performance analysis — acquisition cost by path

Not all paths cost the same. The table below ranks them from cheapest to most expensive:

Path	Cost	Synchronization	When it fires
E (re-entrance, class fast path)	O(1)	`hold_mutex` only — no `res_mutex`, no hash lookup	Same tran re-requests same class with same-or-weaker mode. Most common in OLTP.
E (re-entrance, via hash)	O(1) + hash lookup	`res_mutex`	Same tran re-requests same instance with same-or-weaker mode.
A (fresh resource)	O(1) + hash insert	Lockfree `find_or_insert` + `res_mutex`	First lock on a never-before-locked OID.
B (compatible grant)	O(1) compat check	`res_mutex`	New holder, compatible with aggregate modes.
F (conversion granted)	O(holders)	`res_mutex`	Upgrade needed, must compute `group_mode` excluding self.
D (enqueue + suspend)	O(waiters) + thread suspend	`res_mutex` + thread mutex + context switch	Incompatible — thread goes to sleep.
G (conversion blocked)	O(holders) + suspend	`res_mutex` + UPR repositioning + context switch	Upgrade incompatible with other holders.

Key performance observations:

1. The class-entry fast path (PATH E) avoids the hash table entirely.

lock_object → lock_find_class_entry (hold_mutex only)
            → found → lock_tran_lk_entry → count++ → return

This is the hottest path in OLTP: every DML statement re-requests the class intention lock (IS/IX) that was already acquired by lock_scan. The cost is one hold_mutex lock/unlock + one linked list walk of class_hold_list (typically < 10 entries) + one integer increment. No hash lookup, no res_mutex.

2. The O(1) aggregate-mode check eliminates per-holder scanning for PATH B.

Without total_holders_mode / total_waiters_mode, every new lock request would need to walk the entire holder list to check compatibility — O(holders per resource). With the cached aggregates, PATH B does two table lookups:

compat1 = lock_Comp[lock][res_ptr->total_waiters_mode];  // O(1)
compat2 = lock_Comp[lock][res_ptr->total_holders_mode];  // O(1)

The trade-off: total_holders_mode must be recomputed on every holder-list change (O(holders) in Ch. 5 §5.6). But since grant is far more frequent than release in steady-state OLTP, the net win is significant.

3. Escalation check on every instance lock request is cheap.

lock_escalate_if_needed → lock_check_escalate is:

One hold_mutex lock/unlock
One integer comparison (ngranules < threshold)
Return false (the common case — no escalation)

Total: ~100ns. This is acceptable because the alternative — unbounded instance lock accumulation — has far worse consequences (memory pressure, O(n) hold-list walks).

4. The hash function’s slot-distribution design pays off under concurrent scans.

Consecutive OIDs on the same heap page (slotid 1, 2, 3, …) land in different hash buckets (Ch. 2 §2.3, Figure 2-2). Two threads scanning the same table in parallel hit different res_mutex instances, so they never contend at the hash level. A naive hash would cluster them in the same bucket, serializing the scans.

3.15 Chapter summary — key takeaways

Two public APIs, one internal function. lock_scan handles class-grain intention locks; lock_object handles all three resource types with multi-granularity preparation. Both converge on lock_internal_perform_lock_object.
Seven paths (A–G) cover every combination of resource state, holder status, compatibility, and wait policy.
Dual compatibility check (PATH B/D): a new request must be compatible with both total_holders_mode and total_waiters_mode. The waiter check is the starvation guard.
Re-entrance (PATH E) is a count bump with no mutex needed when reached via the class-entry fast path.
Conversion (PATH F/G) checks against other holders only (excluding self). Blocked conversions set both granted_mode and blocked_mode, leaving the entry in the holder list but repositioned per UPR.
Waiters are granted by the releaser, not by the waking thread. The resume path does not re-check compatibility.
lock_update_non2pl_list is called on successful grants where the resource already has prior state (PATH B, F) to mark conflicting non2pl entries for decache notification. PATH A (fresh resource) skips this — the non2pl list is empty.

Chapter 4: Lock Conversion and Re-entrance

Chapter 3 covered where conversion and re-entrance appear in the acquisition state machine (PATH E/F/G). This chapter zooms in on the mechanics: how the conversion table works, how re-entrance counting interacts with unlock, how blocked conversions are granted, and what cleanup happens after a class-lock upgrade.

4.1 When conversion happens — the trigger condition

Conversion occurs when a transaction requests a lock on a resource it already holds. The function detects this by scanning the holder list for a matching tran_index:

// lock_internal_perform_lock_object — src/transaction/lock_manager.c
entry_ptr = res_ptr->holder;
while (entry_ptr != NULL)
  {
    if (entry_ptr->tran_index == tran_index)
      break;    /* ← found: I am already a holder */
    entry_ptr = entry_ptr->next;
  }

If found, the code jumps to lock_tran_lk_entry: — the conversion path. If not found, it is a new lock request (PATH A–D).

For class locks, there is a fast path that avoids the hash table entirely: lock_find_class_entry walks the transaction’s class_hold_list (O(classes held) instead of O(holders on the resource)):

// lock_find_class_entry — src/transaction/lock_manager.c
entry_ptr = tran_lock->class_hold_list;
while (entry_ptr != NULL)
  {
    if (OID_EQ (&entry_ptr->res_head->key.oid, class_oid))
      break;
    entry_ptr = entry_ptr->tran_next;
  }

This fast path bypasses find_or_insert and res_mutex for the re-entrance case (PATH E). It does acquire hold_mutex (the per-transaction mutex) to walk the list safely, but avoids the heavier per-resource res_mutex. This is the most common conversion scenario.

4.2 The conversion table — `lock_Conv[requested][current]`

The conversion result is the least upper bound (LUB) of the requested mode and the currently held mode:

new_mode = lock_conv (lock, entry_ptr->granted_mode);
// equivalent to: new_mode = lock_Conv[lock][entry_ptr->granted_mode];

The full 12×12 table is in lock_table.c. The practically important subset:

requested ↓ \ current →	IS	S	IX	SIX	U	X
IS	IS	S	IX	SIX	—	X
S	S	S	SIX	SIX	U	X
IX	IX	SIX	IX	SIX	—	X
SIX	SIX	SIX	SIX	SIX	—	X
U	—	U	—	—	U	X
X	X	X	X	X	X	X

Reading the table: lock_Conv[S][IX] = SIX means “if I hold IX and request S, the result is SIX.”

NA_LOCK entries (marked —) mean the combination is undefined and triggers an assert failure. These are combinations that should never arise in practice (e.g., U_LOCK is only requested when already holding S_LOCK).

4.3 Re-entrance — PATH E in detail

Re-entrance is the most common conversion: the transaction requests a mode that is equal to or weaker than what it already holds. The conversion result equals the current mode, so no actual upgrade is needed.

// PATH E — lock_tran_lk_entry in lock_internal_perform_lock_object
new_mode = lock_conv (lock, entry_ptr->granted_mode);
if (new_mode == entry_ptr->granted_mode)
  {
    entry_ptr->count += 1;   /* ← just bump the counter */
    // ... instant lock tracking ...
    return LK_GRANTED;
  }

The count field is critical for correct unlock behavior. Each lock_object call increments count; each lock_unlock_object call decrements it. The lock is only actually released when count reaches zero. This prevents a nested caller from accidentally releasing a lock that an outer caller still needs.

Example — nested re-entrance:

lock_object(oid=R1, mode=S)   → count=1, granted_mode=S
  lock_object(oid=R1, mode=S) → count=2, granted_mode=S  (PATH E)
  lock_unlock_object(R1, S)   → count=1, lock STAYS
lock_unlock_object(R1, S)     → count=0, lock RELEASED

4.4 Immediate conversion — PATH F in detail

When new_mode != granted_mode (a real upgrade is needed), the function must check compatibility with other holders only (excluding self):

// PATH F — lock_internal_perform_lock_object (annotated)
group_mode = NULL_LOCK;
for (i = res_ptr->holder; i != NULL; i = i->next)
  {
    if (i != entry_ptr)          /* ← skip myself */
      group_mode = lock_conv (i->granted_mode, group_mode);
  }

compat1 = lock_compat (new_mode, group_mode);
if (compat1 == LOCK_COMPAT_YES)
  {
    /* upgrade in place */
    entry_ptr->granted_mode = new_mode;
    entry_ptr->count += 1;
    res_ptr->total_holders_mode =
      lock_conv (lock, res_ptr->total_holders_mode);
    lock_update_non2pl_list (...);
    goto lock_conversion_treatement;
  }

Why exclude self? A transaction cannot conflict with its own lock. If T holds S and requests X, the total_holders_mode might be SIX (because another transaction holds IX). But group_mode (other holders only) might be IX, and lock_compat(X, IX) == NO — so the conversion blocks. Without the self-exclusion, the compatibility check would use total_holders_mode = SIX and incorrectly conclude the conversion is impossible even when the only S holder is the requesting transaction itself.

Concrete example:

Resource R: holder list = [T5(IX), T7(S)]
T7 requests X_LOCK on R.

new_mode = lock_conv(X, S) = X
group_mode = T5's granted_mode = IX
lock_compat(X, IX) = NO → PATH G (blocked)

But if R only had: [T7(S)]
group_mode = NULL_LOCK
lock_compat(X, NULL) = YES → PATH F (granted, upgrade S→X)

4.5 Blocked conversion — PATH G in detail

When the upgrade is incompatible with other holders, the entry enters the dual-field state — the defining characteristic of a blocked conversion:

// PATH G — lock_internal_perform_lock_object (annotated)
entry_ptr->blocked_mode = new_mode;   /* upgrade target */
entry_ptr->count += 1;
entry_ptr->thrd_entry = thread_p;

/* update total_holders_mode to reflect the REQUESTED mode */
res_ptr->total_holders_mode =
  lock_conv (lock, res_ptr->total_holders_mode);

/* remove from current position, reposition per UPR */
// ... unlink entry_ptr from holder list ...
lock_position_holder_entry (res_ptr, entry_ptr);

goto blocked;   /* → lock_suspend */

Key invariant during blocked conversion:

granted_mode = old mode (still effective — other transactions see this in compatibility checks)
blocked_mode = new mode (the upgrade being waited for)
Entry stays in the holder list (not the waiter list)
Thread is suspended

total_holders_mode includes the requested mode even though it is not yet granted. This is conservative: it ensures that new requests see the strongest possible contention picture. The trade-off is that some requests that could theoretically coexist with the old mode will be blocked. This is intentional — it prevents starvation of the upgrader.

4.6 How blocked conversions are granted — `lock_grant_blocked_holder`

When a holder releases its lock (Ch. 5), the release path calls lock_grant_blocked_holder. This function scans the holder list front-to-back (which is why UPR positions upgraders at the front):

// lock_grant_blocked_holder — src/transaction/lock_manager.c (annotated)
static void
lock_grant_blocked_holder (THREAD_ENTRY * thread_p, LK_RES * res_ptr)
{
  holder = res_ptr->holder;
  while (holder != NULL && holder->blocked_mode != NULL_LOCK)
    {
      /* compute aggregate of all holders AFTER this one */
      mode = NULL_LOCK;
      for (h = holder->next; h != NULL; h = h->next)
        mode = lock_conv (h->granted_mode, mode);

      compat = lock_compat (holder->blocked_mode, mode);
      if (compat == LOCK_COMPAT_NO)
        break;                    /* ← stop granting */

      /* grant the conversion */
      holder->granted_mode = holder->blocked_mode;
      holder->blocked_mode = NULL_LOCK;

      /* reposition: move from upgrader zone to plain-holder zone */
      // ... reposition per UPR ...

      lock_update_non2pl_list (...);
      lock_resume (holder, LOCK_RESUMED);   /* wake up */

      holder = next;
    }
}

The algorithm walks upgraders only (stops at the first entry with blocked_mode == NULL_LOCK). For each upgrader, it computes the aggregate of holders after it in the list (not before). This asymmetry is what UPR enables: compatible upgraders at the front can be granted together because their blocked_modes don’t conflict with the modes behind them.

Grant sequence:

granted_mode = blocked_mode (the upgrade takes effect)
blocked_mode = NULL_LOCK (no longer an upgrader)
Reposition to the plain-holder zone via UPR
lock_update_non2pl_list for conflict detection
lock_resume(LOCK_RESUMED) wakes the suspended thread

4.7 How blocked waiters are granted — `lock_grant_blocked_waiter`

After granting blocked holders, the release path calls lock_grant_blocked_waiter. This function scans the waiter list front-to-back:

// lock_grant_blocked_waiter — src/transaction/lock_manager.c (annotated)
static int
lock_grant_blocked_waiter (THREAD_ENTRY * thread_p, LK_RES * res_ptr)
{
  waiter = res_ptr->waiter;
  while (waiter != NULL)
    {
      compat = lock_compat (waiter->blocked_mode,
                            res_ptr->total_holders_mode);
      if (compat == LOCK_COMPAT_NO)
        break;                    /* ← stop granting */

      /* grant the waiter */
      waiter->granted_mode = waiter->blocked_mode;
      waiter->blocked_mode = NULL_LOCK;

      /* move from waiter list to holder list */
      // ... remove from waiter list ...
      lock_position_holder_entry (res_ptr, waiter);

      /* update aggregates */
      res_ptr->total_holders_mode =
        lock_conv (waiter->granted_mode, res_ptr->total_holders_mode);

      /* add to transaction hold list */
      lock_insert_into_tran_hold_list (waiter, owner_tran_index);

      lock_update_non2pl_list (...);
      lock_resume (waiter, LOCK_RESUMED);

      waiter = next;
    }

  /* recompute total_waiters_mode from remaining waiters */
  if (change_total_waiters_mode)
    {
      mode = NULL_LOCK;
      for (w = res_ptr->waiter; w != NULL; w = w->next)
        mode = lock_conv (w->blocked_mode, mode);
      res_ptr->total_waiters_mode = mode;
    }
}

Key differences from lock_grant_blocked_holder:

Aspect	`lock_grant_blocked_holder`	`lock_grant_blocked_waiter`
Scans	Holder list (upgraders)	Waiter list
Compat check against	Aggregate of holders after this entry	`total_holders_mode` (all holders)
On grant	`granted_mode = blocked_mode` (in-place)	Move from waiter list to holder list
Stop condition	First incompatible upgrader	First incompatible waiter

Order of the two calls matters: blocked holders (upgraders) are granted first, which may change total_holders_mode. Then blocked waiters are checked against the updated total_holders_mode. This ensures upgraders have priority over new waiters — they were here first and already hold a weaker lock.

4.8 Post-conversion cleanup — `lock_conversion_treatement`

After a class-lock conversion is granted (PATH F, or PATH G after resume), instance locks that are now subsumed by the stronger class lock are removed:

// lock_conversion_treatement — src/transaction/lock_manager.c
if (entry_ptr->res_head->key.type == LOCK_RESOURCE_CLASS
    && lock_conversion == true)
  {
    new_mode = entry_ptr->granted_mode;
    switch (old_mode)
      {
      case IS_LOCK:
        if (IS_WRITE_EXCLUSIVE_LOCK (new_mode)
            || new_mode == S_LOCK || new_mode == SIX_LOCK)
          lock_remove_all_inst_locks (thread_p, tran_index,
                                      oid, S_LOCK);
        break;

      case IX_LOCK:
        if (new_mode == SIX_LOCK)
          lock_remove_all_inst_locks (..., S_LOCK);
        else if (IS_WRITE_EXCLUSIVE_LOCK (new_mode))
          lock_remove_all_inst_locks (..., X_LOCK);
        break;

      case SIX_LOCK:
        /* new_mode == X_LOCK */
        lock_remove_all_inst_locks (..., X_LOCK);
        break;
      }
  }

What gets removed and why:

Old class lock	New class lock	Remove instance locks	Reason
IS	S, SIX, X	All S instance locks	Class S subsumes instance S
IX	SIX	All S instance locks	SIX = S + IX; the S part subsumes instance S
IX	X	All X instance locks	Class X subsumes everything
SIX	X	All X instance locks	Was already S+IX; now full X

This is a memory optimization: the subsumed instance LK_ENTRYs are freed and ngranules on the class entry is decremented. Without this cleanup, a class-lock escalation would leave behind redundant instance entries that waste memory and slow down hold-list traversal.

4.9 Concrete scenario — conversion lifecycle

flowchart LR
  subgraph S1["1. T7 holds S on row R"]
    H1["holder: T7(S, count=1)"]
    AGG1["holders=S"]
  end

  subgraph S2["2. T5 acquires S on R (PATH B)"]
    H2["holder: T7(S), T5(S)"]
    AGG2["holders=S"]
  end

  subgraph S3["3. T7 requests X (PATH G)"]
    H3["holder: T7(S, blocked=X), T5(S)"]
    AGG3["holders=X\n(conservative)"]
  end

  subgraph S4["4. T5 releases S"]
    H4["holder: T7(X, count=2)"]
    AGG4["holders=X"]
  end

  S1 --> S2 --> S3 --> S4

Figure 4-1 — Conversion lifecycle. Step 3: T7’s S→X conversion is blocked by T5’s S (PATH G). total_holders_mode becomes X (conservative). Step 4: T5 releases, lock_grant_blocked_holder grants T7’s upgrade: granted_mode becomes X, blocked_mode becomes NULL_LOCK, T7’s thread resumes. Note count=2 because both the original S request and the X request each incremented it.

4.10 Performance analysis — conversion cost

Re-entrance (PATH E) is nearly free; real conversion is O(holders).

Scenario	Cost	Bottleneck
Re-entrance (same or weaker mode)	O(1): `lock_conv` table lookup + `count++`	None — no list walk, no compat check
Conversion granted (PATH F)	O(holders): must walk holder list to compute `group_mode` excluding self	The self-exclusion loop. A resource with 100 holders means 100 iterations.
Conversion blocked (PATH G)	O(holders) + UPR repositioning + suspend	Same as PATH F plus the UPR walk and a context switch
Grant cascade (`lock_grant_blocked_holder`)	O(upgraders × holders-after-each)	On resume, each upgrader checks compat against all holders behind it. With k upgraders and n total holders, worst case is O(k × n).

Why group_mode computation is O(holders), not O(1):

For a new lock request (PATH B), the aggregate total_holders_mode suffices — O(1) check. But for conversion (PATH F/G), the aggregate includes the requesting transaction’s own mode, which must be excluded. There is no way to “subtract” one mode from an aggregate LUB in O(1) — the LUB is not invertible. So the code walks the holder list, skipping entry_ptr, and recomputes the aggregate from scratch.

This is acceptable because:

Conversion is less frequent than new grants (most lock requests are fresh or re-entrant).
The holder count per resource is typically small (< 10 in OLTP).
The walk is under res_mutex, which is per-resource — other resources are unaffected.

Post-conversion cleanup (lock_conversion_treatement) is O(instance locks):

When a class lock is upgraded (IS→S, IX→X, etc.), lock_remove_all_inst_locks walks the transaction’s inst_hold_list and removes every instance lock of the specified mode for that class. This is O(total instance locks held by the transaction), not O(instances of that class) — the list is not partitioned by class. For a transaction holding 5000 instance locks across 3 classes, the walk examines all 5000 even if only 1000 belong to the escalated class.

4.11 Chapter summary — key takeaways

Conversion is triggered when a holder requests the same resource again with a different mode. The conversion table (lock_Conv) computes the LUB of the two modes.
Re-entrance (PATH E) is the common case — same or weaker mode, just bump count. No res_mutex needed on the class fast path.
count controls unlock depth. Each lock_object increments it; each lock_unlock_object decrements it. The lock is only released when count reaches zero.
Immediate conversion (PATH F) checks against other holders only (self-exclusion). This prevents a transaction from dead-locking with itself.
Blocked conversion (PATH G) sets both granted_mode and blocked_mode, stays in the holder list, and is repositioned per UPR. total_holders_mode conservatively includes the requested mode.
Grant order: upgraders first, then waiters. lock_grant_blocked_holder scans front-to-back among upgraders, then lock_grant_blocked_waiter scans the waiter list against the updated total_holders_mode.
Post-conversion cleanup removes instance locks subsumed by the upgraded class lock, freeing memory and reducing hold-list size.

Chapter 5: Lock Release and the NON2PL Protocol

This chapter traces the full unlock path — from the public API through the internal release mechanics, then into the NON2PL protocol that tracks early-released locks under READ COMMITTED.

5.1 Three public unlock entry points

flowchart LR
  UO["lock_unlock_object\n(oid, class_oid, lock, force)"]
  UDN["lock_unlock_object_donot_move_to_non2pl\n(oid, class_oid, lock)"]
  UA["lock_unlock_all\n(commit/rollback)"]

  INT["lock_internal_perform_unlock_object\n(entry, release_flag, move_to_non2pl)"]

  UO --> INT
  UDN --> INT
  UA --> INT

Figure 5-1 — Three unlock entry points. They differ in release_flag and move_to_non2pl parameters, which control whether the lock is fully removed and whether a non2pl shadow record is created.

Entry point	`release_flag`	`move_to_non2pl`	When used
`lock_unlock_object(force=true)`	false	true	Forced mid-transaction release
`lock_unlock_object(force=false)`	false	true	Isolation-gated release (see §5.2)
`lock_unlock_object_donot_move_to_non2pl`	false	false	Btree FK checks, catalog lookups — caller knows no stale-cache risk
`lock_unlock_all`	true	false	Commit/rollback — unconditional release of everything

5.2 Unlock caller patterns (see Ch. 3 §3.2 for the full map)

The full caller map (Figure 3-3) is in Ch. 3 §3.2. The unlock-specific patterns worth highlighting:

Unlock API	Primary callers	Why this variant
`lock_unlock_object(force=false)`	`locator_sr.c`, `heap_file.c`, `scan_manager.c`	Standard isolation-gated release — the four-stage filter in §5.3 applies
`lock_unlock_object(force=true)`	`locator_sr.c` (DDL rollback), `serial.c`	Forced release bypassing isolation check — caller knows the lock must go
`lock_unlock_object_donot_move_to_non2pl`	`btree.c` (FK checks), `system_catalog.c`, `locator_sr.c`	Rows never surfaced to client — no stale-cache risk, non2pl tracking skipped
`lock_unlock_all`	`log_manager.c` only	Commit/rollback — unconditional bulk release
`lock_stop_instant_lock_mode`	`query_executor.c`	Releases all instant-duration locks at end of WHERE evaluation

5.3 `lock_unlock_object` — the isolation policy gate

This is the most commonly called unlock function. It implements the isolation-level policy before delegating to the internal function:

// lock_unlock_object — src/transaction/lock_manager.c (annotated)
void
lock_unlock_object (THREAD_ENTRY * thread_p, const OID * oid,
                    const OID * class_oid, LOCK lock, bool force)
{
  if (force == true)
    {
      entry_ptr = lock_find_tran_hold_entry (...);
      lock_internal_perform_unlock_object (thread_p, entry_ptr,
                                           false, true);
      return;
    }

  /* force == false: apply isolation policy */
  if (lock != S_LOCK)
    {
      /* These will not be released. */
      return;                    /* ← X, IX, SIX, etc. stay to commit */
    }

  isolation = logtb_find_isolation (tran_index);
  switch (isolation)
    {
    case TRAN_SERIALIZABLE:
    case TRAN_REPEATABLE_READ:
      return;                    /* nothing to do — S locks held to commit */

    case TRAN_READ_COMMITTED:
      lock_unlock_object_by_isolation (thread_p, tran_index,
                                       isolation, class_oid, oid);
      break;
    }
}

Three filters that prevent release:

flowchart TD
  REQ["lock_unlock_object(force=false)"]
  F1{"lock != S_LOCK?"}
  F1 -- "yes (X, IX, ...)" --> KEEP1["KEEP: write locks\nnever released early"]
  F1 -- "no (S_LOCK)" --> F2{"isolation >= RR?"}
  F2 -- "yes (RR/SER)" --> KEEP2["KEEP: S locks\nheld to commit"]
  F2 -- "no (RC)" --> F3["lock_unlock_object_by_isolation"]
  F3 --> F3A{"root or class OID?"}
  F3A -- "yes" --> KEEP3["KEEP: class locks\nnever released early"]
  F3A -- "no" --> F3B{"mvcc_is_mvcc_disabled_class?"}
  F3B -- "yes" --> REL["RELEASE:\nlock_unlock_shared_inst_lock\n(move_to_non2pl=true)"]
  F3B -- "no (MVCC table)" --> KEEP4["KEEP: snapshot-based,\nno lock to release"]

Figure 5-2 — Four-stage filter in the non-forced unlock path. Only S locks, under RC, on MVCC-disabled instance OIDs, are actually released. Everything else is kept to commit.

Key insight: In a typical MVCC workload, the vast majority of lock_unlock_object(force=false) calls hit KEEP4 — MVCC-enabled tables use snapshots, not locks, for reads. The only locks that reach the actual release path are S locks on MVCC-disabled classes (serials, root class, collation class, HA apply-info class).

5.4 `lock_internal_perform_unlock_object` — the complete flow

This is the internal workhorse. It handles three cases depending on where the entry is found:

flowchart TD
  START["lock_internal_perform_unlock_object\n(entry_ptr, release_flag, move_to_non2pl)"]

  START --> DEC{"release_flag == false?"}
  DEC -- "yes" --> COUNT["count--\n(+ instant_lock_count-- if instant mode)"]
  COUNT --> EARLY{"blocked_mode == NULL\nAND count > 0?"}
  EARLY -- "yes" --> RET["RETURN early\n(other callers still need this lock)"]
  EARLY -- "no" --> MUTEX
  DEC -- "no (release_flag=true)" --> MUTEX

  MUTEX["pthread_mutex_lock(res_mutex)"]
  MUTEX --> FIND_H{"find entry in\nholder list?"}
  FIND_H -- "yes" --> CASE_H["CASE H: holder release"]
  FIND_H -- "no" --> FIND_W{"find entry in\nwaiter list?"}
  FIND_W -- "yes" --> CASE_W["CASE W: waiter removal\n(timeout/deadlock victim)"]
  FIND_W -- "no" --> ERR["ERROR: lost transaction"]

  CASE_H --> H_CHECK{"release_flag==false\nAND count > 0?"}
  H_CHECK -- "yes" --> H_BLOCKED["Blocked holder cleanup:\nblocked_mode = NULL\nreposition per UPR"]
  H_CHECK -- "no" --> H_RELEASE["Full release:\n1. Remove from tran hold list\n2. Decrement class granules\n3. If move_to_non2pl: lock_add_non2pl_lock\n4. Free entry"]

  H_RELEASE --> RECOMP["Recompute total_holders_mode"]
  H_BLOCKED --> RECOMP
  RECOMP --> EMPTY{"holder==NULL\nAND waiter==NULL?"}
  EMPTY -- "yes, non2pl==NULL" --> REMOVE["lock_remove_resource\n(erase from hash table)"]
  EMPTY -- "yes, non2pl!=NULL" --> UNLOCK["unlock res_mutex\n(resource kept for non2pl)"]
  EMPTY -- "no" --> GRANT["lock_grant_blocked_holder\nlock_grant_blocked_waiter\nunlock res_mutex"]

Figure 5-3 — Complete flow of lock_internal_perform_unlock_object. The early return (count > 0) is the re-entrance guard from Ch. 4 §4.3. CASE H is the normal holder release. CASE W handles waiter removal after timeout or deadlock.

5.5 The `release_flag` and `count` interaction

The two parameters release_flag and count together control whether the lock is actually released:

// lock_internal_perform_unlock_object — src/transaction/lock_manager.c
if (release_flag == false)
  {
    entry_ptr->count--;
    if (lock_is_instant_lock_mode (tran_index))
      entry_ptr->instant_lock_count--;

    if (entry_ptr->blocked_mode == NULL_LOCK && entry_ptr->count > 0)
      return;    /* ← early return: other callers still need this lock */
  }

`release_flag`	`count` after decrement	`blocked_mode`	Result
false	> 0	`NULL_LOCK`	Early return — lock stays, other re-entrant callers still active
false	> 0	!= `NULL_LOCK`	Continue — this is a blocked holder being cleaned up (timeout/victim)
false	0	any	Continue — last reference, proceed to actual release
true	(not decremented)	any	Continue — unconditional release (`lock_unlock_all`)

5.6 CASE H — holder release (the normal path)

When the entry is found in the holder list and the count has reached zero (or release_flag is true):

// CASE H — lock_internal_perform_unlock_object (annotated)
/* remove from holder list */
if (prev == NULL)
  res_ptr->holder = curr->next;
else
  prev->next = curr->next;

if (release_flag == false && curr->count > 0)
  {
    /* blocked holder cleanup (timeout/victim) */
    curr->blocked_mode = NULL_LOCK;
    lock_position_holder_entry (res_ptr, entry_ptr);
  }
else
  {
    /* full release */
    lock_delete_from_tran_hold_list (curr, tran_index);
    lock_decrement_class_granules (curr->class_entry);

    if (release_flag == false && move_to_non2pl == true)
      lock_add_non2pl_lock (thread_p, res_ptr, tran_index,
                            curr->granted_mode);  /* ← NON2PL creation */

    lock_free_entry (tran_index, t_entry,
                     &lk_Gl.obj_free_entry_list, curr);
  }

/* recompute total_holders_mode */
mode = NULL_LOCK;
for (i = res_ptr->holder; i != NULL; i = i->next)
  {
    mode = lock_conv (i->granted_mode, mode);
    mode = lock_conv (i->blocked_mode, mode);  /* include upgraders */
  }
res_ptr->total_holders_mode = mode;

Why this walk exists — and why it is cheap. Commit never searches for its entries: lock_unlock_all walks the transaction-side hold lists and hands each LK_ENTRY pointer directly to the unlock path (the dual-view payoff, §1.4). The only walk left is finding prev above — the resource-side list is singly linked, so splicing a node out needs its predecessor. The asymmetry is a deliberate cost trade: the transaction view is doubly linked (tran_prev) because commit unlinks thousands of entries one by one and each unlink must be O(1); the resource-side holder list stays singly linked because an object rarely has more than a handful of holders — and the same res_mutex section re-walks that list anyway (the aggregate recompute below, then the grant passes).

After the holder is removed, three outcomes based on remaining state:

Remaining state	Action
No holders, no waiters, no non2pl	`lock_remove_resource` — erase the `LK_RES` from the hash table
No holders, no waiters, but non2pl exists	Keep `LK_RES` alive (unlock `res_mutex`) — the non2pl entries still need the resource
Holders or waiters remain	`lock_grant_blocked_holder` → `lock_grant_blocked_waiter` (Ch. 4 §4.6–4.7) → unlock `res_mutex`

5.7 CASE W — waiter removal (timeout / deadlock victim)

When the entry is not in the holder list but is found in the waiter list — this happens when a suspended thread is resumed due to timeout or deadlock detection:

// CASE W — lock_internal_perform_unlock_object (annotated)
/* remove from waiter list */
if (prev == NULL)
  res_ptr->waiter = curr->next;
else
  prev->next = curr->next;

lock_free_entry (..., curr);

if (from_whom != NULL)
  lock_grant_blocked_waiter_partial (thread_p, res_ptr, from_whom);
else
  {
    /* recompute total_waiters_mode from remaining waiters */
    mode = NULL_LOCK;
    for (i = res_ptr->waiter; i != NULL; i = i->next)
      mode = lock_conv (i->blocked_mode, mode);
    res_ptr->total_waiters_mode = mode;
  }

lock_grant_blocked_waiter_partial is a variant of lock_grant_blocked_waiter that only checks waiters from a given position onward — since the removed waiter was ahead of from_whom, only the waiters behind it might now be grantable.

5.8 `lock_unlock_all` — commit/rollback release

Called at transaction end. Unconditionally releases everything:

// lock_unlock_all — src/transaction/lock_manager.c (annotated)
void
lock_unlock_all (THREAD_ENTRY * thread_p)
{
  /* 1. release all instance locks */
  entry_ptr = tran_lock->inst_hold_list;
  while (entry_ptr != NULL)
    {
      lock_internal_perform_unlock_object (thread_p, entry_ptr,
                                           true,    /* ← release_flag */
                                           false);  /* ← no non2pl */
      entry_ptr = tran_lock->inst_hold_list;  /* re-read head */
    }

  /* 2. release all class locks */
  entry_ptr = tran_lock->class_hold_list;
  while (entry_ptr != NULL)
    {
      lock_internal_perform_unlock_object (thread_p, entry_ptr,
                                           true, false);
      entry_ptr = tran_lock->class_hold_list;
    }

  /* 3. release root class lock */
  if (tran_lock->root_class_hold != NULL)
    lock_internal_perform_unlock_object (thread_p,
                                         tran_lock->root_class_hold,
                                         true, false);

  /* 4. clean up non2pl list */
  while (tran_lock->non2pl_list != NULL)
    {
      entry_ptr = tran_lock->non2pl_list;
      tran_lock->non2pl_list = entry_ptr->tran_next;
      if (entry_ptr->granted_mode == INCON_NON_TWO_PHASE_LOCK)
        tran_lock->num_incons_non2pl -= 1;
      lock_remove_non2pl (thread_p, entry_ptr, tran_index);
    }

  lock_clear_deadlock_victim (tran_index);
}

Release order is deliberate: instances → classes → root. This is leaf-to-root, matching 2PL’s requirement that locks are released in reverse acquisition order. Releasing a class lock while instance locks still exist would violate the multi-granularity hierarchy.

release_flag = true means count is not decremented — the loop ignores re-entrance depth and frees the entry unconditionally. move_to_non2pl = false because at commit/rollback time there is no need to track released locks for future conflict detection.

5.9 The NON2PL protocol — lifecycle

NON2PL is a shadow-record mechanism for detecting serialization inconsistencies under READ COMMITTED. It is not a lock mode that blocks — it is a passive tracking record.

Phase 1: Creation — lock_add_non2pl_lock

When lock_internal_perform_unlock_object releases a holder with move_to_non2pl = true:

// lock_add_non2pl_lock — src/transaction/lock_manager.c (annotated)
static LK_ENTRY *
lock_add_non2pl_lock (THREAD_ENTRY * thread_p, LK_RES * res_ptr,
                      int tran_index, LOCK lock)
{
  /* check if I already have a non2pl entry on this resource */
  non2pl = res_ptr->non2pl;
  while (non2pl != NULL)
    {
      if (non2pl->tran_index == tran_index)
        break;
      non2pl = non2pl->next;
    }

  if (non2pl != NULL)
    {
      /* merge: upgrade the existing non2pl entry's mode */
      if (non2pl->granted_mode != INCON_NON_TWO_PHASE_LOCK)
        {
          compat = lock_compat (lock, non2pl->granted_mode);
          if (compat == LOCK_COMPAT_NO)
            {
              non2pl->granted_mode = INCON_NON_TWO_PHASE_LOCK;
              tran_lock->num_incons_non2pl += 1;
            }
          else
            non2pl->granted_mode = lock_conv (lock, non2pl->granted_mode);
        }
    }
  else
    {
      /* create new non2pl entry */
      non2pl = lock_get_new_entry (...);
      lock_initialize_entry_as_non2pl (non2pl, tran_index,
                                        res_ptr, lock);
      /* link into resource non2pl list */
      non2pl->next = res_ptr->non2pl;
      res_ptr->non2pl = non2pl;
      /* link into transaction non2pl list */
      lock_insert_into_tran_non2pl_list (non2pl, tran_index);
    }
}

Key detail: the merge path. If the transaction already has a non2pl entry on this resource (e.g., it read the same serial twice in the same RC transaction, releasing S each time), the existing entry is upgraded rather than duplicated. If the new mode is incompatible with the existing non2pl mode, the entry is promoted directly to INCON_NON_TWO_PHASE_LOCK.

Phase 2: Conflict marking — lock_update_non2pl_list

Called whenever a lock is successfully granted (PATH B, F in Ch. 3). Walks the resource’s non2pl list and checks each other-transaction entry:

// lock_update_non2pl_list — src/transaction/lock_manager.c (annotated)
curr = res_ptr->non2pl;
while (curr != NULL)
  {
    if (curr->tran_index == tran_index)
      {
        /* same transaction: remove our own non2pl entry */
        lock_delete_from_tran_non2pl_list (curr, tran_index);
        lock_free_entry (..., curr);
      }
    else
      {
        /* different transaction: check for conflict */
        if (curr->granted_mode != INCON_NON_TWO_PHASE_LOCK)
          {
            compat = lock_compat (lock, curr->granted_mode);
            if (compat == LOCK_COMPAT_NO)
              {
                curr->granted_mode = INCON_NON_TWO_PHASE_LOCK;
                tran_lock->num_incons_non2pl += 1;
              }
          }
      }
    curr = curr->next;
  }

Two actions:

Same transaction: if I acquire a new lock on a resource where I have an old non2pl entry, that entry is removed — I now hold a real lock, the shadow is superseded.
Other transaction: if the newly granted lock is incompatible with the other transaction’s non2pl granted_mode, promote it to INCON_NON_TWO_PHASE_LOCK. This means: “the row you read and released has been modified by another transaction.”

Phase 3: Notification — lock_notify_isolation_incons

At the next fetch boundary, locator_sr.c calls this function:

// lock_notify_isolation_incons — src/transaction/lock_manager.c (simplified)
void
lock_notify_isolation_incons (THREAD_ENTRY * thread_p,
    bool (*fun) (const OID *, const OID *, void *), void *args)
{
  if (isolation == TRAN_REPEATABLE_READ
      || isolation == TRAN_SERIALIZABLE)
    return;    /* nothing was released under RR/SER */

  curr = tran_lock->non2pl_list;
  while (tran_lock->num_incons_non2pl > 0 && curr != NULL)
    {
      if (curr->granted_mode == INCON_NON_TWO_PHASE_LOCK)
        {
          ret_val = (*fun) (&curr->res_head->key.class_oid,
                            &curr->res_head->key.oid, args);
          /* remove from non2pl list, free entry */
          tran_lock->num_incons_non2pl -= 1;
        }
      curr = curr->tran_next;
    }
}

The callback (locator_notify_decache) adds each flagged OID to the response with LC_FETCH_DECACHE_LOCK, telling the client to invalidate its cached copy.

Phase 4: Cleanup — lock_unlock_all

At commit/rollback, lock_unlock_all walks the entire non2pl list and removes every entry via lock_remove_non2pl.

5.10 NON2PL scope — why it matters almost exclusively for serials

lock_unlock_object_by_isolation is the scope gate:

// lock_unlock_object_by_isolation — src/transaction/lock_manager.c
if (isolation != TRAN_READ_COMMITTED)
  return;                             /* RR/SER: nothing released */

if (OID_IS_ROOTOID (oid) || OID_IS_ROOTOID (class_oid))
  return;                             /* class locks kept */

if (mvcc_is_mvcc_disabled_class (class_oid))
  lock_unlock_shared_inst_lock (...); /* ← non2pl path */
else
  { /* MVCC table: "READ COMMITTED isolation uses snapshot
       instead of locks. We don't have to release anything." */ }

mvcc_is_mvcc_disabled_class returns true for only four class kinds: root class, serial class, collation class, HA apply-info class. Of these, only serial produces a realistic RC workload where S locks are acquired on instance OIDs and then released mid-transaction.

Historical context: The entire non2pl/decache infrastructure (locator_notify_decache, LC_FETCH_DECACHE_LOCK, lock_notify_isolation_incons) predates MVCC. Before MVCC, all instance reads used S locks, and RC released them early — making non2pl tracking necessary for every row. After MVCC was introduced, normal tables use snapshots for visibility, and the non2pl path narrowed to MVCC-disabled classes only. The infrastructure remains intact but its active scope is essentially serial reads under RC.

5.11 Instant lock and its relationship to NON2PL

Instant lock is a mode where locks acquired during WHERE-clause evaluation are automatically released when the evaluation phase ends.

// lock_start_instant_lock_mode — src/transaction/lock_manager.c
void lock_start_instant_lock_mode (int tran_index)
{
  tran_lock->is_instant_duration = true;
}

// lock_stop_instant_lock_mode — src/transaction/lock_manager.c
void lock_stop_instant_lock_mode (THREAD_ENTRY * thread_p,
                                   int tran_index, bool need_unlock)
{
  /* walk instance hold list */
  entry_ptr = tran_lock->inst_hold_list;
  while (entry_ptr != NULL)
    {
      count = entry_ptr->instant_lock_count;
      if (need_unlock)
        while (count > 0)
          {
            lock_internal_perform_unlock_object (thread_p, entry_ptr,
                false, true);   /* ← move_to_non2pl = true */
            count--;
          }
      entry_ptr->instant_lock_count = 0;
      entry_ptr = next_ptr;
    }
  /* same for class hold list ... */
  tran_lock->is_instant_duration = false;
}

Relationship to NON2PL: Instant lock feeds into non2pl because the release uses move_to_non2pl = true. But non2pl also exists without instant lock — any mid-transaction S lock release under RC goes through the same path.

Callers: Only query_executor.c uses instant lock mode (lock_start_instant_lock_mode / lock_stop_instant_lock_mode), specifically around condition evaluation in query execution.

Edge case — protecting non-instant locks: When is_instant_duration == true and the transaction already holds an IX or stronger lock with instant_lock_count == 0 (a non-instant lock), and the new request is incompatible with the held mode, instant mode is forcibly stopped (lock_stop_instant_lock_mode(need_unlock=false)) to prevent the non-instant lock from being accidentally released during instant mode cleanup.

5.12 Performance analysis — release cost and the grant cascade

Most unlock calls are O(1) — the count early return dominates.

In typical OLTP, a transaction acquires the same lock multiple times via re-entrance (Ch. 4 §4.3). Each lock_unlock_object just decrements count. No mutex, no list manipulation:

if (release_flag == false)
  {
    entry_ptr->count--;
    if (entry_ptr->blocked_mode == NULL_LOCK && entry_ptr->count > 0)
      return;    /* ← most unlock calls exit here */
  }

When count reaches zero — the real release cost:

Step	Cost	Notes
Acquire `res_mutex`	O(1)	Per-resource, uncontended in most cases
Find entry in holder list	O(holders)	Linear scan of singly-linked list
Remove from holder list	O(1)	Pointer unlink
Remove from tran hold list	O(1)	Doubly-linked — O(1) unlink
`lock_add_non2pl_lock` (if `move_to_non2pl`)	O(non2pl entries on this resource)	Walks non2pl list to check for existing entry to merge
Free entry to pool	O(1)	Local pool or lockfree retire
Recompute `total_holders_mode`	O(remaining holders)	Walks entire remaining holder list to compute LUB
Grant cascade	O(upgraders + grantable waiters)	See below

The total_holders_mode recomputation is the per-release O(holders) cost.

mode = NULL_LOCK;
for (i = res_ptr->holder; i != NULL; i = i->next)
  {
    mode = lock_conv (i->granted_mode, mode);
    mode = lock_conv (i->blocked_mode, mode);  /* upgraders too */
  }
res_ptr->total_holders_mode = mode;

This cannot be optimized to O(1) because the LUB operation is not invertible — you cannot “subtract” a removed mode from the aggregate without recomputing from scratch. The same limitation explains why conversion (Ch. 4 §4.10) requires O(holders).

The grant cascade — release one, potentially wake many:

After recomputing total_holders_mode, the releaser calls:

lock_grant_blocked_holder — O(upgraders × holders-behind-each)
lock_grant_blocked_waiter — O(waiters), each checked against total_holders_mode
Each granted waiter: move to holder list + lock_resume

flowchart LR
  REL["T1 releases X_LOCK"]
  RECOMP["recompute\ntotal_holders_mode\nO(remaining holders)"]
  GBH["grant_blocked_holder\n(check each upgrader)"]
  GBW["grant_blocked_waiter\n(check each waiter\nvs total_holders_mode)"]
  WAKE["resume N threads\n(N context switches)"]

  REL --> RECOMP --> GBH --> GBW --> WAKE

Figure 5-4 — Release cascade cost. A single release can trigger O(upgraders + waiters) compat checks and N thread wakeups. The worst case is a hot row with many waiters — but this is also the case where the release provides the most value.

Worst-case scenario: A hot row held by one X-lock holder with 50 S-lock waiters. When the holder releases:

Recompute total_holders_mode: O(0) (no remaining holders)
grant_blocked_holder: O(0) (no upgraders)
grant_blocked_waiter: 50 waiters all become grantable (S is compatible with S) → 50 entries moved to holder list, 50 lock_resume calls, 50 context switches

This is expensive but unavoidable — the 50 threads were suspended and need to be woken. The alternative (each woken thread re-checks compatibility independently) would be O(50²) total compat checks instead of O(50).

NON2PL cost on the grant path:

lock_update_non2pl_list is called on every successful grant (PATH B, F, and now on each cascaded grant). It walks the resource’s non2pl list — O(non2pl entries per resource). In MVCC workloads, this list is almost always empty (non2pl only exists for MVCC-disabled classes under RC). But for serial-heavy workloads, the non2pl list could accumulate entries across many RC transactions.

5.13 Chapter summary — key takeaways

Three public unlock APIs differ in release_flag and move_to_non2pl. lock_unlock_all (commit/rollback) uses release_flag=true to bypass count checks.
lock_unlock_object(force=false) passes four filters before actually releasing: only S locks, only under RC, only instances, only MVCC-disabled classes. In practice, this means serials.
count controls release depth. Each lock_object increments it; each lock_unlock_object decrements it. The lock is only freed when count reaches zero — the re-entrance guard.
After a holder release, total_holders_mode is recomputed by walking the remaining holder list, then blocked holders and waiters are granted (Ch. 4 §4.6–4.7).
Resource removal: an LK_RES is erased from the hash table only when all three lists (holder, waiter, non2pl) are empty. A non2pl-only resource stays alive.
NON2PL lifecycle: create (lock_add_non2pl_lock) → mark conflict (lock_update_non2pl_list) → notify client (lock_notify_isolation_incons) → cleanup (lock_unlock_all).
NON2PL scope narrowed post-MVCC. The infrastructure is pre-MVCC era; today only MVCC-disabled classes (serials in practice) exercise the path.
Instant lock feeds into non2pl via move_to_non2pl = true, but the two mechanisms are independent — non2pl exists without instant lock, and instant lock cleanup protects pre-existing non-instant locks.
Release order in lock_unlock_all is instances → classes → root → non2pl — leaf-to-root, matching 2PL’s reverse-order release requirement.

Chapter 6: Lock Escalation

Lock escalation replaces many fine-grain instance locks with a single coarse-grain class lock when the instance lock count exceeds a threshold. It is a memory-pressure relief valve — without it, a bulk UPDATE touching 100K rows would hold 100K LK_ENTRY structs.

6.1 When escalation is triggered

Escalation is checked at the top of every instance lock request, before the hash lookup, inside lock_internal_perform_lock_object:

// lock_internal_perform_lock_object — src/transaction/lock_manager.c
start:
  if (class_oid != NULL && !OID_IS_ROOTOID (class_oid))
    {
      /* instance lock request */
      ret_val = lock_escalate_if_needed (thread_p, class_entry,
                                          tran_index);
      // ... if escalated, class lock subsumes → return LK_GRANTED
    }

This means escalation is evaluated on every instance lock acquisition — not periodically, not lazily. The lock_check_escalate function makes the threshold check fast enough to justify this.

6.2 `lock_check_escalate` — the threshold check

// lock_check_escalate — src/transaction/lock_manager.c (annotated)
static bool
lock_check_escalate (THREAD_ENTRY * thread_p, LK_ENTRY * class_entry,
                     LK_TRAN_LOCK * tran_lock)
{
  if (class_entry == NULL)
    return false;

  if (class_entry->granted_mode == BU_LOCK)
    return false;    /* ← bulk update already has class-level lock */

  if (tran_lock->lock_escalation_on == true)
    return false;    /* ← another thread of same tran is escalating */

  superclass_entry = class_entry->class_entry;

  if (superclass_entry != NULL
      && !OID_IS_ROOTOID (&superclass_entry->res_head->key.oid))
    {
      /* class hierarchy: check superclass's ngranules */
      if (superclass_entry->ngranules
          < prm_get_integer_value (PRM_ID_LK_ESCALATION_AT))
        return false;
    }
  else if (class_entry->ngranules
           < prm_get_integer_value (PRM_ID_LK_ESCALATION_AT))
    return false;

  return true;    /* ← escalation needed */
}

Five conditions that prevent escalation:

Condition	Why
`class_entry == NULL`	No class context (shouldn’t happen for instance locks)
`granted_mode == BU_LOCK`	Bulk update already has class-level protection
`lock_escalation_on == true`	Another thread of the same transaction is already escalating — re-entrance guard
Superclass hierarchy: `superclass->ngranules < threshold`	In a class hierarchy, the superclass aggregates granule counts
Direct: `class_entry->ngranules < threshold`	Normal case — not enough instance locks yet

The threshold is PRM_ID_LK_ESCALATION_AT, a runtime configuration parameter (default value set in cubrid.conf). The ngranules field on class_entry is incremented by lock_increment_class_granules every time an instance lock is granted (Ch. 3 PATH A/B) and decremented by lock_decrement_class_granules on release (Ch. 5).

6.3 `lock_escalate_if_needed` — the escalation procedure

When lock_check_escalate returns true, the actual escalation begins:

flowchart LR
  CHK["lock_check_escalate\n→ true"]
  ABORT{"PRM_ID_LK_ROLLBACK\n_ON_LOCK_ESCALATION?"}
  ABORT -- "yes" --> ROLL["Abort transaction\nER_LK_ROLLBACK_ON_LOCK_ESCALATION"]
  ABORT -- "no" --> MODE["Determine escalated\nclass lock mode"]
  MODE --> TRY["lock_internal_perform_lock_object\n(class_oid, max_class_lock,\nFORCE_ZERO_WAIT)"]
  TRY --> OK{"granted?"}
  OK -- "yes" --> UNDO["Unlock one count\nof old class lock\n(maintain count balance)"]
  UNDO --> DONE["Return LK_GRANTED\n(caller checks subsumption)"]
  OK -- "no" --> FAIL["Reset lock_escalation_on\nReturn failure\n(fall through to normal path)"]

Figure 6-1 — Escalation procedure. The escalated class lock is attempted conditionally (FORCE_ZERO_WAIT). If it fails (e.g., another transaction holds a conflicting class lock), escalation is silently abandoned and the normal instance lock path proceeds.

The code:

// lock_escalate_if_needed — src/transaction/lock_manager.c (annotated)
static int
lock_escalate_if_needed (THREAD_ENTRY * thread_p,
                         LK_ENTRY * class_entry, int tran_index)
{
  tran_lock = &lk_Gl.tran_lock_table[tran_index];
  pthread_mutex_lock (&tran_lock->hold_mutex);

  if (lock_check_escalate (...) == false)
    {
      pthread_mutex_unlock (&tran_lock->hold_mutex);
      return LK_NOTGRANTED;   /* ← not an error, just "no escalation" */
    }

  /* option: abort instead of escalating */
  if (prm_get_bool_value (PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION))
    {
      er_set (ER_ERROR_SEVERITY, ...,
              ER_LK_ROLLBACK_ON_LOCK_ESCALATION, ...);
      lock_set_tran_abort_reason (tran_index,
                                  TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION);
      pthread_mutex_unlock (&tran_lock->hold_mutex);
      return LK_NOTGRANTED_DUE_ABORTED;
    }

  /* set re-entrance guard */
  tran_lock->lock_escalation_on = true;

  /* determine escalated mode */
  if (class_entry->granted_mode == NULL_LOCK
      || class_entry->granted_mode == S_LOCK
      || class_entry->granted_mode == X_LOCK
      || class_entry->granted_mode == SCH_M_LOCK)
    {
      /* already strong enough — no instance locks possible */
      tran_lock->lock_escalation_on = false;
      pthread_mutex_unlock (&tran_lock->hold_mutex);
      return LK_GRANTED;
    }

  /* IS → S, IX/SIX → X */
  if (class_entry->granted_mode == IX_LOCK
      || class_entry->granted_mode == SIX_LOCK)
    max_class_lock = X_LOCK;
  else  /* IS_LOCK */
    max_class_lock = S_LOCK;

  pthread_mutex_unlock (&tran_lock->hold_mutex);

  /* attempt the escalated class lock — conditional (no wait) */
  granted = lock_internal_perform_lock_object (
    thread_p, tran_index,
    &class_entry->res_head->key.oid,
    NULL,                 /* class_oid = NULL → this is a class lock */
    max_class_lock,
    LK_FORCE_ZERO_WAIT,  /* ← conditional: don't block */
    &class_entry, NULL);

  if (granted != LK_GRANTED)
    {
      /* escalation failed — fall through to normal instance lock */
      tran_lock->lock_escalation_on = false;
      return granted;
    }

  /* success: undo the extra count from the escalation request */
  lock_internal_perform_unlock_object (thread_p, class_entry,
                                       false, true);

  tran_lock->lock_escalation_on = false;
  return LK_GRANTED;
}

6.4 Escalated mode decision — simple rule

The mode decision is deliberately simple to avoid expensive instance-list scanning:

Current class lock	Escalated to	Rationale
`IS_LOCK`	`S_LOCK`	Was reading instances → take class S
`IX_LOCK`	`X_LOCK`	Was writing instances → take class X
`SIX_LOCK`	`X_LOCK`	Was reading + writing → take class X
`S_LOCK`, `X_LOCK`, `SCH_M_LOCK`	(no-op)	Already class-level — no instance locks to escalate
`NULL_LOCK`	(no-op)	No class lock held — shouldn’t have instance locks

The comment in the source explains: “Because to count the shared and exclusive instance locks may cause high CPU usage, we used a simple rule.” Walking the instance hold list to compute the precise LUB of all instance modes would be O(instance locks) — the very thing escalation is trying to avoid.

6.5 Why `LK_FORCE_ZERO_WAIT` — conditional escalation

The escalated class lock is attempted with LK_FORCE_ZERO_WAIT (no blocking, no error message). If it fails:

The escalation is silently abandoned.
lock_escalation_on is reset.
The caller (lock_internal_perform_lock_object) proceeds with the normal instance lock path.

This is a deliberate design choice: escalation is an optimization, not a requirement. If the class lock can’t be obtained without waiting (because another transaction holds a conflicting lock), it is better to continue with per-instance locks than to block the entire transaction waiting for a class lock that might cause a cascade of conflicts.

6.6 Post-escalation: what happens to instance locks

After escalation succeeds, the caller in lock_internal_perform_lock_object checks:

if (ret_val == LK_GRANTED
    && lock_is_class_lock_escalated (
         lock_get_object_lock (class_oid, oid_Root_class_oid),
         lock) == true)
  {
    return LK_GRANTED;   /* class lock subsumes the instance request */
  }

lock_is_class_lock_escalated checks whether the now-upgraded class lock mode is strong enough to cover the requested instance lock mode (e.g., class X subsumes instance S or X).

The existing instance LK_ENTRY structs are not immediately freed by escalation itself. They are freed later via the lock_conversion_treatement cleanup (Ch. 4 §4.8) when the class lock conversion is processed, and individually as lock_unlock_all walks the instance hold list at commit.

6.7 The `PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION` option

Some applications prefer to abort rather than escalate — class locks are coarser and may cause unexpected serialization with other transactions:

if (prm_get_bool_value (PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION))
  {
    er_set (..., ER_LK_ROLLBACK_ON_LOCK_ESCALATION, ...);
    lock_set_tran_abort_reason (tran_index,
                                TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION);
    return LK_NOTGRANTED_DUE_ABORTED;
  }

When this parameter is true, hitting the escalation threshold causes the transaction to abort with TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION. This is checked in the caller:

// lock_internal_perform_lock_object — after lock_escalate_if_needed
if (ret_val == LK_NOTGRANTED_DUE_ABORTED)
  {
    LOG_TDES *tdes = LOG_FIND_TDES (tran_index);
    if (tdes && tdes->tran_abort_reason
        == TRAN_ABORT_DUE_ROLLBACK_ON_ESCALATION)
      goto end;    /* propagate the abort */
  }

6.8 Concrete scenario — escalation during bulk UPDATE

flowchart LR
  subgraph S1["Before escalation"]
    CL1["class A: IX_LOCK\nngranules = 4999"]
    IL1["inst locks:\nrow1(X), row2(X),\n..., row4999(X)"]
  end

  subgraph S2["5000th row request triggers escalation"]
    CHK2["lock_check_escalate:\nngranules(4999) >= threshold(5000)?\nNot yet → grant row5000 normally"]
  end

  subgraph S3["5001st row request"]
    CHK3["lock_check_escalate:\nngranules(5000) >= threshold(5000)?\nYES → escalate"]
    ESC3["IX → X (FORCE_ZERO_WAIT)"]
  end

  subgraph S4["After escalation"]
    CL4["class A: X_LOCK\n(subsumes all instances)"]
    IL4["inst locks: still present\n(freed at commit or\nby conversion cleanup)"]
  end

  S1 --> S2 --> S3 --> S4

Figure 6-2 — Escalation during bulk UPDATE. When ngranules reaches the threshold (here 5000), the next instance lock request triggers escalation from IX to X. The class X lock subsumes all future instance requests, so no more instance LK_ENTRY structs are created. Existing instance entries are cleaned up later.

6.9 Chapter summary — key takeaways

Escalation is checked on every instance lock request — at the top of lock_internal_perform_lock_object, before the hash lookup.
The threshold is PRM_ID_LK_ESCALATION_AT (configurable). The ngranules counter on the class LK_ENTRY tracks the current instance lock count.
Escalated mode is determined by a simple rule: IS→S, IX/SIX→X. No instance-list scanning — the whole point is to avoid that cost.
Escalation is conditional (LK_FORCE_ZERO_WAIT). If the class lock can’t be obtained without blocking, escalation is silently abandoned and the normal instance lock path proceeds.
lock_escalation_on is a re-entrance guard — prevents multiple threads of the same transaction from escalating simultaneously.
PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION allows applications to abort instead of escalate, avoiding the coarse-grain serialization that class locks impose.
Existing instance locks are not immediately freed — they are cleaned up by conversion treatment (Ch. 4 §4.8) and by lock_unlock_all at commit.

Chapter 7: Suspend/Resume and Timeout

When a lock request cannot be granted immediately (PATH D or G in Ch. 3), the requesting thread is suspended until the lock becomes available, a timeout expires, or a deadlock is detected. This chapter traces the suspend/resume machinery and the four wakeup paths.

7.1 `lock_suspend` — putting a thread to sleep

// lock_suspend — src/transaction/lock_manager.c (annotated)
static LOCK_WAIT_STATE
lock_suspend (THREAD_ENTRY * thread_p, LK_ENTRY * entry_ptr,
              int wait_msecs)
{
  /* assert: no page latches held while waiting for a lock */
  assert (lock_is_safe_lock_with_page (thread_p, entry_ptr)
          || !pgbuf_has_perm_pages_fixed (thread_p));

  /* register wait info into the thread entry */
  entry_ptr->thrd_entry->lockwait = (void *) entry_ptr;
  entry_ptr->thrd_entry->lockwait_stime =
    (tv.tv_sec * 1000000LL + tv.tv_usec) / 1000LL;  /* ms */
  entry_ptr->thrd_entry->lockwait_msecs = wait_msecs;
  entry_ptr->thrd_entry->lockwait_state = (int) LOCK_SUSPENDED;

  /* register in WFG for deadlock detection */
  lk_Gl.TWFG_node[entry_ptr->tran_index].thrd_wait_stime =
    entry_ptr->thrd_entry->lockwait_stime;
  lk_Gl.deadlock_and_timeout_detector++;

  /* record which resource we're waiting for (for diagnostics) */
  tdes->waiting_for_res = entry_ptr->res_head;

  /* === actually suspend === */
  thread_suspend_wakeup_and_unlock_entry (entry_ptr->thrd_entry,
                                           THREAD_LOCK_SUSPENDED);

  /* === resumed here === */
  lk_Gl.deadlock_and_timeout_detector--;
  lk_Gl.TWFG_node[entry_ptr->tran_index].thrd_wait_stime = 0;
  tdes->waiting_for_res = NULL;

  /* dispatch on resume reason */
  // ... see §7.3 ...
}

Key fields set before suspension:

Field	Where	Purpose
`thrd_entry->lockwait`	Thread entry	Points to the `LK_ENTRY` being waited on. Non-NULL means “this thread is suspended for a lock.”
`thrd_entry->lockwait_stime`	Thread entry	Timestamp (ms) when wait started. Used by timeout check.
`thrd_entry->lockwait_msecs`	Thread entry	Timeout deadline. `LK_INFINITE_WAIT` (-1) for no timeout.
`thrd_entry->lockwait_state`	Thread entry	Set to `LOCK_SUSPENDED`. Changed by the waker to the resume reason.
`TWFG_node[tran].thrd_wait_stime`	WFG node	Used by the deadlock daemon to know this transaction is waiting.
`deadlock_and_timeout_detector`	`lk_Gl`	Atomic counter. The deadlock daemon skips its scan when this is 0 (no threads waiting).

The page-latch safety assert: A thread must not hold any permanent page latches when it suspends for a lock. Holding a page latch while waiting for a lock could deadlock with another thread that holds the lock and needs the page latch — a classic latch-lock ordering violation. The assert catches this at development time.

7.2 `lock_resume` — waking a thread

// lock_resume — src/transaction/lock_manager.c (annotated)
static void
lock_resume (LK_ENTRY * entry_ptr, int state)
{
  /* The caller is holding the thread entry mutex */
  entry_ptr->thrd_entry->lockwait = NULL;
  entry_ptr->thrd_entry->lockwait_state = (int) state;
  entry_ptr->thrd_entry->resume_status = THREAD_LOCK_RESUMED;
  pthread_cond_signal (&entry_ptr->thrd_entry->wakeup_cond);
  thread_unlock_entry (entry_ptr->thrd_entry);
}

Protocol: The waker (whoever calls lock_resume) must:

Hold the thread entry mutex (thread_lock_entry).
Verify LK_IS_LOCKWAIT_THREAD(thrd) — the thread is still suspended.
Set lockwait = NULL and lockwait_state = state before signaling the condition variable.
Signal and release the mutex.

The suspended thread, upon waking, reads lockwait_state to determine the resume reason. Since the waker sets the state before signaling, no race exists.

7.3 Resume dispatch — six states

After thread_suspend_wakeup_and_unlock_entry returns, lock_suspend reads lockwait_state to determine why it was woken:

flowchart LR
  WAKE["thread resumes"]
  WAKE --> S1{"lockwait_state?"}
  S1 -- "LOCK_RESUMED" --> R1["Lock granted\n(entry moved to holder list\nby the releaser)"]
  S1 -- "LOCK_RESUMED_ABORTED_FIRST" --> R2["Deadlock victim (primary)\nAbort transaction,\nwait for other threads to finish"]
  S1 -- "LOCK_RESUMED_ABORTED_OTHER" --> R3["Deadlock victim (secondary)\nTreated as timeout\n(another thread handles abort)"]
  S1 -- "LOCK_RESUMED_DEADLOCK_TIMEOUT" --> R4["Deadlock timeout\n(set error, return timeout)"]
  S1 -- "LOCK_RESUMED_TIMEOUT" --> R5["Normal timeout\n(wait_msecs expired)"]
  S1 -- "LOCK_RESUMED_INTERRUPT" --> R6["Server shutdown\n(ER_INTERRUPTED)"]

Figure 7-1 — Six resume states. Only LOCK_RESUMED means the lock was granted. All other states require the caller to clean up the waiter entry via lock_internal_perform_unlock_object.

State	Who sets it	Meaning	Return value
`LOCK_RESUMED`	`lock_grant_blocked_holder` or `lock_grant_blocked_waiter` (Ch. 4 §4.6–4.7)	Lock granted — entry already in holder list	`LOCK_RESUMED`
`LOCK_RESUMED_ABORTED_FIRST`	`lock_wakeup_deadlock_victim_aborted`	This thread is the primary deadlock victim — must perform the rollback	`LOCK_RESUMED_ABORTED`
`LOCK_RESUMED_ABORTED_OTHER`	`lock_wakeup_deadlock_victim_aborted`	Another thread of the same transaction is already handling the abort	`LOCK_RESUMED_DEADLOCK_TIMEOUT`
`LOCK_RESUMED_DEADLOCK_TIMEOUT`	`lock_wakeup_deadlock_victim_timeout`	Deadlock victim woken as timeout (soft abort)	`LOCK_RESUMED_DEADLOCK_TIMEOUT`
`LOCK_RESUMED_TIMEOUT`	`lock_force_timeout_expired_wait_transactions` (deadlock daemon)	Wait time exceeded `lockwait_msecs`	`LOCK_RESUMED_TIMEOUT`
`LOCK_RESUMED_INTERRUPT`	Server shutdown	Server is shutting down — all waiters must exit	`LOCK_RESUMED_INTERRUPT`

7.4 Who wakes up suspended threads — four sources

flowchart TB
  subgraph Sources["Four wakeup sources"]
    S1["1. Lock releaser\n(lock_grant_blocked_holder\n lock_grant_blocked_waiter)\n→ LOCK_RESUMED"]
    S2["2. Deadlock daemon\n(timeout check, every 100ms)\n→ LOCK_RESUMED_TIMEOUT"]
    S3["3. Deadlock daemon\n(cycle detection)\n→ LOCK_RESUMED_ABORTED_FIRST\n   LOCK_RESUMED_ABORTED_OTHER\n   LOCK_RESUMED_DEADLOCK_TIMEOUT"]
    S4["4. Server shutdown\n(thread interrupt)\n→ LOCK_RESUMED_INTERRUPT"]
  end

  S1 --> RESUME["lock_resume(entry, state)"]
  S2 --> RESUME
  S3 --> RESUME
  S4 --> INT["thread wakeup\n(THREAD_RESUME_DUE_TO_INTERRUPT)"]

Figure 7-2 — Four sources that wake a suspended lock-waiting thread. Source 1 is the happy path. Sources 2–4 are failure paths that require cleanup.

7.5 Timeout check — `lock_force_timeout_expired_wait_transactions`

The deadlock daemon (Ch. 2 §2.6) calls this on every 100ms iteration for each suspended thread:

// lock_force_timeout_expired_wait_transactions — src/transaction/lock_manager.c
static bool
lock_force_timeout_expired_wait_transactions (void *thrd_entry)
{
  thrd = (THREAD_ENTRY *) thrd_entry;
  thread_lock_entry (thrd);

  if (LK_IS_LOCKWAIT_THREAD (thrd))
    {
      /* check for interrupt first */
      if (logtb_is_interrupted_tran (thrd, ...))
        {
          lock_resume ((LK_ENTRY *) thrd->lockwait,
                       LOCK_RESUMED_INTERRUPT);
          return true;
        }

      /* check for timeout */
      if (LK_CAN_TIMEOUT (thrd->lockwait_msecs))
        {
          etime = current_time_ms;
          if (etime - thrd->lockwait_stime > thrd->lockwait_msecs)
            {
              lock_resume ((LK_ENTRY *) thrd->lockwait,
                           LOCK_RESUMED_TIMEOUT);
              return true;
            }
        }

      thread_unlock_entry (thrd);
      return false;    /* not timed out, still waiting */
    }
  // ... thread not waiting ...
}

Timeout granularity: The daemon loops every 100ms, so a thread with wait_msecs = 50 may not be timed out until up to 100ms after its deadline. This is by design — the daemon does not use high-resolution timers. For LK_INFINITE_WAIT (-1), LK_CAN_TIMEOUT returns false and the timeout path is skipped entirely.

7.6 The `tran_next_wait` chain — multi-thread wait coordination

When multiple threads of the same transaction try to lock the same resource, the second thread finds the first thread already waiting. Instead of creating a second waiter entry, it chains itself onto the first thread’s tran_next_wait:

// lock_internal_perform_lock_object — same-tran-already-waiting branch
if (wait_entry_ptr != NULL)   /* another thread of my tran is waiting */
  {
    thrd_entry->tran_next_wait =
      wait_entry_ptr->thrd_entry->tran_next_wait;
    wait_entry_ptr->thrd_entry->tran_next_wait = thrd_entry;

    pthread_mutex_unlock (&res_ptr->res_mutex);
    thread_suspend_wakeup_and_unlock_entry (thrd_entry,
                                             THREAD_LOCK_SUSPENDED);
    goto start;    /* retry after wakeup */
  }

When the primary waiter is granted and resumes in lock_suspend, it wakes all chained threads:

// lock_suspend — after main dispatch, before return
thread_lock_entry (entry_ptr->thrd_entry);
while (entry_ptr->thrd_entry->tran_next_wait)
  {
    p = entry_ptr->thrd_entry->tran_next_wait;
    entry_ptr->thrd_entry->tran_next_wait = p->tran_next_wait;
    p->tran_next_wait = NULL;
    thread_wakeup (p, THREAD_LOCK_RESUMED);
  }
thread_unlock_entry (entry_ptr->thrd_entry);

The woken secondary threads goto start and retry the entire lock acquisition from the beginning — they do not inherit the primary thread’s grant.

7.7 The `deadlock_and_timeout_detector` counter

// lock_suspend — before and after suspension
lk_Gl.deadlock_and_timeout_detector++;    /* before suspend */
// ... thread sleeps ...
lk_Gl.deadlock_and_timeout_detector--;    /* after resume */

This atomic counter is checked by the deadlock daemon:

// deadlock_detect_task_execute — src/transaction/lock_manager.c
if (lk_Gl.deadlock_and_timeout_detector == 0)
  return;    /* no threads waiting — skip entirely */

When the counter is zero, no thread is suspended for a lock, so there is nothing to time out and no deadlock is possible. This avoids the cost of the thread-map scan and WFG construction on every 100ms tick when the system is uncontended.

7.8 Chapter summary — key takeaways

lock_suspend registers five fields in the thread entry and WFG node before suspending. The page-latch safety assert prevents latch-lock ordering deadlocks.
lock_resume sets lockwait_state before signaling — the protocol is race-free by construction.
Six resume states distinguish grant, abort (primary/secondary), deadlock timeout, normal timeout, and interrupt. Only LOCK_RESUMED means success.
Four wakeup sources: lock releaser (happy path), daemon timeout check, daemon deadlock detection, and server shutdown.
Timeout granularity is 100ms (the daemon loop period). This is acceptable for lock timeouts, which are typically in the seconds range.
Multi-thread wait coordination uses tran_next_wait chains: secondary threads piggyback on the primary waiter and retry from scratch on wakeup.
deadlock_and_timeout_detector is a fast early-exit for the daemon — when zero, the 100ms tick does no work.

Chapter 8: Deadlock Detection

The deadlock daemon (Ch. 2 §2.6, Ch. 7 §7.4) periodically constructs a transaction waits-for graph (WFG) and scans it for cycles. This chapter traces the full detection pipeline: WFG construction, cycle detection via DFS, victim selection, and resolution.

8.1 Overall pipeline — `lock_detect_local_deadlock`

flowchart LR
  P1["Phase 1\nInit WFG tables\n(reset nodes + edges)"]
  P2["Phase 2\nBuild WFG\n(scan all LK_RES,\nadd edges)"]
  P3["Phase 3\nDFS cycle detection\n(walk TWFG_node graph)"]
  P4["Phase 4\nVictim selection\n(per cycle)"]
  P5["Phase 5\nResolution\n(wake victims:\ntimeout or abort)"]

  P1 --> P2 --> P3 --> P4 --> P5

Figure 8-1 — Five-phase deadlock detection pipeline. The entire pipeline runs under DL_detection_mutex, serializing detection runs. Each phase operates on the TWFG_node[] / TWFG_edge[] arrays in lk_Gl.

Figure 8-1b — The gates in front of the pipeline

Figure 8-1b — What actually happens on a daemon tick. Three gates sit in front of the five phases — any suspended waiter at all (atomic counter), at least two suspended threads, detection interval elapsed — and most ticks die at the gray return rail for the price of one atomic read. Only when all three pass (at most once per second) does the pipeline run; phase 2’s hash-table sweep is the cost center.

8.2 Phase 1 — initialize WFG tables

// lock_detect_local_deadlock — Phase 1
for (i = 1; i < lk_Gl.num_trans; i++)
  {
    lk_Gl.TWFG_node[i].first_edge = -1;
    lk_Gl.TWFG_node[i].tran_edge_seq_num = 0;
    lk_Gl.TWFG_node[i].checked_by_deadlock_detector = true;
  }

lk_Gl.TWFG_edge = &TWFG_edge_block[0];   /* static block, 200 edges */
lk_Gl.max_TWFG_edge = LK_MIN_TWFG_EDGE_COUNT;  /* 200 */
for (i = 0; i < LK_MIN_TWFG_EDGE_COUNT; i++)
  {
    lk_Gl.TWFG_edge[i].to_tran_index = -1;
    lk_Gl.TWFG_edge[i].next = (i + 1);    /* free-list chain */
  }
lk_Gl.TWFG_edge[max - 1].next = -1;
lk_Gl.TWFG_free_edge_idx = 0;
lk_Gl.global_edge_seq_num = 0;
victim_count = 0;

The WFG is rebuilt from scratch on every detection run — there is no incremental maintenance between runs. The edge array starts from a static block (TWFG_edge_block[200]) and can grow to LK_MID_TWFG_EDGE_COUNT (1000) or LK_MAX_TWFG_EDGE_COUNT (MAX_NTRANS²) if needed.

8.3 Phase 2 — build the WFG by scanning all resources

The function iterates over every LK_RES in the hash table and adds edges for three categories of wait relationships:

// lock_detect_local_deadlock — Phase 2 (annotated)
for (res_ptr = iterator.iterate (); res_ptr != NULL; ...)
  {
    /* (a) among holders: upgrader-to-upgrader conflicts */
    for (hi = res_ptr->holder; hi != NULL; hi = hi->next)
      {
        if (hi->blocked_mode == NULL_LOCK) break;  /* end of upgraders */
        for (hj = hi->next; hj != NULL; hj = hj->next)
          {
            /* hj waits for hi? */
            if (incompat(hj->blocked_mode, hi->granted_mode)
                || incompat(hj->blocked_mode, hi->blocked_mode))
              lock_add_WFG_edge (hj→hi, holder=true);

            /* hi waits for hj? */
            if (incompat(hi->blocked_mode, hj->granted_mode))
              lock_add_WFG_edge (hi→hj, holder=true);
          }
      }

    /* (b) from waiters to holders */
    for (hi = res_ptr->holder; hi != NULL; hi = hi->next)
      for (hj = res_ptr->waiter; hj != NULL; hj = hj->next)
        {
          if (incompat(hj->blocked_mode, hi->granted_mode)
              || incompat(hj->blocked_mode, hi->blocked_mode))
            lock_add_WFG_edge (hj→hi, holder=true);
        }

    /* (c) from waiters to other waiters */
    for (hi = res_ptr->waiter; hi != NULL; hi = hi->next)
      for (hj = hi->next; hj != NULL; hj = hj->next)
        {
          if (incompat(hj->blocked_mode, hi->blocked_mode))
            lock_add_WFG_edge (hj→hi, holder=false);
        }
  }

Three edge categories:

Category	From	To	Edge meaning
(a) Holder↔Holder	Upgrader hj	Upgrader hi	hj’s upgrade blocked by hi’s granted or blocked mode
(b) Waiter→Holder	Waiter hj	Holder hi	hj can’t be granted because of hi’s granted or blocked mode
(c) Waiter→Waiter	Waiter hj	Waiter hi (earlier in queue)	hj can’t be granted even if holders release, because hi’s request (ahead in FIFO) is incompatible

The holder_flag parameter on each edge records whether the target (to_tran_index) is a holder — this is used in victim selection (§8.5) to prefer aborting a holder over a waiter.

8.4 `lock_add_WFG_edge` — edge allocation and staleness detection

// lock_add_WFG_edge — src/transaction/lock_manager.c (annotated)
static int
lock_add_WFG_edge (int from_tran_index, int to_tran_index,
                   int holder_flag, INT64 edge_wait_stime,
                   LK_ENTRY * holder, LK_ENTRY * waiter)
{
  /* skip if either transaction is already a victim */
  if (TWFG_node[from].DL_victim || TWFG_node[to].DL_victim)
    return NO_ERROR;

  /* increment global edge sequence number */
  lk_Gl.global_edge_seq_num++;

  /* handle stale transactions: if a node was not checked by the
     detector yet, it might be a new transaction that reused the
     tran_index — clear its old edges */
  if (TWFG_node[from].checked_by_deadlock_detector == false)
    {
      /* recycle old edges to free list */
      // ... chain old edges back to TWFG_free_edge_idx ...
      TWFG_node[from].first_edge = -1;
      TWFG_node[from].checked_by_deadlock_detector = true;
      TWFG_node[from].tran_edge_seq_num = global_edge_seq_num;
    }
  /* same for to_tran_index ... */

  /* allocate edge from free list (may grow the array) */
  alloc_idx = lk_Gl.TWFG_free_edge_idx;
  lk_Gl.TWFG_free_edge_idx = TWFG_edge[alloc_idx].next;

  /* populate edge */
  TWFG_edge[alloc_idx].to_tran_index = to_tran_index;
  TWFG_edge[alloc_idx].edge_seq_num = global_edge_seq_num;
  TWFG_edge[alloc_idx].holder_flag = holder_flag;
  TWFG_edge[alloc_idx].edge_wait_stime = edge_wait_stime;

  /* prepend to from_tran's edge list */
  TWFG_edge[alloc_idx].next = TWFG_node[from].first_edge;
  TWFG_node[from].first_edge = alloc_idx;
}

Edge array growth (3-tier):

Tier	Size	Storage
Initial	`LK_MIN_TWFG_EDGE_COUNT` = 200	Static `TWFG_edge_block[200]`
Mid	`LK_MID_TWFG_EDGE_COUNT` = 1000	Same static block (reuses first 200 + extends)
Max	`LK_MAX_TWFG_EDGE_COUNT` = `MAX_NTRANS²`	`malloc`’d block (copies from static)

8.5 Phase 3 — DFS cycle detection

The cycle detection is a classic DFS over the WFG node/edge structure. The ancestor field chains the DFS path, and a cycle is detected when a follow-edge leads to a node that is already on the current path (ancestor != -1):

// lock_detect_local_deadlock — Phase 3 (annotated)
for (k = 1; k < lk_Gl.num_trans; k++)
  {
    TWFG_node[k].current = TWFG_node[k].first_edge;
    TWFG_node[k].ancestor = -1;
  }

for (k = 1; k < lk_Gl.num_trans; k++)
  {
    if (TWFG_node[k].current == -1)
      continue;            /* no outgoing edges — skip */

    s = k;
    TWFG_node[s].ancestor = -2;   /* sentinel: DFS root */

    for (; s != -2; )
      {
        /* skip stale nodes/edges */
        // ... validation checks ...

        if (TWFG_node[s].current == -1)
          {
            /* backtrack: no more edges from s */
            t = TWFG_node[s].ancestor;
            TWFG_node[s].ancestor = -1;
            s = t;
            // advance parent's edge pointer
            continue;
          }

        t = TWFG_edge[TWFG_node[s].current].to_tran_index;

        /* skip stale edges */
        // ... validation checks ...

        if (TWFG_node[t].ancestor != -1)
          {
            /* CYCLE FOUND: s → t, and t is already on the path */
            lock_select_deadlock_victim (thread_p, s, t);
          }
        else
          {
            /* no cycle yet: advance DFS into t */
            TWFG_node[t].ancestor = s;
            s = t;
          }
      }
  }

Staleness checks during DFS filter out edges that have become invalid since Phase 2 — a transaction may have committed or been aborted between the WFG build and the DFS walk. Three conditions invalidate a node: checked_by_deadlock_detector == false (new transaction reused the index), thrd_wait_stime == 0 (no longer waiting), or edge_seq_num < tran_edge_seq_num (edge is from a previous incarnation of this tran_index).

8.6 Phase 4 — victim selection strategy

When a cycle is found, lock_select_deadlock_victim is called with s (the node that discovered the back-edge) and t (the node already on the DFS path — the cycle start):

Cycle: t → ... → s → t

The function first validates the cycle (false-cycle check for stale edges), then selects a victim using a priority-based strategy:

Victim Selection Strategy (from source comments):
  1) Must be a lock holder (holder_flag == true).
  2) Must be an active transaction.
  3) Prefer a transaction that does NOT have deadlock priority.
  4) Prefer a transaction that has written fewer log records.
  5) Prefer a transaction with a closer timeout.
  6) Prefer the youngest transaction (highest tranid).

Why each criterion exists:

Holder — only a holder’s rollback frees what the cycle waits on; aborting a pure waiter removes an edge but releases nothing, so the cycle persists.
Active — a transaction already committing / aborting is exiting anyway and cannot be aborted again; victimizing it resolves nothing.
No deadlock priority — the priority flag is an explicit protection knob; protected transactions survive.
Fewer log records — the cheapest rollback.
Finite timeout — that application already expects retry-style failures, so a retryable wake costs the least surprise.
Youngest — the least work thrown away; and because elders always survive ties, a long-running transaction is never re-victimized forever — the system as a whole keeps making progress.

The candidate field on TWFG_node marks which cycle members are eligible victims (they are on the holder side of at least one edge). The function walks the cycle via ancestor links, comparing each candidate against the current best victim using the six criteria in order.

// lock_select_deadlock_victim — src/transaction/lock_manager.c (annotated)
static void
lock_select_deadlock_victim (THREAD_ENTRY * thread_p, int s, int t)
{
  /* Phase A: validate cycle — check for stale edges */
  // walk cycle t → ... → s → t, mark false_dd_cycle if any
  // node/edge is stale; if false cycle, clear ancestors and return

  if (false_dd_cycle == true)
    { /* clear ancestor chain and return */ return; }

  /* Phase B: initialize victim from the edge s→t */
  if (TWFG_edge[TWFG_node[s].current].holder_flag)
    {
      /* t is a holder — candidate for victim */
      if (logtb_is_active (thread_p, logtb_find_tranid (t)))
        {
          victims[victim_count].tran_index = t;
          victims[victim_count].can_timeout =
            LK_CAN_TIMEOUT (logtb_find_wait_msecs (t));
        }
    }

  /* Phase C: walk cycle s → ancestor → ... → t, compare candidates */
  for (v = s; v != t; v = TWFG_node[v].ancestor)
    {
      if (TWFG_node[v].candidate == false)
        continue;    /* not a holder in this cycle — skip */

      tranid = logtb_find_tranid (v);
      if (!logtb_is_active (thread_p, tranid))
        continue;    /* criterion 2: must be active */

      victim_tran_index = victims[victim_count].tran_index;

      if (victim_tran_index == NULL_TRAN_INDEX)
        {
          victim_tranid = tranid;   /* first eligible candidate */
        }
      /* criterion 3: prefer no deadlock priority */
      else if (logtb_has_deadlock_priority (victim_tran_index)
               != logtb_has_deadlock_priority (v))
        {
          if (logtb_has_deadlock_priority (v) == false)
            victim_tranid = tranid;
        }
      else
        {
          tran_log_count = logtb_find_log_records_count (v);
          victim_tran_log_count =
            logtb_find_log_records_count (victim_tran_index);

          /* criterion 4: prefer fewer log records */
          if (tran_log_count < victim_tran_log_count)
            victim_tranid = tranid;
          else if (tran_log_count == victim_tran_log_count)
            {
              /* criterion 5: prefer closer timeout */
              /* criterion 6: prefer youngest (highest tranid) */
              if ((victims[victim_count].can_timeout == false
                   && can_timeout == true)
                  || (same_timeout_policy
                      && LK_ISYOUNGER (tranid, current_victim)))
                victim_tranid = tranid;
            }
        }

      if (victim_tranid != NULL_TRANID)
        {
          victims[victim_count].tran_index = v;
          victims[victim_count].tranid = victim_tranid;
          victims[victim_count].can_timeout = can_timeout;
        }
    }

  /* Phase D: record victim for event log and increment victim_count */
  if (victims[victim_count].tran_index != NULL_TRAN_INDEX)
    {
      /* collect LK_ENTRY pointers along the cycle for event log */
      // ... walk cycle again, store holder/waiter entries ...
      victim_count++;
    }
}

The six criteria applied in order (each is a tie-breaker for the previous):

flowchart TD
  C1{"1. Is it a lock holder\nin this cycle?"}
  C1 -- "no" --> SKIP["skip"]
  C1 -- "yes" --> C2{"2. Is transaction\nactive?"}
  C2 -- "no" --> SKIP
  C2 -- "yes" --> C3{"3. Does it have\ndeadlock priority?"}
  C3 -- "yes (protected)" --> KEEP["keep current victim"]
  C3 -- "no (unprotected)" --> PREFER["prefer this one"]
  C3 -- "same" --> C4{"4. Fewer log\nrecords written?"}
  C4 -- "yes (less work)" --> PREFER
  C4 -- "same count" --> C5{"5. Has finite\ntimeout?"}
  C5 -- "yes (can retry)" --> PREFER
  C5 -- "same" --> C6{"6. Younger\n(higher tranid)?"}
  C6 -- "yes" --> PREFER
  C6 -- "no" --> KEEP

Figure 8-2 — Victim selection criteria as a decision tree. Each criterion is a tie-breaker for the previous. The overall preference is: unprotected, less work done, retryable, youngest.

Resolution modes:

can_timeout = true (the victim’s wait_msecs is finite): wake with LOCK_RESUMED_DEADLOCK_TIMEOUT — treated as a soft timeout, the application can retry.
can_timeout = false (LK_INFINITE_WAIT): wake with LOCK_RESUMED_ABORTED_FIRST / _OTHER — the transaction must be rolled back.

8.7 Phase 5 — resolution (waking victims)

// lock_detect_local_deadlock — Phase 5
for (k = 0; k < victim_count; k++)
  {
    if (victims[k].can_timeout)
      lock_wakeup_deadlock_victim_timeout (victims[k].tran_index);
    else
      lock_wakeup_deadlock_victim_aborted (victims[k].tran_index);
  }

These functions (detailed in Ch. 7 §7.3–7.4) find the victim’s suspended thread and call lock_resume with the appropriate state. The resumed thread then cleans up its waiter entry via lock_internal_perform_unlock_object (Ch. 5 §5.7 CASE W).

Safety net — the no_victim_case_count escalation:

If 60 consecutive deadlock detection runs find no victims but threads are still suspended, the system suspects a livelock or undetectable deadlock. As a last resort, it forcibly times out one thread:

if (victim_count == 0)
  {
    if (lk_Gl.no_victim_case_count >= 60)
      {
        if (css_are_all_request_handlers_suspended ())
          thread_get_manager ()->map_entries (
            lock_victimize_first_thread_mapfunc);
        lk_Gl.no_victim_case_count = 0;
      }
    else
      lk_Gl.no_victim_case_count += 1;
  }

At 100ms per daemon tick, 60 runs = ~6 seconds of no progress before the safety net fires.

8.8 WFG data structures — recap from Ch. 1

For reference, the WFG structures (defined in lock_manager.c):

struct lk_WFG_node
{
  int first_edge;         /* head of outgoing edge list (-1 = none) */
  bool candidate;         /* eligible as victim in current cycle */
  int current;            /* DFS cursor into edge list */
  int ancestor;           /* DFS parent (-1 = not on path, -2 = root) */
  INT64 thrd_wait_stime;  /* wait start time (0 = not waiting) */
  int tran_edge_seq_num;  /* staleness marker */
  bool checked_by_deadlock_detector;  /* false = new tran, clear old edges */
  bool DL_victim;         /* already selected as victim */
};

struct lk_WFG_edge
{
  int to_tran_index;      /* target transaction (-2 = stale edge) */
  int edge_seq_num;       /* creation sequence number */
  int holder_flag;        /* true if target is a holder */
  int next;               /* next edge in same node's list */
  INT64 edge_wait_stime;  /* waiter's wait start time */
  LK_ENTRY *holder;       /* LK_ENTRY of the blocking holder */
  LK_ENTRY *waiter;       /* LK_ENTRY of the blocked waiter */
};

8.9 Concrete scenario — three-transaction deadlock

flowchart LR
  subgraph Build["Phase 2: Build WFG"]
    direction LR
    T1["T1: holds X on R1\nwaits S on R2"]
    T2["T2: holds S on R2\nwaits X on R3"]
    T3["T3: holds X on R3\nwaits S on R1"]
    T1 -- "edge: T1 waits T2\n(waiter->holder on R2)" --> T2
    T2 -- "edge: T2 waits T3\n(waiter->holder on R3)" --> T3
    T3 -- "edge: T3 waits T1\n(waiter->holder on R1)" --> T1
  end

  subgraph DFS["Phase 3: DFS from T1"]
    D1["visit T1 (ancestor=-2)"]
    D2["visit T2 (ancestor=T1)"]
    D3["visit T3 (ancestor=T2)"]
    D4["follow edge T3->T1:\nT1.ancestor != -1\n→ CYCLE FOUND"]
    D1 --> D2 --> D3 --> D4
  end

  subgraph Victim["Phase 4: Select victim"]
    V1["Cycle: T1 -> T2 -> T3 -> T1\nAll are holders\nAll are active\nPick youngest (highest tranid)\n→ T3 is victim"]
  end

  subgraph Resolve["Phase 5: Resolution"]
    R1["T3.can_timeout?\nyes → DEADLOCK_TIMEOUT\nno → ABORTED_FIRST"]
  end

  Build --> DFS --> Victim --> Resolve

Figure 8-3 — Three-transaction deadlock detection end-to-end. Phase 2 adds three edges forming a cycle. Phase 3 DFS discovers the cycle when T3’s edge leads back to T1. Phase 4 selects T3 (youngest by the default victim strategy). Phase 5 wakes T3 with timeout or abort.

8.10 Performance analysis — is full rebuild every run acceptable?

The WFG is rebuilt from scratch on every detection run. This section analyzes the actual cost and the design rationale.

Three gates that prevent unnecessary work:

flowchart LR
  TICK["daemon tick\n(every 100ms)"]
  G1{"deadlock_and_timeout\n_detector == 0?"}
  G1 -- "yes (no waiters)" --> SKIP1["skip entirely\n(atomic read only)"]
  G1 -- "no" --> TIMEOUT["timeout check\n(map over threads)"]
  TIMEOUT --> G2{"lock_wait_count >= 2?"}
  G2 -- "no (0 or 1 waiter)" --> SKIP2["skip WFG\n(deadlock impossible)"]
  G2 -- "yes" --> G3{"interval elapsed?\n(default 1.0s)"}
  G3 -- "no" --> SKIP3["skip WFG\n(too soon)"]
  G3 -- "yes" --> FULL["FULL WFG SCAN\nlock_detect_local_deadlock"]

Figure 8-4 — Three gates before the WFG full scan runs. In an uncontended workload, Gate 1 catches everything — cost is one atomic read per 100ms. In a lightly contended workload, Gate 2 catches single-waiter cases. Gate 3 ensures the full scan runs at most once per second.

Cost breakdown when the full scan does execute:

Phase	Time complexity	Practical cost	Bottleneck?
Node init	O(MAX_NTRANS)	~tens of µs	No
Edge init	O(200) static block	negligible	No
Hash table scan	O(all active LK_RES)	The dominant cost. Each resource: acquire `res_mutex`, scan holders/waiters, release.	Yes
Per-resource edge build	O(holders² + holders×waiters + waiters²)	Most resources have 1–2 holders, 0 waiters → O(1) per resource	Rarely
DFS cycle detection	O(nodes + edges)	nodes ≤ MAX_NTRANS, edges = actual wait relationships	No

The hash table scan is the bottleneck. If the system has 100K active lock resources, the scan performs 100K res_mutex lock/unlock pairs. However:

Most resources have no waiter — the scan checks res_ptr->holder and res_ptr->waiter, adds zero edges, and moves on. The per-resource cost is one mutex acquire + a few pointer reads + one mutex release.
res_mutex is per-resource — while scanning resource R₁, other threads can freely lock/unlock resource R₂. The scan does not serialize the entire lock manager.
The scan runs at most once per second (default PRM_ID_LK_RUN_DEADLOCK_INTERVAL = 1.0).

Design trade-off — full rebuild vs. incremental maintenance:

Approach	Hot-path overhead (lock acquire/release)	Detection cost	Complexity
Full rebuild (CUBRID)	Zero — no WFG bookkeeping on acquire/release	O(resources) per scan, ≤ 1/sec	Simple
Incremental WFG	Edge add on every block, edge remove on every grant/release	O(DFS only) per scan	High — must maintain edge consistency under concurrent modifications

CUBRID chose zero overhead on the hot path. In OLTP workloads where lock acquires happen tens of thousands of times per second, adding WFG edge maintenance to every acquire would cost more in aggregate than one full scan per second. The full-rebuild approach also avoids the complexity of maintaining graph consistency under concurrent lock state changes.

When the cost could become a problem:

Tens of thousands of active lock resources and sustained multi-thread contention (many waiters per resource).
In this scenario, the hash table scan could take tens of milliseconds. Mitigation options:
- Increase PRM_ID_LK_RUN_DEADLOCK_INTERVAL (reduce scan frequency — trades detection latency for lower overhead).
- Lower PRM_ID_LK_ESCALATION_AT (escalate sooner — reduces the total number of lock resources).
- The timeout-based fallback (wait_msecs on each lock request) still resolves deadlocks even if the WFG scan is delayed.

8.11 Chapter summary — key takeaways

The WFG is rebuilt from scratch on every detection run — no incremental maintenance. This simplifies correctness at the cost of O(resources × holders²) construction time.
Three categories of edges capture all wait relationships: holder↔holder (upgrader conflicts), waiter→holder, and waiter→waiter.
DFS cycle detection uses ancestor chains. Staleness checks during DFS filter out edges invalidated by concurrent commits.
Victim selection uses a six-criteria priority: must be holder, must be active, no deadlock priority, fewer log records, closer timeout, youngest transaction.
Two resolution modes: can_timeout → soft timeout (retryable); !can_timeout → abort (rollback required).
Edge array grows in 3 tiers: 200 (static) → 1000 (static extended) → MAX_NTRANS² (malloc). The static initial block avoids allocation for low-contention workloads.
Safety net: 60 consecutive no-victim runs (~6 seconds) triggers forced timeout of one thread to prevent server hang.

Chapter 9: Special Paths

The main lock lifecycle (acquire → convert → release) was covered in Chapters 3–5. This chapter covers the special-purpose APIs that don’t fit that main flow: instant lock probe, composite lock, lock_scan / lock_classes_lock_hint, lock demotion, and lock_subclass.

9.1 `lock_hold_object_instant` — non-blocking probe

A lightweight lock check that never blocks and never creates an LK_ENTRY. It tests whether a lock could be granted right now without actually acquiring it.

// lock_internal_hold_lock_object_instant — src/transaction/lock_manager.c
static int
lock_internal_hold_lock_object_instant (THREAD_ENTRY * thread_p,
    int tran_index, const OID * oid,
    const OID * class_oid, LOCK lock)
{
  /* shortcut: class lock subsumes the request? */
  if (class_oid != NULL && !OID_IS_ROOTOID (class_oid))
    if (lock_is_class_lock_escalated (..., lock))
      return LK_GRANTED;

  /* find (not find_or_insert!) the resource */
  res_ptr = lk_Gl.m_obj_hash_table.find (thread_p, search_key);
  if (res_ptr == NULL)
    return LK_GRANTED;    /* no resource → no contention */

  /* am I already a holder? */
  for (entry_ptr = res_ptr->holder; ...; entry_ptr = entry_ptr->next)
    if (entry_ptr->tran_index == tran_index) break;

  if (entry_ptr == NULL)
    {
      /* not a holder: check compat with holders AND waiters */
      compat1 = lock_compat (lock, res_ptr->total_waiters_mode);
      compat2 = lock_compat (lock, res_ptr->total_holders_mode);
      pthread_mutex_unlock (&res_ptr->res_mutex);
      return (compat1 == YES && compat2 == YES)
             ? LK_GRANTED : LK_NOTGRANTED;
    }
  else
    {
      /* already a holder: check conversion compat */
      new_mode = lock_conv (lock, entry_ptr->granted_mode);
      if (new_mode == entry_ptr->granted_mode)
        { /* re-entrance */ return LK_GRANTED; }
      /* check against other holders */
      // ... same group_mode logic as PATH F (Ch. 3 §3.9) ...
      return compat ? LK_GRANTED : LK_NOTGRANTED;
    }
}

Key differences from lock_object:

Aspect	`lock_hold_object_instant`	`lock_object`
Creates `LK_ENTRY`	No	Yes
Blocks on conflict	No (returns `LK_NOTGRANTED`)	Yes (suspends thread)
Modifies holder/waiter lists	No	Yes
Hash operation	`find` (read-only)	`find_or_insert` (may create `LK_RES`)
Use case	Optimistic probe before full lock	Actual lock acquisition

Caller: scan_manager.c uses this as an optimistic first try during heap/index scans. If the probe returns LK_GRANTED, the scan can proceed knowing the lock would succeed, often without needing to create any lock entry at all. If LK_NOTGRANTED, it falls back to the full lock_object path.

9.2 Composite lock — batched multi-object locking

Composite lock is a deferred locking strategy for bulk DELETE/UPDATE. Instead of acquiring X locks on each row during the scan phase (which could deadlock with concurrent transactions), it collects OIDs first and acquires all locks at the end.

Lifecycle:

flowchart LR
  INIT["lock_initialize_composite_lock\n(set tran_index, wait_msecs)"]
  ADD["lock_add_composite_lock\n(collect OID + class_oid)\nRepeated per row"]
  FIN["lock_finalize_composite_lock\n(acquire all X locks)\nEscalate if count >= threshold"]
  ABORT["lock_abort_composite_lock\n(free collected OIDs)"]

  INIT --> ADD --> FIN
  ADD -- "error" --> ABORT
  FIN -- "error" --> ABORT

Figure 9-1 — Composite lock lifecycle. OIDs are collected during scan, locks are acquired in batch at finalization.

lock_add_composite_lock collects instance OIDs grouped by class:

// lock_add_composite_lock — src/transaction/lock_manager.c (simplified)
int lock_add_composite_lock (... LK_COMPOSITE_LOCK * comp_lock,
                             const OID * oid, const OID * class_oid)
{
  /* find or create per-class bucket */
  for (lockcomp_class = lockcomp->class_list; ...; )
    if (OID_EQ (class_oid, &lockcomp_class->class_oid)) break;

  if (lockcomp_class == NULL)
    {
      /* new class: acquire IX on class immediately */
      lock_internal_perform_lock_object (..., class_oid, IX_LOCK, ...);
      /* allocate OID buffer (up to escalation threshold) */
    }

  /* if class lock already >= X: skip instance OID collection */
  if (lockcomp_class->class_lock_ptr->granted_mode >= X_LOCK)
    return;    /* class lock subsumes everything */

  /* collect the instance OID */
  lockcomp_class->inst_oid_space[num_inst_oids++] = *oid;
  /* if count reaches escalation threshold: stop collecting */
}

lock_finalize_composite_lock acquires the actual locks:

// lock_finalize_composite_lock — src/transaction/lock_manager.c (simplified)
int lock_finalize_composite_lock (... LK_COMPOSITE_LOCK * comp_lock)
{
  for (each lockcomp_class in class_list)
    {
      if (class already >= X_LOCK
          || num_inst_oids == escalation_threshold)
        {
          /* escalation: X lock on the class */
          lock_internal_perform_lock_object (..., class_oid, X_LOCK, ...);
        }
      else
        {
          /* individual: X lock on each collected instance */
          for (i = 0; i < num_inst_oids; i++)
            lock_internal_perform_lock_object (..., inst_oid[i], X_LOCK, ...);
        }
    }
  lock_abort_composite_lock (comp_lock);  /* free memory */
}

Caller: query_executor.c uses composite lock for DELETE/UPDATE queries. The XASL plan carries a LK_COMPOSITE_LOCK field; the executor collects target OIDs during the scan phase and calls lock_finalize_composite_lock before applying the modifications.

Escalation integration: Composite lock has its own escalation logic — if the collected OID count for a class reaches PRM_ID_LK_ESCALATION_AT, it stops collecting OIDs and instead acquires X on the class at finalization. This mirrors the behavior of the standard escalation (Ch. 6) but with a two-phase approach.

9.3 `lock_scan` and `lock_classes_lock_hint` — class-grain APIs

These two APIs operate at the class level only, with no instance locking.

lock_scan (covered briefly in Ch. 3 §3.1) acquires a single class lock for a heap scan:

// lock_scan — src/transaction/lock_manager.c
int lock_scan (THREAD_ENTRY * thread_p, const OID * class_oid,
               int cond_flag, LOCK class_lock)
{
  root_class_entry = lock_get_class_lock (thread_p, oid_Root_class_oid);
  return lock_internal_perform_lock_object (
    thread_p, tran_index, class_oid, NULL,
    class_lock, wait_msecs,
    &class_entry, root_class_entry);
}

Caller: heap_file.c calls lock_scan(class_oid, IS_LOCK) at the start of a heap scan.

lock_classes_lock_hint acquires locks on multiple classes at once, given a pre-built LC_LOCKHINT array:

// lock_classes_lock_hint — src/transaction/lock_manager.c (simplified)
int lock_classes_lock_hint (THREAD_ENTRY * thread_p,
                            LC_LOCKHINT * lockhint)
{
  /* sort classes for consistent ordering (avoid class-level deadlocks) */
  // ... sort lockhint->classes by OID ...

  for (i = 0; i < lockhint->num_classes; i++)
    {
      lock_internal_perform_lock_object (
        thread_p, tran_index,
        &lockhint->classes[i].oid, NULL,
        lockhint->classes[i].lock, wait_msecs,
        &entry, root_class_entry);
    }
}

Caller: locator_sr.c (xlocator_fetch_lockhint_classes) uses this for batch class locking when the client prefetches multiple classes. The sort step is critical — acquiring class locks in a consistent order prevents deadlocks between transactions that need the same set of classes.

9.4 Lock demotion — `lock_demote_class_lock`

Demotion is the reverse of conversion: downgrade a held lock to a weaker mode. This is used only in narrow circumstances.

// lock_demote_class_lock — src/transaction/lock_manager.c (simplified)
int lock_demote_class_lock (THREAD_ENTRY * thread_p,
                            const OID * oid, LOCK lock, LOCK * ex_lock)
{
  entry_ptr = lock_find_tran_hold_entry (thread_p, tran_index, oid, true);
  return lock_internal_demote_class_lock (thread_p, entry_ptr,
                                          lock, ex_lock);
}

lock_internal_demote_class_lock changes granted_mode to the weaker mode, recomputes total_holders_mode, and calls lock_grant_blocked_holder / lock_grant_blocked_waiter to see if the demotion unblocks any waiters.

Callers:

lock_demote_read_class_lock_for_checksumdb — demotes S→IS on class locks during checksumdb to reduce contention. Explicitly documented as “ONLY for checksumdb.”
lock_demote_all_shared_class_locks — internal function that demotes all S class locks to IS, called during specific transaction state transitions.

Warning from the source: “Demoting a write lock in the middle of a transaction may cause recovery issues. It should be carefully used for some particular cases.” Demotion of instance locks is deliberately not supported.

9.5 `lock_subclass` — class hierarchy locking

For CUBRID’s class inheritance hierarchy, locking a subclass requires intention locks on the superclass:

// lock_subclass — src/transaction/lock_manager.c (simplified)
int lock_subclass (THREAD_ENTRY * thread_p,
                   const OID * subclass_oid,
                   const OID * superclass_oid,
                   LOCK lock, int cond_flag)
{
  /* determine intention lock for superclass */
  if (lock <= S_LOCK)
    new_superclass_lock = IS_LOCK;
  else
    new_superclass_lock = IX_LOCK;

  /* acquire intention lock on superclass if needed */
  if (old_superclass_lock < new_superclass_lock)
    lock_internal_perform_lock_object (..., superclass_oid, NULL,
                                       new_superclass_lock, ...);

  /* acquire the actual lock on the subclass */
  lock_internal_perform_lock_object (..., subclass_oid, NULL,
                                     lock, ..., &subclass_entry,
                                     superclass_entry);
}

This follows the same multi-granularity pattern as lock_object’s instance-on-class locking, but one level up in the hierarchy: subclass-on-superclass instead of instance-on-class. The class_entry back-pointer chains allow escalation to propagate through the hierarchy.

9.6 `lock_rep_read_tran` — REPEATABLE READ helper

A specialized entry point for ALTER TABLE ADD COLUMN NOT NULL under REPEATABLE READ isolation:

int lock_rep_read_tran (THREAD_ENTRY * thread_p, LOCK lock,
                        int cond_flag);

This function acquires a lock on the root class OID to prevent concurrent DDL during the operation. It is a narrow special case and does not introduce any new locking mechanism — it just calls lock_internal_perform_lock_object on the root OID.

9.7 Caller map for special-path APIs

These APIs have narrower caller bases than the main lock_object/ lock_unlock_object (see Ch. 3 §3.2 for the main caller map).

API	Callers	Context
`lock_hold_object_instant`	`scan_manager.c` (heap scan, index scan)	Optimistic non-blocking probe before full `lock_object`. Two call sites: `scan_next_heap_scan` for heap OIDs, `scan_next_index_scan` for index OIDs.
`lock_initialize/add/finalize_composite_lock`	`query_executor.c`	Bulk DELETE/UPDATE. `lock_initialize` at query start, `lock_add` per candidate row during scan, `lock_finalize` before applying modifications.
`lock_scan`	`heap_file.c`	Single call site: `heap_scancache_start` acquires IS_LOCK on the class before scanning the heap.
`lock_classes_lock_hint`	`locator_sr.c` (`xlocator_fetch_lockhint_classes`)	Batch class locking when the client prefetches multiple classes. Paired with `lock_unlock_classes_lock_hint` on completion.
`lock_subclass`	`locator_sr.c` (INSERT/UPDATE on partitioned tables), `query_executor.c` (partition scan)	Acquires IX/IS on subclass with intention lock on superclass. Used when operating on class hierarchies / partitions.
`lock_demote_class_lock`	`locator_sr.c` (`xlocator_demote_class_lock`), `btree_load.c` (index load)	Demote class lock after bulk operation. General-purpose but restricted.
`lock_demote_read_class_lock_for_checksumdb`	`locator_sr.c` (`xlocator_check_fk_validity`)	S→IS demotion during checksumdb only. Source explicitly says “NEVER consider to use this function for any other clients/threads.”
`lock_rep_read_tran`	`transaction_sr.c` (`xtran_lock_rep_read`)	Root-class lock for ALTER TABLE ADD COLUMN NOT NULL under RR. Single call site.

9.8 Chapter summary — key takeaways

lock_hold_object_instant is a read-only, non-blocking probe: uses find (not find_or_insert), creates no LK_ENTRY, never blocks. Used by scan_manager for optimistic locking.
Composite lock defers X-lock acquisition to a batch finalization phase. Integrates its own escalation logic (class X if count reaches threshold). Used by query_executor for bulk DELETE/UPDATE.
lock_scan is a thin wrapper for class-grain IS/IX locks. lock_classes_lock_hint sorts classes before locking to prevent class-level deadlocks.
Lock demotion is intentionally restricted to class locks only and to specific callers (checksumdb). The source explicitly warns against demoting write locks mid-transaction.
lock_subclass extends multi-granularity locking to the class inheritance hierarchy: subclass-on-superclass mirrors instance-on-class.

Chapter 10: Serializable conflict handling (first-updater-wins)

This chapter covers the one isolation rule the lock manager enforces without a lock — the write-path ER_MVCC_SERIALIZABLE_CONFLICT raised under REPEATABLE READ / SERIALIZABLE. It is the write-side counterpart to §5.3 (lock_unlock_object — the isolation policy gate): §5.3 decides which locks a transaction keeps to end-of-transaction under RR; this rule stops a transaction from modifying a row that another transaction changed out from under its snapshot. MVCC visibility (cubrid-mvcc-detail.md) supplies the detection substrate; the serialization decision lives here. (The MVCC detail doc delegates here for exactly this reason — visibility alone cannot serialize two writers.)

10.1 What the error is

ER_MVCC_SERIALIZABLE_CONFLICT (-1158, base/error_code.h) is CUBRID’s first-updater-wins write-conflict error — the analogue of PostgreSQL’s “could not serialize access due to concurrent update” (SQLSTATE 40001). Under RR/SERIALIZABLE a transaction may not UPDATE/DELETE a row whose latest version differs from the version its snapshot sees. If a concurrent transaction created a newer version, or deleted the row, after this transaction’s snapshot was taken, the modify is refused with -1158 and the statement aborts.

There is no separate SERIALIZABLE mechanism: TRAN_SERIALIZABLE has no standalone branch anywhere in the tree, so at runtime RR ≡ SERIALIZABLE and both are gated identically by isolation > TRAN_READ_COMMITTED. The source enforces serializability through this one first-updater-wins check plus RR’s strict-2PL lock retention (§5.3); it does not add key-range locks.

10.2 Where it is raised — `locator_lock_and_get_object_internal`

The check is not in lock_manager.c. It lives in the locator’s lock-and-fetch path that UPDATE/DELETE use to pin a row for modification: locator_lock_and_get_object_internal (src/transaction/locator_sr.c), reached from locator_lock_and_get_object_with_evaluation with lock_mode = X_LOCK. The X-lock is taken first, and only then is the version fetched and checked:

// locator_lock_and_get_object_internal — src/transaction/locator_sr.c (annotated)
//   1) acquire the row OID X-lock: conditional, else unconditional (blocking)
if (lock_object (.., oid, class_oid, lock_mode, LK_COND_LOCK) != LK_GRANTED)
  {
    ... pgbuf_ordered_unfix (..);   // drop page watchers
    if (lock_object (.., oid, class_oid, lock_mode, LK_UNCOND_LOCK) != LK_GRANTED)  // ← write-write WAIT
      goto error;
    heap_prepare_get_context (..);  // pages were unlatched; re-fetch
  }
//   2) only now fetch the latest version and read its MVCC header
heap_get_last_version (.., context);
heap_get_mvcc_header (.., context, &recdes_header);
//   3) RR/SERIALIZABLE: compare the latest version against the txn snapshot (§10.3)

So the write-write wait today is the OID-lock block (LK_UNCOND_LOCK); the snapshot check below is a post-lock isolation gate, not the conflict detector. (This ordering is the lever the cubrid_cv I13 implicit-update-lock analysis proposes to invert.)

10.3 The gate — RR/SERIALIZABLE on RR-eligible classes

// locator_sr.c (annotated)
if (logtb_find_current_isolation (thread_p) > TRAN_READ_COMMITTED
    && logtb_check_class_for_rr_isolation_err (class_oid))
  {
    MVCC_SNAPSHOT *snap = logtb_get_mvcc_snapshot (thread_p);
    MVCC_SATISFIES_SNAPSHOT_RESULT res = snap->snapshot_fnc (.., &recdes_header, snap);
    ...
  }

Two conditions, both required:

Isolation > READ COMMITTED — RR or SERIALIZABLE. Under READ COMMITTED the whole block is skipped: RC never raises -1158; it waits on the row lock and then re-reads the new version.
RR-eligible class — logtb_check_class_for_rr_isolation_err (src/transaction/log_tran_table.c) returns true only when the class is not db_root, _db_user, or _db_trigger. Those are accessed before table locks and secure consistency by locks, so they are exempt; most other catalog classes are MVCC-enabled and are checked.

10.4 The three branches

The gate runs the transaction snapshot (mvcc_satisfies_snapshot, src/transaction/mvcc.c) against the latest version’s header:

Snapshot result on latest version	Meaning	Outcome
`TOO_OLD_FOR_SNAPSHOT`	Deleted + committed before my snapshot — invisible to me	`S_DOESNT_EXIST` (row “gone”, not a conflict)
`TOO_NEW_FOR_SNAPSHOT`	Latest version’s inserter is active, or committed after my snapshot (`MVCC_IS_REC_INSERTER_IN_SNAPSHOT`)	`ER_MVCC_SERIALIZABLE_CONFLICT`
`SNAPSHOT_SATISFIED` and `MVCC_IS_HEADER_DELID_VALID`	Latest version deleted by a concurrent transaction	`ER_MVCC_SERIALIZABLE_CONFLICT`
`SNAPSHOT_SATISFIED`, not deleted	Latest == my visible version; nobody changed it	fall through — modify proceeds

TOO_NEW_FOR_SNAPSHOT is defined in mvcc_satisfies_snapshot: a not-deleted record whose inserter is “in snapshot” (active, or committed after the snapshot) returns TOO_NEW. That is precisely “a concurrent transaction produced a version I am not entitled to see” — the first-updater-wins condition.

10.5 Scenarios

Concurrent UPDATE. T1 snapshots at t0; T2 updates row R and commits after t0; T1 then updates R → its latest version is TOO_NEW for T1 → -1158. (If T1 first blocked on T2’s row X-lock, the re-fetch after the grant sees the new version and raises -1158 rather than silently overwriting.)
Concurrent DELETE. T2 deletes R and commits after t0; T1 updates/deletes R → latest version carries a concurrent deleter’s delete-id → -1158. (If T2’s delete committed before t0 the result is TOO_OLD → S_DOESNT_EXIST.)
No conflict. No concurrent modify since t0 → latest == visible version → modify proceeds.

10.6 How callers treat it

-1158 is an abort signal, recognised alongside ER_LK_UNILATERALLY_ABORTED by the layers that must not swallow it — query_executor.c, work_space.c (client retry), trigger_manager.c. The engine does not retry internally: the transaction aborts and the application decides whether to retry under a fresh snapshot.

10.7 Relation to MVCC and escalation

MVCC (cubrid-mvcc-detail.md). MVCC supplies the substrate — the record header’s insert/delete MVCCID and the mvcc_satisfies_* classifiers. This chapter is where the isolation layer turns “the latest version is not my visible version” into a refusal.
Escalation (Chapter 6). Orthogonal: escalation changes the lock granule (instance→class) to relieve LK_ENTRY memory; it does not change this check. A bulk UPDATE that escalates still runs the per-row first-updater-wins check on each modified row under RR/SERIALIZABLE.

10.8 Position hints (cross-module)

These symbols sit outside lock_manager.c; anchor by function name (line numbers verified against the develop+I8 worktree, 2026-06, and drift across revisions).

Symbol	File	Line
`ER_MVCC_SERIALIZABLE_CONFLICT` (`-1158`)	`base/error_code.h`	1487
`locator_lock_and_get_object_internal` (raise sites, gate)	`transaction/locator_sr.c`	12935 (raises 13041 / 13047)
`locator_lock_and_get_object_with_evaluation` (entry, `X_LOCK`)	`transaction/locator_sr.c`	13099
`mvcc_satisfies_snapshot` (TOO_NEW / TOO_OLD / SATISFIED)	`transaction/mvcc.c`	156
`logtb_check_class_for_rr_isolation_err` (RR-eligible gate)	`transaction/log_tran_table.c`	5671

Position hints as of this revision

Symbol	File	Line
Types (headers)
`struct lk_entry`	`lock_manager.h`	79
`struct lk_res_key`	`lock_manager.h`	164
`struct lk_res`	`lock_manager.h`	175
`lock_remove_all_inst_locks`	`lock_manager.h`	195
`enum LOCK`	`lock_table.h`	29
`lock_Comp[][]`	`lock_table.c`	67
`lock_Conv[][]`	`lock_table.c`	179
`lock_conv()` / `lock_compat()`	`lock_table.h`	58 / 65
Internal types (lock_manager.c)
`struct lk_tran_lock`	`lock_manager.c`	324
`struct lk_global_data`	`lock_manager.c`	361
`lk_Gl` (global instance)	`lock_manager.c`	418
`lk_Obj_lock_res_desc`	`lock_manager.c`	621
`obj_lock_entry_desc`	`lock_manager.c`	592
Constants and macros
`LK_OBJ_LOCK_HASH`	`lock_manager.c`	84
`LK_MIN_OBJECT_LOCKS`	`lock_manager.c`	433
`LOCK_TRAN_LOCAL_POOL_MAX_SIZE`	`lock_manager.c`	351
`LK_MIN_TWFG_EDGE_COUNT`	`lock_manager.c`	451
`LK_MID_TWFG_EDGE_COUNT`	`lock_manager.c`	453
`LK_MAX_TWFG_EDGE_COUNT`	`lock_manager.c`	455
`LK_MAX_VICTIM_COUNT`	`lock_manager.c`	448
`PRM_ID_LK_ESCALATION_AT`	`system_parameter.h`	(config)
`PRM_ID_LK_ROLLBACK_ON_LOCK_ESCALATION`	`system_parameter.h`	(config)
`PRM_ID_LK_RUN_DEADLOCK_INTERVAL`	`system_parameter.c`	1313
Initialization / finalization
`lock_initialize`	`lock_manager.c`	5627
`lock_finalize`	`lock_manager.c`	5859
`lock_initialize_tran_lock_table`	`lock_manager.c`	1061
`lock_initialize_object_hash_table`	`lock_manager.c`	1121
`lock_initialize_object_lock_entry_list`	`lock_manager.c`	1152
`lock_initialize_deadlock_detection`	`lock_manager.c`	1180
`lock_finalize_tran_lock_table`	`lock_manager.c`	5702
`lock_deadlock_detect_daemon_init`	`lock_manager.c`	5812
`deadlock_detect_task_execute`	`lock_manager.c`	5780
Entry init / hash
`lock_initialize_entry_as_granted`	`lock_manager.c`	920
`lock_initialize_entry_as_blocked`	`lock_manager.c`	940
`lock_initialize_entry_as_non2pl`	`lock_manager.c`	962
`lock_create_search_key`	`lock_manager.c`	641
`lock_res_key_compare`	`lock_manager.c`	844
`lock_res_key_hash`	`lock_manager.c`	884
`lock_get_hash_value`	`lock_manager.c`	1015
`lock_get_new_entry`	`lock_manager.c`	9778
`lock_free_entry`	`lock_manager.c`	9809
Acquisition
`lock_object`	`lock_manager.c`	5945
`lock_object_wait_msecs`	`lock_manager.c`	6273
`lock_scan`	`lock_manager.c`	6303
`lock_subclass`	`lock_manager.c`	6154
`lock_classes_lock_hint`	`lock_manager.c`	6379
`lock_internal_perform_lock_object`	`lock_manager.c`	3223
`lock_find_class_entry`	`lock_manager.c`	1596
`lock_position_holder_entry`	`lock_manager.c`	1743
`lock_is_class_lock_escalated`	`lock_manager.c`	508
Grant
`lock_grant_blocked_holder`	`lock_manager.c`	2512
`lock_grant_blocked_waiter`	`lock_manager.c`	2629
`lock_grant_blocked_waiter_partial`	`lock_manager.c`	2758
Release
`lock_unlock_object`	`lock_manager.c`	6630
`lock_unlock_object_donot_move_to_non2pl`	`lock_manager.c`	6613
`lock_unlock_object_lock_internal`	`lock_manager.c`	6580
`lock_unlock_object_by_isolation`	`lock_manager.c`	9841
`lock_unlock_shared_inst_lock`	`lock_manager.c`	4399
`lock_unlock_all`	`lock_manager.c`	6851
`lock_internal_perform_unlock_object`	`lock_manager.c`	3945
NON2PL
`lock_add_non2pl_lock`	`lock_manager.c`	1649
`lock_update_non2pl_list`	`lock_manager.c`	4592
`lock_remove_non2pl`	`lock_manager.c`	539
`lock_notify_isolation_incons`	`lock_manager.c`	7516
Instant lock
`lock_start_instant_lock_mode`	`lock_manager.c`	8879
`lock_stop_instant_lock_mode`	`lock_manager.c`	8901
Escalation
`lock_escalate_if_needed`	`lock_manager.c`	2974
`lock_check_escalate`	`lock_manager.c`	2907
`lock_increment_class_granules`	`lock_manager.c`	548
`lock_decrement_class_granules`	`lock_manager.c`	550
Suspend / resume
`lock_suspend`	`lock_manager.c`	2151
`lock_resume`	`lock_manager.c`	2338
`lock_wakeup_deadlock_victim_timeout`	`lock_manager.c`	2386
`lock_wakeup_deadlock_victim_aborted`	`lock_manager.c`	2442
`lock_force_timeout_expired_wait_transactions`	`lock_manager.c`	7443
Deadlock detection
`lock_detect_local_deadlock`	`lock_manager.c`	7696
`lock_add_WFG_edge`	`lock_manager.c`	4670
`lock_select_deadlock_victim`	`lock_manager.c`	4806
`lock_is_local_deadlock_detection_interval_up`	`lock_manager.c`	7622
Special paths
`lock_hold_object_instant`	`lock_manager.c`	5901
`lock_internal_hold_lock_object_instant`	`lock_manager.c`	3087
`lock_initialize_composite_lock`	`lock_manager.c`	8559
`lock_add_composite_lock`	`lock_manager.c`	8586
`lock_finalize_composite_lock`	`lock_manager.c`	8740
`lock_abort_composite_lock`	`lock_manager.c`	8800
`lock_demote_class_lock`	`lock_manager.c`	4165
`lock_internal_demote_class_lock`	`lock_manager.c`	4188
`lock_rep_read_tran`	`lock_manager.c`	9698