PostgreSQL Lock Manager — The Heavyweight Lock Table, Lock Modes, and Deadlock Detection
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”The lock manager is the component that realizes pessimistic concurrency control: before a transaction reads or writes a logical object, it asks the manager for a lock in a particular mode, and the manager either grants it, blocks the requester, or (in a cycle) aborts somebody. Database System Concepts (Silberschatz, 7e, ch. 18 “Concurrency Control”) frames this as the lock-based protocol family, whose canonical member is two-phase locking (2PL): every transaction acquires all its locks in a growing phase and releases them in a shrinking phase, never acquiring after the first release. 2PL guarantees conflict-serializability; strict 2PL — the variant nearly all engines use — additionally holds every exclusive lock until commit/abort, which also guarantees recoverability and avoids cascading aborts. PostgreSQL holds transaction-level locks to end-of-transaction, so it is a strict-2PL engine for the locks it takes.
Two refinements of the basic model shape every real lock manager and are the knobs PostgreSQL turns:
-
Lock granularity and multi-granularity locking (MGL). A lock can name a whole table, a page, or a single tuple. Coarse locks are cheap to track but kill concurrency; fine locks maximize concurrency but explode the lock table. Database System Concepts §18.3 (“Multiple Granularity”) resolves this with a hierarchy (database → table → page → tuple) and intention modes (IS, IX, SIX): before locking a node in S or X mode, a transaction locks every ancestor in the corresponding intention mode, so a coarse locker can detect a fine locker without walking the whole subtree. The richness of the mode set — and the conflict table that defines which modes coexist — is the heart of the design.
-
Deadlock handling. Because transactions request locks in an order the manager cannot predict, cycles are possible: A holds x and waits for y while B holds y and waits for x. The textbook (§18.2.2) gives two families: prevention (timeouts, wait-die / wound-wait timestamp ordering, or acquiring all locks atomically) and detection-and-recovery (build a waits-for graph (WFG) — a node per transaction, an edge A→B when A waits for a lock B holds — and periodically test it for a cycle, aborting a victim to break it). PostgreSQL chooses detection: it lets the cycle form, then runs a WFG search after a timeout.
The lock manager is deliberately separate from MVCC. Under snapshot
isolation, reads do not take row locks at all — they read a visible version
(see postgres-mvcc-visibility.md). The heavyweight lock manager exists for
the conflicts MVCC cannot resolve by versioning: writer/writer serialization,
DDL vs. DML, explicit LOCK TABLE, SELECT ... FOR UPDATE, and waiting on
another transaction to finish. This is why a read-mostly OLTP workload takes
millions of weak relation locks that never conflict — the observation that
motivates the fast path (below).
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model (2PL, MGL, WFG). This section names the engineering conventions that nearly every production lock manager — PostgreSQL, Oracle, DB2, SQL Server, MySQL/InnoDB, CUBRID — adopts to make that model run on a multicore box. PostgreSQL’s specific choices in the next section read as one set of dials within this shared space.
A central lock table keyed by an object identifier
Section titled “A central lock table keyed by an object identifier”Every engine keeps a hash table whose key identifies the lockable object (a table OID, a page number, a tuple address, a transaction id) and whose value records who holds and who waits. Two value records are nearly universal: a per-object record (the queue of holders and waiters on this object) and a per-holder record (what one transaction holds/awaits on this object). The per-object record carries an aggregate of all granted modes so a conflict test is one bitwise operation instead of a list walk.
Table-driven conflict semantics
Section titled “Table-driven conflict semantics”The meaning of “mode M1 conflicts with mode M2” is encoded in a static conflict table (a 2-D bit matrix), not in branching code. A request is tested against the object’s aggregate “granted modes” mask by ANDing the requester’s conflict row with the mask. This makes the mode set a piece of data that can be extended without touching the acquisition path.
Per-backend lock cache
Section titled “Per-backend lock cache”Within one transaction the same lock is requested and released many times
(every catalog lookup re-locks pg_class). Touching shared memory each time
would serialize backends on a latch. Every engine keeps a backend-local
count of how many times it holds each lock, so the second through N-th
acquisition of an already-held lock never touches shared memory.
Partitioned latching for scalability
Section titled “Partitioned latching for scalability”A single global mutex over the lock table is the classic multicore bottleneck — the central finding of A Scalable Lock Manager for Multicores (Jung et al., SIGMOD 2013), which traced MySQL’s throughput collapse at 16 cores to cache-line bouncing on the one lock-table mutex even on a read-only, conflict-free workload. The standard mitigation is to partition the lock table by a hash of the key and give each partition its own latch, so unrelated objects never contend. The frontier the paper pushes past — splitting lock allocation from lock manipulation so the manipulation is latch-free — is exactly the gap PostgreSQL’s fast path attacks from a different angle.
A fast path for the common, non-conflicting case
Section titled “A fast path for the common, non-conflicting case”Even partitioned, a hot table funnels every locker into one partition latch. Engines therefore add an escape hatch: if a lock is known to be weak (unlikely to conflict) and the system can cheaply prove no conflicting strong lock exists, record it in a per-backend private array and skip the shared table entirely. The cost is moved to the rare strong locker, who must “flush” everyone’s fast-path entries into the shared table before it can conflict-check.
Deadlock detection deferred and asynchronous
Section titled “Deadlock detection deferred and asynchronous”Running a WFG search on every block would dominate the no-deadlock common case.
The convention is optimistic waiting: block immediately with no check, arm a
timer, and only search the WFG if the wait outlasts a timeout (PostgreSQL’s
deadlock_timeout, default 1 s). A deadlock that resolves itself within the
timeout costs nothing.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Lockable object identity | LOCKTAG — 16-byte tag, locktag_type ∈ LockTagType |
| Per-object holder/waiter record | LOCK (in shared LockMethodLockHash) |
| Per-holder record | PROCLOCK (in shared LockMethodProcLockHash) |
| Per-backend lock cache | LOCALLOCK (in backend-private LockMethodLocalHash) |
| Conflict table (mode × mode bit matrix) | LockConflicts[] + LockMethodData.conflictTab |
| The 8 lock modes | AccessShareLock … AccessExclusiveLock (lockdefs.h) |
| Aggregate granted-modes mask | LOCK.grantMask; awaited modes in LOCK.waitMask |
| Partitioned latching | NUM_LOCK_PARTITIONS (=16) partition LWLocks, hash of LOCKTAG |
| Fast path for weak relation locks | PGPROC.fpRelId[] / fpLockBits[]; FastPathStrongRelationLocks |
| Wait queue on the object | LOCK.waitProcs (list of PGPROC); request stored in PGPROC |
| Waits-for graph + detection | DeadLockCheck / FindLockCycle (deadlock.c) |
| Optimistic waiting + timeout | ProcSleep arms deadlock_timeout; CheckDeadLock runs the search |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s README (in src/backend/storage/lmgr/) calls these “regular
locks” or “heavyweight locks” and contrasts them with three lighter
interprocess locks — spinlocks and LWLocks (own doc:
postgres-lwlock-spinlock.md) and SIReadLock predicate locks (own doc:
postgres-ssi-predicate-locking.md). The heavyweight manager is the only one
of the four with “a variety of lock modes with table-driven semantics, full
deadlock detection, and automatic release at transaction end.” Everything
below is that manager.
The distinguishing choices are: (1) three tiers of lock record —
backend-local LOCALLOCK, shared LOCK, shared PROCLOCK — so repeated
acquisition is latch-free and shared state stores only one grant per
backend/object/mode; (2) the conflict semantics are a static 8×8 bit
table; (3) the shared hashes are partitioned 16 ways by LOCKTAG hash;
(4) a fast path lets weak relation locks bypass shared memory entirely;
(5) deadlock detection is deferred until deadlock_timeout and resolves
soft cycles by rearranging wait queues rather than aborting.
The three lock records and how they relate
Section titled “The three lock records and how they relate”flowchart LR
subgraph BK["backend-private memory"]
LL["LOCALLOCK<br/>tag = (LOCKTAG, mode)<br/>nLocks, lockOwners[]<br/>lock*, proclock*"]
end
subgraph SHM["shared memory (partitioned)"]
LK["LOCK<br/>tag = LOCKTAG<br/>grantMask, waitMask<br/>granted[], requested[]<br/>procLocks, waitProcs"]
PL["PROCLOCK<br/>tag = (myLock, myProc)<br/>holdMask, releaseMask<br/>groupLeader"]
end
PG["PGPROC<br/>(this backend)<br/>waitLock, waitLockMode<br/>fpRelId[], fpLockBits[]"]
LL -- "lock*" --> LK
LL -- "proclock*" --> PL
PL -- "tag.myLock" --> LK
PL -- "tag.myProc" --> PG
LK -- "procLocks (dlist)" --> PL
LK -- "waitProcs (dclist)" --> PG
Figure 1 — The three lock records. LOCALLOCK is a backend-private cache
keyed by (object, mode) holding the local acquisition count; it points at the
shared LOCK (per-object) and PROCLOCK (per-backend-per-object) when the
lock lives in the main table. A PROCLOCK is doubly linked into both its
LOCK’s procLocks list and its owning PGPROC. When a lock is taken via the
fast path, LOCALLOCK.lock and .proclock are NULL.
The LOCK is “per lockable object” and the PROCLOCK is “per
lock-and-requestor” (README). The shared structures “only allow a single lock
grant to be made per lockable object/lock mode/backend” — the internal request
counts (a transaction taking the same lock twice) are held in LOCALLOCK so
the shared data need not be touched.
// LOCK — src/include/storage/lock.htypedef struct LOCK{ LOCKTAG tag; /* unique identifier of lockable object */ LOCKMASK grantMask; /* bitmask for lock types already granted */ LOCKMASK waitMask; /* bitmask for lock types awaited */ dlist_head procLocks; /* list of PROCLOCK objects assoc. with lock */ dclist_head waitProcs; /* list of PGPROC objects waiting on lock */ int requested[MAX_LOCKMODES]; /* counts of requested locks */ int nRequested; /* total of requested[] array */ int granted[MAX_LOCKMODES]; /* counts of granted locks */ int nGranted; /* total of granted[] array */} LOCK;grantMask is the aggregate the conflict test ANDs against; granted[i] is the
per-mode count behind it (bit i of grantMask is on iff granted[i] > 0).
waitMask flags modes some waiter wants but cannot get yet. When nRequested
falls to zero the LOCK is freed.
// PROCLOCK + PROCLOCKTAG — src/include/storage/lock.htypedef struct PROCLOCKTAG{ LOCK *myLock; /* link to per-lockable-object information */ PGPROC *myProc; /* link to PGPROC of owning backend */} PROCLOCKTAG;
typedef struct PROCLOCK{ PROCLOCKTAG tag; /* unique identifier of proclock object */ PGPROC *groupLeader; /* proc's lock group leader, or proc itself */ LOCKMASK holdMask; /* bitmask for lock types currently held */ LOCKMASK releaseMask; /* bitmask for lock types to be released */ dlist_node lockLink; /* list link in LOCK's list of proclocks */ dlist_node procLink; /* list link in PGPROC's list of proclocks */} PROCLOCK;The PROCLOCKTAG uses raw pointers as a key — safe because “a PROCLOCK never
outlives either its lock or its proc” (README). holdMask is the per-backend
slice of LOCK.grantMask. The two dlist_nodes thread the same object into two
lists: the LOCK’s holder list and the PGPROC’s held-locks list.
// LOCALLOCK + LOCALLOCKTAG — src/include/storage/lock.htypedef struct LOCALLOCKTAG{ LOCKTAG lock; /* identifies the lockable object */ LOCKMODE mode; /* lock mode for this table entry */} LOCALLOCKTAG;
typedef struct LOCALLOCK{ LOCALLOCKTAG tag; /* unique identifier of locallock entry */ uint32 hashcode; /* copy of LOCKTAG's hash value */ LOCK *lock; /* associated LOCK object, if any */ PROCLOCK *proclock; /* associated PROCLOCK object, if any */ int64 nLocks; /* total number of times lock is held */ int numLockOwners; /* # of relevant ResourceOwners */ int maxLockOwners; /* allocated size of array */ LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */ bool holdsStrongLockCount; /* bumped FastPathStrongRelationLocks */ bool lockCleared; /* we read all sinval msgs for lock */} LOCALLOCK;Note LOCALLOCK’s tag includes the mode — there is one LOCALLOCK per
(object, mode) the backend touches, whereas there is one LOCK per object
across all modes. nLocks is the local re-entrancy count; lockOwners[] tracks
which ResourceOwners hold it, so subtransaction rollback can release exactly
its share (see postgres-backend-lifecycle.md for the resource-owner tree).
The LOCKTAG: naming any lockable object in 16 bytes
Section titled “The LOCKTAG: naming any lockable object in 16 bytes”// LOCKTAG — src/include/storage/lock.htypedef struct LOCKTAG{ uint32 locktag_field1; /* a 32-bit ID field */ uint32 locktag_field2; /* a 32-bit ID field */ uint32 locktag_field3; /* a 32-bit ID field */ uint16 locktag_field4; /* a 16-bit ID field */ uint8 locktag_type; /* see enum LockTagType */ uint8 locktag_lockmethodid; /* lockmethod indicator */} LOCKTAG;The struct is “defined with malice aforethought to fit into 16 bytes with no
padding” — padding bytes would make the hash random because the whole struct is
hashed. locktag_type selects how the four ID fields are read; the SET_LOCKTAG_*
macros encode the convention. The 12 types (LockTagType) span the granularity
hierarchy and several non-relation objects:
locktag_type | Identified by (fields) | Typical use |
|---|---|---|
LOCKTAG_RELATION | dbOid, relOid | whole table (most DML/DDL locks) |
LOCKTAG_RELATION_EXTEND | dbOid, relOid | right to add a block to a relation |
LOCKTAG_DATABASE_FROZEN_IDS | dbOid | pg_database.datfrozenxid update |
LOCKTAG_PAGE | dbOid, relOid, blocknum | one page (e.g. GIN cleanup) |
LOCKTAG_TUPLE | dbOid, relOid, blocknum, offnum | one physical tuple (SELECT FOR …) |
LOCKTAG_TRANSACTION | xid | wait for an xact to finish |
LOCKTAG_VIRTUALTRANSACTION | procNumber, lxid | wait for a vxid (CIC, hot standby) |
LOCKTAG_SPECULATIVE_TOKEN | xid, token | speculative insert (INSERT ... ON CONFLICT) |
LOCKTAG_OBJECT | dbOid, classOid, objOid, objsubid | non-relation catalog object |
LOCKTAG_USERLOCK | (reserved) | old contrib/userlock |
LOCKTAG_ADVISORY | 4 user ids | pg_advisory_lock(...) |
LOCKTAG_APPLY_TRANSACTION | dbOid, subOid, xid, objid | logical-replication apply conflict |
locktag_lockmethodid selects the lock method: DEFAULT_LOCKMETHOD (1) for
everything system-driven, USER_LOCKMETHOD (2) for advisory locks. Both methods
share the same mode set and conflict table — they differ only in lifetime rules
(advisory/user locks can outlive a transaction; see README §“User Locks”).
The 8 lock modes and the conflict table
Section titled “The 8 lock modes and the conflict table”// lock modes — src/include/storage/lockdefs.h#define NoLock 0 /* not a lock; "don't get a lock" */#define AccessShareLock 1 /* SELECT */#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */#define ShareUpdateExclusiveLock 4 /* VACUUM, ANALYZE, CREATE INDEX CONCURRENTLY */#define ShareLock 5 /* CREATE INDEX (non-concurrent) */#define ShareRowExclusiveLock 6 /* like EXCLUSIVE but allows ROW SHARE */#define ExclusiveLock 7 /* blocks ROW SHARE / SELECT FOR UPDATE */#define AccessExclusiveLock 8 /* ALTER/DROP TABLE, VACUUM FULL, LOCK TABLE */These are table-level modes — PostgreSQL’s MGL hierarchy is shallow (it does
not take page/tuple S/X locks for ordinary MVCC reads). The lower modes
(AccessShareLock, RowShareLock, RowExclusiveLock) are what DML takes and
are mutually compatible; the higher modes are DDL/maintenance and progressively
exclude. The conflict semantics live entirely in one array:
// LockConflicts[] — src/backend/storage/lmgr/lock.c (condensed)static const LOCKMASK LockConflicts[] = { 0, /* AccessShareLock */ LOCKBIT_ON(AccessExclusiveLock), /* RowShareLock */ LOCKBIT_ON(ExclusiveLock) | LOCKBIT_ON(AccessExclusiveLock), /* RowExclusiveLock */ LOCKBIT_ON(ShareLock) | LOCKBIT_ON(ShareRowExclusiveLock) | LOCKBIT_ON(ExclusiveLock) | LOCKBIT_ON(AccessExclusiveLock), /* ShareUpdateExclusiveLock */ LOCKBIT_ON(ShareUpdateExclusiveLock) | LOCKBIT_ON(ShareLock) | LOCKBIT_ON(ShareRowExclusiveLock) | LOCKBIT_ON(ExclusiveLock) | LOCKBIT_ON(AccessExclusiveLock), /* ShareLock */ LOCKBIT_ON(RowExclusiveLock) | LOCKBIT_ON(ShareUpdateExclusiveLock) | LOCKBIT_ON(ShareRowExclusiveLock) | LOCKBIT_ON(ExclusiveLock) | LOCKBIT_ON(AccessExclusiveLock), /* ShareRowExclusiveLock */ LOCKBIT_ON(RowExclusiveLock) | LOCKBIT_ON(ShareUpdateExclusiveLock) | LOCKBIT_ON(ShareLock) | LOCKBIT_ON(ShareRowExclusiveLock) | LOCKBIT_ON(ExclusiveLock) | LOCKBIT_ON(AccessExclusiveLock), /* ExclusiveLock */ LOCKBIT_ON(RowShareLock) | LOCKBIT_ON(RowExclusiveLock) | LOCKBIT_ON(ShareUpdateExclusiveLock) | LOCKBIT_ON(ShareLock) | LOCKBIT_ON(ShareRowExclusiveLock) | LOCKBIT_ON(ExclusiveLock) | LOCKBIT_ON(AccessExclusiveLock), /* AccessExclusiveLock */ LOCKBIT_ON(AccessShareLock) | LOCKBIT_ON(RowShareLock) | LOCKBIT_ON(RowExclusiveLock) | LOCKBIT_ON(ShareUpdateExclusiveLock) | LOCKBIT_ON(ShareLock) | LOCKBIT_ON(ShareRowExclusiveLock) | LOCKBIT_ON(ExclusiveLock) | LOCKBIT_ON(AccessExclusiveLock)};LockConflicts[i] is the set of modes that conflict with mode i. The matrix
is symmetric; the conventional rendering (✗ = conflict):
| ACS | RS | RE | SUE | S | SRE | E | AE | |
|---|---|---|---|---|---|---|---|---|
| AccessShare (1) | ✗ | |||||||
| RowShare (2) | ✗ | ✗ | ||||||
| RowExclusive (3) | ✗ | ✗ | ✗ | ✗ | ||||
| ShareUpdateExcl (4) | ✗ | ✗ | ✗ | ✗ | ✗ | |||
| Share (5) | ✗ | ✗ | ✗ | ✗ | ✗ | |||
| ShareRowExcl (6) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ||
| Exclusive (7) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | |
| AccessExclusive (8) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
Read it as the MGL/intent picture flattened to table granularity:
AccessShareLock (a reader) conflicts only with AccessExclusiveLock (a
table rewriter), so SELECTs and most DDL-free work coexist freely. The three
DML modes (1–3) form a mutually compatible block — many INSERT/UPDATE/DELETE
backends run against one table at once. ShareUpdateExclusiveLock is
self-conflicting, which is why only one VACUUM/ANALYZE/CREATE INDEX CONCURRENTLY runs on a table at a time. The single DoLockModesConflict
helper is the public face of the table:
// DoLockModesConflict — src/backend/storage/lmgr/lock.cboolDoLockModesConflict(LOCKMODE mode1, LOCKMODE mode2){ LockMethod lockMethodTable = LockMethods[DEFAULT_LOCKMETHOD];
if (lockMethodTable->conflictTab[mode1] & LOCKBIT_ON(mode2)) return true; return false;}The partitioned shared hash
Section titled “The partitioned shared hash”The two shared hashes (LockMethodLockHash for LOCKs,
LockMethodProcLockHash for PROCLOCKs) are partitioned 16 ways. Before 8.2
one global LockMgrLock protected everything and became “a contention
bottleneck” (README); the fix was to assign each lockable object to a
partition by the low-order bits of its LOCKTAG hash, each partition guarded by
its own LWLock.
// partition selection — src/include/storage/lock.h + lwlock.h#define LOG2_NUM_LOCK_PARTITIONS 4#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) /* = 16 */
#define LockHashPartition(hashcode) \ ((hashcode) % NUM_LOCK_PARTITIONS)#define LockHashPartitionLock(hashcode) \ (&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + \ LockHashPartition(hashcode)].lock)A normal acquire/release touches one partition lock. The trick that lets one
partition lock cover both hashes is a specialized PROCLOCK hash function that
forces a PROCLOCK into the same partition as its LOCK:
// proclock_hash — src/backend/storage/lmgr/lock.cstatic uint32proclock_hash(const void *key, Size keysize){ const PROCLOCKTAG *proclocktag = (const PROCLOCKTAG *) key; uint32 lockhash; Datum procptr;
/* Look into the associated LOCK object, and compute its hash code */ lockhash = LockTagHashCode(&proclocktag->myLock->tag);
/* xor the proc address in, left-shifted past the partition-number bits */ procptr = PointerGetDatum(proclocktag->myProc); lockhash ^= ((uint32) procptr) << LOG2_NUM_LOCK_PARTITIONS; return lockhash;}The proc pointer is mixed in above the low 4 bits, so the partition number is
inherited from the LOCK while the rest of the hash still spreads PROCLOCKs
across hash chains. Deadlock checking, which must see the whole graph, locks
all 16 partitions in partition-number order (a fixed order that prevents
LWLock deadlock among the partition locks themselves).
LockAcquire: the acquisition path
Section titled “LockAcquire: the acquisition path”LockAcquire (and its backing LockAcquireExtended) is the heart of the
manager. The control flow has four exits — already-held (local count bump),
fast-path grant, immediate shared-table grant, and wait-then-grant — plus the
deadlock/no-wait error exit.
flowchart TD
A["LockAcquireExtended(tag, mode)"] --> B["find/create LOCALLOCK<br/>(backend-private hash)"]
B --> C{"locallock.nLocks > 0?<br/>(already held)"}
C -- "yes" --> C1["GrantLockLocal:<br/>bump local count, return"]
C -- "no" --> D{"eligible for<br/>relation fast path?"}
D -- "yes" --> E{"strong-lock count<br/>for partition == 0?"}
E -- "yes" --> E1["FastPathGrantRelationLock<br/>→ record in PGPROC array,<br/>return LOCKACQUIRE_OK"]
E -- "no" --> F
D -- "no" --> F{"this is a strong lock<br/>that conflicts w/ fast path?"}
F -- "yes" --> F1["BeginStrongLockAcquire +<br/>FastPathTransferRelationLocks<br/>(flush others' fp locks)"]
F1 --> G
F -- "no" --> G["LWLockAcquire(partitionLock)"]
G --> H["SetupLockInTable:<br/>find/create LOCK + PROCLOCK"]
H --> I{"conflictTab[mode] & waitMask<br/>OR LockCheckConflicts?"}
I -- "no conflict" --> J["GrantLock → OK"]
I -- "conflict" --> K["JoinWaitQueue(dontWait)"]
K --> L{"result?"}
L -- "OK (grant-on-join)" --> J
L -- "ERROR (early deadlock<br/>or dontWait)" --> M["clean up, DeadLockReport<br/>or return NOT_AVAIL"]
L -- "WAITING" --> N["WaitOnLock → ProcSleep<br/>(sleep on semaphore)"]
N --> O{"granted or<br/>deadlock?"}
O -- "granted" --> J
O -- "deadlock" --> M
Figure 2 — LockAcquireExtended control flow. The two cheap exits (local
re-acquire, fast-path grant) avoid shared memory entirely. Only a genuine
shared-table conflict drops the backend into JoinWaitQueue → WaitOnLock →
ProcSleep. A strong-lock acquisition first transfers every backend’s matching
fast-path locks into the shared table so conflict and deadlock detection see
them.
The already-held check is the first and cheapest exit:
// LockAcquireExtended — src/backend/storage/lmgr/lock.c (condensed)locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash, &localtag, HASH_ENTER, &found);/* ... initialize if !found ... */
/* If we already hold the lock, just increase the count locally. */if (locallock->nLocks > 0){ GrantLockLocal(locallock, owner); if (locallock->lockCleared) return LOCKACQUIRE_ALREADY_CLEAR; else return LOCKACQUIRE_ALREADY_HELD;}If not already held, the shared-table conflict test is two steps — a cheap
mask test against waiters, then the full LockCheckConflicts:
// LockAcquireExtended — shared-table grant decision (condensed)proclock = SetupLockInTable(lockMethodTable, MyProc, locktag, hashcode, lockmode);/* ... */locallock->proclock = proclock;lock = proclock->tag.myLock;locallock->lock = lock;
/* If request conflicts with waiters, must queue; else check held locks. */if (lockMethodTable->conflictTab[lockmode] & lock->waitMask) found_conflict = true;else found_conflict = LockCheckConflicts(lockMethodTable, lockmode, lock, proclock);
if (!found_conflict){ GrantLock(lock, proclock, lockmode); /* No conflict — granted */ waitResult = PROC_WAIT_STATUS_OK;}else waitResult = JoinWaitQueue(locallock, lockMethodTable, dontWait);The “conflicts with waiters” test first (waitMask) implements FIFO
fairness: a new requester that could be granted but would jump ahead of an
incompatible waiter is made to queue instead — otherwise a stream of compatible
weak lockers could starve a waiting strong locker.
GrantLock is the bookkeeping that records the grant in the shared structures:
// GrantLock — src/backend/storage/lmgr/lock.cvoidGrantLock(LOCK *lock, PROCLOCK *proclock, LOCKMODE lockmode){ lock->nGranted++; lock->granted[lockmode]++; lock->grantMask |= LOCKBIT_ON(lockmode); if (lock->granted[lockmode] == lock->requested[lockmode]) lock->waitMask &= LOCKBIT_OFF(lockmode); proclock->holdMask |= LOCKBIT_ON(lockmode);}LockCheckConflicts: conflict test with group-locking subtraction
Section titled “LockCheckConflicts: conflict test with group-locking subtraction”// LockCheckConflicts — src/backend/storage/lmgr/lock.c (condensed)int conflictMask = lockMethodTable->conflictTab[lockmode];
/* global conflict: does any granted mode conflict with my request? */if (!(conflictMask & lock->grantMask)) return false; /* nobody conflicts — granted */
/* Subtract out locks I already hold myself. */myLocks = proclock->holdMask;for (i = 1; i <= numLockModes; i++){ if ((conflictMask & LOCKBIT_ON(i)) == 0) { conflictsRemaining[i] = 0; continue; } conflictsRemaining[i] = lock->granted[i]; if (myLocks & LOCKBIT_ON(i)) --conflictsRemaining[i]; totalConflictsRemaining += conflictsRemaining[i];}if (totalConflictsRemaining == 0) return false; /* only my own locks conflicted */
/* If no group locking, it's definitely a conflict. */if (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL) return true;/* ... otherwise subtract conflicts held by my parallel lock-group peers ... */The first AND is the common-case fast test. The subtractions handle two
“conflicts that aren’t really conflicts”: a process never conflicts with its
own held locks (you can take a stronger lock while holding a weaker one), and —
since parallel query — members of the same lock group (a leader and its
parallel workers) treat each other’s locks as non-conflicting, with one
exception: LOCKTAG_RELATION_EXTEND conflicts even within a group. Group
locking exists because a parallel worker would otherwise self-deadlock against
a lock its own leader holds; the README §“Group Locking” lays out why
predicting workers’ lock requests was rejected in favor of declaring intra-group
locks non-conflicting.
The fast path for weak relation locks
Section titled “The fast path for weak relation locks”A read-mostly workload hammering one table sends every backend into the same
partition lock — “measurable even on 2-core servers, and very pronounced as
core count increases” (README). The fast path (since 9.2) lets a backend
record a weak relation lock (AccessShareLock, RowShareLock,
RowExclusiveLock — all below ShareUpdateExclusiveLock) in a private array in
its own PGPROC, never touching the shared hash:
// PGPROC fast-path fields — src/include/storage/proc.hLWLock fpInfoLock; /* protects per-backend fast-path state */uint64 *fpLockBits; /* lock modes held for each fast-path slot */Oid *fpRelId; /* slots for rel oids */Slots are grouped (FP_LOCK_SLOTS_PER_GROUP = 16 per group), and a relation
hashes to a group via FAST_PATH_REL_GROUP. Three bits per slot encode which of
the three weak modes are held. Eligibility and the grant are:
// EligibleForRelationFastPath + grant — src/backend/storage/lmgr/lock.c (condensed)#define EligibleForRelationFastPath(locktag, mode) \ ((locktag)->locktag_lockmethodid == DEFAULT_LOCKMETHOD && \ (locktag)->locktag_type == LOCKTAG_RELATION && \ (locktag)->locktag_field1 == MyDatabaseId && \ MyDatabaseId != InvalidOid && \ (mode) < ShareUpdateExclusiveLock)
if (EligibleForRelationFastPath(locktag, lockmode) && /* group not full */ ){ uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode); LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE); if (FastPathStrongRelationLocks->count[fasthashcode] != 0) acquired = false; /* a strong locker is present — fall back */ else acquired = FastPathGrantRelationLock(locktag->locktag_field2, lockmode); LWLockRelease(&MyProc->fpInfoLock); if (acquired) { locallock->lock = NULL; locallock->proclock = NULL; GrantLockLocal(locallock, owner); return LOCKACQUIRE_OK; }}The correctness hinge is FastPathStrongRelationLocks, a 1024-way array of
strong-lock counts (a “1024-way partitioning of the lock space”, README):
// FastPathStrongRelationLockData — src/backend/storage/lmgr/lock.c#define FAST_PATH_STRONG_LOCK_HASH_PARTITIONS (1 << 10) /* = 1024 */typedef struct { slock_t mutex; uint32 count[FAST_PATH_STRONG_LOCK_HASH_PARTITIONS];} FastPathStrongRelationLockData;static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;A weak locker may use the fast path only when its partition’s strong count is
zero. When a strong locker (> ShareUpdateExclusiveLock) arrives, it does the
reverse: bump the strong count, then transfer every backend’s matching
fast-path entries into the shared table so the conflict and deadlock machinery
can see them:
sequenceDiagram
participant W as Weak locker (DML)
participant FP as MyProc fast-path array
participant CNT as FastPathStrongRelationLocks
participant S as Strong locker (DDL)
participant SHM as Shared LOCK table
Note over W,CNT: normal case — no strong lock present
W->>CNT: read count[partition] == 0
W->>FP: FastPathGrantRelationLock (set bit), return OK
Note over S,SHM: strong lock arrives
S->>CNT: BeginStrongLockAcquire: count[partition]++
S->>FP: FastPathTransferRelationLocks: scan every PGPROC
FP->>SHM: SetupLockInTable + GrantLock for each matching fp entry
S->>SHM: now conflict-check against the (now complete) shared table
Figure 3 — Fast-path interaction. The memory-ordering guarantee (README):
LWLock acquisition is a sequence point, so a weak locker that holds its own
fpInfoLock and reads a zero strong count is safe — the strong locker, to
transfer locks, must acquire every backend’s fpInfoLock in turn and will
therefore see any weak lock taken after its own count bump. “Deadlock detection
does not need to examine the fast-path data structures, because any lock that
could possibly be involved in a deadlock must have been transferred to the main
tables beforehand.”
Waiting: JoinWaitQueue and ProcSleep
Section titled “Waiting: JoinWaitQueue and ProcSleep”When a request conflicts, JoinWaitQueue decides where in LOCK.waitProcs
the backend goes. The default is the tail (FIFO), but there is one exception
that pre-empts a deadlock: if the requester already holds a lock that conflicts
with some earlier waiter’s request, it inserts itself just before that
waiter (otherwise ProcLockWakeup would never wake it before the earlier
waiter, manufacturing a wait cycle the deadlock detector would have to untangle):
// JoinWaitQueue — src/backend/storage/lmgr/proc.c (condensed)dclist_foreach(iter, waitQueue){ PGPROC *proc = dlist_container(PGPROC, links, iter.cur); /* Must he wait for me? (my held locks conflict with his request) */ if (lockMethodTable->conflictTab[proc->waitLockMode] & myHeldLocks) { /* Must I wait for him? (his held locks conflict with my request) */ if (lockMethodTable->conflictTab[lockmode] & proc->heldLocks) { RememberSimpleDeadLock(MyProc, lockmode, lock, proc); early_deadlock = true; /* immediate, no timeout wait */ break; } /* I go before him; if I conflict with nothing ahead, grant now */ if ((lockMethodTable->conflictTab[lockmode] & aheadRequests) == 0 && !LockCheckConflicts(lockMethodTable, lockmode, lock, proclock)) { GrantLock(lock, proclock, lockmode); return PROC_WAIT_STATUS_OK; } insert_before = proc; break; } aheadRequests |= LOCKBIT_ON(proc->waitLockMode);}The PGPROC wait fields (waitLock, waitProcLock, waitLockMode, heldLocks)
are the outgoing edge of the WFG: they record exactly what this backend is
blocked on. Once queued, ProcSleep arms the deadlock_timeout timer and
blocks on the backend’s semaphore until granted or signaled:
// ProcSleep — src/backend/storage/lmgr/proc.c (the timeout arming, condensed)deadlock_state = DS_NOT_YET_CHECKED;got_deadlock_timeout = false;/* Set a timer so CheckDeadLock() fires after deadlock_timeout ms. *//* ... enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout) ... */When the holder releases, ProcLockWakeup walks the wait queue front-to-back
and wakes each waiter whose request now conflicts with neither held locks nor
the requests of earlier un-wakable waiters — preserving arrival order for
conflicting requests:
// ProcLockWakeup — src/backend/storage/lmgr/proc.c (condensed)dclist_foreach_modify(miter, waitQueue){ PGPROC *proc = dlist_container(PGPROC, links, miter.cur); LOCKMODE lockmode = proc->waitLockMode; if ((lockMethodTable->conflictTab[lockmode] & aheadRequests) == 0 && !LockCheckConflicts(lockMethodTable, lockmode, lock, proc->waitProcLock)) { GrantLock(lock, proc->waitProcLock, lockmode); ProcWakeup(proc, PROC_WAIT_STATUS_OK); } else aheadRequests |= LOCKBIT_ON(lockmode); /* keep blocking those behind */}Deadlock detection: the waits-for graph
Section titled “Deadlock detection: the waits-for graph”If deadlock_timeout elapses before the lock is granted, the signal handler
runs CheckDeadLock, which locks all 16 partitions and calls DeadLockCheck.
The graph is the standard WFG: a node per process, an edge A→B when A waits for a
lock B holds (or, more subtly, when A is behind B in a wait queue for
conflicting requests). PostgreSQL distinguishes two edge kinds:
- Hard edge — B already holds a lock conflicting with A’s request. This cannot be undone without aborting someone.
- Soft edge — B is merely ahead of A in a wait queue with a conflicting
request, so
ProcLockWakeupwould grant B first. This can be undone by reordering the wait queue.
flowchart TD
A["CheckDeadLock fires<br/>(deadlock_timeout elapsed)"] --> B["lock all 16 partitions<br/>in partition order"]
B --> C{"still on wait queue?"}
C -- "no (woken meanwhile)" --> Z["release locks, resume"]
C -- "yes" --> D["DeadLockCheck →<br/>DeadLockCheckRecurse"]
D --> E["FindLockCycle: DFS along<br/>hard + soft waits-for edges"]
E --> F{"cycle through<br/>start point?"}
F -- "no cycle" --> G["DS_NO_DEADLOCK<br/>(maybe DS_BLOCKED_BY_AUTOVACUUM)"]
F -- "cycle, all hard edges" --> H["DS_HARD_DEADLOCK<br/>→ abort this transaction"]
F -- "cycle w/ soft edges" --> I["try reversing a soft edge:<br/>TopoSort the wait queue"]
I --> J{"rearrangement<br/>removes all cycles<br/>w/o making new ones?"}
J -- "yes" --> K["DS_SOFT_DEADLOCK:<br/>apply new wait order,<br/>ProcLockWakeup"]
J -- "no (exhausted)" --> H
G --> Z
K --> Z
Figure 4 — Deadlock detection state space. The recursive search
(DeadLockCheckRecurse) tries each soft-edge reversal as a topological-sort
constraint, recursively, until it either finds a queue rearrangement that
eliminates every cycle or proves none exists (hard deadlock). Only then is a
transaction aborted.
FindLockCycleRecurse is the DFS. It marks visited procs; returning to index 0
(the start proc) is a deadlock cycle, while a cycle that does not include the
start point is “somebody else’s deadlock” and ignored (the process that closed
that cycle will detect it):
// FindLockCycleRecurse — src/backend/storage/lmgr/deadlock.c (condensed)for (i = 0; i < nVisitedProcs; i++){ if (visitedProcs[i] == checkProc) { if (i == 0) /* returned to start: real cycle */ { nDeadlockDetails = depth; return true; } return false; /* cycle not through start: ignore */ }}visitedProcs[nVisitedProcs++] = checkProc;The hard-edge scan walks the awaited lock’s procLocks for holders with a
conflicting holdMask; the soft-edge scan walks the wait queue (or a
hypothetical reordered queue) for earlier waiters with conflicting requests:
// FindLockCycleRecurseMember — hard-edge scan (condensed)dlist_foreach(proclock_iter, &lock->procLocks){ PROCLOCK *proclock = dlist_container(PROCLOCK, lockLink, proclock_iter.cur); proc = proclock->tag.myProc; leader = proc->lockGroupLeader == NULL ? proc : proc->lockGroupLeader; if (leader != checkProcLeader) /* never blocks self/group */ for (lm = 1; lm <= numLockModes; lm++) if ((proclock->holdMask & LOCKBIT_ON(lm)) && (conflictMask & LOCKBIT_ON(lm))) { if (FindLockCycleRecurse(proc, depth + 1, softEdges, nSoftEdges)) return true; /* hard edge in a cycle */ /* note an autovacuum directly hard-blocking us, for cancel */ if (checkProc == MyProc && proc->statusFlags & PROC_IS_AUTOVACUUM) blocking_autovacuum_proc = proc; break; }}Three outcomes follow (DeadLockState): DS_NO_DEADLOCK, DS_SOFT_DEADLOCK
(resolved by TopoSort-driven queue rearrangement — see README for the
edge-reversal search), and DS_HARD_DEADLOCK (abort the start-point
transaction via DeadLockReport, which raises ERROR and never returns). A
fourth code, DS_BLOCKED_BY_AUTOVACUUM, is a special “no deadlock, but an
autovacuum worker is in our way” signal — the detector is “abused” (README)
to implement autovacuum’s low locking priority: the blocked backend can cancel
the autovacuum rather than wait.
Two important guarantees the README proves: no deadlock is ever missed by
asynchronous checking (the last process to close a cycle will always detect it
when its own timeout fires), and the soft-edge rearrangement never creates a
new deadlock (it is validated by re-running FindLockCycle before being
applied).
Release and transaction end
Section titled “Release and transaction end”LockRelease decrements the LOCALLOCK count; only when it reaches zero does it
touch shared memory (or clear the fast-path bit). LockReleaseAll, called at
transaction end, sweeps the backend’s LOCALLOCK hash and the fast-path array in
one pass, then ProcLockWakeups each freed object’s queue. Because transaction
locks live to commit/abort, this is where strict-2PL’s shrinking phase happens —
all at once.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. The lmgr source moves between releases; function/struct names are the stable handle. Use
git grep -n '<symbol>' src/backend/storage/lmgr/to locate the current position. Line numbers in the table below were observed at commit 273fe94 (REL_18) and are hints only.
Structures (src/include/storage/)
Section titled “Structures (src/include/storage/)”struct LOCKTAG,enum LockTagType(inlock.h) — the 16-byte object key and its 12 object types;SET_LOCKTAG_*macros encode each type.struct LOCK(inlock.h) — per-object record:grantMask,waitMask,granted[]/requested[],procLocks,waitProcs.struct PROCLOCK+PROCLOCKTAG(inlock.h) — per-backend-per-object:holdMask,groupLeader, the twodlist_nodelinks.struct LOCALLOCK+LOCALLOCKTAG(inlock.h) — backend-private cache keyed by (object, mode);nLocks,lockOwners[].enum DeadLockState(inlock.h) — the four detector verdicts.- lock-mode constants
AccessShareLock…AccessExclusiveLock,MaxLockMode(inlockdefs.h). struct PGPROCfast-path + wait fieldsfpRelId/fpLockBits/fpInfoLock,waitLock/waitProcLock/waitLockMode/heldLocks(inproc.h).
Conflict semantics (src/backend/storage/lmgr/lock.c)
Section titled “Conflict semantics (src/backend/storage/lmgr/lock.c)”LockConflicts[]— the 8×8 conflict bit matrix.lock_mode_names[],default_lockmethod,user_lockmethod,LockMethods[]— the lock-method tables.DoLockModesConflict— public conflict test for two modes.LockCheckConflicts— full conflict test with self- and group-subtraction.
Hash + partition (lock.c)
Section titled “Hash + partition (lock.c)”LockTagHashCode— hash of aLOCKTAG; low 4 bits are the partition.proclock_hash/ProcLockHashCode— forces aPROCLOCKinto itsLOCK’s partition.LockHashPartition,LockHashPartitionLock(macros inlock.h).
Acquire / grant (lock.c)
Section titled “Acquire / grant (lock.c)”LockAcquire/LockAcquireExtended— the acquisition state machine.SetupLockInTable— find/create the sharedLOCKandPROCLOCK.GrantLock/GrantLockLocal— record a grant in shared / local state.UnGrantLock/CleanUpLock— the inverse; garbage-collect empty objects.
Fast path (lock.c)
Section titled “Fast path (lock.c)”FastPathGrantRelationLock/FastPathUnGrantRelationLock— manipulate the per-backend array.FastPathTransferRelationLocks— flush all backends’ fast-path entries to the shared table when a strong locker arrives.FastPathStrongRelationLockData/FastPathStrongLockHashPartition— the 1024-way strong-lock counter.BeginStrongLockAcquire/FinishStrongLockAcquire/AbortStrongLockAcquire.
Wait queue (proc.c)
Section titled “Wait queue (proc.c)”JoinWaitQueue— choose queue position; detect the immediate (early) deadlock.ProcSleep— armdeadlock_timeout, block on the semaphore.ProcLockWakeup— wake wakable waiters in arrival order on release.CheckDeadLock— timeout handler: lock all partitions, run the detector.RemoveFromWaitQueue(inlock.c) — dequeue a proc (deadlock victim / cancel).
Deadlock detector (deadlock.c)
Section titled “Deadlock detector (deadlock.c)”DeadLockCheck— entry point; returns aDeadLockState, applies soft fixes.DeadLockCheckRecurse/TestConfiguration— search soft-edge reversals.FindLockCycle/FindLockCycleRecurse/FindLockCycleRecurseMember— the WFG DFS over hard and soft edges.ExpandConstraints/TopoSort— build a hypothetical reordered wait queue.DeadLockReport— raise theERRORthat aborts the victim.InitDeadLockChecking— preallocate worst-case workspace fromMaxBackends.
Wrappers (lmgr.c)
Section titled “Wrappers (lmgr.c)”LockRelationOid/UnlockRelationOid/LockRelation— the relation-lock entry points most of the backend actually calls.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
LockConflicts[] | lock.c | 65 |
lock_mode_names[] | lock.c | 108 |
LockMethods[] | lock.c | 150 |
FastPathStrongRelationLockData | lock.c | 306 |
LockTagHashCode | lock.c | 556 |
proclock_hash | lock.c | 573 |
DoLockModesConflict | lock.c | 622 |
LockAcquire | lock.c | 808 |
LockAcquireExtended | lock.c | 835 |
SetupLockInTable | lock.c | 1282 |
LockCheckConflicts | lock.c | 1528 |
GrantLock | lock.c | 1657 |
UnGrantLock | lock.c | 1680 |
CleanUpLock | lock.c | 1737 |
LockRelease | lock.c | 2070 |
LockReleaseAll | lock.c | 2275 |
FastPathGrantRelationLock | lock.c | 2750 |
FastPathTransferRelationLocks | lock.c | 2829 |
VirtualXactLock | lock.c | 4713 |
JoinWaitQueue | proc.c | 1173 |
ProcSleep | proc.c | 1342 |
ProcLockWakeup | proc.c | 1772 |
CheckDeadLock | proc.c | 1820 |
DeadLockCheck | deadlock.c | 220 |
DeadLockCheckRecurse | deadlock.c | 312 |
FindLockCycle | deadlock.c | 446 |
FindLockCycleRecurse | deadlock.c | 457 |
FindLockCycleRecurseMember | deadlock.c | 536 |
TopoSort | deadlock.c | 862 |
DeadLockReport | deadlock.c | 1075 |
LockRelationOid | lmgr.c | 107 |
struct LOCKTAG | lock.h | 165 |
struct LOCK | lock.h | 309 |
struct PROCLOCK | lock.h | 370 |
struct LOCALLOCK | lock.h | 427 |
enum DeadLockState | lock.h | 509 |
AccessShareLock … MaxLockMode | lockdefs.h | 36 |
NUM_LOCK_PARTITIONS | lwlock.h | 97 |
FP_LOCK_SLOTS_PER_GROUP | proc.h | 98 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry leads with a fact about the current source at commit 273fe94 (REL_18_STABLE), readable without any external material. The trailing sentence records how it was checked. Open questions follow as recorded gaps.
Verified facts
Section titled “Verified facts”-
There are exactly 8 standard lock modes plus
NoLock(0). Verified inlockdefs.h:AccessShareLock=1…AccessExclusiveLock=8,MaxLockMode=8.MAX_LOCKMODES=10inlock.his the array sizing constant (≥ modes + slack), not the mode count.InplaceUpdateTupleLockis#defined toExclusiveLock, i.e. a reuse, not a 9th mode. -
DEFAULT and USER lock methods share one conflict table. Both
default_lockmethodanduser_lockmethodpoint at the sameLockConflicts[]andlock_mode_names[](lock.c). They differ only in their debug trace flag and in lifetime rules enforced elsewhere; the mode semantics are identical. -
The shared lock hash is partitioned exactly 16 ways.
LOG2_NUM_LOCK_PARTITIONS = 4⇒NUM_LOCK_PARTITIONS = 16(lwlock.h). The partition ishashcode % 16(LockHashPartition), andproclock_hashmixes the proc pointer in above the low 4 bits to keep aPROCLOCKin itsLOCK’s partition. Confirmed by reading both the macro andproclock_hash. -
Fast-path eligibility is strictly the three weak relation modes on an unshared relation in the current database.
EligibleForRelationFastPathrequiresDEFAULT_LOCKMETHOD,LOCKTAG_RELATION,locktag_field1 == MyDatabaseId, andmode < ShareUpdateExclusiveLock.ShareUpdateExclusiveLockis excluded because it is self-conflicting; the comment notes it “does not conflict with any of the locks that do, so we can ignore it completely.” -
The fast-path group count is
FP_LOCK_SLOTS_PER_GROUP = 16slots per group, marked “don’t change”. Verified inproc.h. The number of groups per backend (FastPathLockGroupsPerBackend) is derived at startup frommax_locks_per_transaction, up toFP_LOCK_GROUPS_PER_BACKEND_MAX = 1024. This is a PG18-era refinement: earlier releases had a fixed 16-slot array. -
The strong-lock counter is a 1024-way array, not per-partition-lock.
FAST_PATH_STRONG_LOCK_HASH_PARTITIONS = 1 << 10 = 1024(lock.c), a finer-grained partitioning than the 16 lock-table partitions, so a strong lock on one relation rarely disables the fast path for unrelated relations. -
Deadlock detection is deferred until
deadlock_timeoutand runs under all 16 partition locks.ProcSleeparms the timer;CheckDeadLockacquiresLockHashPartitionLockByIndex(i)for alliin0..15in order before callingDeadLockCheck. Verified by reading both functions. The fixed ascending order is what prevents LWLock deadlock among the partition locks. -
The detector resolves soft cycles by wait-queue rearrangement before ever aborting.
DeadLockCheckreturnsDS_SOFT_DEADLOCKand rewriteslock->waitProcsfromwaitOrders[]when a topological reordering removes all cycles; only an unresolvable (all-hard) cycle yieldsDS_HARD_DEADLOCKand an abort. Verified inDeadLockCheck/DeadLockCheckRecurse. -
LOCKTAG_RELATION_EXTENDconflicts even within a parallel lock group, and can never be in a deadlock cycle.LockCheckConflictsspecial-cases it before the group-subtraction loop;FindLockCycleRecurseMemberreturns early for it. TheIsRelationExtensionLockHeldassert enforces that no other heavyweight lock is taken while holding it, so it cannot extend a wait chain. -
The deadlock detector is reused to implement autovacuum cancellation. A
DS_BLOCKED_BY_AUTOVACUUMreturn withblocking_autovacuum_procset lets the blocked backend signal-cancel a directly-blocking autovacuum worker rather than wait. Verified inFindLockCycleRecurseMember(thePROC_IS_AUTOVACUUMbranch) andGetBlockingAutoVacuumPgproc.
Open questions
Section titled “Open questions”-
Memory-ordering correctness of the fast-path/strong-lock handshake under weak memory models. The
READMEargues the LWLock-as-sequence-point guarantee on SMP, but the exact set of barriers and whether anything beyond LWLock acquire/release is relied upon on ARM/POWER is not traced here. Investigation path: readBeginStrongLockAcquire/FastPathTransferRelationLocksalongside the LWLock memory-barrier semantics inpostgres-lwlock-spinlock.md. -
Worst-case cost of soft-edge rearrangement search.
DeadLockCheckRecurserecursively tries soft-edge reversals; theREADMEclaims combinatorial blowup “shouldn’t be a big problem in practice” because cycles are few and small, but the actual bound as a function of wait-queue length and group size is not established. Investigation path: instrumentnCurConstraints/nPossibleConstraintsunder a synthetic many-way soft-deadlock workload. -
Interaction of group locking with non-relation locks under future parallel writes. The
READMEnotes that declaring intra-group locks non-conflicting is safe only because parallel mode is currently read-only, and flags GIN page locks as a hazard if parallel writes are enabled. Whether REL_18 has tightened any of this is unverified here. Investigation path: cross-check againstaccess/transam/README.paralleland any PG18 parallel-write work.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a handle for a follow-up doc.
-
A Scalable Lock Manager for Multicores (Jung, Han, Fekete, Heiser, Yeom, SIGMOD 2013) — the anchor paper. It diagnoses MySQL’s throughput collapse as lock-table latch contention and proposes staged lock allocation: split allocation/de-allocation (done in bulk, asynchronously, as garbage collection) from manipulation (done latch-free with RAW memory barriers). PostgreSQL attacks the same bottleneck from the opposite end — partitioning plus a fast path that avoids the shared table for weak locks — rather than making the shared table itself latch-free. A side-by-side study of “avoid the table” vs. “make the table latch-free” would quantify what each gives up.
-
CUBRID’s heavyweight lock manager (
cubrid-lock-manager.md) — CUBRID useslk_res/lk_entryrecords analogous to PG’sLOCK/PROCLOCK, but takes row-level locks for SERIALIZABLE (first-updater-wins) where PG leans on MVCC plus SSI predicate locks. CUBRID’s victim policy is “most-recently-blocked”; PG aborts the cycle-closing process. A comparison of the two deadlock-victim policies and granularity choices is a natural follow-up. -
SSI predicate locking (
postgres-ssi-predicate-locking.md; Cahill 2008, Ports & Grittner 2012) — the other lock manager. SIReadLocks are a separate subsystem (README-SSI) that detects rw-antidependency cycles to make SI serializable, complementing (not replacing) the heavyweight manager analyzed here. The boundary — what each subsystem is responsible for — is worth a dedicated cross-walk. -
Wait-die / wound-wait timestamp ordering (Database System Concepts §18.2) — deadlock prevention schemes that PostgreSQL rejected in favor of detection. Yu et al., Staring into the Abyss (VLDB 2015), benchmarks detection vs. prevention at 1000 cores; relevant if PG’s all-partition-lock detection becomes a scaling limit.
-
Lock-table-free / optimistic designs (Hekaton, Silo, Cicada) — in-memory engines that replace the central lock table with per-record metadata or optimistic validation. Orthogonal to PG’s disk-resident design but instructive for separating costs intrinsic to locking from costs intrinsic to a centralized lock table.
Sources
Section titled “Sources”In-tree README
Section titled “In-tree README”src/backend/storage/lmgr/README— the authoritative design doc: lock data structures, internal partitioned locking, fast-path locking, the deadlock algorithm (hard/soft edges, topological-sort rearrangement), group locking, user locks, hot-standby locking.
PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”src/backend/storage/lmgr/lock.c— the primary lock mechanism.src/backend/storage/lmgr/proc.c— wait queue,ProcSleep,CheckDeadLock.src/backend/storage/lmgr/deadlock.c— the WFG detector and soft-edge solver.src/backend/storage/lmgr/lmgr.c— relation-lock wrappers.src/include/storage/lock.h,src/include/storage/lockdefs.h— structures and lock-mode constants.src/include/storage/proc.h,src/include/storage/lwlock.h— PGPROC fast-path/wait fields and partition constants.
Papers (under knowledge/research/dbms-papers/)
Section titled “Papers (under knowledge/research/dbms-papers/)”scalable-lock-manager.md— Jung et al., A Scalable Lock Manager for Multicores, SIGMOD 2013.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz et al., 7e), ch. 18 “Concurrency Control” — §18.1 lock-based protocols & 2PL, §18.2 deadlock handling (prevention vs. detection, waits-for graph), §18.3 multiple granularity & intention locks.
Sibling docs (cross-references, mechanism not duplicated here)
Section titled “Sibling docs (cross-references, mechanism not duplicated here)”postgres-lwlock-spinlock.md— the lightweight locks the partition latches andfpInfoLockare built from.postgres-ssi-predicate-locking.md— the separate SIReadLock subsystem.postgres-shared-memory-ipc.md— where the shared hashes and PGPROC array live.postgres-backend-lifecycle.md— the ResourceOwner tree behindLOCALLOCK.lockOwnersandLockReleaseAll.