Skip to content

PostgreSQL LWLock and Spinlock — Atomic State Word, Wait-Queue Protocol, and Platform TAS

Contents:

Every database engine that stores its working state in shared memory must solve one coordination problem: when multiple processes read and write the same memory location concurrently, how do you prevent torn reads and lost writes? The answer is mutual exclusion — the guarantee that at most one writer (or, for read-optimized variants, any number of readers but no concurrent writer) holds access at one time.

Mutual exclusion is studied from two angles in classical concurrency theory.

Hardware primitives. The foundation is a test-and-set (TAS) instruction: atomically read a memory word and write 1 to it (or swap in a sentinel), returning the old value. If the old value was 0 the caller has acquired the lock; if it was already 1, someone else holds it. Modern CPUs generalize TAS into compare-and-exchange (CAS) and fetch-and-add (FAA), which are strictly more expressive. Database Internals (Petrov, ch. 8 “Introduction to Distributed Transactions”) notes that hardware atomic operations are the bedrock of all higher-level synchronization. In a single-host shared-memory engine like PostgreSQL, native CAS on a 32-bit word is the critical primitive.

Blocking vs. spinning. Once a caller discovers the lock is held it must decide whether to spin (poll in a tight loop, consuming CPU) or block (relinquish the CPU to the OS scheduler and wait to be woken up). Spinning is optimal for very short waits — the round-trip to the kernel and back is far more expensive than a few wasted cycles. Blocking is mandatory for any wait that could last longer than a timeslice. Most practical implementations combine both: spin for a short burst, then block. This two-tier strategy underlies both lock tiers in PostgreSQL.

Reader-writer locks. When the shared resource is read far more often than it is written, pure mutual exclusion wastes parallelism by serializing reads that could proceed concurrently. A reader-writer lock (a/k/a shared-exclusive lock) distinguishes shared mode (many readers at once, no writer) from exclusive mode (one writer, no readers). Textbook designs encode state as separate shared and exclusive counters, protected by a small internal spinlock. The challenge is doing so with minimal contention — taking an internal spinlock to acquire a shared outer lock is counter-productive when the shared lock is uncontended, because all readers then fight over the internal spinlock even though none of them are conflicting.

This last insight — the internal spinlock bottleneck — is precisely the problem the current LWLock implementation was designed to solve. The file comment in lwlock.c describes the old approach:

// lwlock.c (file header comment)
// This used to be a pretty straight forward reader-writer lock
// implementation, in which the internal state was protected by a
// spinlock. Unfortunately the overhead of taking the spinlock proved to be
// too high for workloads/locks that were taken in shared mode very
// frequently. Often we were spinning in the (obviously exclusive) spinlock,
// while trying to acquire a shared lock that was actually free.

The redesign — an atomic word that encodes all state — eliminates the internal spinlock entirely for the uncontended fast path. Readers acquire a shared lock with a single CAS and never touch a spinlock.

Shared-memory databases universally need at least two locking tiers:

Tier 1 — spinlocks (microseconds). Used only around a handful of instructions: incrementing a counter, updating a flag, reading a pair of values that must be seen together. The lock must never be held across a system call or a non-trivial subroutine. The PostgreSQL README for storage/lmgr/ states: “If a lock is to be held more than a few dozen instructions, or across any sort of kernel call (or even a call to a nontrivial subroutine), don’t use a spinlock.” No deadlock detection, no automatic release, no timeout beyond “stuck for ~1 minute = fatal.”

Tier 2 — lightweight read-write locks (microseconds to milliseconds). Protect shared data structures accessed for the duration of a small operation — traversing a buffer-mapping hash bucket, reading a catalog-cache entry, writing a WAL record. Lightweight locks support shared and exclusive modes, integrate with the engine’s process wait mechanism (so waiters block on a semaphore rather than burning CPU), and are automatically released during error recovery.

Tier 3 — heavyweight locks (milliseconds to seconds). The regular lock manager (lock.c) handles SQL-visible concurrency — row locks, table locks, LOCK TABLE, deadlock detection. It depends on tier-2 locks to protect its own data structures.

The key design tension for tier-2 locks is the uncontended fast path. In a database server under load, the same LWLock may be acquired and released thousands of times per second by many backends, and the vast majority of acquisitions find the lock free. The design must make the common case (uncontended shared or exclusive acquisition) as cheap as possible. The classic reader-writer lock fails this by requiring a spinlock even when the shared lock is trivially available.

The modern approach (used by PostgreSQL, Linux rwlock, Java StampedLock, etc.) packs all state into a single atomic word and uses CAS for the fast path. The word encodes: (a) whether an exclusive holder is present (a sentinel bit or counter region), (b) the number of shared holders (a counter region), and (c) flags for the wait-queue state. A shared acquisition in the uncontended case is a single atomic fetch-and-add; an exclusive acquisition is a CAS that swaps in the sentinel. No internal spinlock is needed for these paths.

PostgreSQL defines a strict separation:

  • Spinlock (slock_t, SpinLockAcquire/SpinLockRelease) — busy-wait only, a few instructions, implemented as a hardware TAS (test-and-set or compare-and-swap) on a single byte or int. The API is a set of macros in spin.h that delegate to platform-specific assembly in s_lock.h.
  • LWLock (LWLock, LWLockAcquire/LWLockRelease) — shared and exclusive modes, OS-block on contention via PGSemaphore, automatic release on error, integrated with the wait-event reporting system.

Spinlocks are used inside the LWLock implementation (specifically for the wait-list mutex, LW_FLAG_LOCKED) and throughout the engine for the narrowest critical sections (e.g., ShmemLock protecting the LWLock tranche counter, ProcGlobal->lock protecting global PGPROC state). LWLocks are used for everything that needs a shared mode or may block for more than a handful of instructions.

The core of the LWLock design is a single pg_atomic_uint32 field lock->state that encodes everything relevant about the lock’s current state:

// state bit layout — lwlock.c
#define LW_FLAG_HAS_WAITERS ((uint32) 1 << 31) // waiters in queue
#define LW_FLAG_RELEASE_OK ((uint32) 1 << 30) // ok to wake waiters now
#define LW_FLAG_LOCKED ((uint32) 1 << 29) // wait-list spinlock
// LW_VAL_EXCLUSIVE = MAX_BACKENDS + 1 (a power-of-2; exclusive sentinel)
// LW_VAL_SHARED = 1 (each shared holder adds 1)
// LW_SHARED_MASK = MAX_BACKENDS (bits 0..N covering shared count)
// LW_LOCK_MASK = MAX_BACKENDS | LW_VAL_EXCLUSIVE

The upper 3 bits (29–31) are flag bits; bits 0 through LOG2_NUM_LOCK_PARTITIONS-ish are the lock-holder region. The exclusive sentinel LW_VAL_EXCLUSIVE is MAX_BACKENDS + 1. Because MAX_BACKENDS is always a power-of-2 minus 1, LW_VAL_EXCLUSIVE = MAX_BACKENDS + 1 is a power of 2 and occupies exactly one bit above the shared count range. This ensures the exclusive sentinel cannot be confused with any possible shared-holder count.

bit 31: HAS_WAITERS — someone is sleeping on this lock
bit 30: RELEASE_OK — release is allowed to wake up waiters
bit 29: LOCKED — wait-list spinlock (internal)
bits 0..27: lock-holder region
if (state & LW_VAL_EXCLUSIVE) != 0 → held exclusively
else (state & LW_SHARED_MASK) → number of shared holders

The static assertions in lwlock.c (around line 109) verify at compile time that MAX_BACKENDS + 1 is a power of 2, that MAX_BACKENDS and LW_FLAG_MASK do not overlap, and that LW_VAL_EXCLUSIVE and LW_FLAG_MASK do not overlap.

// LWLock — src/include/storage/lwlock.h
typedef struct LWLock
{
uint16 tranche; /* tranche ID for wait event naming */
pg_atomic_uint32 state; /* atomic state word */
proclist_head waiters; /* list of waiting PGPROCs */
} LWLock;

The struct is small — one uint16, one atomic uint32, and a two-word list head. Each lock is padded to LWLOCK_PADDED_SIZE (= one cache line) in LWLockPadded to prevent false sharing.

The tranche is a name group for wait-event reporting. All 128 BufferContent locks share the tranche LWTRANCHE_BUFFER_CONTENT; all 16 LockManager partition locks share LWTRANCHE_LOCK_MANAGER. Named tranches (RequestNamedLWLockTranche / GetNamedLWLockTranche) let extensions allocate arrays of LWLocks at postmaster startup with user-visible wait-event names. The tranche ID doubles as the wait-event ID reported to pg_stat_activity.

The main lock array MainLWLockArray is allocated once in shared memory by CreateLWLocks and holds all fixed (individually-named) locks plus the per-partition groups. It is sized by LWLockShmemSize.

The shared-memory layout produced by CreateLWLocks / InitializeLWLocks is a single contiguous block. A 4-byte dynamic-tranche-ID counter sits just before MainLWLockArray (read by LWLockNewTrancheId); the array proper begins on a LWLOCK_PADDED_SIZE (cache-line) boundary. The fixed region is laid out by the offset macros in lwlock.h (BUFFER_MAPPING_LWLOCK_OFFSET, LOCK_MANAGER_LWLOCK_OFFSET, PREDICATELOCK_MANAGER_LWLOCK_OFFSET), so any backend can index a partition group by adding a constant offset. Named-tranche locks (from RequestNamedLWLockTranche) follow the fixed NUM_FIXED_LWLOCKS region, and the NamedLWLockTranche descriptor array plus the tranche name strings are packed at the tail.

flowchart TD
    subgraph SHM["shared memory block from ShmemAlloc (CreateLWLocks)"]
        CTR["int LWLockCounter<br/>(dynamic tranche-ID counter,<br/>init = LWTRANCHE_FIRST_USER_DEFINED)"]
        subgraph ARR["MainLWLockArray : LWLockPadded[] (cache-line aligned)"]
            IND["[0 .. NUM_INDIVIDUAL_LWLOCKS)<br/>individual named locks<br/>(each its own tranche)"]
            BUF["BUFFER_MAPPING_LWLOCK_OFFSET<br/>128 locks, LWTRANCHE_BUFFER_MAPPING"]
            LCK["LOCK_MANAGER_LWLOCK_OFFSET<br/>16 locks, LWTRANCHE_LOCK_MANAGER"]
            PRD["PREDICATELOCK_MANAGER_LWLOCK_OFFSET<br/>16 locks, LWTRANCHE_PREDICATE_LOCK_MANAGER"]
            NAMED["[NUM_FIXED_LWLOCKS ..)<br/>named-tranche locks<br/>(RequestNamedLWLockTranche)"]
        end
        DESC["NamedLWLockTranche[]<br/>(trancheId, trancheName)"]
        STR["tranche name strings (char)"]
    end
    CTR --> IND
    IND --> BUF --> LCK --> PRD --> NAMED
    NAMED --> DESC --> STR

    L["LWLock { uint16 tranche;<br/>pg_atomic_uint32 state;<br/>proclist_head waiters }"] -.padded to.-> PAD["LWLockPadded (= PG_CACHE_LINE_SIZE)<br/>one lock per cache line, no false sharing"]
    NAMED -.each element is a.-> L

Each LWLockPadded element is a union of the bare LWLock and a char pad[LWLOCK_PADDED_SIZE], so every lock occupies a full cache line. The tranche field inside each LWLock is the index into BuiltinTrancheNames[] (or the dynamic LWLockTrancheNames[]), which is how LWLockReportWaitStart resolves a wait event to a user-visible name in pg_stat_activity.

Acquisition protocol — the two-attempt race closure

Section titled “Acquisition protocol — the two-attempt race closure”

The uncontended shared path is a single atomic CAS. The comment in lwlock.c explains why a single attempt is not enough for the contended case:

// LWLockAcquire — lwlock.c (simplified narrative)
// Phase 1: Try to do it atomically, if we succeed, nice
// Phase 2: Add ourselves to the waitqueue of the lock
// Phase 3: Try to grab the lock again; if we succeed, remove ourselves
// Phase 4: Sleep till wake-up, goto Phase 1

The race being closed: a naïve implementation might try once, fail, then enqueue itself — but by the time enqueuing completes, the lock could already have been released and no one will wake the new waiter up. The two-attempt protocol prevents this: after LWLockQueueSelf places the caller in the waiter list, a second LWLockAttemptLock is executed. If the second attempt succeeds, the caller calls LWLockDequeueSelf to remove itself cleanly. If it fails, the caller is guaranteed to be woken up because the releaser will see the enqueued waiter when it processes the wait list.

// LWLockAcquire — lwlock.c
bool
LWLockAcquire(LWLock *lock, LWLockMode mode)
{
// ...
HOLD_INTERRUPTS();
for (;;)
{
mustwait = LWLockAttemptLock(lock, mode); /* Phase 1 */
if (!mustwait)
break;
LWLockQueueSelf(lock, mode); /* Phase 2 */
mustwait = LWLockAttemptLock(lock, mode); /* Phase 3 */
if (!mustwait)
{
LWLockDequeueSelf(lock);
break;
}
/* Phase 4 */
PGSemaphoreLock(proc->sem); /* block in OS */
// ... loop back
}
held_lwlocks[num_held_lwlocks++] = {lock, mode};
return result; /* true = acquired without waiting */
}

HOLD_INTERRUPTS() is called before the first attempt; RESUME_INTERRUPTS() is called in LWLockRelease. This ensures cancel/die signals cannot interrupt a backend that holds an LWLock.

The flow below traces a single LWLockAcquire call: the fast path is the left spine (LWLockAttemptLock succeeds on the first try, the overwhelmingly common case under load), and the slow path is the enqueue/recheck/block loop that closes the enqueue-then-miss race. Note that the block point (PGSemaphoreLock) is reached only after a second LWLockAttemptLock has also failed while the caller is already queued — which is exactly the condition that guarantees a future releaser will see the waiter and call LWLockWakeup.

flowchart TD
    A["LWLockAcquire(lock, mode)"] --> B["HOLD_INTERRUPTS()"]
    B --> C{"num_held_lwlocks &gt;= MAX_SIMUL_LWLOCKS?"}
    C -->|yes| Cerr["elog(ERROR, too many LWLocks taken)"]
    C -->|no| D["LWLockAttemptLock(lock, mode)<br/>Phase 1: single CAS on lock-&gt;state"]
    D --> E{"mustwait?"}
    E -->|no| Z["record in held_lwlocks[]<br/>return result"]
    E -->|yes| F["LWLockQueueSelf(lock, mode)<br/>Phase 2: enqueue MyProc under LW_FLAG_LOCKED"]
    F --> G["LWLockAttemptLock(lock, mode)<br/>Phase 3: retry CAS, now guaranteed visible"]
    G --> H{"mustwait?"}
    H -->|no| I["LWLockDequeueSelf(lock)<br/>undo the enqueue"]
    I --> Z
    H -->|yes| J["LWLockReportWaitStart()<br/>Phase 4: block"]
    J --> K["PGSemaphoreLock(proc-&gt;sem)"]
    K --> L{"proc-&gt;lwWaiting == LW_WS_NOT_WAITING?"}
    L -->|no, spurious| K
    L -->|yes, woken| M["pg_atomic_fetch_or(state, LW_FLAG_RELEASE_OK)<br/>result = false"]
    M --> D
    Z --> Y["caller runs critical section ... eventually LWLockRelease"]
    Y --> R["LWLockReleaseInternal:<br/>pg_atomic_sub_fetch_u32(state, LW_VAL_*)"]
    R --> S{"HAS_WAITERS &amp; RELEASE_OK set<br/>AND LW_LOCK_MASK == 0?"}
    S -->|no| T["RESUME_INTERRUPTS(); done"]
    S -->|yes| U["LWLockWakeup(lock):<br/>walk waiters, mark LW_WS_PENDING_WAKEUP,<br/>PGSemaphoreUnlock each"]
    U --> T

LWLockAttemptLock — the atomic fast path

Section titled “LWLockAttemptLock — the atomic fast path”
// LWLockAttemptLock — lwlock.c
static bool
LWLockAttemptLock(LWLock *lock, LWLockMode mode)
{
uint32 old_state = pg_atomic_read_u32(&lock->state);
while (true)
{
uint32 desired_state = old_state;
bool lock_free;
if (mode == LW_EXCLUSIVE)
{
lock_free = (old_state & LW_LOCK_MASK) == 0;
if (lock_free) desired_state += LW_VAL_EXCLUSIVE;
}
else
{
lock_free = (old_state & LW_VAL_EXCLUSIVE) == 0;
if (lock_free) desired_state += LW_VAL_SHARED;
}
// CAS always executes (acts as memory barrier even on failure)
if (pg_atomic_compare_exchange_u32(&lock->state,
&old_state, desired_state))
return !lock_free; /* false = acquired; true = must wait */
}
}

For shared acquisition: if no exclusive holder, increment the shared count. For exclusive acquisition: if no holder at all (neither exclusive nor any shared), swap in LW_VAL_EXCLUSIVE. The CAS is always issued even when lock_free is false — this doubles as a full memory barrier, and the implementation comment explicitly notes that benchmarks did not show a benefit from skipping it.

The wait list (lock->waiters, a proclist) is not itself lock-free. A tiny spinlock bit LW_FLAG_LOCKED (bit 29 of state) serializes enqueue and dequeue. LWLockWaitListLock acquires it with a fetch-or, spins using perform_spin_delay if busy, and LWLockWaitListUnlock clears it with a fetch-and. This is the only place spinlock semantics appear inside the LWLock code.

// LWLockWaitListLock — lwlock.c
static void
LWLockWaitListLock(LWLock *lock)
{
while (true)
{
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break; /* got lock */
// spin using perform_spin_delay until cleared
}
}

LWLockRelease calls LWLockReleaseInternal after removing the lock from the held-locks array. The release atomically decrements the state word:

// LWLockReleaseInternal — lwlock.c
static void
LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
{
if (mode == LW_EXCLUSIVE)
oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
else
oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
// Wake waiters only when:
// HAS_WAITERS is set AND RELEASE_OK is set AND no holder remains
if ((oldstate & (LW_FLAG_HAS_WAITERS | LW_FLAG_RELEASE_OK)) ==
(LW_FLAG_HAS_WAITERS | LW_FLAG_RELEASE_OK) &&
(oldstate & LW_LOCK_MASK) == 0)
LWLockWakeup(lock);
}

LWLockWakeup acquires the wait-list spinlock, walks lock->waiters, marks selected waiters as LW_WS_PENDING_WAKEUP, releases the wait-list spinlock, then calls PGSemaphoreUnlock on each waiter’s semaphore. Multiple shared waiters can be woken at once; the first exclusive waiter stops the walk.

The RELEASE_OK flag prevents double-wakeup. When LWLockWakeup moves a waiter to LW_WS_PENDING_WAKEUP, it clears LW_FLAG_RELEASE_OK so no further wakeup round fires until the woken process sets it again (pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_RELEASE_OK) at the top of the retry loop in LWLockAcquire).

LWLockWaitForVar — condition-variable variant

Section titled “LWLockWaitForVar — condition-variable variant”

Beyond the standard acquire/release pair, LWLock offers a third mode: LWLockWaitForVar / LWLockUpdateVar. A caller can hold a lock in exclusive mode and broadcast a variable update; any number of waiters block on the lock with LW_WAIT_UNTIL_FREE and wake when the variable changes. This is the mechanism used by WALInsert to let backends wait for the WAL flush LSN to advance without acquiring the WAL insert lock.

// LWLockUpdateVar — lwlock.c
// Caller must hold lock in exclusive mode.
void
LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
{
pg_atomic_exchange_u64(valptr, val); /* full barrier */
// wake all LW_WAIT_UNTIL_FREE waiters
LWLockWaitListLock(lock);
// ... move front-of-queue waiters to wakeup list
LWLockWaitListUnlock(lock);
// ... PGSemaphoreUnlock each
}

LW_WAIT_UNTIL_FREE waiters are placed at the front of the wait queue so that variable-change notifications are delivered promptly and are not blocked behind regular lock waiters.

The spinlock API in PostgreSQL is a thin macro shell over platform-specific TAS:

spin.h
#define SpinLockInit(lock) S_INIT_LOCK(lock)
#define SpinLockAcquire(lock) S_LOCK(lock)
#define SpinLockRelease(lock) S_UNLOCK(lock)
#define SpinLockFree(lock) S_LOCK_FREE(lock)

On x86-64, S_LOCK expands to s_lock() which loops on TAS_SPIN. TAS_SPIN does a non-locking load check first (*(lock) ? 1 : TAS(lock)) to avoid asserting the bus lock when the lock is obviously held — a technique documented in Intel’s “Implementing Scalable Atomic Locks” whitepaper. The TAS macro itself issues a lock; xchgb instruction:

// tas() — src/include/storage/s_lock.h (x86_64 section)
static __inline__ int
tas(volatile slock_t *lock)
{
slock_t _res = 1;
__asm__ __volatile__(
" lock \n"
" xchgb %0,%1 \n"
: "+q"(_res), "+m"(*lock)
: /* no inputs */
: "memory", "cc");
return (int) _res;
}

On ARM64, the equivalent is __sync_lock_test_and_set (GCC builtin for LDREX/STREX or LDXR/STXR), with ISB as the spin delay. On platforms with no hardware TAS, PostgreSQL falls back to a semop-based spinlock.

The portable wait loop in s_lock.c adapts its spin count:

// s_lock / perform_spin_delay — s_lock.c
int
s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
{
SpinDelayStatus delayStatus;
init_spin_delay(&delayStatus, file, line, func);
while (TAS_SPIN(lock))
perform_spin_delay(&delayStatus);
finish_spin_delay(&delayStatus);
return delayStatus.delays;
}

perform_spin_delay calls SPIN_DELAY() (the CPU hint — rep; nop on x86) every iteration. After spins_per_delay spins it escalates to pg_usleep with a randomly-increasing delay (1 ms to 1 s) to avoid starvation under heavy contention. The spins_per_delay value is a per-process heuristic: it is increased rapidly when the lock is acquired without sleeping (indicating a multiprocessor environment where spinning pays), and decreased slowly when sleeping is needed (indicating a uniprocessor or overloaded system). set_spins_per_delay / update_spins_per_delay propagate the estimate between backends via shared memory on startup and exit.

Both SpinLockAcquire and LWLockAcquire are documented to hold off cancel/die interrupts for the duration of the lock hold. The lmgr/README notes: “Acquisition of either a spinlock or a lightweight lock causes query cancel and die() interrupts to be held off until all such locks are released.” This is implemented by HOLD_INTERRUPTS() / RESUME_INTERRUPTS() in LWLockAcquire / LWLockRelease and by the spinlock macros themselves. The implication is that no CHECK_FOR_INTERRUPTS() is possible while any spinlock is held, and the critical section must be kept extremely short.

LWLockReleaseAll is called by the error-recovery path (triggered by ereport(ERROR)). It iterates the process-local held_lwlocks[] array and releases every held lock in reverse acquisition order. This is why the held_lwlocks array is maintained — it is the error-recovery release list, not merely a debugging aid. Spinlocks, by contrast, have no automatic release on error; a spinlock held across a longjmp would deadlock the server.

SymbolKindFileRole
slock_ttypedefs_lock.hPlatform-specific spinlock word (typically unsigned char on x86, int on ARM)
TASmacros_lock.hSingle atomic test-and-set; returns 0 on success
TAS_SPINmacros_lock.hLike TAS but checks without bus-lock first (x86_64 / ARM64)
SPIN_DELAYmacros_lock.hCPU hint inside spin loop (rep; nop / isb)
SpinLockAcquiremacrospin.hPublic API — delegates to S_LOCK
SpinLockReleasemacrospin.hPublic API — delegates to S_UNLOCK
s_lockfunctions_lock.cPlatform-independent spin-with-backoff loop
perform_spin_delayfunctions_lock.cOne spin iteration: CPU hint, count spins, escalate to pg_usleep
finish_spin_delayfunctions_lock.cAdjust spins_per_delay heuristic after acquisition
set_spins_per_delayfunctions_lock.cCopy shared estimate into process-local spins_per_delay at startup
update_spins_per_delayfunctions_lock.cBlend process-local estimate back to shared at exit
SymbolKindFileRole
LWLockstructlwlock.hLock object: tranche (uint16), state (atomic uint32), waiters (proclist)
LWLockPaddedunionlwlock.hLWLock padded to one cache line to prevent false sharing
LWLockModeenumlwlock.hLW_EXCLUSIVE, LW_SHARED, LW_WAIT_UNTIL_FREE
LWLockWaitStateenumlwlock.hLW_WS_NOT_WAITING, LW_WS_WAITING, LW_WS_PENDING_WAKEUP
NamedLWLockTranchestructlwlock.hName + tranche ID for an extension-allocated lock array
LWLockHandlestructlwlock.cProcess-local record of a held lock: {lock *, mode}
MainLWLockArraygloballwlock.cShared-memory base of all fixed + named-tranche LWLocks
held_lwlocks[]static arraylwlock.cPer-process held-lock list (max 200 simultaneous)
LW_FLAG_HAS_WAITERSmacrolwlock.cBit 31 of state
LW_FLAG_RELEASE_OKmacrolwlock.cBit 30 of state
LW_FLAG_LOCKEDmacrolwlock.cBit 29 of state (wait-list spinlock)
LW_VAL_EXCLUSIVEmacrolwlock.cExclusive sentinel = MAX_BACKENDS + 1
LW_VAL_SHAREDmacrolwlock.cShared increment = 1
SymbolKindFileRole
LWLockShmemSizefunctionlwlock.cCompute shmem bytes for all locks + tranche names
CreateLWLocksfunctionlwlock.cPostmaster: allocate MainLWLockArray and named-tranche arrays in shmem
InitializeLWLocksfunctionlwlock.cInitialize all fixed and named-tranche locks via LWLockInitialize
LWLockInitializefunctionlwlock.cSet state = LW_FLAG_RELEASE_OK, bind tranche ID
RequestNamedLWLockTranchefunctionlwlock.cExtension hook: request N locks by name before postmaster starts
GetNamedLWLockTranchefunctionlwlock.cRetrieve base pointer for a named tranche array
LWLockRegisterTranchefunctionlwlock.cPer-process: register tranche ID → name for wait-event lookup
LWLockNewTrancheIdfunctionlwlock.cAllocate a fresh tranche ID from MainLWLockArray[-1] counter (protected by ShmemLock)
LWLockAttemptLockfunctionlwlock.cSingle CAS attempt; returns true = must wait
LWLockQueueSelffunctionlwlock.cAdd MyProc to wait list under LW_FLAG_LOCKED
LWLockDequeueSelffunctionlwlock.cRemove MyProc from wait list if still there
LWLockAcquirefunctionlwlock.cFull blocking acquire; two-attempt protocol
LWLockConditionalAcquirefunctionlwlock.cNon-blocking acquire; returns false immediately if lock busy
LWLockAcquireOrWaitfunctionlwlock.cAcquire if free; wait (without acquiring) if busy — used for WALWriteLock
LWLockWaitListLockfunctionlwlock.cAcquire wait-list spinlock (bit 29)
LWLockWakeupfunctionlwlock.cWalk waiters, move eligible to wakeup, call PGSemaphoreUnlock
LWLockReleaseInternalfunctionlwlock.cAtomic decrement + conditional LWLockWakeup
LWLockReleasefunctionlwlock.cPublic release: disown + LWLockReleaseInternal + RESUME_INTERRUPTS
LWLockReleaseAllfunctionlwlock.cError recovery: release all entries in held_lwlocks[]
LWLockWaitForVarfunctionlwlock.cBlock until *valptr != oldval or lock is released
LWLockUpdateVarfunctionlwlock.cAtomically update *valptr and wake LW_WAIT_UNTIL_FREE waiters

Position hints (commit 273fe94, 2026-06-05)

Section titled “Position hints (commit 273fe94, 2026-06-05)”
SymbolFileLine
LW_FLAG_* / LW_VAL_* macrossrc/backend/storage/lmgr/lwlock.c94–106
LWLock structsrc/include/storage/lwlock.h41–50
LWLockPadded / LWLOCK_PADDED_SIZEsrc/include/storage/lwlock.h62–72
LWLockMode enumsrc/include/storage/lwlock.h112–119
BuiltinTrancheIds enumsrc/include/storage/lwlock.h181–225
*_LWLOCK_OFFSET / NUM_FIXED_LWLOCKS macrossrc/include/storage/lwlock.h104–110
LWLockShmemSizesrc/backend/storage/lmgr/lwlock.c432
CreateLWLockssrc/backend/storage/lmgr/lwlock.c462
InitializeLWLockssrc/backend/storage/lmgr/lwlock.c502
LWLockNewTrancheIdsrc/backend/storage/lmgr/lwlock.c615
LWLockReportWaitStartsrc/backend/storage/lmgr/lwlock.c737
GetNamedLWLockTranchesrc/backend/storage/lmgr/lwlock.c585
LWLockRegisterTranchesrc/backend/storage/lmgr/lwlock.c640
RequestNamedLWLockTranchesrc/backend/storage/lmgr/lwlock.c682
LWLockInitializesrc/backend/storage/lmgr/lwlock.c719
LWLockAttemptLocksrc/backend/storage/lmgr/lwlock.c796
LWLockWaitListLocksrc/backend/storage/lmgr/lwlock.c867
LWLockWakeupsrc/backend/storage/lmgr/lwlock.c932
LWLockQueueSelfsrc/backend/storage/lmgr/lwlock.c1048
LWLockDequeueSelfsrc/backend/storage/lmgr/lwlock.c1091
LWLockAcquiresrc/backend/storage/lmgr/lwlock.c1180
LWLockConditionalAcquiresrc/backend/storage/lmgr/lwlock.c1351
LWLockAcquireOrWaitsrc/backend/storage/lmgr/lwlock.c1408
LWLockWaitForVarsrc/backend/storage/lmgr/lwlock.c1596
LWLockUpdateVarsrc/backend/storage/lmgr/lwlock.c1732
LWLockReleaseInternalsrc/backend/storage/lmgr/lwlock.c1836
LWLockReleasesrc/backend/storage/lmgr/lwlock.c1900
s_locksrc/backend/storage/lmgr/s_lock.c98
perform_spin_delaysrc/backend/storage/lmgr/s_lock.c126
finish_spin_delaysrc/backend/storage/lmgr/s_lock.c186
set_spins_per_delaysrc/backend/storage/lmgr/s_lock.c207
update_spins_per_delaysrc/backend/storage/lmgr/s_lock.c218
tas (x86_64)src/include/storage/s_lock.h214
SpinLockAcquire macrosrc/include/storage/spin.h59

Verified against /data/hgryoo/references/postgres, branch REL_18_STABLE, commit 273fe94.

Confirmed:

  • LWLock struct layout (tranche + state + waiters) verified in lwlock.h:41–50.
  • State bit macros and StaticAssertDecl checks verified in lwlock.c:94–116.
  • LWLockAttemptLock CAS loop (shared and exclusive paths) verified in lwlock.c:796–856.
  • Two-attempt protocol (LWLockAcquireLWLockQueueSelf → second LWLockAttemptLock) verified in lwlock.c:1180–1341.
  • LWLockReleaseInternal atomic decrement + LWLockWakeup guard condition verified in lwlock.c:1836–1877.
  • LWLockWakeup wait-list traversal, LW_WS_PENDING_WAKEUP transition, pg_write_barrier(), PGSemaphoreUnlock verified in lwlock.c:932–1040.
  • LWLockWaitForVar / LWLockUpdateVar mechanism (front-of-queue LW_WAIT_UNTIL_FREE, pg_atomic_exchange_u64 barrier) verified in lwlock.c:1596–1795.
  • LWLockAcquireOrWait semantics (acquire if free, wait without acquiring if busy; documented for WALWriteLock) verified in lwlock.c:1393–1408 comment + 1408–1594 body.
  • s_lock adaptive spin loop and spins_per_delay heuristic verified in s_lock.c.
  • x86_64 lock; xchgb TAS and TAS_SPIN non-locking pre-check verified in s_lock.h:196–241.
  • ARM64 __sync_lock_test_and_set + ISB delay verified in s_lock.h:250–290.
  • LWTRANCHE_AIO_URING_COMPLETION present in BuiltinTrancheNames (PG18 async I/O) — verified in lwlock.c:180.

Unresolved / not verified in this pass:

  • The exact memory-ordering guarantees of pg_atomic_compare_exchange_u32 on weak-ordering architectures (the port/atomics/ implementation is arch-specific and not traced through in full here).
  • The interaction between LWLockDisown / LWLockReleaseDisowned and parallel workers that transfer lock ownership across process boundaries — not fully traced.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Linux kernel rwsem and qspinlock. Linux uses a similar atomic-word design for read-write semaphores: a single long encodes reader count, writer-owned bit, and waiter-present bit. The optimistic fast path (a single LOCK XADD) is nearly identical to PostgreSQL’s shared-mode fast path. Linux’s qspinlock (MCS-based) is more sophisticated than PostgreSQL’s simpler TAS spinlock — it uses a queue-based approach that eliminates the O(n) cache-line bouncing of naive TAS under high contention. PostgreSQL has not adopted MCS spinlocks, as its spinlocks are held for only a handful of instructions and contention levels are low.

Java StampedLock (JDK 8+). Java’s StampedLock uses an AtomicLong state word with a similar layout: exclusive bit, reader count, and a stamp for optimistic reads. The optimistic read mode (tryOptimisticRead / validate) is a lock-free variant that reads data without any atomic operation and then checks whether a writer has intervened — closer to MVCC than to locking. PostgreSQL does not implement optimistic LWLock reads, though LWLockConditionalAcquire serves the non-blocking acquire use case.

Scalable lock managers and NUMA. Eyad Alkabani and others observed that a single shared lock array causes cache-coherence traffic that degrades under NUMA. The paper “A Scalable Lock Manager for Multicores” (Johnson et al., SIGMOD 2010, captured in knowledge/research/dbms-papers/scalable-lock-manager.md) proposes per-CPU-core lock tables to reduce this traffic. PostgreSQL partially addresses NUMA effects through LWLockPadded (one lock per cache line) and through the partitioned design of high-contention lock groups (NUM_BUFFER_PARTITIONS = 128, NUM_LOCK_PARTITIONS = 16) — each partition gets its own LWLock so the hot lock is spread across cache lines.

Optimistic concurrency and lock-free approaches. An alternative to locking for read-mostly data is seqlock (sequence lock): a writer increments a version counter before and after the write; a reader checks that the counter is even and unchanged. PostgreSQL uses a structurally similar mechanism in its WAL insert logic (XLogCtl->Insert seqlock-like LSN check) and in LWLockWaitForVar. Lock-free hash tables and B-trees (e.g., Masstree, MICA) eliminate locks entirely for lookups, at the cost of considerably more complex reclamation. PostgreSQL does not adopt these because correct reclamation requires either hazard pointers or epoch-based reclamation — infrastructure not present in the codebase — and the performance benefit is modest for PostgreSQL’s workloads.

PG18 async I/O integration. The LWTRANCHE_AIO_URING_COMPLETION tranche (visible in BuiltinTrancheNames) is new in PG18. It protects the io_uring completion ring in the async I/O subsystem (storage/aio/). The introduction of async I/O required a new LWLock tranche rather than reusing existing ones, because the completion processing can happen in a worker process that did not issue the I/O — the tranche name makes this wait event user-visible and diagnosable in pg_stat_activity.

  • src/backend/storage/lmgr/lwlock.c — full LWLock implementation
  • src/backend/storage/lmgr/s_lock.c — portable spinlock wait loop
  • src/include/storage/lwlock.hLWLock struct, LWLockMode, BuiltinTrancheIds
  • src/include/storage/spin.h — public spinlock API macros
  • src/include/storage/s_lock.h — platform TAS implementations (x86, x86_64, ARM64, s390, …)
  • src/backend/storage/lmgr/README — authoritative overview of the four PostgreSQL lock tiers
  • knowledge/research/dbms-papers/scalable-lock-manager.md — Johnson et al. 2010 on scalable lock managers (NUMA)
  • knowledge/research/dbms-general/database-internals.md — Petrov ch. 8 (atomic operations, concurrency foundations)