PostgreSQL LWLock and Spinlock — Atomic State Word, Wait-Queue Protocol, and Platform TAS
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every database engine that stores its working state in shared memory must solve one coordination problem: when multiple processes read and write the same memory location concurrently, how do you prevent torn reads and lost writes? The answer is mutual exclusion — the guarantee that at most one writer (or, for read-optimized variants, any number of readers but no concurrent writer) holds access at one time.
Mutual exclusion is studied from two angles in classical concurrency theory.
Hardware primitives. The foundation is a test-and-set (TAS) instruction: atomically read a memory word and write 1 to it (or swap in a sentinel), returning the old value. If the old value was 0 the caller has acquired the lock; if it was already 1, someone else holds it. Modern CPUs generalize TAS into compare-and-exchange (CAS) and fetch-and-add (FAA), which are strictly more expressive. Database Internals (Petrov, ch. 8 “Introduction to Distributed Transactions”) notes that hardware atomic operations are the bedrock of all higher-level synchronization. In a single-host shared-memory engine like PostgreSQL, native CAS on a 32-bit word is the critical primitive.
Blocking vs. spinning. Once a caller discovers the lock is held it must decide whether to spin (poll in a tight loop, consuming CPU) or block (relinquish the CPU to the OS scheduler and wait to be woken up). Spinning is optimal for very short waits — the round-trip to the kernel and back is far more expensive than a few wasted cycles. Blocking is mandatory for any wait that could last longer than a timeslice. Most practical implementations combine both: spin for a short burst, then block. This two-tier strategy underlies both lock tiers in PostgreSQL.
Reader-writer locks. When the shared resource is read far more often than it is written, pure mutual exclusion wastes parallelism by serializing reads that could proceed concurrently. A reader-writer lock (a/k/a shared-exclusive lock) distinguishes shared mode (many readers at once, no writer) from exclusive mode (one writer, no readers). Textbook designs encode state as separate shared and exclusive counters, protected by a small internal spinlock. The challenge is doing so with minimal contention — taking an internal spinlock to acquire a shared outer lock is counter-productive when the shared lock is uncontended, because all readers then fight over the internal spinlock even though none of them are conflicting.
This last insight — the internal spinlock bottleneck — is precisely the
problem the current LWLock implementation was designed to solve. The
file comment in lwlock.c describes the old approach:
// lwlock.c (file header comment)// This used to be a pretty straight forward reader-writer lock// implementation, in which the internal state was protected by a// spinlock. Unfortunately the overhead of taking the spinlock proved to be// too high for workloads/locks that were taken in shared mode very// frequently. Often we were spinning in the (obviously exclusive) spinlock,// while trying to acquire a shared lock that was actually free.The redesign — an atomic word that encodes all state — eliminates the internal spinlock entirely for the uncontended fast path. Readers acquire a shared lock with a single CAS and never touch a spinlock.
Common DBMS Design
Section titled “Common DBMS Design”Shared-memory databases universally need at least two locking tiers:
Tier 1 — spinlocks (microseconds). Used only around a handful of
instructions: incrementing a counter, updating a flag, reading a pair
of values that must be seen together. The lock must never be held across
a system call or a non-trivial subroutine. The PostgreSQL README for
storage/lmgr/ states: “If a lock is to be held more than a few dozen
instructions, or across any sort of kernel call (or even a call to a
nontrivial subroutine), don’t use a spinlock.” No deadlock detection, no
automatic release, no timeout beyond “stuck for ~1 minute = fatal.”
Tier 2 — lightweight read-write locks (microseconds to milliseconds). Protect shared data structures accessed for the duration of a small operation — traversing a buffer-mapping hash bucket, reading a catalog-cache entry, writing a WAL record. Lightweight locks support shared and exclusive modes, integrate with the engine’s process wait mechanism (so waiters block on a semaphore rather than burning CPU), and are automatically released during error recovery.
Tier 3 — heavyweight locks (milliseconds to seconds). The regular
lock manager (lock.c) handles SQL-visible concurrency — row locks,
table locks, LOCK TABLE, deadlock detection. It depends on tier-2
locks to protect its own data structures.
The key design tension for tier-2 locks is the uncontended fast path. In a database server under load, the same LWLock may be acquired and released thousands of times per second by many backends, and the vast majority of acquisitions find the lock free. The design must make the common case (uncontended shared or exclusive acquisition) as cheap as possible. The classic reader-writer lock fails this by requiring a spinlock even when the shared lock is trivially available.
The modern approach (used by PostgreSQL, Linux rwlock, Java
StampedLock, etc.) packs all state into a single atomic word and uses
CAS for the fast path. The word encodes: (a) whether an exclusive holder
is present (a sentinel bit or counter region), (b) the number of shared
holders (a counter region), and (c) flags for the wait-queue state. A
shared acquisition in the uncontended case is a single atomic
fetch-and-add; an exclusive acquisition is a CAS that swaps in the
sentinel. No internal spinlock is needed for these paths.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Two lock tiers
Section titled “Two lock tiers”PostgreSQL defines a strict separation:
- Spinlock (
slock_t,SpinLockAcquire/SpinLockRelease) — busy-wait only, a few instructions, implemented as a hardware TAS (test-and-set or compare-and-swap) on a single byte or int. The API is a set of macros inspin.hthat delegate to platform-specific assembly ins_lock.h. - LWLock (
LWLock,LWLockAcquire/LWLockRelease) — shared and exclusive modes, OS-block on contention viaPGSemaphore, automatic release on error, integrated with the wait-event reporting system.
Spinlocks are used inside the LWLock implementation (specifically for
the wait-list mutex, LW_FLAG_LOCKED) and throughout the engine for
the narrowest critical sections (e.g., ShmemLock protecting the LWLock
tranche counter, ProcGlobal->lock protecting global PGPROC state).
LWLocks are used for everything that needs a shared mode or may block
for more than a handful of instructions.
The LWLock state word
Section titled “The LWLock state word”The core of the LWLock design is a single pg_atomic_uint32 field
lock->state that encodes everything relevant about the lock’s
current state:
// state bit layout — lwlock.c#define LW_FLAG_HAS_WAITERS ((uint32) 1 << 31) // waiters in queue#define LW_FLAG_RELEASE_OK ((uint32) 1 << 30) // ok to wake waiters now#define LW_FLAG_LOCKED ((uint32) 1 << 29) // wait-list spinlock
// LW_VAL_EXCLUSIVE = MAX_BACKENDS + 1 (a power-of-2; exclusive sentinel)// LW_VAL_SHARED = 1 (each shared holder adds 1)// LW_SHARED_MASK = MAX_BACKENDS (bits 0..N covering shared count)// LW_LOCK_MASK = MAX_BACKENDS | LW_VAL_EXCLUSIVEThe upper 3 bits (29–31) are flag bits; bits 0 through LOG2_NUM_LOCK_PARTITIONS-ish
are the lock-holder region. The exclusive sentinel LW_VAL_EXCLUSIVE is
MAX_BACKENDS + 1. Because MAX_BACKENDS is always a power-of-2 minus
1, LW_VAL_EXCLUSIVE = MAX_BACKENDS + 1 is a power of 2 and occupies
exactly one bit above the shared count range. This ensures the exclusive
sentinel cannot be confused with any possible shared-holder count.
bit 31: HAS_WAITERS — someone is sleeping on this lockbit 30: RELEASE_OK — release is allowed to wake up waitersbit 29: LOCKED — wait-list spinlock (internal)bits 0..27: lock-holder region if (state & LW_VAL_EXCLUSIVE) != 0 → held exclusively else (state & LW_SHARED_MASK) → number of shared holdersThe static assertions in lwlock.c (around line 109) verify at compile
time that MAX_BACKENDS + 1 is a power of 2, that MAX_BACKENDS and
LW_FLAG_MASK do not overlap, and that LW_VAL_EXCLUSIVE and
LW_FLAG_MASK do not overlap.
LWLock struct and tranche system
Section titled “LWLock struct and tranche system”// LWLock — src/include/storage/lwlock.htypedef struct LWLock{ uint16 tranche; /* tranche ID for wait event naming */ pg_atomic_uint32 state; /* atomic state word */ proclist_head waiters; /* list of waiting PGPROCs */} LWLock;The struct is small — one uint16, one atomic uint32, and a two-word
list head. Each lock is padded to LWLOCK_PADDED_SIZE (= one cache
line) in LWLockPadded to prevent false sharing.
The tranche is a name group for wait-event reporting. All 128
BufferContent locks share the tranche LWTRANCHE_BUFFER_CONTENT; all
16 LockManager partition locks share LWTRANCHE_LOCK_MANAGER. Named
tranches (RequestNamedLWLockTranche / GetNamedLWLockTranche) let
extensions allocate arrays of LWLocks at postmaster startup with
user-visible wait-event names. The tranche ID doubles as the wait-event
ID reported to pg_stat_activity.
The main lock array MainLWLockArray is allocated once in shared
memory by CreateLWLocks and holds all fixed (individually-named)
locks plus the per-partition groups. It is sized by LWLockShmemSize.
The shared-memory layout produced by CreateLWLocks / InitializeLWLocks
is a single contiguous block. A 4-byte dynamic-tranche-ID counter sits
just before MainLWLockArray (read by LWLockNewTrancheId); the array
proper begins on a LWLOCK_PADDED_SIZE (cache-line) boundary. The fixed
region is laid out by the offset macros in lwlock.h
(BUFFER_MAPPING_LWLOCK_OFFSET, LOCK_MANAGER_LWLOCK_OFFSET,
PREDICATELOCK_MANAGER_LWLOCK_OFFSET), so any backend can index a partition
group by adding a constant offset. Named-tranche locks (from
RequestNamedLWLockTranche) follow the fixed NUM_FIXED_LWLOCKS region,
and the NamedLWLockTranche descriptor array plus the tranche name strings
are packed at the tail.
flowchart TD
subgraph SHM["shared memory block from ShmemAlloc (CreateLWLocks)"]
CTR["int LWLockCounter<br/>(dynamic tranche-ID counter,<br/>init = LWTRANCHE_FIRST_USER_DEFINED)"]
subgraph ARR["MainLWLockArray : LWLockPadded[] (cache-line aligned)"]
IND["[0 .. NUM_INDIVIDUAL_LWLOCKS)<br/>individual named locks<br/>(each its own tranche)"]
BUF["BUFFER_MAPPING_LWLOCK_OFFSET<br/>128 locks, LWTRANCHE_BUFFER_MAPPING"]
LCK["LOCK_MANAGER_LWLOCK_OFFSET<br/>16 locks, LWTRANCHE_LOCK_MANAGER"]
PRD["PREDICATELOCK_MANAGER_LWLOCK_OFFSET<br/>16 locks, LWTRANCHE_PREDICATE_LOCK_MANAGER"]
NAMED["[NUM_FIXED_LWLOCKS ..)<br/>named-tranche locks<br/>(RequestNamedLWLockTranche)"]
end
DESC["NamedLWLockTranche[]<br/>(trancheId, trancheName)"]
STR["tranche name strings (char)"]
end
CTR --> IND
IND --> BUF --> LCK --> PRD --> NAMED
NAMED --> DESC --> STR
L["LWLock { uint16 tranche;<br/>pg_atomic_uint32 state;<br/>proclist_head waiters }"] -.padded to.-> PAD["LWLockPadded (= PG_CACHE_LINE_SIZE)<br/>one lock per cache line, no false sharing"]
NAMED -.each element is a.-> L
Each LWLockPadded element is a union of the bare LWLock and a
char pad[LWLOCK_PADDED_SIZE], so every lock occupies a full cache line.
The tranche field inside each LWLock is the index into
BuiltinTrancheNames[] (or the dynamic LWLockTrancheNames[]), which is how
LWLockReportWaitStart resolves a wait event to a user-visible name in
pg_stat_activity.
Acquisition protocol — the two-attempt race closure
Section titled “Acquisition protocol — the two-attempt race closure”The uncontended shared path is a single atomic CAS. The comment in
lwlock.c explains why a single attempt is not enough for the
contended case:
// LWLockAcquire — lwlock.c (simplified narrative)// Phase 1: Try to do it atomically, if we succeed, nice// Phase 2: Add ourselves to the waitqueue of the lock// Phase 3: Try to grab the lock again; if we succeed, remove ourselves// Phase 4: Sleep till wake-up, goto Phase 1The race being closed: a naïve implementation might try once, fail, then
enqueue itself — but by the time enqueuing completes, the lock could
already have been released and no one will wake the new waiter up.
The two-attempt protocol prevents this: after LWLockQueueSelf places
the caller in the waiter list, a second LWLockAttemptLock is executed.
If the second attempt succeeds, the caller calls LWLockDequeueSelf to
remove itself cleanly. If it fails, the caller is guaranteed to be
woken up because the releaser will see the enqueued waiter when it
processes the wait list.
// LWLockAcquire — lwlock.cboolLWLockAcquire(LWLock *lock, LWLockMode mode){ // ... HOLD_INTERRUPTS(); for (;;) { mustwait = LWLockAttemptLock(lock, mode); /* Phase 1 */ if (!mustwait) break;
LWLockQueueSelf(lock, mode); /* Phase 2 */ mustwait = LWLockAttemptLock(lock, mode); /* Phase 3 */ if (!mustwait) { LWLockDequeueSelf(lock); break; } /* Phase 4 */ PGSemaphoreLock(proc->sem); /* block in OS */ // ... loop back } held_lwlocks[num_held_lwlocks++] = {lock, mode}; return result; /* true = acquired without waiting */}HOLD_INTERRUPTS() is called before the first attempt; RESUME_INTERRUPTS()
is called in LWLockRelease. This ensures cancel/die signals cannot
interrupt a backend that holds an LWLock.
The flow below traces a single LWLockAcquire call: the fast path is the
left spine (LWLockAttemptLock succeeds on the first try, the overwhelmingly
common case under load), and the slow path is the enqueue/recheck/block loop
that closes the enqueue-then-miss race. Note that the block point
(PGSemaphoreLock) is reached only after a second LWLockAttemptLock has
also failed while the caller is already queued — which is exactly the
condition that guarantees a future releaser will see the waiter and call
LWLockWakeup.
flowchart TD
A["LWLockAcquire(lock, mode)"] --> B["HOLD_INTERRUPTS()"]
B --> C{"num_held_lwlocks >= MAX_SIMUL_LWLOCKS?"}
C -->|yes| Cerr["elog(ERROR, too many LWLocks taken)"]
C -->|no| D["LWLockAttemptLock(lock, mode)<br/>Phase 1: single CAS on lock->state"]
D --> E{"mustwait?"}
E -->|no| Z["record in held_lwlocks[]<br/>return result"]
E -->|yes| F["LWLockQueueSelf(lock, mode)<br/>Phase 2: enqueue MyProc under LW_FLAG_LOCKED"]
F --> G["LWLockAttemptLock(lock, mode)<br/>Phase 3: retry CAS, now guaranteed visible"]
G --> H{"mustwait?"}
H -->|no| I["LWLockDequeueSelf(lock)<br/>undo the enqueue"]
I --> Z
H -->|yes| J["LWLockReportWaitStart()<br/>Phase 4: block"]
J --> K["PGSemaphoreLock(proc->sem)"]
K --> L{"proc->lwWaiting == LW_WS_NOT_WAITING?"}
L -->|no, spurious| K
L -->|yes, woken| M["pg_atomic_fetch_or(state, LW_FLAG_RELEASE_OK)<br/>result = false"]
M --> D
Z --> Y["caller runs critical section ... eventually LWLockRelease"]
Y --> R["LWLockReleaseInternal:<br/>pg_atomic_sub_fetch_u32(state, LW_VAL_*)"]
R --> S{"HAS_WAITERS & RELEASE_OK set<br/>AND LW_LOCK_MASK == 0?"}
S -->|no| T["RESUME_INTERRUPTS(); done"]
S -->|yes| U["LWLockWakeup(lock):<br/>walk waiters, mark LW_WS_PENDING_WAKEUP,<br/>PGSemaphoreUnlock each"]
U --> T
LWLockAttemptLock — the atomic fast path
Section titled “LWLockAttemptLock — the atomic fast path”// LWLockAttemptLock — lwlock.cstatic boolLWLockAttemptLock(LWLock *lock, LWLockMode mode){ uint32 old_state = pg_atomic_read_u32(&lock->state); while (true) { uint32 desired_state = old_state; bool lock_free; if (mode == LW_EXCLUSIVE) { lock_free = (old_state & LW_LOCK_MASK) == 0; if (lock_free) desired_state += LW_VAL_EXCLUSIVE; } else { lock_free = (old_state & LW_VAL_EXCLUSIVE) == 0; if (lock_free) desired_state += LW_VAL_SHARED; } // CAS always executes (acts as memory barrier even on failure) if (pg_atomic_compare_exchange_u32(&lock->state, &old_state, desired_state)) return !lock_free; /* false = acquired; true = must wait */ }}For shared acquisition: if no exclusive holder, increment the shared
count. For exclusive acquisition: if no holder at all (neither exclusive
nor any shared), swap in LW_VAL_EXCLUSIVE. The CAS is always issued
even when lock_free is false — this doubles as a full memory barrier,
and the implementation comment explicitly notes that benchmarks did not
show a benefit from skipping it.
Wait-list mutex (LW_FLAG_LOCKED)
Section titled “Wait-list mutex (LW_FLAG_LOCKED)”The wait list (lock->waiters, a proclist) is not itself lock-free. A
tiny spinlock bit LW_FLAG_LOCKED (bit 29 of state) serializes
enqueue and dequeue. LWLockWaitListLock acquires it with a fetch-or,
spins using perform_spin_delay if busy, and LWLockWaitListUnlock
clears it with a fetch-and. This is the only place spinlock semantics
appear inside the LWLock code.
// LWLockWaitListLock — lwlock.cstatic voidLWLockWaitListLock(LWLock *lock){ while (true) { old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED); if (!(old_state & LW_FLAG_LOCKED)) break; /* got lock */ // spin using perform_spin_delay until cleared }}Release and wakeup
Section titled “Release and wakeup”LWLockRelease calls LWLockReleaseInternal after removing the lock
from the held-locks array. The release atomically decrements the state
word:
// LWLockReleaseInternal — lwlock.cstatic voidLWLockReleaseInternal(LWLock *lock, LWLockMode mode){ if (mode == LW_EXCLUSIVE) oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE); else oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
// Wake waiters only when: // HAS_WAITERS is set AND RELEASE_OK is set AND no holder remains if ((oldstate & (LW_FLAG_HAS_WAITERS | LW_FLAG_RELEASE_OK)) == (LW_FLAG_HAS_WAITERS | LW_FLAG_RELEASE_OK) && (oldstate & LW_LOCK_MASK) == 0) LWLockWakeup(lock);}LWLockWakeup acquires the wait-list spinlock, walks lock->waiters,
marks selected waiters as LW_WS_PENDING_WAKEUP, releases the
wait-list spinlock, then calls PGSemaphoreUnlock on each waiter’s
semaphore. Multiple shared waiters can be woken at once; the first
exclusive waiter stops the walk.
The RELEASE_OK flag prevents double-wakeup. When LWLockWakeup moves
a waiter to LW_WS_PENDING_WAKEUP, it clears LW_FLAG_RELEASE_OK so
no further wakeup round fires until the woken process sets it again
(pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_RELEASE_OK) at the top
of the retry loop in LWLockAcquire).
LWLockWaitForVar — condition-variable variant
Section titled “LWLockWaitForVar — condition-variable variant”Beyond the standard acquire/release pair, LWLock offers a third mode:
LWLockWaitForVar / LWLockUpdateVar. A caller can hold a lock in
exclusive mode and broadcast a variable update; any number of waiters
block on the lock with LW_WAIT_UNTIL_FREE and wake when the variable
changes. This is the mechanism used by WALInsert to let backends wait
for the WAL flush LSN to advance without acquiring the WAL insert lock.
// LWLockUpdateVar — lwlock.c// Caller must hold lock in exclusive mode.voidLWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val){ pg_atomic_exchange_u64(valptr, val); /* full barrier */ // wake all LW_WAIT_UNTIL_FREE waiters LWLockWaitListLock(lock); // ... move front-of-queue waiters to wakeup list LWLockWaitListUnlock(lock); // ... PGSemaphoreUnlock each}LW_WAIT_UNTIL_FREE waiters are placed at the front of the wait
queue so that variable-change notifications are delivered promptly and
are not blocked behind regular lock waiters.
Spinlock tier
Section titled “Spinlock tier”The spinlock API in PostgreSQL is a thin macro shell over platform-specific TAS:
#define SpinLockInit(lock) S_INIT_LOCK(lock)#define SpinLockAcquire(lock) S_LOCK(lock)#define SpinLockRelease(lock) S_UNLOCK(lock)#define SpinLockFree(lock) S_LOCK_FREE(lock)On x86-64, S_LOCK expands to s_lock() which loops on TAS_SPIN.
TAS_SPIN does a non-locking load check first (*(lock) ? 1 : TAS(lock))
to avoid asserting the bus lock when the lock is obviously held — a
technique documented in Intel’s “Implementing Scalable Atomic Locks”
whitepaper. The TAS macro itself issues a lock; xchgb instruction:
// tas() — src/include/storage/s_lock.h (x86_64 section)static __inline__ inttas(volatile slock_t *lock){ slock_t _res = 1; __asm__ __volatile__( " lock \n" " xchgb %0,%1 \n" : "+q"(_res), "+m"(*lock) : /* no inputs */ : "memory", "cc"); return (int) _res;}On ARM64, the equivalent is __sync_lock_test_and_set (GCC builtin for
LDREX/STREX or LDXR/STXR), with ISB as the spin delay. On
platforms with no hardware TAS, PostgreSQL falls back to a semop-based
spinlock.
The portable wait loop in s_lock.c adapts its spin count:
// s_lock / perform_spin_delay — s_lock.cints_lock(volatile slock_t *lock, const char *file, int line, const char *func){ SpinDelayStatus delayStatus; init_spin_delay(&delayStatus, file, line, func); while (TAS_SPIN(lock)) perform_spin_delay(&delayStatus); finish_spin_delay(&delayStatus); return delayStatus.delays;}perform_spin_delay calls SPIN_DELAY() (the CPU hint — rep; nop
on x86) every iteration. After spins_per_delay spins it escalates to
pg_usleep with a randomly-increasing delay (1 ms to 1 s) to avoid
starvation under heavy contention. The spins_per_delay value is a
per-process heuristic: it is increased rapidly when the lock is acquired
without sleeping (indicating a multiprocessor environment where spinning
pays), and decreased slowly when sleeping is needed (indicating a
uniprocessor or overloaded system). set_spins_per_delay /
update_spins_per_delay propagate the estimate between backends via
shared memory on startup and exit.
Interrupt holdoff discipline
Section titled “Interrupt holdoff discipline”Both SpinLockAcquire and LWLockAcquire are documented to hold off
cancel/die interrupts for the duration of the lock hold. The lmgr/README
notes: “Acquisition of either a spinlock or a lightweight lock causes
query cancel and die() interrupts to be held off until all such locks
are released.” This is implemented by HOLD_INTERRUPTS() / RESUME_INTERRUPTS()
in LWLockAcquire / LWLockRelease and by the spinlock macros themselves.
The implication is that no CHECK_FOR_INTERRUPTS() is possible while any
spinlock is held, and the critical section must be kept extremely short.
Error recovery
Section titled “Error recovery”LWLockReleaseAll is called by the error-recovery path (triggered by
ereport(ERROR)). It iterates the process-local held_lwlocks[] array
and releases every held lock in reverse acquisition order. This is why
the held_lwlocks array is maintained — it is the error-recovery
release list, not merely a debugging aid. Spinlocks, by contrast, have
no automatic release on error; a spinlock held across a longjmp would
deadlock the server.
Source Walkthrough
Section titled “Source Walkthrough”Spinlock subsystem
Section titled “Spinlock subsystem”| Symbol | Kind | File | Role |
|---|---|---|---|
slock_t | typedef | s_lock.h | Platform-specific spinlock word (typically unsigned char on x86, int on ARM) |
TAS | macro | s_lock.h | Single atomic test-and-set; returns 0 on success |
TAS_SPIN | macro | s_lock.h | Like TAS but checks without bus-lock first (x86_64 / ARM64) |
SPIN_DELAY | macro | s_lock.h | CPU hint inside spin loop (rep; nop / isb) |
SpinLockAcquire | macro | spin.h | Public API — delegates to S_LOCK |
SpinLockRelease | macro | spin.h | Public API — delegates to S_UNLOCK |
s_lock | function | s_lock.c | Platform-independent spin-with-backoff loop |
perform_spin_delay | function | s_lock.c | One spin iteration: CPU hint, count spins, escalate to pg_usleep |
finish_spin_delay | function | s_lock.c | Adjust spins_per_delay heuristic after acquisition |
set_spins_per_delay | function | s_lock.c | Copy shared estimate into process-local spins_per_delay at startup |
update_spins_per_delay | function | s_lock.c | Blend process-local estimate back to shared at exit |
LWLock data structures
Section titled “LWLock data structures”| Symbol | Kind | File | Role |
|---|---|---|---|
LWLock | struct | lwlock.h | Lock object: tranche (uint16), state (atomic uint32), waiters (proclist) |
LWLockPadded | union | lwlock.h | LWLock padded to one cache line to prevent false sharing |
LWLockMode | enum | lwlock.h | LW_EXCLUSIVE, LW_SHARED, LW_WAIT_UNTIL_FREE |
LWLockWaitState | enum | lwlock.h | LW_WS_NOT_WAITING, LW_WS_WAITING, LW_WS_PENDING_WAKEUP |
NamedLWLockTranche | struct | lwlock.h | Name + tranche ID for an extension-allocated lock array |
LWLockHandle | struct | lwlock.c | Process-local record of a held lock: {lock *, mode} |
MainLWLockArray | global | lwlock.c | Shared-memory base of all fixed + named-tranche LWLocks |
held_lwlocks[] | static array | lwlock.c | Per-process held-lock list (max 200 simultaneous) |
LW_FLAG_HAS_WAITERS | macro | lwlock.c | Bit 31 of state |
LW_FLAG_RELEASE_OK | macro | lwlock.c | Bit 30 of state |
LW_FLAG_LOCKED | macro | lwlock.c | Bit 29 of state (wait-list spinlock) |
LW_VAL_EXCLUSIVE | macro | lwlock.c | Exclusive sentinel = MAX_BACKENDS + 1 |
LW_VAL_SHARED | macro | lwlock.c | Shared increment = 1 |
LWLock lifecycle and acquisition
Section titled “LWLock lifecycle and acquisition”| Symbol | Kind | File | Role |
|---|---|---|---|
LWLockShmemSize | function | lwlock.c | Compute shmem bytes for all locks + tranche names |
CreateLWLocks | function | lwlock.c | Postmaster: allocate MainLWLockArray and named-tranche arrays in shmem |
InitializeLWLocks | function | lwlock.c | Initialize all fixed and named-tranche locks via LWLockInitialize |
LWLockInitialize | function | lwlock.c | Set state = LW_FLAG_RELEASE_OK, bind tranche ID |
RequestNamedLWLockTranche | function | lwlock.c | Extension hook: request N locks by name before postmaster starts |
GetNamedLWLockTranche | function | lwlock.c | Retrieve base pointer for a named tranche array |
LWLockRegisterTranche | function | lwlock.c | Per-process: register tranche ID → name for wait-event lookup |
LWLockNewTrancheId | function | lwlock.c | Allocate a fresh tranche ID from MainLWLockArray[-1] counter (protected by ShmemLock) |
LWLockAttemptLock | function | lwlock.c | Single CAS attempt; returns true = must wait |
LWLockQueueSelf | function | lwlock.c | Add MyProc to wait list under LW_FLAG_LOCKED |
LWLockDequeueSelf | function | lwlock.c | Remove MyProc from wait list if still there |
LWLockAcquire | function | lwlock.c | Full blocking acquire; two-attempt protocol |
LWLockConditionalAcquire | function | lwlock.c | Non-blocking acquire; returns false immediately if lock busy |
LWLockAcquireOrWait | function | lwlock.c | Acquire if free; wait (without acquiring) if busy — used for WALWriteLock |
LWLockWaitListLock | function | lwlock.c | Acquire wait-list spinlock (bit 29) |
LWLockWakeup | function | lwlock.c | Walk waiters, move eligible to wakeup, call PGSemaphoreUnlock |
LWLockReleaseInternal | function | lwlock.c | Atomic decrement + conditional LWLockWakeup |
LWLockRelease | function | lwlock.c | Public release: disown + LWLockReleaseInternal + RESUME_INTERRUPTS |
LWLockReleaseAll | function | lwlock.c | Error recovery: release all entries in held_lwlocks[] |
LWLockWaitForVar | function | lwlock.c | Block until *valptr != oldval or lock is released |
LWLockUpdateVar | function | lwlock.c | Atomically update *valptr and wake LW_WAIT_UNTIL_FREE waiters |
Position hints (commit 273fe94, 2026-06-05)
Section titled “Position hints (commit 273fe94, 2026-06-05)”| Symbol | File | Line |
|---|---|---|
LW_FLAG_* / LW_VAL_* macros | src/backend/storage/lmgr/lwlock.c | 94–106 |
LWLock struct | src/include/storage/lwlock.h | 41–50 |
LWLockPadded / LWLOCK_PADDED_SIZE | src/include/storage/lwlock.h | 62–72 |
LWLockMode enum | src/include/storage/lwlock.h | 112–119 |
BuiltinTrancheIds enum | src/include/storage/lwlock.h | 181–225 |
*_LWLOCK_OFFSET / NUM_FIXED_LWLOCKS macros | src/include/storage/lwlock.h | 104–110 |
LWLockShmemSize | src/backend/storage/lmgr/lwlock.c | 432 |
CreateLWLocks | src/backend/storage/lmgr/lwlock.c | 462 |
InitializeLWLocks | src/backend/storage/lmgr/lwlock.c | 502 |
LWLockNewTrancheId | src/backend/storage/lmgr/lwlock.c | 615 |
LWLockReportWaitStart | src/backend/storage/lmgr/lwlock.c | 737 |
GetNamedLWLockTranche | src/backend/storage/lmgr/lwlock.c | 585 |
LWLockRegisterTranche | src/backend/storage/lmgr/lwlock.c | 640 |
RequestNamedLWLockTranche | src/backend/storage/lmgr/lwlock.c | 682 |
LWLockInitialize | src/backend/storage/lmgr/lwlock.c | 719 |
LWLockAttemptLock | src/backend/storage/lmgr/lwlock.c | 796 |
LWLockWaitListLock | src/backend/storage/lmgr/lwlock.c | 867 |
LWLockWakeup | src/backend/storage/lmgr/lwlock.c | 932 |
LWLockQueueSelf | src/backend/storage/lmgr/lwlock.c | 1048 |
LWLockDequeueSelf | src/backend/storage/lmgr/lwlock.c | 1091 |
LWLockAcquire | src/backend/storage/lmgr/lwlock.c | 1180 |
LWLockConditionalAcquire | src/backend/storage/lmgr/lwlock.c | 1351 |
LWLockAcquireOrWait | src/backend/storage/lmgr/lwlock.c | 1408 |
LWLockWaitForVar | src/backend/storage/lmgr/lwlock.c | 1596 |
LWLockUpdateVar | src/backend/storage/lmgr/lwlock.c | 1732 |
LWLockReleaseInternal | src/backend/storage/lmgr/lwlock.c | 1836 |
LWLockRelease | src/backend/storage/lmgr/lwlock.c | 1900 |
s_lock | src/backend/storage/lmgr/s_lock.c | 98 |
perform_spin_delay | src/backend/storage/lmgr/s_lock.c | 126 |
finish_spin_delay | src/backend/storage/lmgr/s_lock.c | 186 |
set_spins_per_delay | src/backend/storage/lmgr/s_lock.c | 207 |
update_spins_per_delay | src/backend/storage/lmgr/s_lock.c | 218 |
tas (x86_64) | src/include/storage/s_lock.h | 214 |
SpinLockAcquire macro | src/include/storage/spin.h | 59 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against /data/hgryoo/references/postgres, branch REL_18_STABLE,
commit 273fe94.
Confirmed:
LWLockstruct layout (tranche + state + waiters) verified inlwlock.h:41–50.- State bit macros and
StaticAssertDeclchecks verified inlwlock.c:94–116. LWLockAttemptLockCAS loop (shared and exclusive paths) verified inlwlock.c:796–856.- Two-attempt protocol (
LWLockAcquire→LWLockQueueSelf→ secondLWLockAttemptLock) verified inlwlock.c:1180–1341. LWLockReleaseInternalatomic decrement +LWLockWakeupguard condition verified inlwlock.c:1836–1877.LWLockWakeupwait-list traversal,LW_WS_PENDING_WAKEUPtransition,pg_write_barrier(),PGSemaphoreUnlockverified inlwlock.c:932–1040.LWLockWaitForVar/LWLockUpdateVarmechanism (front-of-queueLW_WAIT_UNTIL_FREE,pg_atomic_exchange_u64barrier) verified inlwlock.c:1596–1795.LWLockAcquireOrWaitsemantics (acquire if free, wait without acquiring if busy; documented forWALWriteLock) verified inlwlock.c:1393–1408comment +1408–1594body.s_lockadaptive spin loop andspins_per_delayheuristic verified ins_lock.c.- x86_64
lock; xchgbTAS andTAS_SPINnon-locking pre-check verified ins_lock.h:196–241. - ARM64
__sync_lock_test_and_set+ ISB delay verified ins_lock.h:250–290. LWTRANCHE_AIO_URING_COMPLETIONpresent inBuiltinTrancheNames(PG18 async I/O) — verified inlwlock.c:180.
Unresolved / not verified in this pass:
- The exact memory-ordering guarantees of
pg_atomic_compare_exchange_u32on weak-ordering architectures (theport/atomics/implementation is arch-specific and not traced through in full here). - The interaction between
LWLockDisown/LWLockReleaseDisownedand parallel workers that transfer lock ownership across process boundaries — not fully traced.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Linux kernel rwsem and qspinlock. Linux uses a similar
atomic-word design for read-write semaphores: a single long encodes
reader count, writer-owned bit, and waiter-present bit. The optimistic
fast path (a single LOCK XADD) is nearly identical to PostgreSQL’s
shared-mode fast path. Linux’s qspinlock (MCS-based) is more
sophisticated than PostgreSQL’s simpler TAS spinlock — it uses a
queue-based approach that eliminates the O(n) cache-line bouncing of
naive TAS under high contention. PostgreSQL has not adopted MCS spinlocks,
as its spinlocks are held for only a handful of instructions and contention
levels are low.
Java StampedLock (JDK 8+). Java’s StampedLock uses an
AtomicLong state word with a similar layout: exclusive bit, reader
count, and a stamp for optimistic reads. The optimistic read mode
(tryOptimisticRead / validate) is a lock-free variant that reads
data without any atomic operation and then checks whether a writer
has intervened — closer to MVCC than to locking. PostgreSQL does not
implement optimistic LWLock reads, though LWLockConditionalAcquire
serves the non-blocking acquire use case.
Scalable lock managers and NUMA. Eyad Alkabani and others observed
that a single shared lock array causes cache-coherence traffic that
degrades under NUMA. The paper “A Scalable Lock Manager for Multicores”
(Johnson et al., SIGMOD 2010, captured in knowledge/research/dbms-papers/scalable-lock-manager.md)
proposes per-CPU-core lock tables to reduce this traffic. PostgreSQL
partially addresses NUMA effects through LWLockPadded (one lock per
cache line) and through the partitioned design of high-contention lock
groups (NUM_BUFFER_PARTITIONS = 128, NUM_LOCK_PARTITIONS = 16) — each
partition gets its own LWLock so the hot lock is spread across cache
lines.
Optimistic concurrency and lock-free approaches. An alternative to
locking for read-mostly data is seqlock (sequence lock): a writer
increments a version counter before and after the write; a reader checks
that the counter is even and unchanged. PostgreSQL uses a structurally
similar mechanism in its WAL insert logic (XLogCtl->Insert seqlock-like
LSN check) and in LWLockWaitForVar. Lock-free hash tables and
B-trees (e.g., Masstree, MICA) eliminate locks entirely for lookups, at
the cost of considerably more complex reclamation. PostgreSQL does not
adopt these because correct reclamation requires either hazard pointers
or epoch-based reclamation — infrastructure not present in the codebase —
and the performance benefit is modest for PostgreSQL’s workloads.
PG18 async I/O integration. The LWTRANCHE_AIO_URING_COMPLETION
tranche (visible in BuiltinTrancheNames) is new in PG18. It protects
the io_uring completion ring in the async I/O subsystem
(storage/aio/). The introduction of async I/O required a new
LWLock tranche rather than reusing existing ones, because the completion
processing can happen in a worker process that did not issue the I/O —
the tranche name makes this wait event user-visible and diagnosable in
pg_stat_activity.
Sources
Section titled “Sources”src/backend/storage/lmgr/lwlock.c— full LWLock implementationsrc/backend/storage/lmgr/s_lock.c— portable spinlock wait loopsrc/include/storage/lwlock.h—LWLockstruct,LWLockMode,BuiltinTrancheIdssrc/include/storage/spin.h— public spinlock API macrossrc/include/storage/s_lock.h— platform TAS implementations (x86, x86_64, ARM64, s390, …)src/backend/storage/lmgr/README— authoritative overview of the four PostgreSQL lock tiersknowledge/research/dbms-papers/scalable-lock-manager.md— Johnson et al. 2010 on scalable lock managers (NUMA)knowledge/research/dbms-general/database-internals.md— Petrov ch. 8 (atomic operations, concurrency foundations)