PostgreSQL Latches, Wait Event Sets, and Inter-Process Signals
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A multi-process database server is, at its core, a set of cooperating processes that spend most of their wall-clock time asleep — blocked waiting for the next unit of work. The design question that pervades the server’s plumbing is deceptively simple: how does a sleeping process learn that it should wake up, and how does another process (or a signal handler in the same process) tell it to? Three sub-problems define the space:
-
The lost-wakeup race. The naive pattern is a shared flag plus a sleep: a worker loops
while (!flag) sleep(), and a producer setsflag = true. This is broken on any preemptive system. If the producer sets the flag and signals between the worker’s test of the flag and its call tosleep(), the wakeup is lost and the worker sleeps forever. A correct primitive must make “check the condition” and “begin sleeping” atomic with respect to the wakeup — exactly the guarantee thatpselect/ppolland condition variables provide and that plainpoll+ a signal handler do not. -
Waiting on heterogeneous events at once. A backend rarely waits for just one thing. It waits for client socket readability (a new query), and for its latch (another backend has notified it), and for postmaster death (the parent crashed and the child should exit), and possibly for a timeout (statement_timeout). The OS readiness primitives —
epoll,kqueue,poll— natively multiplex file descriptors, but a latch is not a file descriptor and postmaster death is not obviously one either. The abstraction must unify them. -
Turning asynchronous signals into synchronous, safe actions. A Unix signal can arrive at any instruction boundary, including in the middle of
malloc, while holding a spinlock, or mid-way through building a protocol message. Almost nothing is safe to do in a signal handler (the async-signal-safe function list is tiny). So the universal pattern is: the handler does the absolute minimum — set avolatile sig_atomic_tflag and poke a wakeup — and the real work is deferred to a well-defined safe point in the main code path where the process voluntarily checks the flag.
Operating Systems: Three Easy Pieces (Arpaci-Dusseau) frames items (1) and (3) under “concurrency” and “the limited directives a signal handler may issue”; the lost-wakeup problem is the same hazard that motivates condition variables having an associated mutex. Database Internals (Petrov), in its treatment of node-local concurrency and process models, notes that a database’s process/thread scheduler is built on exactly these OS primitives and that the cost of a wakeup (a syscall, a context switch, a cache-line bounce) is a first-order performance concern for OLTP. Architecture of a Database System (Hellerstein et al., §“Process Models”) catalogs the process-per-connection model PostgreSQL uses and observes that such a model leans heavily on the OS for IPC and signalling rather than building a user-space scheduler.
PostgreSQL’s answer composes three layers, each solving one sub-problem:
the Latch (a race-free boolean-with-wakeup, solving #1 for the
one-bit case), the WaitEventSet (a portable multiplexer over the OS
readiness primitives, solving #2 and embedding the latch into the same
wait), and the interrupt machinery — ProcSignal for inter-process
messaging plus InterruptPending/CHECK_FOR_INTERRUPTS() for deferral —
solving #3.
Common DBMS Design
Section titled “Common DBMS Design”This section names the engineering patterns that recur across multi-process servers (database or otherwise) so PostgreSQL’s specific choices read as selections within a shared design space.
The event/latch object: a boolean plus a wakeup
Section titled “The event/latch object: a boolean plus a wakeup”Almost every server that uses a process-per-connection or process-per-worker model needs a “kick this sleeping process” primitive. The shape is always the same: a small shared object holding (a) a boolean “is there work?” flag and (b) enough identity (a pid, an event handle, a pipe fd) to deliver an OS-level wakeup to whoever owns it. The two halves must be ordered by memory barriers: the setter publishes the flag before checking whether anyone is sleeping, and the waiter clears the flag before re-examining the condition, so neither side can conclude “nothing to do / nobody to wake” on the basis of stale memory. This is the textbook double-checked pattern around a condition variable, adapted to cross-process shared memory.
Multiplexed readiness via the OS reactor
Section titled “Multiplexed readiness via the OS reactor”To wait on many fds plus internal events at once, servers build a thin
portability wrapper over whatever “reactor” the OS provides: epoll on
Linux, kqueue on the BSDs and macOS, poll/select as the lowest
common denominator, and I/O completion ports or event objects on Windows.
The wrapper registers a set of interest specifications once and then calls
the blocking primitive repeatedly, translating the kernel’s “these became
ready” answer back into the server’s own event vocabulary. nginx, redis,
libuv, and PostgreSQL all have a module of exactly this shape.
Race-free blocking: ppoll/pselect or the self-pipe trick
Section titled “Race-free blocking: ppoll/pselect or the self-pipe trick”Because plain poll() cannot atomically “unblock signals and sleep,” a
signal that arrives just before poll() is entered will not interrupt it.
The two canonical fixes are (a) use the p-variant syscalls
(ppoll/pselect) that take a signal mask, or (b) the self-pipe
trick (Bernstein): the signal handler writes one byte to a pipe whose
read end is in the fd set, so a pending signal manifests as a readable
fd — something poll() cannot miss. Linux offers a third option,
signalfd, which turns signal delivery itself into a readable fd, letting
the process keep the signal blocked and consume it synchronously.
Multiplexing one signal number into many logical messages
Section titled “Multiplexing one signal number into many logical messages”Unix gives a process only a handful of user-definable signals
(SIGUSR1, SIGUSR2). A server that needs dozens of distinct
inter-process notifications cannot afford one signal per message type.
The standard trick is to pick a single signal as a generic “you have
mail” doorbell, and carry the reason out-of-band in shared memory: the
sender sets a per-recipient flag identifying the message, then sends the
one signal; the recipient’s handler scans its flags to learn what
actually happened. This decouples the (scarce) signal namespace from the
(unbounded) message namespace.
Deferred interrupt handling at safe points
Section titled “Deferred interrupt handling at safe points”The final shared pattern: signal handlers never do real work. They set
InterruptPending-style flags and return. Real work happens when the main
loop reaches a safe point and calls a check macro. This converts an
asynchronous, dangerous context into a synchronous, controlled one, and
lets the server protect critical sections simply by incrementing a
“hold-off” counter that the check macro respects.
flowchart TD A["Producer process<br/>or signal handler"] -->|"SetLatch / kill(SIGUSR1)"| B["Shared state:<br/>is_set flag,<br/>ProcSignal reason bit"] B --> C["OS wakeup:<br/>self-pipe byte /<br/>signalfd / SIGURG"] C --> D["Sleeping process in<br/>WaitEventSetWait()"] D -->|"returns readiness"| E["Main loop reaches<br/>safe point"] E -->|"CHECK_FOR_INTERRUPTS()"| F["ProcessInterrupts():<br/>cancel / die / barrier"]
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL implements the three layers as three source modules plus the
interrupt glue in postgres.c. The layering is strict: latch.c is a
thin facade over waiteventset.c, which is the only module that touches
the OS reactor; procsignal.c builds the SIGUSR1 multiplexer on top of
latches; and postgres.c defines what the deferred handlers actually do.
The Latch: a one-bit, race-free wakeup
Section titled “The Latch: a one-bit, race-free wakeup”A Latch is three fields that matter — is_set, maybe_sleeping, and
owner_pid — plus an is_shared flag. A process-local latch (InitLatch)
can only be set from within the same process (typically from a signal
handler); a shared latch (InitSharedLatch + OwnLatch) lives in shared
memory and can be set by any backend. MyLatch is every process’s
standard wakeup object, used pervasively as the “something happened, go
re-check your state” signal.
SetLatch is the heart of the race-freedom. It uses two memory barriers
to sandwich the publish-then-check sequence:
// SetLatch — src/backend/storage/ipc/latch.cpg_memory_barrier();/* Quick exit if already set */if (latch->is_set) return;latch->is_set = true;pg_memory_barrier();if (!latch->maybe_sleeping) return;/* ... figure out owner_pid and deliver a wakeup ... */owner_pid = latch->owner_pid;if (owner_pid == 0) return;else if (owner_pid == MyProcPid) WakeupMyProc(); /* in-process: self-pipe or SIGURG to self */else WakeupOtherProc(owner_pid);/* cross-process: kill(pid, SIGURG) */The waiter side, inside WaitEventSetWait, performs the mirror-image
dance: it sets maybe_sleeping = true, issues a barrier, then re-checks
is_set before actually sleeping. If SetLatch ran in the window, either
the waiter sees is_set and skips the sleep, or the setter sees
maybe_sleeping and delivers the wakeup — never both-miss. ResetLatch
clears is_set with a trailing barrier so a subsequent flag-read cannot
be reordered before the clear.
flowchart TD
subgraph Setter["SetLatch (any process / handler)"]
S1["barrier; if is_set return"] --> S2["is_set = true"]
S2 --> S3["barrier; if !maybe_sleeping return"]
S3 --> S4["owner==me? WakeupMyProc<br/>else WakeupOtherProc(pid)"]
end
subgraph Waiter["WaitEventSetWait (owner)"]
W1["maybe_sleeping = true"] --> W2["barrier"]
W2 --> W3["if is_set: report WL_LATCH_SET,<br/>skip sleep"]
W3 --> W4["else block in epoll/poll<br/>until self-pipe/signalfd readable"]
W4 --> W5["drain(); recheck is_set"]
end
S4 -.->|"self-pipe byte / SIGURG"| W4
The WaitEventSet: one wait over latch + sockets + PM death + timeout
Section titled “The WaitEventSet: one wait over latch + sockets + PM death + timeout”WaitLatch is now just a convenience wrapper. The real machinery is the
WaitEventSet, which holds a registered array of WaitEvent interest
records and a per-platform kernel object (epoll_fd, kqueue_fd, or a
pollfd array). You build one with CreateWaitEventSet, register events
with AddWaitEventToSet (returning a position you can later
ModifyWaitEvent), and block in WaitEventSetWait. The event kinds are a
bitmask: WL_LATCH_SET, WL_SOCKET_READABLE/WRITEABLE/CLOSED/...,
WL_POSTMASTER_DEATH (or WL_EXIT_ON_PM_DEATH), and WL_TIMEOUT.
The crucial design move is that all three non-socket events are mapped onto file descriptors so the OS reactor can wait on them uniformly:
- the latch becomes the read end of the self-pipe (poll build) or the signalfd (epoll build);
- postmaster death becomes
postmaster_alive_fds[POSTMASTER_FD_WATCH], a pipe the postmaster holds open and the kernel closes (EPOLLHUP) when the postmaster dies; - the timeout is the reactor call’s own timeout argument.
// AddWaitEventToSet — src/backend/storage/ipc/waiteventset.cif (events == WL_LATCH_SET){ set->latch = latch; set->latch_pos = event->pos;#if defined(WAIT_USE_SELF_PIPE) event->fd = selfpipe_readfd;#elif defined(WAIT_USE_SIGNALFD) event->fd = signal_fd;#else event->fd = PGINVALID_SOCKET;#ifdef WAIT_USE_EPOLL return event->pos;#endif#endif}else if (events == WL_POSTMASTER_DEATH){#ifndef WIN32 event->fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];#endif}WaitLatch exploits a single long-lived LatchWaitSet (built once by
InitializeLatchWaitSet) with exactly two slots — the latch at position 0
and PM-death at position 1 — and merely ModifyWaitEvents them on each
call. That avoids the cost of creating and tearing down an epoll fd for
the extremely common single-latch wait. WaitLatchOrSocket, by contrast,
builds a fresh three-slot set per call because the socket varies.
The self-pipe / signalfd / SIGURG trick
Section titled “The self-pipe / signalfd / SIGURG trick”PostgreSQL chooses its wakeup mechanism at compile time based on the
reactor. The header comment in waiteventset.c lays out the matrix
precisely: poll() uses the self-pipe; epoll() uses signalfd and keeps
SIGURG blocked; kqueue() registers EVFILT_SIGNAL for SIGURG; Windows
uses inherited event objects. The signal used to carry a latch wakeup is
SIGURG, deliberately distinct from SIGUSR1 (which carries ProcSignal
messages) so the two concerns don’t interfere.
// latch_sigurg_handler / sendSelfPipeByte — src/backend/storage/ipc/waiteventset.cstatic voidlatch_sigurg_handler(SIGNAL_ARGS){ if (waiting) sendSelfPipeByte(); /* turn the signal into a readable fd */}On the epoll path there is no handler at all: InitializeWaitEventSupport
blocks SIGURG and opens a signalfd so that delivery of SIGURG becomes a
readable descriptor that epoll watches directly. Either way, when
WaitEventSetWaitBlock reports the latch fd readable it calls drain()
to empty the pipe/signalfd (non-blocking read until EAGAIN) before
re-checking is_set, so accumulated bytes can’t cause a busy-spin.
ProcSignal: multiplexing SIGUSR1 into many reasons
Section titled “ProcSignal: multiplexing SIGUSR1 into many reasons”Cross-backend messages that aren’t just “wake up” ride on SIGUSR1.
procsignal.c keeps a shared-memory array of ProcSignalSlot, one per
ProcNumber (plus one per auxiliary process). Each slot has a
pss_signalFlags[NUM_PROCSIGNALS] array of booleans. To signal backend
N with reason R, SendProcSignal sets slot[N].pss_signalFlags[R] = true under the slot’s spinlock and then kill(pid, SIGUSR1). The
recipient’s procsignal_sigusr1_handler scans every reason with
CheckProcSignal, dispatches a Handle...Interrupt for each set flag,
and finally SetLatch(MyLatch).
// procsignal_sigusr1_handler (excerpt) — src/backend/storage/ipc/procsignal.cif (CheckProcSignal(PROCSIG_NOTIFY_INTERRUPT)) HandleNotifyInterrupt();if (CheckProcSignal(PROCSIG_PARALLEL_MESSAGE)) HandleParallelMessageInterrupt();if (CheckProcSignal(PROCSIG_BARRIER)) HandleProcSignalBarrierInterrupt();/* ... recovery-conflict reasons ... */SetLatch(MyLatch);Note the layering: every Handle...Interrupt does only flag-setting
(InterruptPending = true; SomethingPending = true;), and the final
SetLatch ensures the backend wakes from any wait so its next
CHECK_FOR_INTERRUPTS() runs the real handler. ProcSignal also carries
global barriers (EmitProcSignalBarrier / WaitForProcSignalBarrier),
a generation-counter protocol used when a state change must be
acknowledged by every backend before proceeding (e.g. SMGR release).
Interrupt deferral: CHECK_FOR_INTERRUPTS and ProcessInterrupts
Section titled “Interrupt deferral: CHECK_FOR_INTERRUPTS and ProcessInterrupts”The deferral contract lives in miscadmin.h and postgres.c. Handlers
set InterruptPending; the main loop sprinkles CHECK_FOR_INTERRUPTS()
at safe points; that macro calls ProcessInterrupts() only when an
interrupt is pending, which then services ProcDiePending,
QueryCancelPending, timeouts, recovery conflicts, ProcSignal barriers,
and parallel messages — throwing ERROR/FATAL as appropriate.
// CHECK_FOR_INTERRUPTS / INTERRUPTS_CAN_BE_PROCESSED — src/include/miscadmin.h#define CHECK_FOR_INTERRUPTS() \do { \ if (INTERRUPTS_PENDING_CONDITION()) \ ProcessInterrupts(); \} while(0)
#define INTERRUPTS_CAN_BE_PROCESSED() \ (InterruptHoldoffCount == 0 && CritSectionCount == 0 && \ QueryCancelHoldoffCount == 0)HOLD_INTERRUPTS()/RESUME_INTERRUPTS() bump InterruptHoldoffCount, and
ProcessInterrupts bails out immediately if that counter or
CritSectionCount is nonzero — this is how a critical section defers a
cancel until it is safe. The query-cancel path additionally respects
QueryCancelHoldoffCount so a cancel cannot fire while the backend is
reading a protocol message (which would desync the FE/BE stream).
flowchart TD K1["kill(pid, SIGINT)<br/>StatementCancelHandler"] --> K2["QueryCancelPending = true<br/>InterruptPending = true<br/>SetLatch(MyLatch)"] K2 --> K3["backend wakes from<br/>WaitEventSetWait"] K3 --> K4["next CHECK_FOR_INTERRUPTS()"] K4 -->|"holdoff/crit-section?"| K5["re-arm InterruptPending,<br/>defer"] K4 -->|"safe"| K6["ProcessInterrupts():<br/>ereport(ERROR,<br/>'canceling statement<br/>due to user request')"]
Source Walkthrough
Section titled “Source Walkthrough”This section walks the actual code in call-flow order, grouped by subsystem. Symbols are the stable anchors; line numbers live only in the position-hint table at the end.
Latch lifecycle and SetLatch (latch.c)
Section titled “Latch lifecycle and SetLatch (latch.c)”A latch is initialized by InitLatch (process-local) or InitSharedLatch
OwnLatch(shared).OwnLatchis the moment a shared latch becomes associated with the current process, recordingowner_pid = MyProcPid;DisownLatchreverses it. The sanity check inOwnLatchPANICs if the latch already has an owner — there is intentionally no lock, so callers must interlock externally if two processes might race to own one latch.
// OwnLatch — src/backend/storage/ipc/latch.cowner_pid = latch->owner_pid;if (owner_pid != 0) elog(PANIC, "latch already owned by PID %d", owner_pid);latch->owner_pid = MyProcPid;SetLatch (quoted in the previous section) is the only setter. Its key
properties: it is safe to call from a signal handler and from critical
sections (it never throws), it is cheap when the latch is already set
(the first barrier+test short-circuits), and it picks the wakeup delivery
by comparing owner_pid to MyProcPid — an in-process set (the common
case for a handler that just flagged an interrupt) routes to
WakeupMyProc, a cross-process set to WakeupOtherProc. ResetLatch
clears is_set and must only be called by the owner; the standard idiom
is wait, then reset, then process, looping at the bottom so a set that
arrives during processing is not lost.
Building and modifying a WaitEventSet (waiteventset.c)
Section titled “Building and modifying a WaitEventSet (waiteventset.c)”InitializeWaitEventSupport runs once per process at startup. On the poll
build it creates the self-pipe (both ends non-blocking and
close-on-exec), installs latch_sigurg_handler for SIGURG, and reserves
two external FDs with fd.c. On the epoll build it instead blocks SIGURG
and opens a signalfd:
// InitializeWaitEventSupport (signalfd branch) — src/backend/storage/ipc/waiteventset.csigaddset(&UnBlockSig, SIGURG); /* keep SIGURG blocked */sigemptyset(&signalfd_mask);sigaddset(&signalfd_mask, SIGURG);signal_fd = signalfd(-1, &signalfd_mask, SFD_NONBLOCK | SFD_CLOEXEC);if (signal_fd < 0) elog(FATAL, "signalfd() failed");CreateWaitEventSet allocates one contiguous block (MAXALIGN-padded) that
holds the WaitEventSet, the WaitEvent array, and the platform’s return
buffer (epoll_ret_events / kqueue_ret_events / pollfds), then opens
the kernel object (epoll_create1(EPOLL_CLOEXEC) / kqueue()). The set
is optionally tracked by a ResourceOwner so it is freed on error.
AddWaitEventToSet validates the request (a latch event needs a latch
owned by this process; a socket event needs a real fd), stores a
WaitEvent, maps internal events to fds (shown earlier), and calls the
platform-specific adjust routine. For epoll that is WaitEventAdjustEpoll,
which translates the WL_* mask into EPOLLIN/EPOLLOUT/EPOLLRDHUP and
calls epoll_ctl:
// WaitEventAdjustEpoll — src/backend/storage/ipc/waiteventset.cepoll_ev.data.ptr = event; /* so epoll_wait hands back our WaitEvent */epoll_ev.events = EPOLLERR | EPOLLHUP; /* always watch for errors */if (event->events == WL_LATCH_SET) epoll_ev.events |= EPOLLIN;else if (event->events == WL_POSTMASTER_DEATH) epoll_ev.events |= EPOLLIN;else{ if (event->events & WL_SOCKET_READABLE) epoll_ev.events |= EPOLLIN; if (event->events & WL_SOCKET_WRITEABLE) epoll_ev.events |= EPOLLOUT; if (event->events & WL_SOCKET_CLOSED) epoll_ev.events |= EPOLLRDHUP;}rc = epoll_ctl(set->epoll_fd, action, event->fd, &epoll_ev);ModifyWaitEvent is the fast-path used by WaitLatch: if neither mask
nor latch changed it returns immediately, and on Unix a latch
modification needs no kernel call at all (the underlying pipe/signalfd is
shared across all latches), so switching MyLatch in and out is nearly
free.
The wait loop: WaitEventSetWait and WaitEventSetWaitBlock
Section titled “The wait loop: WaitEventSetWait and WaitEventSetWaitBlock”WaitEventSetWait is the portable outer loop. It sets the
waiting-flag (so the SIGURG handler knows to write to the self-pipe),
then loops until at least one event is returned. The latch fast-path is
the embodiment of the race-free protocol: set maybe_sleeping, barrier,
recheck is_set, and only block if still unset.
// WaitEventSetWait (latch fast path) — src/backend/storage/ipc/waiteventset.cif (set->latch && !set->latch->is_set){ set->latch->maybe_sleeping = true; /* tell SetLatch we're about to sleep */ pg_memory_barrier(); /* and recheck */}if (set->latch && set->latch->is_set){ occurred_events->events = WL_LATCH_SET; /* ... fill event, return without blocking ... */ set->latch->maybe_sleeping = false;}WaitEventSetWaitBlock is the platform-specific inner call. The epoll
version calls epoll_wait, treats EINTR as “retry” and rc == 0 as
timeout, then walks the returned epoll_event array. When the latch fd
fires it drain()s the signalfd and re-checks set->latch->is_set && maybe_sleeping before reporting WL_LATCH_SET — guarding against a
spurious wakeup. When the PM-death fd fires it re-confirms with
PostmasterIsAliveInternal() (a spurious death report would be
catastrophic) and, if exit_on_postmaster_death, calls proc_exit(1)
directly.
// WaitEventSetWaitBlock (epoll, latch + PM-death) — src/backend/storage/ipc/waiteventset.cif (cur_event->events == WL_LATCH_SET && cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP)){ drain(); /* empty signalfd / self-pipe */ if (set->latch && set->latch->maybe_sleeping && set->latch->is_set) { occurred_events->events = WL_LATCH_SET; returned_events++; }}else if (cur_event->events == WL_POSTMASTER_DEATH && ...){ if (!PostmasterIsAliveInternal()) { if (set->exit_on_postmaster_death) proc_exit(1); occurred_events->events = WL_POSTMASTER_DEATH; }}Wakeup delivery and drain (waiteventset.c)
Section titled “Wakeup delivery and drain (waiteventset.c)”WakeupMyProc is what SetLatch calls when the owner is the current
process (typical inside a signal handler): on the self-pipe build it
writes a byte, on the signalfd build it kill(MyProcPid, SIGURG).
WakeupOtherProc always sends kill(pid, SIGURG). sendSelfPipeByte
writes a single byte non-blocking, treating EAGAIN/EWOULDBLOCK as
success (a full pipe already carries the wakeup) and silently ignoring
other errors because it may run in a handler. drain reads until the
descriptor is empty.
// WakeupMyProc / WakeupOtherProc — src/backend/storage/ipc/waiteventset.cvoidWakeupMyProc(void){#if defined(WAIT_USE_SELF_PIPE) if (waiting) sendSelfPipeByte();#else if (waiting) kill(MyProcPid, SIGURG);#endif}voidWakeupOtherProc(int pid){ kill(pid, SIGURG);}ProcSignal: slots, send, and the SIGUSR1 handler (procsignal.c)
Section titled “ProcSignal: slots, send, and the SIGUSR1 handler (procsignal.c)”ProcSignalShmemInit lays out the ProcSignalHeader (a global barrier
generation + a flexible array of NumProcSignalSlots slots).
ProcSignalInit claims MyProcNumber’s slot, publishing pss_pid with a
write-membarrier so EmitProcSignalBarrier cannot skip a
half-initialized slot, and registers CleanupProcSignalState via
on_shmem_exit.
SendProcSignal is the sender. With a known ProcNumber it goes straight
to the slot; otherwise it scans back-to-front (auxiliary processes live
near the end). It sets the reason flag under the spinlock and sends one
SIGUSR1:
// SendProcSignal (fast path) — src/backend/storage/ipc/procsignal.cSpinLockAcquire(&slot->pss_mutex);if (pg_atomic_read_u32(&slot->pss_pid) == pid){ slot->pss_signalFlags[reason] = true; /* which reason */ SpinLockRelease(&slot->pss_mutex); return kill(pid, SIGUSR1); /* the doorbell */}SpinLockRelease(&slot->pss_mutex);CheckProcSignal reads-and-clears one reason flag (the flag is volatile sig_atomic_t, readable in the handler without the lock).
procsignal_sigusr1_handler (quoted earlier) walks all reasons and ends
with SetLatch(MyLatch). Each Handle...Interrupt it dispatches lives
elsewhere (async.c, parallel.c, postgres.c) and only sets pending
flags.
Global barriers (procsignal.c)
Section titled “Global barriers (procsignal.c)”EmitProcSignalBarrier(type) ORs a type bit into every slot’s
pss_barrierCheckMask, atomically increments the global
psh_barrierGeneration, and SIGUSR1s every live process with the
PROCSIG_BARRIER reason. ProcessProcSignalBarrier (run from
CHECK_FOR_INTERRUPTS) compares the process’s local generation to the
shared one, exchanges out its check-mask, and dispatches each requested
barrier (e.g. ProcessBarrierSmgrRelease) inside a PG_TRY so an ERROR
re-arms the bits. On success it advances its pss_barrierGeneration and
broadcasts the slot’s condition variable. WaitForProcSignalBarrier(gen)
sleeps on each slot’s CV until every slot’s generation reaches gen.
// EmitProcSignalBarrier — src/backend/storage/ipc/procsignal.cfor (int i = 0; i < NumProcSignalSlots; i++) pg_atomic_fetch_or_u32(&ProcSignal->psh_slot[i].pss_barrierCheckMask, flagbit);generation = pg_atomic_add_fetch_u64(&ProcSignal->psh_barrierGeneration, 1);/* ... then SIGUSR1 every slot with pid != 0, reason PROCSIG_BARRIER ... */return generation;Interrupt handlers and ProcessInterrupts (postgres.c)
Section titled “Interrupt handlers and ProcessInterrupts (postgres.c)”The terminal handlers are tiny. die (SIGTERM) sets
ProcDiePending/InterruptPending and SetLatch(MyLatch);
StatementCancelHandler (SIGINT) sets QueryCancelPending likewise. Note
that the cancel signal is SIGINT, sent by SendCancelRequest after a
cancel-key match; ProcSignal messages are SIGUSR1; latch wakeups are
SIGURG — three distinct signals, three distinct purposes.
// StatementCancelHandler — src/backend/tcop/postgres.cif (!proc_exit_inprogress){ InterruptPending = true; QueryCancelPending = true;}SetLatch(MyLatch); /* waken anything waiting */ProcessInterrupts is the out-of-line body of CHECK_FOR_INTERRUPTS. It
returns immediately if InterruptHoldoffCount or CritSectionCount is
nonzero, then clears InterruptPending and services each condition in
priority order — ProcDiePending first (FATAL, with role-specific
messages for autovacuum/bgworker/walreceiver/etc.), then client-connection
loss, then QueryCancelPending (deferred again if
QueryCancelHoldoffCount != 0 to protect protocol reads), then the
timeout family, recovery conflicts, ProcSignal barrier, and parallel
messages.
// ProcessInterrupts (cancel-during-read guard) — src/backend/tcop/postgres.cif (QueryCancelPending && QueryCancelHoldoffCount != 0){ /* Re-arm so we process the cancel once we're done reading the message. */ InterruptPending = true;}else if (QueryCancelPending){ QueryCancelPending = false; /* ... distinguish lock_timeout / statement_timeout / user request, LockErrorCleanup(), then ereport(ERROR, "canceling statement ...") */}ProcessClientReadInterrupt and ProcessClientWriteInterrupt wrap the
low-level socket reads/writes: while idle (DoingCommandRead) a read
checks for interrupts and services NOTIFY/catchup; while dying
(ProcDiePending) they ensure the latch is set so a blocked I/O comes
back and dies promptly rather than hanging on an unresponsive client.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
InitializeLatchWaitSet | src/backend/storage/ipc/latch.c | 35 |
InitLatch | src/backend/storage/ipc/latch.c | 63 |
InitSharedLatch | src/backend/storage/ipc/latch.c | 93 |
OwnLatch | src/backend/storage/ipc/latch.c | 126 |
DisownLatch | src/backend/storage/ipc/latch.c | 144 |
WaitLatch | src/backend/storage/ipc/latch.c | 172 |
WaitLatchOrSocket | src/backend/storage/ipc/latch.c | 223 |
SetLatch | src/backend/storage/ipc/latch.c | 290 |
ResetLatch | src/backend/storage/ipc/latch.c | 374 |
InitializeWaitEventSupport | src/backend/storage/ipc/waiteventset.c | 241 |
CreateWaitEventSet | src/backend/storage/ipc/waiteventset.c | 364 |
FreeWaitEventSet | src/backend/storage/ipc/waiteventset.c | 481 |
AddWaitEventToSet | src/backend/storage/ipc/waiteventset.c | 570 |
ModifyWaitEvent | src/backend/storage/ipc/waiteventset.c | 656 |
WaitEventAdjustEpoll | src/backend/storage/ipc/waiteventset.c | 738 |
WaitEventSetWait | src/backend/storage/ipc/waiteventset.c | 1038 |
WaitEventSetWaitBlock (epoll) | src/backend/storage/ipc/waiteventset.c | 1182 |
latch_sigurg_handler | src/backend/storage/ipc/waiteventset.c | 1896 |
sendSelfPipeByte | src/backend/storage/ipc/waiteventset.c | 1904 |
drain | src/backend/storage/ipc/waiteventset.c | 1945 |
WakeupMyProc | src/backend/storage/ipc/waiteventset.c | 2020 |
WakeupOtherProc | src/backend/storage/ipc/waiteventset.c | 2033 |
ProcSignalSlot (struct) | src/backend/storage/ipc/procsignal.c | 64 |
ProcSignalShmemInit | src/backend/storage/ipc/procsignal.c | 132 |
ProcSignalInit | src/backend/storage/ipc/procsignal.c | 167 |
SendProcSignal | src/backend/storage/ipc/procsignal.c | 293 |
EmitProcSignalBarrier | src/backend/storage/ipc/procsignal.c | 365 |
WaitForProcSignalBarrier | src/backend/storage/ipc/procsignal.c | 433 |
ProcessProcSignalBarrier | src/backend/storage/ipc/procsignal.c | 508 |
CheckProcSignal | src/backend/storage/ipc/procsignal.c | 658 |
procsignal_sigusr1_handler | src/backend/storage/ipc/procsignal.c | 683 |
SendCancelRequest | src/backend/storage/ipc/procsignal.c | 741 |
ProcessClientReadInterrupt | src/backend/tcop/postgres.c | 502 |
ProcessClientWriteInterrupt | src/backend/tcop/postgres.c | 548 |
die | src/backend/tcop/postgres.c | 3027 |
StatementCancelHandler | src/backend/tcop/postgres.c | 3057 |
HandleRecoveryConflictInterrupt | src/backend/tcop/postgres.c | 3090 |
ProcessInterrupts | src/backend/tcop/postgres.c | 3299 |
CHECK_FOR_INTERRUPTS (macro) | src/include/miscadmin.h | 123 |
INTERRUPTS_CAN_BE_PROCESSED (macro) | src/include/miscadmin.h | 130 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All facts below were checked against the REL_18_STABLE tree at commit
273fe94 under /data/hgryoo/references/postgres.
Verified true:
latch.cis a thin wrapper overwaiteventset.c:WaitLatchuses a shared two-slotLatchWaitSetandWaitLatchOrSocketbuilds a three-slot set per call (CreateWaitEventSet(CurrentResourceOwner, 3)). Confirmed inWaitLatch/WaitLatchOrSocket.SetLatchbrackets its publish/check with twopg_memory_barrier()calls and dispatches viaWakeupMyProc/WakeupOtherProc. The WaitEventSet side setsmaybe_sleepingwith a matching barrier inWaitEventSetWait. Confirmed.- The latch wakeup signal is SIGURG, ProcSignal uses SIGUSR1, and
query-cancel uses SIGINT. Confirmed in
WakeupOtherProc(kill(pid, SIGURG)),SendProcSignal(kill(pid, SIGUSR1)), andSendCancelRequest(kill(-backendPID, SIGINT)underHAVE_SETSID). - The compile-time matrix is: epoll⇒signalfd (SIGURG blocked, no handler),
poll⇒self-pipe (
latch_sigurg_handlerinstalled), kqueue⇒EVFILT_SIGNALfor SIGURG, win32⇒event objects. Confirmed in theWAIT_USE_*#ifladder andInitializeWaitEventSupport. - Postmaster death is delivered as the read end of
postmaster_alive_fds[POSTMASTER_FD_WATCH], re-confirmed withPostmasterIsAliveInternal()before reporting.WL_EXIT_ON_PM_DEATHcallsproc_exit(1)from insideWaitEventSetWaitBlock. Confirmed. - ProcSignal slots are indexed by
ProcNumber;NumProcSignalSlots = MaxBackends + NUM_AUXILIARY_PROCS.pss_signalFlagsisvolatile sig_atomic_t[NUM_PROCSIGNALS]. Confirmed. - The barrier protocol uses a 64-bit generation counter plus a per-slot
pss_barrierCheckMaskand a per-slotConditionVariable pss_barrierCV.ProcessProcSignalBarrierclears the mask viapg_atomic_exchange_u32before processing and re-arms on failure inside aPG_TRY. Confirmed. ProcessInterruptsearly-returns onInterruptHoldoffCount || CritSectionCount, and re-armsInterruptPendingwhenQueryCancelPending && QueryCancelHoldoffCount != 0. Confirmed.- In REL_18, the only barrier type handled in
ProcessProcSignalBarrier’s switch isPROCSIGNAL_BARRIER_SMGRRELEASE(→ProcessBarrierSmgrRelease). Confirmed in theswitch (type)block.
Scope notes / non-assertions:
- The kqueue and Win32
WaitEventSetWaitBlockvariants exist and were read but are not quoted; the doc asserts only the epoll path’s line-level details. The poll path’s self-pipe handler is quoted. signalfd/epollare Linux-only; the position-hint lines forlatch_sigurg_handler/sendSelfPipeByteare inside#if defined(WAIT_USE_SELF_PIPE)and only compile on the poll build. The line numbers are still accurate as source positions.contrib/is out of scope; no contrib symbols are asserted here.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”The latch/wait-set/interrupt stack is one concrete answer to a problem every concurrent system faces. Placing it beside other designs sharpens what PostgreSQL chose and why.
Latches vs. futexes vs. condition variables
Section titled “Latches vs. futexes vs. condition variables”PostgreSQL’s Latch is morally a cross-process condition variable
restricted to a single bit of state, but it deliberately does not use
a kernel blocking primitive like a Linux futex for the wakeup.
The reason is portability and the need to compose the wakeup with other
wait sources in one syscall: a futex can only wait on a futex word, but a
backend must wait on a latch and a socket and postmaster death
simultaneously. By routing the latch through a file descriptor
(self-pipe/signalfd) it becomes just another fd in the reactor’s interest
set. The cost is one extra syscall per wakeup (the pipe write or the
kill) and the drain read. PostgreSQL does have a true futex-style
primitive elsewhere — ConditionVariable (used by the barrier CV above)
and LWLocks both sleep on the proc’s semaphore — but those are not
multiplexable with sockets, which is precisely why latches exist as a
separate mechanism. The sibling doc postgres-lwlock-spinlock.md covers
the semaphore-based sleeping locks.
Self-pipe vs. signalfd vs. EVFILT_SIGNAL vs. eventfd
Section titled “Self-pipe vs. signalfd vs. EVFILT_SIGNAL vs. eventfd”The “make a signal look like a readable fd” problem has spawned a small
zoo of OS-specific answers, and PostgreSQL supports four of them. The
classic self-pipe trick (Bernstein, ~1990s; popularized by Stevens
and the qmail codebase) is the portable baseline. Linux’s signalfd
(2.6.22) and eventfd (2.6.22) were added specifically to retire the
self-pipe; PostgreSQL uses signalfd on epoll builds but notably does not
use eventfd for latches, since signalfd composes cleanly with the existing
SIGURG-based cross-process wakeup. The BSD kqueue EVFILT_SIGNAL
filter folds signal waiting directly into the reactor with no auxiliary
fd at all — arguably the cleanest design, and the reason the kqueue path
needs neither a self-pipe nor a signalfd. A research-frontier question is
whether io_uring’s IORING_OP_* and its native support for waiting on
eventfd/futex could one day replace the WaitEventSet entirely; PostgreSQL
18’s new async I/O subsystem (postgres-aio.md) is the first foothold of
io_uring in the tree, but the latch path is untouched by it so far.
Signal multiplexing vs. message queues
Section titled “Signal multiplexing vs. message queues”ProcSignal multiplexes one signal number into ~20 reasons via
shared-memory flags. An alternative design — used by many actor-model and
microkernel systems — is a genuine per-process message queue where
each message carries a typed payload. PostgreSQL actually has both: the
shm_mq shared-memory queue (postgres-shared-memory-ipc.md) carries
data between parallel workers and their leader, while ProcSignal carries
only notifications (“there is a message waiting in the queue, go look”).
The split is deliberate: signals are scarce and lossy (coalescing
identical reasons is a feature, not a bug, for idempotent notifications),
whereas shm_mq is lossless and ordered for payloads. The general DBMS
lesson — separate the doorbell from the mailbox — recurs in many systems.
Cooperative cancellation vs. preemption
Section titled “Cooperative cancellation vs. preemption”CHECK_FOR_INTERRUPTS() makes PostgreSQL’s cancellation cooperative:
a backend can only be cancelled at the safe points where it polls.
A long-running C function with no interrupt check is effectively
uncancellable — a real and occasionally painful limitation (tight loops
in extensions, certain regex operations historically). The alternative,
preemptive cancellation (forcibly unwinding a thread), is what
thread-per-connection engines with managed runtimes (e.g. JVM-based
systems) can attempt, but it is notoriously unsafe around locks and
allocator state — the same hazard that forbids real work in signal
handlers. PostgreSQL’s choice trades latency-to-cancel for the guarantee
that cancellation never corrupts shared state. The InterruptHoldoffCount
/ CritSectionCount / QueryCancelHoldoffCount counters are the explicit
knobs that widen or narrow the safe-point windows. A standing
research/engineering frontier is reducing worst-case cancel latency by
sprinkling more checks into hot paths without paying their branch cost in
the common case — the unlikely() hint in INTERRUPTS_PENDING_CONDITION
is the current mitigation.
Wait-event instrumentation
Section titled “Wait-event instrumentation”A side benefit of funneling every wait through WaitEventSetWait is
observability: each call passes a wait_event_info that
pgstat_report_wait_start/_end records, which is what populates
pg_stat_activity.wait_event and wait_event_type. This is one of the
most-used production diagnostics in PostgreSQL, and it exists essentially
for free because all sleeping is centralized in one module. Engines that
scatter their blocking calls across the codebase struggle to retrofit such
uniform wait accounting. See postgres-cumulative-stats.md for the
collection side.
Sources
Section titled “Sources”PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”src/backend/storage/ipc/latch.c—InitLatch,InitSharedLatch,OwnLatch,DisownLatch,WaitLatch,WaitLatchOrSocket,SetLatch,ResetLatch,InitializeLatchWaitSet.src/backend/storage/ipc/waiteventset.c—InitializeWaitEventSupport,CreateWaitEventSet,AddWaitEventToSet,ModifyWaitEvent,WaitEventAdjustEpoll,WaitEventSetWait,WaitEventSetWaitBlock,latch_sigurg_handler,sendSelfPipeByte,drain,WakeupMyProc,WakeupOtherProc, and theWAIT_USE_*selection ladder.src/backend/storage/ipc/procsignal.c—ProcSignalSlot,ProcSignalShmemInit,ProcSignalInit,SendProcSignal,EmitProcSignalBarrier,WaitForProcSignalBarrier,ProcessProcSignalBarrier,CheckProcSignal,procsignal_sigusr1_handler,SendCancelRequest.src/backend/tcop/postgres.c—die,StatementCancelHandler,HandleRecoveryConflictInterrupt,ProcessInterrupts,ProcessClientReadInterrupt,ProcessClientWriteInterrupt.src/include/miscadmin.h—CHECK_FOR_INTERRUPTS,INTERRUPTS_PENDING_CONDITION,INTERRUPTS_CAN_BE_PROCESSED,HOLD_INTERRUPTS/RESUME_INTERRUPTS,InterruptPending.src/include/storage/procsignal.h—ProcSignalReasonenum,NUM_PROCSIGNALS,ProcSignalBarrierType.src/include/storage/latch.h,src/include/storage/waiteventset.h— theWL_*event-mask constants and the public API prototypes.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Architecture of a Database System (Hellerstein et al.), §“Process Models” — the process-per-connection model and its reliance on OS IPC/signalling rather than a user-space scheduler.
- Database Internals (Petrov) — node-local concurrency and process models; the cost of wakeups and context switches for OLTP.
- Operating Systems: Three Easy Pieces (Arpaci-Dusseau) — condition variables and the lost-wakeup hazard; the limited directives a signal handler may safely issue.
Cross-references (sibling module docs)
Section titled “Cross-references (sibling module docs)”postgres-shared-memory-ipc.md— the shared-memory segment that holdsProcSignalandshm_mq; how slots are allocated; the payload-carrying message queue that pairs with ProcSignal’s notification-only role.postgres-backend-lifecycle.md— where a backend callsInitializeWaitEventSupport,ProcSignalInit, and reaches thePostgresMainloop that sprinklesCHECK_FOR_INTERRUPTS().postgres-aux-processes.md— auxiliary processes that own slots near the end of the ProcSignal array and use latches as their main wait.postgres-lwlock-spinlock.md— the semaphore-based sleeping locks andConditionVariable, the non-multiplexable cousins of the latch.postgres-wire-protocol.md— the FE/BE read/write points wrapped byProcessClientReadInterrupt/ProcessClientWriteInterrupt, and the cancel request that triggersSendCancelRequest.postgres-cumulative-stats.md— the wait-event accounting populated bypgstat_report_wait_start/_endinsideWaitEventSetWait.postgres-aio.md— PG18 io_uring async I/O, the first io_uring user in the tree and a possible future direction for the wait machinery.