Skip to content

PostgreSQL Latches, Wait Event Sets, and Inter-Process Signals

Contents:

A multi-process database server is, at its core, a set of cooperating processes that spend most of their wall-clock time asleep — blocked waiting for the next unit of work. The design question that pervades the server’s plumbing is deceptively simple: how does a sleeping process learn that it should wake up, and how does another process (or a signal handler in the same process) tell it to? Three sub-problems define the space:

  1. The lost-wakeup race. The naive pattern is a shared flag plus a sleep: a worker loops while (!flag) sleep(), and a producer sets flag = true. This is broken on any preemptive system. If the producer sets the flag and signals between the worker’s test of the flag and its call to sleep(), the wakeup is lost and the worker sleeps forever. A correct primitive must make “check the condition” and “begin sleeping” atomic with respect to the wakeup — exactly the guarantee that pselect/ppoll and condition variables provide and that plain poll + a signal handler do not.

  2. Waiting on heterogeneous events at once. A backend rarely waits for just one thing. It waits for client socket readability (a new query), and for its latch (another backend has notified it), and for postmaster death (the parent crashed and the child should exit), and possibly for a timeout (statement_timeout). The OS readiness primitives — epoll, kqueue, poll — natively multiplex file descriptors, but a latch is not a file descriptor and postmaster death is not obviously one either. The abstraction must unify them.

  3. Turning asynchronous signals into synchronous, safe actions. A Unix signal can arrive at any instruction boundary, including in the middle of malloc, while holding a spinlock, or mid-way through building a protocol message. Almost nothing is safe to do in a signal handler (the async-signal-safe function list is tiny). So the universal pattern is: the handler does the absolute minimum — set a volatile sig_atomic_t flag and poke a wakeup — and the real work is deferred to a well-defined safe point in the main code path where the process voluntarily checks the flag.

Operating Systems: Three Easy Pieces (Arpaci-Dusseau) frames items (1) and (3) under “concurrency” and “the limited directives a signal handler may issue”; the lost-wakeup problem is the same hazard that motivates condition variables having an associated mutex. Database Internals (Petrov), in its treatment of node-local concurrency and process models, notes that a database’s process/thread scheduler is built on exactly these OS primitives and that the cost of a wakeup (a syscall, a context switch, a cache-line bounce) is a first-order performance concern for OLTP. Architecture of a Database System (Hellerstein et al., §“Process Models”) catalogs the process-per-connection model PostgreSQL uses and observes that such a model leans heavily on the OS for IPC and signalling rather than building a user-space scheduler.

PostgreSQL’s answer composes three layers, each solving one sub-problem: the Latch (a race-free boolean-with-wakeup, solving #1 for the one-bit case), the WaitEventSet (a portable multiplexer over the OS readiness primitives, solving #2 and embedding the latch into the same wait), and the interrupt machinery — ProcSignal for inter-process messaging plus InterruptPending/CHECK_FOR_INTERRUPTS() for deferral — solving #3.

This section names the engineering patterns that recur across multi-process servers (database or otherwise) so PostgreSQL’s specific choices read as selections within a shared design space.

The event/latch object: a boolean plus a wakeup

Section titled “The event/latch object: a boolean plus a wakeup”

Almost every server that uses a process-per-connection or process-per-worker model needs a “kick this sleeping process” primitive. The shape is always the same: a small shared object holding (a) a boolean “is there work?” flag and (b) enough identity (a pid, an event handle, a pipe fd) to deliver an OS-level wakeup to whoever owns it. The two halves must be ordered by memory barriers: the setter publishes the flag before checking whether anyone is sleeping, and the waiter clears the flag before re-examining the condition, so neither side can conclude “nothing to do / nobody to wake” on the basis of stale memory. This is the textbook double-checked pattern around a condition variable, adapted to cross-process shared memory.

To wait on many fds plus internal events at once, servers build a thin portability wrapper over whatever “reactor” the OS provides: epoll on Linux, kqueue on the BSDs and macOS, poll/select as the lowest common denominator, and I/O completion ports or event objects on Windows. The wrapper registers a set of interest specifications once and then calls the blocking primitive repeatedly, translating the kernel’s “these became ready” answer back into the server’s own event vocabulary. nginx, redis, libuv, and PostgreSQL all have a module of exactly this shape.

Race-free blocking: ppoll/pselect or the self-pipe trick

Section titled “Race-free blocking: ppoll/pselect or the self-pipe trick”

Because plain poll() cannot atomically “unblock signals and sleep,” a signal that arrives just before poll() is entered will not interrupt it. The two canonical fixes are (a) use the p-variant syscalls (ppoll/pselect) that take a signal mask, or (b) the self-pipe trick (Bernstein): the signal handler writes one byte to a pipe whose read end is in the fd set, so a pending signal manifests as a readable fd — something poll() cannot miss. Linux offers a third option, signalfd, which turns signal delivery itself into a readable fd, letting the process keep the signal blocked and consume it synchronously.

Multiplexing one signal number into many logical messages

Section titled “Multiplexing one signal number into many logical messages”

Unix gives a process only a handful of user-definable signals (SIGUSR1, SIGUSR2). A server that needs dozens of distinct inter-process notifications cannot afford one signal per message type. The standard trick is to pick a single signal as a generic “you have mail” doorbell, and carry the reason out-of-band in shared memory: the sender sets a per-recipient flag identifying the message, then sends the one signal; the recipient’s handler scans its flags to learn what actually happened. This decouples the (scarce) signal namespace from the (unbounded) message namespace.

Deferred interrupt handling at safe points

Section titled “Deferred interrupt handling at safe points”

The final shared pattern: signal handlers never do real work. They set InterruptPending-style flags and return. Real work happens when the main loop reaches a safe point and calls a check macro. This converts an asynchronous, dangerous context into a synchronous, controlled one, and lets the server protect critical sections simply by incrementing a “hold-off” counter that the check macro respects.

flowchart TD
  A["Producer process<br/>or signal handler"] -->|"SetLatch / kill(SIGUSR1)"| B["Shared state:<br/>is_set flag,<br/>ProcSignal reason bit"]
  B --> C["OS wakeup:<br/>self-pipe byte /<br/>signalfd / SIGURG"]
  C --> D["Sleeping process in<br/>WaitEventSetWait()"]
  D -->|"returns readiness"| E["Main loop reaches<br/>safe point"]
  E -->|"CHECK_FOR_INTERRUPTS()"| F["ProcessInterrupts():<br/>cancel / die / barrier"]

PostgreSQL implements the three layers as three source modules plus the interrupt glue in postgres.c. The layering is strict: latch.c is a thin facade over waiteventset.c, which is the only module that touches the OS reactor; procsignal.c builds the SIGUSR1 multiplexer on top of latches; and postgres.c defines what the deferred handlers actually do.

A Latch is three fields that matter — is_set, maybe_sleeping, and owner_pid — plus an is_shared flag. A process-local latch (InitLatch) can only be set from within the same process (typically from a signal handler); a shared latch (InitSharedLatch + OwnLatch) lives in shared memory and can be set by any backend. MyLatch is every process’s standard wakeup object, used pervasively as the “something happened, go re-check your state” signal.

SetLatch is the heart of the race-freedom. It uses two memory barriers to sandwich the publish-then-check sequence:

// SetLatch — src/backend/storage/ipc/latch.c
pg_memory_barrier();
/* Quick exit if already set */
if (latch->is_set)
return;
latch->is_set = true;
pg_memory_barrier();
if (!latch->maybe_sleeping)
return;
/* ... figure out owner_pid and deliver a wakeup ... */
owner_pid = latch->owner_pid;
if (owner_pid == 0)
return;
else if (owner_pid == MyProcPid)
WakeupMyProc(); /* in-process: self-pipe or SIGURG to self */
else
WakeupOtherProc(owner_pid);/* cross-process: kill(pid, SIGURG) */

The waiter side, inside WaitEventSetWait, performs the mirror-image dance: it sets maybe_sleeping = true, issues a barrier, then re-checks is_set before actually sleeping. If SetLatch ran in the window, either the waiter sees is_set and skips the sleep, or the setter sees maybe_sleeping and delivers the wakeup — never both-miss. ResetLatch clears is_set with a trailing barrier so a subsequent flag-read cannot be reordered before the clear.

flowchart TD
  subgraph Setter["SetLatch (any process / handler)"]
    S1["barrier; if is_set return"] --> S2["is_set = true"]
    S2 --> S3["barrier; if !maybe_sleeping return"]
    S3 --> S4["owner==me? WakeupMyProc<br/>else WakeupOtherProc(pid)"]
  end
  subgraph Waiter["WaitEventSetWait (owner)"]
    W1["maybe_sleeping = true"] --> W2["barrier"]
    W2 --> W3["if is_set: report WL_LATCH_SET,<br/>skip sleep"]
    W3 --> W4["else block in epoll/poll<br/>until self-pipe/signalfd readable"]
    W4 --> W5["drain(); recheck is_set"]
  end
  S4 -.->|"self-pipe byte / SIGURG"| W4

The WaitEventSet: one wait over latch + sockets + PM death + timeout

Section titled “The WaitEventSet: one wait over latch + sockets + PM death + timeout”

WaitLatch is now just a convenience wrapper. The real machinery is the WaitEventSet, which holds a registered array of WaitEvent interest records and a per-platform kernel object (epoll_fd, kqueue_fd, or a pollfd array). You build one with CreateWaitEventSet, register events with AddWaitEventToSet (returning a position you can later ModifyWaitEvent), and block in WaitEventSetWait. The event kinds are a bitmask: WL_LATCH_SET, WL_SOCKET_READABLE/WRITEABLE/CLOSED/..., WL_POSTMASTER_DEATH (or WL_EXIT_ON_PM_DEATH), and WL_TIMEOUT.

The crucial design move is that all three non-socket events are mapped onto file descriptors so the OS reactor can wait on them uniformly:

  • the latch becomes the read end of the self-pipe (poll build) or the signalfd (epoll build);
  • postmaster death becomes postmaster_alive_fds[POSTMASTER_FD_WATCH], a pipe the postmaster holds open and the kernel closes (EPOLLHUP) when the postmaster dies;
  • the timeout is the reactor call’s own timeout argument.
// AddWaitEventToSet — src/backend/storage/ipc/waiteventset.c
if (events == WL_LATCH_SET)
{
set->latch = latch;
set->latch_pos = event->pos;
#if defined(WAIT_USE_SELF_PIPE)
event->fd = selfpipe_readfd;
#elif defined(WAIT_USE_SIGNALFD)
event->fd = signal_fd;
#else
event->fd = PGINVALID_SOCKET;
#ifdef WAIT_USE_EPOLL
return event->pos;
#endif
#endif
}
else if (events == WL_POSTMASTER_DEATH)
{
#ifndef WIN32
event->fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
#endif
}

WaitLatch exploits a single long-lived LatchWaitSet (built once by InitializeLatchWaitSet) with exactly two slots — the latch at position 0 and PM-death at position 1 — and merely ModifyWaitEvents them on each call. That avoids the cost of creating and tearing down an epoll fd for the extremely common single-latch wait. WaitLatchOrSocket, by contrast, builds a fresh three-slot set per call because the socket varies.

PostgreSQL chooses its wakeup mechanism at compile time based on the reactor. The header comment in waiteventset.c lays out the matrix precisely: poll() uses the self-pipe; epoll() uses signalfd and keeps SIGURG blocked; kqueue() registers EVFILT_SIGNAL for SIGURG; Windows uses inherited event objects. The signal used to carry a latch wakeup is SIGURG, deliberately distinct from SIGUSR1 (which carries ProcSignal messages) so the two concerns don’t interfere.

// latch_sigurg_handler / sendSelfPipeByte — src/backend/storage/ipc/waiteventset.c
static void
latch_sigurg_handler(SIGNAL_ARGS)
{
if (waiting)
sendSelfPipeByte(); /* turn the signal into a readable fd */
}

On the epoll path there is no handler at all: InitializeWaitEventSupport blocks SIGURG and opens a signalfd so that delivery of SIGURG becomes a readable descriptor that epoll watches directly. Either way, when WaitEventSetWaitBlock reports the latch fd readable it calls drain() to empty the pipe/signalfd (non-blocking read until EAGAIN) before re-checking is_set, so accumulated bytes can’t cause a busy-spin.

ProcSignal: multiplexing SIGUSR1 into many reasons

Section titled “ProcSignal: multiplexing SIGUSR1 into many reasons”

Cross-backend messages that aren’t just “wake up” ride on SIGUSR1. procsignal.c keeps a shared-memory array of ProcSignalSlot, one per ProcNumber (plus one per auxiliary process). Each slot has a pss_signalFlags[NUM_PROCSIGNALS] array of booleans. To signal backend N with reason R, SendProcSignal sets slot[N].pss_signalFlags[R] = true under the slot’s spinlock and then kill(pid, SIGUSR1). The recipient’s procsignal_sigusr1_handler scans every reason with CheckProcSignal, dispatches a Handle...Interrupt for each set flag, and finally SetLatch(MyLatch).

// procsignal_sigusr1_handler (excerpt) — src/backend/storage/ipc/procsignal.c
if (CheckProcSignal(PROCSIG_NOTIFY_INTERRUPT))
HandleNotifyInterrupt();
if (CheckProcSignal(PROCSIG_PARALLEL_MESSAGE))
HandleParallelMessageInterrupt();
if (CheckProcSignal(PROCSIG_BARRIER))
HandleProcSignalBarrierInterrupt();
/* ... recovery-conflict reasons ... */
SetLatch(MyLatch);

Note the layering: every Handle...Interrupt does only flag-setting (InterruptPending = true; SomethingPending = true;), and the final SetLatch ensures the backend wakes from any wait so its next CHECK_FOR_INTERRUPTS() runs the real handler. ProcSignal also carries global barriers (EmitProcSignalBarrier / WaitForProcSignalBarrier), a generation-counter protocol used when a state change must be acknowledged by every backend before proceeding (e.g. SMGR release).

Interrupt deferral: CHECK_FOR_INTERRUPTS and ProcessInterrupts

Section titled “Interrupt deferral: CHECK_FOR_INTERRUPTS and ProcessInterrupts”

The deferral contract lives in miscadmin.h and postgres.c. Handlers set InterruptPending; the main loop sprinkles CHECK_FOR_INTERRUPTS() at safe points; that macro calls ProcessInterrupts() only when an interrupt is pending, which then services ProcDiePending, QueryCancelPending, timeouts, recovery conflicts, ProcSignal barriers, and parallel messages — throwing ERROR/FATAL as appropriate.

// CHECK_FOR_INTERRUPTS / INTERRUPTS_CAN_BE_PROCESSED — src/include/miscadmin.h
#define CHECK_FOR_INTERRUPTS() \
do { \
if (INTERRUPTS_PENDING_CONDITION()) \
ProcessInterrupts(); \
} while(0)
#define INTERRUPTS_CAN_BE_PROCESSED() \
(InterruptHoldoffCount == 0 && CritSectionCount == 0 && \
QueryCancelHoldoffCount == 0)

HOLD_INTERRUPTS()/RESUME_INTERRUPTS() bump InterruptHoldoffCount, and ProcessInterrupts bails out immediately if that counter or CritSectionCount is nonzero — this is how a critical section defers a cancel until it is safe. The query-cancel path additionally respects QueryCancelHoldoffCount so a cancel cannot fire while the backend is reading a protocol message (which would desync the FE/BE stream).

flowchart TD
  K1["kill(pid, SIGINT)<br/>StatementCancelHandler"] --> K2["QueryCancelPending = true<br/>InterruptPending = true<br/>SetLatch(MyLatch)"]
  K2 --> K3["backend wakes from<br/>WaitEventSetWait"]
  K3 --> K4["next CHECK_FOR_INTERRUPTS()"]
  K4 -->|"holdoff/crit-section?"| K5["re-arm InterruptPending,<br/>defer"]
  K4 -->|"safe"| K6["ProcessInterrupts():<br/>ereport(ERROR,<br/>'canceling statement<br/>due to user request')"]

This section walks the actual code in call-flow order, grouped by subsystem. Symbols are the stable anchors; line numbers live only in the position-hint table at the end.

A latch is initialized by InitLatch (process-local) or InitSharedLatch

  • OwnLatch (shared). OwnLatch is the moment a shared latch becomes associated with the current process, recording owner_pid = MyProcPid; DisownLatch reverses it. The sanity check in OwnLatch PANICs if the latch already has an owner — there is intentionally no lock, so callers must interlock externally if two processes might race to own one latch.
// OwnLatch — src/backend/storage/ipc/latch.c
owner_pid = latch->owner_pid;
if (owner_pid != 0)
elog(PANIC, "latch already owned by PID %d", owner_pid);
latch->owner_pid = MyProcPid;

SetLatch (quoted in the previous section) is the only setter. Its key properties: it is safe to call from a signal handler and from critical sections (it never throws), it is cheap when the latch is already set (the first barrier+test short-circuits), and it picks the wakeup delivery by comparing owner_pid to MyProcPid — an in-process set (the common case for a handler that just flagged an interrupt) routes to WakeupMyProc, a cross-process set to WakeupOtherProc. ResetLatch clears is_set and must only be called by the owner; the standard idiom is wait, then reset, then process, looping at the bottom so a set that arrives during processing is not lost.

Building and modifying a WaitEventSet (waiteventset.c)

Section titled “Building and modifying a WaitEventSet (waiteventset.c)”

InitializeWaitEventSupport runs once per process at startup. On the poll build it creates the self-pipe (both ends non-blocking and close-on-exec), installs latch_sigurg_handler for SIGURG, and reserves two external FDs with fd.c. On the epoll build it instead blocks SIGURG and opens a signalfd:

// InitializeWaitEventSupport (signalfd branch) — src/backend/storage/ipc/waiteventset.c
sigaddset(&UnBlockSig, SIGURG); /* keep SIGURG blocked */
sigemptyset(&signalfd_mask);
sigaddset(&signalfd_mask, SIGURG);
signal_fd = signalfd(-1, &signalfd_mask, SFD_NONBLOCK | SFD_CLOEXEC);
if (signal_fd < 0)
elog(FATAL, "signalfd() failed");

CreateWaitEventSet allocates one contiguous block (MAXALIGN-padded) that holds the WaitEventSet, the WaitEvent array, and the platform’s return buffer (epoll_ret_events / kqueue_ret_events / pollfds), then opens the kernel object (epoll_create1(EPOLL_CLOEXEC) / kqueue()). The set is optionally tracked by a ResourceOwner so it is freed on error.

AddWaitEventToSet validates the request (a latch event needs a latch owned by this process; a socket event needs a real fd), stores a WaitEvent, maps internal events to fds (shown earlier), and calls the platform-specific adjust routine. For epoll that is WaitEventAdjustEpoll, which translates the WL_* mask into EPOLLIN/EPOLLOUT/EPOLLRDHUP and calls epoll_ctl:

// WaitEventAdjustEpoll — src/backend/storage/ipc/waiteventset.c
epoll_ev.data.ptr = event; /* so epoll_wait hands back our WaitEvent */
epoll_ev.events = EPOLLERR | EPOLLHUP; /* always watch for errors */
if (event->events == WL_LATCH_SET)
epoll_ev.events |= EPOLLIN;
else if (event->events == WL_POSTMASTER_DEATH)
epoll_ev.events |= EPOLLIN;
else
{
if (event->events & WL_SOCKET_READABLE) epoll_ev.events |= EPOLLIN;
if (event->events & WL_SOCKET_WRITEABLE) epoll_ev.events |= EPOLLOUT;
if (event->events & WL_SOCKET_CLOSED) epoll_ev.events |= EPOLLRDHUP;
}
rc = epoll_ctl(set->epoll_fd, action, event->fd, &epoll_ev);

ModifyWaitEvent is the fast-path used by WaitLatch: if neither mask nor latch changed it returns immediately, and on Unix a latch modification needs no kernel call at all (the underlying pipe/signalfd is shared across all latches), so switching MyLatch in and out is nearly free.

The wait loop: WaitEventSetWait and WaitEventSetWaitBlock

Section titled “The wait loop: WaitEventSetWait and WaitEventSetWaitBlock”

WaitEventSetWait is the portable outer loop. It sets the waiting-flag (so the SIGURG handler knows to write to the self-pipe), then loops until at least one event is returned. The latch fast-path is the embodiment of the race-free protocol: set maybe_sleeping, barrier, recheck is_set, and only block if still unset.

// WaitEventSetWait (latch fast path) — src/backend/storage/ipc/waiteventset.c
if (set->latch && !set->latch->is_set)
{
set->latch->maybe_sleeping = true; /* tell SetLatch we're about to sleep */
pg_memory_barrier();
/* and recheck */
}
if (set->latch && set->latch->is_set)
{
occurred_events->events = WL_LATCH_SET;
/* ... fill event, return without blocking ... */
set->latch->maybe_sleeping = false;
}

WaitEventSetWaitBlock is the platform-specific inner call. The epoll version calls epoll_wait, treats EINTR as “retry” and rc == 0 as timeout, then walks the returned epoll_event array. When the latch fd fires it drain()s the signalfd and re-checks set->latch->is_set && maybe_sleeping before reporting WL_LATCH_SET — guarding against a spurious wakeup. When the PM-death fd fires it re-confirms with PostmasterIsAliveInternal() (a spurious death report would be catastrophic) and, if exit_on_postmaster_death, calls proc_exit(1) directly.

// WaitEventSetWaitBlock (epoll, latch + PM-death) — src/backend/storage/ipc/waiteventset.c
if (cur_event->events == WL_LATCH_SET &&
cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
{
drain(); /* empty signalfd / self-pipe */
if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
{
occurred_events->events = WL_LATCH_SET;
returned_events++;
}
}
else if (cur_event->events == WL_POSTMASTER_DEATH && ...)
{
if (!PostmasterIsAliveInternal())
{
if (set->exit_on_postmaster_death)
proc_exit(1);
occurred_events->events = WL_POSTMASTER_DEATH;
}
}

Wakeup delivery and drain (waiteventset.c)

Section titled “Wakeup delivery and drain (waiteventset.c)”

WakeupMyProc is what SetLatch calls when the owner is the current process (typical inside a signal handler): on the self-pipe build it writes a byte, on the signalfd build it kill(MyProcPid, SIGURG). WakeupOtherProc always sends kill(pid, SIGURG). sendSelfPipeByte writes a single byte non-blocking, treating EAGAIN/EWOULDBLOCK as success (a full pipe already carries the wakeup) and silently ignoring other errors because it may run in a handler. drain reads until the descriptor is empty.

// WakeupMyProc / WakeupOtherProc — src/backend/storage/ipc/waiteventset.c
void
WakeupMyProc(void)
{
#if defined(WAIT_USE_SELF_PIPE)
if (waiting)
sendSelfPipeByte();
#else
if (waiting)
kill(MyProcPid, SIGURG);
#endif
}
void
WakeupOtherProc(int pid)
{
kill(pid, SIGURG);
}

ProcSignal: slots, send, and the SIGUSR1 handler (procsignal.c)

Section titled “ProcSignal: slots, send, and the SIGUSR1 handler (procsignal.c)”

ProcSignalShmemInit lays out the ProcSignalHeader (a global barrier generation + a flexible array of NumProcSignalSlots slots). ProcSignalInit claims MyProcNumber’s slot, publishing pss_pid with a write-membarrier so EmitProcSignalBarrier cannot skip a half-initialized slot, and registers CleanupProcSignalState via on_shmem_exit.

SendProcSignal is the sender. With a known ProcNumber it goes straight to the slot; otherwise it scans back-to-front (auxiliary processes live near the end). It sets the reason flag under the spinlock and sends one SIGUSR1:

// SendProcSignal (fast path) — src/backend/storage/ipc/procsignal.c
SpinLockAcquire(&slot->pss_mutex);
if (pg_atomic_read_u32(&slot->pss_pid) == pid)
{
slot->pss_signalFlags[reason] = true; /* which reason */
SpinLockRelease(&slot->pss_mutex);
return kill(pid, SIGUSR1); /* the doorbell */
}
SpinLockRelease(&slot->pss_mutex);

CheckProcSignal reads-and-clears one reason flag (the flag is volatile sig_atomic_t, readable in the handler without the lock). procsignal_sigusr1_handler (quoted earlier) walks all reasons and ends with SetLatch(MyLatch). Each Handle...Interrupt it dispatches lives elsewhere (async.c, parallel.c, postgres.c) and only sets pending flags.

EmitProcSignalBarrier(type) ORs a type bit into every slot’s pss_barrierCheckMask, atomically increments the global psh_barrierGeneration, and SIGUSR1s every live process with the PROCSIG_BARRIER reason. ProcessProcSignalBarrier (run from CHECK_FOR_INTERRUPTS) compares the process’s local generation to the shared one, exchanges out its check-mask, and dispatches each requested barrier (e.g. ProcessBarrierSmgrRelease) inside a PG_TRY so an ERROR re-arms the bits. On success it advances its pss_barrierGeneration and broadcasts the slot’s condition variable. WaitForProcSignalBarrier(gen) sleeps on each slot’s CV until every slot’s generation reaches gen.

// EmitProcSignalBarrier — src/backend/storage/ipc/procsignal.c
for (int i = 0; i < NumProcSignalSlots; i++)
pg_atomic_fetch_or_u32(&ProcSignal->psh_slot[i].pss_barrierCheckMask, flagbit);
generation = pg_atomic_add_fetch_u64(&ProcSignal->psh_barrierGeneration, 1);
/* ... then SIGUSR1 every slot with pid != 0, reason PROCSIG_BARRIER ... */
return generation;

Interrupt handlers and ProcessInterrupts (postgres.c)

Section titled “Interrupt handlers and ProcessInterrupts (postgres.c)”

The terminal handlers are tiny. die (SIGTERM) sets ProcDiePending/InterruptPending and SetLatch(MyLatch); StatementCancelHandler (SIGINT) sets QueryCancelPending likewise. Note that the cancel signal is SIGINT, sent by SendCancelRequest after a cancel-key match; ProcSignal messages are SIGUSR1; latch wakeups are SIGURG — three distinct signals, three distinct purposes.

// StatementCancelHandler — src/backend/tcop/postgres.c
if (!proc_exit_inprogress)
{
InterruptPending = true;
QueryCancelPending = true;
}
SetLatch(MyLatch); /* waken anything waiting */

ProcessInterrupts is the out-of-line body of CHECK_FOR_INTERRUPTS. It returns immediately if InterruptHoldoffCount or CritSectionCount is nonzero, then clears InterruptPending and services each condition in priority order — ProcDiePending first (FATAL, with role-specific messages for autovacuum/bgworker/walreceiver/etc.), then client-connection loss, then QueryCancelPending (deferred again if QueryCancelHoldoffCount != 0 to protect protocol reads), then the timeout family, recovery conflicts, ProcSignal barrier, and parallel messages.

// ProcessInterrupts (cancel-during-read guard) — src/backend/tcop/postgres.c
if (QueryCancelPending && QueryCancelHoldoffCount != 0)
{
/* Re-arm so we process the cancel once we're done reading the message. */
InterruptPending = true;
}
else if (QueryCancelPending)
{
QueryCancelPending = false;
/* ... distinguish lock_timeout / statement_timeout / user request,
LockErrorCleanup(), then ereport(ERROR, "canceling statement ...") */
}

ProcessClientReadInterrupt and ProcessClientWriteInterrupt wrap the low-level socket reads/writes: while idle (DoingCommandRead) a read checks for interrupts and services NOTIFY/catchup; while dying (ProcDiePending) they ensure the latch is set so a blocked I/O comes back and dies promptly rather than hanging on an unresponsive client.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
InitializeLatchWaitSetsrc/backend/storage/ipc/latch.c35
InitLatchsrc/backend/storage/ipc/latch.c63
InitSharedLatchsrc/backend/storage/ipc/latch.c93
OwnLatchsrc/backend/storage/ipc/latch.c126
DisownLatchsrc/backend/storage/ipc/latch.c144
WaitLatchsrc/backend/storage/ipc/latch.c172
WaitLatchOrSocketsrc/backend/storage/ipc/latch.c223
SetLatchsrc/backend/storage/ipc/latch.c290
ResetLatchsrc/backend/storage/ipc/latch.c374
InitializeWaitEventSupportsrc/backend/storage/ipc/waiteventset.c241
CreateWaitEventSetsrc/backend/storage/ipc/waiteventset.c364
FreeWaitEventSetsrc/backend/storage/ipc/waiteventset.c481
AddWaitEventToSetsrc/backend/storage/ipc/waiteventset.c570
ModifyWaitEventsrc/backend/storage/ipc/waiteventset.c656
WaitEventAdjustEpollsrc/backend/storage/ipc/waiteventset.c738
WaitEventSetWaitsrc/backend/storage/ipc/waiteventset.c1038
WaitEventSetWaitBlock (epoll)src/backend/storage/ipc/waiteventset.c1182
latch_sigurg_handlersrc/backend/storage/ipc/waiteventset.c1896
sendSelfPipeBytesrc/backend/storage/ipc/waiteventset.c1904
drainsrc/backend/storage/ipc/waiteventset.c1945
WakeupMyProcsrc/backend/storage/ipc/waiteventset.c2020
WakeupOtherProcsrc/backend/storage/ipc/waiteventset.c2033
ProcSignalSlot (struct)src/backend/storage/ipc/procsignal.c64
ProcSignalShmemInitsrc/backend/storage/ipc/procsignal.c132
ProcSignalInitsrc/backend/storage/ipc/procsignal.c167
SendProcSignalsrc/backend/storage/ipc/procsignal.c293
EmitProcSignalBarriersrc/backend/storage/ipc/procsignal.c365
WaitForProcSignalBarriersrc/backend/storage/ipc/procsignal.c433
ProcessProcSignalBarriersrc/backend/storage/ipc/procsignal.c508
CheckProcSignalsrc/backend/storage/ipc/procsignal.c658
procsignal_sigusr1_handlersrc/backend/storage/ipc/procsignal.c683
SendCancelRequestsrc/backend/storage/ipc/procsignal.c741
ProcessClientReadInterruptsrc/backend/tcop/postgres.c502
ProcessClientWriteInterruptsrc/backend/tcop/postgres.c548
diesrc/backend/tcop/postgres.c3027
StatementCancelHandlersrc/backend/tcop/postgres.c3057
HandleRecoveryConflictInterruptsrc/backend/tcop/postgres.c3090
ProcessInterruptssrc/backend/tcop/postgres.c3299
CHECK_FOR_INTERRUPTS (macro)src/include/miscadmin.h123
INTERRUPTS_CAN_BE_PROCESSED (macro)src/include/miscadmin.h130

All facts below were checked against the REL_18_STABLE tree at commit 273fe94 under /data/hgryoo/references/postgres.

Verified true:

  • latch.c is a thin wrapper over waiteventset.c: WaitLatch uses a shared two-slot LatchWaitSet and WaitLatchOrSocket builds a three-slot set per call (CreateWaitEventSet(CurrentResourceOwner, 3)). Confirmed in WaitLatch/WaitLatchOrSocket.
  • SetLatch brackets its publish/check with two pg_memory_barrier() calls and dispatches via WakeupMyProc/WakeupOtherProc. The WaitEventSet side sets maybe_sleeping with a matching barrier in WaitEventSetWait. Confirmed.
  • The latch wakeup signal is SIGURG, ProcSignal uses SIGUSR1, and query-cancel uses SIGINT. Confirmed in WakeupOtherProc (kill(pid, SIGURG)), SendProcSignal (kill(pid, SIGUSR1)), and SendCancelRequest (kill(-backendPID, SIGINT) under HAVE_SETSID).
  • The compile-time matrix is: epoll⇒signalfd (SIGURG blocked, no handler), poll⇒self-pipe (latch_sigurg_handler installed), kqueue⇒EVFILT_SIGNAL for SIGURG, win32⇒event objects. Confirmed in the WAIT_USE_* #if ladder and InitializeWaitEventSupport.
  • Postmaster death is delivered as the read end of postmaster_alive_fds[POSTMASTER_FD_WATCH], re-confirmed with PostmasterIsAliveInternal() before reporting. WL_EXIT_ON_PM_DEATH calls proc_exit(1) from inside WaitEventSetWaitBlock. Confirmed.
  • ProcSignal slots are indexed by ProcNumber; NumProcSignalSlots = MaxBackends + NUM_AUXILIARY_PROCS. pss_signalFlags is volatile sig_atomic_t[NUM_PROCSIGNALS]. Confirmed.
  • The barrier protocol uses a 64-bit generation counter plus a per-slot pss_barrierCheckMask and a per-slot ConditionVariable pss_barrierCV. ProcessProcSignalBarrier clears the mask via pg_atomic_exchange_u32 before processing and re-arms on failure inside a PG_TRY. Confirmed.
  • ProcessInterrupts early-returns on InterruptHoldoffCount || CritSectionCount, and re-arms InterruptPending when QueryCancelPending && QueryCancelHoldoffCount != 0. Confirmed.
  • In REL_18, the only barrier type handled in ProcessProcSignalBarrier’s switch is PROCSIGNAL_BARRIER_SMGRRELEASE (→ ProcessBarrierSmgrRelease). Confirmed in the switch (type) block.

Scope notes / non-assertions:

  • The kqueue and Win32 WaitEventSetWaitBlock variants exist and were read but are not quoted; the doc asserts only the epoll path’s line-level details. The poll path’s self-pipe handler is quoted.
  • signalfd/epoll are Linux-only; the position-hint lines for latch_sigurg_handler/sendSelfPipeByte are inside #if defined(WAIT_USE_SELF_PIPE) and only compile on the poll build. The line numbers are still accurate as source positions.
  • contrib/ is out of scope; no contrib symbols are asserted here.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

The latch/wait-set/interrupt stack is one concrete answer to a problem every concurrent system faces. Placing it beside other designs sharpens what PostgreSQL chose and why.

Latches vs. futexes vs. condition variables

Section titled “Latches vs. futexes vs. condition variables”

PostgreSQL’s Latch is morally a cross-process condition variable restricted to a single bit of state, but it deliberately does not use a kernel blocking primitive like a Linux futex for the wakeup. The reason is portability and the need to compose the wakeup with other wait sources in one syscall: a futex can only wait on a futex word, but a backend must wait on a latch and a socket and postmaster death simultaneously. By routing the latch through a file descriptor (self-pipe/signalfd) it becomes just another fd in the reactor’s interest set. The cost is one extra syscall per wakeup (the pipe write or the kill) and the drain read. PostgreSQL does have a true futex-style primitive elsewhere — ConditionVariable (used by the barrier CV above) and LWLocks both sleep on the proc’s semaphore — but those are not multiplexable with sockets, which is precisely why latches exist as a separate mechanism. The sibling doc postgres-lwlock-spinlock.md covers the semaphore-based sleeping locks.

Self-pipe vs. signalfd vs. EVFILT_SIGNAL vs. eventfd

Section titled “Self-pipe vs. signalfd vs. EVFILT_SIGNAL vs. eventfd”

The “make a signal look like a readable fd” problem has spawned a small zoo of OS-specific answers, and PostgreSQL supports four of them. The classic self-pipe trick (Bernstein, ~1990s; popularized by Stevens and the qmail codebase) is the portable baseline. Linux’s signalfd (2.6.22) and eventfd (2.6.22) were added specifically to retire the self-pipe; PostgreSQL uses signalfd on epoll builds but notably does not use eventfd for latches, since signalfd composes cleanly with the existing SIGURG-based cross-process wakeup. The BSD kqueue EVFILT_SIGNAL filter folds signal waiting directly into the reactor with no auxiliary fd at all — arguably the cleanest design, and the reason the kqueue path needs neither a self-pipe nor a signalfd. A research-frontier question is whether io_uring’s IORING_OP_* and its native support for waiting on eventfd/futex could one day replace the WaitEventSet entirely; PostgreSQL 18’s new async I/O subsystem (postgres-aio.md) is the first foothold of io_uring in the tree, but the latch path is untouched by it so far.

ProcSignal multiplexes one signal number into ~20 reasons via shared-memory flags. An alternative design — used by many actor-model and microkernel systems — is a genuine per-process message queue where each message carries a typed payload. PostgreSQL actually has both: the shm_mq shared-memory queue (postgres-shared-memory-ipc.md) carries data between parallel workers and their leader, while ProcSignal carries only notifications (“there is a message waiting in the queue, go look”). The split is deliberate: signals are scarce and lossy (coalescing identical reasons is a feature, not a bug, for idempotent notifications), whereas shm_mq is lossless and ordered for payloads. The general DBMS lesson — separate the doorbell from the mailbox — recurs in many systems.

CHECK_FOR_INTERRUPTS() makes PostgreSQL’s cancellation cooperative: a backend can only be cancelled at the safe points where it polls. A long-running C function with no interrupt check is effectively uncancellable — a real and occasionally painful limitation (tight loops in extensions, certain regex operations historically). The alternative, preemptive cancellation (forcibly unwinding a thread), is what thread-per-connection engines with managed runtimes (e.g. JVM-based systems) can attempt, but it is notoriously unsafe around locks and allocator state — the same hazard that forbids real work in signal handlers. PostgreSQL’s choice trades latency-to-cancel for the guarantee that cancellation never corrupts shared state. The InterruptHoldoffCount / CritSectionCount / QueryCancelHoldoffCount counters are the explicit knobs that widen or narrow the safe-point windows. A standing research/engineering frontier is reducing worst-case cancel latency by sprinkling more checks into hot paths without paying their branch cost in the common case — the unlikely() hint in INTERRUPTS_PENDING_CONDITION is the current mitigation.

A side benefit of funneling every wait through WaitEventSetWait is observability: each call passes a wait_event_info that pgstat_report_wait_start/_end records, which is what populates pg_stat_activity.wait_event and wait_event_type. This is one of the most-used production diagnostics in PostgreSQL, and it exists essentially for free because all sleeping is centralized in one module. Engines that scatter their blocking calls across the codebase struggle to retrofit such uniform wait accounting. See postgres-cumulative-stats.md for the collection side.

PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”
  • src/backend/storage/ipc/latch.cInitLatch, InitSharedLatch, OwnLatch, DisownLatch, WaitLatch, WaitLatchOrSocket, SetLatch, ResetLatch, InitializeLatchWaitSet.
  • src/backend/storage/ipc/waiteventset.cInitializeWaitEventSupport, CreateWaitEventSet, AddWaitEventToSet, ModifyWaitEvent, WaitEventAdjustEpoll, WaitEventSetWait, WaitEventSetWaitBlock, latch_sigurg_handler, sendSelfPipeByte, drain, WakeupMyProc, WakeupOtherProc, and the WAIT_USE_* selection ladder.
  • src/backend/storage/ipc/procsignal.cProcSignalSlot, ProcSignalShmemInit, ProcSignalInit, SendProcSignal, EmitProcSignalBarrier, WaitForProcSignalBarrier, ProcessProcSignalBarrier, CheckProcSignal, procsignal_sigusr1_handler, SendCancelRequest.
  • src/backend/tcop/postgres.cdie, StatementCancelHandler, HandleRecoveryConflictInterrupt, ProcessInterrupts, ProcessClientReadInterrupt, ProcessClientWriteInterrupt.
  • src/include/miscadmin.hCHECK_FOR_INTERRUPTS, INTERRUPTS_PENDING_CONDITION, INTERRUPTS_CAN_BE_PROCESSED, HOLD_INTERRUPTS/RESUME_INTERRUPTS, InterruptPending.
  • src/include/storage/procsignal.hProcSignalReason enum, NUM_PROCSIGNALS, ProcSignalBarrierType.
  • src/include/storage/latch.h, src/include/storage/waiteventset.h — the WL_* event-mask constants and the public API prototypes.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Architecture of a Database System (Hellerstein et al.), §“Process Models” — the process-per-connection model and its reliance on OS IPC/signalling rather than a user-space scheduler.
  • Database Internals (Petrov) — node-local concurrency and process models; the cost of wakeups and context switches for OLTP.
  • Operating Systems: Three Easy Pieces (Arpaci-Dusseau) — condition variables and the lost-wakeup hazard; the limited directives a signal handler may safely issue.
  • postgres-shared-memory-ipc.md — the shared-memory segment that holds ProcSignal and shm_mq; how slots are allocated; the payload-carrying message queue that pairs with ProcSignal’s notification-only role.
  • postgres-backend-lifecycle.md — where a backend calls InitializeWaitEventSupport, ProcSignalInit, and reaches the PostgresMain loop that sprinkles CHECK_FOR_INTERRUPTS().
  • postgres-aux-processes.md — auxiliary processes that own slots near the end of the ProcSignal array and use latches as their main wait.
  • postgres-lwlock-spinlock.md — the semaphore-based sleeping locks and ConditionVariable, the non-multiplexable cousins of the latch.
  • postgres-wire-protocol.md — the FE/BE read/write points wrapped by ProcessClientReadInterrupt/ProcessClientWriteInterrupt, and the cancel request that triggers SendCancelRequest.
  • postgres-cumulative-stats.md — the wait-event accounting populated by pgstat_report_wait_start/_end inside WaitEventSetWait.
  • postgres-aio.md — PG18 io_uring async I/O, the first io_uring user in the tree and a possible future direction for the wait machinery.