Skip to content

PostgreSQL Wait Events and Progress Reporting

Contents:

Every production database engine has to answer a deceptively simple operational question: “what is each process doing right now, and if it is not making progress, what is it blocked on?” This is the introspection or online observability problem. It is distinct from the cumulative statistics problem (how many index scans has this table seen since the last reset — covered in the sibling doc postgres-cumulative-stats.md). Introspection is about an instantaneous, per-process snapshot; cumulative stats are about monotonically growing counters aggregated over time.

Three properties define the design space for a per-process introspection facility:

  1. Write cost on the hot path. The instrumentation is poked at the exact moment a backend is about to do something interesting — acquire a lock, issue an I/O, enter the main loop. If publishing “I am now waiting on X” costs a lock acquisition or a system call, the instrumentation perturbs the very thing it measures (a probe effect). The design must make the common write a single un-contended, un-synchronised store.

  2. Read consistency without blocking the writer. A monitoring query (SELECT * FROM pg_stat_activity) reads thousands of other backends’ state. It must never take a lock that the observed backend also needs, or it would turn a read-only diagnostic into a source of contention — or worse, a deadlock. The classic answer is a seqlock: the writer bumps a counter before and after each mutation; the reader copies the record, re-reads the counter, and retries if it changed. Writes stay cheap; reads pay the cost of occasional retries.

  3. Vocabulary management. “Waiting on X” needs a controlled vocabulary of what X can be, partitioned into classes (lock wait, I/O wait, IPC wait, idle-in-main-loop, …) so an operator can reason at the right altitude. As the engine grows, that vocabulary grows to hundreds of entries; hand-maintaining the enum, the C name-lookup, and the user-facing documentation in three places invites drift, so mature engines generate all three from one declarative table.

A fourth concern is progress reporting for long-running maintenance commands (VACUUM, CREATE INDEX, CLUSTER, a base backup). Unlike a wait event — which is a single categorical value — progress is a small vector of integers (phase, blocks scanned, tuples processed, total to do) whose meaning is command-specific. The engine needs a generic transport for that vector and a per-command convention layered on top.

Database System Concepts (Silberschatz, Korth, Sudarshan) frames the DBA’s monitoring loop as observing “the state of currently executing transactions and the resources they hold or await,” and notes that lock waits and I/O waits dominate the diagnosis of stalled OLTP workloads. Architecture of a Database System (Hellerstein, Stonebraker, Hamilton), in its “Process Model” discussion, observes that a multi-process engine must publish per-worker state into shared memory precisely because no single process has a global view — the monitoring process must read the workers’ self-reported state rather than interrogate them. Database Internals (Petrov) connects this to the lock-free / wait-free literature: single-writer/multi-reader records published with memory barriers are the standard mechanism for low-overhead telemetry in concurrent systems.

PostgreSQL’s answer assembles all four pieces: a 4-byte wait_event_info word for the categorical wait, a table-driven codegen (wait_event_names.txt) for the vocabulary, a per-backend PgBackendStatus slot guarded by a st_changecount seqlock for the rest of the activity snapshot, and a 20-element st_progress_param[] vector inside that same slot for command progress.

This section names the recurring engineering patterns that engines adopt for per-process introspection, so PostgreSQL’s specifics read as choices within a shared space.

Single-writer shared-memory status slot per worker

Section titled “Single-writer shared-memory status slot per worker”

Every multi-process (or multi-threaded) engine reserves one fixed slot of shared memory per worker, indexed by a stable worker id. The worker is the sole writer of its slot; any other process is a reader. This single-writer invariant is what makes cheap publication possible — there is never write-write contention on a slot, only the comparatively rare read-during-write race, which the seqlock handles. PostgreSQL sizes the array at MaxBackends + NUM_AUXILIARY_PROCS and indexes it by ProcNumber.

A reader cannot take a lock the writer needs, so the canonical lock-free publication uses an even/odd version counter:

  • Writer: count++ (now odd) → write barrier → mutate fields → write barrier → count++ (now even).
  • Reader: read count (must be even) → read barrier → copy fields → read barrier → re-read count; if unchanged and even, the copy is consistent; otherwise retry.

The odd value signals “mutation in flight.” The memory barriers stop the CPU or compiler from reordering the counter bumps around the field writes. This is exactly Linux’s seqlock applied to a per-backend record.

A separate “free” instrumentation word for the hottest signal

Section titled “A separate “free” instrumentation word for the hottest signal”

The very hottest signal — “I just started waiting / I just stopped waiting” — is poked so often (every lock, every buffer I/O) that even the seqlock’s barriers are too expensive. Engines special-case it as a single machine-word store that is atomic by virtue of alignment, needing no counter and no barrier, accepting that a reader may observe a momentarily stale value. PostgreSQL splits the wait event out of PgBackendStatus entirely: it lives as MyProc->wait_event_info, written by a bare *(volatile uint32 *) store.

Wait reasons are encoded as an integer whose high bits name a class (lock, I/O, IPC, timeout, client, activity) and whose low bits name the specific event within the class. A reader masks off the class to render the type column and the event column separately. This keeps the wire representation a single integer while preserving a two-level taxonomy.

Rather than hand-write the enum, the integer-to-name function, and the documentation table, mature engines keep one declarative list and generate all artifacts. This guarantees the pg_stat_activity name, the C enum symbol, and the manual entry never drift.

Generic progress vector with per-command semantics

Section titled “Generic progress vector with per-command semantics”

Progress for maintenance commands is a fixed-width array of integers published into the same per-worker slot. The transport is generic; a per-command header file assigns meaning to each slot index (slot 0 = phase, slot 1 = heap blocks total, …) and a SQL view maps the raw integers to friendly columns.

Theory / conventionPostgreSQL name
Per-worker status slotBackendStatusArray[ProcNumber], type PgBackendStatus (backend_status.c)
Sole-writer pointer to own slotMyBEEntry
Seqlock version counterst_changecount + PGSTAT_BEGIN/END_WRITE_ACTIVITY macros
Reader retry looppgstat_begin_read_activity / pgstat_read_activity_complete
Free hot-path wordMyProc->wait_event_info, written by pgstat_report_wait_start
Class bits / event bitsWAIT_EVENT_CLASS_MASK (0xFF000000) / WAIT_EVENT_ID_MASK (0x0000FFFF)
Wait class tablewait_classes.h (PG_WAIT_LWLOCKPG_WAIT_INJECTIONPOINT)
Vocabulary sourcewait_event_names.txt
Vocabulary codegengenerate-wait_event_types.plwait_event_types.h, pgstat_wait_event.c
Integer → type stringpgstat_get_wait_event_type
Integer → event stringpgstat_get_wait_event
Current-activity stringst_activity_raw (raw, possibly mid-multibyte truncated) → pgstat_clip_activity
Per-session snapshot copypgstat_read_current_statuslocalBackendStatusTable
Progress command tagst_progress_command (ProgressCommandType)
Progress vectorst_progress_param[PGSTAT_NUM_PROGRESS_PARAM] (20 slots)
Parallel-worker progress relayPqMsg_Progress message handled in parallel.c

PostgreSQL splits a backend’s “what am I doing” state across two shared-memory homes, on purpose:

  1. The wait event lives in MyProc->wait_event_info (a PGPROC field), written by an inline single-word store with no changecount and no lock. This is the hottest signal, so it is made as close to free as a memory store can be.

  2. Everything else — session state (active / idle / idle in transaction), the current query text, the application name, the query and plan identifiers, and the progress vector — lives in PgBackendStatus, the per-backend slot guarded by the st_changecount seqlock.

pg_stat_activity is a join of the two: its wait_event_type / wait_event columns come from MyProc->wait_event_info decoded by wait_event.c, while state, query, application_name, query_id, etc. come from PgBackendStatus decoded by backend_status.c. The progress views (pg_stat_progress_vacuum, pg_stat_progress_create_index, …) read st_progress_command plus st_progress_param[] from that same PgBackendStatus slot.

flowchart TB
    subgraph Backend["Observed backend (sole writer)"]
        proc["MyProc->wait_event_info<br/>(single uint32, lock-free store)"]
        be["MyBEEntry = &BackendStatusArray[ProcNumber]<br/>PgBackendStatus slot"]
        be --> st["st_state / st_activity_raw<br/>st_query_id / st_plan_id"]
        be --> prog["st_progress_command<br/>st_progress_param[20]"]
    end
    subgraph Monitor["Monitoring backend (reader)"]
        view["pg_stat_activity /<br/>pg_stat_progress_*"]
    end
    proc -. "pgstat_get_wait_event_type / _event" .-> view
    be -. "pgstat_read_current_status<br/>(changecount retry loop)" .-> view
    classDef w fill:#eef,stroke:#446;
    class proc,be,st,prog w;

Figure 1 — The two homes of per-backend live state. The wait event is a free single-word store in PGPROC; the rest is the seqlock-guarded PgBackendStatus slot. A monitoring backend decodes the first directly and snapshots the second through the changecount retry loop.

The wait event word: class byte + event id

Section titled “The wait event word: class byte + event id”

A wait event is a single uint32. The top byte is the class; the low two bytes are the event id within that class. The class constants live in wait_classes.h:

// wait class constants — include/utils/wait_classes.h
#define PG_WAIT_LWLOCK 0x01000000U
#define PG_WAIT_LOCK 0x03000000U
#define PG_WAIT_BUFFERPIN 0x04000000U
#define PG_WAIT_ACTIVITY 0x05000000U
#define PG_WAIT_CLIENT 0x06000000U
#define PG_WAIT_EXTENSION 0x07000000U
#define PG_WAIT_IPC 0x08000000U
#define PG_WAIT_TIMEOUT 0x09000000U
#define PG_WAIT_IO 0x0A000000U
#define PG_WAIT_INJECTIONPOINT 0x0B000000U

wait_event.c masks the word with two constants to split class from id:

// class/id masks — utils/activity/wait_event.c
#define WAIT_EVENT_CLASS_MASK 0xFF000000
#define WAIT_EVENT_ID_MASK 0x0000FFFF

So 0x0A000007 reads as class PG_WAIT_IO (0x0A000000), event id 7. The lock-free store that publishes it is a textbook example of the “free instrumentation word” pattern — note that it does not check pgstat_track_activities, because the check would cost more than the store it guards:

// pgstat_report_wait_start — include/utils/wait_event.h
static inline void
pgstat_report_wait_start(uint32 wait_event_info)
{
/*
* Since this is a four-byte field which is always read and written as
* four-bytes, updates are atomic.
*/
*(volatile uint32 *) my_wait_event_info = wait_event_info;
}

my_wait_event_info initially points at a process-local variable (local_my_wait_event_info) so the store is safe even before MyProc exists; pgstat_set_wait_event_storage later redirects it into shared memory. pgstat_report_wait_end simply stores 0, and a zero word is the sentinel meaning “not waiting.”

Two functions decode the word for pg_stat_activity. pgstat_get_wait_event_type masks off the class and returns the type column string:

// pgstat_get_wait_event_type — utils/activity/wait_event.c (condensed)
const char *
pgstat_get_wait_event_type(uint32 wait_event_info)
{
uint32 classId;
if (wait_event_info == 0)
return NULL; /* not waiting */
classId = wait_event_info & WAIT_EVENT_CLASS_MASK;
switch (classId)
{
case PG_WAIT_LWLOCK: return "LWLock";
case PG_WAIT_LOCK: return "Lock";
case PG_WAIT_IO: return "IO";
case PG_WAIT_IPC: return "IPC";
/* ... Activity, Client, Timeout, BufferPin, Extension ... */
default: return "???";
}
}

pgstat_get_wait_event then dispatches on the class to a per-class name-lookup. Note that LWLock and Lock are handled by their own code (GetLWLockIdentifier, GetLockNameFromTagType — see postgres-lwlock-spinlock.md), while the generated classes route through the codegen’d pgstat_get_wait_* helpers:

// pgstat_get_wait_event — utils/activity/wait_event.c (condensed)
const char *
pgstat_get_wait_event(uint32 wait_event_info)
{
uint32 classId = wait_event_info & WAIT_EVENT_CLASS_MASK;
uint16 eventId = wait_event_info & WAIT_EVENT_ID_MASK;
switch (classId)
{
case PG_WAIT_LWLOCK:
return GetLWLockIdentifier(classId, eventId); /* own code */
case PG_WAIT_LOCK:
return GetLockNameFromTagType(eventId); /* own code */
case PG_WAIT_EXTENSION:
case PG_WAIT_INJECTIONPOINT:
return GetWaitEventCustomIdentifier(wait_event_info);
case PG_WAIT_IO:
return pgstat_get_wait_io((WaitEventIO) wait_event_info);
/* ... IPC, Activity, Client, Timeout, BufferPin ... */
}
}

The pgstat_get_wait_io, pgstat_get_wait_ipc, etc. helpers are not hand-written — they are generated, as the final #include at the bottom of wait_event.c admits:

// tail of utils/activity/wait_event.c
#include "utils/pgstat_wait_event.c"

Extensions register their own wait events (class PG_WAIT_EXTENSION) by name; the registry lives in two shared hash tables (by-info and by-name) plus a spinlock-guarded counter. WaitEventExtensionNew is the public entry point, delegating to WaitEventCustomNew:

// WaitEventCustomNew — utils/activity/wait_event.c (condensed)
static uint32
WaitEventCustomNew(uint32 classId, const char *wait_event_name)
{
/* fast path: name already registered? return its id */
LWLockAcquire(WaitEventCustomLock, LW_SHARED);
entry_by_name = hash_search(WaitEventCustomHashByName, wait_event_name,
HASH_FIND, &found);
LWLockRelease(WaitEventCustomLock);
if (found)
return entry_by_name->wait_event_info;
/* slow path: take exclusive, recheck, allocate a fresh event id */
LWLockAcquire(WaitEventCustomLock, LW_EXCLUSIVE);
/* ... recheck ... */
SpinLockAcquire(&WaitEventCustomCounter->mutex);
eventId = WaitEventCustomCounter->nextId++;
SpinLockRelease(&WaitEventCustomCounter->mutex);
wait_event_info = classId | eventId; /* fold class into the id */
/* register in both hash directions, then release the LWLock */
}

This is the only part of the wait-event machinery that needs real locking, because it mutates a shared registry rather than a single backend’s own word. The double-checked locking (shared probe, then exclusive recheck) is a deliberate optimisation for the common “already registered” case.

The vocabulary codegen: wait_event_names.txt

Section titled “The vocabulary codegen: wait_event_names.txt”

The controlled vocabulary of built-in wait events is a single tab-separated file, wait_event_names.txt. Each line gives the enum/event name and a documentation sentence, grouped under Section: ClassName - WaitEvent<Class> headers:

# wait_event_names.txt (excerpt) — class headers + entries
Section: ClassName - WaitEventIO
AIO_IO_COMPLETION "Waiting for another process to complete IO."
BUFFILE_READ "Waiting for a read from a buffered file."
CONTROL_FILE_SYNC "Waiting for the pg_control file to reach durable storage."
Section: ClassName - WaitEventIPC
APPEND_READY "Waiting for subplan nodes of an Append plan node to be ready."
BACKEND_TERMINATION "Waiting for the termination of another backend."

generate-wait_event_types.pl reads this file at build time and emits three artifacts from it: wait_event_types.h (the per-class C enums), pgstat_wait_event.c (the pgstat_get_wait_<class> lookups #included at the tail of wait_event.c), and the SGML documentation table. The script turns each SCREAMING_SNAKE token into both a WAIT_EVENT_ enum symbol and a CamelCase display string:

# generate-wait_event_types.pl (condensed) — name + enum derivation
my $waiteventenumname = "WAIT_EVENT_$waiteventname"; # WAIT_EVENT_BUFFILE_READ
# CamelCase the display name (LWLock/Lock classes are left verbatim)
my @waiteventparts = split("_", $waiteventname);
foreach my $waiteventpart (@waiteventparts)
{
$waiteventdescription .= substr($waiteventpart, 0, 1)
. lc(substr($waiteventpart, 1)); # "BufFileRead"
}

The enum’s first member in each class is anchored to the class constant, so the low bits become the event id automatically:

# generate-wait_event_types.pl (condensed) — enum base = class constant
printf $h "typedef enum\n{\n";
$pg_wait_class = "PG_WAIT_" . $lastuc; # e.g. PG_WAIT_IO
printf $h "\t%s = %s", $wev->[0], $pg_wait_class; # first = PG_WAIT_IO
# subsequent members just ", NEXT_NAME" — C auto-increments the id

So the generated WaitEventIO enum starts at PG_WAIT_IO (0x0A000000) and each subsequent event is +1, which is exactly why masking with WAIT_EVENT_ID_MASK recovers the per-class ordinal. Four classes are excluded from C generation — WaitEventExtension, WaitEventInjectionPoint, WaitEventLWLock, WaitEventLock — because their names are resolved dynamically (extension/injection-point registry) or by their own subsystems (LWLock, Lock); the script emits only SGML docs for those.

flowchart LR
    txt["wait_event_names.txt<br/>(declarative table)"]
    pl["generate-wait_event_types.pl<br/>(build step)"]
    h["wait_event_types.h<br/>(per-class enums)"]
    c["pgstat_wait_event.c<br/>(name lookups)"]
    sgml["wait_event_types.sgml<br/>(docs table)"]
    txt --> pl
    pl --> h
    pl --> c
    pl --> sgml
    h -. "WaitEventIO enum" .-> wec["wait_event.c<br/>pgstat_get_wait_event"]
    c -. "#include at tail" .-> wec
    classDef g fill:#efe,stroke:#484;
    class txt,pl,h,c,sgml g;

Figure 2 — One declarative table feeds the enum, the C name lookup, and the documentation. The single source eliminates drift between the pg_stat_activity string, the C enum symbol, and the manual.

The PgBackendStatus slot and the changecount seqlock

Section titled “The PgBackendStatus slot and the changecount seqlock”

The rest of a backend’s live state is the PgBackendStatus struct. The array is allocated once at postmaster startup, one slot per possible ProcNumber, with the variable-length strings (st_appname, st_clienthostname, st_activity_raw) carved out of separate shared buffers and pointed into:

// PgBackendStatus core fields — include/utils/backend_status.h (condensed)
typedef struct PgBackendStatus
{
int st_changecount; /* seqlock version: odd = write in flight */
int st_procpid; /* slot valid iff st_procpid > 0 */
BackendType st_backendType;
TimestampTz st_proc_start_timestamp;
TimestampTz st_xact_start_timestamp;
TimestampTz st_activity_start_timestamp;
TimestampTz st_state_start_timestamp;
Oid st_databaseid;
Oid st_userid;
BackendState st_state; /* STATE_RUNNING / STATE_IDLE / ... */
char *st_appname; /* into BackendAppnameBuffer */
char *st_activity_raw; /* into BackendActivityBuffer; may be mid-mb-truncated */
ProgressCommandType st_progress_command;
Oid st_progress_command_target;
int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM]; /* 20 slots */
int64 st_query_id;
int64 st_plan_id;
} PgBackendStatus;

A backend reaches its own slot through MyBEEntry, set up by pgstat_beinit (MyBEEntry = &BackendStatusArray[MyProcNumber]). Every mutation is bracketed by the seqlock macros. Critically, the macros are a critical section: any error between them is promoted to PANIC, because there is no unwind that restores st_changecount to even — so the bracketed region must be short, straight-line, and allocation-free:

// changecount seqlock macros — include/utils/backend_status.h
#define PGSTAT_BEGIN_WRITE_ACTIVITY(beentry) \
do { START_CRIT_SECTION(); \
(beentry)->st_changecount++; \
pg_write_barrier(); } while (0)
#define PGSTAT_END_WRITE_ACTIVITY(beentry) \
do { pg_write_barrier(); \
(beentry)->st_changecount++; \
Assert(((beentry)->st_changecount & 1) == 0); \
END_CRIT_SECTION(); } while (0)

pgstat_report_activity is the canonical writer — the backend calls it from tcop/postgres.c on every state transition. It does all the expensive work (timestamp fetch, string length) before entering the critical section, then performs only stores inside it:

// pgstat_report_activity — utils/activity/backend_status.c (condensed)
void
pgstat_report_activity(BackendState state, const char *cmd_str)
{
volatile PgBackendStatus *beentry = MyBEEntry;
/* ... handle track_activities disabled: one final DISABLED update ... */
/* fetch everything BEFORE the critical section */
start_timestamp = GetCurrentStatementStartTimestamp();
if (cmd_str != NULL)
len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
current_timestamp = GetCurrentTimestamp();
/* ... accumulate conn active/idle time on state change ... */
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
beentry->st_state = state;
beentry->st_state_start_timestamp = current_timestamp;
if (state == STATE_RUNNING)
{
beentry->st_query_id = INT64CONST(0); /* reset; set later at parse analysis */
beentry->st_plan_id = INT64CONST(0);
}
if (cmd_str != NULL)
{
memcpy(beentry->st_activity_raw, cmd_str, len);
beentry->st_activity_raw[len] = '\0';
beentry->st_activity_start_timestamp = start_timestamp;
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
}

Note st_activity_raw is stored raw — possibly truncated in the middle of a multi-byte character — because writes are far more frequent than reads, so the cost of correct UTF-8 clipping is deferred to the reader via pgstat_clip_activity.

A monitoring backend does not read other slots field-by-field on demand; pgstat_read_current_status copies the whole array into local memory once per transaction, honouring the reader side of the seqlock. The per-entry copy loop retries until it observes an even, unchanged st_changecount:

// pgstat_read_current_status — utils/activity/backend_status.c (condensed)
for (;;)
{
int before_changecount, after_changecount;
pgstat_begin_read_activity(beentry, before_changecount);
localentry->backendStatus.st_procpid = beentry->st_procpid;
if (localentry->backendStatus.st_procpid > 0)
{
memcpy(&localentry->backendStatus,
unvolatize(PgBackendStatus *, beentry), sizeof(PgBackendStatus));
strcpy(localappname, beentry->st_appname);
localentry->backendStatus.st_appname = localappname; /* re-point at local copy */
strcpy(localactivity, beentry->st_activity_raw);
localentry->backendStatus.st_activity_raw = localactivity;
}
pgstat_end_read_activity(beentry, after_changecount);
if (pgstat_read_activity_complete(before_changecount, after_changecount))
break;
CHECK_FOR_INTERRUPTS(); /* don't spin forever on a stuck writer */
}

Because the out-of-line strings are pointers into shared buffers, the copy also re-points the local struct’s pointers at the local string copies — strcpy is safe against concurrent writes here only because each shared buffer is always NUL-terminated. The deadlock detector takes a different path: pgstat_get_backend_current_activity reads a single slot directly (no full snapshot) because it already knows the target is blocked and stable.

Progress reporting reuses the same PgBackendStatus slot but a different set of fields: a command tag and a 20-integer vector. The command tag is a small enum, and the vector width is fixed:

include/utils/backend_progress.h
typedef enum ProgressCommandType
{
PROGRESS_COMMAND_INVALID,
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_ANALYZE,
PROGRESS_COMMAND_CLUSTER,
PROGRESS_COMMAND_CREATE_INDEX,
PROGRESS_COMMAND_BASEBACKUP,
PROGRESS_COMMAND_COPY,
} ProgressCommandType;
#define PGSTAT_NUM_PROGRESS_PARAM 20

A command “opens” progress with pgstat_progress_start_command(cmdtype, relid), which sets the tag, the target relation, and zeroes the vector — all inside the seqlock:

// pgstat_progress_start_command — utils/activity/backend_progress.c
void
pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
{
volatile PgBackendStatus *beentry = MyBEEntry;
if (!beentry || !pgstat_track_activities)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
beentry->st_progress_command = cmdtype;
beentry->st_progress_command_target = relid;
MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
PGSTAT_END_WRITE_ACTIVITY(beentry);
}

Each command then pokes individual slots as it advances. The meaning of each index is a per-command convention defined in commands/progress.h (e.g. for VACUUM, slot 0 is the phase, slot 1 is total heap blocks). pgstat_progress_update_param writes one slot; pgstat_progress_update_multi_param writes several atomically (one seqlock bracket, so a reader never sees a half-updated vector):

// pgstat_progress_update_multi_param — utils/activity/backend_progress.c (condensed)
void
pgstat_progress_update_multi_param(int nparam, const int *index, const int64 *val)
{
volatile PgBackendStatus *beentry = MyBEEntry;
if (!beentry || !pgstat_track_activities || nparam == 0)
return;
PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
for (int i = 0; i < nparam; ++i)
{
Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
beentry->st_progress_param[index[i]] = val[i];
}
PGSTAT_END_WRITE_ACTIVITY(beentry);
}

pgstat_progress_end_command resets the tag to PROGRESS_COMMAND_INVALID, which is the signal the progress views use to decide a backend is no longer running that command.

A parallel CREATE INDEX or VACUUM spreads work across worker processes, but the progress vector that the user reads lives only in the leader’s PgBackendStatus. A worker cannot write the leader’s slot (single-writer invariant), so it sends the leader a PqMsg_Progress message over the parallel-worker message queue. The variant pgstat_progress_parallel_incr_param chooses the path:

// pgstat_progress_parallel_incr_param — utils/activity/backend_progress.c
void
pgstat_progress_parallel_incr_param(int index, int64 incr)
{
if (IsParallelWorker())
{
static StringInfoData progress_message;
initStringInfo(&progress_message);
pq_beginmessage(&progress_message, PqMsg_Progress);
pq_sendint32(&progress_message, index);
pq_sendint64(&progress_message, incr);
pq_endmessage(&progress_message);
}
else
pgstat_progress_incr_param(index, incr); /* leader: write own slot directly */
}

The leader picks the message up in its parallel-message handler (HandleParallelMessage in access/transam/parallel.c) and applies the increment to its own slot, so the single-writer invariant is preserved — only the leader ever writes the leader’s vector:

// HandleParallelMessage — access/transam/parallel.c (condensed)
case PqMsg_Progress:
{
int index = pq_getmsgint(msg, 4);
int64 incr = pq_getmsgint64(msg);
pq_getmsgend(msg);
pgstat_progress_incr_param(index, incr); /* leader updates its own slot */
break;
}

This is why only the incremental progress API has a parallel variant: relaying an absolute “set slot N to V” would race between workers, but “add incr to slot N” composes cleanly when funnelled through the single leader.

This section lists the stable symbols grouped by call-flow. Line numbers are deferred to the position-hint table at the end; anchor on the symbol names, which survive reformatting.

Wait-event encoding and decoding (wait_event.c, wait_event.h, wait_classes.h)

Section titled “Wait-event encoding and decoding (wait_event.c, wait_event.h, wait_classes.h)”
  • PG_WAIT_LWLOCKPG_WAIT_INJECTIONPOINT (wait_classes.h) — the ten class constants, each occupying the top byte of the 4-byte word.
  • WAIT_EVENT_CLASS_MASK / WAIT_EVENT_ID_MASK (wait_event.c) — 0xFF000000 / 0x0000FFFF; split the word into class and event id.
  • pgstat_report_wait_start / pgstat_report_wait_end (wait_event.h, inline) — the lock-free single-word publish; the hottest instrumentation path in the server. Writes through my_wait_event_info.
  • my_wait_event_info / local_my_wait_event_info (wait_event.c) — the redirectable pointer; local before shared memory, then pointed into PGPROC by pgstat_set_wait_event_storage.
  • pgstat_set_wait_event_storage / pgstat_reset_wait_event_storage (wait_event.c) — redirect / un-redirect the publish target at backend start / shutdown.
  • pgstat_get_wait_event_type (wait_event.c) — class byte → "LWLock" / "IO" / … (the wait_event_type column).
  • pgstat_get_wait_event (wait_event.c) — full word → event name; dispatches to subsystem code for LWLock/Lock and to the codegen’d pgstat_get_wait_* for the rest.

Custom (extension) wait-event registry (wait_event.c)

Section titled “Custom (extension) wait-event registry (wait_event.c)”
  • WaitEventExtensionNew / WaitEventInjectionPointNew — public registration entry points for the two custom classes.
  • WaitEventCustomNew — double-checked-locking allocate-or-find; the only locking path in the wait-event subsystem.
  • WaitEventCustomCounterData / WaitEventCustomCounter — spinlock-guarded nextId counter in shared memory.
  • WaitEventCustomHashByInfo / WaitEventCustomHashByName — the two shared hash tables (info→name and name→info).
  • GetWaitEventCustomIdentifier — reverse lookup used by pgstat_get_wait_event for the custom classes.
  • GetWaitEventCustomNames — enumerate registered names for a class (backs pg_get_wait_events()).
  • WaitEventCustomShmemInit / WaitEventCustomShmemSize — shmem setup for the registry.
  • wait_event_names.txt — declarative table; one event per line under Section: ClassName - WaitEvent<Class> headers.
  • generate-wait_event_types.pl — build-time generator; emits wait_event_types.h, pgstat_wait_event.c, and the SGML docs. Skips C generation for WaitEventExtension, WaitEventInjectionPoint, WaitEventLWLock, WaitEventLock.
  • WAIT_EVENT_* enum members / pgstat_get_wait_<class> — generated; the latter are #included at the tail of wait_event.c via utils/pgstat_wait_event.c.

Backend-status slot lifecycle (backend_status.c, backend_status.h)

Section titled “Backend-status slot lifecycle (backend_status.c, backend_status.h)”
  • PgBackendStatus (backend_status.h) — the per-backend slot struct.
  • st_changecount + PGSTAT_BEGIN_WRITE_ACTIVITY / PGSTAT_END_WRITE_ACTIVITY — the writer-side seqlock (a critical section: errors inside → PANIC).
  • pgstat_begin_read_activity / pgstat_end_read_activity / pgstat_read_activity_complete — the reader-side seqlock.
  • BackendStatusArray / MyBEEntry — the shared array and the backend’s pointer to its own slot.
  • BackendStatusShmemInit / BackendStatusShmemSize — allocate the array plus the appname / hostname / activity string buffers.
  • pgstat_beinit — set MyBEEntry, register the shutdown hook.
  • pgstat_bestart_initial / pgstat_bestart_security / pgstat_bestart_final — three-phase fill of the slot at backend start (STATE_STARTING → security details → STATE_UNDEFINED + appname).
  • pgstat_beshutdown_hook — zero st_procpid to mark the slot free.

Activity reporting and readback (backend_status.c)

Section titled “Activity reporting and readback (backend_status.c)”
  • pgstat_report_activity — the main writer (state + query text), called from tcop/postgres.c.
  • pgstat_report_appname / pgstat_report_query_id / pgstat_report_plan_id / pgstat_report_xact_timestamp — targeted single-field writers.
  • pgstat_read_current_status — snapshot the whole array into localBackendStatusTable (the seqlock reader loop).
  • pgstat_get_beentry_by_proc_number / pgstat_get_local_beentry_by_index / pgstat_fetch_stat_numbackends — accessors over the local snapshot used by the pg_stat_get_activity SRF.
  • pgstat_get_backend_current_activity — single-slot direct read used by the deadlock detector.
  • pgstat_get_crashed_backend_activity — postmaster-side, deliberately unsynchronised read of a possibly-corrupt slot for crash reporting.
  • pgstat_clip_activity — multi-byte-safe truncation of st_activity_raw at read time.

Progress reporting (backend_progress.c, backend_progress.h)

Section titled “Progress reporting (backend_progress.c, backend_progress.h)”
  • ProgressCommandType + PGSTAT_NUM_PROGRESS_PARAM (backend_progress.h) — the command tag enum and the fixed 20-slot vector width.
  • pgstat_progress_start_command — set tag + target, zero the vector.
  • pgstat_progress_update_param / pgstat_progress_incr_param / pgstat_progress_update_multi_param — write one / increment one / atomically write several slots.
  • pgstat_progress_parallel_incr_param — worker sends PqMsg_Progress; leader increments directly.
  • pgstat_progress_end_command — reset the tag to PROGRESS_COMMAND_INVALID.
  • HandleParallelMessage (parallel.c) — leader-side PqMsg_Progress handler that funnels worker increments into the leader’s own slot.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
PG_WAIT_LWLOCKPG_WAIT_INJECTIONPOINTsrc/include/utils/wait_classes.h18–27
WAIT_EVENT_CLASS_MASK / WAIT_EVENT_ID_MASKsrc/backend/utils/activity/wait_event.c42–43
WaitEventCustomShmemInitsrc/backend/utils/activity/wait_event.c119
WaitEventExtensionNewsrc/backend/utils/activity/wait_event.c163
WaitEventCustomNewsrc/backend/utils/activity/wait_event.c175
GetWaitEventCustomIdentifiersrc/backend/utils/activity/wait_event.c276
GetWaitEventCustomNamessrc/backend/utils/activity/wait_event.c306
pgstat_set_wait_event_storagesrc/backend/utils/activity/wait_event.c349
pgstat_get_wait_event_typesrc/backend/utils/activity/wait_event.c373
pgstat_get_wait_eventsrc/backend/utils/activity/wait_event.c431
#include "utils/pgstat_wait_event.c"src/backend/utils/activity/wait_event.c506
pgstat_report_wait_start / _end (inline)src/include/utils/wait_event.h68–88
name + enum derivationsrc/backend/utils/activity/generate-wait_event_types.pl106–130
enum base = class constantsrc/backend/utils/activity/generate-wait_event_types.pl205–212
PgBackendStatus structsrc/include/utils/backend_status.h98–177
PGSTAT_BEGIN/END_WRITE_ACTIVITYsrc/include/utils/backend_status.h209–222
pgstat_begin_read_activity / _completesrc/include/utils/backend_status.h224–238
BackendStatusShmemInitsrc/backend/utils/activity/backend_status.c114
pgstat_beinitsrc/backend/utils/activity/backend_status.c245
pgstat_bestart_initialsrc/backend/utils/activity/backend_status.c270
pgstat_beshutdown_hooksrc/backend/utils/activity/backend_status.c509
pgstat_report_activitysrc/backend/utils/activity/backend_status.c572
pgstat_read_current_statussrc/backend/utils/activity/backend_status.c820
pgstat_get_backend_current_activitysrc/backend/utils/activity/backend_status.c996
pgstat_clip_activitysrc/backend/utils/activity/backend_status.c1315
ProgressCommandType / PGSTAT_NUM_PROGRESS_PARAMsrc/include/utils/backend_progress.h22–33
pgstat_progress_start_commandsrc/backend/utils/activity/backend_progress.c27
pgstat_progress_update_paramsrc/backend/utils/activity/backend_progress.c48
pgstat_progress_incr_paramsrc/backend/utils/activity/backend_progress.c69
pgstat_progress_parallel_incr_paramsrc/backend/utils/activity/backend_progress.c91
pgstat_progress_update_multi_paramsrc/backend/utils/activity/backend_progress.c121
pgstat_progress_end_commandsrc/backend/utils/activity/backend_progress.c150
PqMsg_Progress handlersrc/backend/access/transam/parallel.c1222

All symbols and excerpts above were read from /data/hgryoo/references/postgres at REL_18_STABLE, commit 273fe94. Spot checks worth re-running after a pull:

  • Class constants and masks. grep -n "0x0.000000U" wait_classes.h should list exactly the ten PG_WAIT_* classes; WAIT_EVENT_CLASS_MASK (0xFF000000) and WAIT_EVENT_ID_MASK (0x0000FFFF) are defined at the top of wait_event.c. If a new class is added it gets a new top-byte value and a new Section: in wait_event_names.txt.
  • The lock-free publish has no track-activities guard. pgstat_report_wait_start in wait_event.h is a bare *(volatile uint32 *) my_wait_event_info = wait_event_info; — the file’s header comment explicitly states the pgstat_track_activities check was removed because it cost more than it saved. Confirm the guard is still absent.
  • The seqlock is a critical section. PGSTAT_BEGIN_WRITE_ACTIVITY opens with START_CRIT_SECTION() and PGSTAT_END_WRITE_ACTIVITY closes with END_CRIT_SECTION(); the header comment warns that any error between them promotes to PANIC. This is why all writers fetch timestamps and string lengths before the bracket.
  • Progress vector width is 20. PGSTAT_NUM_PROGRESS_PARAM is 20 in backend_progress.h; the per-command slot conventions live in commands/progress.h (e.g. PROGRESS_VACUUM_*), out of scope here.
  • Parallel progress is increment-only. Only pgstat_progress_parallel_incr_param exists (no parallel set / multi variant), and the leader’s parallel.c handler calls pgstat_progress_incr_param. Confirm both ends still use the increment form.
  • Codegen excludes four classes. In generate-wait_event_types.pl, the next if (...) guard skips WaitEventExtension, WaitEventInjectionPoint, WaitEventLWLock, WaitEventLock from C generation; they get SGML only.
  • Build-artifact note. wait_event_types.h and pgstat_wait_event.c are generated — they are not present in a clean checkout. The excerpts here are reconstructed from generate-wait_event_types.pl + the #include at the tail of wait_event.c, which is authoritative for how the generated names plug in.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s wait events are a deliberate echo of Oracle’s Wait Interface, the model that made “wait event” industry vocabulary. Oracle publishes per-session wait state into the SGA (v$session_wait, v$session_event, v$system_event) and, crucially, times each wait — time_waited accumulates microseconds per event class, which feeds the “DB time” / “Active Session History” methodology. PostgreSQL’s core pg_stat_activity is comparatively spartan: it shows the current wait event but does not accumulate per-event wait time in core. The standard way to recover Oracle-style ASH on PostgreSQL is sampling: an extension (pg_wait_sampling, or the commercial pg_stat_statements-adjacent samplers) periodically reads MyProc->wait_event_info across all backends and histograms it. The lock-free single-word design is what makes such a high-frequency sampler cheap — a sampler can read every backend’s wait word without ever taking a lock the backend needs.

MySQL / InnoDB: Performance Schema instruments

Section titled “MySQL / InnoDB: Performance Schema instruments”

MySQL’s Performance Schema takes the opposite tradeoff: instead of one categorical word, it instruments thousands of named points (wait/synch/mutex/..., wait/io/file/..., stage/sql/...) with per-instrument timers and aggregations, all configurable at runtime. The “stage” instruments are MySQL’s analogue of PostgreSQL command progress — SELECT ... FROM performance_schema.events_stages_current shows the current execution stage. The cost model differs sharply: Performance Schema maintains timed, aggregated history in shared memory at all times (with a documented overhead), whereas PostgreSQL keeps the hot path to a single un-timed store and pushes aggregation out to optional samplers. This is the classic always-on rich telemetry vs. cheap raw signal + external aggregation split.

SQL Server: wait statistics as a first-class tuning surface

Section titled “SQL Server: wait statistics as a first-class tuning surface”

SQL Server exposes cumulative wait statistics directly (sys.dm_os_wait_stats, sys.dm_exec_requests.wait_type) and the wait taxonomy (PAGEIOLATCH_*, LCK_M_*, CXPACKET, …) is the primary entry point of its performance-tuning methodology. Like Oracle, the engine accumulates wait time per type in core. PostgreSQL’s choice to not time waits in core is a conscious overhead decision; the community has repeatedly debated adding in-core wait timing and has so far preferred sampling extensions to avoid taxing the hot path.

The st_changecount protocol is a single-writer/multi-reader seqlock, the same primitive Linux uses for gettimeofday-class fast reads. The research lineage runs through the lock-free / wait-free literature: the single-writer record with paired memory barriers is the cheapest known way to publish a multi-field record for concurrent readers without a lock. The deliberate asymmetry — cheap writes, retrying reads — is exactly right for telemetry, where writers are on the hot path and readers are rare monitoring queries. PostgreSQL pushes this further by splitting out the hottest field (the wait word) into a barrier-free single store, accepting momentary reader staleness in exchange for a near-zero write cost. The apt bibliography in dbms-general/dbms-papers.md situates this under the broader concurrent-data-structure work on versioned / optimistic reads.

  • In-core wait timing / sampling. Whether to accumulate per-event wait time in core (Oracle/SQL Server style) or keep relying on external samplers remains an open community tradeoff; the hot-path cost of the former is the sticking point.
  • eBPF / USDT probes. PostgreSQL ships USDT tracepoints (e.g. TRACE_POSTGRESQL_STATEMENT_STATUS fires inside pgstat_report_activity) enabling out-of-process tracing that sidesteps the shared-memory path entirely — a frontier for low-overhead, dynamically-attached observability.
  • Richer progress models. Today’s progress vector is 20 flat integers with per-command conventions; extending it to structured / nested phase reporting (for, say, parallel plans) without breaking the single-writer invariant is an active design space, as the PqMsg_Progress increment relay hints.
  • Custom wait events for extensions. The WaitEventExtensionNew registry (added so extensions stop colliding on the single Extension event) is a recent generalisation; injection-point wait events reuse the same machinery for deterministic testing.
  • PostgreSQL source, REL_18_STABLE @ 273fe94 (/data/hgryoo/references/postgres):
    • src/backend/utils/activity/wait_event.c — encode/decode, custom registry.
    • src/backend/utils/activity/backend_status.cPgBackendStatus lifecycle, activity reporting, snapshot readback.
    • src/backend/utils/activity/backend_progress.c — progress API and the parallel relay.
    • src/backend/utils/activity/wait_event_names.txt — the declarative vocabulary.
    • src/backend/utils/activity/generate-wait_event_types.pl — the codegen.
    • src/include/utils/wait_event.h, wait_classes.h, backend_status.h, backend_progress.h — the inline publish, class constants, slot struct + seqlock macros, progress enum.
    • src/backend/access/transam/parallel.c — leader-side PqMsg_Progress handler.
  • Cross-references (this KB):
    • postgres-cumulative-stats.md — the other stats system (monotonic counters, pgstat.c shared hash); contrast with the live per-backend snapshot here.
    • postgres-lwlock-spinlock.mdGetLWLockIdentifier, the spinlock behind WaitEventCustomCounter, and the LWLock wait class that pgstat_get_wait_event defers to.
    • postgres-wire-protocol.mdpq_beginmessage / PqMsg_* framing reused by the parallel progress relay.
  • Textbook anchors (knowledge/research/dbms-general/):
    • Database System Concepts (Silberschatz, Korth, Sudarshan) — DBA monitoring of executing transactions, lock/I-O wait diagnosis.
    • Architecture of a Database System (Hellerstein, Stonebraker, Hamilton) — process model; per-worker state published to shared memory.
    • Database Internals (Petrov) — lock-free single-writer telemetry, versioned optimistic reads.
  • Paper bibliography: dbms-general/dbms-papers.md (apt entries on seqlocks / concurrent versioned reads); see also Oracle Wait Interface and MySQL Performance Schema documentation for the comparative designs.