Skip to content

PostgreSQL Auxiliary Processes — bgwriter, walwriter, checkpointer, startup, and syslogger

Contents:

A database server doing all I/O inside client sessions creates a latency coupling that is architecturally undesirable: a client writing a dirty buffer to disk delays itself and, through lock contention, may delay others. The classical solution is to off-load predictable, periodic I/O work into dedicated background processes that run independently of any client session.

Two sub-problems drive the design space:

  1. Amortizing write-back cost across clients. If every client must flush the page it evicts from the buffer pool, write latency becomes the critical path for common workloads. A background writer that continuously cleans buffers in advance means a client evicting a buffer usually finds a clean page available immediately. The same logic applies to WAL: grouping multiple commit records into a single fsync call (group commit) reduces the per-commit I/O penalty.

  2. Periodic consistency points (checkpoints). ARIES-style recovery (Mohan et al., 1992) requires the ability to identify a consistent on-disk state from which redo can begin. A checkpoint writes all dirty buffers to data files and records the WAL position. At recovery, only WAL after the most recent checkpoint needs replaying. The more frequently checkpoints occur, the shorter the recovery time; the more aggressively dirty data is flushed, the cheaper each checkpoint becomes. These two forces are in tension with normal I/O bandwidth, so checkpoint scheduling and pacing are non-trivial engineering problems.

Architecture of a Database System (Hellerstein et al., 2007, §6) describes the background writer pattern as near-universal in buffer-pool managers and notes that separating the background writer from the checkpoint process — which PostgreSQL did fully in version 9.2 — allows finer I/O scheduling: checkpoints can be spread across their interval (checkpoint_completion_target) while the background writer handles steady-state eviction pressure.

WAL recovery is a different concern. A process that replays WAL records to bring data files to a consistent state must run before any client backend is admitted. This is the startup process pattern: a single process drives the entire recovery pipeline, then signals the postmaster to open connections. In standby mode, this same process never exits — it becomes the continuous redo loop.

Logging infrastructure (capturing stderr from all processes and routing it to rotating files) is not a DBMS-specific problem, but the architectural choice to isolate it in a dedicated process rather than have each backend write directly to log files is significant: it serializes all writes through one writer, eliminating the lock contention that arises from concurrent file writes, and centralizes log rotation logic.

Almost every production RDBMS runs a background process (or thread) that proactively flushes dirty pages from the buffer pool:

  • Oracle: DBWR (Database Writer) processes; multiple instances for parallelism; woken by buffer-free threshold or timeout.
  • MySQL/InnoDB: the io_write threads plus the page-cleaner threads introduced in 5.6 to relieve single-threaded flushing bottlenecks.
  • SQL Server: the LazyWriter and Checkpoint Writer; LazyWriter handles steady-state eviction while Checkpoint Writer handles periodic consistency flushes.
  • CUBRID: the Flush Manager thread, which mirrors the “background writer + checkpointer” split.

The common pattern is: sleep for a configured interval, scan a portion of the buffer pool’s clock-hand or LRU list, write dirty pages whose LSN is old enough, then sleep again. Feedback loops adjust the scan window or the sleep duration based on how much work was found — if the system is idle, the writer enters a “hibernate” mode and backs off exponentially.

Grouping commit records to amortize fsync cost is also near-universal. Each RDBMS has its own name — Oracle’s “redo copy” latching, InnoDB’s “group commit” optimization, PostgreSQL’s WAL writer with synchronous_commit = off path. The invariant is: a transaction does not have to fsync WAL itself if it can rely on another process doing so within a bounded latency.

A naive checkpoint writes all dirty pages as fast as possible, creating an I/O spike. The design insight is that the spike can be spread across the checkpoint interval. ARIES mentions checkpoint scheduling as an implementation concern; the practical pacing strategy varies. PostgreSQL’s checkpoint_completion_target (default 0.9) tells the checkpointer to finish in 90% of checkpoint_timeout, which leaves headroom without starving normal I/O.

Recovery as a dedicated early-lifecycle phase

Section titled “Recovery as a dedicated early-lifecycle phase”

Every WAL-based RDBMS must replay WAL on startup before opening connections. The question is whether this happens in the main server thread, in a dedicated subprocess, or in the first client that connects. PostgreSQL’s answer — a dedicated startup process that signals the postmaster when done — provides a clean separation: the postmaster itself never reads WAL; it waits for a PMSIGNAL from the startup process to transition PMState.

ConceptPostgreSQL name
Background page cleanerB_BG_WRITER / BackgroundWriterMain
WAL group commit / fsync amortizerB_WAL_WRITER / WalWriterMain
Checkpoint scheduler + executorB_CHECKPOINTER / CheckpointerMain
WAL recovery / redo driverB_STARTUP / StartupProcessMainStartupXLOG
Stderr log collector + rotatorB_LOGGER / SysLoggerMain
Common aux init pathAuxiliaryProcessMainCommon
Client-facing checkpoint requestCheckpointerShmem (shared struct)

AuxiliaryProcessMainCommon: shared initialization

Section titled “AuxiliaryProcessMainCommon: shared initialization”

All five processes call AuxiliaryProcessMainCommon (in auxprocess.c) after setting MyBackendType. This function provides the minimal initialization path for processes that do not call the full InitPostgres:

// AuxiliaryProcessMainCommon — src/backend/postmaster/auxprocess.c
void
AuxiliaryProcessMainCommon(void)
{
Assert(IsUnderPostmaster);
/* Release postmaster's working memory context */
if (PostmasterContext)
{
MemoryContextDelete(PostmasterContext);
PostmasterContext = NULL;
}
init_ps_display(NULL); /* process-title display */
IgnoreSystemIndexes = true; /* no catalog reads yet */
InitAuxiliaryProcess(); /* create PGPROC in shared memory */
BaseInit(); /* low-level I/O and buffer init */
ProcSignalInit(NULL, 0); /* register in procsignal array */
CreateAuxProcessResourceOwner(); /* resource tracking sans transactions */
pgstat_beinit(); /* backend status infrastructure */
pgstat_bestart_initial();
pgstat_bestart_final();
before_shmem_exit(ShutdownAuxiliaryProcess, 0); /* LWLock cleanup */
SetProcessingMode(NormalProcessing);
}

The critical call is InitAuxiliaryProcess, which allocates a PGPROC slot for the process. Without a PGPROC, the process cannot acquire LWLocks and cannot be seen by the procarray. Unlike InitPostgres, auxiliary initialization does not open a database, does not set up a role, and does not enable heavyweight locks — auxiliary processes are stateless with respect to user transactions.

The syslogger is the sole exception: it calls neither InitAuxiliaryProcess nor AuxiliaryProcessMainCommon because it does not attach to shared memory at all. Its isolation from the shared segment is intentional — the logger must survive situations where shared memory is in an unknown state.

BackgroundWriterMain (bgwriter.c) runs a continuous loop: call BgBufferSync, sleep for bgwriter_delay milliseconds, repeat.

// BackgroundWriterMain — src/backend/postmaster/bgwriter.c
void
BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
{
MyBackendType = B_BG_WRITER;
AuxiliaryProcessMainCommon();
pqsignal(SIGHUP, SignalHandlerForConfigReload);
pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
/* ... condensed signal setup ... */
for (;;)
{
ResetLatch(MyLatch);
ProcessMainLoopInterrupts();
can_hibernate = BgBufferSync(&wb_context);
pgstat_report_bgwriter();
pgstat_report_wal(true);
if (FirstCallSinceLastCheckpoint())
smgrdestroyall(); /* free dropped-relation smgr objects */
/* Periodically log xl_running_xacts for replication standby */
if (XLogStandbyInfoActive() && !RecoveryInProgress())
{
/* ... condensed: LogStandbySnapshot every 15 s ... */
}
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
BgWriterDelay, WAIT_EVENT_BGWRITER_MAIN);
/* Hibernate if nothing happening for two consecutive cycles */
if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
{
StrategyNotifyBgWriter(MyProcNumber);
(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
BgWriterDelay * HIBERNATE_FACTOR,
WAIT_EVENT_BGWRITER_HIBERNATE);
StrategyNotifyBgWriter(-1);
}
prev_hibernate = can_hibernate;
}
}

BgBufferSync (in bufmgr.c) performs the actual dirty-page scanning via the clock-hand algorithm. It returns true (can hibernate) when it found no work. After two consecutive “can hibernate” cycles, the bgwriter registers itself with StrategyNotifyBgWriter so the buffer pool strategy will wake it on the next allocation, then sleeps for BgWriterDelay * 50 ms instead of BgWriterDelay. This hibernate mechanism cuts idle CPU and disk wakeups significantly.

The cleaning work itself is the clock-sweep LRU scan inside BgBufferSync. The bgwriter does not scan the whole pool every cycle; it estimates how many buffers the next allocation cycle will need (from a smoothed allocation-rate moving average scaled by bgwriter_lru_multiplier), then cleans forward from next_to_clean until it has either lapped the strategy clock sweep, satisfied the estimate, or hit the bgwriter_lru_maxpages cap:

// BgBufferSync — src/backend/storage/buffer/bufmgr.c
bool
BgBufferSync(WritebackContext *wb_context)
{
/* Find where the freelist clock sweep is + allocations since last call */
strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
PendingBgWriterStats.buf_alloc += recent_alloc;
if (bgwriter_lru_maxpages <= 0) /* LRU scan disabled */
{
saved_info_valid = false;
return true; /* OK to hibernate */
}
/* ... condensed: compute strategy_delta, bufs_to_lap, smoothed_density,
smoothed_alloc, then upcoming_alloc_est = smoothed_alloc * multiplier ... */
num_to_scan = bufs_to_lap;
num_written = 0;
reusable_buffers = reusable_buffers_est;
/* Execute the LRU scan: clean forward until lapped / estimate met / capped */
while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
{
int sync_state = SyncOneBuffer(next_to_clean, true, wb_context);
if (++next_to_clean >= NBuffers)
{
next_to_clean = 0;
next_passes++;
}
num_to_scan--;
if (sync_state & BUF_WRITTEN)
{
reusable_buffers++;
if (++num_written >= bgwriter_lru_maxpages)
{
PendingBgWriterStats.maxwritten_clean++;
break; /* hit per-cycle write cap */
}
}
else if (sync_state & BUF_REUSABLE)
reusable_buffers++;
}
PendingBgWriterStats.buf_written_clean += num_written;
/* Hibernate only if we lapped the sweep AND no allocations occurred */
return (bufs_to_lap == 0 && recent_alloc == 0);
}

The return value wired into can_hibernate is bufs_to_lap == 0 && recent_alloc == 0: the bgwriter signals it may hibernate only when it has caught up to the strategy clock sweep and no backend allocated a buffer since the previous call. SyncOneBuffer (same file) writes one dirty candidate buffer if it is not pinned or recently used, returning a bitmask of BUF_WRITTEN / BUF_REUSABLE. The two feedback estimators — smoothed_alloc (fast-attack, slow-decline allocation rate) and smoothed_density (buffers scanned per reusable buffer found) — are what let the writer track a moving workload without re-scanning clean buffers it already passed.

Figure 1b — BgBufferSync LRU clock-sweep cleaning loop

flowchart TD
    START["BgBufferSync"]
    SYNCSTART["StrategySyncStart\nstrategy_buf_id, recent_alloc"]
    DISABLED["bgwriter_lru_maxpages <= 0?"]
    HIB["return true\n(hibernate)"]
    EST["estimate upcoming_alloc_est\nfrom smoothed_alloc x multiplier\ncompute bufs_to_lap"]
    COND["num_to_scan > 0 AND\nreusable_buffers < est?"]
    SYNC["SyncOneBuffer(next_to_clean)"]
    ADV["advance next_to_clean\n(wrap to 0, next_passes++)"]
    WRITTEN["BUF_WRITTEN?\nnum_written++"]
    CAP["num_written >= maxpages?\nbreak"]
    RET["return bufs_to_lap == 0\nAND recent_alloc == 0"]

    START --> SYNCSTART --> DISABLED
    DISABLED -->|yes| HIB
    DISABLED -->|no| EST --> COND
    COND -->|yes| SYNC --> ADV --> WRITTEN
    WRITTEN -->|yes| CAP
    CAP -->|under cap| COND
    CAP -->|at cap| RET
    WRITTEN -->|no| COND
    COND -->|no| RET

Two secondary duties live in the bgwriter loop because it is the only process guaranteed to run regularly: (1) destroying smgr objects for dropped relations after each checkpoint (since the bgwriter, unlike backends, never calls AtEOXact_SMgr), and (2) periodically logging xl_running_xacts snapshots to help standbys reach a consistent state faster.

Figure 1 — bgwriter main loop

flowchart TD
    START["BackgroundWriterMain"]
    INIT["AuxiliaryProcessMainCommon\nsetup signals"]
    LOOP["for(;;)"]
    RESET["ResetLatch"]
    INTR["ProcessMainLoopInterrupts\n(shutdown / reload)"]
    SYNC["BgBufferSync\nreturns can_hibernate"]
    STATS["pgstat_report_bgwriter\npgstat_report_wal"]
    SMGR["FirstCallSinceLastCheckpoint?\nsmgrdestroyall"]
    SNAP["XLogStandbyInfoActive?\nLogStandbySnapshot every 15 s"]
    WAIT["WaitLatch(BgWriterDelay)"]
    HIB["hibernate?\nStrategyNotifyBgWriter\nWaitLatch(BgWriterDelay * 50)"]

    START --> INIT --> LOOP
    LOOP --> RESET --> INTR --> SYNC --> STATS --> SMGR --> SNAP --> WAIT
    WAIT -->|timeout + can_hibernate x2| HIB --> LOOP
    WAIT -->|latch or timeout| LOOP

WalWriterMain (walwriter.c) calls XLogBackgroundFlush on every cycle to flush any unflushed WAL buffers:

// WalWriterMain — src/backend/postmaster/walwriter.c
void
WalWriterMain(const void *startup_data, size_t startup_data_len)
{
MyBackendType = B_WAL_WRITER;
AuxiliaryProcessMainCommon();
/* Advertise proc number so backends can wake us */
ProcGlobal->walwriterProc = MyProcNumber;
for (;;)
{
if (hibernating != (left_till_hibernate <= 1))
{
hibernating = (left_till_hibernate <= 1);
SetWalWriterSleeping(hibernating); /* global flag for async commits */
}
ResetLatch(MyLatch);
ProcessMainLoopInterrupts();
if (XLogBackgroundFlush())
left_till_hibernate = LOOPS_UNTIL_HIBERNATE; /* reset to 50 */
else if (left_till_hibernate > 0)
left_till_hibernate--;
pgstat_report_wal(false);
cur_timeout = (left_till_hibernate > 0)
? WalWriterDelay
: WalWriterDelay * HIBERNATE_FACTOR; /* 25× */
(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
cur_timeout,
WAIT_EVENT_WAL_WRITER_MAIN);
}
}

ProcGlobal->walwriterProc is the mechanism by which asynchronously committing backends know whose latch to set when they need a flush. SetWalWriterSleeping(true) tells async-commit code that the walwriter is about to enter a long sleep, so the async-commit timeout calculation (synchronous_commit = off guarantees flush within wal_writer_delay * 3) remains correct even when the walwriter hibernates.

XLogBackgroundFlush (in xlog.c) writes any WAL buffers that have not yet been written to the WAL segment and fsyncs up to the current flush point. It returns true when it flushed anything. After LOOPS_UNTIL_HIBERNATE (50) consecutive no-op cycles, left_till_hibernate reaches zero and the sleep is extended to WalWriterDelay * 25.

checkpointer: checkpoint owner and fsync dispatcher

Section titled “checkpointer: checkpoint owner and fsync dispatcher”

CheckpointerMain (checkpointer.c) owns all checkpoint and restartpoint execution. It communicates with backends via CheckpointerShmem, a shared struct that doubles as a request queue for fsync operations:

// CheckpointerShmemStruct — src/backend/postmaster/checkpointer.c
typedef struct
{
pid_t checkpointer_pid; /* 0 if not started */
slock_t ckpt_lck; /* protects ckpt_* counters */
int ckpt_started; /* incremented at checkpoint start */
int ckpt_done; /* set == ckpt_started on completion */
int ckpt_failed; /* incremented on failure */
int ckpt_flags; /* OR of CHECKPOINT_* request bits */
ConditionVariable start_cv; /* signaled when ckpt_started advances */
ConditionVariable done_cv; /* signaled when ckpt_done advances */
int num_requests; /* pending fsync requests */
int max_requests;
CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
} CheckpointerShmemStruct;

The three-counter protocol (ckpt_started, ckpt_done, ckpt_failed) lets a backend that sends a checkpoint request (via RequestCheckpoint) wait for its checkpoint to complete without a custom signal. The backend records ckpt_started before signaling, waits for ckpt_started to advance (a new checkpoint has begun), then waits for ckpt_done to catch up to that value. If ckpt_failed increased between start and done, the checkpoint failed.

The main loop:

// CheckpointerMain loop (condensed) — src/backend/postmaster/checkpointer.c
void
CheckpointerMain(const void *startup_data, size_t startup_data_len)
{
MyBackendType = B_CHECKPOINTER;
AuxiliaryProcessMainCommon();
CheckpointerShmem->checkpointer_pid = MyProcPid;
ProcGlobal->checkpointerProc = MyProcNumber;
/* SIGINT = shutdown checkpoint request; SIGUSR2 = exit after that */
pqsignal(SIGINT, ReqShutdownXLOG);
pqsignal(SIGTERM, SIG_IGN); /* ignore normal SIGTERM */
pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
for (;;)
{
ResetLatch(MyLatch);
AbsorbSyncRequests(); /* drain fsync request queue */
ProcessCheckpointerInterrupts();
if (ShutdownXLOGPending || ShutdownRequestPending)
break;
/* Decide: time-driven or request-driven checkpoint? */
if (CheckpointerShmem->ckpt_flags)
{
do_checkpoint = true;
chkpt_or_rstpt_requested = true;
}
elapsed_secs = (pg_time_t)time(NULL) - last_checkpoint_time;
if (elapsed_secs >= CheckPointTimeout)
{
do_checkpoint = true;
flags |= CHECKPOINT_CAUSE_TIME;
}
if (do_checkpoint)
{
/* Broadcast start counter, do checkpoint, broadcast done */
SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
CheckpointerShmem->ckpt_flags = 0;
CheckpointerShmem->ckpt_started++;
SpinLockRelease(&CheckpointerShmem->ckpt_lck);
ConditionVariableBroadcast(&CheckpointerShmem->start_cv);
if (!RecoveryInProgress())
ckpt_performed = CreateCheckPoint(flags);
else
ckpt_performed = CreateRestartPoint(flags);
SpinLockAcquire(&CheckpointerShmem->ckpt_lck);
CheckpointerShmem->ckpt_done = CheckpointerShmem->ckpt_started;
SpinLockRelease(&CheckpointerShmem->ckpt_lck);
ConditionVariableBroadcast(&CheckpointerShmem->done_cv);
}
CheckArchiveTimeout();
pgstat_report_checkpointer();
pgstat_report_wal(true);
/* Sleep until next checkpoint time or signal */
(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
cur_timeout * 1000L,
WAIT_EVENT_CHECKPOINTER_MAIN);
}
/* Shutdown path: write shutdown checkpoint, then exit */
if (ShutdownXLOGPending)
ShutdownXLOG(0, 0);
}

The checkpointer uses an unusual signal assignment: SIGTERM is ignored (the postmaster sends SIGTERM to client backends during normal shutdown, but the checkpointer must survive that phase). Shutdown is initiated by SIGINT (writes the shutdown checkpoint) followed by SIGUSR2 (exit). This sequencing is enforced by the postmaster’s PostmasterStateMachine.

Figure 2 — CheckpointerShmem three-counter protocol

flowchart LR
    BE["Backend calls\nRequestCheckpoint"]
    REC["Record ckpt_started\nset ckpt_flags bits\nwrite to shmem"]
    SIG["Send SIGUSR1\nto checkpointer"]
    CKP_S["Checkpointer:\nAbsorbSyncRequests\nincrement ckpt_started\nbroadcast start_cv"]
    EXEC["CreateCheckPoint\nor CreateRestartPoint"]
    CKP_D["Checkpointer:\nset ckpt_done = ckpt_started\nbroadcast done_cv"]
    WAIT_S["Backend waits on\nstart_cv"]
    WAIT_D["Backend waits on\ndone_cv"]
    CHK["Backend checks\nckpt_failed delta"]

    BE --> REC --> SIG --> CKP_S
    BE --> WAIT_S
    CKP_S --> EXEC --> CKP_D
    CKP_S --> WAIT_S
    WAIT_S -->|ckpt_started advanced| WAIT_D
    CKP_D --> WAIT_D
    WAIT_D -->|ckpt_done caught up| CHK

StartupProcessMain (startup.c) is the shortest of the five main functions. Its entire job is to call StartupXLOG and exit:

// StartupProcessMain — src/backend/postmaster/startup.c
void
StartupProcessMain(const void *startup_data, size_t startup_data_len)
{
MyBackendType = B_STARTUP;
AuxiliaryProcessMainCommon();
on_shmem_exit(StartupProcExit, 0); /* cleanup recovery env on exit */
pqsignal(SIGHUP, StartupProcSigHupHandler);
pqsignal(SIGTERM, StartupProcShutdownHandler); /* request abort */
pqsignal(SIGUSR2, StartupProcTriggerHandler); /* promote to primary */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
StartupXLOG(); /* replays WAL; in standby mode, loops continuously */
proc_exit(0); /* exit code 0 → postmaster transitions PMState */
}

StartupXLOG (in xlogrecovery.c) reads the control file, finds the latest checkpoint record, and replays WAL forward. In primary-only startup (no WAL to replay), it completes quickly. In crash recovery, it replays all WAL since the last checkpoint. In standby mode, it enters a loop that continuously applies incoming WAL from the walreceiver until SIGUSR2 triggers promotion.

Signal semantics for the startup process differ from other auxiliaries: SIGTERM sets shutdown_requested, which causes proc_exit(1) (abnormal exit); SIGUSR2 sets promote_signaled, which StartupXLOG polls via IsPromoteSignaled() to trigger a standby-to-primary transition. The in_restore_command flag allows SIGTERM to be handled immediately during restore_command execution (a safe point to stop), rather than deferred.

Because the startup process exits after recovery is complete, it does not have a steady-state event loop. In standby mode, the WAL replay loop in StartupXLOG is the effective main loop, driven by WakeupRecovery() calls from signal handlers.

SysLoggerMain (syslogger.c) is unlike the other auxiliaries: it is forked before shared memory exists (or before the shared segment is in a known state), does not attach to shared memory, and therefore cannot use any shared infrastructure.

Its mechanism: the postmaster redirects its own stderr (and that of all subsequently forked children) to a pipe before forking the syslogger. The syslogger reads from the read end of syslogPipe, reassembles chunked messages, and writes them to the current log file.

// SysLoggerMain startup (condensed) — src/backend/postmaster/syslogger.c
void
SysLoggerMain(const void *startup_data, size_t startup_data_len)
{
MyBackendType = B_LOGGER;
init_ps_display(NULL); /* no AuxiliaryProcessMainCommon */
/* Ignore all termination signals; exit only when pipe EOF seen */
pqsignal(SIGTERM, SIG_IGN);
pqsignal(SIGQUIT, SIG_IGN);
pqsignal(SIGUSR1, sigUsr1Handler); /* request log rotation */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
/* Main loop: read pipe, rotate files, sleep */
for (;;)
{
/* read from syslogPipe[0], reassemble chunks into save_buffer list,
write completed messages to syslogFile / csvlogFile / jsonlogFile */
process_pipe_input(logbuffer, &bytes_in_logbuffer);
/* rotate if time- or size-based threshold crossed */
logfile_rotate(time_based_rotation, size_rotation_for);
if (pipe_eof_seen)
break; /* all writers (postmaster + children) are gone */
(void) WaitEventSetWait(wes, ...);
}
/* flush remaining input, close log files, exit */
}

Three log file handles exist simultaneously: syslogFile (plain text), csvlogFile (CSV format), and jsonlogFile (JSON format) — each written to independently depending on log_destination. Log rotation is triggered by SIGUSR1 (pg_rotate_logfile()) or by time/size thresholds (Log_RotationAge, Log_RotationSize). The pipe_eof_seen flag becomes true when all write ends of syslogPipe have been closed, which only happens after every other process (postmaster and all children) has exited. This makes the syslogger the very last process to exit in a clean shutdown.

Messages are sent through the pipe as fixed-size PipeProtoChunk structs. A backend that generates a long log message splits it across multiple chunks. The syslogger maintains a save_buffer list keyed by source PID to reassemble multi-chunk messages before writing them to the log file.

Figure 3 — syslogger pipe architecture

flowchart TD
    PM["postmaster\nstderr → syslogPipe[1]"]
    BE["backends + aux procs\nstderr → syslogPipe[1]"]
    PIPE["syslogPipe[1]\n(write end)"]
    SYS["SysLoggerMain\nreads syslogPipe[0]"]
    CHUNK["process_pipe_input\nreassemble PipeProtoChunks\nvia save_buffer list"]
    ROTATE["logfile_rotate\ntime / size / SIGUSR1"]
    FILES["syslogFile\ncsvlogFile\njsonlogFile"]

    PM --> PIPE
    BE --> PIPE
    PIPE -->|read| SYS
    SYS --> CHUNK --> FILES
    SYS --> ROTATE --> FILES

The postmaster launches auxiliary processes in a specific order, governed by LaunchMissingBackgroundProcesses and the PMState machine:

  1. Syslogger (B_LOGGER) — launched first, before shared memory, before any other process. Captures all subsequent stderr output.
  2. Startup process (B_STARTUP) — launched during PM_STARTUP. Drives WAL recovery. Its proc_exit(0) moves pmState to PM_RECOVERY or PM_RUN.
  3. checkpointer (B_CHECKPOINTER) and bgwriter (B_BG_WRITER) — launched in PM_STARTUP and kept running through PM_RECOVERY, PM_HOT_STANDBY, and PM_RUN. They are wanted even before clients are admitted.
  4. walwriter (B_WAL_WRITER) — launched only in PM_RUN (after full recovery), because WAL writes by clients only occur in primary mode.
  5. All five are relaunched automatically by LaunchMissingBackgroundProcesses on every ServerLoop iteration if they exit unexpectedly, with the exception of the syslogger (handled separately) and the startup process (which exits normally after recovery).

A crash of bgwriter, walwriter, or checkpointer is treated by the postmaster the same as a backend crash: HandleChildCrash sends SIGQUIT to all remaining children and starts a crash-recovery cycle. These processes touch shared memory, so an unexpected exit implies potential shared-memory corruption.

  • AuxiliaryProcessMainCommon (auxprocess.c:39) — shared init: delete postmaster context, call InitAuxiliaryProcess + BaseInit + ProcSignalInit + CreateAuxProcessResourceOwner + pgstat init; registers ShutdownAuxiliaryProcess as a before-shmem-exit callback.
  • ShutdownAuxiliaryProcess (auxprocess.c:98) — releases all LWLocks, cancels condition-variable waits, reports wait-end to pgstat.
  • InitAuxiliaryProcess (proc.c) — allocates a PGPROC for the process without the heavyweight-lock fields used by regular backends.
  • BackgroundWriterMain (bgwriter.c:88) — entry; sets B_BG_WRITER, calls AuxiliaryProcessMainCommon; runs the BgBufferSync loop.
  • BgBufferSync (bufmgr.c) — clock-hand scan of the buffer pool; cleans forward from next_to_clean until it laps the strategy sweep, meets upcoming_alloc_est, or hits bgwriter_lru_maxpages; returns true (can_hibernate) only when bufs_to_lap == 0 && recent_alloc == 0.
  • SyncOneBuffer (bufmgr.c) — writes one dirty, unpinned, not-recently-used buffer; returns a BUF_WRITTEN | BUF_REUSABLE bitmask.
  • StrategySyncStart (freelist.c) — reports current clock-sweep position and the allocation count since the last call.
  • StrategyNotifyBgWriter (freelist.c) — registers / deregisters the bgwriter’s proc number for wakeup on next buffer allocation.
  • WritebackContextInit / WritebackContext — tracks dirty-page writebacks for bgwriter_flush_after coalescing.
  • WalWriterMain (walwriter.c:88) — entry; sets B_WAL_WRITER, advertises ProcGlobal->walwriterProc, runs the XLogBackgroundFlush loop.
  • XLogBackgroundFlush (xlog.c) — writes unwritten WAL buffers and fsyncs to the current insert LSN.
  • SetWalWriterSleeping (walwriter.c) — sets the global flag read by async-commit code to adjust its timeout calculation.
  • CheckpointerMain (checkpointer.c:182) — entry; sets B_CHECKPOINTER; registers pgstat_before_server_shutdown as a shmem-exit callback (sole process to flush cumulative stats at shutdown).
  • CheckpointerShmemInit (checkpointer.c) — allocates CheckpointerShmemStruct from shared memory during postmaster startup.
  • CheckpointerShmemSize (checkpointer.c) — reports the size for CalculateShmemSize.
  • AbsorbSyncRequests (checkpointer.c) — drains the fsync request queue built by backends via ForwardSyncRequest.
  • CreateCheckPoint (xlog.c) — writes a full checkpoint record; flushes all dirty buffers via CheckpointWriteDelay-paced calls to smgrwrite.
  • CreateRestartPoint (xlog.c) — restartpoint equivalent for standby mode.
  • RequestCheckpoint (checkpointer.c) — called by backends to post a request via ckpt_flags and send SIGUSR1; optionally waits for completion via the three-counter protocol.
  • StartupProcessMain (startup.c:216) — entry; sets B_STARTUP; calls StartupXLOG; exits 0 on success.
  • StartupXLOG (xlogrecovery.c) — reads control file, replays WAL from last checkpoint; enters standby replay loop in hot-standby mode.
  • ProcessStartupProcInterrupts (startup.c:154) — polls got_SIGHUP, shutdown_requested, postmaster-alive check, barrier signals.
  • StartupProcTriggerHandler (startup.c:93) — SIGUSR2 handler; sets promote_signaled and calls WakeupRecovery.
  • IsPromoteSignaled / ResetPromoteSignaled — called by StartupXLOG to poll for and acknowledge the promotion signal.
  • SysLoggerMain (syslogger.c:165) — entry; sets B_LOGGER; no AuxiliaryProcessMainCommon; reads syslogPipe[0]; calls process_pipe_input and logfile_rotate in a loop.
  • process_pipe_input (syslogger.c) — reassembles PipeProtoChunk messages from the pipe using save_buffer lists keyed by source PID.
  • logfile_rotate (syslogger.c) — opens a new log file; closes old one; updates last_sys_file_name / last_csv_file_name / last_json_file_name.
  • SysLogger_Start (syslogger.c) — called by the postmaster to create syslogPipe, redirect stderr, and fork the syslogger.
  • write_syslogger_file (syslogger.c) — the backend-side function that formats and chunks messages before writing to syslogPipe[1].

Position hints (as of 2026-06-05, commit 273fe94)

Section titled “Position hints (as of 2026-06-05, commit 273fe94)”
SymbolFileLine
AuxiliaryProcessMainCommonsrc/backend/postmaster/auxprocess.c39
ShutdownAuxiliaryProcesssrc/backend/postmaster/auxprocess.c98
BackgroundWriterMainsrc/backend/postmaster/bgwriter.c88
BgBufferSyncsrc/backend/storage/buffer/bufmgr.c3629
SyncOneBuffersrc/backend/storage/buffer/bufmgr.c3927
BgWriterDelay (GUC)src/backend/postmaster/bgwriter.c58
HIBERNATE_FACTOR (bgwriter)src/backend/postmaster/bgwriter.c64
LOG_SNAPSHOT_INTERVAL_MSsrc/backend/postmaster/bgwriter.c70
WalWriterMainsrc/backend/postmaster/walwriter.c88
WalWriterDelay (GUC)src/backend/postmaster/walwriter.c70
LOOPS_UNTIL_HIBERNATE (walwriter)src/backend/postmaster/walwriter.c78
HIBERNATE_FACTOR (walwriter)src/backend/postmaster/walwriter.c79
CheckpointerMainsrc/backend/postmaster/checkpointer.c182
CheckpointerShmemStructsrc/backend/postmaster/checkpointer.c107
CheckPointTimeout (GUC)src/backend/postmaster/checkpointer.c144
CheckPointCompletionTarget (GUC)src/backend/postmaster/checkpointer.c146
StartupProcessMainsrc/backend/postmaster/startup.c216
ProcessStartupProcInterruptssrc/backend/postmaster/startup.c154
StartupProcTriggerHandlersrc/backend/postmaster/startup.c93
SysLoggerMainsrc/backend/postmaster/syslogger.c165
Logging_collector (GUC)src/backend/postmaster/syslogger.c70
syslogPipesrc/backend/postmaster/syslogger.c114
NBUFFER_LISTSsrc/backend/postmaster/syslogger.c109
  • AuxiliaryProcessMainCommon is called by bgwriter, walwriter, checkpointer, and startup, but NOT by syslogger. Verified: SysLoggerMain (syslogger.c:165) sets MyBackendType = B_LOGGER and calls init_ps_display directly without calling AuxiliaryProcessMainCommon. This is the only auxiliary that does not attach to shared memory. Syslogger may be forked before shared memory exists.

  • bgwriter has two hibernate mechanisms: a 50× sleep multiplier and a StrategyNotifyBgWriter wakeup registration. Verified at bgwriter.c:329–342. The condition is: WL_TIMEOUT return from WaitLatch AND can_hibernate (from BgBufferSync) AND prev_hibernate (true for two consecutive cycles).

  • walwriter advertises its proc number via ProcGlobal->walwriterProc. Verified at walwriter.c:216. This is the field async-committing backends use to set the walwriter’s latch instead of signaling it.

  • checkpointer ignores SIGTERM and uses SIGINT for the shutdown-checkpoint request. Verified at checkpointer.c:202–209. The comment explains the rationale: during a normal Unix shutdown, init sends SIGTERM to all processes; the checkpointer must survive that phase to write the shutdown checkpoint only when signaled explicitly by the postmaster via SIGINT.

  • checkpointer is responsible for flushing cumulative statistics at shutdown. Verified at checkpointer.c:232: before_shmem_exit(pgstat_before_server_shutdown, 0) is registered in CheckpointerMain. The comment states “this needs to be called by exactly one process during a normal shutdown.”

  • startup process exits with code 0 on success; postmaster uses this to transition PMState. Verified at startup.c:263–264. A non-zero exit (e.g., from shutdown_requested calling proc_exit(1)) causes the postmaster to enter crash recovery instead of transitioning to PM_RUN.

  • syslogger supports three simultaneous log formats: text, CSV, and JSON. Verified at syslogger.c:82–86: three separate FILE * globals (syslogFile, csvlogFile, jsonlogFile); each written based on log_destination. JSON log support was added in PG15.

  • pipe_eof_seen makes the syslogger the last process to exit. Verified at syslogger.c:219–220 and the main-loop exit condition. Pipe EOF is only possible when all write-end file descriptors are closed. The postmaster keeps syslogPipe[1] open until it exits; thus the syslogger outlives all other processes.

  1. B_IO_WORKER and AuxiliaryProcessMainCommon. PG18 adds B_IO_WORKER processes for async I/O (storage/aio/). Whether they also call AuxiliaryProcessMainCommon and follow the same initialization path has not been verified in this document. Investigation path: storage/aio/aio_worker.c (if it exists).

  2. fsync request queue overflow. CheckpointerShmem holds up to MAX_CHECKPOINT_REQUESTS (10,000,000) pending fsync requests. What happens when the queue fills — whether backends block or fall back to direct fsync — is not traced here. Investigation path: ForwardSyncRequest in md.c and the CompactCheckpointerRequestQueue logic in checkpointer.c.

  3. Syslogger on Windows (EXEC_BACKEND). On Windows the syslogger receives open file descriptors via SysloggerStartupData through the startup_data parameter rather than inheriting them from fork(). The full EXEC_BACKEND re-entry path for the syslogger (via SubPostmasterMain) is not traced here.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • InnoDB page cleaner evolution. Before MySQL 5.6, InnoDB had a single background page-flushing thread, which became a bottleneck on multi-core hardware. MySQL 5.6 introduced multiple “page cleaner” threads (innodb_page_cleaners GUC). PostgreSQL’s bgwriter remains single-threaded but mitigates this by keeping the scan window small and relying on checkpoint spreading. The PG18 async I/O (B_IO_WORKER) adds parallel I/O dispatch beneath the bgwriter, moving some of the parallelism to the I/O subsystem rather than the flusher count.

  • ARIES checkpoint pacing. The original ARIES paper describes “fuzzy checkpoints” that allow normal processing to continue during the checkpoint. PostgreSQL’s checkpoint_completion_target implements exactly this pacing idea: CheckpointWriteDelay (called from CreateCheckPoint) sleeps between writes to spread the checkpoint over the target fraction of the checkpoint interval. Increasing max_wal_size reduces checkpoint frequency; increasing checkpoint_completion_target reduces per-checkpoint I/O spikes.

  • WAL writer as group-commit mechanism. “Architecture of a Database System” (§6.3) describes group commit as a key optimization: transactions waiting for their WAL record to be fsynced can be batched so that one fsync covers many commits. PostgreSQL’s walwriter implements this when synchronous_commit = off: the walwriter flushes WAL every wal_writer_delay ms, and async-commit transactions wait for at most wal_writer_delay * 3 before their records are guaranteed durable. This is a documented trade-off: up to wal_writer_delay * 3 of committed transactions can be lost on a crash.

  • Oracle’s recovery architecture comparison. Oracle’s SMON (System MONitor) process handles instance recovery by reading the online redo log. Like PostgreSQL’s startup process, SMON runs before client access is permitted. The difference is that SMON is also responsible for ongoing cleanup tasks (coalescing free extents, cleaning up temporary segments), giving it a dual role that PostgreSQL splits between the startup process (pure recovery) and autovacuum (ongoing cleanup).

  • Log architecture: pipe vs. shared-memory queue. PostgreSQL uses a Unix pipe to collect log output, which serializes all writes through the syslogger. An alternative is a lock-free shared-memory ring buffer (used by some systems for high-throughput logging). The pipe approach has the advantage of working before shared memory is initialized, making it safe to capture early postmaster startup messages. The shared-memory approach avoids the overhead of chunking and reassembly but requires shared memory to be available before the first log message.

  • None (synthesized directly from source tree at REL_18_STABLE / commit 273fe94).

Source code paths (REL_18_STABLE / commit 273fe94)

Section titled “Source code paths (REL_18_STABLE / commit 273fe94)”
  • src/backend/postmaster/auxprocess.cAuxiliaryProcessMainCommon, ShutdownAuxiliaryProcess
  • src/backend/postmaster/bgwriter.cBackgroundWriterMain, hibernate logic, LogStandbySnapshot duty
  • src/backend/storage/buffer/bufmgr.cBgBufferSync LRU clock-sweep cleaning loop, SyncOneBuffer
  • src/backend/postmaster/walwriter.cWalWriterMain, SetWalWriterSleeping, XLogBackgroundFlush call
  • src/backend/postmaster/checkpointer.cCheckpointerMain, CheckpointerShmemStruct, CheckpointerShmemInit, AbsorbSyncRequests, RequestCheckpoint
  • src/backend/postmaster/startup.cStartupProcessMain, ProcessStartupProcInterrupts, signal handlers, promote logic
  • src/backend/postmaster/syslogger.cSysLoggerMain, SysLogger_Start, process_pipe_input, logfile_rotate, write_syslogger_file
  • src/include/miscadmin.hBackendType enum (B_BG_WRITER, B_WAL_WRITER, B_CHECKPOINTER, B_STARTUP, B_LOGGER)
  • Mohan, C., et al. “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.” ACM TODS, 1992. (Checkpoint concepts, redo recovery.) Curated at knowledge/research/dbms-papers/aries.md.
  • Hellerstein, Stonebraker, Hamilton. Architecture of a Database System, Foundations and Trends in Databases, 2007. §6 (buffer management, background writers, group commit). Curated at knowledge/research/dbms-papers/fntdb07-architecture.md.
  • Stonebraker, M., and Rowe, L. A. “The Design of POSTGRES.” SIGMOD 1986. (Original process model and postmaster concept.)