PostgreSQL Auxiliary Processes — bgwriter, walwriter, checkpointer, startup, and syslogger
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A database server doing all I/O inside client sessions creates a latency coupling that is architecturally undesirable: a client writing a dirty buffer to disk delays itself and, through lock contention, may delay others. The classical solution is to off-load predictable, periodic I/O work into dedicated background processes that run independently of any client session.
Two sub-problems drive the design space:
-
Amortizing write-back cost across clients. If every client must flush the page it evicts from the buffer pool, write latency becomes the critical path for common workloads. A background writer that continuously cleans buffers in advance means a client evicting a buffer usually finds a clean page available immediately. The same logic applies to WAL: grouping multiple commit records into a single
fsynccall (group commit) reduces the per-commit I/O penalty. -
Periodic consistency points (checkpoints). ARIES-style recovery (Mohan et al., 1992) requires the ability to identify a consistent on-disk state from which redo can begin. A checkpoint writes all dirty buffers to data files and records the WAL position. At recovery, only WAL after the most recent checkpoint needs replaying. The more frequently checkpoints occur, the shorter the recovery time; the more aggressively dirty data is flushed, the cheaper each checkpoint becomes. These two forces are in tension with normal I/O bandwidth, so checkpoint scheduling and pacing are non-trivial engineering problems.
Architecture of a Database System (Hellerstein et al., 2007, §6) describes
the background writer pattern as near-universal in buffer-pool managers and
notes that separating the background writer from the checkpoint process —
which PostgreSQL did fully in version 9.2 — allows finer I/O scheduling:
checkpoints can be spread across their interval (checkpoint_completion_target)
while the background writer handles steady-state eviction pressure.
WAL recovery is a different concern. A process that replays WAL records to bring data files to a consistent state must run before any client backend is admitted. This is the startup process pattern: a single process drives the entire recovery pipeline, then signals the postmaster to open connections. In standby mode, this same process never exits — it becomes the continuous redo loop.
Logging infrastructure (capturing stderr from all processes and routing it to rotating files) is not a DBMS-specific problem, but the architectural choice to isolate it in a dedicated process rather than have each backend write directly to log files is significant: it serializes all writes through one writer, eliminating the lock contention that arises from concurrent file writes, and centralizes log rotation logic.
Common DBMS Design
Section titled “Common DBMS Design”Background writer / page cleaner
Section titled “Background writer / page cleaner”Almost every production RDBMS runs a background process (or thread) that proactively flushes dirty pages from the buffer pool:
- Oracle: DBWR (Database Writer) processes; multiple instances for parallelism; woken by buffer-free threshold or timeout.
- MySQL/InnoDB: the
io_writethreads plus the page-cleaner threads introduced in 5.6 to relieve single-threaded flushing bottlenecks. - SQL Server: the LazyWriter and Checkpoint Writer; LazyWriter handles steady-state eviction while Checkpoint Writer handles periodic consistency flushes.
- CUBRID: the Flush Manager thread, which mirrors the “background writer + checkpointer” split.
The common pattern is: sleep for a configured interval, scan a portion of the buffer pool’s clock-hand or LRU list, write dirty pages whose LSN is old enough, then sleep again. Feedback loops adjust the scan window or the sleep duration based on how much work was found — if the system is idle, the writer enters a “hibernate” mode and backs off exponentially.
WAL group commit
Section titled “WAL group commit”Grouping commit records to amortize fsync cost is also near-universal.
Each RDBMS has its own name — Oracle’s “redo copy” latching, InnoDB’s
“group commit” optimization, PostgreSQL’s WAL writer with
synchronous_commit = off path. The invariant is: a transaction does not
have to fsync WAL itself if it can rely on another process doing so within
a bounded latency.
Checkpoint pacing
Section titled “Checkpoint pacing”A naive checkpoint writes all dirty pages as fast as possible, creating an
I/O spike. The design insight is that the spike can be spread across the
checkpoint interval. ARIES mentions checkpoint scheduling as an
implementation concern; the practical pacing strategy varies. PostgreSQL’s
checkpoint_completion_target (default 0.9) tells the checkpointer to
finish in 90% of checkpoint_timeout, which leaves headroom without
starving normal I/O.
Recovery as a dedicated early-lifecycle phase
Section titled “Recovery as a dedicated early-lifecycle phase”Every WAL-based RDBMS must replay WAL on startup before opening connections.
The question is whether this happens in the main server thread, in a
dedicated subprocess, or in the first client that connects. PostgreSQL’s
answer — a dedicated startup process that signals the postmaster when done
— provides a clean separation: the postmaster itself never reads WAL; it
waits for a PMSIGNAL from the startup process to transition PMState.
Theory ↔ implementation mapping
Section titled “Theory ↔ implementation mapping”| Concept | PostgreSQL name |
|---|---|
| Background page cleaner | B_BG_WRITER / BackgroundWriterMain |
| WAL group commit / fsync amortizer | B_WAL_WRITER / WalWriterMain |
| Checkpoint scheduler + executor | B_CHECKPOINTER / CheckpointerMain |
| WAL recovery / redo driver | B_STARTUP / StartupProcessMain → StartupXLOG |
| Stderr log collector + rotator | B_LOGGER / SysLoggerMain |
| Common aux init path | AuxiliaryProcessMainCommon |
| Client-facing checkpoint request | CheckpointerShmem (shared struct) |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”AuxiliaryProcessMainCommon: shared initialization
Section titled “AuxiliaryProcessMainCommon: shared initialization”All five processes call AuxiliaryProcessMainCommon (in auxprocess.c)
after setting MyBackendType. This function provides the minimal
initialization path for processes that do not call the full InitPostgres:
// AuxiliaryProcessMainCommon — src/backend/postmaster/auxprocess.cvoidAuxiliaryProcessMainCommon(void){ Assert(IsUnderPostmaster);
/* Release postmaster's working memory context */ if (PostmasterContext) { MemoryContextDelete(PostmasterContext); PostmasterContext = NULL; }
init_ps_display(NULL); /* process-title display */
IgnoreSystemIndexes = true; /* no catalog reads yet */
InitAuxiliaryProcess(); /* create PGPROC in shared memory */ BaseInit(); /* low-level I/O and buffer init */ ProcSignalInit(NULL, 0); /* register in procsignal array */
CreateAuxProcessResourceOwner(); /* resource tracking sans transactions */
pgstat_beinit(); /* backend status infrastructure */ pgstat_bestart_initial(); pgstat_bestart_final();
before_shmem_exit(ShutdownAuxiliaryProcess, 0); /* LWLock cleanup */
SetProcessingMode(NormalProcessing);}The critical call is InitAuxiliaryProcess, which allocates a PGPROC
slot for the process. Without a PGPROC, the process cannot acquire LWLocks
and cannot be seen by the procarray. Unlike InitPostgres, auxiliary
initialization does not open a database, does not set up a role, and does
not enable heavyweight locks — auxiliary processes are stateless with respect
to user transactions.
The syslogger is the sole exception: it calls neither InitAuxiliaryProcess
nor AuxiliaryProcessMainCommon because it does not attach to shared memory
at all. Its isolation from the shared segment is intentional — the logger
must survive situations where shared memory is in an unknown state.
bgwriter: proactive buffer cleaner
Section titled “bgwriter: proactive buffer cleaner”BackgroundWriterMain (bgwriter.c) runs a continuous loop: call
BgBufferSync, sleep for bgwriter_delay milliseconds, repeat.
// BackgroundWriterMain — src/backend/postmaster/bgwriter.cvoidBackgroundWriterMain(const void *startup_data, size_t startup_data_len){ MyBackendType = B_BG_WRITER; AuxiliaryProcessMainCommon();
pqsignal(SIGHUP, SignalHandlerForConfigReload); pqsignal(SIGTERM, SignalHandlerForShutdownRequest); /* ... condensed signal setup ... */
for (;;) { ResetLatch(MyLatch); ProcessMainLoopInterrupts();
can_hibernate = BgBufferSync(&wb_context);
pgstat_report_bgwriter(); pgstat_report_wal(true);
if (FirstCallSinceLastCheckpoint()) smgrdestroyall(); /* free dropped-relation smgr objects */
/* Periodically log xl_running_xacts for replication standby */ if (XLogStandbyInfoActive() && !RecoveryInProgress()) { /* ... condensed: LogStandbySnapshot every 15 s ... */ }
rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, BgWriterDelay, WAIT_EVENT_BGWRITER_MAIN);
/* Hibernate if nothing happening for two consecutive cycles */ if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate) { StrategyNotifyBgWriter(MyProcNumber); (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, BgWriterDelay * HIBERNATE_FACTOR, WAIT_EVENT_BGWRITER_HIBERNATE); StrategyNotifyBgWriter(-1); } prev_hibernate = can_hibernate; }}BgBufferSync (in bufmgr.c) performs the actual dirty-page scanning via
the clock-hand algorithm. It returns true (can hibernate) when it found
no work. After two consecutive “can hibernate” cycles, the bgwriter registers
itself with StrategyNotifyBgWriter so the buffer pool strategy will wake it
on the next allocation, then sleeps for BgWriterDelay * 50 ms instead
of BgWriterDelay. This hibernate mechanism cuts idle CPU and disk wakeups
significantly.
The cleaning work itself is the clock-sweep LRU scan inside BgBufferSync.
The bgwriter does not scan the whole pool every cycle; it estimates how many
buffers the next allocation cycle will need (from a smoothed allocation-rate
moving average scaled by bgwriter_lru_multiplier), then cleans forward from
next_to_clean until it has either lapped the strategy clock sweep, satisfied
the estimate, or hit the bgwriter_lru_maxpages cap:
// BgBufferSync — src/backend/storage/buffer/bufmgr.cboolBgBufferSync(WritebackContext *wb_context){ /* Find where the freelist clock sweep is + allocations since last call */ strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc); PendingBgWriterStats.buf_alloc += recent_alloc;
if (bgwriter_lru_maxpages <= 0) /* LRU scan disabled */ { saved_info_valid = false; return true; /* OK to hibernate */ }
/* ... condensed: compute strategy_delta, bufs_to_lap, smoothed_density, smoothed_alloc, then upcoming_alloc_est = smoothed_alloc * multiplier ... */
num_to_scan = bufs_to_lap; num_written = 0; reusable_buffers = reusable_buffers_est;
/* Execute the LRU scan: clean forward until lapped / estimate met / capped */ while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est) { int sync_state = SyncOneBuffer(next_to_clean, true, wb_context);
if (++next_to_clean >= NBuffers) { next_to_clean = 0; next_passes++; } num_to_scan--;
if (sync_state & BUF_WRITTEN) { reusable_buffers++; if (++num_written >= bgwriter_lru_maxpages) { PendingBgWriterStats.maxwritten_clean++; break; /* hit per-cycle write cap */ } } else if (sync_state & BUF_REUSABLE) reusable_buffers++; }
PendingBgWriterStats.buf_written_clean += num_written;
/* Hibernate only if we lapped the sweep AND no allocations occurred */ return (bufs_to_lap == 0 && recent_alloc == 0);}The return value wired into can_hibernate is bufs_to_lap == 0 && recent_alloc == 0: the bgwriter signals it may hibernate only when it has
caught up to the strategy clock sweep and no backend allocated a buffer
since the previous call. SyncOneBuffer (same file) writes one dirty
candidate buffer if it is not pinned or recently used, returning a bitmask of
BUF_WRITTEN / BUF_REUSABLE. The two feedback estimators — smoothed_alloc
(fast-attack, slow-decline allocation rate) and smoothed_density (buffers
scanned per reusable buffer found) — are what let the writer track a moving
workload without re-scanning clean buffers it already passed.
Figure 1b — BgBufferSync LRU clock-sweep cleaning loop
flowchart TD
START["BgBufferSync"]
SYNCSTART["StrategySyncStart\nstrategy_buf_id, recent_alloc"]
DISABLED["bgwriter_lru_maxpages <= 0?"]
HIB["return true\n(hibernate)"]
EST["estimate upcoming_alloc_est\nfrom smoothed_alloc x multiplier\ncompute bufs_to_lap"]
COND["num_to_scan > 0 AND\nreusable_buffers < est?"]
SYNC["SyncOneBuffer(next_to_clean)"]
ADV["advance next_to_clean\n(wrap to 0, next_passes++)"]
WRITTEN["BUF_WRITTEN?\nnum_written++"]
CAP["num_written >= maxpages?\nbreak"]
RET["return bufs_to_lap == 0\nAND recent_alloc == 0"]
START --> SYNCSTART --> DISABLED
DISABLED -->|yes| HIB
DISABLED -->|no| EST --> COND
COND -->|yes| SYNC --> ADV --> WRITTEN
WRITTEN -->|yes| CAP
CAP -->|under cap| COND
CAP -->|at cap| RET
WRITTEN -->|no| COND
COND -->|no| RET
Two secondary duties live in the bgwriter loop because it is the only
process guaranteed to run regularly: (1) destroying smgr objects for dropped
relations after each checkpoint (since the bgwriter, unlike backends, never
calls AtEOXact_SMgr), and (2) periodically logging xl_running_xacts
snapshots to help standbys reach a consistent state faster.
Figure 1 — bgwriter main loop
flowchart TD
START["BackgroundWriterMain"]
INIT["AuxiliaryProcessMainCommon\nsetup signals"]
LOOP["for(;;)"]
RESET["ResetLatch"]
INTR["ProcessMainLoopInterrupts\n(shutdown / reload)"]
SYNC["BgBufferSync\nreturns can_hibernate"]
STATS["pgstat_report_bgwriter\npgstat_report_wal"]
SMGR["FirstCallSinceLastCheckpoint?\nsmgrdestroyall"]
SNAP["XLogStandbyInfoActive?\nLogStandbySnapshot every 15 s"]
WAIT["WaitLatch(BgWriterDelay)"]
HIB["hibernate?\nStrategyNotifyBgWriter\nWaitLatch(BgWriterDelay * 50)"]
START --> INIT --> LOOP
LOOP --> RESET --> INTR --> SYNC --> STATS --> SMGR --> SNAP --> WAIT
WAIT -->|timeout + can_hibernate x2| HIB --> LOOP
WAIT -->|latch or timeout| LOOP
walwriter: WAL fsync amortizer
Section titled “walwriter: WAL fsync amortizer”WalWriterMain (walwriter.c) calls XLogBackgroundFlush on every cycle
to flush any unflushed WAL buffers:
// WalWriterMain — src/backend/postmaster/walwriter.cvoidWalWriterMain(const void *startup_data, size_t startup_data_len){ MyBackendType = B_WAL_WRITER; AuxiliaryProcessMainCommon();
/* Advertise proc number so backends can wake us */ ProcGlobal->walwriterProc = MyProcNumber;
for (;;) { if (hibernating != (left_till_hibernate <= 1)) { hibernating = (left_till_hibernate <= 1); SetWalWriterSleeping(hibernating); /* global flag for async commits */ }
ResetLatch(MyLatch); ProcessMainLoopInterrupts();
if (XLogBackgroundFlush()) left_till_hibernate = LOOPS_UNTIL_HIBERNATE; /* reset to 50 */ else if (left_till_hibernate > 0) left_till_hibernate--;
pgstat_report_wal(false);
cur_timeout = (left_till_hibernate > 0) ? WalWriterDelay : WalWriterDelay * HIBERNATE_FACTOR; /* 25× */
(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, cur_timeout, WAIT_EVENT_WAL_WRITER_MAIN); }}ProcGlobal->walwriterProc is the mechanism by which asynchronously
committing backends know whose latch to set when they need a flush.
SetWalWriterSleeping(true) tells async-commit code that the walwriter is
about to enter a long sleep, so the async-commit timeout calculation
(synchronous_commit = off guarantees flush within wal_writer_delay * 3)
remains correct even when the walwriter hibernates.
XLogBackgroundFlush (in xlog.c) writes any WAL buffers that have not
yet been written to the WAL segment and fsyncs up to the current flush point.
It returns true when it flushed anything. After LOOPS_UNTIL_HIBERNATE
(50) consecutive no-op cycles, left_till_hibernate reaches zero and the
sleep is extended to WalWriterDelay * 25.
checkpointer: checkpoint owner and fsync dispatcher
Section titled “checkpointer: checkpoint owner and fsync dispatcher”CheckpointerMain (checkpointer.c) owns all checkpoint and restartpoint
execution. It communicates with backends via CheckpointerShmem, a shared
struct that doubles as a request queue for fsync operations:
// CheckpointerShmemStruct — src/backend/postmaster/checkpointer.ctypedef struct{ pid_t checkpointer_pid; /* 0 if not started */
slock_t ckpt_lck; /* protects ckpt_* counters */
int ckpt_started; /* incremented at checkpoint start */ int ckpt_done; /* set == ckpt_started on completion */ int ckpt_failed; /* incremented on failure */
int ckpt_flags; /* OR of CHECKPOINT_* request bits */
ConditionVariable start_cv; /* signaled when ckpt_started advances */ ConditionVariable done_cv; /* signaled when ckpt_done advances */
int num_requests; /* pending fsync requests */ int max_requests; CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];} CheckpointerShmemStruct;The three-counter protocol (ckpt_started, ckpt_done, ckpt_failed)
lets a backend that sends a checkpoint request (via RequestCheckpoint)
wait for its checkpoint to complete without a custom signal. The backend
records ckpt_started before signaling, waits for ckpt_started to
advance (a new checkpoint has begun), then waits for ckpt_done to catch
up to that value. If ckpt_failed increased between start and done, the
checkpoint failed.
The main loop:
// CheckpointerMain loop (condensed) — src/backend/postmaster/checkpointer.cvoidCheckpointerMain(const void *startup_data, size_t startup_data_len){ MyBackendType = B_CHECKPOINTER; AuxiliaryProcessMainCommon();
CheckpointerShmem->checkpointer_pid = MyProcPid; ProcGlobal->checkpointerProc = MyProcNumber;
/* SIGINT = shutdown checkpoint request; SIGUSR2 = exit after that */ pqsignal(SIGINT, ReqShutdownXLOG); pqsignal(SIGTERM, SIG_IGN); /* ignore normal SIGTERM */ pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
for (;;) { ResetLatch(MyLatch); AbsorbSyncRequests(); /* drain fsync request queue */ ProcessCheckpointerInterrupts(); if (ShutdownXLOGPending || ShutdownRequestPending) break;
/* Decide: time-driven or request-driven checkpoint? */ if (CheckpointerShmem->ckpt_flags) { do_checkpoint = true; chkpt_or_rstpt_requested = true; } elapsed_secs = (pg_time_t)time(NULL) - last_checkpoint_time; if (elapsed_secs >= CheckPointTimeout) { do_checkpoint = true; flags |= CHECKPOINT_CAUSE_TIME; }
if (do_checkpoint) { /* Broadcast start counter, do checkpoint, broadcast done */ SpinLockAcquire(&CheckpointerShmem->ckpt_lck); CheckpointerShmem->ckpt_flags = 0; CheckpointerShmem->ckpt_started++; SpinLockRelease(&CheckpointerShmem->ckpt_lck); ConditionVariableBroadcast(&CheckpointerShmem->start_cv);
if (!RecoveryInProgress()) ckpt_performed = CreateCheckPoint(flags); else ckpt_performed = CreateRestartPoint(flags);
SpinLockAcquire(&CheckpointerShmem->ckpt_lck); CheckpointerShmem->ckpt_done = CheckpointerShmem->ckpt_started; SpinLockRelease(&CheckpointerShmem->ckpt_lck); ConditionVariableBroadcast(&CheckpointerShmem->done_cv); }
CheckArchiveTimeout(); pgstat_report_checkpointer(); pgstat_report_wal(true);
/* Sleep until next checkpoint time or signal */ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, cur_timeout * 1000L, WAIT_EVENT_CHECKPOINTER_MAIN); } /* Shutdown path: write shutdown checkpoint, then exit */ if (ShutdownXLOGPending) ShutdownXLOG(0, 0);}The checkpointer uses an unusual signal assignment: SIGTERM is ignored
(the postmaster sends SIGTERM to client backends during normal shutdown,
but the checkpointer must survive that phase). Shutdown is initiated by
SIGINT (writes the shutdown checkpoint) followed by SIGUSR2 (exit).
This sequencing is enforced by the postmaster’s PostmasterStateMachine.
Figure 2 — CheckpointerShmem three-counter protocol
flowchart LR
BE["Backend calls\nRequestCheckpoint"]
REC["Record ckpt_started\nset ckpt_flags bits\nwrite to shmem"]
SIG["Send SIGUSR1\nto checkpointer"]
CKP_S["Checkpointer:\nAbsorbSyncRequests\nincrement ckpt_started\nbroadcast start_cv"]
EXEC["CreateCheckPoint\nor CreateRestartPoint"]
CKP_D["Checkpointer:\nset ckpt_done = ckpt_started\nbroadcast done_cv"]
WAIT_S["Backend waits on\nstart_cv"]
WAIT_D["Backend waits on\ndone_cv"]
CHK["Backend checks\nckpt_failed delta"]
BE --> REC --> SIG --> CKP_S
BE --> WAIT_S
CKP_S --> EXEC --> CKP_D
CKP_S --> WAIT_S
WAIT_S -->|ckpt_started advanced| WAIT_D
CKP_D --> WAIT_D
WAIT_D -->|ckpt_done caught up| CHK
startup process: WAL recovery driver
Section titled “startup process: WAL recovery driver”StartupProcessMain (startup.c) is the shortest of the five main
functions. Its entire job is to call StartupXLOG and exit:
// StartupProcessMain — src/backend/postmaster/startup.cvoidStartupProcessMain(const void *startup_data, size_t startup_data_len){ MyBackendType = B_STARTUP; AuxiliaryProcessMainCommon();
on_shmem_exit(StartupProcExit, 0); /* cleanup recovery env on exit */
pqsignal(SIGHUP, StartupProcSigHupHandler); pqsignal(SIGTERM, StartupProcShutdownHandler); /* request abort */ pqsignal(SIGUSR2, StartupProcTriggerHandler); /* promote to primary */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
StartupXLOG(); /* replays WAL; in standby mode, loops continuously */
proc_exit(0); /* exit code 0 → postmaster transitions PMState */}StartupXLOG (in xlogrecovery.c) reads the control file, finds the
latest checkpoint record, and replays WAL forward. In primary-only startup
(no WAL to replay), it completes quickly. In crash recovery, it replays all
WAL since the last checkpoint. In standby mode, it enters a loop that
continuously applies incoming WAL from the walreceiver until SIGUSR2
triggers promotion.
Signal semantics for the startup process differ from other auxiliaries:
SIGTERM sets shutdown_requested, which causes proc_exit(1) (abnormal
exit); SIGUSR2 sets promote_signaled, which StartupXLOG polls via
IsPromoteSignaled() to trigger a standby-to-primary transition. The
in_restore_command flag allows SIGTERM to be handled immediately during
restore_command execution (a safe point to stop), rather than deferred.
Because the startup process exits after recovery is complete, it does not
have a steady-state event loop. In standby mode, the WAL replay loop in
StartupXLOG is the effective main loop, driven by WakeupRecovery() calls
from signal handlers.
syslogger: stderr pipe collector
Section titled “syslogger: stderr pipe collector”SysLoggerMain (syslogger.c) is unlike the other auxiliaries: it is
forked before shared memory exists (or before the shared segment is in a
known state), does not attach to shared memory, and therefore cannot use
any shared infrastructure.
Its mechanism: the postmaster redirects its own stderr (and that of all
subsequently forked children) to a pipe before forking the syslogger. The
syslogger reads from the read end of syslogPipe, reassembles chunked
messages, and writes them to the current log file.
// SysLoggerMain startup (condensed) — src/backend/postmaster/syslogger.cvoidSysLoggerMain(const void *startup_data, size_t startup_data_len){ MyBackendType = B_LOGGER; init_ps_display(NULL); /* no AuxiliaryProcessMainCommon */
/* Ignore all termination signals; exit only when pipe EOF seen */ pqsignal(SIGTERM, SIG_IGN); pqsignal(SIGQUIT, SIG_IGN); pqsignal(SIGUSR1, sigUsr1Handler); /* request log rotation */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
/* Main loop: read pipe, rotate files, sleep */ for (;;) { /* read from syslogPipe[0], reassemble chunks into save_buffer list, write completed messages to syslogFile / csvlogFile / jsonlogFile */ process_pipe_input(logbuffer, &bytes_in_logbuffer);
/* rotate if time- or size-based threshold crossed */ logfile_rotate(time_based_rotation, size_rotation_for);
if (pipe_eof_seen) break; /* all writers (postmaster + children) are gone */
(void) WaitEventSetWait(wes, ...); } /* flush remaining input, close log files, exit */}Three log file handles exist simultaneously: syslogFile (plain text),
csvlogFile (CSV format), and jsonlogFile (JSON format) — each written
to independently depending on log_destination. Log rotation is triggered
by SIGUSR1 (pg_rotate_logfile()) or by time/size thresholds
(Log_RotationAge, Log_RotationSize). The pipe_eof_seen flag becomes
true when all write ends of syslogPipe have been closed, which only
happens after every other process (postmaster and all children) has exited.
This makes the syslogger the very last process to exit in a clean shutdown.
Messages are sent through the pipe as fixed-size PipeProtoChunk structs.
A backend that generates a long log message splits it across multiple chunks.
The syslogger maintains a save_buffer list keyed by source PID to
reassemble multi-chunk messages before writing them to the log file.
Figure 3 — syslogger pipe architecture
flowchart TD
PM["postmaster\nstderr → syslogPipe[1]"]
BE["backends + aux procs\nstderr → syslogPipe[1]"]
PIPE["syslogPipe[1]\n(write end)"]
SYS["SysLoggerMain\nreads syslogPipe[0]"]
CHUNK["process_pipe_input\nreassemble PipeProtoChunks\nvia save_buffer list"]
ROTATE["logfile_rotate\ntime / size / SIGUSR1"]
FILES["syslogFile\ncsvlogFile\njsonlogFile"]
PM --> PIPE
BE --> PIPE
PIPE -->|read| SYS
SYS --> CHUNK --> FILES
SYS --> ROTATE --> FILES
Startup order and lifetime
Section titled “Startup order and lifetime”The postmaster launches auxiliary processes in a specific order, governed
by LaunchMissingBackgroundProcesses and the PMState machine:
- Syslogger (
B_LOGGER) — launched first, before shared memory, before any other process. Captures all subsequent stderr output. - Startup process (
B_STARTUP) — launched duringPM_STARTUP. Drives WAL recovery. Itsproc_exit(0)movespmStatetoPM_RECOVERYorPM_RUN. - checkpointer (
B_CHECKPOINTER) and bgwriter (B_BG_WRITER) — launched inPM_STARTUPand kept running throughPM_RECOVERY,PM_HOT_STANDBY, andPM_RUN. They are wanted even before clients are admitted. - walwriter (
B_WAL_WRITER) — launched only inPM_RUN(after full recovery), because WAL writes by clients only occur in primary mode. - All five are relaunched automatically by
LaunchMissingBackgroundProcesseson everyServerLoopiteration if they exit unexpectedly, with the exception of the syslogger (handled separately) and the startup process (which exits normally after recovery).
A crash of bgwriter, walwriter, or checkpointer is treated by the postmaster
the same as a backend crash: HandleChildCrash sends SIGQUIT to all
remaining children and starts a crash-recovery cycle. These processes touch
shared memory, so an unexpected exit implies potential shared-memory
corruption.
Source Walkthrough
Section titled “Source Walkthrough”AuxiliaryProcessMainCommon
Section titled “AuxiliaryProcessMainCommon”AuxiliaryProcessMainCommon(auxprocess.c:39) — shared init: delete postmaster context, callInitAuxiliaryProcess+BaseInit+ProcSignalInit+CreateAuxProcessResourceOwner+ pgstat init; registersShutdownAuxiliaryProcessas a before-shmem-exit callback.ShutdownAuxiliaryProcess(auxprocess.c:98) — releases all LWLocks, cancels condition-variable waits, reports wait-end to pgstat.InitAuxiliaryProcess(proc.c) — allocates aPGPROCfor the process without the heavyweight-lock fields used by regular backends.
bgwriter
Section titled “bgwriter”BackgroundWriterMain(bgwriter.c:88) — entry; setsB_BG_WRITER, callsAuxiliaryProcessMainCommon; runs theBgBufferSyncloop.BgBufferSync(bufmgr.c) — clock-hand scan of the buffer pool; cleans forward fromnext_to_cleanuntil it laps the strategy sweep, meetsupcoming_alloc_est, or hitsbgwriter_lru_maxpages; returnstrue(can_hibernate) only whenbufs_to_lap == 0 && recent_alloc == 0.SyncOneBuffer(bufmgr.c) — writes one dirty, unpinned, not-recently-used buffer; returns aBUF_WRITTEN | BUF_REUSABLEbitmask.StrategySyncStart(freelist.c) — reports current clock-sweep position and the allocation count since the last call.StrategyNotifyBgWriter(freelist.c) — registers / deregisters the bgwriter’s proc number for wakeup on next buffer allocation.WritebackContextInit/WritebackContext— tracks dirty-page writebacks forbgwriter_flush_aftercoalescing.
walwriter
Section titled “walwriter”WalWriterMain(walwriter.c:88) — entry; setsB_WAL_WRITER, advertisesProcGlobal->walwriterProc, runs theXLogBackgroundFlushloop.XLogBackgroundFlush(xlog.c) — writes unwritten WAL buffers and fsyncs to the current insert LSN.SetWalWriterSleeping(walwriter.c) — sets the global flag read by async-commit code to adjust its timeout calculation.
checkpointer
Section titled “checkpointer”CheckpointerMain(checkpointer.c:182) — entry; setsB_CHECKPOINTER; registerspgstat_before_server_shutdownas a shmem-exit callback (sole process to flush cumulative stats at shutdown).CheckpointerShmemInit(checkpointer.c) — allocatesCheckpointerShmemStructfrom shared memory during postmaster startup.CheckpointerShmemSize(checkpointer.c) — reports the size forCalculateShmemSize.AbsorbSyncRequests(checkpointer.c) — drains the fsync request queue built by backends viaForwardSyncRequest.CreateCheckPoint(xlog.c) — writes a full checkpoint record; flushes all dirty buffers viaCheckpointWriteDelay-paced calls tosmgrwrite.CreateRestartPoint(xlog.c) — restartpoint equivalent for standby mode.RequestCheckpoint(checkpointer.c) — called by backends to post a request viackpt_flagsand sendSIGUSR1; optionally waits for completion via the three-counter protocol.
startup process
Section titled “startup process”StartupProcessMain(startup.c:216) — entry; setsB_STARTUP; callsStartupXLOG; exits 0 on success.StartupXLOG(xlogrecovery.c) — reads control file, replays WAL from last checkpoint; enters standby replay loop in hot-standby mode.ProcessStartupProcInterrupts(startup.c:154) — pollsgot_SIGHUP,shutdown_requested, postmaster-alive check, barrier signals.StartupProcTriggerHandler(startup.c:93) —SIGUSR2handler; setspromote_signaledand callsWakeupRecovery.IsPromoteSignaled/ResetPromoteSignaled— called byStartupXLOGto poll for and acknowledge the promotion signal.
syslogger
Section titled “syslogger”SysLoggerMain(syslogger.c:165) — entry; setsB_LOGGER; noAuxiliaryProcessMainCommon; readssyslogPipe[0]; callsprocess_pipe_inputandlogfile_rotatein a loop.process_pipe_input(syslogger.c) — reassemblesPipeProtoChunkmessages from the pipe usingsave_bufferlists keyed by source PID.logfile_rotate(syslogger.c) — opens a new log file; closes old one; updateslast_sys_file_name/last_csv_file_name/last_json_file_name.SysLogger_Start(syslogger.c) — called by the postmaster to createsyslogPipe, redirectstderr, and fork the syslogger.write_syslogger_file(syslogger.c) — the backend-side function that formats and chunks messages before writing tosyslogPipe[1].
Position hints (as of 2026-06-05, commit 273fe94)
Section titled “Position hints (as of 2026-06-05, commit 273fe94)”| Symbol | File | Line |
|---|---|---|
AuxiliaryProcessMainCommon | src/backend/postmaster/auxprocess.c | 39 |
ShutdownAuxiliaryProcess | src/backend/postmaster/auxprocess.c | 98 |
BackgroundWriterMain | src/backend/postmaster/bgwriter.c | 88 |
BgBufferSync | src/backend/storage/buffer/bufmgr.c | 3629 |
SyncOneBuffer | src/backend/storage/buffer/bufmgr.c | 3927 |
BgWriterDelay (GUC) | src/backend/postmaster/bgwriter.c | 58 |
HIBERNATE_FACTOR (bgwriter) | src/backend/postmaster/bgwriter.c | 64 |
LOG_SNAPSHOT_INTERVAL_MS | src/backend/postmaster/bgwriter.c | 70 |
WalWriterMain | src/backend/postmaster/walwriter.c | 88 |
WalWriterDelay (GUC) | src/backend/postmaster/walwriter.c | 70 |
LOOPS_UNTIL_HIBERNATE (walwriter) | src/backend/postmaster/walwriter.c | 78 |
HIBERNATE_FACTOR (walwriter) | src/backend/postmaster/walwriter.c | 79 |
CheckpointerMain | src/backend/postmaster/checkpointer.c | 182 |
CheckpointerShmemStruct | src/backend/postmaster/checkpointer.c | 107 |
CheckPointTimeout (GUC) | src/backend/postmaster/checkpointer.c | 144 |
CheckPointCompletionTarget (GUC) | src/backend/postmaster/checkpointer.c | 146 |
StartupProcessMain | src/backend/postmaster/startup.c | 216 |
ProcessStartupProcInterrupts | src/backend/postmaster/startup.c | 154 |
StartupProcTriggerHandler | src/backend/postmaster/startup.c | 93 |
SysLoggerMain | src/backend/postmaster/syslogger.c | 165 |
Logging_collector (GUC) | src/backend/postmaster/syslogger.c | 70 |
syslogPipe | src/backend/postmaster/syslogger.c | 114 |
NBUFFER_LISTS | src/backend/postmaster/syslogger.c | 109 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
AuxiliaryProcessMainCommonis called by bgwriter, walwriter, checkpointer, and startup, but NOT by syslogger. Verified:SysLoggerMain(syslogger.c:165) setsMyBackendType = B_LOGGERand callsinit_ps_displaydirectly without callingAuxiliaryProcessMainCommon. This is the only auxiliary that does not attach to shared memory. Syslogger may be forked before shared memory exists. -
bgwriter has two hibernate mechanisms: a 50× sleep multiplier and a
StrategyNotifyBgWriterwakeup registration. Verified atbgwriter.c:329–342. The condition is:WL_TIMEOUTreturn fromWaitLatchANDcan_hibernate(fromBgBufferSync) ANDprev_hibernate(true for two consecutive cycles). -
walwriter advertises its proc number via
ProcGlobal->walwriterProc. Verified atwalwriter.c:216. This is the field async-committing backends use to set the walwriter’s latch instead of signaling it. -
checkpointer ignores
SIGTERMand usesSIGINTfor the shutdown-checkpoint request. Verified atcheckpointer.c:202–209. The comment explains the rationale: during a normal Unix shutdown,initsendsSIGTERMto all processes; the checkpointer must survive that phase to write the shutdown checkpoint only when signaled explicitly by the postmaster viaSIGINT. -
checkpointer is responsible for flushing cumulative statistics at shutdown. Verified at
checkpointer.c:232:before_shmem_exit(pgstat_before_server_shutdown, 0)is registered inCheckpointerMain. The comment states “this needs to be called by exactly one process during a normal shutdown.” -
startup process exits with code 0 on success; postmaster uses this to transition
PMState. Verified atstartup.c:263–264. A non-zero exit (e.g., fromshutdown_requestedcallingproc_exit(1)) causes the postmaster to enter crash recovery instead of transitioning toPM_RUN. -
syslogger supports three simultaneous log formats: text, CSV, and JSON. Verified at
syslogger.c:82–86: three separateFILE *globals (syslogFile,csvlogFile,jsonlogFile); each written based onlog_destination. JSON log support was added in PG15. -
pipe_eof_seenmakes the syslogger the last process to exit. Verified atsyslogger.c:219–220and the main-loop exit condition. Pipe EOF is only possible when all write-end file descriptors are closed. The postmaster keepssyslogPipe[1]open until it exits; thus the syslogger outlives all other processes.
Open questions
Section titled “Open questions”-
B_IO_WORKERandAuxiliaryProcessMainCommon. PG18 addsB_IO_WORKERprocesses for async I/O (storage/aio/). Whether they also callAuxiliaryProcessMainCommonand follow the same initialization path has not been verified in this document. Investigation path:storage/aio/aio_worker.c(if it exists). -
fsync request queue overflow.
CheckpointerShmemholds up toMAX_CHECKPOINT_REQUESTS(10,000,000) pending fsync requests. What happens when the queue fills — whether backends block or fall back to direct fsync — is not traced here. Investigation path:ForwardSyncRequestinmd.cand theCompactCheckpointerRequestQueuelogic incheckpointer.c. -
Syslogger on Windows (
EXEC_BACKEND). On Windows the syslogger receives open file descriptors viaSysloggerStartupDatathrough thestartup_dataparameter rather than inheriting them fromfork(). The fullEXEC_BACKENDre-entry path for the syslogger (viaSubPostmasterMain) is not traced here.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
InnoDB page cleaner evolution. Before MySQL 5.6, InnoDB had a single background page-flushing thread, which became a bottleneck on multi-core hardware. MySQL 5.6 introduced multiple “page cleaner” threads (
innodb_page_cleanersGUC). PostgreSQL’s bgwriter remains single-threaded but mitigates this by keeping the scan window small and relying on checkpoint spreading. The PG18 async I/O (B_IO_WORKER) adds parallel I/O dispatch beneath the bgwriter, moving some of the parallelism to the I/O subsystem rather than the flusher count. -
ARIES checkpoint pacing. The original ARIES paper describes “fuzzy checkpoints” that allow normal processing to continue during the checkpoint. PostgreSQL’s
checkpoint_completion_targetimplements exactly this pacing idea:CheckpointWriteDelay(called fromCreateCheckPoint) sleeps between writes to spread the checkpoint over the target fraction of the checkpoint interval. Increasingmax_wal_sizereduces checkpoint frequency; increasingcheckpoint_completion_targetreduces per-checkpoint I/O spikes. -
WAL writer as group-commit mechanism. “Architecture of a Database System” (§6.3) describes group commit as a key optimization: transactions waiting for their WAL record to be fsynced can be batched so that one fsync covers many commits. PostgreSQL’s walwriter implements this when
synchronous_commit = off: the walwriter flushes WAL everywal_writer_delayms, and async-commit transactions wait for at mostwal_writer_delay * 3before their records are guaranteed durable. This is a documented trade-off: up towal_writer_delay * 3of committed transactions can be lost on a crash. -
Oracle’s recovery architecture comparison. Oracle’s SMON (System MONitor) process handles instance recovery by reading the online redo log. Like PostgreSQL’s startup process, SMON runs before client access is permitted. The difference is that SMON is also responsible for ongoing cleanup tasks (coalescing free extents, cleaning up temporary segments), giving it a dual role that PostgreSQL splits between the startup process (pure recovery) and autovacuum (ongoing cleanup).
-
Log architecture: pipe vs. shared-memory queue. PostgreSQL uses a Unix pipe to collect log output, which serializes all writes through the syslogger. An alternative is a lock-free shared-memory ring buffer (used by some systems for high-throughput logging). The pipe approach has the advantage of working before shared memory is initialized, making it safe to capture early postmaster startup messages. The shared-memory approach avoids the overhead of chunking and reassembly but requires shared memory to be available before the first log message.
Sources
Section titled “Sources”Raw source files consumed
Section titled “Raw source files consumed”- None (synthesized directly from source tree at REL_18_STABLE / commit 273fe94).
Source code paths (REL_18_STABLE / commit 273fe94)
Section titled “Source code paths (REL_18_STABLE / commit 273fe94)”src/backend/postmaster/auxprocess.c—AuxiliaryProcessMainCommon,ShutdownAuxiliaryProcesssrc/backend/postmaster/bgwriter.c—BackgroundWriterMain, hibernate logic,LogStandbySnapshotdutysrc/backend/storage/buffer/bufmgr.c—BgBufferSyncLRU clock-sweep cleaning loop,SyncOneBuffersrc/backend/postmaster/walwriter.c—WalWriterMain,SetWalWriterSleeping,XLogBackgroundFlushcallsrc/backend/postmaster/checkpointer.c—CheckpointerMain,CheckpointerShmemStruct,CheckpointerShmemInit,AbsorbSyncRequests,RequestCheckpointsrc/backend/postmaster/startup.c—StartupProcessMain,ProcessStartupProcInterrupts, signal handlers, promote logicsrc/backend/postmaster/syslogger.c—SysLoggerMain,SysLogger_Start,process_pipe_input,logfile_rotate,write_syslogger_filesrc/include/miscadmin.h—BackendTypeenum (B_BG_WRITER, B_WAL_WRITER, B_CHECKPOINTER, B_STARTUP, B_LOGGER)
Textbook and paper references
Section titled “Textbook and paper references”- Mohan, C., et al. “ARIES: A Transaction Recovery Method Supporting
Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.”
ACM TODS, 1992. (Checkpoint concepts, redo recovery.)
Curated at
knowledge/research/dbms-papers/aries.md. - Hellerstein, Stonebraker, Hamilton. Architecture of a Database System,
Foundations and Trends in Databases, 2007. §6 (buffer management,
background writers, group commit).
Curated at
knowledge/research/dbms-papers/fntdb07-architecture.md. - Stonebraker, M., and Rowe, L. A. “The Design of POSTGRES.” SIGMOD 1986. (Original process model and postmaster concept.)