PostgreSQL Checkpoint — The Durability Anchor: redo-pointer fixation, buffer flush, and WAL truncation
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A checkpoint in a write-ahead logging (WAL) system is a durable marker that allows the recovery manager to discard all WAL records written before it. Without checkpoints, crash recovery would have to replay the entire WAL history from the very beginning — an O(total-WAL) operation that grows without bound. The checkpoint cuts that tail: after a successful checkpoint at LSN L, recovery need only replay WAL starting from L, because every change before L has already been written to stable storage.
The canonical theory is ARIES (Mohan et al., ARIES: A Transaction
Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks
Using Write-Ahead Logging, ACM TODS 1992; captured at
knowledge/research/dbms-papers/aries.md). ARIES defines the checkpoint
obligation precisely: before writing the checkpoint record at LSN L, the
system must ensure that every data page whose PageLSN ≤ L has been
flushed to stable storage. This is sometimes called the write-ahead rule
for checkpoints — a complement to the write-ahead rule for individual page
writes. Once the checkpoint record is durable, the redo pointer embedded
in it is the earliest LSN that recovery must start from.
ARIES further distinguishes two checkpoint styles:
-
Consistent (quiesce) checkpoint. All active transactions are quiesced (no new writes) while the snapshot is taken. The redo pointer equals the checkpoint record LSN. Simple to reason about, but requires write starvation and is unsuitable for high-concurrency systems.
-
Fuzzy (online) checkpoint. Writes continue while the buffer flush proceeds. The redo pointer is fixed before the flush begins, at the earliest LSN that could generate a dirty page during the checkpoint window. Because transactions insert WAL records while buffers are being written, the checkpoint record is written at the end of the flush, not the beginning. Recovery must replay from the redo pointer (which precedes the checkpoint record in LSN order), not from the checkpoint record itself.
PostgreSQL uses the fuzzy model for online checkpoints. The redo pointer is pinned at the start, the flush proceeds concurrently, and the checkpoint record is written last. Only shutdown checkpoints use a quiesce variant, because no concurrent WAL insertions are possible at shutdown.
Database Internals (Petrov, ch. 5 “Transaction Processing and Recovery”) identifies two further sub-problems that every checkpoint design must solve:
-
I/O amplification. Writing all dirty buffers at once at checkpoint time creates a large, sudden I/O spike, which competes with foreground query I/O and can violate latency SLAs. The remedy is checkpoint pacing: spread the dirty-buffer writes over a fraction of the inter-checkpoint interval, using the inter-checkpoint period as a natural budget.
-
WAL retention vs. space consumption. WAL segments before the oldest live redo pointer can be recycled. But if a standby replica or a replication slot holds a position that precedes the redo pointer, segments must be retained longer. Getting this calculation wrong either wastes disk space or breaks replicas.
Both problems have direct implementations in PostgreSQL (see § PostgreSQL’s Approach).
Common DBMS Design
Section titled “Common DBMS Design”The ARIES theory shapes every production checkpoint implementation into a recognizable pattern. This section names the engineering conventions that PostgreSQL, Oracle, InnoDB, DB2, and SQL Server all share.
Dedicated checkpoint process
Section titled “Dedicated checkpoint process”Nearly every engine separates checkpoint I/O from normal query processing by running it in a dedicated background process (or thread). This prevents checkpoint latency from appearing in user query response times and allows the checkpoint engine to set its own I/O priority without contending with foreground threads over scheduler timeslices.
A redo pointer fixed at start, checkpoint record written at end
Section titled “A redo pointer fixed at start, checkpoint record written at end”For fuzzy checkpoints the sequence is always: (1) fix the redo pointer while
holding brief exclusive access to the WAL insertion state, (2) flush all
pages whose PageLSN ≤ redo pointer, (3) write and flush the checkpoint
record that carries the redo pointer value. Recovery navigates to the
checkpoint record to find where to start replaying (the redo pointer), not
where the checkpoint record sits. These two LSNs differ in fuzzy
checkpoints by as much as max_wal_size bytes.
pg_control / control file as the recovery anchor
Section titled “pg_control / control file as the recovery anchor”Every ARIES implementation maintains a small, frequently updated control file that records the location of the latest checkpoint record. On startup the recovery manager reads this file first, locates the checkpoint record, extracts the redo pointer, and begins replay. The control file is single-page atomic (or uses shadow-copy) to tolerate partial writes.
Paced I/O to bound the spike
Section titled “Paced I/O to bound the spike”Engines spread the dirty-buffer flush over the inter-checkpoint window by
throttling the buffer writer. The tuning knob is usually expressed as a
fraction of the checkpoint interval: “aim to finish the flush by the time
target_fraction × checkpoint_interval has elapsed.” Sleeping between
batches of page writes is the common mechanism. The system accelerates
writes when it detects it is running behind.
fsync consolidation
Section titled “fsync consolidation”Writing a page and fsyncing it are two different system calls. In a busy system, many backends write the same page (or pages in the same file) multiple times between checkpoints. An efficient implementation batches fsync calls per file at checkpoint time rather than fsyncing after every write. This turns O(writes) fsyncs into O(files-touched) fsyncs, which is a large reduction in practice. The checkpoint process owns the fsync queue; backends that write files directly register their files with the checkpoint process so that the consolidation occurs centrally.
Restartpoint during recovery
Section titled “Restartpoint during recovery”While in archive or streaming recovery, a standby cannot take a full checkpoint (it cannot write new WAL other than in the primary’s stream). Instead it takes a restartpoint: a record noting “we have replayed safely up to this previously-written checkpoint record, and all pages referenced up to that point have been flushed.” A restartpoint allows WAL segments before the corresponding checkpoint’s redo pointer to be recycled on the standby without the primary needing to take a new checkpoint.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Process model: the checkpointer
Section titled “Process model: the checkpointer”PostgreSQL devotes a single auxiliary process to all checkpoint work.
CheckpointerMain (in postmaster/checkpointer.c) is launched by
AuxiliaryProcessMain with MyBackendType = B_CHECKPOINTER. The process
owns:
- the timer loop that fires every
checkpoint_timeoutseconds, - consumption of
CHECKPOINT_CAUSE_XLOGsignals when WAL usage exceedsmax_wal_size, - the shutdown checkpoint when it receives
SIGINTfrom the postmaster, - the fsync request queue that backends post to via
ForwardSyncRequest.
Before PostgreSQL 9.2 the bgwriter process performed both background buffer writing and checkpoints. The separation into dedicated bgwriter and checkpointer processes avoids priority inversion: the bgwriter can perform steady proactive writes independent of checkpoint pacing.
CheckpointerShmemStruct: the request and completion protocol
Section titled “CheckpointerShmemStruct: the request and completion protocol”Backends and the checkpointer communicate through a small shared-memory structure:
// CheckpointerShmemStruct — postmaster/checkpointer.ctypedef struct{ pid_t checkpointer_pid; /* PID (0 if not started) */ slock_t ckpt_lck; /* protects ckpt_* fields */ int ckpt_started; /* increments when a checkpoint starts */ int ckpt_done; /* set to ckpt_started on completion */ int ckpt_failed; /* increments on failure */ int ckpt_flags; /* OR of pending request flags */ ConditionVariable start_cv; /* signaled when ckpt_started advances */ ConditionVariable done_cv; /* signaled when ckpt_done advances */ int num_requests; int max_requests; CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER]; /* fsync queue */} CheckpointerShmemStruct;RequestCheckpoint uses a six-step handshake:
- Record
ckpt_failedandckpt_startedunderckpt_lck; OR in request flags. - Set the checkpointer’s latch via
SetLatch. - Sleep on
start_cvuntilckpt_startedadvances. - Record the new
ckpt_started. - Sleep on
done_cvuntilckpt_done ≥ new_started(modulo arithmetic). - If
ckpt_failedchanged, the checkpoint failed; otherwise it succeeded.
This protocol lets multiple backends request checkpoints concurrently
without any of them needing to run the checkpoint themselves. The OR
semantics of ckpt_flags mean that multiple requests are coalesced: a
CHECKPOINT_IMMEDIATE flag set by any backend takes effect for the next
checkpoint regardless of other concurrent requests.
The CheckPoint record struct
Section titled “The CheckPoint record struct”The body of XLOG_CHECKPOINT_ONLINE and XLOG_CHECKPOINT_SHUTDOWN WAL
records is the CheckPoint struct from src/include/catalog/pg_control.h:
// CheckPoint — src/include/catalog/pg_control.htypedef struct CheckPoint{ XLogRecPtr redo; /* REDO start point */ TimeLineID ThisTimeLineID; TimeLineID PrevTimeLineID; bool fullPageWrites; int wal_level; FullTransactionId nextXid; /* next free XID */ Oid nextOid; MultiXactId nextMulti; MultiXactOffset nextMultiOffset; TransactionId oldestXid; /* cluster-wide datfrozenxid */ Oid oldestXidDB; MultiXactId oldestMulti; Oid oldestMultiDB; pg_time_t time; TransactionId oldestCommitTsXid; TransactionId newestCommitTsXid; TransactionId oldestActiveXid; /* oldest XID still running (online ckpt) */} CheckPoint;The redo field is the redo pointer. ControlFileData.checkPointCopy holds
a copy of the latest CheckPoint; ControlFileData.checkPoint holds the
LSN of the checkpoint record itself. Recovery reads both: checkPoint to
locate the WAL record, and checkPointCopy.redo to know where to start
replaying.
ControlFileData.state (a DBState enum with values DB_STARTUP,
DB_SHUTDOWNING, DB_SHUTDOWNED, DB_IN_PRODUCTION, etc.) is updated
atomically around the checkpoint so that a partial write is detectable on
the next startup.
CreateCheckPoint: the online checkpoint flow
Section titled “CreateCheckPoint: the online checkpoint flow”CreateCheckPoint in xlog.c performs the full checkpoint sequence. For an
online (non-shutdown) checkpoint:
// CreateCheckPoint — src/backend/access/transam/xlog.cboolCreateCheckPoint(int flags){ /* 1. Reject if we're already in recovery (except end-of-recovery). */ if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0) elog(ERROR, "can't create a checkpoint during recovery");
/* 2. smgr pre-checkpoint hook (outside critical section). */ SyncPreCheckpoint();
START_CRIT_SECTION();
/* 3. Gather XID/OID/MultiXact watermarks into checkPoint struct. */ checkPoint.nextXid = TransamVariables->nextXid; /* under XidGenLock */ checkPoint.nextOid = TransamVariables->nextOid; /* under OidGenLock */ /* ... MultiXact, CommitTs fields ... */
/* 4. Skip if nothing has changed since last checkpoint. */ if (last_important_lsn == ControlFile->checkPoint) { END_CRIT_SECTION(); return false; }
/* 5. Pin the redo pointer: insert XLOG_CHECKPOINT_REDO. */ XLogBeginInsert(); XLogRegisterData(&wal_level, sizeof(wal_level)); (void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO); checkPoint.redo = RedoRecPtr; /* now fixed */
END_CRIT_SECTION();
/* 6. Wait for transactions in commit critical sections to clear. */ vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START); while (HaveVirtualXIDsDelayingChkpt(...)) { AbsorbSyncRequests(); ... }
/* 7. Flush all dirty buffers and SLRUs (paced). */ CheckPointGuts(checkPoint.redo, flags);
/* 8. Wait for transactions completing their commit. */ vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE); while (HaveVirtualXIDsDelayingChkpt(...)) { AbsorbSyncRequests(); ... }
START_CRIT_SECTION();
/* 9. Write and flush XLOG_CHECKPOINT_ONLINE record. */ XLogBeginInsert(); XLogRegisterData(&checkPoint, sizeof(checkPoint)); recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_ONLINE); XLogFlush(recptr);
/* 10. Update pg_control atomically. */ ControlFile->checkPoint = ProcLastRecPtr; /* LSN of ckpt record */ ControlFile->checkPointCopy = checkPoint; UpdateControlFile();
END_CRIT_SECTION();
/* 11. Wake WAL summarizer; recycle old WAL segments. */ WakeupWalSummarizer(); RemoveOldXlogFiles(...); UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr); return true;}Step 5 is the key insight of the fuzzy checkpoint: XLOG_CHECKPOINT_REDO
is a minimal WAL record whose LSN becomes checkPoint.redo. Any page
written after this LSN will carry a PageLSN > redo, so recovery knows it
must replay from redo forward to reconstruct those pages. The subsequent
XLOG_CHECKPOINT_ONLINE record embeds the same checkPoint.redo value, so
recovery knows where to start even though the redo record itself has no
payload beyond a wal_level hint for the WAL summarizer.
For shutdown checkpoints (flags & CHECKPOINT_IS_SHUTDOWN) the sequence
is simpler: because no concurrent WAL insertions are possible, the redo
pointer equals the next WAL insertion point (computed directly from
Insert->CurrBytePos). Only XLOG_CHECKPOINT_SHUTDOWN is written; the
separate XLOG_CHECKPOINT_REDO record is skipped.
Redo-pointer fixation: the two computation paths
Section titled “Redo-pointer fixation: the two computation paths”The redo pointer is the single most important value the checkpoint
produces, and CreateCheckPoint computes it by two distinct routes
depending on whether the cluster is quiescing. Both happen while the WAL
insertion locks are held, so the value is consistent with the WAL tip:
// CreateCheckPoint — src/backend/access/transam/xlog.cif (shutdown){ XLogRecPtr curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
/* * Compute new REDO record ptr = location of next XLOG record. * Since this is a shutdown checkpoint, there can't be any concurrent * WAL insertion. */ freespace = INSERT_FREESPACE(curInsert); if (freespace == 0) { if (XLogSegmentOffset(curInsert, wal_segment_size) == 0) curInsert += SizeOfXLogLongPHD; else curInsert += SizeOfXLogShortPHD; } checkPoint.redo = curInsert;
/* update shared RedoRecPtr while holding all insertion locks */ RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;}
WALInsertLockRelease(); /* let other xacts proceed during the flush */
if (!shutdown){ /* Include WAL level in record for WAL summarizer's benefit. */ XLogBeginInsert(); XLogRegisterData(&wal_level, sizeof(wal_level)); (void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
/* XLogInsertRecord already advanced Insert.RedoRecPtr + RedoRecPtr; */ /* copy that LSN into the checkpoint record we will write at the end. */ checkPoint.redo = RedoRecPtr;}The shutdown branch must skip past the page header that an empty WAL page
would otherwise force: when INSERT_FREESPACE(curInsert) == 0 the next
record cannot begin at curInsert (a page boundary), so the redo pointer
is bumped past SizeOfXLogLongPHD (at a segment boundary) or
SizeOfXLogShortPHD (mid-segment) — exactly where XLogInsertRecord
would have placed the next record. The online branch sidesteps this
arithmetic entirely: it simply inserts the XLOG_CHECKPOINT_REDO
record and lets the normal insertion path report the LSN back through
RedoRecPtr. The crucial comment — “we can’t postpone advancing
RedoRecPtr because XLogInserts that happen while we are dumping buffers
must assume that their buffer changes are not included in the checkpoint”
— is the fuzzy-checkpoint invariant stated in the negative: any buffer
dirtied at or after RedoRecPtr is outside this checkpoint and must be
recovered by replaying from RedoRecPtr forward.
flowchart TD
A["CreateCheckPoint(flags)"] --> B["SyncPreCheckpoint<br/>(smgr pre-hook, outside crit section)"]
B --> C["START_CRIT_SECTION<br/>gather nextXid / nextOid / MultiXact<br/>into CheckPoint struct"]
C --> D{"last_important_lsn ==<br/>ControlFile->checkPoint ?<br/>(idle, non-forced)"}
D -->|yes| E["END_CRIT_SECTION<br/>return false (skip)"]
D -->|no| F{"shutdown ?"}
F -->|"online"| G["XLogInsert(XLOG_CHECKPOINT_REDO)<br/>pin checkPoint.redo = RedoRecPtr"]
F -->|"shutdown"| H["redo = XLogBytePosToRecPtr(CurrBytePos)<br/>no separate REDO record"]
G --> I["GetVirtualXIDsDelayingChkpt(DELAY_CHKPT_START)<br/>wait for commit-critical xacts"]
H --> I
I --> J["CheckPointGuts(checkPoint.redo, flags)<br/>flush SLRUs + CheckPointBuffers (paced)<br/>ProcessSyncRequests (fsync)"]
J --> K["GetVirtualXIDsDelayingChkpt(DELAY_CHKPT_COMPLETE)<br/>wait for completing commits"]
K --> L["XLogInsert(XLOG_CHECKPOINT_ONLINE / SHUTDOWN)<br/>XLogFlush(recptr)"]
L --> M["LWLock(ControlFileLock):<br/>ControlFile->checkPoint = ProcLastRecPtr<br/>checkPointCopy = checkPoint<br/>UpdateControlFile"]
M --> N["END_CRIT_SECTION<br/>WakeupWalSummarizer<br/>SyncPostCheckpoint"]
N --> O["UpdateCheckPointDistanceEstimate<br/>KeepLogSeg + RemoveOldXlogFiles<br/>(recycle / remove WAL)"]
O --> P["PreallocXlogFiles<br/>return true"]
Figure 1 — the online checkpoint sequence inside CreateCheckPoint
(xlog.c). The load-bearing ordering is the fuzzy-checkpoint invariant:
XLOG_CHECKPOINT_REDO pins checkPoint.redo before CheckPointGuts
flushes dirty buffers, while XLOG_CHECKPOINT_ONLINE — carrying that same
redo value — is written after the flush completes. The control-file
update under ControlFileLock is the single atomic point that makes the
new checkpoint the recovery anchor; only then are old WAL segments recycled
via KeepLogSeg + RemoveOldXlogFiles. The shutdown branch collapses the
two-record scheme to one XLOG_CHECKPOINT_SHUTDOWN because no concurrent
WAL insertion is possible.
CheckPointGuts: flushing all dirty state
Section titled “CheckPointGuts: flushing all dirty state”CheckPointGuts(checkPointRedo, flags) is the I/O-heavy phase:
// CheckPointGuts — src/backend/access/transam/xlog.cstatic voidCheckPointGuts(XLogRecPtr checkPointRedo, int flags){ CheckPointRelationMap(); CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN); CheckPointSnapBuild(); CheckPointLogicalRewriteHeap(); CheckPointReplicationOrigin();
/* Write out all dirty data in SLRUs and main buffer pool */ CheckPointCLOG(); CheckPointCommitTs(); CheckPointSUBTRANS(); CheckPointMultiXact(); CheckPointPredicate(); CheckPointBuffers(flags); /* <-- main shared_buffers flush */
/* Fsync everything */ ProcessSyncRequests();
/* 2PC state last */ CheckPointTwoPhase(checkPointRedo);}CheckPointBuffers calls BufferSync in storage/buffer/bufmgr.c, which
iterates over the buffer pool in two passes. The tag pass marks every
buffer that was dirty when the checkpoint began with BM_CHECKPOINT_NEEDED
and records it in CkptBufferIds[], so that pages dirtied after this point
are excluded from this checkpoint (they belong to the next one):
// BufferSync — src/backend/storage/buffer/bufmgr.cnum_to_scan = 0;for (buf_id = 0; buf_id < NBuffers; buf_id++){ BufferDesc *bufHdr = GetBufferDescriptor(buf_id); buf_state = LockBufHdr(bufHdr);
if ((buf_state & mask) == mask) /* BM_DIRTY [| BM_PERMANENT] */ { CkptSortItem *item; buf_state |= BM_CHECKPOINT_NEEDED; item = &CkptBufferIds[num_to_scan++]; item->buf_id = buf_id; item->tsId = bufHdr->tag.spcOid; /* sort key: tablespace */ /* ... relNumber / forkNum / blockNum ... */ } UnlockBufHdr(bufHdr, buf_state);}The write pass then drains a per-tablespace min-heap so writes are
balanced across tablespaces (sorting alone would write one tablespace at a
time, starving the others’ hardware). After each buffer it reports
progress to CheckpointWriteDelay, which is where pacing happens:
// BufferSync — src/backend/storage/buffer/bufmgr.cnum_processed = 0;while (!binaryheap_empty(ts_heap)){ CkptTsStatus *ts_stat = (CkptTsStatus *) DatumGetPointer(binaryheap_first(ts_heap)); buf_id = CkptBufferIds[ts_stat->index].buf_id; bufHdr = GetBufferDescriptor(buf_id); num_processed++;
/* Flag may have been cleared by a backend/bgwriter writing it first. */ if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED) if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN) num_written++;
/* Advance per-tablespace progress, re-balance the heap. */ ts_stat->progress += ts_stat->progress_slice; /* ... binaryheap_remove_first / binaryheap_replace_first ... */
/* Sleep to throttle our I/O rate. */ CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);}progress = num_processed / num_to_scan is the fraction-of-work estimate
that CheckpointWriteDelay compares against the elapsed-time and elapsed-WAL
fractions — that single ratio is the bridge between the buffer-flush loop in
bufmgr.c and the checkpoint_completion_target pacing logic in
checkpointer.c. Note the BM_CHECKPOINT_NEEDED re-check inside the loop:
because a normal backend or the bgwriter may have already written (and
cleared the flag on) a tagged buffer, the checkpointer only issues
SyncOneBuffer for buffers still carrying the flag, avoiding redundant I/O.
flowchart TD
A["CheckPointBuffers(flags)<br/>-> BufferSync(flags)"] --> B["TAG PASS:<br/>scan all NBuffers under LockBufHdr"]
B --> C{"buf_state & mask == mask<br/>(BM_DIRTY [| BM_PERMANENT]) ?"}
C -->|no| B
C -->|yes| D["set BM_CHECKPOINT_NEEDED<br/>append to CkptBufferIds[num_to_scan++]<br/>(buf_id, tsId, relNumber, forkNum, blockNum)"]
D --> B
B --> E["sort CkptBufferIds by tablespace<br/>build per-tablespace min-heap (ts_heap)<br/>progress_slice = num_to_scan / ts.num_to_scan"]
E --> F{"binaryheap_empty(ts_heap) ?"}
F -->|yes| K["IssuePendingWritebacks<br/>CheckpointStats.ckpt_bufs_written += num_written"]
F -->|no| G["pick heap-first tablespace<br/>buf_id = CkptBufferIds[ts.index]<br/>num_processed++"]
G --> H{"state & BM_CHECKPOINT_NEEDED<br/>still set ?"}
H -->|"no (backend/bgwriter wrote it)"| I["skip write"]
H -->|yes| J["SyncOneBuffer -> FlushBuffer<br/>num_written++"]
I --> L["ts.progress += progress_slice<br/>remove_first or replace_first on heap"]
J --> L
L --> M["CheckpointWriteDelay(flags,<br/>num_processed / num_to_scan)"]
M --> N{"!CHECKPOINT_IMMEDIATE &&<br/>IsCheckpointOnSchedule(progress) ?"}
N -->|"yes (ahead)"| O["AbsorbSyncRequests<br/>WaitLatch(100 ms)"]
N -->|"no (behind)"| P["return immediately<br/>(no sleep)"]
O --> F
P --> F
Figure 3 — BufferSync (bufmgr.c) and its coupling to checkpoint pacing.
The tag pass freezes the working set (BM_CHECKPOINT_NEEDED) so concurrent
writers cannot enlarge this checkpoint’s obligation; the write pass drains a
per-tablespace min-heap to keep every tablespace’s storage busy. The single
load-bearing number that flows out of the loop is num_processed / num_to_scan — passed to CheckpointWriteDelay, which sleeps 100 ms only
while IsCheckpointOnSchedule reports the flush is ahead of both the
elapsed-time and elapsed-WAL budgets, and never sleeps under
CHECKPOINT_IMMEDIATE.
ProcessSyncRequests in storage/sync/sync.c consolidates all fsync
requests into one fsync-per-file call. Before this call, AbsorbSyncRequests
transfers the fsync queue from CheckpointerShmem->requests[] (posted by
backends via ForwardSyncRequest) into the sync module’s internal pending
table, deduplicating by file tag.
Checkpoint pacing: CheckpointWriteDelay and IsCheckpointOnSchedule
Section titled “Checkpoint pacing: CheckpointWriteDelay and IsCheckpointOnSchedule”// CheckpointWriteDelay — postmaster/checkpointer.cvoidCheckpointWriteDelay(int flags, double progress){ if (!(flags & CHECKPOINT_IMMEDIATE) && !ShutdownXLOGPending && IsCheckpointOnSchedule(progress)) { AbsorbSyncRequests(); CheckArchiveTimeout(); WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT, 100, ...); ResetLatch(MyLatch); }}IsCheckpointOnSchedule(progress) computes two elapsed-fraction estimates:
- WAL-based:
(current_insert_LSN - ckpt_start_recptr) / (wal_segment_size × CheckPointSegments) - Time-based:
elapsed_seconds / checkpoint_timeout
If progress × checkpoint_completion_target exceeds either fraction, the
checkpoint is ahead of schedule and takes a 100 ms sleep. If it is behind
either estimate, writes continue without sleeping. This adaptive throttle
ensures the flush completes approximately at
checkpoint_completion_target × checkpoint_timeout under steady WAL load,
leaving the tail of the interval for the synchronous ProcessSyncRequests
phase.
CreateRestartPoint: checkpoints during recovery
Section titled “CreateRestartPoint: checkpoints during recovery”During archive or streaming recovery the startup process replays WAL records,
and every time it encounters a checkpoint record it calls
RecoveryRestartPoint to stash the record in XLogCtl->lastCheckPoint.
The checkpointer periodically calls CreateRestartPoint:
// CreateRestartPoint — src/backend/access/transam/xlog.cboolCreateRestartPoint(int flags){ /* 1. Fetch the most recent safe checkpoint from shared memory. */ SpinLockAcquire(&XLogCtl->info_lck); lastCheckPoint = XLogCtl->lastCheckPoint; SpinLockRelease(&XLogCtl->info_lck);
/* 2. Bail if we've already made a restartpoint at this checkpoint. */ if (lastCheckPoint.redo <= ControlFile->checkPointCopy.redo) return false;
/* 3. Pin the redo pointer. */ WALInsertLockAcquireExclusive(); RedoRecPtr = XLogCtl->Insert.RedoRecPtr = lastCheckPoint.redo; WALInsertLockRelease();
/* 4. Flush buffers and SLRUs. */ CheckPointGuts(lastCheckPoint.redo, flags);
/* 5. Update pg_control. */ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); ControlFile->checkPoint = lastCheckPointRecPtr; ControlFile->checkPointCopy = lastCheckPoint; UpdateControlFile(); LWLockRelease(ControlFileLock);
/* 6. Recycle old WAL segments. */ RemoveOldXlogFiles(...); return true;}CreateRestartPoint differs from CreateCheckPoint in three ways: (a) it
does not write a new WAL record — the checkpoint record was already written
by the primary; (b) it can only advance to checkpoints the startup process
has already replayed; (c) it runs concurrently with PerformWalRecovery
(which the startup process is executing in parallel), so care is taken to
read lastCheckPoint under info_lck.
flowchart TD
A["CheckpointerMain<br/>MyBackendType = B_CHECKPOINTER"] --> B["for (;;) loop:<br/>ResetLatch · AbsorbSyncRequests<br/>ProcessCheckpointerInterrupts"]
B --> C{"ShutdownXLOGPending ||<br/>ShutdownRequestPending ?"}
C -->|yes| Z["break loop -><br/>ShutdownXLOG(0,0)<br/>(final shutdown ckpt/rstpt)"]
C -->|no| D{"CheckpointerShmem->ckpt_flags != 0 ?"}
D -->|yes| G["do_checkpoint = true"]
D -->|no| E{"elapsed_secs >=<br/>CheckPointTimeout ?"}
E -->|yes| F["do_checkpoint = true<br/>flags |= CHECKPOINT_CAUSE_TIME"]
E -->|no| W["WaitLatch(cur_timeout)<br/>= min(CheckPointTimeout,<br/>XLogArchiveTimeout)"]
W --> B
F --> H
G --> H["SpinLock(ckpt_lck):<br/>flags |= ckpt_flags · ckpt_flags = 0<br/>ckpt_started++"]
H --> I["ConditionVariableBroadcast(start_cv)<br/>ckpt_start_recptr = GetInsertRecPtr /<br/>GetXLogReplayRecPtr"]
I --> J{"do_restartpoint =<br/>RecoveryInProgress() ?<br/>(unless END_OF_RECOVERY)"}
J -->|"false (primary)"| K["CreateCheckPoint(flags)"]
J -->|"true (standby)"| L["CreateRestartPoint(flags)"]
L --> L1{"lastCheckPoint.redo <=<br/>checkPointCopy.redo ?"}
L1 -->|yes| L2["return false<br/>(no new replayed ckpt;<br/>retry in 15 s)"]
L1 -->|no| L3["pin RedoRecPtr = lastCheckPoint.redo<br/>CheckPointGuts(redo, flags)<br/>update ControlFile · RemoveOldXlogFiles"]
K --> M["smgrdestroyall"]
L2 --> M
L3 --> M
M --> N["SpinLock(ckpt_lck):<br/>ckpt_done = ckpt_started<br/>ConditionVariableBroadcast(done_cv)"]
N --> B
Figure 2 — the CheckpointerMain event loop (checkpointer.c) and its
restartpoint branch. Each iteration first drains the fsync queue
(AbsorbSyncRequests), then decides to act on either a requested checkpoint
(ckpt_flags nonzero) or a timed one (elapsed_secs >= CheckPointTimeout).
The branch on RecoveryInProgress() is what splits a primary
(CreateCheckPoint) from a standby (CreateRestartPoint); the
CHECKPOINT_END_OF_RECOVERY flag forces the full checkpoint path even
during recovery. The start_cv / done_cv broadcasts bracket the work so
backends blocked in RequestCheckpoint observe start and completion. Note
the restartpoint early-out: if no checkpoint WAL record has been replayed
since the last restartpoint (lastCheckPoint.redo <= checkPointCopy.redo),
it returns false and the loop retries 15 s later.
WAL segment recycling and max_wal_size
Section titled “WAL segment recycling and max_wal_size”After updating pg_control, both CreateCheckPoint and
CreateRestartPoint call RemoveOldXlogFiles with a horizon computed from
the new redo pointer. The algorithm:
- Convert
RedoRecPtrto a segment number_logSegNo. - Call
KeepLogSeg(recptr, &_logSegNo)to extend the horizon forwal_keep_sizeand any replication slot that holds an older LSN. - Optionally call
InvalidateObsoleteReplicationSlotsto drop slots whose required WAL segment would otherwise prevent recycling. - Decrement
_logSegNoand pass it toRemoveOldXlogFiles, which recycles (renames to a higher segment name) or removes segments older than the horizon.
CheckPointDistanceEstimate maintains an exponential moving average of
bytes of WAL generated per inter-checkpoint interval. PreallocXlogFiles
uses this estimate to pre-create WAL segment files, amortizing the cost of
fallocate / zero-fill over the interval.
Full-page writes and the checkpoint connection
Section titled “Full-page writes and the checkpoint connection”fullPageWrites (a GUC, stored in the CheckPoint struct and in
XLogCtl->Insert.fullPageWrites) is enabled automatically at every
checkpoint start and can be turned off by SET full_page_writes = off.
After a checkpoint, the first modification of any buffer page within the new
checkpoint window must be logged as a full-page image (FPI). This ensures
recovery can reconstruct a torn page (written partially by a crash) by
replacing it entirely with the logged copy rather than applying a delta that
would be meaningless on a torn page. The FPI guard resets at the checkpoint
because the checkpoint has already ensured all pages are consistent on disk.
Source Walkthrough
Section titled “Source Walkthrough”postmaster/checkpointer.c
Section titled “postmaster/checkpointer.c”CheckpointerMain is the process entry point. It sets
MyBackendType = B_CHECKPOINTER, registers signal handlers (SIGINT →
ReqShutdownXLOG, SIGUSR2 → SignalHandlerForShutdownRequest), and runs
the main event loop. The loop: (1) absorbs fsync requests via
AbsorbSyncRequests; (2) checks ckpt_flags for pending requests; (3)
fires a timed checkpoint if elapsed_secs ≥ CheckPointTimeout; (4) calls
CreateCheckPoint or CreateRestartPoint as appropriate; (5) signals
waiting backends via done_cv; (6) sleeps until the earlier of the next
checkpoint timeout and XLogArchiveTimeout.
CheckpointerShmemInit allocates and zero-initialises
CheckpointerShmemStruct in shared memory, sizing the requests[] array at
min(NBuffers, MAX_CHECKPOINT_REQUESTS) entries.
RequestCheckpoint is called by backends. It ORs flags into
ckpt_flags, sets the checkpointer’s latch, then optionally waits via
start_cv + done_cv condition variables for the checkpoint to complete
and succeed.
ForwardSyncRequest posts a (FileTag, SyncRequestType) pair into
requests[] under CheckpointerCommLock. If the queue is more than half
full it nudges the checkpointer latch. If the queue is completely full it
tries CompactCheckpointerRequestQueue (a deduplication pass using an
in-memory hash table); if compaction fails, it returns false, telling the
caller to fsync directly.
CheckpointWriteDelay is called by BufferSync after each page write.
It sleeps 100 ms when IsCheckpointOnSchedule says the flush is ahead,
absorbs fsync requests every WRITES_PER_ABSORB = 1000 writes otherwise,
and skips sleeping entirely when CHECKPOINT_IMMEDIATE is set.
AbsorbSyncRequests copies requests[] out of shared memory under
CheckpointerCommLock, clears num_requests, then calls
RememberSyncRequest for each entry (inside a critical section, because
failing after clearing the queue would lose fsync obligations).
access/transam/xlog.c — checkpoint path
Section titled “access/transam/xlog.c — checkpoint path”CreateCheckPoint (line 6927): full online/shutdown checkpoint
implementation. Key sub-calls: SyncPreCheckpoint, START_CRIT_SECTION,
WALInsertLockAcquireExclusive (to read Insert->CurrBytePos for shutdown
redo pointer), WALInsertLockRelease, XLogInsert(XLOG_CHECKPOINT_REDO)
for online checkpoints, CheckPointGuts, XLogInsert(XLOG_CHECKPOINT_ONLINE / XLOG_CHECKPOINT_SHUTDOWN), XLogFlush, UpdateControlFile,
WakeupWalSummarizer, RemoveOldXlogFiles, PreallocXlogFiles.
CheckPointGuts (line 7550): dispatches to per-subsystem checkpoint
routines and to CheckPointBuffers / ProcessSyncRequests.
RecoveryRestartPoint (line 7590): called by the startup process each
time it replays a checkpoint WAL record; stashes the record in
XLogCtl->lastCheckPoint under info_lck.
CreateRestartPoint (line 7631): called by the checkpointer during
recovery. Reads lastCheckPoint, pins RedoRecPtr, calls CheckPointGuts,
updates ControlFile, and recycles WAL.
UpdateCheckPointDistanceEstimate (line 6824): slow-decay moving
average of bytes of WAL per checkpoint interval. Rises immediately on
increase, falls by 10 % per interval (formula: 0.9 × old + 0.1 × new).
IsCheckpointOnSchedule (line 841 of checkpointer.c): computes
elapsed_xlogs and elapsed_time fractions; returns true (on schedule)
only if both are less than progress × CheckPointCompletionTarget.
include/catalog/pg_control.h
Section titled “include/catalog/pg_control.h”CheckPoint struct (line 35): body of checkpoint WAL records; also
copied into ControlFileData.checkPointCopy.
ControlFileData struct (line 104): the pg_control file layout.
state (a DBState enum), checkPoint (LSN of last checkpoint record),
checkPointCopy (copy of CheckPoint body), minRecoveryPoint (standby
must replay at least this far).
DBState enum: DB_STARTUP, DB_SHUTDOWNED, DB_SHUTDOWNED_IN_RECOVERY,
DB_SHUTDOWNING, DB_IN_CRASH_RECOVERY, DB_IN_ARCHIVE_RECOVERY,
DB_IN_PRODUCTION.
Position hints (commit 273fe94, 2026-06-05)
Section titled “Position hints (commit 273fe94, 2026-06-05)”| Symbol | File | Approx. line |
|---|---|---|
CheckpointerMain | src/backend/postmaster/checkpointer.c | 182 |
CheckpointerShmemStruct | src/backend/postmaster/checkpointer.c | 107 |
CheckpointerShmemInit | src/backend/postmaster/checkpointer.c | 959 |
RequestCheckpoint | src/backend/postmaster/checkpointer.c | 1003 |
ForwardSyncRequest | src/backend/postmaster/checkpointer.c | 1153 |
AbsorbSyncRequests | src/backend/postmaster/checkpointer.c | 1329 |
CompactCheckpointerRequestQueue | src/backend/postmaster/checkpointer.c | 1219 |
CheckpointWriteDelay | src/backend/postmaster/checkpointer.c | 772 |
IsCheckpointOnSchedule | src/backend/postmaster/checkpointer.c | 841 |
CreateCheckPoint | src/backend/access/transam/xlog.c | 6927 |
CheckPointGuts | src/backend/access/transam/xlog.c | 7550 |
RecoveryRestartPoint | src/backend/access/transam/xlog.c | 7590 |
CreateRestartPoint | src/backend/access/transam/xlog.c | 7631 |
UpdateCheckPointDistanceEstimate | src/backend/access/transam/xlog.c | 6824 |
XLogBytePosToRecPtr | src/backend/access/transam/xlog.c | 1861 |
INSERT_FREESPACE macro | src/backend/access/transam/xlog.c | 581 |
BufferSync | src/backend/storage/buffer/bufmgr.c | 3353 |
CheckPointBuffers | src/backend/storage/buffer/bufmgr.c | 4219 |
CkptTsStatus struct | src/backend/storage/buffer/bufmgr.c | 106 |
SyncOneBuffer | src/backend/storage/buffer/bufmgr.c | 521 |
CheckPoint struct | src/include/catalog/pg_control.h | 35 |
ControlFileData struct | src/include/catalog/pg_control.h | 104 |
DBState enum | src/include/catalog/pg_control.h | 89 |
XLOG_CHECKPOINT_REDO | src/include/catalog/pg_control.h | 82 |
XLOG_CHECKPOINT_ONLINE | src/include/catalog/pg_control.h | 69 |
XLOG_CHECKPOINT_SHUTDOWN | src/include/catalog/pg_control.h | 68 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against commit 273fe94 (REL_18_STABLE, PostgreSQL 18.x) under
/data/hgryoo/references/postgres.
Confirmed:
CheckpointerMainruns the main event loop and callsCreateCheckPointorCreateRestartPointdepending onRecoveryInProgress(). Confirmed inpostmaster/checkpointer.clines 349–546.CheckpointerShmemStructlayout (includingstart_cv,done_cv,FLEXIBLE_ARRAY_MEMBERrequests) confirmed at lines 107–131.RequestCheckpointsix-step handshake confirmed at lines 1003–1130.ForwardSyncRequest/CompactCheckpointerRequestQueue/AbsorbSyncRequestsconfirmed at lines 1153–1371.CreateCheckPointtwo-record scheme (XLOG_CHECKPOINT_REDOthenXLOG_CHECKPOINT_ONLINE) confirmed at lines 7086–7109 and 7250–7256.CheckPointGutscall chain confirmed at lines 7550–7577.CreateRestartPointconfirmed at lines 7631–7810.CheckPointstruct confirmed atpg_control.hlines 35–65.ControlFileDataconfirmed at lines 104–239.DBStateenum confirmed at lines 89–98.XLOG_CHECKPOINT_REDO = 0xE0confirmed at line 82.UpdateCheckPointDistanceEstimatemoving-average formula (0.9/0.1 decay) confirmed at lines 6848–6853.WakeupWalSummarizer()called after checkpoint record flush (WAL summarization, PG18) confirmed at line 7337.- Redo-pointer fixation: shutdown branch (
XLogBytePosToRecPtr(CurrBytePos)INSERT_FREESPACEpage-header skip) confirmed at lines 7044–7077; online branch (XLOG_CHECKPOINT_REDOinsert thencheckPoint.redo = RedoRecPtr) confirmed at lines 7094–7108.
BufferSynctwo-pass structure (tag pass settingBM_CHECKPOINT_NEEDEDintoCkptBufferIds[], then per-tablespace min-heap write pass callingCheckpointWriteDelay(flags, num_processed / num_to_scan)) confirmed instorage/buffer/bufmgr.clines 3353–3615;CkptTsStatusstruct at 106–128.
Not verified / out of scope:
ProcessSyncRequestsinternals (instorage/sync/sync.c): covered in the smgr/md doc.- The
pg_stat_checkpointerview: driven byPendingCheckpointerStatsand reported viapgstat_report_checkpointer; internals inutils/activity/pgstat_checkpointer.c(out of scope for this doc).
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Comparison: Oracle and SQL Server
Section titled “Comparison: Oracle and SQL Server”Oracle uses a fuzzy checkpoint with incremental flushing driven by
database writer (DBWn) processes, not a dedicated checkpoint thread. The
redo pointer equivalent is the System Change Number (SCN) stored in the
control file. Oracle’s checkpoint design is notable for its
incremental checkpoint: the dirty-buffer write proceeds continuously
(not just at checkpoint intervals), guided by a write queue ordered by
dirty_since time. PostgreSQL’s bgwriter performs an analogous role but is
fully separated from the checkpointer.
SQL Server uses indirect checkpoints (version 2012+) that target a
configurable dirty-page buffer age rather than a time or WAL-volume
interval, achieving smoother I/O at the cost of more frequent background
flushing. PostgreSQL’s checkpoint_completion_target achieves similar
smoothing at the checkpoint level.
Comparison: InnoDB (MySQL)
Section titled “Comparison: InnoDB (MySQL)”InnoDB’s fuzzy checkpoint is also driven by a dedicated thread (the
master thread or dedicated I/O threads). The redo log is circular; when it
wraps, a forced checkpoint flushes the pages needed to reclaim log space.
This is analogous to PostgreSQL’s CHECKPOINT_CAUSE_XLOG trigger when WAL
usage approaches max_wal_size. InnoDB uses innodb_io_capacity as the
I/O rate governor; PostgreSQL’s equivalent is the implicit 100 ms sleep in
CheckpointWriteDelay.
Comparison: CUBRID
Section titled “Comparison: CUBRID”CUBRID (see knowledge/code-analysis/cubrid/cubrid-checkpoint.md) uses a
similar fuzzy checkpoint driven by a dedicated checkpointer thread. The
redo pointer concept is equivalent but CUBRID stores it in a dedicated log
anchor file rather than embedding it in WAL records. CUBRID does not have an
analog of PostgreSQL’s XLOG_CHECKPOINT_REDO two-record scheme; instead the
checkpoint record itself doubles as the redo-pointer anchor. PostgreSQL’s
separation of redo-record from checkpoint-record is a direct consequence of
supporting concurrent WAL insertion during the flush.
Research frontiers
Section titled “Research frontiers”Eliminating checkpoint spikes entirely. The LeanStore and Umbra
storage engines (Leis et al.) use a continuously-running eviction loop that
writes dirty pages to a shadow location before in-place overwriting, making
distinct checkpoint phases unnecessary. PostgreSQL’s checkpoint_completion_target
approximates this but still has a synchronous ProcessSyncRequests phase.
Group commit and checkpoint interaction. High-frequency checkpoints
interact with group commit: if checkpoint_timeout is short relative to
transaction commit rate, the WAL-flush cost at commit is reduced (recent
commits are already flushed as part of the checkpoint). PostgreSQL does not
explicitly exploit this coupling; InnoDB’s group commit documentation
acknowledges a similar interplay.
WAL summarization and incremental backup (PG18). PostgreSQL 18 ships
WakeupWalSummarizer() inside CreateCheckPoint. The WAL summarizer
(postmaster/walsummarizer.c) tracks which blocks were modified between
consecutive checkpoints and writes a .walsummary file that pg_basebackup --incremental consumes to skip unmodified blocks. The checkpoint is the
natural epoch boundary for summarization: between two checkpoints every
modified page has a known set of WAL records. See
postgres-archiving-walsummary.md for the full design.
Sources
Section titled “Sources”src/backend/postmaster/checkpointer.c— checkpointer process, request queue, pacing logic.src/backend/access/transam/xlog.c—CreateCheckPoint,CreateRestartPoint,CheckPointGuts, redo-pointer management, WAL segment recycling.src/include/catalog/pg_control.h—CheckPoint,ControlFileData,DBState.knowledge/research/dbms-papers/aries.md— ARIES theory (Mohan et al. 1992); redo pointer and fuzzy checkpoint formalization.knowledge/research/dbms-general/database-internals.md— Petrov ch. 5 “Transaction Processing and Recovery”; checkpoint pacing and I/O amplification framing.knowledge/research/dbms-general/database-system-concepts.md— Silberschatz ch. 19 “Recovery System”; WAL, stable storage, shadow paging.knowledge/code-analysis/postgres/postgres-xlog-wal.md— WAL buffer management,XLogInsert,XLogFlush, LSN mechanics.knowledge/code-analysis/postgres/postgres-recovery-redo.md— recovery modes,PerformWalRecovery, timeline management.knowledge/code-analysis/postgres/postgres-buffer-manager.md—BufferSync,FlushBuffer, dirty-buffer tracking.