PostgreSQL Checkpoint — The Durability Anchor: redo-pointer fixation, buffer flush, and WAL truncation

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

A checkpoint in a write-ahead logging (WAL) system is a durable marker that allows the recovery manager to discard all WAL records written before it. Without checkpoints, crash recovery would have to replay the entire WAL history from the very beginning — an O(total-WAL) operation that grows without bound. The checkpoint cuts that tail: after a successful checkpoint at LSN L, recovery need only replay WAL starting from L, because every change before L has already been written to stable storage.

The canonical theory is ARIES (Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992; captured at knowledge/research/dbms-papers/aries.md). ARIES defines the checkpoint obligation precisely: before writing the checkpoint record at LSN L, the system must ensure that every data page whose PageLSN ≤ L has been flushed to stable storage. This is sometimes called the write-ahead rule for checkpoints — a complement to the write-ahead rule for individual page writes. Once the checkpoint record is durable, the redo pointer embedded in it is the earliest LSN that recovery must start from.

ARIES further distinguishes two checkpoint styles:

Consistent (quiesce) checkpoint. All active transactions are quiesced (no new writes) while the snapshot is taken. The redo pointer equals the checkpoint record LSN. Simple to reason about, but requires write starvation and is unsuitable for high-concurrency systems.
Fuzzy (online) checkpoint. Writes continue while the buffer flush proceeds. The redo pointer is fixed before the flush begins, at the earliest LSN that could generate a dirty page during the checkpoint window. Because transactions insert WAL records while buffers are being written, the checkpoint record is written at the end of the flush, not the beginning. Recovery must replay from the redo pointer (which precedes the checkpoint record in LSN order), not from the checkpoint record itself.

PostgreSQL uses the fuzzy model for online checkpoints. The redo pointer is pinned at the start, the flush proceeds concurrently, and the checkpoint record is written last. Only shutdown checkpoints use a quiesce variant, because no concurrent WAL insertions are possible at shutdown.

Database Internals (Petrov, ch. 5 “Transaction Processing and Recovery”) identifies two further sub-problems that every checkpoint design must solve:

I/O amplification. Writing all dirty buffers at once at checkpoint time creates a large, sudden I/O spike, which competes with foreground query I/O and can violate latency SLAs. The remedy is checkpoint pacing: spread the dirty-buffer writes over a fraction of the inter-checkpoint interval, using the inter-checkpoint period as a natural budget.
WAL retention vs. space consumption. WAL segments before the oldest live redo pointer can be recycled. But if a standby replica or a replication slot holds a position that precedes the redo pointer, segments must be retained longer. Getting this calculation wrong either wastes disk space or breaks replicas.

Both problems have direct implementations in PostgreSQL (see § PostgreSQL’s Approach).

Common DBMS Design

The ARIES theory shapes every production checkpoint implementation into a recognizable pattern. This section names the engineering conventions that PostgreSQL, Oracle, InnoDB, DB2, and SQL Server all share.

Dedicated checkpoint process

Nearly every engine separates checkpoint I/O from normal query processing by running it in a dedicated background process (or thread). This prevents checkpoint latency from appearing in user query response times and allows the checkpoint engine to set its own I/O priority without contending with foreground threads over scheduler timeslices.

A redo pointer fixed at start, checkpoint record written at end

For fuzzy checkpoints the sequence is always: (1) fix the redo pointer while holding brief exclusive access to the WAL insertion state, (2) flush all pages whose PageLSN ≤ redo pointer, (3) write and flush the checkpoint record that carries the redo pointer value. Recovery navigates to the checkpoint record to find where to start replaying (the redo pointer), not where the checkpoint record sits. These two LSNs differ in fuzzy checkpoints by as much as max_wal_size bytes.

pg_control / control file as the recovery anchor

Every ARIES implementation maintains a small, frequently updated control file that records the location of the latest checkpoint record. On startup the recovery manager reads this file first, locates the checkpoint record, extracts the redo pointer, and begins replay. The control file is single-page atomic (or uses shadow-copy) to tolerate partial writes.

Paced I/O to bound the spike

Engines spread the dirty-buffer flush over the inter-checkpoint window by throttling the buffer writer. The tuning knob is usually expressed as a fraction of the checkpoint interval: “aim to finish the flush by the time target_fraction × checkpoint_interval has elapsed.” Sleeping between batches of page writes is the common mechanism. The system accelerates writes when it detects it is running behind.

fsync consolidation

Writing a page and fsyncing it are two different system calls. In a busy system, many backends write the same page (or pages in the same file) multiple times between checkpoints. An efficient implementation batches fsync calls per file at checkpoint time rather than fsyncing after every write. This turns O(writes) fsyncs into O(files-touched) fsyncs, which is a large reduction in practice. The checkpoint process owns the fsync queue; backends that write files directly register their files with the checkpoint process so that the consolidation occurs centrally.

Restartpoint during recovery

While in archive or streaming recovery, a standby cannot take a full checkpoint (it cannot write new WAL other than in the primary’s stream). Instead it takes a restartpoint: a record noting “we have replayed safely up to this previously-written checkpoint record, and all pages referenced up to that point have been flushed.” A restartpoint allows WAL segments before the corresponding checkpoint’s redo pointer to be recycled on the standby without the primary needing to take a new checkpoint.

PostgreSQL’s Approach

Process model: the checkpointer

PostgreSQL devotes a single auxiliary process to all checkpoint work. CheckpointerMain (in postmaster/checkpointer.c) is launched by AuxiliaryProcessMain with MyBackendType = B_CHECKPOINTER. The process owns:

the timer loop that fires every checkpoint_timeout seconds,
consumption of CHECKPOINT_CAUSE_XLOG signals when WAL usage exceeds max_wal_size,
the shutdown checkpoint when it receives SIGINT from the postmaster,
the fsync request queue that backends post to via ForwardSyncRequest.

Before PostgreSQL 9.2 the bgwriter process performed both background buffer writing and checkpoints. The separation into dedicated bgwriter and checkpointer processes avoids priority inversion: the bgwriter can perform steady proactive writes independent of checkpoint pacing.

CheckpointerShmemStruct: the request and completion protocol

Backends and the checkpointer communicate through a small shared-memory structure:

// CheckpointerShmemStruct — postmaster/checkpointer.c
typedef struct
{
    pid_t       checkpointer_pid; /* PID (0 if not started) */
    slock_t     ckpt_lck;         /* protects ckpt_* fields */
    int         ckpt_started;     /* increments when a checkpoint starts */
    int         ckpt_done;        /* set to ckpt_started on completion */
    int         ckpt_failed;      /* increments on failure */
    int         ckpt_flags;       /* OR of pending request flags */
    ConditionVariable start_cv;   /* signaled when ckpt_started advances */
    ConditionVariable done_cv;    /* signaled when ckpt_done advances */
    int         num_requests;
    int         max_requests;
    CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER]; /* fsync queue */
} CheckpointerShmemStruct;

RequestCheckpoint uses a six-step handshake:

Record ckpt_failed and ckpt_started under ckpt_lck; OR in request flags.
Set the checkpointer’s latch via SetLatch.
Sleep on start_cv until ckpt_started advances.
Record the new ckpt_started.
Sleep on done_cv until ckpt_done ≥ new_started (modulo arithmetic).
If ckpt_failed changed, the checkpoint failed; otherwise it succeeded.

This protocol lets multiple backends request checkpoints concurrently without any of them needing to run the checkpoint themselves. The OR semantics of ckpt_flags mean that multiple requests are coalesced: a CHECKPOINT_IMMEDIATE flag set by any backend takes effect for the next checkpoint regardless of other concurrent requests.

The CheckPoint record struct

The body of XLOG_CHECKPOINT_ONLINE and XLOG_CHECKPOINT_SHUTDOWN WAL records is the CheckPoint struct from src/include/catalog/pg_control.h:

// CheckPoint — src/include/catalog/pg_control.h
typedef struct CheckPoint
{
    XLogRecPtr  redo;              /* REDO start point */
    TimeLineID  ThisTimeLineID;
    TimeLineID  PrevTimeLineID;
    bool        fullPageWrites;
    int         wal_level;
    FullTransactionId nextXid;     /* next free XID */
    Oid         nextOid;
    MultiXactId nextMulti;
    MultiXactOffset nextMultiOffset;
    TransactionId oldestXid;       /* cluster-wide datfrozenxid */
    Oid         oldestXidDB;
    MultiXactId oldestMulti;
    Oid         oldestMultiDB;
    pg_time_t   time;
    TransactionId oldestCommitTsXid;
    TransactionId newestCommitTsXid;
    TransactionId oldestActiveXid; /* oldest XID still running (online ckpt) */
} CheckPoint;

The redo field is the redo pointer. ControlFileData.checkPointCopy holds a copy of the latest CheckPoint; ControlFileData.checkPoint holds the LSN of the checkpoint record itself. Recovery reads both: checkPoint to locate the WAL record, and checkPointCopy.redo to know where to start replaying.

ControlFileData.state (a DBState enum with values DB_STARTUP, DB_SHUTDOWNING, DB_SHUTDOWNED, DB_IN_PRODUCTION, etc.) is updated atomically around the checkpoint so that a partial write is detectable on the next startup.

CreateCheckPoint: the online checkpoint flow

CreateCheckPoint in xlog.c performs the full checkpoint sequence. For an online (non-shutdown) checkpoint:

// CreateCheckPoint — src/backend/access/transam/xlog.c
bool
CreateCheckPoint(int flags)
{
    /* 1. Reject if we're already in recovery (except end-of-recovery). */
    if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
        elog(ERROR, "can't create a checkpoint during recovery");

    /* 2. smgr pre-checkpoint hook (outside critical section). */
    SyncPreCheckpoint();

    START_CRIT_SECTION();

    /* 3. Gather XID/OID/MultiXact watermarks into checkPoint struct. */
    checkPoint.nextXid = TransamVariables->nextXid;   /* under XidGenLock */
    checkPoint.nextOid = TransamVariables->nextOid;   /* under OidGenLock */
    /* ... MultiXact, CommitTs fields ... */

    /* 4. Skip if nothing has changed since last checkpoint. */
    if (last_important_lsn == ControlFile->checkPoint)
    { END_CRIT_SECTION(); return false; }

    /* 5. Pin the redo pointer: insert XLOG_CHECKPOINT_REDO. */
    XLogBeginInsert();
    XLogRegisterData(&wal_level, sizeof(wal_level));
    (void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
    checkPoint.redo = RedoRecPtr;   /* now fixed */

    END_CRIT_SECTION();

    /* 6. Wait for transactions in commit critical sections to clear. */
    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
    while (HaveVirtualXIDsDelayingChkpt(...)) { AbsorbSyncRequests(); ... }

    /* 7. Flush all dirty buffers and SLRUs (paced). */
    CheckPointGuts(checkPoint.redo, flags);

    /* 8. Wait for transactions completing their commit. */
    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
    while (HaveVirtualXIDsDelayingChkpt(...)) { AbsorbSyncRequests(); ... }

    START_CRIT_SECTION();

    /* 9. Write and flush XLOG_CHECKPOINT_ONLINE record. */
    XLogBeginInsert();
    XLogRegisterData(&checkPoint, sizeof(checkPoint));
    recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_ONLINE);
    XLogFlush(recptr);

    /* 10. Update pg_control atomically. */
    ControlFile->checkPoint     = ProcLastRecPtr;  /* LSN of ckpt record */
    ControlFile->checkPointCopy = checkPoint;
    UpdateControlFile();

    END_CRIT_SECTION();

    /* 11. Wake WAL summarizer; recycle old WAL segments. */
    WakeupWalSummarizer();
    RemoveOldXlogFiles(...);
    UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
    return true;
}

Step 5 is the key insight of the fuzzy checkpoint: XLOG_CHECKPOINT_REDO is a minimal WAL record whose LSN becomes checkPoint.redo. Any page written after this LSN will carry a PageLSN > redo, so recovery knows it must replay from redo forward to reconstruct those pages. The subsequent XLOG_CHECKPOINT_ONLINE record embeds the same checkPoint.redo value, so recovery knows where to start even though the redo record itself has no payload beyond a wal_level hint for the WAL summarizer.

For shutdown checkpoints (flags & CHECKPOINT_IS_SHUTDOWN) the sequence is simpler: because no concurrent WAL insertions are possible, the redo pointer equals the next WAL insertion point (computed directly from Insert->CurrBytePos). Only XLOG_CHECKPOINT_SHUTDOWN is written; the separate XLOG_CHECKPOINT_REDO record is skipped.

Redo-pointer fixation: the two computation paths

The redo pointer is the single most important value the checkpoint produces, and CreateCheckPoint computes it by two distinct routes depending on whether the cluster is quiescing. Both happen while the WAL insertion locks are held, so the value is consistent with the WAL tip:

// CreateCheckPoint — src/backend/access/transam/xlog.c
if (shutdown)
{
    XLogRecPtr  curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);

    /*
     * Compute new REDO record ptr = location of next XLOG record.
     * Since this is a shutdown checkpoint, there can't be any concurrent
     * WAL insertion.
     */
    freespace = INSERT_FREESPACE(curInsert);
    if (freespace == 0)
    {
        if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
            curInsert += SizeOfXLogLongPHD;
        else
            curInsert += SizeOfXLogShortPHD;
    }
    checkPoint.redo = curInsert;

    /* update shared RedoRecPtr while holding all insertion locks */
    RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
}

WALInsertLockRelease();   /* let other xacts proceed during the flush */

if (!shutdown)
{
    /* Include WAL level in record for WAL summarizer's benefit. */
    XLogBeginInsert();
    XLogRegisterData(&wal_level, sizeof(wal_level));
    (void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);

    /* XLogInsertRecord already advanced Insert.RedoRecPtr + RedoRecPtr; */
    /* copy that LSN into the checkpoint record we will write at the end. */
    checkPoint.redo = RedoRecPtr;
}

The shutdown branch must skip past the page header that an empty WAL page would otherwise force: when INSERT_FREESPACE(curInsert) == 0 the next record cannot begin at curInsert (a page boundary), so the redo pointer is bumped past SizeOfXLogLongPHD (at a segment boundary) or SizeOfXLogShortPHD (mid-segment) — exactly where XLogInsertRecord would have placed the next record. The online branch sidesteps this arithmetic entirely: it simply inserts the XLOG_CHECKPOINT_REDO record and lets the normal insertion path report the LSN back through RedoRecPtr. The crucial comment — “we can’t postpone advancing RedoRecPtr because XLogInserts that happen while we are dumping buffers must assume that their buffer changes are not included in the checkpoint” — is the fuzzy-checkpoint invariant stated in the negative: any buffer dirtied at or after RedoRecPtr is outside this checkpoint and must be recovered by replaying from RedoRecPtr forward.

flowchart TD
    A["CreateCheckPoint(flags)"] --> B["SyncPreCheckpoint<br/>(smgr pre-hook, outside crit section)"]
    B --> C["START_CRIT_SECTION<br/>gather nextXid / nextOid / MultiXact<br/>into CheckPoint struct"]
    C --> D{"last_important_lsn ==<br/>ControlFile-&gt;checkPoint ?<br/>(idle, non-forced)"}
    D -->|yes| E["END_CRIT_SECTION<br/>return false (skip)"]
    D -->|no| F{"shutdown ?"}
    F -->|"online"| G["XLogInsert(XLOG_CHECKPOINT_REDO)<br/>pin checkPoint.redo = RedoRecPtr"]
    F -->|"shutdown"| H["redo = XLogBytePosToRecPtr(CurrBytePos)<br/>no separate REDO record"]
    G --> I["GetVirtualXIDsDelayingChkpt(DELAY_CHKPT_START)<br/>wait for commit-critical xacts"]
    H --> I
    I --> J["CheckPointGuts(checkPoint.redo, flags)<br/>flush SLRUs + CheckPointBuffers (paced)<br/>ProcessSyncRequests (fsync)"]
    J --> K["GetVirtualXIDsDelayingChkpt(DELAY_CHKPT_COMPLETE)<br/>wait for completing commits"]
    K --> L["XLogInsert(XLOG_CHECKPOINT_ONLINE / SHUTDOWN)<br/>XLogFlush(recptr)"]
    L --> M["LWLock(ControlFileLock):<br/>ControlFile-&gt;checkPoint = ProcLastRecPtr<br/>checkPointCopy = checkPoint<br/>UpdateControlFile"]
    M --> N["END_CRIT_SECTION<br/>WakeupWalSummarizer<br/>SyncPostCheckpoint"]
    N --> O["UpdateCheckPointDistanceEstimate<br/>KeepLogSeg + RemoveOldXlogFiles<br/>(recycle / remove WAL)"]
    O --> P["PreallocXlogFiles<br/>return true"]

Figure 1 — the online checkpoint sequence inside CreateCheckPoint (xlog.c). The load-bearing ordering is the fuzzy-checkpoint invariant: XLOG_CHECKPOINT_REDO pins checkPoint.redo before CheckPointGuts flushes dirty buffers, while XLOG_CHECKPOINT_ONLINE — carrying that same redo value — is written after the flush completes. The control-file update under ControlFileLock is the single atomic point that makes the new checkpoint the recovery anchor; only then are old WAL segments recycled via KeepLogSeg + RemoveOldXlogFiles. The shutdown branch collapses the two-record scheme to one XLOG_CHECKPOINT_SHUTDOWN because no concurrent WAL insertion is possible.

CheckPointGuts: flushing all dirty state

CheckPointGuts(checkPointRedo, flags) is the I/O-heavy phase:

// CheckPointGuts — src/backend/access/transam/xlog.c
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
    CheckPointRelationMap();
    CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
    CheckPointSnapBuild();
    CheckPointLogicalRewriteHeap();
    CheckPointReplicationOrigin();

    /* Write out all dirty data in SLRUs and main buffer pool */
    CheckPointCLOG();
    CheckPointCommitTs();
    CheckPointSUBTRANS();
    CheckPointMultiXact();
    CheckPointPredicate();
    CheckPointBuffers(flags);      /* <-- main shared_buffers flush */

    /* Fsync everything */
    ProcessSyncRequests();

    /* 2PC state last */
    CheckPointTwoPhase(checkPointRedo);
}

CheckPointBuffers calls BufferSync in storage/buffer/bufmgr.c, which iterates over the buffer pool in two passes. The tag pass marks every buffer that was dirty when the checkpoint began with BM_CHECKPOINT_NEEDED and records it in CkptBufferIds[], so that pages dirtied after this point are excluded from this checkpoint (they belong to the next one):

// BufferSync — src/backend/storage/buffer/bufmgr.c
num_to_scan = 0;
for (buf_id = 0; buf_id < NBuffers; buf_id++)
{
    BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
    buf_state = LockBufHdr(bufHdr);

    if ((buf_state & mask) == mask)        /* BM_DIRTY [| BM_PERMANENT] */
    {
        CkptSortItem *item;
        buf_state |= BM_CHECKPOINT_NEEDED;
        item = &CkptBufferIds[num_to_scan++];
        item->buf_id = buf_id;
        item->tsId = bufHdr->tag.spcOid;   /* sort key: tablespace */
        /* ... relNumber / forkNum / blockNum ... */
    }
    UnlockBufHdr(bufHdr, buf_state);
}

The write pass then drains a per-tablespace min-heap so writes are balanced across tablespaces (sorting alone would write one tablespace at a time, starving the others’ hardware). After each buffer it reports progress to CheckpointWriteDelay, which is where pacing happens:

// BufferSync — src/backend/storage/buffer/bufmgr.c
num_processed = 0;
while (!binaryheap_empty(ts_heap))
{
    CkptTsStatus *ts_stat = (CkptTsStatus *)
        DatumGetPointer(binaryheap_first(ts_heap));
    buf_id = CkptBufferIds[ts_stat->index].buf_id;
    bufHdr = GetBufferDescriptor(buf_id);
    num_processed++;

    /* Flag may have been cleared by a backend/bgwriter writing it first. */
    if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
        if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
            num_written++;

    /* Advance per-tablespace progress, re-balance the heap. */
    ts_stat->progress += ts_stat->progress_slice;
    /* ... binaryheap_remove_first / binaryheap_replace_first ... */

    /* Sleep to throttle our I/O rate. */
    CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
}

progress = num_processed / num_to_scan is the fraction-of-work estimate that CheckpointWriteDelay compares against the elapsed-time and elapsed-WAL fractions — that single ratio is the bridge between the buffer-flush loop in bufmgr.c and the checkpoint_completion_target pacing logic in checkpointer.c. Note the BM_CHECKPOINT_NEEDED re-check inside the loop: because a normal backend or the bgwriter may have already written (and cleared the flag on) a tagged buffer, the checkpointer only issues SyncOneBuffer for buffers still carrying the flag, avoiding redundant I/O.

flowchart TD
    A["CheckPointBuffers(flags)<br/>-&gt; BufferSync(flags)"] --> B["TAG PASS:<br/>scan all NBuffers under LockBufHdr"]
    B --> C{"buf_state &amp; mask == mask<br/>(BM_DIRTY [| BM_PERMANENT]) ?"}
    C -->|no| B
    C -->|yes| D["set BM_CHECKPOINT_NEEDED<br/>append to CkptBufferIds[num_to_scan++]<br/>(buf_id, tsId, relNumber, forkNum, blockNum)"]
    D --> B
    B --> E["sort CkptBufferIds by tablespace<br/>build per-tablespace min-heap (ts_heap)<br/>progress_slice = num_to_scan / ts.num_to_scan"]
    E --> F{"binaryheap_empty(ts_heap) ?"}
    F -->|yes| K["IssuePendingWritebacks<br/>CheckpointStats.ckpt_bufs_written += num_written"]
    F -->|no| G["pick heap-first tablespace<br/>buf_id = CkptBufferIds[ts.index]<br/>num_processed++"]
    G --> H{"state &amp; BM_CHECKPOINT_NEEDED<br/>still set ?"}
    H -->|"no (backend/bgwriter wrote it)"| I["skip write"]
    H -->|yes| J["SyncOneBuffer -&gt; FlushBuffer<br/>num_written++"]
    I --> L["ts.progress += progress_slice<br/>remove_first or replace_first on heap"]
    J --> L
    L --> M["CheckpointWriteDelay(flags,<br/>num_processed / num_to_scan)"]
    M --> N{"!CHECKPOINT_IMMEDIATE &amp;&amp;<br/>IsCheckpointOnSchedule(progress) ?"}
    N -->|"yes (ahead)"| O["AbsorbSyncRequests<br/>WaitLatch(100 ms)"]
    N -->|"no (behind)"| P["return immediately<br/>(no sleep)"]
    O --> F
    P --> F

Figure 3 — BufferSync (bufmgr.c) and its coupling to checkpoint pacing. The tag pass freezes the working set (BM_CHECKPOINT_NEEDED) so concurrent writers cannot enlarge this checkpoint’s obligation; the write pass drains a per-tablespace min-heap to keep every tablespace’s storage busy. The single load-bearing number that flows out of the loop is num_processed / num_to_scan — passed to CheckpointWriteDelay, which sleeps 100 ms only while IsCheckpointOnSchedule reports the flush is ahead of both the elapsed-time and elapsed-WAL budgets, and never sleeps under CHECKPOINT_IMMEDIATE.

ProcessSyncRequests in storage/sync/sync.c consolidates all fsync requests into one fsync-per-file call. Before this call, AbsorbSyncRequests transfers the fsync queue from CheckpointerShmem->requests[] (posted by backends via ForwardSyncRequest) into the sync module’s internal pending table, deduplicating by file tag.

Checkpoint pacing: CheckpointWriteDelay and IsCheckpointOnSchedule

// CheckpointWriteDelay — postmaster/checkpointer.c
void
CheckpointWriteDelay(int flags, double progress)
{
    if (!(flags & CHECKPOINT_IMMEDIATE) && !ShutdownXLOGPending
        && IsCheckpointOnSchedule(progress))
    {
        AbsorbSyncRequests();
        CheckArchiveTimeout();
        WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT, 100, ...);
        ResetLatch(MyLatch);
    }
}

IsCheckpointOnSchedule(progress) computes two elapsed-fraction estimates:

WAL-based: (current_insert_LSN - ckpt_start_recptr) / (wal_segment_size × CheckPointSegments)
Time-based: elapsed_seconds / checkpoint_timeout

If progress × checkpoint_completion_target exceeds either fraction, the checkpoint is ahead of schedule and takes a 100 ms sleep. If it is behind either estimate, writes continue without sleeping. This adaptive throttle ensures the flush completes approximately at checkpoint_completion_target × checkpoint_timeout under steady WAL load, leaving the tail of the interval for the synchronous ProcessSyncRequests phase.

CreateRestartPoint: checkpoints during recovery

During archive or streaming recovery the startup process replays WAL records, and every time it encounters a checkpoint record it calls RecoveryRestartPoint to stash the record in XLogCtl->lastCheckPoint. The checkpointer periodically calls CreateRestartPoint:

// CreateRestartPoint — src/backend/access/transam/xlog.c
bool
CreateRestartPoint(int flags)
{
    /* 1. Fetch the most recent safe checkpoint from shared memory. */
    SpinLockAcquire(&XLogCtl->info_lck);
    lastCheckPoint = XLogCtl->lastCheckPoint;
    SpinLockRelease(&XLogCtl->info_lck);

    /* 2. Bail if we've already made a restartpoint at this checkpoint. */
    if (lastCheckPoint.redo <= ControlFile->checkPointCopy.redo)
        return false;

    /* 3. Pin the redo pointer. */
    WALInsertLockAcquireExclusive();
    RedoRecPtr = XLogCtl->Insert.RedoRecPtr = lastCheckPoint.redo;
    WALInsertLockRelease();

    /* 4. Flush buffers and SLRUs. */
    CheckPointGuts(lastCheckPoint.redo, flags);

    /* 5. Update pg_control. */
    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
    ControlFile->checkPoint     = lastCheckPointRecPtr;
    ControlFile->checkPointCopy = lastCheckPoint;
    UpdateControlFile();
    LWLockRelease(ControlFileLock);

    /* 6. Recycle old WAL segments. */
    RemoveOldXlogFiles(...);
    return true;
}

CreateRestartPoint differs from CreateCheckPoint in three ways: (a) it does not write a new WAL record — the checkpoint record was already written by the primary; (b) it can only advance to checkpoints the startup process has already replayed; (c) it runs concurrently with PerformWalRecovery (which the startup process is executing in parallel), so care is taken to read lastCheckPoint under info_lck.

flowchart TD
    A["CheckpointerMain<br/>MyBackendType = B_CHECKPOINTER"] --> B["for (;;) loop:<br/>ResetLatch · AbsorbSyncRequests<br/>ProcessCheckpointerInterrupts"]
    B --> C{"ShutdownXLOGPending ||<br/>ShutdownRequestPending ?"}
    C -->|yes| Z["break loop -&gt;<br/>ShutdownXLOG(0,0)<br/>(final shutdown ckpt/rstpt)"]
    C -->|no| D{"CheckpointerShmem-&gt;ckpt_flags != 0 ?"}
    D -->|yes| G["do_checkpoint = true"]
    D -->|no| E{"elapsed_secs &gt;=<br/>CheckPointTimeout ?"}
    E -->|yes| F["do_checkpoint = true<br/>flags |= CHECKPOINT_CAUSE_TIME"]
    E -->|no| W["WaitLatch(cur_timeout)<br/>= min(CheckPointTimeout,<br/>XLogArchiveTimeout)"]
    W --> B
    F --> H
    G --> H["SpinLock(ckpt_lck):<br/>flags |= ckpt_flags · ckpt_flags = 0<br/>ckpt_started++"]
    H --> I["ConditionVariableBroadcast(start_cv)<br/>ckpt_start_recptr = GetInsertRecPtr /<br/>GetXLogReplayRecPtr"]
    I --> J{"do_restartpoint =<br/>RecoveryInProgress() ?<br/>(unless END_OF_RECOVERY)"}
    J -->|"false (primary)"| K["CreateCheckPoint(flags)"]
    J -->|"true (standby)"| L["CreateRestartPoint(flags)"]
    L --> L1{"lastCheckPoint.redo &lt;=<br/>checkPointCopy.redo ?"}
    L1 -->|yes| L2["return false<br/>(no new replayed ckpt;<br/>retry in 15 s)"]
    L1 -->|no| L3["pin RedoRecPtr = lastCheckPoint.redo<br/>CheckPointGuts(redo, flags)<br/>update ControlFile · RemoveOldXlogFiles"]
    K --> M["smgrdestroyall"]
    L2 --> M
    L3 --> M
    M --> N["SpinLock(ckpt_lck):<br/>ckpt_done = ckpt_started<br/>ConditionVariableBroadcast(done_cv)"]
    N --> B

Figure 2 — the CheckpointerMain event loop (checkpointer.c) and its restartpoint branch. Each iteration first drains the fsync queue (AbsorbSyncRequests), then decides to act on either a requested checkpoint (ckpt_flags nonzero) or a timed one (elapsed_secs >= CheckPointTimeout). The branch on RecoveryInProgress() is what splits a primary (CreateCheckPoint) from a standby (CreateRestartPoint); the CHECKPOINT_END_OF_RECOVERY flag forces the full checkpoint path even during recovery. The start_cv / done_cv broadcasts bracket the work so backends blocked in RequestCheckpoint observe start and completion. Note the restartpoint early-out: if no checkpoint WAL record has been replayed since the last restartpoint (lastCheckPoint.redo <= checkPointCopy.redo), it returns false and the loop retries 15 s later.

WAL segment recycling and max_wal_size

After updating pg_control, both CreateCheckPoint and CreateRestartPoint call RemoveOldXlogFiles with a horizon computed from the new redo pointer. The algorithm:

Convert RedoRecPtr to a segment number _logSegNo.
Call KeepLogSeg(recptr, &_logSegNo) to extend the horizon for wal_keep_size and any replication slot that holds an older LSN.
Optionally call InvalidateObsoleteReplicationSlots to drop slots whose required WAL segment would otherwise prevent recycling.
Decrement _logSegNo and pass it to RemoveOldXlogFiles, which recycles (renames to a higher segment name) or removes segments older than the horizon.

CheckPointDistanceEstimate maintains an exponential moving average of bytes of WAL generated per inter-checkpoint interval. PreallocXlogFiles uses this estimate to pre-create WAL segment files, amortizing the cost of fallocate / zero-fill over the interval.

Full-page writes and the checkpoint connection

fullPageWrites (a GUC, stored in the CheckPoint struct and in XLogCtl->Insert.fullPageWrites) is enabled automatically at every checkpoint start and can be turned off by SET full_page_writes = off. After a checkpoint, the first modification of any buffer page within the new checkpoint window must be logged as a full-page image (FPI). This ensures recovery can reconstruct a torn page (written partially by a crash) by replacing it entirely with the logged copy rather than applying a delta that would be meaningless on a torn page. The FPI guard resets at the checkpoint because the checkpoint has already ensured all pages are consistent on disk.

Source Walkthrough

postmaster/checkpointer.c

CheckpointerMain is the process entry point. It sets MyBackendType = B_CHECKPOINTER, registers signal handlers (SIGINT → ReqShutdownXLOG, SIGUSR2 → SignalHandlerForShutdownRequest), and runs the main event loop. The loop: (1) absorbs fsync requests via AbsorbSyncRequests; (2) checks ckpt_flags for pending requests; (3) fires a timed checkpoint if elapsed_secs ≥ CheckPointTimeout; (4) calls CreateCheckPoint or CreateRestartPoint as appropriate; (5) signals waiting backends via done_cv; (6) sleeps until the earlier of the next checkpoint timeout and XLogArchiveTimeout.

CheckpointerShmemInit allocates and zero-initialises CheckpointerShmemStruct in shared memory, sizing the requests[] array at min(NBuffers, MAX_CHECKPOINT_REQUESTS) entries.

RequestCheckpoint is called by backends. It ORs flags into ckpt_flags, sets the checkpointer’s latch, then optionally waits via start_cv + done_cv condition variables for the checkpoint to complete and succeed.

ForwardSyncRequest posts a (FileTag, SyncRequestType) pair into requests[] under CheckpointerCommLock. If the queue is more than half full it nudges the checkpointer latch. If the queue is completely full it tries CompactCheckpointerRequestQueue (a deduplication pass using an in-memory hash table); if compaction fails, it returns false, telling the caller to fsync directly.

CheckpointWriteDelay is called by BufferSync after each page write. It sleeps 100 ms when IsCheckpointOnSchedule says the flush is ahead, absorbs fsync requests every WRITES_PER_ABSORB = 1000 writes otherwise, and skips sleeping entirely when CHECKPOINT_IMMEDIATE is set.

AbsorbSyncRequests copies requests[] out of shared memory under CheckpointerCommLock, clears num_requests, then calls RememberSyncRequest for each entry (inside a critical section, because failing after clearing the queue would lose fsync obligations).

access/transam/xlog.c — checkpoint path

CreateCheckPoint (line 6927): full online/shutdown checkpoint implementation. Key sub-calls: SyncPreCheckpoint, START_CRIT_SECTION, WALInsertLockAcquireExclusive (to read Insert->CurrBytePos for shutdown redo pointer), WALInsertLockRelease, XLogInsert(XLOG_CHECKPOINT_REDO) for online checkpoints, CheckPointGuts, XLogInsert(XLOG_CHECKPOINT_ONLINE / XLOG_CHECKPOINT_SHUTDOWN), XLogFlush, UpdateControlFile, WakeupWalSummarizer, RemoveOldXlogFiles, PreallocXlogFiles.

CheckPointGuts (line 7550): dispatches to per-subsystem checkpoint routines and to CheckPointBuffers / ProcessSyncRequests.

RecoveryRestartPoint (line 7590): called by the startup process each time it replays a checkpoint WAL record; stashes the record in XLogCtl->lastCheckPoint under info_lck.

CreateRestartPoint (line 7631): called by the checkpointer during recovery. Reads lastCheckPoint, pins RedoRecPtr, calls CheckPointGuts, updates ControlFile, and recycles WAL.

UpdateCheckPointDistanceEstimate (line 6824): slow-decay moving average of bytes of WAL per checkpoint interval. Rises immediately on increase, falls by 10 % per interval (formula: 0.9 × old + 0.1 × new).

IsCheckpointOnSchedule (line 841 of checkpointer.c): computes elapsed_xlogs and elapsed_time fractions; returns true (on schedule) only if both are less than progress × CheckPointCompletionTarget.

include/catalog/pg_control.h

CheckPoint struct (line 35): body of checkpoint WAL records; also copied into ControlFileData.checkPointCopy.

ControlFileData struct (line 104): the pg_control file layout. state (a DBState enum), checkPoint (LSN of last checkpoint record), checkPointCopy (copy of CheckPoint body), minRecoveryPoint (standby must replay at least this far).

DBState enum: DB_STARTUP, DB_SHUTDOWNED, DB_SHUTDOWNED_IN_RECOVERY, DB_SHUTDOWNING, DB_IN_CRASH_RECOVERY, DB_IN_ARCHIVE_RECOVERY, DB_IN_PRODUCTION.

Position hints (commit 273fe94, 2026-06-05)

Symbol	File	Approx. line
`CheckpointerMain`	`src/backend/postmaster/checkpointer.c`	182
`CheckpointerShmemStruct`	`src/backend/postmaster/checkpointer.c`	107
`CheckpointerShmemInit`	`src/backend/postmaster/checkpointer.c`	959
`RequestCheckpoint`	`src/backend/postmaster/checkpointer.c`	1003
`ForwardSyncRequest`	`src/backend/postmaster/checkpointer.c`	1153
`AbsorbSyncRequests`	`src/backend/postmaster/checkpointer.c`	1329
`CompactCheckpointerRequestQueue`	`src/backend/postmaster/checkpointer.c`	1219
`CheckpointWriteDelay`	`src/backend/postmaster/checkpointer.c`	772
`IsCheckpointOnSchedule`	`src/backend/postmaster/checkpointer.c`	841
`CreateCheckPoint`	`src/backend/access/transam/xlog.c`	6927
`CheckPointGuts`	`src/backend/access/transam/xlog.c`	7550
`RecoveryRestartPoint`	`src/backend/access/transam/xlog.c`	7590
`CreateRestartPoint`	`src/backend/access/transam/xlog.c`	7631
`UpdateCheckPointDistanceEstimate`	`src/backend/access/transam/xlog.c`	6824
`XLogBytePosToRecPtr`	`src/backend/access/transam/xlog.c`	1861
`INSERT_FREESPACE` macro	`src/backend/access/transam/xlog.c`	581
`BufferSync`	`src/backend/storage/buffer/bufmgr.c`	3353
`CheckPointBuffers`	`src/backend/storage/buffer/bufmgr.c`	4219
`CkptTsStatus` struct	`src/backend/storage/buffer/bufmgr.c`	106
`SyncOneBuffer`	`src/backend/storage/buffer/bufmgr.c`	521
`CheckPoint` struct	`src/include/catalog/pg_control.h`	35
`ControlFileData` struct	`src/include/catalog/pg_control.h`	104
`DBState` enum	`src/include/catalog/pg_control.h`	89
`XLOG_CHECKPOINT_REDO`	`src/include/catalog/pg_control.h`	82
`XLOG_CHECKPOINT_ONLINE`	`src/include/catalog/pg_control.h`	69
`XLOG_CHECKPOINT_SHUTDOWN`	`src/include/catalog/pg_control.h`	68

Source verification (as of 2026-06-05)

Verified against commit 273fe94 (REL_18_STABLE, PostgreSQL 18.x) under /data/hgryoo/references/postgres.

Confirmed:

CheckpointerMain runs the main event loop and calls CreateCheckPoint or CreateRestartPoint depending on RecoveryInProgress(). Confirmed in postmaster/checkpointer.c lines 349–546.
CheckpointerShmemStruct layout (including start_cv, done_cv, FLEXIBLE_ARRAY_MEMBER requests) confirmed at lines 107–131.
RequestCheckpoint six-step handshake confirmed at lines 1003–1130.
ForwardSyncRequest / CompactCheckpointerRequestQueue / AbsorbSyncRequests confirmed at lines 1153–1371.
CreateCheckPoint two-record scheme (XLOG_CHECKPOINT_REDO then XLOG_CHECKPOINT_ONLINE) confirmed at lines 7086–7109 and 7250–7256.
CheckPointGuts call chain confirmed at lines 7550–7577.
CreateRestartPoint confirmed at lines 7631–7810.
CheckPoint struct confirmed at pg_control.h lines 35–65.
ControlFileData confirmed at lines 104–239.
DBState enum confirmed at lines 89–98.
XLOG_CHECKPOINT_REDO = 0xE0 confirmed at line 82.
UpdateCheckPointDistanceEstimate moving-average formula (0.9/0.1 decay) confirmed at lines 6848–6853.
WakeupWalSummarizer() called after checkpoint record flush (WAL summarization, PG18) confirmed at line 7337.
Redo-pointer fixation: shutdown branch (XLogBytePosToRecPtr(CurrBytePos)
- INSERT_FREESPACE page-header skip) confirmed at lines 7044–7077; online branch (XLOG_CHECKPOINT_REDO insert then checkPoint.redo = RedoRecPtr) confirmed at lines 7094–7108.
BufferSync two-pass structure (tag pass setting BM_CHECKPOINT_NEEDED into CkptBufferIds[], then per-tablespace min-heap write pass calling CheckpointWriteDelay(flags, num_processed / num_to_scan)) confirmed in storage/buffer/bufmgr.c lines 3353–3615; CkptTsStatus struct at 106–128.

Not verified / out of scope:

ProcessSyncRequests internals (in storage/sync/sync.c): covered in the smgr/md doc.
The pg_stat_checkpointer view: driven by PendingCheckpointerStats and reported via pgstat_report_checkpointer; internals in utils/activity/pgstat_checkpointer.c (out of scope for this doc).

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Comparison: Oracle and SQL Server

Oracle uses a fuzzy checkpoint with incremental flushing driven by database writer (DBWn) processes, not a dedicated checkpoint thread. The redo pointer equivalent is the System Change Number (SCN) stored in the control file. Oracle’s checkpoint design is notable for its incremental checkpoint: the dirty-buffer write proceeds continuously (not just at checkpoint intervals), guided by a write queue ordered by dirty_since time. PostgreSQL’s bgwriter performs an analogous role but is fully separated from the checkpointer.

SQL Server uses indirect checkpoints (version 2012+) that target a configurable dirty-page buffer age rather than a time or WAL-volume interval, achieving smoother I/O at the cost of more frequent background flushing. PostgreSQL’s checkpoint_completion_target achieves similar smoothing at the checkpoint level.

Comparison: InnoDB (MySQL)

InnoDB’s fuzzy checkpoint is also driven by a dedicated thread (the master thread or dedicated I/O threads). The redo log is circular; when it wraps, a forced checkpoint flushes the pages needed to reclaim log space. This is analogous to PostgreSQL’s CHECKPOINT_CAUSE_XLOG trigger when WAL usage approaches max_wal_size. InnoDB uses innodb_io_capacity as the I/O rate governor; PostgreSQL’s equivalent is the implicit 100 ms sleep in CheckpointWriteDelay.

Comparison: CUBRID

CUBRID (see knowledge/code-analysis/cubrid/cubrid-checkpoint.md) uses a similar fuzzy checkpoint driven by a dedicated checkpointer thread. The redo pointer concept is equivalent but CUBRID stores it in a dedicated log anchor file rather than embedding it in WAL records. CUBRID does not have an analog of PostgreSQL’s XLOG_CHECKPOINT_REDO two-record scheme; instead the checkpoint record itself doubles as the redo-pointer anchor. PostgreSQL’s separation of redo-record from checkpoint-record is a direct consequence of supporting concurrent WAL insertion during the flush.

Research frontiers

Eliminating checkpoint spikes entirely. The LeanStore and Umbra storage engines (Leis et al.) use a continuously-running eviction loop that writes dirty pages to a shadow location before in-place overwriting, making distinct checkpoint phases unnecessary. PostgreSQL’s checkpoint_completion_target approximates this but still has a synchronous ProcessSyncRequests phase.

Group commit and checkpoint interaction. High-frequency checkpoints interact with group commit: if checkpoint_timeout is short relative to transaction commit rate, the WAL-flush cost at commit is reduced (recent commits are already flushed as part of the checkpoint). PostgreSQL does not explicitly exploit this coupling; InnoDB’s group commit documentation acknowledges a similar interplay.

WAL summarization and incremental backup (PG18). PostgreSQL 18 ships WakeupWalSummarizer() inside CreateCheckPoint. The WAL summarizer (postmaster/walsummarizer.c) tracks which blocks were modified between consecutive checkpoints and writes a .walsummary file that pg_basebackup --incremental consumes to skip unmodified blocks. The checkpoint is the natural epoch boundary for summarization: between two checkpoints every modified page has a known set of WAL records. See postgres-archiving-walsummary.md for the full design.

Sources

src/backend/postmaster/checkpointer.c — checkpointer process, request queue, pacing logic.
src/backend/access/transam/xlog.c — CreateCheckPoint, CreateRestartPoint, CheckPointGuts, redo-pointer management, WAL segment recycling.
src/include/catalog/pg_control.h — CheckPoint, ControlFileData, DBState.
knowledge/research/dbms-papers/aries.md — ARIES theory (Mohan et al. 1992); redo pointer and fuzzy checkpoint formalization.
knowledge/research/dbms-general/database-internals.md — Petrov ch. 5 “Transaction Processing and Recovery”; checkpoint pacing and I/O amplification framing.
knowledge/research/dbms-general/database-system-concepts.md — Silberschatz ch. 19 “Recovery System”; WAL, stable storage, shadow paging.
knowledge/code-analysis/postgres/postgres-xlog-wal.md — WAL buffer management, XLogInsert, XLogFlush, LSN mechanics.
knowledge/code-analysis/postgres/postgres-recovery-redo.md — recovery modes, PerformWalRecovery, timeline management.
knowledge/code-analysis/postgres/postgres-buffer-manager.md — BufferSync, FlushBuffer, dirty-buffer tracking.