Skip to content

PostgreSQL Data Checksums — Page Integrity Verification

Contents:

A relational engine’s most basic promise is that a row written yesterday reads back unchanged today. Between the INSERT and the SELECT, however, the 8 KB page that holds the row passes through a long chain of components that the DBMS does not control: the OS page cache, the filesystem, the block layer, an HBA or NVMe controller, the drive’s own firmware and DRAM cache, and — for networked storage — a SAN fabric. Any link can flip a bit, mis-route a write to the wrong sector, return a stale copy of a block, or hand back a torn (partially written) page after a power loss. Database Internals (Petrov 2019, ch. 3 “File Formats”, §“Checksumming”; captured in research/dbms-general/database-internals.md) states the problem plainly: “Files on disk may get damaged or corrupted by software bugs and hardware failures. To identify these problems preemptively and avoid propagating corrupt data to other subsystems or even nodes, we can use checksums and cyclic redundancy checks (CRCs).”

The crucial adjective is silent. A drive that returns an explicit read error is easy to handle — the DBMS gets an EIO and reports it. The dangerous case is silent data corruption: the storage stack returns success together with wrong bytes. Without an end-to-end integrity check the engine cannot tell a corrupted page from a good one; it dutifully follows the now-garbage line pointers, mis-reads transaction ids, and either crashes far from the real fault or — worse — returns wrong answers and propagates the damage into backups, replicas, and dumps. A checksum stored with the data turns silent corruption into loud corruption: a mismatch at read time is a detectable, attributable event.

Database Internals is careful to rank the guarantee strength of the three families of integrity codes, because the choice is a real engineering trade-off:

  • Checksums “provide the weakest form of guarantee and aren’t able to detect corruption in multiple bits. They’re usually computed by using XOR with parity checks or summation.”
  • CRCs “can help detect burst errors (e.g., when multiple consecutive bits got corrupted)” via polynomial division, and multi-bit detection matters “since a significant percentage of failures in communication networks and storage devices manifest this way.”
  • Cryptographic hashes resist intentional tampering. The book’s WARNING is explicit: “Noncryptographic hashes and CRCs should not be used to verify whether or not the data has been tampered with… The main goal of CRC is to make sure that there were no unintended and accidental changes in data.”

This last point fixes the threat model. A data-page checksum is an accident detector, not a tamper detector. An attacker who can rewrite a page can recompute its checksum just as easily as the engine can. The design target is therefore random corruption (bit rot, mis-directed writes, RAM flips on a resident clean page), and the design goal is to maximize detection probability per CPU cycle — not cryptographic strength.

The book also explains why page-granular, which is exactly the granularity PostgreSQL chose: “Since computing a checksum over the whole file is often impractical… page checksums are usually computed on pages and placed in the page header. This way, checksums can be more robust (since they are performed on a small subset of the data), and the whole file doesn’t have to be discarded if corruption is contained in a single page.” A per-page code is verified on exactly the I/O unit the buffer manager already reads and writes, costs nothing to locate (it lives in the page header), and localizes blame to a single 8 KB block.

Two design knobs follow from this framing and organize the rest of the document:

  1. Which algorithm, and how fast? Because the checksum is recomputed on every page read and write, it sits on the hottest I/O path in the engine. On a workload whose working set fits in the OS cache but not shared buffers, pages stream in at memory bandwidth and the checksum can become the bottleneck. So the algorithm must be vectorizable, not merely correct.

  2. When is it computed and verified, and what happens on mismatch? A checksum is only as good as its enforcement points. The engine must stamp it at the last moment before the bytes leave for disk (after all in-memory mutation, including non-WAL-logged hint bits), verify it the instant bytes arrive, and have a defined policy — fail loudly, count the event, or (under operator override) limp on — when verification fails.

Engines that ship page-level integrity checks converge on a recurring set of choices. Naming them first lets the PostgreSQL symbols in the next section read as points in a shared design space.

The checksum field lives in the page header, and excludes itself

Section titled “The checksum field lives in the page header, and excludes itself”

Every engine that does this reserves a small fixed field in the page header for the integrity code and computes the code over the rest of the page. The self-reference problem — you cannot checksum a field that holds the checksum — is solved the same way everywhere: the field is treated as zero (or skipped) during computation. Verification re-zeroes it (or recomputes the expected value with it zeroed) and compares. The field is small (16–32 bits is typical for a per-page code) because page-header space is precious and a 16-bit code already yields a ~1-in-65 000 false-negative rate, which combined with the independent header sanity checks is more than enough for accident detection.

Compute on write-out, verify on read-in — at the buffer/storage boundary

Section titled “Compute on write-out, verify on read-in — at the buffer/storage boundary”

The natural enforcement points are the two ends of the buffer manager’s I/O:

  • Write path. Just before a dirty page is handed to the storage layer (write()/pwrite()), the engine computes the checksum over the about-to-be-written image and stamps the header. Critically this must happen after every in-memory mutation, including ones that are not WAL-logged.
  • Read path. Just after the storage layer returns a page image, and before the engine trusts any field in it, the checksum is recomputed and compared.

This placement means the checksum protects the at-rest representation only. It is not part of the WAL-protected page image and is generally recomputed fresh on every flush rather than logged, so it cannot detect a logic bug that corrupts a page in memory before the checksum is stamped. It is purpose-built to catch corruption that happens between a correct stamp and a later read — i.e., in the storage stack.

The hint-bit hazard: checksum a stable image

Section titled “The hint-bit hazard: checksum a stable image”

A no-overwrite MVCC engine writes hint bits (cached visibility decisions) into pages opportunistically, often under nothing stronger than a shared lock/latch, and often without WAL. That creates a subtle race with checksumming: if the engine computes a checksum over a shared page while another backend flips a hint bit mid-computation, the stamped checksum will not match the bytes that actually reach disk. Engines solve this either by (a) taking a private copy of the page and checksumming the copy, or (b) ensuring hint-bit writes are themselves serialized against flush. Choice (a) is the common one because it keeps the hot mutation path lock-free.

Cluster-wide on/off, recorded in a control structure

Section titled “Cluster-wide on/off, recorded in a control structure”

Enabling checksums changes the on-disk format contract for every page in the cluster, so the flag is not per-table or per-session — it is a cluster-wide property recorded once in a control file and consulted by every read and write. Turning it on after the fact requires rewriting (or at least re-stamping) every existing page, which is why engines historically gate it at cluster initialization time and offer a separate offline tool to flip it later.

Defined failure policy with an operator escape hatch

Section titled “Defined failure policy with an operator escape hatch”

A detected mismatch is, by default, fatal to the read: the engine refuses to return a page it knows is corrupt, raising a clearly-coded error and bumping a visible counter so monitoring can alarm. But operators recovering a damaged cluster sometimes need to read through corruption to salvage what they can, so engines provide an explicit, dangerous override that downgrades the error to a warning (or zeroes the page) for the duration of a recovery session.

Theory / conventionPostgreSQL name
Per-page integrity code in the headerPageHeaderData.pd_checksum (16-bit)
Page-granular, self-excluding computationpg_checksum_page zeroes pd_checksum transiently
Fast, vectorizable algorithmFNV-1a-based pg_checksum_block, N_SUMS = 32 parallel sums
Mix in physical locationchecksum ^= blkno (detect transposed pages)
Avoid the zero checksum(checksum % 65535) + 1 (range 1..65535)
Compute on write-out (shared buffer)PageSetChecksumCopypg_checksum_page, called from FlushBuffer
Compute on write-out (private memory)PageSetChecksumInplace (localbuf, bulk_write, hash overflow)
Hint-bit hazard mitigationPageSetChecksumCopy takes a private pageCopy
Verify on read-inPageIsVerifiedpg_checksum_page, called from buffer read completion
Cluster-wide enable flagControlFileData.data_checksum_version (0 = off, PG_DATA_CHECKSUM_VERSION = 1)
Runtime “are checksums on?” predicateDataChecksumsEnabled()
Enable at initdbbootstrap.c -kbootstrap_data_checksum_version
Offline enable/disable/verify toolpg_checksums (frontend, also calls pg_checksum_page)
Failure policy + escape hatchignore_checksum_failure, zero_damaged_pages GUCs; PIV_* flags
Visible failure counterpg_stat_database.checksum_failures via pgstat_report_checksum_failures_in_db
Generic multi-algorithm helper (CRC32C/SHA)checksum_helper.c pg_checksum_* (separate from page checksums)

PostgreSQL data checksums are a cluster-wide, opt-in feature. When a cluster is initialized with initdb -k (--data-checksums), or when checksums are enabled later with the offline pg_checksums tool, every page that the buffer manager writes carries a 16-bit checksum in the pd_checksum field of its page header, and every page the buffer manager reads is verified against it. When the feature is off (the historical default through PG 17; note that PG 18’s initdb defaults -k on, but a cluster can still be initialized without it), the pd_checksum field is simply left at zero and never consulted.

The feature has three moving parts: (1) the algorithm — a single function pg_checksum_page() that turns a page image plus its block number into a uint16; (2) the enforcement pointsPageSetChecksumCopy / PageSetChecksumInplace on the write path and PageIsVerified on the read path, each gated by DataChecksumsEnabled(); and (3) the cluster flagdata_checksum_version in pg_control, set at bootstrap.

The checksum lives in a 16-bit slot near the front of every page header, immediately after the page LSN:

// PageHeaderData — src/include/storage/bufpage.h
typedef struct PageHeaderData
{
PageXLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog
* record for last change to this page */
uint16 pd_checksum; /* checksum */
uint16 pd_flags; /* flag bits, see below */
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
uint16 pd_pagesize_version;
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;

Putting pd_checksum inside the checksummed region (it is at byte offset 8, well before pd_lower) is exactly the self-reference problem from the previous section, and pg_checksum_page resolves it by transiently zeroing the field — detailed below. The on-disk format version that says “this cluster uses checksums” is a separate constant, recorded once in the control file:

// version constants — src/include/storage/bufpage.h
#define PG_PAGE_LAYOUT_VERSION 4
#define PG_DATA_CHECKSUM_VERSION 1

The algorithm: FNV-1a, 32-wide, SIMD-friendly

Section titled “The algorithm: FNV-1a, 32-wide, SIMD-friendly”

The real code lives not in checksum.c (which is a one-line #include) but in storage/checksum_impl.h, so that external programs — pg_checksums, pg_upgrade, third-party block-inspection tools — can #include the identical algorithm without linking the backend. checksum.c exists only to pull it into the server:

// checksum.c — src/backend/storage/page/checksum.c
#include "storage/checksum.h"
/*
* The actual code is in storage/checksum_impl.h. This is done so that
* external programs can incorporate the checksum code by #include'ing
* that file from the exported Postgres headers. (Compare our CRC code.)
*/
#include "storage/checksum_impl.h" /* IWYU pragma: keep */

The algorithm’s header comment states the performance motivation directly: “The algorithm used to checksum pages is chosen for very fast calculation. Workloads where the database working set fits into OS file cache but not into shared buffers can read in pages at a very fast pace and the checksum algorithm itself can become the largest bottleneck.” It is built on the FNV-1a (Fowler/Noll/Vo) hash, whose primitive folds in data with hash = (hash ^ value) * FNV_PRIME, but with two deliberate departures.

First, plain FNV-1a “has bad mixing of high bits — high order bits in input data only affect high order bits in output data.” PostgreSQL fixes the avalanche by xor-ing the value back in, shifted right 17 bits, and processes 4 bytes at a time:

// CHECKSUM_COMP, FNV_PRIME — src/include/storage/checksum_impl.h
/* prime multiplier of FNV-1a hash */
#define FNV_PRIME 16777619
/*
* Calculate one round of the checksum.
*/
#define CHECKSUM_COMP(checksum, value) \
do { \
uint32 __tmp = (checksum) ^ (value); \
(checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \
} while (0)

Second — and this is the SIMD trick — the page is not hashed as one serial stream. The 8 KB page is reinterpreted as a BLCKSZ / (4 * 32) row × 32 column array of uint32, and 32 independent partial sums are advanced in lockstep, one per column. Because the 32 multiplies in an inner iteration have no data dependency on each other, a vectorizing compiler maps them onto SIMD lanes (SSE4.1 pmulld, ARM NEON vmul.i32), hiding the multiply latency. The union reinterprets the page without violating strict aliasing, and each partial sum starts from a distinct random “offset basis” so identical columns do not collapse to identical partials:

// PGChecksummablePage, N_SUMS, checksumBaseOffsets — src/include/storage/checksum_impl.h
/* number of checksums to calculate in parallel */
#define N_SUMS 32
/* Use a union so that this code is valid under strict aliasing */
typedef union
{
PageHeaderData phdr;
uint32 data[BLCKSZ / (sizeof(uint32) * N_SUMS)][N_SUMS];
} PGChecksummablePage;
/*
* Base offsets to initialize each of the parallel FNV hashes into a
* different initial state.
*/
static const uint32 checksumBaseOffsets[N_SUMS] = {
0x5B1F36E9, 0xB8525960, 0x02AB50AA, 0x1DE66D2A,
/* ... 28 more randomly-chosen 32-bit constants ... */
0x9FBF8C76, 0x15CA20BE, 0xF2CA9FD3, 0x959BD756
};

pg_checksum_block does the work: seed the 32 partials from the offset table, run the main pass over every row, add two extra rounds of zero “to mix the bits of the last value added,” then xor-fold the 32 partials into one uint32:

// pg_checksum_block — src/include/storage/checksum_impl.h
static uint32
pg_checksum_block(const PGChecksummablePage *page)
{
uint32 sums[N_SUMS];
uint32 result = 0;
uint32 i, j;
/* ensure that the size is compatible with the algorithm */
Assert(sizeof(PGChecksummablePage) == BLCKSZ);
/* initialize partial checksums to their corresponding offsets */
memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
/* main checksum calculation */
for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
for (j = 0; j < N_SUMS; j++)
CHECKSUM_COMP(sums[j], page->data[i][j]);
/* finally add in two rounds of zeroes for additional mixing */
for (i = 0; i < 2; i++)
for (j = 0; j < N_SUMS; j++)
CHECKSUM_COMP(sums[j], 0);
/* xor fold partial checksums together */
for (i = 0; i < N_SUMS; i++)
result ^= sums[i];
return result;
}

The wrapper pg_checksum_page adds the three remaining steps that distinguish a page checksum from a generic block hash: it transiently zeroes pd_checksum so the field excludes itself, mixes in the block number (so a page written to the wrong location is detected even if its bytes are individually intact), and reduces to a non-zero uint16:

// pg_checksum_page — src/include/storage/checksum_impl.h
uint16
pg_checksum_page(char *page, BlockNumber blkno)
{
PGChecksummablePage *cpage = (PGChecksummablePage *) page;
uint16 save_checksum;
uint32 checksum;
/* We only calculate the checksum for properly-initialized pages */
Assert(!PageIsNew((Page) page));
/*
* Save pd_checksum and temporarily set it to zero, so that the checksum
* calculation isn't affected by the old checksum stored on the page.
* Restore it after, because actually updating the checksum is NOT part of
* the API of this function.
*/
save_checksum = cpage->phdr.pd_checksum;
cpage->phdr.pd_checksum = 0;
checksum = pg_checksum_block(cpage);
cpage->phdr.pd_checksum = save_checksum;
/* Mix in the block number to detect transposed pages */
checksum ^= blkno;
/*
* Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
* one. That avoids checksums of zero, which seems like a good idea.
*/
return (uint16) ((checksum % 65535) + 1);
}

Three subtleties worth flagging. (1) The function restores the old pd_checksum; it returns the value but does not store it — stamping is the caller’s job (PageSetChecksum*). (2) N_SUMS = 32 is “a fixed part of the algorithm because changing the parallelism changes the checksum result” — the on-disk checksum of a given page is defined by this constant, so it can never be tuned without an on-disk format break. (3) The reduction (checksum % 65535) + 1 maps every result into 1..65535, reserving 0 for “no checksum,” and introduces “a very slight bias towards lower values,” judged insignificant.

flowchart TD
  P["8 KB page image<br/>(pd_checksum included)"] --> Z["save pd_checksum,<br/>set field = 0"]
  Z --> R["reinterpret as<br/>32-column uint32 array"]
  R --> S["seed 32 partial sums<br/>from checksumBaseOffsets"]
  S --> M["main pass:<br/>CHECKSUM_COMP per column,<br/>32 lanes in parallel"]
  M --> ZR["2 extra zero rounds<br/>(finish avalanche)"]
  ZR --> F["xor-fold 32 partials<br/>to one uint32"]
  F --> B["checksum ^= blkno<br/>(catch transposed page)"]
  B --> RED["(checksum % 65535) + 1<br/>-> uint16 in 1..65535"]
  Z -.->|restore after| P
  RED --> OUT["return uint16"]

Figure 1 — pg_checksum_page data flow. The header field is transiently zeroed so it excludes itself, the page is hashed as 32 parallel FNV streams for SIMD throughput, the block number is folded in to catch mis-located pages, and the 32-bit result is squeezed into the 1..65535 uint16 range. Stamping the returned value into pd_checksum is the caller’s responsibility.

Where it is computed (write) and verified (read)

Section titled “Where it is computed (write) and verified (read)”

The algorithm is location-agnostic; the policy of when to apply it lives in the buffer manager and the page-support layer, every call gated by DataChecksumsEnabled(). The next section walks the exact call sites; the shape is: stamp in PageSetChecksum{Copy,Inplace} immediately before smgrwrite, verify in PageIsVerified immediately after the read completes.

The code splits cleanly into four call-flows: the algorithm (checksum.cchecksum_impl.h), the write path (PageSetChecksum* from the buffer flush sites), the read path (PageIsVerified from the buffer read completion, plus the failure-reporting machinery), and the enable / control path (bootstrap.c, xlog.c, pg_control.h). A separate, often-confused sibling — the generic checksum_helper.c — closes the section.

The single exported entry point is declared in storage/checksum.h, which is intentionally tiny so external programs can pull in just the prototype and then the implementation header:

// checksum.h — src/include/storage/checksum.h
#include "storage/block.h"
/*
* Compute the checksum for a Postgres page. The page must be aligned on a
* 4-byte boundary.
*/
extern uint16 pg_checksum_page(char *page, BlockNumber blkno);

The implementation (checksum_impl.h) is split into the macro CHECKSUM_COMP, the N_SUMS = 32 / FNV_PRIME constants, the checksumBaseOffsets[32] seed table, the PGChecksummablePage aliasing union, the inner pg_checksum_block, and the wrapper pg_checksum_page — all quoted in the previous section. The 4-byte-alignment requirement is real: the union casts a char * page to an array of uint32, so callers must hand it a properly-aligned buffer (shared buffers and MemoryContextAllocAligned copies both satisfy this).

Write path: PageSetChecksumCopy and PageSetChecksumInplace

Section titled “Write path: PageSetChecksumCopy and PageSetChecksumInplace”

Both stampers live in bufpage.c, and both short-circuit when the page is new (uninitialized, all-zero) or checksums are off — so a checksums-disabled cluster pays nothing. The difference is who else can touch the page concurrently.

PageSetChecksumCopy is used when the page is a shared buffer being flushed under only a shared content lock — another backend may be setting hint bits in it right now. It therefore checksums a private copy, so a concurrent hint-bit write cannot invalidate the stamped value:

// PageSetChecksumCopy — src/backend/storage/page/bufpage.c
char *
PageSetChecksumCopy(Page page, BlockNumber blkno)
{
static char *pageCopy = NULL;
/* If we don't need a checksum, just return the passed-in data */
if (PageIsNew(page) || !DataChecksumsEnabled())
return page;
/* ... palloc the aligned copy buffer once, reuse thereafter ... */
if (pageCopy == NULL)
pageCopy = MemoryContextAllocAligned(TopMemoryContext,
BLCKSZ, PG_IO_ALIGN_SIZE, 0);
memcpy(pageCopy, page, BLCKSZ);
((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
return pageCopy;
}

The static pageCopy is allocated once per backend (aligned, for the checksum’s uint32 casts) and reused; the function returns the buffer the caller must write immediately. Its sole caller is FlushBuffer, right before smgrwrite, and the comment there spells out the hazard the copy avoids:

// FlushBuffer — src/backend/storage/buffer/bufmgr.c
/*
* Update page checksum if desired. Since we have only shared lock on the
* buffer, other processes might be updating hint bits in it, so we must
* copy the page to private storage if we do checksumming.
*/
bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
/* ... */
smgrwrite(reln, BufTagGetForkNum(&buf->tag), buf->tag.blockNum,
bufToWrite, false);

PageSetChecksumInplace is the cheaper variant used whenever the caller knows no one else can be mutating the buffer, so it stamps pd_checksum directly with no copy:

// PageSetChecksumInplace — src/backend/storage/page/bufpage.c
void
PageSetChecksumInplace(Page page, BlockNumber blkno)
{
/* If we don't need a checksum, just return */
if (PageIsNew(page) || !DataChecksumsEnabled())
return;
((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
}

Its callers are the three contexts where the page is private to the writer: localbuf.c’s FlushLocalBuffer (temp-table buffers are never shared), bulk_write.c’s smgr_bulk_flush (the bulk-write facility used by index builds and COPY-style relation extension owns its pages outright), and hashpage.c’s _hash_alloc_buckets (which pre-formats and writes the last overflow page directly):

// FlushLocalBuffer — src/backend/storage/buffer/localbuf.c
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
/* ... */
smgrwrite(reln, BufTagGetForkNum(&bufHdr->tag), bufHdr->tag.blockNum,
localpage, false);
// smgr_bulk_flush -> PageSetChecksumInplace — src/backend/storage/smgr/bulk_write.c
for (int i = 0; i < npending; i++)
{
BlockNumber blkno = pending_writes[i].blkno;
Page page = pending_writes[i].buf->data;
PageSetChecksumInplace(page, blkno);
/* ... then smgrwrite / smgrextend the run ... */
}
flowchart TD
  subgraph WRITE["Write path (stamp pd_checksum)"]
    FB["FlushBuffer<br/>(shared buffer,<br/>shared content lock)"] --> PSCC["PageSetChecksumCopy<br/>(private copy,<br/>dodge hint-bit race)"]
    LB["FlushLocalBuffer<br/>(temp buffers)"] --> PSCI["PageSetChecksumInplace"]
    BW["smgr_bulk_flush<br/>(bulk write)"] --> PSCI
    HP["_hash_alloc_buckets"] --> PSCI
    PSCC --> PCP["pg_checksum_page"]
    PSCI --> PCP
    PCP --> SW["smgrwrite -> disk"]
  end
  GATE{"DataChecksumsEnabled()<br/>and not PageIsNew?"}
  PSCC -.->|"no -> return page as-is"| GATE
  PSCI -.->|"no -> return"| GATE

Figure 2 — Write-path stamping. Shared buffers go through PageSetChecksumCopy (a private copy guards against concurrent hint-bit writes under the shared lock); private pages (temp buffers, bulk write, hash overflow alloc) use the cheaper in-place stamp. Every site is gated by DataChecksumsEnabled() and skips new/zero pages.

Read path: PageIsVerified and failure reporting

Section titled “Read path: PageIsVerified and failure reporting”

Verification happens in PageIsVerified, called from the buffer read completion callback the moment a page image arrives from smgr. The function does the checksum comparison first (only for non-new pages, only when enabled), then a set of independent header sanity checks, and treats an all-zero page as acceptable (a crash can leave a zeroed but unlogged extension page on disk):

// PageIsVerified — src/backend/storage/page/bufpage.c
bool
PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_failure_p)
{
const PageHeaderData *p = (const PageHeaderData *) page;
bool checksum_failure = false;
bool header_sane = false;
uint16 checksum = 0;
if (checksum_failure_p)
*checksum_failure_p = false;
if (!PageIsNew(page))
{
if (DataChecksumsEnabled())
{
checksum = pg_checksum_page(page, blkno);
if (checksum != p->pd_checksum)
{
checksum_failure = true;
if (checksum_failure_p)
*checksum_failure_p = true;
}
}
/* independent header sanity (offsets nested correctly, MAXALIGNed) */
if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
p->pd_lower <= p->pd_upper &&
p->pd_upper <= p->pd_special &&
p->pd_special <= BLCKSZ &&
p->pd_special == MAXALIGN(p->pd_special))
header_sane = true;
if (header_sane && !checksum_failure)
return true;
}
/* ... all-zero page is OK; else fall through to the failure report ... */

The two layers are deliberate: the checksum catches random corruption with high probability, while the header checks catch structured damage (impossible offsets) that a 16-bit checksum might miss. A page must pass both to be trusted. The failure tail logs the diagnostic and decides whether the page is salvageable under operator override:

// PageIsVerified (failure tail) — src/backend/storage/page/bufpage.c
if (checksum_failure)
{
if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0)
ereport(flags & PIV_LOG_WARNING ? WARNING : LOG,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("page verification failed, calculated checksum %u but expected %u",
checksum, p->pd_checksum)));
if (header_sane && (flags & PIV_IGNORE_CHECKSUM_FAILURE))
return true;
}
return false;
}

The PIV_* flags (in bufpage.h) parameterize the policy: PIV_LOG_WARNING / PIV_LOG_LOG choose the log level, and PIV_IGNORE_CHECKSUM_FAILURE lets a header-sane page through despite a bad checksum:

// PIV flags — src/include/storage/bufpage.h
#define PIV_LOG_WARNING (1 << 0)
#define PIV_LOG_LOG (1 << 1)
#define PIV_IGNORE_CHECKSUM_FAILURE (1 << 2)

The buffer-read completion (buffer_readv_complete_one in bufmgr.c) wires the GUCs into those flags. It always asks PageIsVerified to log (not error) because completion may run in an I/O worker, deferring the user-visible WARNING/ERROR to buffer_readv_report; the session’s ignore_checksum_failure adds PIV_IGNORE_CHECKSUM_FAILURE, and zero_damaged_pages (translated earlier into READ_BUFFERS_ZERO_ON_ERROR) replaces an unverifiable page with zeroes:

// buffer read completion -> PageIsVerified — src/backend/storage/buffer/bufmgr.c
piv_flags = PIV_LOG_LOG;
/* the local zero_damaged_pages may differ from the definer's */
if (flags & READ_BUFFERS_IGNORE_CHECKSUM_FAILURES)
piv_flags |= PIV_IGNORE_CHECKSUM_FAILURE;
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags, failed_checksum))
{
if (flags & READ_BUFFERS_ZERO_ON_ERROR)
{
memset(bufdata, 0, BLCKSZ);
*zeroed_buffer = true;
}
else
{
*buffer_invalid = true;
failed = true;
}
}
else if (*failed_checksum)
*ignored_checksum = true;

The two session GUCs are read in WaitReadBuffers and folded into the read flags so the completion callback (which may execute in a different process) sees a consistent decision:

// WaitReadBuffers -> read flags — src/backend/storage/buffer/bufmgr.c
if (zero_damaged_pages)
flags |= READ_BUFFERS_ZERO_ON_ERROR;
/* For the same reason ... we need to use this backend's value. */
if (ignore_checksum_failure)
flags |= READ_BUFFERS_IGNORE_CHECKSUM_FAILURES;

Every detected mismatch (even an ignored one) is counted in cumulative stats, surfaced through pg_stat_database.checksum_failures and ...checksum_last_failure. The bump goes straight into shared memory so it can run inside a critical section:

// pgstat_report_checksum_failures_in_db — src/backend/utils/activity/pgstat_database.c
void
pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
{
/* ... fetch the shared DB entry (create=false, no allocation) ... */
sharedent = (PgStatShared_Database *) entry_ref->shared_stats;
sharedent->stats.checksum_failures += failurecount;
sharedent->stats.last_checksum_failure = GetCurrentTimestamp();
pgstat_unlock_entry(entry_ref);
}

The SQL-visible accessors return NULL when checksums are off, so monitoring can distinguish “zero failures” from “feature disabled”:

// pg_stat_get_db_checksum_failures — src/backend/utils/adt/pgstatfuncs.c
if (!DataChecksumsEnabled())
PG_RETURN_NULL();
/* ... else return dbentry->checksum_failures ... */
flowchart TD
  RD["smgr read completes<br/>(buffer_readv_complete_one)"] --> PIV["PageIsVerified"]
  PIV --> CK{"DataChecksumsEnabled<br/>and pd_checksum<br/>mismatch?"}
  CK -->|no| HDR{"header sane?"}
  CK -->|yes| LOG["ereport LOG/WARNING<br/>ERRCODE_DATA_CORRUPTED<br/>page verification failed"]
  LOG --> CNT["bump pg_stat_database<br/>.checksum_failures"]
  CNT --> IGN{"PIV_IGNORE_CHECKSUM<br/>and header sane?"}
  IGN -->|yes| OKI["accept page<br/>(ignored)"]
  IGN -->|no| ZERO{"ZERO_ON_ERROR<br/>zero_damaged_pages?"}
  ZERO -->|yes| ZP["memset page to 0<br/>accept"]
  ZERO -->|no| BAD["buffer invalid<br/>-> ERROR to client"]
  HDR -->|yes| OK["accept page"]
  HDR -->|no| BAD

Figure 3 — Read-path verification and failure policy. A checksum mismatch is logged and counted unconditionally; whether it is fatal depends on the ignore_checksum_failure and zero_damaged_pages GUCs, translated into the PIV_* / READ_BUFFERS_* flags. The independent header-sanity check is a second, structured layer that even an enabled checksum does not subsume.

Enable / control path: data_checksum_version in pg_control

Section titled “Enable / control path: data_checksum_version in pg_control”

Whether the whole machinery is active is a single cluster property recorded in the control file:

// ControlFileData (excerpt) — src/include/catalog/pg_control.h
uint32 data_checksum_version;

It is set exactly once, at bootstrap. initdb -k passes -k to the bootstrap backend, which raises the version from 0 to PG_DATA_CHECKSUM_VERSION:

// BootstrapModeMain option handling — src/backend/bootstrap/bootstrap.c
case 'k':
bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION;
break;
/* ... later ... */
BootStrapXLOG(bootstrap_data_checksum_version);

BootStrapXLOG threads the value into InitControlFile, which stores it in the control file that every subsequent backend reads at startup:

// InitControlFile — src/backend/access/transam/xlog.c
ControlFile->data_checksum_version = data_checksum_version;

At runtime the predicate DataChecksumsEnabled() is the single source of truth consulted by every stamp and verify site; it is a trivial read of the cached control file:

// DataChecksumsEnabled — src/backend/access/transam/xlog.c
bool
DataChecksumsEnabled(void)
{
Assert(ControlFile != NULL);
return (ControlFile->data_checksum_version > 0);
}

Two related facts fall out of this flag. First, xlog.c publishes a read-only data_checksums GUC (SetConfigOption("data_checksums", DataChecksumsEnabled() ? "yes" : "no", ...)) so clients can query the cluster’s state. Second, the same flag forces hint-bit writes to be WAL-logged: XLogHintBitIsNeeded() is (DataChecksumsEnabled() || wal_log_hints). This is required for torn-page safety — without it, a hint-bit-only change could alter a page’s bytes (and therefore its checksum) without any WAL record to replay over a torn write, yielding a spurious checksum failure after crash recovery. The offline pg_checksums tool flips data_checksum_version between 0 and 1 on a stopped cluster, re-stamping (or clearing) every page — the supported way to change the setting after initdb.

The sibling you should not confuse: checksum_helper.c

Section titled “The sibling you should not confuse: checksum_helper.c”

src/common/checksum_helper.c is a different facility that also lives under the word “checksum” and trips up grep-driven readers. It is a generic, algorithm-pluggable digest used by features like backup manifests and pg_verifybackupnot the page checksum. Its pg_checksum_type enum spans CRC32C and the SHA-2 family, and it offers an init/update/final streaming API:

// pg_checksum_init (excerpt) — src/common/checksum_helper.c
int
pg_checksum_init(pg_checksum_context *context, pg_checksum_type type)
{
context->type = type;
switch (type)
{
case CHECKSUM_TYPE_NONE:
break;
case CHECKSUM_TYPE_CRC32C:
INIT_CRC32C(context->raw_context.c_crc32c);
break;
case CHECKSUM_TYPE_SHA224:
context->raw_context.c_sha2 = pg_cryptohash_create(PG_SHA224);
/* ... */
}
return 0;
}

The naming overlap is purely lexical: pg_checksum_page (page integrity, FNV, fixed 16-bit) and pg_checksum_init/_update/_final (manifest digests, multiple algorithms, variable length) share no code. The basebackup.c verify_page_checksum path does use the real page checksum (pg_checksum_page) — it re-verifies pages while streaming a base backup, but skips pages modified since the backup’s start LSN, because those may be torn and “replaying WAL would reinstate the correct page.”

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
pg_checksum_pagesrc/include/storage/checksum_impl.h187
pg_checksum_blocksrc/include/storage/checksum_impl.h145
CHECKSUM_COMPsrc/include/storage/checksum_impl.h135
N_SUMS, FNV_PRIMEsrc/include/storage/checksum_impl.h106, 108
checksumBaseOffsetssrc/include/storage/checksum_impl.h121
PGChecksummablePagesrc/include/storage/checksum_impl.h111
pg_checksum_page (decl)src/include/storage/checksum.h22
#include checksum_impl.hsrc/backend/storage/page/checksum.c22
PageHeaderData.pd_checksumsrc/include/storage/bufpage.h164
PG_DATA_CHECKSUM_VERSIONsrc/include/storage/bufpage.h208
PIV_LOG_WARNING / PIV_LOG_LOG / PIV_IGNORE_CHECKSUM_FAILUREsrc/include/storage/bufpage.h469–471
PageIsVerifiedsrc/backend/storage/page/bufpage.c94
PageSetChecksumCopysrc/backend/storage/page/bufpage.c1509
PageSetChecksumInplacesrc/backend/storage/page/bufpage.c1541
ignore_checksum_failure (GUC var)src/backend/storage/page/bufpage.c27
FlushBufferPageSetChecksumCopysrc/backend/storage/buffer/bufmgr.c4372
read completion → PageIsVerifiedsrc/backend/storage/buffer/bufmgr.c7100
WaitReadBuffers GUC→flagssrc/backend/storage/buffer/bufmgr.c1821, 1828
buffer_readv_reportsrc/backend/storage/buffer/bufmgr.c7286
FlushLocalBufferPageSetChecksumInplacesrc/backend/storage/buffer/localbuf.c201
smgr_bulk_flushPageSetChecksumInplacesrc/backend/storage/smgr/bulk_write.c282
_hash_alloc_bucketsPageSetChecksumInplacesrc/backend/access/hash/hashpage.c1032
DataChecksumsEnabledsrc/backend/access/transam/xlog.c4611
InitControlFile (sets version)src/backend/access/transam/xlog.c4200, 4231
bootstrap_data_checksum_version (-k)src/backend/bootstrap/bootstrap.c204, 287
ControlFileData.data_checksum_versionsrc/include/catalog/pg_control.h222
XLogHintBitIsNeededsrc/include/access/xlog.h120
verify_page_checksumsrc/backend/backup/basebackup.c1993
pgstat_report_checksum_failures_in_dbsrc/backend/utils/activity/pgstat_database.c166
pg_stat_get_db_checksum_failuressrc/backend/utils/adt/pgstatfuncs.c1154
pg_checksum_init (helper, distinct)src/common/checksum_helper.c83

All excerpts were read from the REL_18_STABLE working tree at commit 273fe94. Verification notes:

  • pg_checksum_page is in a header, not checksum.c. checksum.c is a two-#include file; the body lives in src/include/storage/checksum_impl.h so frontend tools (pg_checksums, pg_upgrade, basebackup) compile the identical algorithm. Anyone grepping checksum.c for the FNV loop finds nothing — confirmed by reading both files.
  • N_SUMS = 32, FNV_PRIME = 16777619 verified verbatim in checksum_impl.h; the comment “changing the parallelism changes the checksum result” confirms the constant is an on-disk format invariant, not a tunable.
  • The reduction is (checksum % 65535) + 1 — modulo 65535 (not 65536) with a +1 offset, yielding 1..65535 and reserving 0. Read directly from pg_checksum_page.
  • pd_checksum is a uint16 at the third field of PageHeaderData (offset 8, after the 8-byte pd_lsn), confirmed in bufpage.h. The field is inside the checksummed region; pg_checksum_page zeroes it transiently.
  • Write sites confirmed: FlushBuffer (shared, via PageSetChecksumCopy), FlushLocalBuffer, smgr_bulk_flush, _hash_alloc_buckets (all via PageSetChecksumInplace). The shared-vs-private distinction and the hint-bit rationale are quoted from the in-tree comments.
  • Read site confirmed: PageIsVerified is the only checksum verification call in the backend read path; reached from the AIO buffer-read completion in bufmgr.c. The user-visible message is "page verification failed, calculated checksum %u but expected %u" with ERRCODE_DATA_CORRUPTED.
  • The two escape-hatch GUCs ignore_checksum_failure and zero_damaged_pages are both session settings translated into READ_BUFFERS_* flags in WaitReadBuffers, because the completion callback may run in a different process than the definer. Verified at lines 1821/1828.
  • Cluster enable is data_checksum_version in pg_control, set once at bootstrap via -k. DataChecksumsEnabled() is version > 0. Verified in xlog.c, bootstrap.c, pg_control.h.
  • No PG19-only surface asserted. This doc deliberately does not cover the online checksum-enable worker / background process (post-PG18). The only in-cluster way to change data_checksum_version described here is initdb -k at create time or the offline pg_checksums tool on a stopped cluster — both REL_18 facts.
  • checksum_helper.c is a separate facility (CRC32C / SHA-2 digests for backup manifests), sharing only the lexical prefix pg_checksum_. Confirmed by reading its enum and init/update/final API; it never calls pg_checksum_page.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s page checksum is a single, deliberately narrow tool: a fast non-cryptographic error-detecting code over the 8 KB block, recomputed on every flush and verified on every read. Placing it against the broader design space clarifies both what it buys and what it leaves on the table.

  • Speed-first detection vs. cryptographic integrity. The page checksum is FNV-1a-derived precisely because, as the header comment warns, “the checksum algorithm itself can become the largest bottleneck” for a working set that fits OS cache but not shared buffers. The choice is error detection, not tamper resistance: a uint16 with a stated ~2e-16 false-positive rate catches storage bit-rot but is trivially forgeable. PostgreSQL keeps the cryptographic family entirely separate — checksum_helper.c exposes CRC32C and the SHA-2 suite for backup manifests, where an adversary-or-bug model and a per-file (not per-page) granularity make a 224–512-bit digest affordable:

    // pg_checksum_init — src/common/checksum_helper.c
    switch (type)
    {
    case CHECKSUM_TYPE_NONE:
    break;
    case CHECKSUM_TYPE_CRC32C:
    INIT_CRC32C(context->raw_context.c_crc32c);
    break;
    case CHECKSUM_TYPE_SHA256:
    context->raw_context.c_sha2 = pg_cryptohash_create(PG_SHA256);
    /* ... SHA224/384/512 are parallel cases ... */
    break;
    }

    The two facilities never meet: the page path optimizes for throughput on the hot read/write loop, the manifest path for digest strength on a cold, once-per-file pass. Conflating them would either slow every page flush or weaken every backup-verification claim.

  • Hardware/format-level CRCs in other engines. Many engines push integrity into a wider or different code. InnoDB historically stamped a CRC32C over the page (with a legacy “innodb” checksum and a crc32 mode), and like PostgreSQL it brackets the page so that the recorded checksum sits at a known offset; SQL Server’s PAGE_VERITY CHECKSUM likewise computes a single page-wide value stored in the header. The common shape — one code, page granularity, verified on read — is the same; the polynomial/primitive and the width differ. PostgreSQL’s distinguishing move is mixing in the block number (checksum ^= blkno) so a page written to the wrong physical location is caught even if its bytes are individually intact — a transposition failure mode that a content-only CRC misses.

  • Checksums vs. end-to-end integrity (the storage-stack argument). A classic systems result — Stone & Partridge’s analysis of TCP/Ethernet checksum gaps, and the broader end-to-end argument (Saltzer, Reed & Clark) — shows that a checksum applied at one layer does not protect the data once it leaves that layer’s custody. PostgreSQL’s checksum is computed in PageSetChecksum* just before smgrwrite and verified in PageIsVerified just after the read completes, so it covers exactly the journey through the kernel, the filesystem, the block layer, and the medium — the segment where PostgreSQL has surrendered the bytes. It deliberately does not cover the in-memory lifetime of a buffer (a RAM bit-flip on a dirty page in shared buffers is re-checksummed into a “valid” page on flush) — that gap is the province of ECC memory, not of pd_checksum.

  • What the checksum cannot see. Because the WAL carries its own CRC and full-page images, and because pd_checksum is recomputed on every flush rather than logged, the page checksum detects corruption introduced below PostgreSQL but is blind to logic bugs above it: a backend that writes a semantically wrong but structurally valid tuple produces a perfectly valid checksum. This is the same boundary that filesystem checksums (ZFS, Btrfs) draw — they catch the medium lying, not the application lying. The research frontier of silent data corruption (Bairavasundaram et al.’s large-scale studies of latent sector errors and disk corruption in production fleets) is precisely the failure class pd_checksum was built to surface: corruption that returns successfully from read() with wrong bytes and no I/O error.

  • Granularity and the cost of being wrong. A 16-bit page code is a conscious trade: it costs 2 bytes per 8 KB (0.024%) and a few cycles, versus a per-row or per-cell checksum (finer localization, far higher overhead) or a whole-segment digest (cheaper amortized, but no per-page isolation and no online verification). PostgreSQL’s per-page choice aligns the integrity unit with the I/O unit and the recovery unit — the same block that is the atom of smgrread/smgrwrite, of buffer replacement, and of WAL full-page images — which is why a failure can be reported as a single bad block and optionally zeroed (zero_damaged_pages) or tolerated (ignore_checksum_failure) without dragging in neighbouring data.

  • Online enablement (out of scope here, noted for the reader). This doc describes the REL_18 reality: checksums are a cluster-wide property fixed at initdb -k, changeable afterward only by the offline pg_checksums tool on a stopped cluster. The ability to enable checksums online on a running cluster via a background worker is a later development and is intentionally not asserted here. Readers on newer branches should treat the enable-path described in section 4 (the data_checksum_version bootstrap write) as the floor, not the ceiling.

In-tree source files (REL_18_STABLE, commit 273fe94, as of 2026-06-06)

Section titled “In-tree source files (REL_18_STABLE, commit 273fe94, as of 2026-06-06)”
  • src/backend/storage/page/checksum.c — the two-#include shim that pulls the algorithm body in from the exported header so frontend tools compile the identical code.
  • src/include/storage/checksum_impl.h — the algorithm itself: PGChecksummablePage union, checksumBaseOffsets[N_SUMS], the CHECKSUM_COMP mixing macro, pg_checksum_block, and the public pg_checksum_page (FNV-1a parallel sums, two zero rounds, xor fold, block-number mix, (checksum % 65535) + 1 reduction). Also the long design rationale comment (SIMD parallelism, the choice of 32, why parallelism is a format invariant).
  • src/include/storage/checksum.h — the one-line public prototype for pg_checksum_page.
  • src/include/storage/bufpage.hPageHeaderData / pd_checksum field layout, PageSetChecksumInplace/PageSetChecksumCopy declarations, PageIsVerifiedExtended and the PIV_* flag bits.
  • src/backend/storage/page/bufpage.cPageIsVerifiedExtended, PageSetChecksumCopy (private-copy path under shared lock), PageSetChecksumInplace (in-place path), and the checksum-failure WARNING/ERROR text.
  • src/backend/storage/buffer/bufmgr.c — the read-completion verification path (PageIsVerified via the AIO buffer-read callback), the FlushBuffer write site, and the ignore_checksum_failure / zero_damaged_pagesREAD_BUFFERS_* translation in WaitReadBuffers.
  • src/backend/storage/buffer/localbuf.cFlushLocalBuffer, the local-buffer write site stamping checksums in place.
  • src/backend/storage/smgr/bulk_write.csmgr_bulk_flush, the bulk-load write path that checksums each block before smgrextend/smgrwrite.
  • src/backend/access/transam/xlog.cDataChecksumsEnabled and the data_checksum_version plumbing in ControlFile.
  • src/backend/bootstrap/bootstrap.c — the bootstrap -k handling that sets the cluster’s checksum version at initdb time.
  • src/backend/backup/basebackup.c — base-backup-time checksum verification of data pages as they are streamed.
  • src/backend/utils/activity/pgstat_database.cpgstat_report_checksum_failure*, feeding pg_stat_database.checksum_failures / checksum_last_failure.
  • src/include/catalog/pg_control.hdata_checksum_version in ControlFileData, the on-disk home of the cluster-wide flag.
  • src/common/checksum_helper.cdistinct facility: CRC32C / SHA-2 digest contexts for backup manifests (pg_checksum_init/update/final, pg_checksum_type). Shares only the pg_checksum_ prefix; never invokes pg_checksum_page. Included here to document the boundary, not as part of the page-checksum path.
  • FNV-1a hash — Fowler/Noll/Vo, the non-cryptographic hash family the page algorithm is built on; described at the URL cited in the in-tree comment (isthe.com/chongo/tech/comp/fnv). PostgreSQL’s variant adds the ^ ((hash ^ value) >> 17) high-bit mixing step and 32-way parallelism.
  • End-to-end argument / checksum coverage gaps — Saltzer, Reed & Clark, End-to-End Arguments in System Design (1984); Stone & Partridge, When the CRC and TCP Checksum Disagree (SIGCOMM 2000). Motivates computing/verifying the checksum at the exact smgrwrite/read boundary.
  • Silent data corruption in the field — Bairavasundaram et al., An Analysis of Latent Sector Errors / Data Corruption in the Storage Stack (FAST ‘07/’08). The empirical case that read() can succeed with wrong bytes, which is the failure class pd_checksum exists to detect.
  • DBMS reliability framing — see the general DBMS reliability/recovery material in knowledge/research/dbms-general/ (page-level integrity sits beside WAL and the buffer manager as the third leg of on-disk durability), and the apt entries in dbms-papers per the project bibliography.
Section titled “Related KB docs (cross-references, not duplicated here)”
  • postgres-page-layout.mdPageHeaderData field-by-field layout; this doc defers the full header anatomy there and only quotes pd_checksum’s offset.
  • postgres-buffer-manager.md — buffer eviction, FlushBuffer/WaitReadBuffers mechanics, and the AIO read-completion machinery; this doc names the write/read call sites but defers the surrounding flush/eviction lifecycle there.
  • postgres-smgr-md.md — the smgrwrite/smgrread layer the checksum brackets.
  • postgres-xlog-wal.md — WAL CRCs and full-page images, the other integrity code, which protects the log rather than the heap/index pages.
  • postgres-backup-basebackup.md / postgres-incremental-backup.md — where the separate checksum_helper.c manifest digests are actually consumed.