PostgreSQL Data Checksums — Page Integrity Verification
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A relational engine’s most basic promise is that a row written yesterday reads
back unchanged today. Between the INSERT and the SELECT, however, the 8 KB
page that holds the row passes through a long chain of components that the DBMS
does not control: the OS page cache, the filesystem, the block layer, an HBA or
NVMe controller, the drive’s own firmware and DRAM cache, and — for networked
storage — a SAN fabric. Any link can flip a bit, mis-route a write to the wrong
sector, return a stale copy of a block, or hand back a torn (partially written)
page after a power loss. Database Internals (Petrov 2019, ch. 3 “File
Formats”, §“Checksumming”; captured in research/dbms-general/database-internals.md)
states the problem plainly: “Files on disk may get damaged or corrupted by
software bugs and hardware failures. To identify these problems preemptively
and avoid propagating corrupt data to other subsystems or even nodes, we can
use checksums and cyclic redundancy checks (CRCs).”
The crucial adjective is silent. A drive that returns an explicit read error
is easy to handle — the DBMS gets an EIO and reports it. The dangerous case
is silent data corruption: the storage stack returns success together
with wrong bytes. Without an end-to-end integrity check the engine cannot
tell a corrupted page from a good one; it dutifully follows the now-garbage line
pointers, mis-reads transaction ids, and either crashes far from the real fault
or — worse — returns wrong answers and propagates the damage into backups,
replicas, and dumps. A checksum stored with the data turns silent corruption
into loud corruption: a mismatch at read time is a detectable, attributable
event.
Database Internals is careful to rank the guarantee strength of the three families of integrity codes, because the choice is a real engineering trade-off:
- Checksums “provide the weakest form of guarantee and aren’t able to detect corruption in multiple bits. They’re usually computed by using XOR with parity checks or summation.”
- CRCs “can help detect burst errors (e.g., when multiple consecutive bits got corrupted)” via polynomial division, and multi-bit detection matters “since a significant percentage of failures in communication networks and storage devices manifest this way.”
- Cryptographic hashes resist intentional tampering. The book’s WARNING is explicit: “Noncryptographic hashes and CRCs should not be used to verify whether or not the data has been tampered with… The main goal of CRC is to make sure that there were no unintended and accidental changes in data.”
This last point fixes the threat model. A data-page checksum is an accident detector, not a tamper detector. An attacker who can rewrite a page can recompute its checksum just as easily as the engine can. The design target is therefore random corruption (bit rot, mis-directed writes, RAM flips on a resident clean page), and the design goal is to maximize detection probability per CPU cycle — not cryptographic strength.
The book also explains why page-granular, which is exactly the granularity PostgreSQL chose: “Since computing a checksum over the whole file is often impractical… page checksums are usually computed on pages and placed in the page header. This way, checksums can be more robust (since they are performed on a small subset of the data), and the whole file doesn’t have to be discarded if corruption is contained in a single page.” A per-page code is verified on exactly the I/O unit the buffer manager already reads and writes, costs nothing to locate (it lives in the page header), and localizes blame to a single 8 KB block.
Two design knobs follow from this framing and organize the rest of the document:
-
Which algorithm, and how fast? Because the checksum is recomputed on every page read and write, it sits on the hottest I/O path in the engine. On a workload whose working set fits in the OS cache but not shared buffers, pages stream in at memory bandwidth and the checksum can become the bottleneck. So the algorithm must be vectorizable, not merely correct.
-
When is it computed and verified, and what happens on mismatch? A checksum is only as good as its enforcement points. The engine must stamp it at the last moment before the bytes leave for disk (after all in-memory mutation, including non-WAL-logged hint bits), verify it the instant bytes arrive, and have a defined policy — fail loudly, count the event, or (under operator override) limp on — when verification fails.
Common DBMS Design
Section titled “Common DBMS Design”Engines that ship page-level integrity checks converge on a recurring set of choices. Naming them first lets the PostgreSQL symbols in the next section read as points in a shared design space.
The checksum field lives in the page header, and excludes itself
Section titled “The checksum field lives in the page header, and excludes itself”Every engine that does this reserves a small fixed field in the page header for the integrity code and computes the code over the rest of the page. The self-reference problem — you cannot checksum a field that holds the checksum — is solved the same way everywhere: the field is treated as zero (or skipped) during computation. Verification re-zeroes it (or recomputes the expected value with it zeroed) and compares. The field is small (16–32 bits is typical for a per-page code) because page-header space is precious and a 16-bit code already yields a ~1-in-65 000 false-negative rate, which combined with the independent header sanity checks is more than enough for accident detection.
Compute on write-out, verify on read-in — at the buffer/storage boundary
Section titled “Compute on write-out, verify on read-in — at the buffer/storage boundary”The natural enforcement points are the two ends of the buffer manager’s I/O:
- Write path. Just before a dirty page is handed to the storage layer
(
write()/pwrite()), the engine computes the checksum over the about-to-be-written image and stamps the header. Critically this must happen after every in-memory mutation, including ones that are not WAL-logged. - Read path. Just after the storage layer returns a page image, and before the engine trusts any field in it, the checksum is recomputed and compared.
This placement means the checksum protects the at-rest representation only. It is not part of the WAL-protected page image and is generally recomputed fresh on every flush rather than logged, so it cannot detect a logic bug that corrupts a page in memory before the checksum is stamped. It is purpose-built to catch corruption that happens between a correct stamp and a later read — i.e., in the storage stack.
The hint-bit hazard: checksum a stable image
Section titled “The hint-bit hazard: checksum a stable image”A no-overwrite MVCC engine writes hint bits (cached visibility decisions) into pages opportunistically, often under nothing stronger than a shared lock/latch, and often without WAL. That creates a subtle race with checksumming: if the engine computes a checksum over a shared page while another backend flips a hint bit mid-computation, the stamped checksum will not match the bytes that actually reach disk. Engines solve this either by (a) taking a private copy of the page and checksumming the copy, or (b) ensuring hint-bit writes are themselves serialized against flush. Choice (a) is the common one because it keeps the hot mutation path lock-free.
Cluster-wide on/off, recorded in a control structure
Section titled “Cluster-wide on/off, recorded in a control structure”Enabling checksums changes the on-disk format contract for every page in the cluster, so the flag is not per-table or per-session — it is a cluster-wide property recorded once in a control file and consulted by every read and write. Turning it on after the fact requires rewriting (or at least re-stamping) every existing page, which is why engines historically gate it at cluster initialization time and offer a separate offline tool to flip it later.
Defined failure policy with an operator escape hatch
Section titled “Defined failure policy with an operator escape hatch”A detected mismatch is, by default, fatal to the read: the engine refuses to return a page it knows is corrupt, raising a clearly-coded error and bumping a visible counter so monitoring can alarm. But operators recovering a damaged cluster sometimes need to read through corruption to salvage what they can, so engines provide an explicit, dangerous override that downgrades the error to a warning (or zeroes the page) for the duration of a recovery session.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Per-page integrity code in the header | PageHeaderData.pd_checksum (16-bit) |
| Page-granular, self-excluding computation | pg_checksum_page zeroes pd_checksum transiently |
| Fast, vectorizable algorithm | FNV-1a-based pg_checksum_block, N_SUMS = 32 parallel sums |
| Mix in physical location | checksum ^= blkno (detect transposed pages) |
| Avoid the zero checksum | (checksum % 65535) + 1 (range 1..65535) |
| Compute on write-out (shared buffer) | PageSetChecksumCopy → pg_checksum_page, called from FlushBuffer |
| Compute on write-out (private memory) | PageSetChecksumInplace (localbuf, bulk_write, hash overflow) |
| Hint-bit hazard mitigation | PageSetChecksumCopy takes a private pageCopy |
| Verify on read-in | PageIsVerified → pg_checksum_page, called from buffer read completion |
| Cluster-wide enable flag | ControlFileData.data_checksum_version (0 = off, PG_DATA_CHECKSUM_VERSION = 1) |
| Runtime “are checksums on?” predicate | DataChecksumsEnabled() |
| Enable at initdb | bootstrap.c -k → bootstrap_data_checksum_version |
| Offline enable/disable/verify tool | pg_checksums (frontend, also calls pg_checksum_page) |
| Failure policy + escape hatch | ignore_checksum_failure, zero_damaged_pages GUCs; PIV_* flags |
| Visible failure counter | pg_stat_database.checksum_failures via pgstat_report_checksum_failures_in_db |
| Generic multi-algorithm helper (CRC32C/SHA) | checksum_helper.c pg_checksum_* (separate from page checksums) |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL data checksums are a cluster-wide, opt-in feature. When a cluster
is initialized with initdb -k (--data-checksums), or when checksums are
enabled later with the offline pg_checksums tool, every page that the buffer
manager writes carries a 16-bit checksum in the pd_checksum field of its page
header, and every page the buffer manager reads is verified against it. When the
feature is off (the historical default through PG 17; note that PG 18’s initdb
defaults -k on, but a cluster can still be initialized without it), the
pd_checksum field is simply left at zero and never consulted.
The feature has three moving parts: (1) the algorithm — a single function
pg_checksum_page() that turns a page image plus its block number into a
uint16; (2) the enforcement points — PageSetChecksumCopy /
PageSetChecksumInplace on the write path and PageIsVerified on the read
path, each gated by DataChecksumsEnabled(); and (3) the cluster flag —
data_checksum_version in pg_control, set at bootstrap.
The page-header field
Section titled “The page-header field”The checksum lives in a 16-bit slot near the front of every page header, immediately after the page LSN:
// PageHeaderData — src/include/storage/bufpage.htypedef struct PageHeaderData{ PageXLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog * record for last change to this page */ uint16 pd_checksum; /* checksum */ uint16 pd_flags; /* flag bits, see below */ LocationIndex pd_lower; /* offset to start of free space */ LocationIndex pd_upper; /* offset to end of free space */ LocationIndex pd_special; /* offset to start of special space */ uint16 pd_pagesize_version; TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */ ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */} PageHeaderData;Putting pd_checksum inside the checksummed region (it is at byte offset 8,
well before pd_lower) is exactly the self-reference problem from the previous
section, and pg_checksum_page resolves it by transiently zeroing the field —
detailed below. The on-disk format version that says “this cluster uses
checksums” is a separate constant, recorded once in the control file:
// version constants — src/include/storage/bufpage.h#define PG_PAGE_LAYOUT_VERSION 4#define PG_DATA_CHECKSUM_VERSION 1The algorithm: FNV-1a, 32-wide, SIMD-friendly
Section titled “The algorithm: FNV-1a, 32-wide, SIMD-friendly”The real code lives not in checksum.c (which is a one-line #include) but in
storage/checksum_impl.h, so that external programs — pg_checksums,
pg_upgrade, third-party block-inspection tools — can #include the identical
algorithm without linking the backend. checksum.c exists only to pull it into
the server:
// checksum.c — src/backend/storage/page/checksum.c#include "storage/checksum.h"/* * The actual code is in storage/checksum_impl.h. This is done so that * external programs can incorporate the checksum code by #include'ing * that file from the exported Postgres headers. (Compare our CRC code.) */#include "storage/checksum_impl.h" /* IWYU pragma: keep */The algorithm’s header comment states the performance motivation directly: “The
algorithm used to checksum pages is chosen for very fast calculation. Workloads
where the database working set fits into OS file cache but not into shared
buffers can read in pages at a very fast pace and the checksum algorithm itself
can become the largest bottleneck.” It is built on the FNV-1a (Fowler/Noll/Vo)
hash, whose primitive folds in data with hash = (hash ^ value) * FNV_PRIME,
but with two deliberate departures.
First, plain FNV-1a “has bad mixing of high bits — high order bits in input data only affect high order bits in output data.” PostgreSQL fixes the avalanche by xor-ing the value back in, shifted right 17 bits, and processes 4 bytes at a time:
// CHECKSUM_COMP, FNV_PRIME — src/include/storage/checksum_impl.h/* prime multiplier of FNV-1a hash */#define FNV_PRIME 16777619
/* * Calculate one round of the checksum. */#define CHECKSUM_COMP(checksum, value) \do { \ uint32 __tmp = (checksum) ^ (value); \ (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \} while (0)Second — and this is the SIMD trick — the page is not hashed as one serial
stream. The 8 KB page is reinterpreted as a BLCKSZ / (4 * 32) row × 32
column array of uint32, and 32 independent partial sums are advanced in
lockstep, one per column. Because the 32 multiplies in an inner iteration have
no data dependency on each other, a vectorizing compiler maps them onto SIMD
lanes (SSE4.1 pmulld, ARM NEON vmul.i32), hiding the multiply latency. The
union reinterprets the page without violating strict aliasing, and each partial
sum starts from a distinct random “offset basis” so identical columns do not
collapse to identical partials:
// PGChecksummablePage, N_SUMS, checksumBaseOffsets — src/include/storage/checksum_impl.h/* number of checksums to calculate in parallel */#define N_SUMS 32
/* Use a union so that this code is valid under strict aliasing */typedef union{ PageHeaderData phdr; uint32 data[BLCKSZ / (sizeof(uint32) * N_SUMS)][N_SUMS];} PGChecksummablePage;
/* * Base offsets to initialize each of the parallel FNV hashes into a * different initial state. */static const uint32 checksumBaseOffsets[N_SUMS] = { 0x5B1F36E9, 0xB8525960, 0x02AB50AA, 0x1DE66D2A, /* ... 28 more randomly-chosen 32-bit constants ... */ 0x9FBF8C76, 0x15CA20BE, 0xF2CA9FD3, 0x959BD756};pg_checksum_block does the work: seed the 32 partials from the offset table,
run the main pass over every row, add two extra rounds of zero “to mix the bits
of the last value added,” then xor-fold the 32 partials into one uint32:
// pg_checksum_block — src/include/storage/checksum_impl.hstatic uint32pg_checksum_block(const PGChecksummablePage *page){ uint32 sums[N_SUMS]; uint32 result = 0; uint32 i, j;
/* ensure that the size is compatible with the algorithm */ Assert(sizeof(PGChecksummablePage) == BLCKSZ);
/* initialize partial checksums to their corresponding offsets */ memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
/* main checksum calculation */ for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) for (j = 0; j < N_SUMS; j++) CHECKSUM_COMP(sums[j], page->data[i][j]);
/* finally add in two rounds of zeroes for additional mixing */ for (i = 0; i < 2; i++) for (j = 0; j < N_SUMS; j++) CHECKSUM_COMP(sums[j], 0);
/* xor fold partial checksums together */ for (i = 0; i < N_SUMS; i++) result ^= sums[i];
return result;}The wrapper pg_checksum_page adds the three remaining steps that distinguish a
page checksum from a generic block hash: it transiently zeroes pd_checksum so
the field excludes itself, mixes in the block number (so a page written to the
wrong location is detected even if its bytes are individually intact), and
reduces to a non-zero uint16:
// pg_checksum_page — src/include/storage/checksum_impl.huint16pg_checksum_page(char *page, BlockNumber blkno){ PGChecksummablePage *cpage = (PGChecksummablePage *) page; uint16 save_checksum; uint32 checksum;
/* We only calculate the checksum for properly-initialized pages */ Assert(!PageIsNew((Page) page));
/* * Save pd_checksum and temporarily set it to zero, so that the checksum * calculation isn't affected by the old checksum stored on the page. * Restore it after, because actually updating the checksum is NOT part of * the API of this function. */ save_checksum = cpage->phdr.pd_checksum; cpage->phdr.pd_checksum = 0; checksum = pg_checksum_block(cpage); cpage->phdr.pd_checksum = save_checksum;
/* Mix in the block number to detect transposed pages */ checksum ^= blkno;
/* * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of * one. That avoids checksums of zero, which seems like a good idea. */ return (uint16) ((checksum % 65535) + 1);}Three subtleties worth flagging. (1) The function restores the old
pd_checksum; it returns the value but does not store it — stamping is the
caller’s job (PageSetChecksum*). (2) N_SUMS = 32 is “a fixed part of the
algorithm because changing the parallelism changes the checksum result” — the
on-disk checksum of a given page is defined by this constant, so it can never be
tuned without an on-disk format break. (3) The reduction (checksum % 65535) + 1
maps every result into 1..65535, reserving 0 for “no checksum,” and
introduces “a very slight bias towards lower values,” judged insignificant.
flowchart TD P["8 KB page image<br/>(pd_checksum included)"] --> Z["save pd_checksum,<br/>set field = 0"] Z --> R["reinterpret as<br/>32-column uint32 array"] R --> S["seed 32 partial sums<br/>from checksumBaseOffsets"] S --> M["main pass:<br/>CHECKSUM_COMP per column,<br/>32 lanes in parallel"] M --> ZR["2 extra zero rounds<br/>(finish avalanche)"] ZR --> F["xor-fold 32 partials<br/>to one uint32"] F --> B["checksum ^= blkno<br/>(catch transposed page)"] B --> RED["(checksum % 65535) + 1<br/>-> uint16 in 1..65535"] Z -.->|restore after| P RED --> OUT["return uint16"]
Figure 1 — pg_checksum_page data flow. The header field is transiently
zeroed so it excludes itself, the page is hashed as 32 parallel FNV streams for
SIMD throughput, the block number is folded in to catch mis-located pages, and
the 32-bit result is squeezed into the 1..65535 uint16 range. Stamping the
returned value into pd_checksum is the caller’s responsibility.
Where it is computed (write) and verified (read)
Section titled “Where it is computed (write) and verified (read)”The algorithm is location-agnostic; the policy of when to apply it lives in
the buffer manager and the page-support layer, every call gated by
DataChecksumsEnabled(). The next section walks the exact call sites; the
shape is: stamp in PageSetChecksum{Copy,Inplace} immediately before
smgrwrite, verify in PageIsVerified immediately after the read completes.
Source Walkthrough
Section titled “Source Walkthrough”The code splits cleanly into four call-flows: the algorithm (checksum.c
→ checksum_impl.h), the write path (PageSetChecksum* from the buffer
flush sites), the read path (PageIsVerified from the buffer read
completion, plus the failure-reporting machinery), and the enable / control
path (bootstrap.c, xlog.c, pg_control.h). A separate, often-confused
sibling — the generic checksum_helper.c — closes the section.
The algorithm and its public surface
Section titled “The algorithm and its public surface”The single exported entry point is declared in storage/checksum.h, which is
intentionally tiny so external programs can pull in just the prototype and then
the implementation header:
// checksum.h — src/include/storage/checksum.h#include "storage/block.h"
/* * Compute the checksum for a Postgres page. The page must be aligned on a * 4-byte boundary. */extern uint16 pg_checksum_page(char *page, BlockNumber blkno);The implementation (checksum_impl.h) is split into the macro CHECKSUM_COMP,
the N_SUMS = 32 / FNV_PRIME constants, the checksumBaseOffsets[32] seed
table, the PGChecksummablePage aliasing union, the inner pg_checksum_block,
and the wrapper pg_checksum_page — all quoted in the previous section. The
4-byte-alignment requirement is real: the union casts a char * page to an
array of uint32, so callers must hand it a properly-aligned buffer (shared
buffers and MemoryContextAllocAligned copies both satisfy this).
Write path: PageSetChecksumCopy and PageSetChecksumInplace
Section titled “Write path: PageSetChecksumCopy and PageSetChecksumInplace”Both stampers live in bufpage.c, and both short-circuit when the page is new
(uninitialized, all-zero) or checksums are off — so a checksums-disabled cluster
pays nothing. The difference is who else can touch the page concurrently.
PageSetChecksumCopy is used when the page is a shared buffer being flushed
under only a shared content lock — another backend may be setting hint bits in
it right now. It therefore checksums a private copy, so a concurrent hint-bit
write cannot invalidate the stamped value:
// PageSetChecksumCopy — src/backend/storage/page/bufpage.cchar *PageSetChecksumCopy(Page page, BlockNumber blkno){ static char *pageCopy = NULL;
/* If we don't need a checksum, just return the passed-in data */ if (PageIsNew(page) || !DataChecksumsEnabled()) return page;
/* ... palloc the aligned copy buffer once, reuse thereafter ... */ if (pageCopy == NULL) pageCopy = MemoryContextAllocAligned(TopMemoryContext, BLCKSZ, PG_IO_ALIGN_SIZE, 0);
memcpy(pageCopy, page, BLCKSZ); ((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno); return pageCopy;}The static pageCopy is allocated once per backend (aligned, for the
checksum’s uint32 casts) and reused; the function returns the buffer the
caller must write immediately. Its sole caller is FlushBuffer, right before
smgrwrite, and the comment there spells out the hazard the copy avoids:
// FlushBuffer — src/backend/storage/buffer/bufmgr.c/* * Update page checksum if desired. Since we have only shared lock on the * buffer, other processes might be updating hint bits in it, so we must * copy the page to private storage if we do checksumming. */bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);/* ... */smgrwrite(reln, BufTagGetForkNum(&buf->tag), buf->tag.blockNum, bufToWrite, false);PageSetChecksumInplace is the cheaper variant used whenever the caller knows
no one else can be mutating the buffer, so it stamps pd_checksum directly with
no copy:
// PageSetChecksumInplace — src/backend/storage/page/bufpage.cvoidPageSetChecksumInplace(Page page, BlockNumber blkno){ /* If we don't need a checksum, just return */ if (PageIsNew(page) || !DataChecksumsEnabled()) return;
((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);}Its callers are the three contexts where the page is private to the writer:
localbuf.c’s FlushLocalBuffer (temp-table buffers are never shared),
bulk_write.c’s smgr_bulk_flush (the bulk-write facility used by index
builds and COPY-style relation extension owns its pages outright), and
hashpage.c’s _hash_alloc_buckets (which pre-formats and writes the last
overflow page directly):
// FlushLocalBuffer — src/backend/storage/buffer/localbuf.cPageSetChecksumInplace(localpage, bufHdr->tag.blockNum);/* ... */smgrwrite(reln, BufTagGetForkNum(&bufHdr->tag), bufHdr->tag.blockNum, localpage, false);// smgr_bulk_flush -> PageSetChecksumInplace — src/backend/storage/smgr/bulk_write.cfor (int i = 0; i < npending; i++){ BlockNumber blkno = pending_writes[i].blkno; Page page = pending_writes[i].buf->data;
PageSetChecksumInplace(page, blkno); /* ... then smgrwrite / smgrextend the run ... */}flowchart TD
subgraph WRITE["Write path (stamp pd_checksum)"]
FB["FlushBuffer<br/>(shared buffer,<br/>shared content lock)"] --> PSCC["PageSetChecksumCopy<br/>(private copy,<br/>dodge hint-bit race)"]
LB["FlushLocalBuffer<br/>(temp buffers)"] --> PSCI["PageSetChecksumInplace"]
BW["smgr_bulk_flush<br/>(bulk write)"] --> PSCI
HP["_hash_alloc_buckets"] --> PSCI
PSCC --> PCP["pg_checksum_page"]
PSCI --> PCP
PCP --> SW["smgrwrite -> disk"]
end
GATE{"DataChecksumsEnabled()<br/>and not PageIsNew?"}
PSCC -.->|"no -> return page as-is"| GATE
PSCI -.->|"no -> return"| GATE
Figure 2 — Write-path stamping. Shared buffers go through PageSetChecksumCopy
(a private copy guards against concurrent hint-bit writes under the shared
lock); private pages (temp buffers, bulk write, hash overflow alloc) use the
cheaper in-place stamp. Every site is gated by DataChecksumsEnabled() and
skips new/zero pages.
Read path: PageIsVerified and failure reporting
Section titled “Read path: PageIsVerified and failure reporting”Verification happens in PageIsVerified, called from the buffer read completion
callback the moment a page image arrives from smgr. The function does the
checksum comparison first (only for non-new pages, only when enabled), then a
set of independent header sanity checks, and treats an all-zero page as
acceptable (a crash can leave a zeroed but unlogged extension page on disk):
// PageIsVerified — src/backend/storage/page/bufpage.cboolPageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_failure_p){ const PageHeaderData *p = (const PageHeaderData *) page; bool checksum_failure = false; bool header_sane = false; uint16 checksum = 0;
if (checksum_failure_p) *checksum_failure_p = false;
if (!PageIsNew(page)) { if (DataChecksumsEnabled()) { checksum = pg_checksum_page(page, blkno);
if (checksum != p->pd_checksum) { checksum_failure = true; if (checksum_failure_p) *checksum_failure_p = true; } }
/* independent header sanity (offsets nested correctly, MAXALIGNed) */ if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 && p->pd_lower <= p->pd_upper && p->pd_upper <= p->pd_special && p->pd_special <= BLCKSZ && p->pd_special == MAXALIGN(p->pd_special)) header_sane = true;
if (header_sane && !checksum_failure) return true; } /* ... all-zero page is OK; else fall through to the failure report ... */The two layers are deliberate: the checksum catches random corruption with high probability, while the header checks catch structured damage (impossible offsets) that a 16-bit checksum might miss. A page must pass both to be trusted. The failure tail logs the diagnostic and decides whether the page is salvageable under operator override:
// PageIsVerified (failure tail) — src/backend/storage/page/bufpage.c if (checksum_failure) { if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0) ereport(flags & PIV_LOG_WARNING ? WARNING : LOG, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("page verification failed, calculated checksum %u but expected %u", checksum, p->pd_checksum)));
if (header_sane && (flags & PIV_IGNORE_CHECKSUM_FAILURE)) return true; }
return false;}The PIV_* flags (in bufpage.h) parameterize the policy: PIV_LOG_WARNING /
PIV_LOG_LOG choose the log level, and PIV_IGNORE_CHECKSUM_FAILURE lets a
header-sane page through despite a bad checksum:
// PIV flags — src/include/storage/bufpage.h#define PIV_LOG_WARNING (1 << 0)#define PIV_LOG_LOG (1 << 1)#define PIV_IGNORE_CHECKSUM_FAILURE (1 << 2)The buffer-read completion (buffer_readv_complete_one in bufmgr.c) wires the
GUCs into those flags. It always asks PageIsVerified to log (not error)
because completion may run in an I/O worker, deferring the user-visible
WARNING/ERROR to buffer_readv_report; the session’s
ignore_checksum_failure adds PIV_IGNORE_CHECKSUM_FAILURE, and
zero_damaged_pages (translated earlier into READ_BUFFERS_ZERO_ON_ERROR)
replaces an unverifiable page with zeroes:
// buffer read completion -> PageIsVerified — src/backend/storage/buffer/bufmgr.cpiv_flags = PIV_LOG_LOG;
/* the local zero_damaged_pages may differ from the definer's */if (flags & READ_BUFFERS_IGNORE_CHECKSUM_FAILURES) piv_flags |= PIV_IGNORE_CHECKSUM_FAILURE;
if (!PageIsVerified((Page) bufdata, tag.blockNum, piv_flags, failed_checksum)){ if (flags & READ_BUFFERS_ZERO_ON_ERROR) { memset(bufdata, 0, BLCKSZ); *zeroed_buffer = true; } else { *buffer_invalid = true; failed = true; }}else if (*failed_checksum) *ignored_checksum = true;The two session GUCs are read in WaitReadBuffers and folded into the read
flags so the completion callback (which may execute in a different process)
sees a consistent decision:
// WaitReadBuffers -> read flags — src/backend/storage/buffer/bufmgr.cif (zero_damaged_pages) flags |= READ_BUFFERS_ZERO_ON_ERROR;
/* For the same reason ... we need to use this backend's value. */if (ignore_checksum_failure) flags |= READ_BUFFERS_IGNORE_CHECKSUM_FAILURES;Every detected mismatch (even an ignored one) is counted in cumulative stats,
surfaced through pg_stat_database.checksum_failures and
...checksum_last_failure. The bump goes straight into shared memory so it can
run inside a critical section:
// pgstat_report_checksum_failures_in_db — src/backend/utils/activity/pgstat_database.cvoidpgstat_report_checksum_failures_in_db(Oid dboid, int failurecount){ /* ... fetch the shared DB entry (create=false, no allocation) ... */ sharedent = (PgStatShared_Database *) entry_ref->shared_stats; sharedent->stats.checksum_failures += failurecount; sharedent->stats.last_checksum_failure = GetCurrentTimestamp(); pgstat_unlock_entry(entry_ref);}The SQL-visible accessors return NULL when checksums are off, so monitoring
can distinguish “zero failures” from “feature disabled”:
// pg_stat_get_db_checksum_failures — src/backend/utils/adt/pgstatfuncs.cif (!DataChecksumsEnabled()) PG_RETURN_NULL();/* ... else return dbentry->checksum_failures ... */flowchart TD
RD["smgr read completes<br/>(buffer_readv_complete_one)"] --> PIV["PageIsVerified"]
PIV --> CK{"DataChecksumsEnabled<br/>and pd_checksum<br/>mismatch?"}
CK -->|no| HDR{"header sane?"}
CK -->|yes| LOG["ereport LOG/WARNING<br/>ERRCODE_DATA_CORRUPTED<br/>page verification failed"]
LOG --> CNT["bump pg_stat_database<br/>.checksum_failures"]
CNT --> IGN{"PIV_IGNORE_CHECKSUM<br/>and header sane?"}
IGN -->|yes| OKI["accept page<br/>(ignored)"]
IGN -->|no| ZERO{"ZERO_ON_ERROR<br/>zero_damaged_pages?"}
ZERO -->|yes| ZP["memset page to 0<br/>accept"]
ZERO -->|no| BAD["buffer invalid<br/>-> ERROR to client"]
HDR -->|yes| OK["accept page"]
HDR -->|no| BAD
Figure 3 — Read-path verification and failure policy. A checksum mismatch is
logged and counted unconditionally; whether it is fatal depends on the
ignore_checksum_failure and zero_damaged_pages GUCs, translated into the
PIV_* / READ_BUFFERS_* flags. The independent header-sanity check is a
second, structured layer that even an enabled checksum does not subsume.
Enable / control path: data_checksum_version in pg_control
Section titled “Enable / control path: data_checksum_version in pg_control”Whether the whole machinery is active is a single cluster property recorded in the control file:
// ControlFileData (excerpt) — src/include/catalog/pg_control.huint32 data_checksum_version;It is set exactly once, at bootstrap. initdb -k passes -k to the bootstrap
backend, which raises the version from 0 to PG_DATA_CHECKSUM_VERSION:
// BootstrapModeMain option handling — src/backend/bootstrap/bootstrap.ccase 'k': bootstrap_data_checksum_version = PG_DATA_CHECKSUM_VERSION; break;/* ... later ... */BootStrapXLOG(bootstrap_data_checksum_version);BootStrapXLOG threads the value into InitControlFile, which stores it in the
control file that every subsequent backend reads at startup:
// InitControlFile — src/backend/access/transam/xlog.cControlFile->data_checksum_version = data_checksum_version;At runtime the predicate DataChecksumsEnabled() is the single source of truth
consulted by every stamp and verify site; it is a trivial read of the cached
control file:
// DataChecksumsEnabled — src/backend/access/transam/xlog.cboolDataChecksumsEnabled(void){ Assert(ControlFile != NULL); return (ControlFile->data_checksum_version > 0);}Two related facts fall out of this flag. First, xlog.c publishes a read-only
data_checksums GUC (SetConfigOption("data_checksums", DataChecksumsEnabled() ? "yes" : "no", ...))
so clients can query the cluster’s state. Second, the same flag forces hint-bit
writes to be WAL-logged: XLogHintBitIsNeeded() is
(DataChecksumsEnabled() || wal_log_hints). This is required for torn-page
safety — without it, a hint-bit-only change could alter a page’s bytes (and
therefore its checksum) without any WAL record to replay over a torn write,
yielding a spurious checksum failure after crash recovery. The offline
pg_checksums tool flips data_checksum_version between 0 and 1 on a
stopped cluster, re-stamping (or clearing) every page — the supported way to
change the setting after initdb.
The sibling you should not confuse: checksum_helper.c
Section titled “The sibling you should not confuse: checksum_helper.c”src/common/checksum_helper.c is a different facility that also lives under
the word “checksum” and trips up grep-driven readers. It is a generic,
algorithm-pluggable digest used by features like backup manifests and
pg_verifybackup — not the page checksum. Its pg_checksum_type enum spans
CRC32C and the SHA-2 family, and it offers an init/update/final streaming API:
// pg_checksum_init (excerpt) — src/common/checksum_helper.cintpg_checksum_init(pg_checksum_context *context, pg_checksum_type type){ context->type = type; switch (type) { case CHECKSUM_TYPE_NONE: break; case CHECKSUM_TYPE_CRC32C: INIT_CRC32C(context->raw_context.c_crc32c); break; case CHECKSUM_TYPE_SHA224: context->raw_context.c_sha2 = pg_cryptohash_create(PG_SHA224); /* ... */ } return 0;}The naming overlap is purely lexical: pg_checksum_page (page integrity, FNV,
fixed 16-bit) and pg_checksum_init/_update/_final (manifest digests,
multiple algorithms, variable length) share no code. The basebackup.c
verify_page_checksum path does use the real page checksum
(pg_checksum_page) — it re-verifies pages while streaming a base backup, but
skips pages modified since the backup’s start LSN, because those may be torn and
“replaying WAL would reinstate the correct page.”
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
pg_checksum_page | src/include/storage/checksum_impl.h | 187 |
pg_checksum_block | src/include/storage/checksum_impl.h | 145 |
CHECKSUM_COMP | src/include/storage/checksum_impl.h | 135 |
N_SUMS, FNV_PRIME | src/include/storage/checksum_impl.h | 106, 108 |
checksumBaseOffsets | src/include/storage/checksum_impl.h | 121 |
PGChecksummablePage | src/include/storage/checksum_impl.h | 111 |
pg_checksum_page (decl) | src/include/storage/checksum.h | 22 |
#include checksum_impl.h | src/backend/storage/page/checksum.c | 22 |
PageHeaderData.pd_checksum | src/include/storage/bufpage.h | 164 |
PG_DATA_CHECKSUM_VERSION | src/include/storage/bufpage.h | 208 |
PIV_LOG_WARNING / PIV_LOG_LOG / PIV_IGNORE_CHECKSUM_FAILURE | src/include/storage/bufpage.h | 469–471 |
PageIsVerified | src/backend/storage/page/bufpage.c | 94 |
PageSetChecksumCopy | src/backend/storage/page/bufpage.c | 1509 |
PageSetChecksumInplace | src/backend/storage/page/bufpage.c | 1541 |
ignore_checksum_failure (GUC var) | src/backend/storage/page/bufpage.c | 27 |
FlushBuffer → PageSetChecksumCopy | src/backend/storage/buffer/bufmgr.c | 4372 |
read completion → PageIsVerified | src/backend/storage/buffer/bufmgr.c | 7100 |
WaitReadBuffers GUC→flags | src/backend/storage/buffer/bufmgr.c | 1821, 1828 |
buffer_readv_report | src/backend/storage/buffer/bufmgr.c | 7286 |
FlushLocalBuffer → PageSetChecksumInplace | src/backend/storage/buffer/localbuf.c | 201 |
smgr_bulk_flush → PageSetChecksumInplace | src/backend/storage/smgr/bulk_write.c | 282 |
_hash_alloc_buckets → PageSetChecksumInplace | src/backend/access/hash/hashpage.c | 1032 |
DataChecksumsEnabled | src/backend/access/transam/xlog.c | 4611 |
InitControlFile (sets version) | src/backend/access/transam/xlog.c | 4200, 4231 |
bootstrap_data_checksum_version (-k) | src/backend/bootstrap/bootstrap.c | 204, 287 |
ControlFileData.data_checksum_version | src/include/catalog/pg_control.h | 222 |
XLogHintBitIsNeeded | src/include/access/xlog.h | 120 |
verify_page_checksum | src/backend/backup/basebackup.c | 1993 |
pgstat_report_checksum_failures_in_db | src/backend/utils/activity/pgstat_database.c | 166 |
pg_stat_get_db_checksum_failures | src/backend/utils/adt/pgstatfuncs.c | 1154 |
pg_checksum_init (helper, distinct) | src/common/checksum_helper.c | 83 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All excerpts were read from the REL_18_STABLE working tree at commit 273fe94.
Verification notes:
pg_checksum_pageis in a header, notchecksum.c.checksum.cis a two-#includefile; the body lives insrc/include/storage/checksum_impl.hso frontend tools (pg_checksums,pg_upgrade,basebackup) compile the identical algorithm. Anyone greppingchecksum.cfor the FNV loop finds nothing — confirmed by reading both files.N_SUMS = 32,FNV_PRIME = 16777619verified verbatim inchecksum_impl.h; the comment “changing the parallelism changes the checksum result” confirms the constant is an on-disk format invariant, not a tunable.- The reduction is
(checksum % 65535) + 1— modulo65535(not65536) with a+1offset, yielding1..65535and reserving0. Read directly frompg_checksum_page. pd_checksumis auint16at the third field ofPageHeaderData(offset 8, after the 8-bytepd_lsn), confirmed inbufpage.h. The field is inside the checksummed region;pg_checksum_pagezeroes it transiently.- Write sites confirmed:
FlushBuffer(shared, viaPageSetChecksumCopy),FlushLocalBuffer,smgr_bulk_flush,_hash_alloc_buckets(all viaPageSetChecksumInplace). The shared-vs-private distinction and the hint-bit rationale are quoted from the in-tree comments. - Read site confirmed:
PageIsVerifiedis the only checksum verification call in the backend read path; reached from the AIO buffer-read completion inbufmgr.c. The user-visible message is"page verification failed, calculated checksum %u but expected %u"withERRCODE_DATA_CORRUPTED. - The two escape-hatch GUCs
ignore_checksum_failureandzero_damaged_pagesare both session settings translated intoREAD_BUFFERS_*flags inWaitReadBuffers, because the completion callback may run in a different process than the definer. Verified at lines 1821/1828. - Cluster enable is
data_checksum_versioninpg_control, set once at bootstrap via-k.DataChecksumsEnabled()isversion > 0. Verified inxlog.c,bootstrap.c,pg_control.h. - No PG19-only surface asserted. This doc deliberately does not cover
the online checksum-enable worker / background process (post-PG18). The only
in-cluster way to change
data_checksum_versiondescribed here isinitdb -kat create time or the offlinepg_checksumstool on a stopped cluster — both REL_18 facts. checksum_helper.cis a separate facility (CRC32C / SHA-2 digests for backup manifests), sharing only the lexical prefixpg_checksum_. Confirmed by reading its enum and init/update/final API; it never callspg_checksum_page.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s page checksum is a single, deliberately narrow tool: a fast non-cryptographic error-detecting code over the 8 KB block, recomputed on every flush and verified on every read. Placing it against the broader design space clarifies both what it buys and what it leaves on the table.
-
Speed-first detection vs. cryptographic integrity. The page checksum is FNV-1a-derived precisely because, as the header comment warns, “the checksum algorithm itself can become the largest bottleneck” for a working set that fits OS cache but not shared buffers. The choice is error detection, not tamper resistance: a
uint16with a stated ~2e-16 false-positive rate catches storage bit-rot but is trivially forgeable. PostgreSQL keeps the cryptographic family entirely separate —checksum_helper.cexposes CRC32C and the SHA-2 suite for backup manifests, where an adversary-or-bug model and a per-file (not per-page) granularity make a 224–512-bit digest affordable:// pg_checksum_init — src/common/checksum_helper.cswitch (type){case CHECKSUM_TYPE_NONE:break;case CHECKSUM_TYPE_CRC32C:INIT_CRC32C(context->raw_context.c_crc32c);break;case CHECKSUM_TYPE_SHA256:context->raw_context.c_sha2 = pg_cryptohash_create(PG_SHA256);/* ... SHA224/384/512 are parallel cases ... */break;}The two facilities never meet: the page path optimizes for throughput on the hot read/write loop, the manifest path for digest strength on a cold, once-per-file pass. Conflating them would either slow every page flush or weaken every backup-verification claim.
-
Hardware/format-level CRCs in other engines. Many engines push integrity into a wider or different code. InnoDB historically stamped a CRC32C over the page (with a legacy “innodb” checksum and a
crc32mode), and like PostgreSQL it brackets the page so that the recorded checksum sits at a known offset; SQL Server’sPAGE_VERITY CHECKSUMlikewise computes a single page-wide value stored in the header. The common shape — one code, page granularity, verified on read — is the same; the polynomial/primitive and the width differ. PostgreSQL’s distinguishing move is mixing in the block number (checksum ^= blkno) so a page written to the wrong physical location is caught even if its bytes are individually intact — a transposition failure mode that a content-only CRC misses. -
Checksums vs. end-to-end integrity (the storage-stack argument). A classic systems result — Stone & Partridge’s analysis of TCP/Ethernet checksum gaps, and the broader end-to-end argument (Saltzer, Reed & Clark) — shows that a checksum applied at one layer does not protect the data once it leaves that layer’s custody. PostgreSQL’s checksum is computed in
PageSetChecksum*just beforesmgrwriteand verified inPageIsVerifiedjust after the read completes, so it covers exactly the journey through the kernel, the filesystem, the block layer, and the medium — the segment where PostgreSQL has surrendered the bytes. It deliberately does not cover the in-memory lifetime of a buffer (a RAM bit-flip on a dirty page in shared buffers is re-checksummed into a “valid” page on flush) — that gap is the province of ECC memory, not ofpd_checksum. -
What the checksum cannot see. Because the WAL carries its own CRC and full-page images, and because
pd_checksumis recomputed on every flush rather than logged, the page checksum detects corruption introduced below PostgreSQL but is blind to logic bugs above it: a backend that writes a semantically wrong but structurally valid tuple produces a perfectly valid checksum. This is the same boundary that filesystem checksums (ZFS, Btrfs) draw — they catch the medium lying, not the application lying. The research frontier of silent data corruption (Bairavasundaram et al.’s large-scale studies of latent sector errors and disk corruption in production fleets) is precisely the failure classpd_checksumwas built to surface: corruption that returns successfully fromread()with wrong bytes and no I/O error. -
Granularity and the cost of being wrong. A 16-bit page code is a conscious trade: it costs 2 bytes per 8 KB (0.024%) and a few cycles, versus a per-row or per-cell checksum (finer localization, far higher overhead) or a whole-segment digest (cheaper amortized, but no per-page isolation and no online verification). PostgreSQL’s per-page choice aligns the integrity unit with the I/O unit and the recovery unit — the same block that is the atom of
smgrread/smgrwrite, of buffer replacement, and of WAL full-page images — which is why a failure can be reported as a single bad block and optionally zeroed (zero_damaged_pages) or tolerated (ignore_checksum_failure) without dragging in neighbouring data. -
Online enablement (out of scope here, noted for the reader). This doc describes the REL_18 reality: checksums are a cluster-wide property fixed at
initdb -k, changeable afterward only by the offlinepg_checksumstool on a stopped cluster. The ability to enable checksums online on a running cluster via a background worker is a later development and is intentionally not asserted here. Readers on newer branches should treat the enable-path described in section 4 (thedata_checksum_versionbootstrap write) as the floor, not the ceiling.
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94, as of 2026-06-06)
Section titled “In-tree source files (REL_18_STABLE, commit 273fe94, as of 2026-06-06)”src/backend/storage/page/checksum.c— the two-#includeshim that pulls the algorithm body in from the exported header so frontend tools compile the identical code.src/include/storage/checksum_impl.h— the algorithm itself:PGChecksummablePageunion,checksumBaseOffsets[N_SUMS], theCHECKSUM_COMPmixing macro,pg_checksum_block, and the publicpg_checksum_page(FNV-1a parallel sums, two zero rounds, xor fold, block-number mix,(checksum % 65535) + 1reduction). Also the long design rationale comment (SIMD parallelism, the choice of 32, why parallelism is a format invariant).src/include/storage/checksum.h— the one-line public prototype forpg_checksum_page.src/include/storage/bufpage.h—PageHeaderData/pd_checksumfield layout,PageSetChecksumInplace/PageSetChecksumCopydeclarations,PageIsVerifiedExtendedand thePIV_*flag bits.src/backend/storage/page/bufpage.c—PageIsVerifiedExtended,PageSetChecksumCopy(private-copy path under shared lock),PageSetChecksumInplace(in-place path), and the checksum-failureWARNING/ERRORtext.src/backend/storage/buffer/bufmgr.c— the read-completion verification path (PageIsVerifiedvia the AIO buffer-read callback), theFlushBufferwrite site, and theignore_checksum_failure/zero_damaged_pages→READ_BUFFERS_*translation inWaitReadBuffers.src/backend/storage/buffer/localbuf.c—FlushLocalBuffer, the local-buffer write site stamping checksums in place.src/backend/storage/smgr/bulk_write.c—smgr_bulk_flush, the bulk-load write path that checksums each block beforesmgrextend/smgrwrite.src/backend/access/transam/xlog.c—DataChecksumsEnabledand thedata_checksum_versionplumbing inControlFile.src/backend/bootstrap/bootstrap.c— the bootstrap-khandling that sets the cluster’s checksum version atinitdbtime.src/backend/backup/basebackup.c— base-backup-time checksum verification of data pages as they are streamed.src/backend/utils/activity/pgstat_database.c—pgstat_report_checksum_failure*, feedingpg_stat_database.checksum_failures/checksum_last_failure.src/include/catalog/pg_control.h—data_checksum_versioninControlFileData, the on-disk home of the cluster-wide flag.src/common/checksum_helper.c— distinct facility: CRC32C / SHA-2 digest contexts for backup manifests (pg_checksum_init/update/final,pg_checksum_type). Shares only thepg_checksum_prefix; never invokespg_checksum_page. Included here to document the boundary, not as part of the page-checksum path.
Theory anchors
Section titled “Theory anchors”- FNV-1a hash — Fowler/Noll/Vo, the non-cryptographic hash family the page
algorithm is built on; described at the URL cited in the in-tree comment
(
isthe.com/chongo/tech/comp/fnv). PostgreSQL’s variant adds the^ ((hash ^ value) >> 17)high-bit mixing step and 32-way parallelism. - End-to-end argument / checksum coverage gaps — Saltzer, Reed & Clark,
End-to-End Arguments in System Design (1984); Stone & Partridge, When the
CRC and TCP Checksum Disagree (SIGCOMM 2000). Motivates computing/verifying
the checksum at the exact
smgrwrite/read boundary. - Silent data corruption in the field — Bairavasundaram et al., An
Analysis of Latent Sector Errors / Data Corruption in the Storage Stack
(FAST ‘07/’08). The empirical case that
read()can succeed with wrong bytes, which is the failure classpd_checksumexists to detect. - DBMS reliability framing — see the general DBMS reliability/recovery
material in
knowledge/research/dbms-general/(page-level integrity sits beside WAL and the buffer manager as the third leg of on-disk durability), and the apt entries indbms-papersper the project bibliography.
Related KB docs (cross-references, not duplicated here)
Section titled “Related KB docs (cross-references, not duplicated here)”postgres-page-layout.md—PageHeaderDatafield-by-field layout; this doc defers the full header anatomy there and only quotespd_checksum’s offset.postgres-buffer-manager.md— buffer eviction,FlushBuffer/WaitReadBuffersmechanics, and the AIO read-completion machinery; this doc names the write/read call sites but defers the surrounding flush/eviction lifecycle there.postgres-smgr-md.md— thesmgrwrite/smgrreadlayer the checksum brackets.postgres-xlog-wal.md— WAL CRCs and full-page images, the other integrity code, which protects the log rather than the heap/index pages.postgres-backup-basebackup.md/postgres-incremental-backup.md— where the separatechecksum_helper.cmanifest digests are actually consumed.