PostgreSQL Page Layout — Slotted Page Format, ItemId Indirection, and Checksum Mechanics
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every disk-resident relational database must solve the same physical problem: it stores data on block devices that operate in fixed-size units (sectors, OS pages), yet the tuples the engine manipulates are variable-length and must be located in O(1) time by a logical address rather than a physical byte offset. Database Internals (Petrov, ch. 3 §“Page Structure”) frames this as a two-level addressing problem. The engine carves disk into fixed-size pages (also called blocks in PostgreSQL’s vocabulary), each page is the unit of I/O, and within a page variable-length records are managed by an indirection layer — a directory that maps logical slot numbers to physical byte offsets inside the page.
Two design axes follow immediately:
-
How should items be addressed? A physical offset bakes the byte position into every reference and breaks the moment the page is compacted. A logical slot number (an index into the directory) lets the engine shuffle bytes freely on the page without invalidating any external pointer, as long as it keeps the directory up to date. PostgreSQL chooses logical slot numbers, which it calls line pointers or item identifiers.
-
How should free space be managed within the page? The simplest scheme grows the item directory from one end and pushes item data from the other end, meeting in a free-space gap in the middle. This is the slotted page layout described in every major textbook (Database System Concepts, Silberschatz et al., ch. 13 §“File Organization”). It is simple, compact, and requires no per-slot free-list overhead.
For heap pages in PostgreSQL, a third design axis matters for MVCC
(multi-version concurrency control): the on-page per-tuple header must
carry enough transaction visibility information (t_xmin, t_xmax,
t_ctid, t_infomask) for the visibility decision to be made without
consulting any external structure in the common case. This is not
universal across DBMSes — InnoDB stores undo log pointers in the tuple
header and reconstructs older versions from a separate undo area — but
it is central to PostgreSQL’s update-in-place-with-version-chain model.
Finally, a fourth concern in modern systems is silent data corruption:
a bit flip in storage or a faulty HBA can deliver a page that is
syntactically valid yet contains wrong data. A page checksum computed
at write time and verified at read time detects this class of failure.
PostgreSQL added optional per-page checksums in release 9.3; since
PostgreSQL 18 they may be enabled or disabled with pg_checksums on an
offline cluster.
Common DBMS Design
Section titled “Common DBMS Design”The slotted page layout has converged to essentially one shape across most disk-resident DBMSes. The engineering conventions are worth naming explicitly so that PostgreSQL’s specific choices read as one instantiation of a well-understood pattern.
Header → line-pointer array → free gap → item area
Section titled “Header → line-pointer array → free gap → item area”Every slotted page has four contiguous regions:
- Fixed-size header — page-level metadata: LSN or sequence number
(for WAL), flags, and the three offsets (
lower,upper,special) that delimit the three variable regions. - Line-pointer array — grows forward (toward higher offsets) as items are inserted. Each entry is a small fixed-size slot: a byte offset to the item body, the item’s byte length, and a state flag.
- Free space gap — the unused region between
pd_lowerandpd_upper. The page is full when this gap is too small to hold a new line pointer plus a new item body. - Item area — grows backward (toward lower offsets) from
pd_upper. Items are packed tightly;MAXALIGNpadding is applied to keep items aligned.
The discipline that emerges from this layout: line pointer slots
are never moved once allocated (they are only re-used when marked
UNUSED), but item bodies may be compacted by PageRepairFragmentation
which slides them together and resets pd_upper. External references
— ItemPointers stored in indexes — hold a (block, slot-number) pair
and remain valid through compaction because the slot number is
indirection, not a byte offset.
Special space for access-method metadata
Section titled “Special space for access-method metadata”Indexes need per-page metadata that is not part of the item stream —
the page’s level in the tree, sibling pointers, cycle-detection markers.
The convention is a special space at the end of the page, carved
out at PageInit time. Heap pages have zero special space; index AMs
initialize it to their own opaque struct (BTPageOpaqueData for
B-tree, etc.).
The TID: a cross-page logical address
Section titled “The TID: a cross-page logical address”A tuple identifier (TID) is the durable logical address by which one
page’s content refers to a tuple on (possibly a different) page. It
holds a (block number, line-pointer offset) pair. Indexes store TIDs
as their “payload” pointing into the heap. The heap itself uses TIDs
in t_ctid to chain MVCC versions. A TID is the minimal cross-page
reference: small enough to store in index leaves, stable under tuple
compaction, and resolvable in one buffer-manager lookup.
Page checksum: per-block, not per-row
Section titled “Page checksum: per-block, not per-row”The standard technique for storage integrity is a per-page checksum stored in the page header, computed at write time over the full page body (or a representative sample of it), and checked at read time. Computing it per-row instead would require touching every row header on every write and every read, which is too expensive. The checksum covers the whole block so the cost is one computation per block I/O. The checksum is typically tied to the block number to detect block relocation (a page written to the wrong sector) as well as bit-flip corruption.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Concept | PostgreSQL name |
|---|---|
| Page / block | Page (PageData) — BLCKSZ bytes (default 8192) |
| Page header | PageHeaderData (bufpage.h) |
| Lower free-space boundary | pd_lower |
| Upper free-space boundary | pd_upper |
| Special-space boundary | pd_special |
| WAL position of last change | pd_lsn (PageXLogRecPtr) |
| Line pointer / slot | ItemIdData (itemid.h) |
| Line-pointer state | LP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEAD |
| Tuple identifier (TID) | ItemPointerData (itemptr.h) |
| Block number part of TID | ip_blkid (BlockIdData) |
| Offset-number part of TID | ip_posid (OffsetNumber) |
| Per-page checksum | pd_checksum; computed by pg_checksum_page |
| Page initialization | PageInit |
| Page verification | PageIsVerified |
| Item insertion | PageAddItemExtended |
| Tuple compaction | PageRepairFragmentation |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Page header: PageHeaderData
Section titled “Page header: PageHeaderData”Every PostgreSQL page — heap page, index page, FSM page, VM page —
begins with the same 24-byte PageHeaderData:
// PageHeaderData — src/include/storage/bufpage.htypedef struct PageHeaderData{ PageXLogRecPtr pd_lsn; /* WAL position of last change */ uint16 pd_checksum; /* page checksum (optional) */ uint16 pd_flags; /* PD_HAS_FREE_LINES | PD_PAGE_FULL | PD_ALL_VISIBLE */ LocationIndex pd_lower; /* offset to start of free space */ LocationIndex pd_upper; /* offset to end of free space */ LocationIndex pd_special; /* offset to start of special space */ uint16 pd_pagesize_version; /* high byte = page size; low byte = layout version */ TransactionId pd_prune_xid; /* oldest prunable XID on this page, or 0 */ ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */} PageHeaderData;SizeOfPageHeaderData is defined as offsetof(PageHeaderData, pd_linp),
which evaluates to 24 bytes on all current platforms. The line-pointer
array begins immediately after this fixed header.
pd_lsn is the most important field for the buffer manager. When
FlushBuffer is about to write a dirty page to disk, it calls
XLogFlush(PageGetLSN(page)) to ensure the WAL is durably written up
to at least the LSN of the last change to this page. This is how
PostgreSQL physically enforces the WAL-before-data rule.
pd_pagesize_version packs page size and layout version into one
uint16. The high byte encodes the page size in units of 256 (so
BLCKSZ = 8192 → 0x2000); the low byte is the layout version number,
currently PG_PAGE_LAYOUT_VERSION = 4 (unchanged since PostgreSQL 8.3).
PageGetPageSize extracts size by masking with 0xFF00;
PageGetPageLayoutVersion extracts version by masking with 0x00FF.
pd_flags carries three hint bits:
PD_HAS_FREE_LINES (0x0001)— there is at least oneLP_UNUSEDslot beforepd_lower. Set/cleared byPageAddItemExtendedandPageRepairFragmentation. It is an unlogged hint: if wrong, the code falls back to a linear scan of the line-pointer array.PD_PAGE_FULL (0x0002)— anUPDATEfound insufficient free space. A prune pass is warranted. Also a hint; not WAL-logged.PD_ALL_VISIBLE (0x0004)— all tuples on this page are visible to every transaction (no dead versions remain). The visibility map uses this to avoid per-tuple visibility checks during index-only scans.
pd_prune_xid is a hint that drives pruning decisions. When a tuple
is deleted by transaction xid, PageSetPrunable(page, xid) advances
pd_prune_xid to the oldest such XID seen so far on this page.
VACUUM and the heap pruner check pd_prune_xid to decide whether
any dead versions on the page are old enough to be removed.
Line pointers: ItemIdData
Section titled “Line pointers: ItemIdData”The pd_linp[] array grows forward from byte offset
SizeOfPageHeaderData. Each slot is a 32-bit word packed as three
bit-fields:
// ItemIdData — src/include/storage/itemid.htypedef struct ItemIdData{ unsigned lp_off:15, /* byte offset to item body, from page start */ lp_flags:2, /* state: LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD */ lp_len:15; /* byte length of item body */} ItemIdData;The 15-bit fields cap item offsets and lengths at 32767, which is why
PostgreSQL pages are limited to 32 KB (BLCKSZ ≤ 32768). Four line-
pointer states exist:
| State | Value | Meaning |
|---|---|---|
LP_UNUSED | 0 | Free slot; lp_len is always 0. Reusable. |
LP_NORMAL | 1 | Active item; lp_off and lp_len are valid. |
LP_REDIRECT | 2 | HOT redirect; lp_off holds the offset number of the next slot in the HOT chain, not a byte offset. lp_len is 0. |
LP_DEAD | 3 | Dead; tuple body may or may not still be present. Set by heap pruning; cleared to LP_UNUSED by VACUUM. |
The LP_REDIRECT state is the implementation of Heap Only Tuples
(HOT). When an UPDATE does not change any indexed column, the new tuple
version is inserted on the same page without creating a new index entry.
The old line pointer is flipped to LP_REDIRECT with lp_off pointing
to the new version’s slot number. Index scans follow the redirect chain
on the heap page entirely in memory, without re-entering the index. This
keeps index size stable for heavily-updated tables.
PageGetMaxOffsetNumber derives the number of allocated line pointers
from pd_lower:
// PageGetMaxOffsetNumber — src/include/storage/bufpage.hstatic inline OffsetNumberPageGetMaxOffsetNumber(const PageData *page){ const PageHeaderData *pageheader = (const PageHeaderData *) page; if (pageheader->pd_lower <= SizeOfPageHeaderData) return 0; else return (pageheader->pd_lower - SizeOfPageHeaderData) / sizeof(ItemIdData);}PageGetItemId(page, offsetNumber) returns a pointer to
pd_linp[offsetNumber - 1]; offset numbers are 1-based by convention.
Tuple identifier: ItemPointerData
Section titled “Tuple identifier: ItemPointerData”An ItemPointerData (a/k/a TID) is the 6-byte cross-page address of
one heap tuple:
// ItemPointerData — src/include/storage/itemptr.htypedef struct ItemPointerData{ BlockIdData ip_blkid; /* block number (4 bytes: hi+lo uint16 pair) */ OffsetNumber ip_posid; /* line-pointer offset number (1-based) */} ItemPointerData;BlockIdData stores the block number as two uint16 values (bi_hi,
bi_lo) for historical alignment reasons;
BlockIdGetBlockNumber reassembles them. The maximum addressable block
number is 2^32 - 1, giving a per-relation ceiling of 32 TB at
BLCKSZ = 8192.
TIDs serve two roles. In indexes, each leaf entry carries the TID of
the corresponding heap tuple as the lookup payload. In the heap itself,
HeapTupleHeaderData.t_ctid holds the TID of the most-recent version
of this tuple: if no newer version exists, t_ctid points back to the
tuple itself; if the tuple was updated, t_ctid points forward to the
new version. MVCC visibility logic (heapam_visibility.c) follows
t_ctid chains to find the right version for a given snapshot.
Item area: tuple bodies grow from pd_upper downward
Section titled “Item area: tuple bodies grow from pd_upper downward”// PageAddItemExtended (condensed) — src/backend/storage/page/bufpage.cOffsetNumberPageAddItemExtended(Page page, Item item, Size size, OffsetNumber offsetNumber, int flags){ PageHeader phdr = (PageHeader) page; // ... condensed: validate offsets, find free slot or extend array ... alignedSize = MAXALIGN(size); upper = (int) phdr->pd_upper - (int) alignedSize; if (lower > upper) return InvalidOffsetNumber; /* page full */
ItemIdSetNormal(itemId, upper, size); /* record byte offset + length */ memcpy((char *) page + upper, item, size); phdr->pd_lower = (LocationIndex) lower; phdr->pd_upper = (LocationIndex) upper; return offsetNumber;}Each insert decrements pd_upper by MAXALIGN(size) and increments
pd_lower by sizeof(ItemIdData). The page is full when lower > upper after accounting for both increments. MAXALIGN ensures that
item bodies start on an 8-byte (or platform-specific) boundary, which
matters for zero-copy access to numeric fields inside tuples.
Fragmentation and compaction: PageRepairFragmentation
Section titled “Fragmentation and compaction: PageRepairFragmentation”After VACUUM or heap pruning marks dead tuples as LP_UNUSED or
LP_DEAD and clears their bodies, gaps appear in the item area. The
pd_upper pointer cannot be bumped back without a compaction pass.
PageRepairFragmentation collects all live item bodies, calls
compactify_tuples to slide them together toward pd_special, and
resets pd_upper. The line-pointer slots are not moved — their
lp_off values are updated to point to the new positions. Trailing
LP_UNUSED slots at the end of the pd_linp[] array are also
truncated to shrink pd_lower:
// PageRepairFragmentation (condensed) — src/backend/storage/page/bufpage.cvoidPageRepairFragmentation(Page page){ // ... condensed: collect live items into itemidbase[], check for corruption ... compactify_tuples(itemidbase, nstorage, page, presorted); // truncate trailing unused line pointers from pd_linp[] if (finalusedlp != nline) ((PageHeader) page)->pd_lower -= (sizeof(ItemIdData) * nunusedend); // set PD_HAS_FREE_LINES hint}The presorted fast path (items already in descending lp_off order,
which is the insertion order) uses memmove in one pass. The unsorted
path copies items into a temporary buffer then writes them back in
order. Callers must hold an exclusive cleanup lock on the buffer.
Page initialization: PageInit
Section titled “Page initialization: PageInit”// PageInit — src/backend/storage/page/bufpage.cvoidPageInit(Page page, Size pageSize, Size specialSize){ PageHeader p = (PageHeader) page; specialSize = MAXALIGN(specialSize); MemSet(p, 0, pageSize); /* zero the whole page first */ p->pd_flags = 0; p->pd_lower = SizeOfPageHeaderData; /* no line pointers yet */ p->pd_upper = pageSize - specialSize; /* item area ceiling */ p->pd_special = pageSize - specialSize; /* special-space floor */ PageSetPageSizeAndVersion(page, pageSize, PG_PAGE_LAYOUT_VERSION); /* pd_prune_xid zeroed by MemSet */}pd_lower == pd_upper (both = pageSize - specialSize) means the page
is empty but has zero free space only if specialSize == pageSize - SizeOfPageHeaderData, which would be degenerate. In practice they
diverge; PageIsEmpty checks pd_lower <= SizeOfPageHeaderData.
Page verification: PageIsVerified
Section titled “Page verification: PageIsVerified”When the buffer manager reads a page from disk, it calls
PageIsVerified before admitting it to the shared pool:
// PageIsVerified (condensed) — src/backend/storage/page/bufpage.cboolPageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_failure_p){ if (!PageIsNew(page)) { if (DataChecksumsEnabled()) { checksum = pg_checksum_page(page, blkno); if (checksum != p->pd_checksum) checksum_failure = true; } if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 && p->pd_lower <= p->pd_upper && p->pd_upper <= p->pd_special && p->pd_special <= BLCKSZ && p->pd_special == MAXALIGN(p->pd_special)) header_sane = true; if (header_sane && !checksum_failure) return true; } /* all-zeros page is acceptable (crashed extension of relation) */ if (pg_memory_is_all_zeros((size_t *) page, BLCKSZ)) return true; return false;}Two independent checks run in sequence: the checksum (if enabled) and a
structural sanity check on the four offset fields. All-zeros pages pass
unconditionally — they arise when a backend extends a relation and
crashes before writing any WAL for the new block; VACUUM will clean them
up. The function accepts PIV_LOG_WARNING, PIV_LOG_LOG, and
PIV_IGNORE_CHECKSUM_FAILURE flags to control error reporting behavior.
Checksum: 16-bit fold-and-XOR over block-number-perturbed data
Section titled “Checksum: 16-bit fold-and-XOR over block-number-perturbed data”The pd_checksum field holds a 16-bit value computed by
pg_checksum_page (declared in checksum.h, implemented in
checksum.c). The algorithm processes the 8 KB page in 32-byte chunks,
mixes in the block number, and folds the result down to 16 bits. The
block number perturbation means the same page bytes stored at block 0
and block 1000 produce different checksums — misrouted block writes are
detected.
PageSetChecksumCopy (used by buffer flush) makes a copy of the page
before computing the checksum to exclude concurrent hint-bit updates
from affecting the final value. PageSetChecksumInplace (used when no
concurrent modification is possible, e.g., during pg_checksums) writes
directly into the page. The README for storage/page/ states the
invariant explicitly: the checksum is valid only when the page enters
or leaves the shared pool, not while it is resident and potentially
being modified by hint bit updates.
Both setters share the same two short-circuits — a page that
PageIsNew (never initialized, all zero) or a cluster with
!DataChecksumsEnabled() skips checksum computation entirely:
// PageSetChecksumCopy — src/backend/storage/page/bufpage.cchar *PageSetChecksumCopy(Page page, BlockNumber blkno){ static char *pageCopy = NULL; if (PageIsNew(page) || !DataChecksumsEnabled()) return page; /* nothing to do */ if (pageCopy == NULL) pageCopy = MemoryContextAllocAligned(TopMemoryContext, BLCKSZ, PG_IO_ALIGN_SIZE, 0); memcpy(pageCopy, page, BLCKSZ); ((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno); return pageCopy;}The static pageCopy is allocated once per backend, aligned to
PG_IO_ALIGN_SIZE, and reused on every flush; the caller must write the
returned pointer immediately and not retain it. PageSetChecksumInplace
is the same minus the copy: it writes pg_checksum_page(page, blkno)
straight into pd_checksum.
The checksum algorithm: parallel FNV-1a fold
Section titled “The checksum algorithm: parallel FNV-1a fold”pg_checksum_page lives in src/backend/storage/page/checksum.c, but
that file is a one-line shell — it #includes storage/checksum_impl.h,
where the real algorithm sits. The indirection lets external tools
(pg_checksums, pg_upgrade, pg_basebackup) embed the exact same code
by including one exported header. The algorithm is a vectorizable variant
of FNV-1a (Fowler/Noll/Vo) tuned so the compiler can auto-SIMD it; on
read-heavy workloads where the working set fits in OS cache but not shared
buffers, the checksum is the dominant cost, so speed is the design driver.
The page is viewed as a two-dimensional uint32 array of N_SUMS = 32
columns. Each column is folded independently into its own partial sum
seeded from a distinct random offset-basis constant, which is what lets a
SIMD unit run all 32 lanes in parallel:
// checksum constants and one fold round — src/include/storage/checksum_impl.h#define N_SUMS 32#define FNV_PRIME 16777619typedef union { PageHeaderData phdr; uint32 data[BLCKSZ / (sizeof(uint32) * N_SUMS)][N_SUMS];} PGChecksummablePage;#define CHECKSUM_COMP(checksum, value) \do { \ uint32 __tmp = (checksum) ^ (value); \ (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \} while (0)The >> 17 xor-shift is PostgreSQL’s addition to plain FNV-1a: vanilla
FNV mixes high-order input bits only into high-order output bits, so the
shift folds high bits back down. pg_checksum_block seeds the 32 lanes
from checksumBaseOffsets[], runs CHECKSUM_COMP across every row, adds
two trailing rounds of zero to avalanche the final row, then xor-folds
the 32 partial sums into one 32-bit result:
// pg_checksum_block (condensed) — src/include/storage/checksum_impl.hstatic uint32pg_checksum_block(const PGChecksummablePage *page){ uint32 sums[N_SUMS]; uint32 result = 0, i, j; memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); for (i = 0; i < (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) for (j = 0; j < N_SUMS; j++) CHECKSUM_COMP(sums[j], page->data[i][j]); for (i = 0; i < 2; i++) /* two zero rounds to mix last row */ for (j = 0; j < N_SUMS; j++) CHECKSUM_COMP(sums[j], 0); for (i = 0; i < N_SUMS; i++) result ^= sums[i]; return result;}pg_checksum_page wraps the block hash with two page-specific steps. It
transiently zeroes pd_checksum so the stored value never feeds its own
recomputation (then restores it — updating the field is the caller’s job,
not this function’s). After the block hash it xors in blkno, which is
what makes a page written to the wrong sector fail verification. Finally
it reduces to 16 bits with (checksum % 65535) + 1 — the + 1 offset
guarantees the result is never zero, reserving 0 as the “checksums never
written” sentinel:
// pg_checksum_page (condensed) — src/include/storage/checksum_impl.huint16pg_checksum_page(char *page, BlockNumber blkno){ PGChecksummablePage *cpage = (PGChecksummablePage *) page; uint16 save_checksum; uint32 checksum; save_checksum = cpage->phdr.pd_checksum; cpage->phdr.pd_checksum = 0; /* exclude stored checksum */ checksum = pg_checksum_block(cpage); cpage->phdr.pd_checksum = save_checksum; checksum ^= blkno; /* detect transposed pages */ return (uint16) ((checksum % 65535) + 1);}Note the PGChecksummablePage union overlays PageHeaderData on the
uint32 grid, so the header — including pd_lsn, pd_flags, and all
four offset fields — is covered by the checksum, but the pd_checksum
field itself is excluded by the transient zeroing. The Assert(sizeof (PGChecksummablePage) == BLCKSZ) in pg_checksum_block enforces that the
grid exactly tiles the block; BLCKSZ must be a multiple of 4 * N_SUMS = 128 bytes, which every supported BLCKSZ (1–32 KB, all powers of two
≥ 256) satisfies.
Checksum read/write flow
Section titled “Checksum read/write flow”flowchart TD
subgraph WRITE["Write path — page leaving shared pool"]
W0["FlushBuffer / SyncOneBuffer"]
W1{"PageIsNew or<br/>!DataChecksumsEnabled?"}
W2["return page as-is<br/>(no checksum)"]
W3["PageSetChecksumCopy:<br/>memcpy page to aligned pageCopy"]
W4["pg_checksum_page(pageCopy, blkno)"]
W5["write pd_checksum into copy"]
W6["smgrwrite() the copy to disk"]
W0 --> W1
W1 -->|yes| W2
W1 -->|no| W3 --> W4 --> W5 --> W6
end
subgraph CKSUM["pg_checksum_page internals"]
C1["save pd_checksum, set field = 0"]
C2["pg_checksum_block:<br/>32 parallel FNV-1a lanes<br/>over uint32 grid"]
C3["two zero rounds + xor-fold to uint32"]
C4["checksum ^= blkno"]
C5["restore pd_checksum<br/>return (checksum % 65535) + 1"]
C1 --> C2 --> C3 --> C4 --> C5
end
subgraph READ["Read path — page entering shared pool"]
R0["smgrread() block from disk"]
R1["PageIsVerified(page, blkno, ...)"]
R2{"DataChecksumsEnabled?"}
R3["computed = pg_checksum_page(page, blkno)"]
R4{"computed == pd_checksum?"}
R5["header sanity:<br/>pd_lower<=pd_upper<=pd_special<=BLCKSZ"]
R6["admit page to buffer pool"]
R7{"page all zeros?"}
R8["accept (crashed relation extension)"]
R9["checksum_failure: ERROR / WARNING<br/>per PIV_* flags"]
R0 --> R1 --> R2
R2 -->|yes| R3 --> R4
R2 -->|no| R5
R4 -->|yes| R5
R4 -->|no| R7
R5 --> R6
R7 -->|yes| R8
R7 -->|no| R9
end
W4 -.calls.-> C1
R3 -.calls.-> C1
Figure 2 — Checksum lifecycle. The write path computes the checksum on a
private copy as the page leaves the shared pool; the read path
recomputes and compares as the page enters. Both call the same
pg_checksum_page (center), which xors in blkno so a transposed page
fails. An all-zeros page bypasses the failure branch because a crashed
relation-extension legitimately leaves zero blocks.
Page layout diagram
Section titled “Page layout diagram”flowchart TD
A["PageHeaderData<br/>(24 bytes fixed)"]
B["pd_linp[0]..pd_linp[N-1]<br/>ItemIdData array<br/>grows ↓ toward pd_upper"]
C["Free space gap<br/>(pd_lower .. pd_upper)"]
D["Item bodies<br/>(tuples / index entries)<br/>packed ↑ from pd_upper"]
E["Special space<br/>(0 bytes for heap,<br/>AM-opaque for indexes)"]
A --> B --> C --> D --> E
Figure 1 — The four regions of a PostgreSQL page. pd_lower points to
the next free slot in the line-pointer array; pd_upper points to the
top of the item area. Insertions advance pd_lower up and pd_upper
down. pd_special is fixed at PageInit time.
The diagram above lists the regions in storage order; the next one shows
the same page as a byte-addressed map, making the two-pointer “growing
toward each other” discipline explicit. The header offsets are the only
metadata needed to locate any region: pd_lower is the high-water mark
of the line-pointer array (slots grow up from SizeOfPageHeaderData),
pd_upper is the low-water mark of the item area (bodies grow down from
pd_special), and the gap between them is the only free space. The page
is full when pd_lower + sizeof(ItemIdData) > pd_upper - MAXALIGN(size).
flowchart TB
subgraph PAGE["One BLCKSZ page (default 8192 bytes), byte offset 0 at top"]
direction TB
H["offset 0 .. SizeOfPageHeaderData (24)<br/>PageHeaderData<br/>pd_lsn | pd_checksum | pd_flags<br/>pd_lower | pd_upper | pd_special<br/>pd_pagesize_version | pd_prune_xid"]
LP["pd_linp[0] (offnum 1)<br/>pd_linp[1] (offnum 2)<br/>...<br/>pd_linp[N-1] (offnum N)<br/>ItemIdData array — slots grow DOWN ↓<br/>each = lp_off:15 lp_flags:2 lp_len:15"]
FREE["FREE SPACE<br/>PageGetFreeSpace =<br/>pd_upper - pd_lower - sizeof(ItemIdData)"]
ITEM["tuple body N (newest, lowest offset)<br/>...<br/>tuple body 1 (oldest, highest offset)<br/>item bodies grow UP ↑, MAXALIGN-padded"]
SP["Special space<br/>heap: 0 bytes (pd_special == BLCKSZ)<br/>btree: BTPageOpaqueData, etc."]
H --> LP --> FREE --> ITEM --> SP
end
PL["pd_lower<br/>= 24 + N * sizeof(ItemIdData)"]
PU["pd_upper<br/>= top of newest tuple body"]
PS["pd_special<br/>= BLCKSZ - MAXALIGN(specialSize)"]
PL -.points at start of.-> FREE
PU -.points at end of.-> FREE
PS -.points at start of.-> SP
Figure 1b — Byte map of a slotted page. The line-pointer array and the
item area grow toward each other from opposite ends; PageGetItemId
indexes the array by 1-based offset number, and PageGetItem follows a
slot’s lp_off into the item area. An ItemPointerData TID stored in an
index leaf names (blkno, offset number) — the offset number is the
array index here, which is why tuple-body compaction by
PageRepairFragmentation (rewriting lp_off) never invalidates an
external TID.
Heap tuple header layout
Section titled “Heap tuple header layout”A heap tuple body begins with HeapTupleHeaderData:
// HeapTupleHeaderData — src/include/access/htup_details.hstruct HeapTupleHeaderData{ union { HeapTupleFields t_heap; /* xmin, xmax, cid/xvac */ DatumTupleFields t_datum; /* used when stored as a composite datum */ } t_choice; ItemPointerData t_ctid; /* TID of this or newer tuple version */ uint16 t_infomask2; /* attribute count + flags */ uint16 t_infomask; /* HEAP_XMIN_COMMITTED, HEAP_XMAX_INVALID, … */ uint8 t_hoff; /* offset to user data (past nulls bitmap) */ bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* nulls bitmap (optional) */ /* user data follows at offset t_hoff */};HeapTupleFields carries t_xmin (inserting transaction), t_xmax
(deleting/locking transaction), and t_cid/t_xvac. The t_infomask
bits such as HEAP_XMIN_COMMITTED and HEAP_XMAX_INVALID are hint
bits: once set, they short-circuit visibility checks by caching the
commit status in the tuple header so the commit-log (CLOG) need not be
consulted on repeated reads. Setting a hint bit makes the page dirty
without a WAL record (unless checksums are enabled and the page is
otherwise clean — see the storage/page/README note on MarkBufferDirtyHint).
Source Walkthrough
Section titled “Source Walkthrough”Page header and layout primitives
Section titled “Page header and layout primitives”PageHeaderData(bufpage.h) — the 24-byte fixed header; all four offset fields (pd_lsn,pd_lower,pd_upper,pd_special) live here.PageXLogRecPtr(bufpage.h) — historical two-uint32representation of anXLogRecPtr;PageXLogRecPtrGet/PageXLogRecPtrSetconvert.PG_PAGE_LAYOUT_VERSION(bufpage.h) — compile-time constant, value 4; packed intopd_pagesize_versionlow byte.SizeOfPageHeaderData(bufpage.h) —offsetof(PageHeaderData, pd_linp), equals 24 bytes; the startingpd_lowerafterPageInit.PageInit(bufpage.c) — zeros the page, setspd_lower,pd_upper,pd_special, packspd_pagesize_version.PageIsNew(bufpage.h) —pd_upper == 0; true beforePageInit.PageIsEmpty(bufpage.h) —pd_lower <= SizeOfPageHeaderData; no slots allocated yet.
Line-pointer (ItemId) layer
Section titled “Line-pointer (ItemId) layer”ItemIdData(itemid.h) — 32-bit packed struct:lp_off:15,lp_flags:2,lp_len:15.LP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEAD(itemid.h) — the four line-pointer states;LP_REDIRECTis the HOT-chain pivot.PageGetItemId(page, offnum)(bufpage.h) — returns&pd_linp[offnum-1].PageGetMaxOffsetNumber(bufpage.h) —(pd_lower - SizeOfPageHeaderData) / sizeof(ItemIdData).PageGetItem(page, itemId)(bufpage.h) —(char *)page + lp_off.ItemIdSetNormal / ItemIdSetRedirect / ItemIdSetDead / ItemIdSetUnused(itemid.h) — the four state transitions.
Item insertion and space management
Section titled “Item insertion and space management”PageAddItemExtended(bufpage.c) — inserts an item: validates offsets, picks or allocates a slot, decrementspd_upper, incrementspd_lower, copies bytes.PageGetFreeSpace(bufpage.c) —pd_upper - pd_lower - sizeof(ItemIdData); returns 0 if below that threshold.PageGetHeapFreeSpace(bufpage.c) — likePageGetFreeSpacebut also enforcesMaxHeapTuplesPerPage(derived fromBLCKSZandMinHeapTupleSize).PageRepairFragmentation(bufpage.c) — full defrag: slides live item bodies together, truncates trailingLP_UNUSEDslots, resetspd_upper.PageTruncateLinePointerArray(bufpage.c) — called by VACUUM second pass; removes only trailingLP_UNUSEDentries without moving item bodies.compactify_tuples(static,bufpage.c) — inner loop ofPageRepairFragmentation; presorted fast path (memmove) vs. unsorted copy-through-buffer path.
TID layer
Section titled “TID layer”ItemPointerData(itemptr.h) — 6-byte TID:ip_blkid+ip_posid.BlockIdData(block.h) —bi_hi+bi_louint16 pair; assembled byBlockIdGetBlockNumber.ItemPointerSet / ItemPointerGet{BlockNumber,OffsetNumber}(itemptr.h) — TID accessors.ItemPointerIsValid(itemptr.h) —ip_posid != 0.
Verification and checksum
Section titled “Verification and checksum”PageIsVerified(bufpage.c) — entry gate for pages entering the buffer pool from disk; checks header sanity + optional checksum.pg_checksum_page(checksum.hdecl;checksum_impl.hbody) — 16-bit checksum overBLCKSZbytes, parameterised by block number; transiently zeroespd_checksum, xors inblkno, returns(checksum % 65535) + 1.pg_checksum_block(static,checksum_impl.h) — the FNV-1a inner loop: 32 parallel partial sums seeded fromchecksumBaseOffsets[], two zero rounds, xor-fold to oneuint32.CHECKSUM_COMP/FNV_PRIME/N_SUMS/PGChecksummablePage(checksum_impl.h) — the fold macro, prime multiplier16777619, lane count32, and the union that overlaysPageHeaderDataon the grid.checksum.c(bufpage/page) — one-line shell that#includeschecksum_impl.hso external tools can embed identical code.PageSetChecksumCopy(bufpage.c) — makes a local copy, computes checksum, returns pointer to copy; used by buffer flush path.PageSetChecksumInplace(bufpage.c) — writes checksum directly into the page; used bypg_checksumstool and similar offline paths.DataChecksumsEnabled()(bufmgr.h) — GUC-controlled predicate consulted by both verification and checksum-setting paths.
Position hints (as of 2026-06-05, commit 273fe94, REL_18_STABLE)
Section titled “Position hints (as of 2026-06-05, commit 273fe94, REL_18_STABLE)”| Symbol | File | Line |
|---|---|---|
PageHeaderData | src/include/storage/bufpage.h | 159 |
SizeOfPageHeaderData | src/include/storage/bufpage.h | 218 |
PG_PAGE_LAYOUT_VERSION | src/include/storage/bufpage.h | 207 |
PD_HAS_FREE_LINES | src/include/storage/bufpage.h | 188 |
PD_PAGE_FULL | src/include/storage/bufpage.h | 189 |
PD_ALL_VISIBLE | src/include/storage/bufpage.h | 190 |
PageGetMaxOffsetNumber | src/include/storage/bufpage.h | 372 |
PageGetItemId | src/include/storage/bufpage.h | 245 |
PageGetItem | src/include/storage/bufpage.h | 354 |
PageGetLSN / PageSetLSN | src/include/storage/bufpage.h | 386–394 |
ItemIdData | src/include/storage/itemid.h | 25 |
LP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEAD | src/include/storage/itemid.h | 38–41 |
ItemPointerData | src/include/storage/itemptr.h | 36 |
PageInit | src/backend/storage/page/bufpage.c | 42 |
PageIsVerified | src/backend/storage/page/bufpage.c | 94 |
PageAddItemExtended | src/backend/storage/page/bufpage.c | 193 |
PageRepairFragmentation | src/backend/storage/page/bufpage.c | 698 |
PageTruncateLinePointerArray | src/backend/storage/page/bufpage.c | 834 |
PageGetFreeSpace | src/backend/storage/page/bufpage.c | 906 |
PageSetChecksumCopy | src/backend/storage/page/bufpage.c | 1509 |
PageSetChecksumInplace | src/backend/storage/page/bufpage.c | 1541 |
pg_checksum_page (decl) | src/include/storage/checksum.h | 22 |
pg_checksum_page (body) | src/include/storage/checksum_impl.h | 187 |
pg_checksum_block | src/include/storage/checksum_impl.h | 146 |
N_SUMS | src/include/storage/checksum_impl.h | 106 |
FNV_PRIME | src/include/storage/checksum_impl.h | 108 |
checksumBaseOffsets | src/include/storage/checksum_impl.h | 121 |
CHECKSUM_COMP | src/include/storage/checksum_impl.h | 135 |
PGChecksummablePage | src/include/storage/checksum_impl.h | 111 |
HeapTupleHeaderData | src/include/access/htup_details.h | 153 |
SizeofHeapTupleHeader | src/include/access/htup_details.h | 185 |
HEAP_XMIN_COMMITTED | src/include/access/htup_details.h | 204 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
PageHeaderDatais 24 bytes on all current platforms. Confirmed bySizeOfPageHeaderData = offsetof(PageHeaderData, pd_linp)inbufpage.hat commit 273fe94. The struct ends atpd_prune_xid(offset 20,TransactionId=uint32) followed by the flexible arraypd_linp[]at offset 24. -
Page layout version is 4 and has not changed since PostgreSQL 8.3.
PG_PAGE_LAYOUT_VERSION = 4confirmed inbufpage.h. The comment history in the file traces versions 0–4 ending at 8.3. -
LP_REDIRECTcarries a slot number, not a byte offset, inlp_off. Confirmed byItemIdGetRedirect(itemId)initemid.hand the HOT chain-following logic inheapam.c. Callers useItemIdGetRedirectto obtain anOffsetNumber, then callPageGetItemIdagain with that number. -
PageRepairFragmentationis heap-only; index AMs usePageIndexMultiDelete. Confirmed by the comment at line 684 ofbufpage.c(“This routine is usable for heap pages only, but see PageIndexMultiDelete”). -
All-zeros pages pass
PageIsVerifiedunconditionally. Confirmed at line 142 ofbufpage.c: thepg_memory_is_all_zeroscheck runs after the checksum/sanity path and returnstrueon an all-zero block. The comment explains the motivation: a crashed backend that extended a relation leaves a zero page. -
Checksum is computed on a private copy during buffer flush to exclude concurrent hint-bit writes. Confirmed in
PageSetChecksumCopyat line 1509 ofbufpage.cand the README note that “many or even most pages in shared buffers have invalid page checksums.” -
pd_prune_xidis unused in index pages. Confirmed by the comment inbufpage.hline 126: “It is currently unused in index pages.” -
pg_checksum_pageis a parallel FNV-1a fold with a>> 17high-bit mixer. Confirmed directly inchecksum_impl.h(the body#included by the one-linechecksum.c).N_SUMS = 32lanes,FNV_PRIME = 16777619, per-lane seeds inchecksumBaseOffsets[], two trailing zero rounds, xor-fold.pg_checksum_pagezeroespd_checksumtransiently, xors inblkno, and returns(checksum % 65535) + 1so the field is never zero. This matches (and supersedes) the “modified FNV” wording in older docs — the modification is precisely the^ ((hash ^ value) >> 17)term.
Open questions
Section titled “Open questions”-
MaxHeapTuplesPerPagederivation. The constant caps the line-pointer array for heap pages (PAI_IS_HEAPflag inPageAddItemExtended). Its exact formula (fromBLCKSZandMinHeapTupleSize) is inhtup.hand was not confirmed in this pass. -
pd_checksumfield when checksums are disabled. Thebufpage.hcomment states “zero is a valid value for a checksum” and that pre-9.3 databases may have non-zero values from the old timelineid field. WhetherPageIsVerifiedhandles this ambiguity correctly in all upgrade paths was not confirmed from this analysis alone.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
InnoDB’s page format — InnoDB uses a fixed 16 KB page with a
FIL_PAGE_LSNheader field and a separate page trailer carrying an LSN plus a checksum. InnoDB’s checksum covers the body but not the header fields, and the trailer checksum allows a torn-page detection that PostgreSQL’s single-field approach does not provide. A comparison of tear-detection strategies would illuminate PostgreSQL’s design choice. -
Variable-page-size experiments — PostgreSQL’s
BLCKSZis a compile-time constant. Thepg_ctl--with-blocksizeoption allows values of 1–32 KB. Research on variable-page-size storage (e.g., for columnar or HTAP workloads) asks whether the page abstraction itself should be widened or replaced. Thetable access methodAPI (postgres-table-am.md) is the extensibility hook that would make this possible without touchingbufpage.c. -
“Beyond the Page” storage models — Column stores (cstore_fdw predecessor, Citus columnar, DuckDB) abandon the slotted-page model entirely in favor of compressed column chunks. PostgreSQL’s Table AM API (PG12+) is the first step toward plugging such a model in-core. Stonebraker et al.’s “The End of an Architectural Era” (VLDB 2007) is the theoretical anchor for this direction.
-
HOT chains and line-pointer indirection — The
LP_REDIRECTmechanism was introduced in PostgreSQL 8.3 (Heap Only Tuples). Digging into the HOT design document (src/backend/access/heap/README.HOT) and comparing it with MySQL’s undo-segment approach to in-place updates would make a strong follow-up note underpostgres-heap-am.md. -
Torn-page protection via full-page images in WAL — The
storage/ page/READMEnotes that full-page writes in WAL protect against torn pages, and thatMarkBufferDirtyHintmust write a WAL record when checksums are on and the page is otherwise clean. The interaction between checksums, full-page writes, andhint_bitsupdates is the subject of a careful PostgreSQL commit trail worth documenting inpostgres-xlog-wal.md.
Sources
Section titled “Sources”Raw files consumed
Section titled “Raw files consumed”(none — this document was synthesized directly from the source tree)
Source code paths (REL_18_STABLE, commit 273fe94)
Section titled “Source code paths (REL_18_STABLE, commit 273fe94)”src/backend/storage/page/bufpage.csrc/backend/storage/page/checksum.csrc/backend/storage/page/itemptr.csrc/backend/storage/page/READMEsrc/include/storage/bufpage.hsrc/include/storage/itemid.hsrc/include/storage/itemptr.hsrc/include/storage/checksum.hsrc/include/storage/checksum_impl.hsrc/backend/storage/page/checksum.csrc/include/access/htup_details.hsrc/include/access/htup.h
Textbook references
Section titled “Textbook references”- Petrov, Database Internals (2019), ch. 3 §“Page Structure”
- Silberschatz et al., Database System Concepts (7th ed.), ch. 13 §“File Organization” (slotted page layout)
Related knowledge-base documents
Section titled “Related knowledge-base documents”knowledge/code-analysis/postgres/postgres-buffer-manager.md— how pages enter/leave the shared pool; WAL-before-flush enforcementknowledge/code-analysis/postgres/postgres-heap-am.md— heap tuple layout, HOT chain mechanics, heap pruningknowledge/code-analysis/postgres/postgres-mvcc-snapshots.md— howt_xmin/t_xmax/t_infomaskhint bits drive visibilityknowledge/code-analysis/postgres/postgres-xlog-wal.md— full-page writes, checksum interaction with WAL