Skip to content

PostgreSQL Page Layout — Slotted Page Format, ItemId Indirection, and Checksum Mechanics

Contents:

Every disk-resident relational database must solve the same physical problem: it stores data on block devices that operate in fixed-size units (sectors, OS pages), yet the tuples the engine manipulates are variable-length and must be located in O(1) time by a logical address rather than a physical byte offset. Database Internals (Petrov, ch. 3 §“Page Structure”) frames this as a two-level addressing problem. The engine carves disk into fixed-size pages (also called blocks in PostgreSQL’s vocabulary), each page is the unit of I/O, and within a page variable-length records are managed by an indirection layer — a directory that maps logical slot numbers to physical byte offsets inside the page.

Two design axes follow immediately:

  1. How should items be addressed? A physical offset bakes the byte position into every reference and breaks the moment the page is compacted. A logical slot number (an index into the directory) lets the engine shuffle bytes freely on the page without invalidating any external pointer, as long as it keeps the directory up to date. PostgreSQL chooses logical slot numbers, which it calls line pointers or item identifiers.

  2. How should free space be managed within the page? The simplest scheme grows the item directory from one end and pushes item data from the other end, meeting in a free-space gap in the middle. This is the slotted page layout described in every major textbook (Database System Concepts, Silberschatz et al., ch. 13 §“File Organization”). It is simple, compact, and requires no per-slot free-list overhead.

For heap pages in PostgreSQL, a third design axis matters for MVCC (multi-version concurrency control): the on-page per-tuple header must carry enough transaction visibility information (t_xmin, t_xmax, t_ctid, t_infomask) for the visibility decision to be made without consulting any external structure in the common case. This is not universal across DBMSes — InnoDB stores undo log pointers in the tuple header and reconstructs older versions from a separate undo area — but it is central to PostgreSQL’s update-in-place-with-version-chain model.

Finally, a fourth concern in modern systems is silent data corruption: a bit flip in storage or a faulty HBA can deliver a page that is syntactically valid yet contains wrong data. A page checksum computed at write time and verified at read time detects this class of failure. PostgreSQL added optional per-page checksums in release 9.3; since PostgreSQL 18 they may be enabled or disabled with pg_checksums on an offline cluster.

The slotted page layout has converged to essentially one shape across most disk-resident DBMSes. The engineering conventions are worth naming explicitly so that PostgreSQL’s specific choices read as one instantiation of a well-understood pattern.

Header → line-pointer array → free gap → item area

Section titled “Header → line-pointer array → free gap → item area”

Every slotted page has four contiguous regions:

  1. Fixed-size header — page-level metadata: LSN or sequence number (for WAL), flags, and the three offsets (lower, upper, special) that delimit the three variable regions.
  2. Line-pointer array — grows forward (toward higher offsets) as items are inserted. Each entry is a small fixed-size slot: a byte offset to the item body, the item’s byte length, and a state flag.
  3. Free space gap — the unused region between pd_lower and pd_upper. The page is full when this gap is too small to hold a new line pointer plus a new item body.
  4. Item area — grows backward (toward lower offsets) from pd_upper. Items are packed tightly; MAXALIGN padding is applied to keep items aligned.

The discipline that emerges from this layout: line pointer slots are never moved once allocated (they are only re-used when marked UNUSED), but item bodies may be compacted by PageRepairFragmentation which slides them together and resets pd_upper. External references — ItemPointers stored in indexes — hold a (block, slot-number) pair and remain valid through compaction because the slot number is indirection, not a byte offset.

Indexes need per-page metadata that is not part of the item stream — the page’s level in the tree, sibling pointers, cycle-detection markers. The convention is a special space at the end of the page, carved out at PageInit time. Heap pages have zero special space; index AMs initialize it to their own opaque struct (BTPageOpaqueData for B-tree, etc.).

A tuple identifier (TID) is the durable logical address by which one page’s content refers to a tuple on (possibly a different) page. It holds a (block number, line-pointer offset) pair. Indexes store TIDs as their “payload” pointing into the heap. The heap itself uses TIDs in t_ctid to chain MVCC versions. A TID is the minimal cross-page reference: small enough to store in index leaves, stable under tuple compaction, and resolvable in one buffer-manager lookup.

The standard technique for storage integrity is a per-page checksum stored in the page header, computed at write time over the full page body (or a representative sample of it), and checked at read time. Computing it per-row instead would require touching every row header on every write and every read, which is too expensive. The checksum covers the whole block so the cost is one computation per block I/O. The checksum is typically tied to the block number to detect block relocation (a page written to the wrong sector) as well as bit-flip corruption.

ConceptPostgreSQL name
Page / blockPage (PageData) — BLCKSZ bytes (default 8192)
Page headerPageHeaderData (bufpage.h)
Lower free-space boundarypd_lower
Upper free-space boundarypd_upper
Special-space boundarypd_special
WAL position of last changepd_lsn (PageXLogRecPtr)
Line pointer / slotItemIdData (itemid.h)
Line-pointer stateLP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEAD
Tuple identifier (TID)ItemPointerData (itemptr.h)
Block number part of TIDip_blkid (BlockIdData)
Offset-number part of TIDip_posid (OffsetNumber)
Per-page checksumpd_checksum; computed by pg_checksum_page
Page initializationPageInit
Page verificationPageIsVerified
Item insertionPageAddItemExtended
Tuple compactionPageRepairFragmentation

Every PostgreSQL page — heap page, index page, FSM page, VM page — begins with the same 24-byte PageHeaderData:

// PageHeaderData — src/include/storage/bufpage.h
typedef struct PageHeaderData
{
PageXLogRecPtr pd_lsn; /* WAL position of last change */
uint16 pd_checksum; /* page checksum (optional) */
uint16 pd_flags; /* PD_HAS_FREE_LINES | PD_PAGE_FULL | PD_ALL_VISIBLE */
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
uint16 pd_pagesize_version; /* high byte = page size; low byte = layout version */
TransactionId pd_prune_xid; /* oldest prunable XID on this page, or 0 */
ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;

SizeOfPageHeaderData is defined as offsetof(PageHeaderData, pd_linp), which evaluates to 24 bytes on all current platforms. The line-pointer array begins immediately after this fixed header.

pd_lsn is the most important field for the buffer manager. When FlushBuffer is about to write a dirty page to disk, it calls XLogFlush(PageGetLSN(page)) to ensure the WAL is durably written up to at least the LSN of the last change to this page. This is how PostgreSQL physically enforces the WAL-before-data rule.

pd_pagesize_version packs page size and layout version into one uint16. The high byte encodes the page size in units of 256 (so BLCKSZ = 81920x2000); the low byte is the layout version number, currently PG_PAGE_LAYOUT_VERSION = 4 (unchanged since PostgreSQL 8.3). PageGetPageSize extracts size by masking with 0xFF00; PageGetPageLayoutVersion extracts version by masking with 0x00FF.

pd_flags carries three hint bits:

  • PD_HAS_FREE_LINES (0x0001) — there is at least one LP_UNUSED slot before pd_lower. Set/cleared by PageAddItemExtended and PageRepairFragmentation. It is an unlogged hint: if wrong, the code falls back to a linear scan of the line-pointer array.
  • PD_PAGE_FULL (0x0002) — an UPDATE found insufficient free space. A prune pass is warranted. Also a hint; not WAL-logged.
  • PD_ALL_VISIBLE (0x0004) — all tuples on this page are visible to every transaction (no dead versions remain). The visibility map uses this to avoid per-tuple visibility checks during index-only scans.

pd_prune_xid is a hint that drives pruning decisions. When a tuple is deleted by transaction xid, PageSetPrunable(page, xid) advances pd_prune_xid to the oldest such XID seen so far on this page. VACUUM and the heap pruner check pd_prune_xid to decide whether any dead versions on the page are old enough to be removed.

The pd_linp[] array grows forward from byte offset SizeOfPageHeaderData. Each slot is a 32-bit word packed as three bit-fields:

// ItemIdData — src/include/storage/itemid.h
typedef struct ItemIdData
{
unsigned lp_off:15, /* byte offset to item body, from page start */
lp_flags:2, /* state: LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD */
lp_len:15; /* byte length of item body */
} ItemIdData;

The 15-bit fields cap item offsets and lengths at 32767, which is why PostgreSQL pages are limited to 32 KB (BLCKSZ ≤ 32768). Four line- pointer states exist:

StateValueMeaning
LP_UNUSED0Free slot; lp_len is always 0. Reusable.
LP_NORMAL1Active item; lp_off and lp_len are valid.
LP_REDIRECT2HOT redirect; lp_off holds the offset number of the next slot in the HOT chain, not a byte offset. lp_len is 0.
LP_DEAD3Dead; tuple body may or may not still be present. Set by heap pruning; cleared to LP_UNUSED by VACUUM.

The LP_REDIRECT state is the implementation of Heap Only Tuples (HOT). When an UPDATE does not change any indexed column, the new tuple version is inserted on the same page without creating a new index entry. The old line pointer is flipped to LP_REDIRECT with lp_off pointing to the new version’s slot number. Index scans follow the redirect chain on the heap page entirely in memory, without re-entering the index. This keeps index size stable for heavily-updated tables.

PageGetMaxOffsetNumber derives the number of allocated line pointers from pd_lower:

// PageGetMaxOffsetNumber — src/include/storage/bufpage.h
static inline OffsetNumber
PageGetMaxOffsetNumber(const PageData *page)
{
const PageHeaderData *pageheader = (const PageHeaderData *) page;
if (pageheader->pd_lower <= SizeOfPageHeaderData)
return 0;
else
return (pageheader->pd_lower - SizeOfPageHeaderData) / sizeof(ItemIdData);
}

PageGetItemId(page, offsetNumber) returns a pointer to pd_linp[offsetNumber - 1]; offset numbers are 1-based by convention.

An ItemPointerData (a/k/a TID) is the 6-byte cross-page address of one heap tuple:

// ItemPointerData — src/include/storage/itemptr.h
typedef struct ItemPointerData
{
BlockIdData ip_blkid; /* block number (4 bytes: hi+lo uint16 pair) */
OffsetNumber ip_posid; /* line-pointer offset number (1-based) */
} ItemPointerData;

BlockIdData stores the block number as two uint16 values (bi_hi, bi_lo) for historical alignment reasons; BlockIdGetBlockNumber reassembles them. The maximum addressable block number is 2^32 - 1, giving a per-relation ceiling of 32 TB at BLCKSZ = 8192.

TIDs serve two roles. In indexes, each leaf entry carries the TID of the corresponding heap tuple as the lookup payload. In the heap itself, HeapTupleHeaderData.t_ctid holds the TID of the most-recent version of this tuple: if no newer version exists, t_ctid points back to the tuple itself; if the tuple was updated, t_ctid points forward to the new version. MVCC visibility logic (heapam_visibility.c) follows t_ctid chains to find the right version for a given snapshot.

Item area: tuple bodies grow from pd_upper downward

Section titled “Item area: tuple bodies grow from pd_upper downward”
// PageAddItemExtended (condensed) — src/backend/storage/page/bufpage.c
OffsetNumber
PageAddItemExtended(Page page, Item item, Size size, OffsetNumber offsetNumber, int flags)
{
PageHeader phdr = (PageHeader) page;
// ... condensed: validate offsets, find free slot or extend array ...
alignedSize = MAXALIGN(size);
upper = (int) phdr->pd_upper - (int) alignedSize;
if (lower > upper)
return InvalidOffsetNumber; /* page full */
ItemIdSetNormal(itemId, upper, size); /* record byte offset + length */
memcpy((char *) page + upper, item, size);
phdr->pd_lower = (LocationIndex) lower;
phdr->pd_upper = (LocationIndex) upper;
return offsetNumber;
}

Each insert decrements pd_upper by MAXALIGN(size) and increments pd_lower by sizeof(ItemIdData). The page is full when lower > upper after accounting for both increments. MAXALIGN ensures that item bodies start on an 8-byte (or platform-specific) boundary, which matters for zero-copy access to numeric fields inside tuples.

Fragmentation and compaction: PageRepairFragmentation

Section titled “Fragmentation and compaction: PageRepairFragmentation”

After VACUUM or heap pruning marks dead tuples as LP_UNUSED or LP_DEAD and clears their bodies, gaps appear in the item area. The pd_upper pointer cannot be bumped back without a compaction pass. PageRepairFragmentation collects all live item bodies, calls compactify_tuples to slide them together toward pd_special, and resets pd_upper. The line-pointer slots are not moved — their lp_off values are updated to point to the new positions. Trailing LP_UNUSED slots at the end of the pd_linp[] array are also truncated to shrink pd_lower:

// PageRepairFragmentation (condensed) — src/backend/storage/page/bufpage.c
void
PageRepairFragmentation(Page page)
{
// ... condensed: collect live items into itemidbase[], check for corruption ...
compactify_tuples(itemidbase, nstorage, page, presorted);
// truncate trailing unused line pointers from pd_linp[]
if (finalusedlp != nline)
((PageHeader) page)->pd_lower -= (sizeof(ItemIdData) * nunusedend);
// set PD_HAS_FREE_LINES hint
}

The presorted fast path (items already in descending lp_off order, which is the insertion order) uses memmove in one pass. The unsorted path copies items into a temporary buffer then writes them back in order. Callers must hold an exclusive cleanup lock on the buffer.

// PageInit — src/backend/storage/page/bufpage.c
void
PageInit(Page page, Size pageSize, Size specialSize)
{
PageHeader p = (PageHeader) page;
specialSize = MAXALIGN(specialSize);
MemSet(p, 0, pageSize); /* zero the whole page first */
p->pd_flags = 0;
p->pd_lower = SizeOfPageHeaderData; /* no line pointers yet */
p->pd_upper = pageSize - specialSize; /* item area ceiling */
p->pd_special = pageSize - specialSize; /* special-space floor */
PageSetPageSizeAndVersion(page, pageSize, PG_PAGE_LAYOUT_VERSION);
/* pd_prune_xid zeroed by MemSet */
}

pd_lower == pd_upper (both = pageSize - specialSize) means the page is empty but has zero free space only if specialSize == pageSize - SizeOfPageHeaderData, which would be degenerate. In practice they diverge; PageIsEmpty checks pd_lower <= SizeOfPageHeaderData.

When the buffer manager reads a page from disk, it calls PageIsVerified before admitting it to the shared pool:

// PageIsVerified (condensed) — src/backend/storage/page/bufpage.c
bool
PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_failure_p)
{
if (!PageIsNew(page))
{
if (DataChecksumsEnabled())
{
checksum = pg_checksum_page(page, blkno);
if (checksum != p->pd_checksum)
checksum_failure = true;
}
if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
p->pd_lower <= p->pd_upper &&
p->pd_upper <= p->pd_special &&
p->pd_special <= BLCKSZ &&
p->pd_special == MAXALIGN(p->pd_special))
header_sane = true;
if (header_sane && !checksum_failure)
return true;
}
/* all-zeros page is acceptable (crashed extension of relation) */
if (pg_memory_is_all_zeros((size_t *) page, BLCKSZ))
return true;
return false;
}

Two independent checks run in sequence: the checksum (if enabled) and a structural sanity check on the four offset fields. All-zeros pages pass unconditionally — they arise when a backend extends a relation and crashes before writing any WAL for the new block; VACUUM will clean them up. The function accepts PIV_LOG_WARNING, PIV_LOG_LOG, and PIV_IGNORE_CHECKSUM_FAILURE flags to control error reporting behavior.

Checksum: 16-bit fold-and-XOR over block-number-perturbed data

Section titled “Checksum: 16-bit fold-and-XOR over block-number-perturbed data”

The pd_checksum field holds a 16-bit value computed by pg_checksum_page (declared in checksum.h, implemented in checksum.c). The algorithm processes the 8 KB page in 32-byte chunks, mixes in the block number, and folds the result down to 16 bits. The block number perturbation means the same page bytes stored at block 0 and block 1000 produce different checksums — misrouted block writes are detected.

PageSetChecksumCopy (used by buffer flush) makes a copy of the page before computing the checksum to exclude concurrent hint-bit updates from affecting the final value. PageSetChecksumInplace (used when no concurrent modification is possible, e.g., during pg_checksums) writes directly into the page. The README for storage/page/ states the invariant explicitly: the checksum is valid only when the page enters or leaves the shared pool, not while it is resident and potentially being modified by hint bit updates.

Both setters share the same two short-circuits — a page that PageIsNew (never initialized, all zero) or a cluster with !DataChecksumsEnabled() skips checksum computation entirely:

// PageSetChecksumCopy — src/backend/storage/page/bufpage.c
char *
PageSetChecksumCopy(Page page, BlockNumber blkno)
{
static char *pageCopy = NULL;
if (PageIsNew(page) || !DataChecksumsEnabled())
return page; /* nothing to do */
if (pageCopy == NULL)
pageCopy = MemoryContextAllocAligned(TopMemoryContext, BLCKSZ,
PG_IO_ALIGN_SIZE, 0);
memcpy(pageCopy, page, BLCKSZ);
((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
return pageCopy;
}

The static pageCopy is allocated once per backend, aligned to PG_IO_ALIGN_SIZE, and reused on every flush; the caller must write the returned pointer immediately and not retain it. PageSetChecksumInplace is the same minus the copy: it writes pg_checksum_page(page, blkno) straight into pd_checksum.

The checksum algorithm: parallel FNV-1a fold

Section titled “The checksum algorithm: parallel FNV-1a fold”

pg_checksum_page lives in src/backend/storage/page/checksum.c, but that file is a one-line shell — it #includes storage/checksum_impl.h, where the real algorithm sits. The indirection lets external tools (pg_checksums, pg_upgrade, pg_basebackup) embed the exact same code by including one exported header. The algorithm is a vectorizable variant of FNV-1a (Fowler/Noll/Vo) tuned so the compiler can auto-SIMD it; on read-heavy workloads where the working set fits in OS cache but not shared buffers, the checksum is the dominant cost, so speed is the design driver.

The page is viewed as a two-dimensional uint32 array of N_SUMS = 32 columns. Each column is folded independently into its own partial sum seeded from a distinct random offset-basis constant, which is what lets a SIMD unit run all 32 lanes in parallel:

// checksum constants and one fold round — src/include/storage/checksum_impl.h
#define N_SUMS 32
#define FNV_PRIME 16777619
typedef union {
PageHeaderData phdr;
uint32 data[BLCKSZ / (sizeof(uint32) * N_SUMS)][N_SUMS];
} PGChecksummablePage;
#define CHECKSUM_COMP(checksum, value) \
do { \
uint32 __tmp = (checksum) ^ (value); \
(checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \
} while (0)

The >> 17 xor-shift is PostgreSQL’s addition to plain FNV-1a: vanilla FNV mixes high-order input bits only into high-order output bits, so the shift folds high bits back down. pg_checksum_block seeds the 32 lanes from checksumBaseOffsets[], runs CHECKSUM_COMP across every row, adds two trailing rounds of zero to avalanche the final row, then xor-folds the 32 partial sums into one 32-bit result:

// pg_checksum_block (condensed) — src/include/storage/checksum_impl.h
static uint32
pg_checksum_block(const PGChecksummablePage *page)
{
uint32 sums[N_SUMS];
uint32 result = 0, i, j;
memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
for (i = 0; i < (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
for (j = 0; j < N_SUMS; j++)
CHECKSUM_COMP(sums[j], page->data[i][j]);
for (i = 0; i < 2; i++) /* two zero rounds to mix last row */
for (j = 0; j < N_SUMS; j++)
CHECKSUM_COMP(sums[j], 0);
for (i = 0; i < N_SUMS; i++)
result ^= sums[i];
return result;
}

pg_checksum_page wraps the block hash with two page-specific steps. It transiently zeroes pd_checksum so the stored value never feeds its own recomputation (then restores it — updating the field is the caller’s job, not this function’s). After the block hash it xors in blkno, which is what makes a page written to the wrong sector fail verification. Finally it reduces to 16 bits with (checksum % 65535) + 1 — the + 1 offset guarantees the result is never zero, reserving 0 as the “checksums never written” sentinel:

// pg_checksum_page (condensed) — src/include/storage/checksum_impl.h
uint16
pg_checksum_page(char *page, BlockNumber blkno)
{
PGChecksummablePage *cpage = (PGChecksummablePage *) page;
uint16 save_checksum;
uint32 checksum;
save_checksum = cpage->phdr.pd_checksum;
cpage->phdr.pd_checksum = 0; /* exclude stored checksum */
checksum = pg_checksum_block(cpage);
cpage->phdr.pd_checksum = save_checksum;
checksum ^= blkno; /* detect transposed pages */
return (uint16) ((checksum % 65535) + 1);
}

Note the PGChecksummablePage union overlays PageHeaderData on the uint32 grid, so the header — including pd_lsn, pd_flags, and all four offset fields — is covered by the checksum, but the pd_checksum field itself is excluded by the transient zeroing. The Assert(sizeof (PGChecksummablePage) == BLCKSZ) in pg_checksum_block enforces that the grid exactly tiles the block; BLCKSZ must be a multiple of 4 * N_SUMS = 128 bytes, which every supported BLCKSZ (1–32 KB, all powers of two ≥ 256) satisfies.

flowchart TD
    subgraph WRITE["Write path — page leaving shared pool"]
        W0["FlushBuffer / SyncOneBuffer"]
        W1{"PageIsNew or<br/>!DataChecksumsEnabled?"}
        W2["return page as-is<br/>(no checksum)"]
        W3["PageSetChecksumCopy:<br/>memcpy page to aligned pageCopy"]
        W4["pg_checksum_page(pageCopy, blkno)"]
        W5["write pd_checksum into copy"]
        W6["smgrwrite() the copy to disk"]
        W0 --> W1
        W1 -->|yes| W2
        W1 -->|no| W3 --> W4 --> W5 --> W6
    end
    subgraph CKSUM["pg_checksum_page internals"]
        C1["save pd_checksum, set field = 0"]
        C2["pg_checksum_block:<br/>32 parallel FNV-1a lanes<br/>over uint32 grid"]
        C3["two zero rounds + xor-fold to uint32"]
        C4["checksum ^= blkno"]
        C5["restore pd_checksum<br/>return (checksum % 65535) + 1"]
        C1 --> C2 --> C3 --> C4 --> C5
    end
    subgraph READ["Read path — page entering shared pool"]
        R0["smgrread() block from disk"]
        R1["PageIsVerified(page, blkno, ...)"]
        R2{"DataChecksumsEnabled?"}
        R3["computed = pg_checksum_page(page, blkno)"]
        R4{"computed == pd_checksum?"}
        R5["header sanity:<br/>pd_lower<=pd_upper<=pd_special<=BLCKSZ"]
        R6["admit page to buffer pool"]
        R7{"page all zeros?"}
        R8["accept (crashed relation extension)"]
        R9["checksum_failure: ERROR / WARNING<br/>per PIV_* flags"]
        R0 --> R1 --> R2
        R2 -->|yes| R3 --> R4
        R2 -->|no| R5
        R4 -->|yes| R5
        R4 -->|no| R7
        R5 --> R6
        R7 -->|yes| R8
        R7 -->|no| R9
    end
    W4 -.calls.-> C1
    R3 -.calls.-> C1

Figure 2 — Checksum lifecycle. The write path computes the checksum on a private copy as the page leaves the shared pool; the read path recomputes and compares as the page enters. Both call the same pg_checksum_page (center), which xors in blkno so a transposed page fails. An all-zeros page bypasses the failure branch because a crashed relation-extension legitimately leaves zero blocks.

flowchart TD
    A["PageHeaderData<br/>(24 bytes fixed)"]
    B["pd_linp[0]..pd_linp[N-1]<br/>ItemIdData array<br/>grows ↓ toward pd_upper"]
    C["Free space gap<br/>(pd_lower .. pd_upper)"]
    D["Item bodies<br/>(tuples / index entries)<br/>packed ↑ from pd_upper"]
    E["Special space<br/>(0 bytes for heap,<br/>AM-opaque for indexes)"]
    A --> B --> C --> D --> E

Figure 1 — The four regions of a PostgreSQL page. pd_lower points to the next free slot in the line-pointer array; pd_upper points to the top of the item area. Insertions advance pd_lower up and pd_upper down. pd_special is fixed at PageInit time.

The diagram above lists the regions in storage order; the next one shows the same page as a byte-addressed map, making the two-pointer “growing toward each other” discipline explicit. The header offsets are the only metadata needed to locate any region: pd_lower is the high-water mark of the line-pointer array (slots grow up from SizeOfPageHeaderData), pd_upper is the low-water mark of the item area (bodies grow down from pd_special), and the gap between them is the only free space. The page is full when pd_lower + sizeof(ItemIdData) > pd_upper - MAXALIGN(size).

flowchart TB
    subgraph PAGE["One BLCKSZ page (default 8192 bytes), byte offset 0 at top"]
        direction TB
        H["offset 0 .. SizeOfPageHeaderData (24)<br/>PageHeaderData<br/>pd_lsn | pd_checksum | pd_flags<br/>pd_lower | pd_upper | pd_special<br/>pd_pagesize_version | pd_prune_xid"]
        LP["pd_linp[0] (offnum 1)<br/>pd_linp[1] (offnum 2)<br/>...<br/>pd_linp[N-1] (offnum N)<br/>ItemIdData array — slots grow DOWN ↓<br/>each = lp_off:15 lp_flags:2 lp_len:15"]
        FREE["FREE SPACE<br/>PageGetFreeSpace =<br/>pd_upper - pd_lower - sizeof(ItemIdData)"]
        ITEM["tuple body N (newest, lowest offset)<br/>...<br/>tuple body 1 (oldest, highest offset)<br/>item bodies grow UP ↑, MAXALIGN-padded"]
        SP["Special space<br/>heap: 0 bytes (pd_special == BLCKSZ)<br/>btree: BTPageOpaqueData, etc."]
        H --> LP --> FREE --> ITEM --> SP
    end
    PL["pd_lower<br/>= 24 + N * sizeof(ItemIdData)"]
    PU["pd_upper<br/>= top of newest tuple body"]
    PS["pd_special<br/>= BLCKSZ - MAXALIGN(specialSize)"]
    PL -.points at start of.-> FREE
    PU -.points at end of.-> FREE
    PS -.points at start of.-> SP

Figure 1b — Byte map of a slotted page. The line-pointer array and the item area grow toward each other from opposite ends; PageGetItemId indexes the array by 1-based offset number, and PageGetItem follows a slot’s lp_off into the item area. An ItemPointerData TID stored in an index leaf names (blkno, offset number) — the offset number is the array index here, which is why tuple-body compaction by PageRepairFragmentation (rewriting lp_off) never invalidates an external TID.

A heap tuple body begins with HeapTupleHeaderData:

// HeapTupleHeaderData — src/include/access/htup_details.h
struct HeapTupleHeaderData
{
union {
HeapTupleFields t_heap; /* xmin, xmax, cid/xvac */
DatumTupleFields t_datum; /* used when stored as a composite datum */
} t_choice;
ItemPointerData t_ctid; /* TID of this or newer tuple version */
uint16 t_infomask2; /* attribute count + flags */
uint16 t_infomask; /* HEAP_XMIN_COMMITTED, HEAP_XMAX_INVALID, … */
uint8 t_hoff; /* offset to user data (past nulls bitmap) */
bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* nulls bitmap (optional) */
/* user data follows at offset t_hoff */
};

HeapTupleFields carries t_xmin (inserting transaction), t_xmax (deleting/locking transaction), and t_cid/t_xvac. The t_infomask bits such as HEAP_XMIN_COMMITTED and HEAP_XMAX_INVALID are hint bits: once set, they short-circuit visibility checks by caching the commit status in the tuple header so the commit-log (CLOG) need not be consulted on repeated reads. Setting a hint bit makes the page dirty without a WAL record (unless checksums are enabled and the page is otherwise clean — see the storage/page/README note on MarkBufferDirtyHint).

  • PageHeaderData (bufpage.h) — the 24-byte fixed header; all four offset fields (pd_lsn, pd_lower, pd_upper, pd_special) live here.
  • PageXLogRecPtr (bufpage.h) — historical two-uint32 representation of an XLogRecPtr; PageXLogRecPtrGet / PageXLogRecPtrSet convert.
  • PG_PAGE_LAYOUT_VERSION (bufpage.h) — compile-time constant, value 4; packed into pd_pagesize_version low byte.
  • SizeOfPageHeaderData (bufpage.h) — offsetof(PageHeaderData, pd_linp), equals 24 bytes; the starting pd_lower after PageInit.
  • PageInit (bufpage.c) — zeros the page, sets pd_lower, pd_upper, pd_special, packs pd_pagesize_version.
  • PageIsNew (bufpage.h) — pd_upper == 0; true before PageInit.
  • PageIsEmpty (bufpage.h) — pd_lower <= SizeOfPageHeaderData; no slots allocated yet.
  • ItemIdData (itemid.h) — 32-bit packed struct: lp_off:15, lp_flags:2, lp_len:15.
  • LP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEAD (itemid.h) — the four line-pointer states; LP_REDIRECT is the HOT-chain pivot.
  • PageGetItemId(page, offnum) (bufpage.h) — returns &pd_linp[offnum-1].
  • PageGetMaxOffsetNumber (bufpage.h) — (pd_lower - SizeOfPageHeaderData) / sizeof(ItemIdData).
  • PageGetItem(page, itemId) (bufpage.h) — (char *)page + lp_off.
  • ItemIdSetNormal / ItemIdSetRedirect / ItemIdSetDead / ItemIdSetUnused (itemid.h) — the four state transitions.
  • PageAddItemExtended (bufpage.c) — inserts an item: validates offsets, picks or allocates a slot, decrements pd_upper, increments pd_lower, copies bytes.
  • PageGetFreeSpace (bufpage.c) — pd_upper - pd_lower - sizeof(ItemIdData); returns 0 if below that threshold.
  • PageGetHeapFreeSpace (bufpage.c) — like PageGetFreeSpace but also enforces MaxHeapTuplesPerPage (derived from BLCKSZ and MinHeapTupleSize).
  • PageRepairFragmentation (bufpage.c) — full defrag: slides live item bodies together, truncates trailing LP_UNUSED slots, resets pd_upper.
  • PageTruncateLinePointerArray (bufpage.c) — called by VACUUM second pass; removes only trailing LP_UNUSED entries without moving item bodies.
  • compactify_tuples (static, bufpage.c) — inner loop of PageRepairFragmentation; presorted fast path (memmove) vs. unsorted copy-through-buffer path.
  • ItemPointerData (itemptr.h) — 6-byte TID: ip_blkid + ip_posid.
  • BlockIdData (block.h) — bi_hi + bi_lo uint16 pair; assembled by BlockIdGetBlockNumber.
  • ItemPointerSet / ItemPointerGet{BlockNumber,OffsetNumber} (itemptr.h) — TID accessors.
  • ItemPointerIsValid (itemptr.h) — ip_posid != 0.
  • PageIsVerified (bufpage.c) — entry gate for pages entering the buffer pool from disk; checks header sanity + optional checksum.
  • pg_checksum_page (checksum.h decl; checksum_impl.h body) — 16-bit checksum over BLCKSZ bytes, parameterised by block number; transiently zeroes pd_checksum, xors in blkno, returns (checksum % 65535) + 1.
  • pg_checksum_block (static, checksum_impl.h) — the FNV-1a inner loop: 32 parallel partial sums seeded from checksumBaseOffsets[], two zero rounds, xor-fold to one uint32.
  • CHECKSUM_COMP / FNV_PRIME / N_SUMS / PGChecksummablePage (checksum_impl.h) — the fold macro, prime multiplier 16777619, lane count 32, and the union that overlays PageHeaderData on the grid.
  • checksum.c (bufpage/page) — one-line shell that #includes checksum_impl.h so external tools can embed identical code.
  • PageSetChecksumCopy (bufpage.c) — makes a local copy, computes checksum, returns pointer to copy; used by buffer flush path.
  • PageSetChecksumInplace (bufpage.c) — writes checksum directly into the page; used by pg_checksums tool and similar offline paths.
  • DataChecksumsEnabled() (bufmgr.h) — GUC-controlled predicate consulted by both verification and checksum-setting paths.

Position hints (as of 2026-06-05, commit 273fe94, REL_18_STABLE)

Section titled “Position hints (as of 2026-06-05, commit 273fe94, REL_18_STABLE)”
SymbolFileLine
PageHeaderDatasrc/include/storage/bufpage.h159
SizeOfPageHeaderDatasrc/include/storage/bufpage.h218
PG_PAGE_LAYOUT_VERSIONsrc/include/storage/bufpage.h207
PD_HAS_FREE_LINESsrc/include/storage/bufpage.h188
PD_PAGE_FULLsrc/include/storage/bufpage.h189
PD_ALL_VISIBLEsrc/include/storage/bufpage.h190
PageGetMaxOffsetNumbersrc/include/storage/bufpage.h372
PageGetItemIdsrc/include/storage/bufpage.h245
PageGetItemsrc/include/storage/bufpage.h354
PageGetLSN / PageSetLSNsrc/include/storage/bufpage.h386–394
ItemIdDatasrc/include/storage/itemid.h25
LP_UNUSED / LP_NORMAL / LP_REDIRECT / LP_DEADsrc/include/storage/itemid.h38–41
ItemPointerDatasrc/include/storage/itemptr.h36
PageInitsrc/backend/storage/page/bufpage.c42
PageIsVerifiedsrc/backend/storage/page/bufpage.c94
PageAddItemExtendedsrc/backend/storage/page/bufpage.c193
PageRepairFragmentationsrc/backend/storage/page/bufpage.c698
PageTruncateLinePointerArraysrc/backend/storage/page/bufpage.c834
PageGetFreeSpacesrc/backend/storage/page/bufpage.c906
PageSetChecksumCopysrc/backend/storage/page/bufpage.c1509
PageSetChecksumInplacesrc/backend/storage/page/bufpage.c1541
pg_checksum_page (decl)src/include/storage/checksum.h22
pg_checksum_page (body)src/include/storage/checksum_impl.h187
pg_checksum_blocksrc/include/storage/checksum_impl.h146
N_SUMSsrc/include/storage/checksum_impl.h106
FNV_PRIMEsrc/include/storage/checksum_impl.h108
checksumBaseOffsetssrc/include/storage/checksum_impl.h121
CHECKSUM_COMPsrc/include/storage/checksum_impl.h135
PGChecksummablePagesrc/include/storage/checksum_impl.h111
HeapTupleHeaderDatasrc/include/access/htup_details.h153
SizeofHeapTupleHeadersrc/include/access/htup_details.h185
HEAP_XMIN_COMMITTEDsrc/include/access/htup_details.h204
  • PageHeaderData is 24 bytes on all current platforms. Confirmed by SizeOfPageHeaderData = offsetof(PageHeaderData, pd_linp) in bufpage.h at commit 273fe94. The struct ends at pd_prune_xid (offset 20, TransactionId = uint32) followed by the flexible array pd_linp[] at offset 24.

  • Page layout version is 4 and has not changed since PostgreSQL 8.3. PG_PAGE_LAYOUT_VERSION = 4 confirmed in bufpage.h. The comment history in the file traces versions 0–4 ending at 8.3.

  • LP_REDIRECT carries a slot number, not a byte offset, in lp_off. Confirmed by ItemIdGetRedirect(itemId) in itemid.h and the HOT chain-following logic in heapam.c. Callers use ItemIdGetRedirect to obtain an OffsetNumber, then call PageGetItemId again with that number.

  • PageRepairFragmentation is heap-only; index AMs use PageIndexMultiDelete. Confirmed by the comment at line 684 of bufpage.c (“This routine is usable for heap pages only, but see PageIndexMultiDelete”).

  • All-zeros pages pass PageIsVerified unconditionally. Confirmed at line 142 of bufpage.c: the pg_memory_is_all_zeros check runs after the checksum/sanity path and returns true on an all-zero block. The comment explains the motivation: a crashed backend that extended a relation leaves a zero page.

  • Checksum is computed on a private copy during buffer flush to exclude concurrent hint-bit writes. Confirmed in PageSetChecksumCopy at line 1509 of bufpage.c and the README note that “many or even most pages in shared buffers have invalid page checksums.”

  • pd_prune_xid is unused in index pages. Confirmed by the comment in bufpage.h line 126: “It is currently unused in index pages.”

  • pg_checksum_page is a parallel FNV-1a fold with a >> 17 high-bit mixer. Confirmed directly in checksum_impl.h (the body #included by the one-line checksum.c). N_SUMS = 32 lanes, FNV_PRIME = 16777619, per-lane seeds in checksumBaseOffsets[], two trailing zero rounds, xor-fold. pg_checksum_page zeroes pd_checksum transiently, xors in blkno, and returns (checksum % 65535) + 1 so the field is never zero. This matches (and supersedes) the “modified FNV” wording in older docs — the modification is precisely the ^ ((hash ^ value) >> 17) term.

  1. MaxHeapTuplesPerPage derivation. The constant caps the line-pointer array for heap pages (PAI_IS_HEAP flag in PageAddItemExtended). Its exact formula (from BLCKSZ and MinHeapTupleSize) is in htup.h and was not confirmed in this pass.

  2. pd_checksum field when checksums are disabled. The bufpage.h comment states “zero is a valid value for a checksum” and that pre-9.3 databases may have non-zero values from the old timelineid field. Whether PageIsVerified handles this ambiguity correctly in all upgrade paths was not confirmed from this analysis alone.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • InnoDB’s page format — InnoDB uses a fixed 16 KB page with a FIL_PAGE_LSN header field and a separate page trailer carrying an LSN plus a checksum. InnoDB’s checksum covers the body but not the header fields, and the trailer checksum allows a torn-page detection that PostgreSQL’s single-field approach does not provide. A comparison of tear-detection strategies would illuminate PostgreSQL’s design choice.

  • Variable-page-size experiments — PostgreSQL’s BLCKSZ is a compile-time constant. The pg_ctl --with-blocksize option allows values of 1–32 KB. Research on variable-page-size storage (e.g., for columnar or HTAP workloads) asks whether the page abstraction itself should be widened or replaced. The table access method API (postgres-table-am.md) is the extensibility hook that would make this possible without touching bufpage.c.

  • “Beyond the Page” storage models — Column stores (cstore_fdw predecessor, Citus columnar, DuckDB) abandon the slotted-page model entirely in favor of compressed column chunks. PostgreSQL’s Table AM API (PG12+) is the first step toward plugging such a model in-core. Stonebraker et al.’s “The End of an Architectural Era” (VLDB 2007) is the theoretical anchor for this direction.

  • HOT chains and line-pointer indirection — The LP_REDIRECT mechanism was introduced in PostgreSQL 8.3 (Heap Only Tuples). Digging into the HOT design document (src/backend/access/heap/README.HOT) and comparing it with MySQL’s undo-segment approach to in-place updates would make a strong follow-up note under postgres-heap-am.md.

  • Torn-page protection via full-page images in WAL — The storage/ page/README notes that full-page writes in WAL protect against torn pages, and that MarkBufferDirtyHint must write a WAL record when checksums are on and the page is otherwise clean. The interaction between checksums, full-page writes, and hint_bits updates is the subject of a careful PostgreSQL commit trail worth documenting in postgres-xlog-wal.md.

(none — this document was synthesized directly from the source tree)

Source code paths (REL_18_STABLE, commit 273fe94)

Section titled “Source code paths (REL_18_STABLE, commit 273fe94)”
  • src/backend/storage/page/bufpage.c
  • src/backend/storage/page/checksum.c
  • src/backend/storage/page/itemptr.c
  • src/backend/storage/page/README
  • src/include/storage/bufpage.h
  • src/include/storage/itemid.h
  • src/include/storage/itemptr.h
  • src/include/storage/checksum.h
  • src/include/storage/checksum_impl.h
  • src/backend/storage/page/checksum.c
  • src/include/access/htup_details.h
  • src/include/access/htup.h
  • Petrov, Database Internals (2019), ch. 3 §“Page Structure”
  • Silberschatz et al., Database System Concepts (7th ed.), ch. 13 §“File Organization” (slotted page layout)
  • knowledge/code-analysis/postgres/postgres-buffer-manager.md — how pages enter/leave the shared pool; WAL-before-flush enforcement
  • knowledge/code-analysis/postgres/postgres-heap-am.md — heap tuple layout, HOT chain mechanics, heap pruning
  • knowledge/code-analysis/postgres/postgres-mvcc-snapshots.md — how t_xmin/t_xmax/t_infomask hint bits drive visibility
  • knowledge/code-analysis/postgres/postgres-xlog-wal.md — full-page writes, checksum interaction with WAL