Skip to content

PostgreSQL Free Space Map — Tracking Reusable Space per Page

Contents:

A heap file in a page-oriented database is an unordered collection of fixed-size pages, and a tuple may be inserted onto any page that has room for it. This freedom is exactly what makes free-space tracking a problem. When a transaction wants to insert a 200-byte row, the storage engine must answer a deceptively simple question: which page should receive it? The naive answers are both bad. Always appending to the last page wastes the space freed by deletes and updates elsewhere, so the file grows without bound even when half of it is empty (the classic “bloat” failure mode). Scanning every page to find one with room is correct but costs O(n) page reads per insert, which is catastrophic on a large relation. The free-space map is the data structure that makes the question answerable in sublinear time.

The textbook framing (Database System Concepts, ch. on “Storage and File Structure”, and Database Internals part I on “File Formats” / “slotted pages”) distinguishes two layers. The slotted page is the intra-page layout: a page header, an array of line pointers (item identifiers) growing from the front, and tuple bodies growing from the back, with a contiguous free region in the middle. The amount of free space on a page is therefore a single derived number — the gap between the line-pointer array and the tuple region (minus what a future line pointer would consume). The free-space management structure is the inter-page layer: a directory that records, for each data page, roughly how much of that gap is available, so the engine can locate a fit without touching the data pages themselves.

Database Internals describes the spectrum of free-space directories used in practice: a simple bitmap (one bit per page: has-room / full), a per-page byte or word giving a coarse fill level, and tree-structured directories that let a “find a page with ≥ X free” query short-circuit at the top. The central design tension is fidelity versus size. A directory that records the exact free byte count per page is precise but large — for a 1 TB table at 8 KB pages that is 128 million entries, and at 4 bytes each that is 512 MB of directory that itself must be paged in and searched. A directory that is small enough to keep hot must therefore be lossy: it approximates free space, accepts that its answer may be stale or wrong, and pushes the burden of verification onto the caller, which already has to lock the target page anyway.

A second axis is search cost. A flat array of per-page fill levels still costs O(n) to scan for a fit. The standard remedy is a tree of maxima: each interior node stores the maximum free-space value among its descendants, so a search for ”≥ X” can prune any subtree whose maximum is below X and descend only along a satisfying path — O(log n) instead of O(n). This is structurally a segment tree / tournament tree specialized to the “range-max + point-update” workload: point updates (one page’s free space changed) bubble a new max up the spine, and queries descend greedily. The same shape underlies interval trees, Fenwick-style aggregates, and the loser-tree used in external sorting; the FSM is this idea applied to disk pages.

A third, subtler axis is concurrency and durability. Free-space information is fundamentally a hint: if it is wrong, the worst outcome is a wasted page read or a slightly larger file, never a corrupt or lost row. That observation licenses two aggressive optimizations that would be illegal for real data. First, the directory need not be crash-safe — it can be rebuilt or self-corrected after a crash, so it need not be write-ahead logged, eliminating a large class of WAL traffic. Second, concurrent readers can tolerate a racy, unlatched “where to look next” pointer, because a stale hint merely sends two backends to adjacent pages rather than colliding on one. The engineering art of a free-space map lies in pushing approximation as far as correctness allows, and then building cheap self-healing measures to mop up the consequences.

Across engines the free-space directory converges on a small set of patterns, differentiated mainly by how lossy they are and how they recover from the loss.

Page-directory / FSIP (Free Space Inventory Page) designs. DB2 and several commercial engines interleave special directory pages into the data file at fixed intervals: every Nth page is an FSIP whose body is an array of small fill-level codes for the N−1 data pages that follow it. A search reads the FSIP, picks a candidate, and verifies on arrival. The directory is co-located with the data (good locality) but its size scales linearly with the table and it offers no top-level pruning — a search may still touch many FSIPs. Oracle’s ASSM (Automatic Segment Space Management) generalizes this into a small hierarchy of bitmap blocks (L1/L2/L3) that classify each data block into a few fullness buckets (e.g., 0–25%, 25–50%, …), with higher levels summarizing lower ones so a search can prune. The bucket granularity is the fidelity knob.

Bitmap free-space management. The coarsest form stores one or two bits per page (free / partially-full / full). This is tiny and cache-friendly but imprecise: a “partially full” page might hold a small tuple but not a large one, so the caller must verify and the map must be conservative. SQL Server’s PFS (Page Free Space) pages take a middle road: one byte per page recording a coarse fullness percent plus allocation status bits, with PFS pages spaced every ~8000 pages; GAM/SGAM pages above them track extent-level allocation. The recurring theme is byte-per-page as the sweet spot between a single bit (too coarse) and an exact count (too large).

Tree-of-maxima designs. To get sublinear search, engines layer a max-tree over the per-page values. The two recurring choices are (a) a tree within a directory page (a heap-ordered array so an in-page search is logarithmic) and (b) a tree across directory pages (an upper directory page whose entries summarize lower directory pages). PostgreSQL is unusual in committing fully to both layers and making them structurally identical, so the same in-page search code serves every level.

Durability stance. Engines split on whether the free-space directory is recoverable. Some WAL-log directory updates so the map is always exact after recovery; this is simple but adds log volume proportional to insert/delete traffic. Others treat the map as a reconstructible hint and do not log it, relying on (i) verification at use time, (ii) periodic background rebuild (typically folded into VACUUM/maintenance), and (iii) idempotent self-repair on read. The non-logged approach trades a small risk of transient inaccuracy for a large reduction in WAL and lock contention. PostgreSQL takes the non-logged route deliberately and documents the self-correcting measures that make it safe.

Concurrency stance. A free-space directory is a write-hot shared structure: every insert that lands on a page wants to update that page’s entry, and every insert that needs a target wants to read near the top. Engines reduce contention by (i) using shared locks for search and exclusive only for update, (ii) spreading concurrent inserters across different candidate pages rather than funneling them to “the” emptiest page, and (iii) tolerating racy hint updates. PostgreSQL’s round-robin fp_next_slot pointer, updated even under a shared lock, is a direct expression of (ii) and (iii).

flowchart TD
  subgraph Approx["Free-space directory: the fidelity/size/durability tradeoffs"]
    A["Exact byte count per page<br/>precise but large, must be paged in"]
    B["Byte per page = 256 buckets<br/>PostgreSQL FSM, SQL Server PFS<br/>small, lossy, verify on use"]
    C["1-2 bits per page<br/>free / partial / full<br/>tiny, very lossy"]
  end
  B --> D["Layer a tree of maxima on top<br/>node = max of children<br/>search '>= X' prunes subtrees<br/>O(log n)"]
  D --> E["Durability choice"]
  E --> F["WAL-logged: always exact after crash<br/>more log volume"]
  E --> G["Not logged: hint only<br/>verify on use + rebuild in VACUUM<br/>PostgreSQL's choice"]

PostgreSQL gives every heap relation (and every index AM that asks for one) its own free-space map, stored not inline but as a separate relation fork: alongside the main data fork and the visibility map fork, there is an FSM_FORKNUM fork addressed by the same RelFileLocator. This fork-based design (introduced in 8.4, replacing a fixed-size shared FSM) means the map grows with the relation, is truncated with it, and is read through the ordinary buffer manager and smgr layer — no special allocation. The fork’s I/O is the concern of postgres-smgr-md.md; here we treat it as “a file of FSM pages numbered from 0.”

One byte per page, 256 fullness categories. The map records free space at a granularity of BLCKSZ/256 bytes. freespace.c defines the quantization:

// freespace.c — category constants
#define FSM_CATEGORIES 256
#define FSM_CAT_STEP (BLCKSZ / FSM_CATEGORIES)
#define MaxFSMRequestSize MaxHeapTupleSize

At the default 8 KB BLCKSZ, FSM_CAT_STEP is 32 bytes, so category c means “between c·32 and (c+1)·32−1 bytes free.” Category 0 is “effectively full,” category 255 is “at least MaxFSMRequestSize (= MaxHeapTupleSize) free.” The asymmetry of category 255 is deliberate: a request for exactly MaxFSMRequestSize must be satisfiable by a completely empty page, so the top category is pinned to that boundary rather than to a round multiple of the step. fsm_space_avail_to_cat rounds down (a page is reported as holding no more than it really does) while fsm_space_needed_to_cat rounds up (a request is never under-estimated), so the map never lies optimistically about a fit:

// freespace.c — fsm_space_needed_to_cat (rounds up; never under-promises)
cat = (needed + FSM_CAT_STEP - 1) / FSM_CAT_STEP;
if (cat > 255)
cat = 255;
return (uint8) cat;

A binary max-heap inside each FSM page. An FSM page is not a flat array of these bytes. Its body (FSMPageData.fp_nodes) is a complete binary tree stored in array order, where a non-leaf node holds the maximum of its two children and the leaves hold the per-heap-page categories. The header reserves NonLeafNodesPerPage (= BLCKSZ/2 − 1) interior slots; the rest are leaves (SlotsPerFSMPage, ~4000 at 8 KB). The root node fp_nodes[0] is therefore the maximum free-space category of every heap page covered by this FSM page — a single byte that answers “is there any page here with ≥ X?” without descending.

// fsm_internals.h — the FSM page body
typedef struct
{
int fp_next_slot; /* round-robin search start hint */
uint8 fp_nodes[FLEXIBLE_ARRAY_MEMBER]; /* binary tree: non-leaf then leaf */
} FSMPageData;

A fanout tree across FSM pages. A single FSM page covers ~4000 heap pages. To cover a whole relation (up to 2^32−1 blocks) PostgreSQL stacks FSM pages into a fixed-height tree: a leaf slot in an upper-level FSM page mirrors the root of a lower-level FSM page, so the same “max of children” invariant holds across pages as within a page. With ~4000-way fanout, three levels reach 4000^3 > 2^32, so FSM_TREE_DEPTH is 3 at default BLCKSZ (4 at very small block sizes). Level 0 is the bottom (one leaf per heap page), level FSM_ROOT_LEVEL is the root, and the root FSM page is always physical block 0.

flowchart TD
  R["Root FSM page (level 2, phys blk 0)<br/>fp_nodes[0] = max free over ENTIRE relation"]
  R --> L1a["Level-1 FSM page #0<br/>covers ~4000 level-0 pages"]
  R --> L1b["Level-1 FSM page #1"]
  L1a --> L0a["Level-0 FSM page #0<br/>4000 leaves = 4000 heap pages"]
  L1a --> L0b["Level-0 FSM page #1"]
  L0a --> H0["heap blk 0 .. 3999<br/>one leaf byte each"]
  L0b --> H1["heap blk 4000 .. 7999"]

The genius of the design is that both trees obey the same rule, so one routine (fsm_search_avail) searches inside a page at every level, and freespace.c only has to walk between pages. The README states the contract crisply: “The root node within each page has the same value as the corresponding leaf node on its parent page.”

Search, update, vacuum. Three public verbs sit on top: GetPageWithFreeSpace (find a heap block with ≥ N bytes), RecordPageWithFreeSpace (set one page’s recorded free space), and RecordAndGetPageWithFreeSpace (the combo the heap uses on a retry — record the failed page and find a nearby alternative in one lock acquisition). FreeSpaceMapVacuum[Range] re-derives the upper-level maxima from the bottom level after VACUUM has refreshed the leaves.

// freespace.c — the public search entry point
BlockNumber
GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
{
uint8 min_cat = fsm_space_needed_to_cat(spaceNeeded);
return fsm_search(rel, min_cat);
}

Not WAL-logged, self-correcting. Writes use MarkBufferDirtyHint, not MarkBufferDirty; reads use RBM_ZERO_ON_ERROR so a torn or checksum-failed FSM page is silently zeroed rather than erroring. Correctness is restored by four mechanisms covered below: a root-vs-set-value sanity check that triggers fsm_rebuild_page; a descent-time corruption check in fsm_search_avail; the fsm_does_block_exist guard that refuses to hand back a block past end-of-file; and VACUUM’s periodic FreeSpaceMapVacuum. The only exception to non-logging is truncation (FreeSpaceMapPrepareTruncateRel), which uses a real MarkBufferDirty plus an optional log_newpage_buffer so a truncated tail does not resurrect after replay.

The code splits cleanly along the two-tree boundary. fsmpage.c knows the in-page binary tree and nothing about forks or addressing — it treats a page as “a black box with SlotsPerFSMPage slots.” freespace.c knows the cross-page fanout, the byte/category encoding, buffer I/O, and the public API, and calls into fsmpage.c for every in-page operation. indexfsm.c is a thin re-skin of the public API for index page recycling.

In-page tree: set, search, rebuild (fsmpage.c)

Section titled “In-page tree: set, search, rebuild (fsmpage.c)”

The page tree is navigated with three macros — leftchild(x) = 2x+1, rightchild(x) = 2x+2, parentof(x) = (x−1)/2 — the standard array embedding of a complete binary tree with the root at index 0.

fsm_set_avail writes a leaf and bubbles the change upward, stopping early at the first parent whose max does not change:

// fsm_set_avail — fsmpage.c (set one leaf, propagate the new max up the spine)
int nodeno = NonLeafNodesPerPage + slot;
FSMPage fsmpage = (FSMPage) PageGetContents(page);
oldvalue = fsmpage->fp_nodes[nodeno];
if (oldvalue == value && value <= fsmpage->fp_nodes[0])
return false; /* unchanged and not above root: nothing to do */
fsmpage->fp_nodes[nodeno] = value;
do {
uint8 newvalue = 0;
nodeno = parentof(nodeno);
lchild = leftchild(nodeno);
rchild = lchild + 1;
newvalue = fsmpage->fp_nodes[lchild];
if (rchild < NodesPerPage)
newvalue = Max(newvalue, fsmpage->fp_nodes[rchild]);
oldvalue = fsmpage->fp_nodes[nodeno];
if (oldvalue == newvalue)
break; /* parent max didn't move: stop bubbling */
fsmpage->fp_nodes[nodeno] = newvalue;
} while (nodeno > 0);
if (value > fsmpage->fp_nodes[0]) /* sanity: set value still above root => corrupt */
fsm_rebuild_page(page);

The final check is the first of the self-correcting measures: after bubbling, if the value we just stored is still greater than the root, some parent on the spine held a too-small value (corruption), so the whole page’s interior is recomputed by fsm_rebuild_page.

fsm_search_avail is the heart of the in-page search. It starts at the round-robin fp_next_slot hint, then performs the “expanding search triangle” walk: at each step move one node right (wrapping within the level via rightneighbor) and climb to the parent, doubling the covered subtree each time until it finds a node ≥ minvalue; then it descends to a satisfying leaf, preferring the left child:

// fsm_search_avail — fsmpage.c (expanding-triangle ascent, then greedy descent)
if (fsmpage->fp_nodes[0] < minvalue)
return -1; /* root says no page qualifies: O(1) miss */
target = fsmpage->fp_next_slot;
if (target < 0 || target >= LeafNodesPerPage)
target = 0;
target += NonLeafNodesPerPage;
nodeno = target;
while (nodeno > 0) {
if (fsmpage->fp_nodes[nodeno] >= minvalue)
break;
nodeno = parentof(rightneighbor(nodeno)); /* move right, wrap, climb */
}
while (nodeno < NonLeafNodesPerPage) { /* descend to a leaf, prefer left */
int childnodeno = leftchild(nodeno);
if (childnodeno < NodesPerPage && fsmpage->fp_nodes[childnodeno] >= minvalue) {
nodeno = childnodeno; continue;
}
childnodeno++; /* try right child */
if (childnodeno < NodesPerPage && fsmpage->fp_nodes[childnodeno] >= minvalue)
nodeno = childnodeno;
else { /* neither child qualifies though parent promised: torn page -> rebuild */ }
}
slot = nodeno - NonLeafNodesPerPage;
fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

Two design points are visible. The descent’s “neither child satisfies what the parent promised” branch is the second self-correcting measure: it logs a DEBUG1 “fixing corrupt FSM block”, upgrades to an exclusive lock if needed, calls fsm_rebuild_page, and goto restarts. And fp_next_slot is updated even under a shared lock — the comment is explicit that a garbled hint is cheaper than the concurrency hit of exclusive locking.

// fsm_search_avail — fsmpage.c (torn-page recovery during descent)
BufferGetTag(buf, &rlocator, &forknum, &blknum);
elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/%u",
blknum, rlocator.spcOid, rlocator.dbOid, rlocator.relNumber);
if (!exclusive_lock_held) {
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
exclusive_lock_held = true;
}
fsm_rebuild_page(page);
MarkBufferDirtyHint(buf, false);
goto restart;

rightneighbor implements the wrap-around using a two’s-complement trick — the leftmost node at each level has index 2^level − 1, so ((x+1) & x) == 0 detects that stepping right has fallen off the right edge onto the next level’s leftmost node, in which case it folds back to the parent:

// rightneighbor — fsmpage.c (move right within a level, wrap to parent at the edge)
x++;
if (((x + 1) & x) == 0) /* (x+1) a power of two => x is leftmost of next level */
x = parentof(x);
return x;

fsm_rebuild_page recomputes every interior node bottom-up (from the last non-leaf back to the root), and fsm_truncate_avail zeroes leaf slots ≥ n then rebuilds — both used by the recovery and truncation paths respectively. fsm_get_avail, fsm_get_max_avail (returns fp_nodes[0]) are lock-free single-byte reads.

Cross-page navigation and addressing (freespace.c)

Section titled “Cross-page navigation and addressing (freespace.c)”

freespace.c works in logical addresses — an FSMAddress {level, logpageno} plus a slot — and converts to a physical block only at I/O time. The four navigation helpers mirror the in-page macros but at page granularity, using SlotsPerFSMPage as the fanout:

// fsm_get_location / fsm_get_parent / fsm_get_child — freespace.c
addr.level = FSM_BOTTOM_LEVEL; /* heap block -> level-0 addr+slot */
addr.logpageno = heapblk / SlotsPerFSMPage;
*slot = heapblk % SlotsPerFSMPage;
/* parent: */ parent.level = child.level + 1;
parent.logpageno = child.logpageno / SlotsPerFSMPage;
*slot = child.logpageno % SlotsPerFSMPage;
/* child: */ child.level = parent.level - 1;
child.logpageno = parent.logpageno * SlotsPerFSMPage + slot;

fsm_logical_to_physical flattens a logical address into the depth-first physical block layout described in the README (the root at block 0, then each subtree laid out depth-first). It counts the leaf and upper pages preceding the target:

// fsm_logical_to_physical — freespace.c (logical (level,logpageno) -> physical block)
leafno = addr.logpageno;
for (l = 0; l < addr.level; l++)
leafno *= SlotsPerFSMPage; /* logical page -> its first bottom leaf */
pages = 0;
for (l = 0; l < FSM_TREE_DEPTH; l++) {
pages += leafno + 1; /* count pages at each level up to the leaf */
leafno /= SlotsPerFSMPage;
}
pages -= addr.level; /* subtract lower levels not actually traversed */
return pages - 1; /* 0-based block number */

fsm_readbuf is the read/extend gateway: it consults (and lazily refreshes) the smgr_cached_nblocks[FSM_FORKNUM] size cache, reads with RBM_ZERO_ON_ERROR, extends via fsm_extend (ExtendBufferedRelTo with EB_CREATE_FORK_IF_NEEDED) when extend is true, and PageInits a freshly zeroed page under a recheck-after-lock guard against concurrent initializers.

fsm_search descends the cross-page tree from the root. At each level it fsm_search_avails the page under a shared lock, releasing the parent before descending (one page locked at a time, as the README’s “Locking” section mandates). Three things make it robust rather than naive:

// fsm_search — freespace.c (descend root->leaf, repair stale parents, verify EOF)
if (addr.level == FSM_BOTTOM_LEVEL) {
BlockNumber blkno = fsm_get_heap_blk(addr, slot);
if (fsm_does_block_exist(rel, blkno)) { /* (1) guard: block must really exist */
ReleaseBuffer(buf);
return blkno;
}
/* Block past EOF: zero the slot, restart from root */
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
fsm_set_avail(page, slot, 0);
MarkBufferDirtyHint(buf, false);
UnlockReleaseBuffer(buf);
if (restarts++ > 10000) return InvalidBlockNumber;
addr = FSM_ROOT_ADDRESS;
}
/* ... at a non-bottom level, if the child search failed the parent was stale: */
parent = fsm_get_parent(addr, &parentslot);
fsm_set_and_search(rel, parent, parentslot, max_avail, 0); /* (2) self-correct upper node */
if (restarts++ > 10000) return InvalidBlockNumber; /* (3) emergency valve */
addr = FSM_ROOT_ADDRESS;

(1) The fsm_does_block_exist guard prevents handing back a block that the FSM believes is free but that never made it to disk (a post-crash hazard explained below); the bad slot is zeroed and the search restarts. (2) If a leaf search fails after the parent pointed here — the upper node’s max was stale — the parent is corrected via fsm_set_and_search so the loop cannot re-trap, then restarts. (3) A 10000-iteration emergency valve guards against pathological looping on badly out-of-date upper pages.

fsm_set_and_search is the shared update primitive: it exclusive-locks the page, fsm_set_avails the slot (dirtying as a hint), and — when minValue > 0 — immediately fsm_search_avails the same page so a record-then-find can avoid a second lock. RecordAndGetPageWithFreeSpace exploits this to return a nearby page when the failed page and a fit share an FSM page.

RelationGetBufferForTuple is the heap’s page-selection loop. It computes a targetFreeSpace, tries the cached target block, then asks the FSM, then loops: each page that proves too small is reported and a new candidate fetched in one call, until a fit is found or the relation must be extended.

// RelationGetBufferForTuple — hio.c (FSM-driven insert placement loop)
if (targetBlock == InvalidBlockNumber && use_fsm)
targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
/* ... loop: lock targetBlock, recheck real free space ... */
pageFreeSpace = PageGetHeapFreeSpace(page);
if (targetFreeSpace <= pageFreeSpace) {
RelationSetTargetBlock(relation, targetBlock);
return buffer; /* FSM was right: use this page */
}
/* FSM over-promised (stale): record the truth, get another candidate */
targetBlock = RecordAndGetPageWithFreeSpace(relation, targetBlock,
pageFreeSpace, targetFreeSpace);

This is exactly the “caller must be prepared for the page to have too little space” contract GetPageWithFreeSpace’s comment warns about — the heap verifies under the page lock it has to take anyway, and feeds the corrected value back so the FSM converges. After a bulk extension, RelationAddBlocks records the new pages’ free space via RecordPageWithFreeSpace and FreeSpaceMapVacuumRange so they become visible to other backends. (The surrounding insert mechanics — buffer locking order, visibility-map pinning — belong to postgres-heap-am.md.)

VACUUM upper-page refresh (fsm_vacuum_page, callers in vacuumlazy.c)

Section titled “VACUUM upper-page refresh (fsm_vacuum_page, callers in vacuumlazy.c)”

VACUUM updates the leaves as it processes heap pages (via RecordPageWithFreeSpace, which bubbles within the bottom page), but the upper pages only get refreshed by FreeSpaceMapVacuum[Range], whose recursive guts are fsm_vacuum_page. It walks the tree depth-first (matching the physical layout for I/O efficiency), recomputes each interior slot from its child’s true max, and resets fp_next_slot to 0 to bias future searches toward low-numbered pages (helping a later VACUUM truncate the tail):

// fsm_vacuum_page — freespace.c (depth-first refresh of upper nodes from children)
if (addr.level > FSM_BOTTOM_LEVEL) {
for (slot = start_slot; slot <= end_slot; slot++) {
if (!eof)
child_avail = fsm_vacuum_page(rel, fsm_get_child(addr, slot),
start, end, &eof);
else
child_avail = 0;
if (fsm_get_avail(page, slot) != child_avail) { /* upper node was stale */
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
fsm_set_avail(page, slot, child_avail);
MarkBufferDirtyHint(buf, false);
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
}
}
}
max_avail = fsm_get_max_avail(page);
((FSMPage) PageGetContents(page))->fp_next_slot = 0; /* bias toward low pages */

vacuumlazy.c calls FreeSpaceMapVacuumRange periodically (every VACUUM_FSM_EVERY_PAGES ≈ 8 GB for index-less tables) during the scan and once more at the end, so newly freed space becomes searchable without waiting for the whole VACUUM to finish. The VACUUM driver itself is postgres-vacuum.md.

indexfsm.c reuses the entire heap FSM machinery but collapses the 256 categories to two: an index page is either wholly free or in use. GetFreeIndexPage asks for BLCKSZ/2 (anything ≥ half-page maps to “free”), and the record helpers encode the two states as the byte extremes:

// indexfsm.c — binary free/used encoding over the heap FSM
BlockNumber GetFreeIndexPage(Relation rel) {
BlockNumber blkno = GetPageWithFreeSpace(rel, BLCKSZ / 2);
if (blkno != InvalidBlockNumber)
RecordUsedIndexPage(rel, blkno); /* claim it atomically-ish */
return blkno;
}
void RecordFreeIndexPage(Relation rel, BlockNumber b) { RecordPageWithFreeSpace(rel, b, BLCKSZ - 1); }
void RecordUsedIndexPage(Relation rel, BlockNumber b) { RecordPageWithFreeSpace(rel, b, 0); }

This is how nbtree recycles deleted pages: a page deemed recyclable is RecordFreeIndexPaged, and _bt_getbuf’s “get a free page” path pulls it back via GetFreeIndexPage. (See postgres-nbtree.md for the recyclability gate.)

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
FSM_CATEGORIES, FSM_CAT_STEP, MaxFSMRequestSizesrc/backend/storage/freespace/freespace.c64–66
FSM_TREE_DEPTH, FSM_ROOT_LEVEL, FSM_BOTTOM_LEVELsrc/backend/storage/freespace/freespace.c75–78
FSMAddress structsrc/backend/storage/freespace/freespace.c84–88
GetPageWithFreeSpacesrc/backend/storage/freespace/freespace.c136
RecordAndGetPageWithFreeSpacesrc/backend/storage/freespace/freespace.c153
RecordPageWithFreeSpacesrc/backend/storage/freespace/freespace.c193
XLogRecordPageWithFreeSpacesrc/backend/storage/freespace/freespace.c210
GetRecordedFreeSpacesrc/backend/storage/freespace/freespace.c253
FreeSpaceMapPrepareTruncateRelsrc/backend/storage/freespace/freespace.c284
FreeSpaceMapVacuum / FreeSpaceMapVacuumRangesrc/backend/storage/freespace/freespace.c367 / 386
fsm_space_avail_to_cat / fsm_space_needed_to_catsrc/backend/storage/freespace/freespace.c401 / 441
fsm_logical_to_physicalsrc/backend/storage/freespace/freespace.c464
fsm_get_location / fsm_get_parent / fsm_get_childsrc/backend/storage/freespace/freespace.c500 / 526 / 544
fsm_readbuf / fsm_extendsrc/backend/storage/freespace/freespace.c563 / 638
fsm_set_and_searchsrc/backend/storage/freespace/freespace.c655
fsm_searchsrc/backend/storage/freespace/freespace.c687
fsm_vacuum_pagesrc/backend/storage/freespace/freespace.c821
fsm_does_block_existsrc/backend/storage/freespace/freespace.c935
leftchild / rightchild / parentof macrossrc/backend/storage/freespace/fsmpage.c29–31
rightneighborsrc/backend/storage/freespace/fsmpage.c36
fsm_set_availsrc/backend/storage/freespace/fsmpage.c62
fsm_get_avail / fsm_get_max_availsrc/backend/storage/freespace/fsmpage.c121 / 137
fsm_search_availsrc/backend/storage/freespace/fsmpage.c157
fsm_truncate_availsrc/backend/storage/freespace/fsmpage.c312
fsm_rebuild_pagesrc/backend/storage/freespace/fsmpage.c341
FSMPageData, NonLeafNodesPerPage, SlotsPerFSMPagesrc/include/storage/fsm_internals.h24–61
GetFreeIndexPage / RecordFreeIndexPage / RecordUsedIndexPagesrc/backend/storage/freespace/indexfsm.c37 / 51 / 61
RelationGetBufferForTuple (FSM calls)src/backend/access/heap/hio.c584 / 737 / 759
VACUUM_FSM_EVERY_PAGES, FreeSpaceMapVacuumRange callssrc/backend/access/heap/vacuumlazy.c202 / 1301 / 1488 / 1537
  • Free space is quantized to one byte = 256 categories of BLCKSZ/256 bytes. Verified in freespace.c: FSM_CATEGORIES 256, FSM_CAT_STEP (BLCKSZ / FSM_CATEGORIES), MaxFSMRequestSize = MaxHeapTupleSize. fsm_space_avail_to_cat rounds down and caps non-top values at 254; fsm_space_needed_to_cat rounds up; category 255 is reserved for MaxFSMRequestSize exactly, per the leading comment. Confirmed against the README’s “one map byte to each page … free space divided by BLCKSZ/256.”

  • An FSM page body is a complete binary max-heap, non-leaf nodes first then leaves. Verified in fsm_internals.h: fp_nodes[FLEXIBLE_ARRAY_MEMBER] with NonLeafNodesPerPage = BLCKSZ/2 − 1, LeafNodesPerPage = NodesPerPage − NonLeafNodesPerPage, SlotsPerFSMPage = LeafNodesPerPage. The macros leftchild/rightchild/parentof in fsmpage.c confirm the array embedding with root at index 0. fsm_get_max_avail returns fp_nodes[0].

  • The cross-page tree has fixed height 3 at default BLCKSZ, root at physical block 0. Verified: FSM_TREE_DEPTH ((SlotsPerFSMPage >= 1626) ? 3 : 4), FSM_ROOT_LEVEL = FSM_TREE_DEPTH − 1, FSM_BOTTOM_LEVEL 0, and FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0}. The README’s “y = n + (n/F + 1) + …” formula matches the loop in fsm_logical_to_physical. The comment that 1626^3 ≥ 2^32−1 is the basis for depth 3 is in the FSM_TREE_DEPTH comment.

  • A leaf in an upper FSM page equals the root of the lower FSM page it covers. Stated in the README (“The root node within each page has the same value as the corresponding leaf node on its parent page”) and enforced by fsm_vacuum_page, which sets fsm_set_avail(page, slot, child_avail) where child_avail is the recursive return = the child page’s fsm_get_max_avail.

  • Search uses shared locks, update uses exclusive, and only one page is locked at a time during descent. Verified in fsm_search (BUFFER_LOCK_SHARE, parent released via UnlockReleaseBuffer/ReleaseBuffer before descending) and fsm_set_and_search (BUFFER_LOCK_EXCLUSIVE). Matches the README “Locking” section.

  • fp_next_slot is a round-robin hint updated even under a shared lock. Verified in fsm_search_avail: fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0) with the comment justifying the shared-lock update; the fsm_internals.h comment explains it is int (not uint16) for atomic fetch/store. fsm_vacuum_page resets it to 0.

  • The FSM is not WAL-logged; writes are MarkBufferDirtyHint, reads are RBM_ZERO_ON_ERROR. Verified across fsm_set_and_search, fsm_search, fsm_vacuum_page (MarkBufferDirtyHint) and fsm_readbuf/fsm_extend (RBM_ZERO_ON_ERROR). The README “Recovery” section states the rationale. The lone exception is XLogRecordPageWithFreeSpace, which uses MarkBufferDirty because MarkBufferDirtyHint is a no-op during recovery with checksums — verified in its inline comment.

  • Three self-correcting measures repair corruption without WAL. (1) fsm_set_avail: after bubbling, if (value > fsmpage->fp_nodes[0]) fsm_rebuild_page(page). (2) fsm_search_avail: the “neither child satisfies” branch logs DEBUG1 "fixing corrupt FSM block", calls fsm_rebuild_page, goto restart. (3) fsm_search: a failed lower search corrects the parent via fsm_set_and_search(rel, parent, parentslot, max_avail, 0).

  • fsm_does_block_exist guards against post-crash phantom free space. Verified in freespace.c: fsm_search and RecordAndGetPageWithFreeSpace both call it before returning a block; it compares against smgr_cached_nblocks[MAIN_FORKNUM] and falls back to RelationGetNumberOfBlocks. The README “Recovery” paragraph explains the hazard: relation extension is not WAL-logged, so an on-disk FSM slot may point at a PageIsNew block that never reached disk after replay.

  • VACUUM updates leaves inline and upper pages periodically. Verified: vacuumlazy.c calls RecordPageWithFreeSpace per heap page and FreeSpaceMapVacuumRange every VACUUM_FSM_EVERY_PAGES and at end of scan. fsm_vacuum_page is depth-first and resets fp_next_slot.

  • The heap insert path verifies FSM answers and feeds corrections back. Verified in RelationGetBufferForTuple (hio.c): GetPageWithFreeSpace for the initial target, PageGetHeapFreeSpace recheck under lock, and RecordAndGetPageWithFreeSpace on a miss. The function’s header comment explicitly anticipates stale FSM info and the retry loop.

  • indexfsm.c reuses the heap FSM with a binary free/used encoding. Verified: RecordFreeIndexPage records BLCKSZ − 1, RecordUsedIndexPage records 0, GetFreeIndexPage requests BLCKSZ/2 and marks the page used. IndexFreeSpaceMapVacuum just calls FreeSpaceMapVacuum.

  1. Real-world fp_next_slot contention relief. The round-robin start pointer is designed to spread concurrent inserters across pages, but the doc asserts the benefit qualitatively. The actual reduction in buffer-lock contention vs. a fixed start, under N concurrent COPY/INSERT backends, is not measured here. Investigation path: instrument fsm_search_avail hit distribution under a synthetic many-writer workload.

  2. How often the corruption self-repair paths actually fire in production. The DEBUG1 "fixing corrupt FSM block" and the fsm_set_avail rebuild are designed for torn pages after crashes, but their real frequency (and whether they ever mask a deeper bug rather than benign torn-page loss) is unknown from the code. Investigation path: collect DEBUG1 rates across a fleet with data_checksums on vs. off.

  3. The exact convergence behavior of stale upper nodes under heavy churn. fsm_search’s self-correcting parent update and the 10000-restart valve bound the loop, but how far the upper pages can drift before the next FreeSpaceMapVacuum — and whether that drift ever causes measurable misplacement / bloat between vacuums — is workload-dependent and untested here. Investigation path: compare GetRecordedFreeSpace against PageGetHeapFreeSpace sampled across a churning table mid-vacuum-interval.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • The fork-as-separate-relation model vs. inline FSIP pages. PostgreSQL keeps the FSM in its own fork (FSM_FORKNUM), physically distinct from the main data fork, so the map can grow and shrink independently and never interleaves with heap pages. DB2’s FSIP and Oracle’s freelist/ASSM bitmaps instead interleave free-space-inventory pages within the tablespace at fixed strides. The fork model trades one extra smgr open and a second set of buffers for a cleaner truncation story (FreeSpaceMapPrepareTruncateRel shrinks the fork in lockstep with the heap) and the ability to drop the whole map by deleting one fork. A side-by-side of fork-relation vs. interleaved-FSIP truncation and crash recovery would sharpen what the indirection buys.

  • CUBRID’s file/extent free-space management. CUBRID tracks free space at the file-manager / extent level rather than with a per-page quantized byte map plus an in-page binary heap. Comparing CUBRID’s allocation bitmap and “numerable file” sector accounting against PostgreSQL’s lossy, non-WAL-logged byte map would frame the durability/precision tradeoff: CUBRID’s structures are recovery-managed allocation metadata, whereas the PostgreSQL FSM is an explicitly approximate, self-healing hint. (See the CUBRID file-manager analysis in the cubrid tree — raw/code-analysis/cubrid/file-manager-*.)

  • Quantized byte map vs. exact free-byte directories. The 256-category encoding (FSM_CATEGORIES, FSM_CAT_STEP) is a deliberate fidelity-for-size trade: one byte per page keeps a 1 TB table’s map around 32 MB instead of the ~512 MB an exact 4-byte-per-page directory would cost. Engines that store exact free counts pay the size and must page more of the directory in; PostgreSQL accepts that two requests of slightly different size map to the same category and that the recheck under page lock (PageGetHeapFreeSpace in hio.c) catches the rounding error. Quantifying the misplacement rate induced by FSM_CAT_STEP granularity on a realistic tuple-size mix is an open measurement.

  • Non-WAL-logged hint vs. logged allocation metadata (ARIES lineage). The FSM’s refusal to WAL-log (writes via MarkBufferDirtyHint, reads via RBM_ZERO_ON_ERROR, rebuild-on-corruption) sits at one extreme of the recovery spectrum. ARIES-style systems (Mohan et al. 1992; dbms-papers/aries.md) log space-map / page-allocation changes so the map is exactly reconstructable by redo. PostgreSQL instead makes the map cheaply re-derivable (VACUUM’s FreeSpaceMapVacuum refreshes upper nodes; the heap recheck repairs leaves on the fly) and tolerates post-crash phantom entries via fsm_does_block_exist. The lone WAL exception, XLogRecordPageWithFreeSpace, exists only so a redo path (heap/visibility-map replay) can seed FSM values it already happens to know. A focused note mapping each FSM self-healing measure to the ARIES recovery guarantee it deliberately forgoes would be a research-grade companion.

  • Round-robin fp_next_slot vs. partitioned / per-backend allocation. The shared-lock round-robin start pointer is PostgreSQL’s lightweight answer to insert-point contention: concurrent backends fan out across pages instead of all hammering the lowest-numbered fitting page. Heavier schemes — per-backend free lists, hashed allocation regions, or Oracle ASSM’s multiple freelist-groups — reduce contention further at the cost of more bloat and more metadata. Measuring fp_next_slot fan-out against a partitioned allocator under many-writer COPY is the natural frontier (also raised in Open Questions).

  • Index-page recycling via the same machinery (indexfsm.c). Reusing the heap FSM with a binary free/used encoding (BLCKSZ-1 vs 0, request BLCKSZ/2) is an elegant economy, but it means index-page reuse inherits the FSM’s lossiness and non-durability. Other engines maintain a dedicated, often logged, free-page list for index structures. Whether the shared-FSM approach ever delays index-page reuse enough to matter (vs. nbtree’s own recyclability gate in postgres-nbtree.md) is an unmeasured interaction.

In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/storage/freespace/README — the design document: one map byte per page at BLCKSZ/256 granularity, the in-page binary max-heap, the cross-page fanout tree, the search/update operations, the “nice properties” argument, locking, recovery (no WAL, torn-page tolerance, the phantom-block hazard), and the higher-level structure / addressing math.
  • src/backend/storage/freespace/freespace.c — the public API (GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace, RecordPageWithFreeSpace, XLogRecordPageWithFreeSpace, GetRecordedFreeSpace), the category encoding (fsm_space_avail_to_cat, fsm_space_needed_to_cat), the cross-page addressing (fsm_logical_to_physical, fsm_get_location, fsm_get_parent, fsm_get_child), buffer/extend helpers (fsm_readbuf, fsm_extend), the search/update drivers (fsm_search, fsm_set_and_search), VACUUM (FreeSpaceMapVacuum/Range, fsm_vacuum_page), truncation (FreeSpaceMapPrepareTruncateRel), and the post-crash guard (fsm_does_block_exist).
  • src/backend/storage/freespace/fsmpage.c — the in-page binary-heap operations: leftchild/rightchild/parentof/rightneighbor macros, fsm_set_avail (leaf set + bubble-up + rebuild-on-overflow), fsm_get_avail, fsm_get_max_avail (root = fp_nodes[0]), fsm_search_avail (the expanding-triangle search anchored at fp_next_slot), fsm_truncate_avail, fsm_rebuild_page.
  • src/backend/storage/freespace/indexfsm.cGetFreeIndexPage, RecordFreeIndexPage, RecordUsedIndexPage, IndexFreeSpaceMapVacuum: the binary free/used reuse of the heap FSM for whole index pages.
  • src/include/storage/fsm_internals.hFSMPageData (fp_next_slot, fp_nodes[FLEXIBLE_ARRAY_MEMBER]), NodesPerPage, NonLeafNodesPerPage, LeafNodesPerPage, SlotsPerFSMPage, and the rationale for fp_next_slot being int.
  • src/include/storage/freespace.h, src/include/storage/indexfsm.h — the declared public surfaces consumed by the heap and the index AMs.
  • src/backend/access/heap/hio.cRelationGetBufferForTuple: the insert path that calls GetPageWithFreeSpace, rechecks with PageGetHeapFreeSpace under the page lock, and feeds misses back via RecordAndGetPageWithFreeSpace.
  • src/backend/access/heap/vacuumlazy.c — the VACUUM driver that records per-page free space and triggers FreeSpaceMapVacuumRange (VACUUM_FSM_EVERY_PAGES).
  • Database Internals (Petrov 2019), part I — slotted-page layout, free-space management directories (bitmaps, per-page bytes, tree-structured maps), and the fidelity-vs-size / hint-vs-durable framing (knowledge/research/dbms-general/).
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. on “Storage and File Structure” — heap file organization, the page-selection problem, and free-list / free-space directory designs (knowledge/research/dbms-general/).
  • Mohan, C. et al. (1992). “ARIES: A Transaction Recovery Method…” ACM TODS 17(1). The logged-recovery baseline against which the FSM’s deliberately non-WAL-logged, self-healing design is contrasted (knowledge/research/dbms-papers/aries.md).

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-heap-am.md — heap insertion mechanics; RelationGetBufferForTuple is summarized here only for its FSM interaction, not its full retry/extend protocol.
  • postgres-smgr-md.md — the smgr/md fork I/O layer that fsm_readbuf and fsm_extend ride on (FSM_FORKNUM, RBM_ZERO_ON_ERROR, relation extension).
  • postgres-vacuum.md — the VACUUM driver (lazy_scan_heap) that owns the cadence of RecordPageWithFreeSpace and FreeSpaceMapVacuumRange; this doc covers only fsm_vacuum_page itself.
  • postgres-nbtree.md — the index-page recyclability gate that decides when a deleted btree page is RecordFreeIndexPaged and pulled back via GetFreeIndexPage.
  • postgres-buffer-manager.mdMarkBufferDirtyHint, content locks, and the pin/lock protocol the FSM’s shared/exclusive page accesses rely on.
  • postgres-page-layout.mdPageGetHeapFreeSpace and the slotted-page model that defines what a page’s “free space” actually means.