PostgreSQL Free Space Map — Tracking Reusable Space per Page
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A heap file in a page-oriented database is an unordered collection of fixed-size pages, and a tuple may be inserted onto any page that has room for it. This freedom is exactly what makes free-space tracking a problem. When a transaction wants to insert a 200-byte row, the storage engine must answer a deceptively simple question: which page should receive it? The naive answers are both bad. Always appending to the last page wastes the space freed by deletes and updates elsewhere, so the file grows without bound even when half of it is empty (the classic “bloat” failure mode). Scanning every page to find one with room is correct but costs O(n) page reads per insert, which is catastrophic on a large relation. The free-space map is the data structure that makes the question answerable in sublinear time.
The textbook framing (Database System Concepts, ch. on “Storage and File Structure”, and Database Internals part I on “File Formats” / “slotted pages”) distinguishes two layers. The slotted page is the intra-page layout: a page header, an array of line pointers (item identifiers) growing from the front, and tuple bodies growing from the back, with a contiguous free region in the middle. The amount of free space on a page is therefore a single derived number — the gap between the line-pointer array and the tuple region (minus what a future line pointer would consume). The free-space management structure is the inter-page layer: a directory that records, for each data page, roughly how much of that gap is available, so the engine can locate a fit without touching the data pages themselves.
Database Internals describes the spectrum of free-space directories used in practice: a simple bitmap (one bit per page: has-room / full), a per-page byte or word giving a coarse fill level, and tree-structured directories that let a “find a page with ≥ X free” query short-circuit at the top. The central design tension is fidelity versus size. A directory that records the exact free byte count per page is precise but large — for a 1 TB table at 8 KB pages that is 128 million entries, and at 4 bytes each that is 512 MB of directory that itself must be paged in and searched. A directory that is small enough to keep hot must therefore be lossy: it approximates free space, accepts that its answer may be stale or wrong, and pushes the burden of verification onto the caller, which already has to lock the target page anyway.
A second axis is search cost. A flat array of per-page fill levels still costs O(n) to scan for a fit. The standard remedy is a tree of maxima: each interior node stores the maximum free-space value among its descendants, so a search for ”≥ X” can prune any subtree whose maximum is below X and descend only along a satisfying path — O(log n) instead of O(n). This is structurally a segment tree / tournament tree specialized to the “range-max + point-update” workload: point updates (one page’s free space changed) bubble a new max up the spine, and queries descend greedily. The same shape underlies interval trees, Fenwick-style aggregates, and the loser-tree used in external sorting; the FSM is this idea applied to disk pages.
A third, subtler axis is concurrency and durability. Free-space information is fundamentally a hint: if it is wrong, the worst outcome is a wasted page read or a slightly larger file, never a corrupt or lost row. That observation licenses two aggressive optimizations that would be illegal for real data. First, the directory need not be crash-safe — it can be rebuilt or self-corrected after a crash, so it need not be write-ahead logged, eliminating a large class of WAL traffic. Second, concurrent readers can tolerate a racy, unlatched “where to look next” pointer, because a stale hint merely sends two backends to adjacent pages rather than colliding on one. The engineering art of a free-space map lies in pushing approximation as far as correctness allows, and then building cheap self-healing measures to mop up the consequences.
Common DBMS Design
Section titled “Common DBMS Design”Across engines the free-space directory converges on a small set of patterns, differentiated mainly by how lossy they are and how they recover from the loss.
Page-directory / FSIP (Free Space Inventory Page) designs. DB2 and several commercial engines interleave special directory pages into the data file at fixed intervals: every Nth page is an FSIP whose body is an array of small fill-level codes for the N−1 data pages that follow it. A search reads the FSIP, picks a candidate, and verifies on arrival. The directory is co-located with the data (good locality) but its size scales linearly with the table and it offers no top-level pruning — a search may still touch many FSIPs. Oracle’s ASSM (Automatic Segment Space Management) generalizes this into a small hierarchy of bitmap blocks (L1/L2/L3) that classify each data block into a few fullness buckets (e.g., 0–25%, 25–50%, …), with higher levels summarizing lower ones so a search can prune. The bucket granularity is the fidelity knob.
Bitmap free-space management. The coarsest form stores one or two bits per page (free / partially-full / full). This is tiny and cache-friendly but imprecise: a “partially full” page might hold a small tuple but not a large one, so the caller must verify and the map must be conservative. SQL Server’s PFS (Page Free Space) pages take a middle road: one byte per page recording a coarse fullness percent plus allocation status bits, with PFS pages spaced every ~8000 pages; GAM/SGAM pages above them track extent-level allocation. The recurring theme is byte-per-page as the sweet spot between a single bit (too coarse) and an exact count (too large).
Tree-of-maxima designs. To get sublinear search, engines layer a max-tree over the per-page values. The two recurring choices are (a) a tree within a directory page (a heap-ordered array so an in-page search is logarithmic) and (b) a tree across directory pages (an upper directory page whose entries summarize lower directory pages). PostgreSQL is unusual in committing fully to both layers and making them structurally identical, so the same in-page search code serves every level.
Durability stance. Engines split on whether the free-space directory is recoverable. Some WAL-log directory updates so the map is always exact after recovery; this is simple but adds log volume proportional to insert/delete traffic. Others treat the map as a reconstructible hint and do not log it, relying on (i) verification at use time, (ii) periodic background rebuild (typically folded into VACUUM/maintenance), and (iii) idempotent self-repair on read. The non-logged approach trades a small risk of transient inaccuracy for a large reduction in WAL and lock contention. PostgreSQL takes the non-logged route deliberately and documents the self-correcting measures that make it safe.
Concurrency stance. A free-space directory is a write-hot shared structure:
every insert that lands on a page wants to update that page’s entry, and every
insert that needs a target wants to read near the top. Engines reduce contention
by (i) using shared locks for search and exclusive only for update, (ii)
spreading concurrent inserters across different candidate pages rather than
funneling them to “the” emptiest page, and (iii) tolerating racy hint updates.
PostgreSQL’s round-robin fp_next_slot pointer, updated even under a shared
lock, is a direct expression of (ii) and (iii).
flowchart TD
subgraph Approx["Free-space directory: the fidelity/size/durability tradeoffs"]
A["Exact byte count per page<br/>precise but large, must be paged in"]
B["Byte per page = 256 buckets<br/>PostgreSQL FSM, SQL Server PFS<br/>small, lossy, verify on use"]
C["1-2 bits per page<br/>free / partial / full<br/>tiny, very lossy"]
end
B --> D["Layer a tree of maxima on top<br/>node = max of children<br/>search '>= X' prunes subtrees<br/>O(log n)"]
D --> E["Durability choice"]
E --> F["WAL-logged: always exact after crash<br/>more log volume"]
E --> G["Not logged: hint only<br/>verify on use + rebuild in VACUUM<br/>PostgreSQL's choice"]
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL gives every heap relation (and every index AM that asks for one) its
own free-space map, stored not inline but as a separate relation fork:
alongside the main data fork and the visibility map fork, there is an
FSM_FORKNUM fork addressed by the same RelFileLocator. This fork-based design
(introduced in 8.4, replacing a fixed-size shared FSM) means the map grows with
the relation, is truncated with it, and is read through the ordinary buffer
manager and smgr layer — no special allocation. The fork’s I/O is the concern
of postgres-smgr-md.md; here we treat it as “a file of FSM pages numbered from
0.”
One byte per page, 256 fullness categories. The map records free space at a
granularity of BLCKSZ/256 bytes. freespace.c defines the quantization:
// freespace.c — category constants#define FSM_CATEGORIES 256#define FSM_CAT_STEP (BLCKSZ / FSM_CATEGORIES)#define MaxFSMRequestSize MaxHeapTupleSizeAt the default 8 KB BLCKSZ, FSM_CAT_STEP is 32 bytes, so category c means
“between c·32 and (c+1)·32−1 bytes free.” Category 0 is “effectively full,”
category 255 is “at least MaxFSMRequestSize (= MaxHeapTupleSize) free.” The
asymmetry of category 255 is deliberate: a request for exactly
MaxFSMRequestSize must be satisfiable by a completely empty page, so the top
category is pinned to that boundary rather than to a round multiple of the step.
fsm_space_avail_to_cat rounds down (a page is reported as holding no more
than it really does) while fsm_space_needed_to_cat rounds up (a request is
never under-estimated), so the map never lies optimistically about a fit:
// freespace.c — fsm_space_needed_to_cat (rounds up; never under-promises)cat = (needed + FSM_CAT_STEP - 1) / FSM_CAT_STEP;if (cat > 255) cat = 255;return (uint8) cat;A binary max-heap inside each FSM page. An FSM page is not a flat array of
these bytes. Its body (FSMPageData.fp_nodes) is a complete binary tree stored
in array order, where a non-leaf node holds the maximum of its two children
and the leaves hold the per-heap-page categories. The header reserves
NonLeafNodesPerPage (= BLCKSZ/2 − 1) interior slots; the rest are leaves
(SlotsPerFSMPage, ~4000 at 8 KB). The root node fp_nodes[0] is therefore the
maximum free-space category of every heap page covered by this FSM page — a
single byte that answers “is there any page here with ≥ X?” without descending.
// fsm_internals.h — the FSM page bodytypedef struct{ int fp_next_slot; /* round-robin search start hint */ uint8 fp_nodes[FLEXIBLE_ARRAY_MEMBER]; /* binary tree: non-leaf then leaf */} FSMPageData;A fanout tree across FSM pages. A single FSM page covers ~4000 heap pages.
To cover a whole relation (up to 2^32−1 blocks) PostgreSQL stacks FSM pages into
a fixed-height tree: a leaf slot in an upper-level FSM page mirrors the root
of a lower-level FSM page, so the same “max of children” invariant holds across
pages as within a page. With ~4000-way fanout, three levels reach 4000^3 > 2^32,
so FSM_TREE_DEPTH is 3 at default BLCKSZ (4 at very small block sizes). Level
0 is the bottom (one leaf per heap page), level FSM_ROOT_LEVEL is the root, and
the root FSM page is always physical block 0.
flowchart TD R["Root FSM page (level 2, phys blk 0)<br/>fp_nodes[0] = max free over ENTIRE relation"] R --> L1a["Level-1 FSM page #0<br/>covers ~4000 level-0 pages"] R --> L1b["Level-1 FSM page #1"] L1a --> L0a["Level-0 FSM page #0<br/>4000 leaves = 4000 heap pages"] L1a --> L0b["Level-0 FSM page #1"] L0a --> H0["heap blk 0 .. 3999<br/>one leaf byte each"] L0b --> H1["heap blk 4000 .. 7999"]
The genius of the design is that both trees obey the same rule, so one
routine (fsm_search_avail) searches inside a page at every level, and
freespace.c only has to walk between pages. The README states the contract
crisply: “The root node within each page has the same value as the corresponding
leaf node on its parent page.”
Search, update, vacuum. Three public verbs sit on top:
GetPageWithFreeSpace (find a heap block with ≥ N bytes), RecordPageWithFreeSpace
(set one page’s recorded free space), and RecordAndGetPageWithFreeSpace (the
combo the heap uses on a retry — record the failed page and find a nearby
alternative in one lock acquisition). FreeSpaceMapVacuum[Range] re-derives the
upper-level maxima from the bottom level after VACUUM has refreshed the leaves.
// freespace.c — the public search entry pointBlockNumberGetPageWithFreeSpace(Relation rel, Size spaceNeeded){ uint8 min_cat = fsm_space_needed_to_cat(spaceNeeded); return fsm_search(rel, min_cat);}Not WAL-logged, self-correcting. Writes use MarkBufferDirtyHint, not
MarkBufferDirty; reads use RBM_ZERO_ON_ERROR so a torn or checksum-failed FSM
page is silently zeroed rather than erroring. Correctness is restored by four
mechanisms covered below: a root-vs-set-value sanity check that triggers
fsm_rebuild_page; a descent-time corruption check in fsm_search_avail; the
fsm_does_block_exist guard that refuses to hand back a block past end-of-file;
and VACUUM’s periodic FreeSpaceMapVacuum. The only exception to non-logging is
truncation (FreeSpaceMapPrepareTruncateRel), which uses a real
MarkBufferDirty plus an optional log_newpage_buffer so a truncated tail does
not resurrect after replay.
Source Walkthrough
Section titled “Source Walkthrough”The code splits cleanly along the two-tree boundary. fsmpage.c knows the
in-page binary tree and nothing about forks or addressing — it treats a page as
“a black box with SlotsPerFSMPage slots.” freespace.c knows the cross-page
fanout, the byte/category encoding, buffer I/O, and the public API, and calls
into fsmpage.c for every in-page operation. indexfsm.c is a thin re-skin of
the public API for index page recycling.
In-page tree: set, search, rebuild (fsmpage.c)
Section titled “In-page tree: set, search, rebuild (fsmpage.c)”The page tree is navigated with three macros — leftchild(x) = 2x+1,
rightchild(x) = 2x+2, parentof(x) = (x−1)/2 — the standard array embedding of
a complete binary tree with the root at index 0.
fsm_set_avail writes a leaf and bubbles the change upward, stopping early at
the first parent whose max does not change:
// fsm_set_avail — fsmpage.c (set one leaf, propagate the new max up the spine)int nodeno = NonLeafNodesPerPage + slot;FSMPage fsmpage = (FSMPage) PageGetContents(page);oldvalue = fsmpage->fp_nodes[nodeno];if (oldvalue == value && value <= fsmpage->fp_nodes[0]) return false; /* unchanged and not above root: nothing to do */fsmpage->fp_nodes[nodeno] = value;do { uint8 newvalue = 0; nodeno = parentof(nodeno); lchild = leftchild(nodeno); rchild = lchild + 1; newvalue = fsmpage->fp_nodes[lchild]; if (rchild < NodesPerPage) newvalue = Max(newvalue, fsmpage->fp_nodes[rchild]); oldvalue = fsmpage->fp_nodes[nodeno]; if (oldvalue == newvalue) break; /* parent max didn't move: stop bubbling */ fsmpage->fp_nodes[nodeno] = newvalue;} while (nodeno > 0);if (value > fsmpage->fp_nodes[0]) /* sanity: set value still above root => corrupt */ fsm_rebuild_page(page);The final check is the first of the self-correcting measures: after bubbling, if
the value we just stored is still greater than the root, some parent on the
spine held a too-small value (corruption), so the whole page’s interior is
recomputed by fsm_rebuild_page.
fsm_search_avail is the heart of the in-page search. It starts at the
round-robin fp_next_slot hint, then performs the “expanding search triangle”
walk: at each step move one node right (wrapping within the level via
rightneighbor) and climb to the parent, doubling the covered subtree each time
until it finds a node ≥ minvalue; then it descends to a satisfying leaf,
preferring the left child:
// fsm_search_avail — fsmpage.c (expanding-triangle ascent, then greedy descent)if (fsmpage->fp_nodes[0] < minvalue) return -1; /* root says no page qualifies: O(1) miss */target = fsmpage->fp_next_slot;if (target < 0 || target >= LeafNodesPerPage) target = 0;target += NonLeafNodesPerPage;nodeno = target;while (nodeno > 0) { if (fsmpage->fp_nodes[nodeno] >= minvalue) break; nodeno = parentof(rightneighbor(nodeno)); /* move right, wrap, climb */}while (nodeno < NonLeafNodesPerPage) { /* descend to a leaf, prefer left */ int childnodeno = leftchild(nodeno); if (childnodeno < NodesPerPage && fsmpage->fp_nodes[childnodeno] >= minvalue) { nodeno = childnodeno; continue; } childnodeno++; /* try right child */ if (childnodeno < NodesPerPage && fsmpage->fp_nodes[childnodeno] >= minvalue) nodeno = childnodeno; else { /* neither child qualifies though parent promised: torn page -> rebuild */ }}slot = nodeno - NonLeafNodesPerPage;fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);Two design points are visible. The descent’s “neither child satisfies what the
parent promised” branch is the second self-correcting measure: it logs a
DEBUG1 “fixing corrupt FSM block”, upgrades to an exclusive lock if needed,
calls fsm_rebuild_page, and goto restarts. And fp_next_slot is updated
even under a shared lock — the comment is explicit that a garbled hint is
cheaper than the concurrency hit of exclusive locking.
// fsm_search_avail — fsmpage.c (torn-page recovery during descent)BufferGetTag(buf, &rlocator, &forknum, &blknum);elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/%u", blknum, rlocator.spcOid, rlocator.dbOid, rlocator.relNumber);if (!exclusive_lock_held) { LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); exclusive_lock_held = true;}fsm_rebuild_page(page);MarkBufferDirtyHint(buf, false);goto restart;rightneighbor implements the wrap-around using a two’s-complement trick — the
leftmost node at each level has index 2^level − 1, so ((x+1) & x) == 0 detects
that stepping right has fallen off the right edge onto the next level’s leftmost
node, in which case it folds back to the parent:
// rightneighbor — fsmpage.c (move right within a level, wrap to parent at the edge)x++;if (((x + 1) & x) == 0) /* (x+1) a power of two => x is leftmost of next level */ x = parentof(x);return x;fsm_rebuild_page recomputes every interior node bottom-up (from the last
non-leaf back to the root), and fsm_truncate_avail zeroes leaf slots ≥ n then
rebuilds — both used by the recovery and truncation paths respectively.
fsm_get_avail, fsm_get_max_avail (returns fp_nodes[0]) are lock-free
single-byte reads.
Cross-page navigation and addressing (freespace.c)
Section titled “Cross-page navigation and addressing (freespace.c)”freespace.c works in logical addresses — an FSMAddress {level, logpageno}
plus a slot — and converts to a physical block only at I/O time. The four
navigation helpers mirror the in-page macros but at page granularity, using
SlotsPerFSMPage as the fanout:
// fsm_get_location / fsm_get_parent / fsm_get_child — freespace.caddr.level = FSM_BOTTOM_LEVEL; /* heap block -> level-0 addr+slot */addr.logpageno = heapblk / SlotsPerFSMPage;*slot = heapblk % SlotsPerFSMPage;/* parent: */ parent.level = child.level + 1;parent.logpageno = child.logpageno / SlotsPerFSMPage;*slot = child.logpageno % SlotsPerFSMPage;/* child: */ child.level = parent.level - 1;child.logpageno = parent.logpageno * SlotsPerFSMPage + slot;fsm_logical_to_physical flattens a logical address into the depth-first
physical block layout described in the README (the root at block 0, then each
subtree laid out depth-first). It counts the leaf and upper pages preceding the
target:
// fsm_logical_to_physical — freespace.c (logical (level,logpageno) -> physical block)leafno = addr.logpageno;for (l = 0; l < addr.level; l++) leafno *= SlotsPerFSMPage; /* logical page -> its first bottom leaf */pages = 0;for (l = 0; l < FSM_TREE_DEPTH; l++) { pages += leafno + 1; /* count pages at each level up to the leaf */ leafno /= SlotsPerFSMPage;}pages -= addr.level; /* subtract lower levels not actually traversed */return pages - 1; /* 0-based block number */fsm_readbuf is the read/extend gateway: it consults (and lazily refreshes) the
smgr_cached_nblocks[FSM_FORKNUM] size cache, reads with RBM_ZERO_ON_ERROR,
extends via fsm_extend (ExtendBufferedRelTo with
EB_CREATE_FORK_IF_NEEDED) when extend is true, and PageInits a freshly
zeroed page under a recheck-after-lock guard against concurrent initializers.
Search workhorse (fsm_search)
Section titled “Search workhorse (fsm_search)”fsm_search descends the cross-page tree from the root. At each level it
fsm_search_avails the page under a shared lock, releasing the parent before
descending (one page locked at a time, as the README’s “Locking” section
mandates). Three things make it robust rather than naive:
// fsm_search — freespace.c (descend root->leaf, repair stale parents, verify EOF)if (addr.level == FSM_BOTTOM_LEVEL) { BlockNumber blkno = fsm_get_heap_blk(addr, slot); if (fsm_does_block_exist(rel, blkno)) { /* (1) guard: block must really exist */ ReleaseBuffer(buf); return blkno; } /* Block past EOF: zero the slot, restart from root */ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); fsm_set_avail(page, slot, 0); MarkBufferDirtyHint(buf, false); UnlockReleaseBuffer(buf); if (restarts++ > 10000) return InvalidBlockNumber; addr = FSM_ROOT_ADDRESS;}/* ... at a non-bottom level, if the child search failed the parent was stale: */parent = fsm_get_parent(addr, &parentslot);fsm_set_and_search(rel, parent, parentslot, max_avail, 0); /* (2) self-correct upper node */if (restarts++ > 10000) return InvalidBlockNumber; /* (3) emergency valve */addr = FSM_ROOT_ADDRESS;(1) The fsm_does_block_exist guard prevents handing back a block that the FSM
believes is free but that never made it to disk (a post-crash hazard explained
below); the bad slot is zeroed and the search restarts. (2) If a leaf search
fails after the parent pointed here — the upper node’s max was stale — the parent
is corrected via fsm_set_and_search so the loop cannot re-trap, then restarts.
(3) A 10000-iteration emergency valve guards against pathological looping on
badly out-of-date upper pages.
fsm_set_and_search is the shared update primitive: it exclusive-locks the page,
fsm_set_avails the slot (dirtying as a hint), and — when minValue > 0 —
immediately fsm_search_avails the same page so a record-then-find can avoid a
second lock. RecordAndGetPageWithFreeSpace exploits this to return a nearby
page when the failed page and a fit share an FSM page.
Heap insert placement (hio.c)
Section titled “Heap insert placement (hio.c)”RelationGetBufferForTuple is the heap’s page-selection loop. It computes a
targetFreeSpace, tries the cached target block, then asks the FSM, then loops:
each page that proves too small is reported and a new candidate fetched in one
call, until a fit is found or the relation must be extended.
// RelationGetBufferForTuple — hio.c (FSM-driven insert placement loop)if (targetBlock == InvalidBlockNumber && use_fsm) targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);/* ... loop: lock targetBlock, recheck real free space ... */pageFreeSpace = PageGetHeapFreeSpace(page);if (targetFreeSpace <= pageFreeSpace) { RelationSetTargetBlock(relation, targetBlock); return buffer; /* FSM was right: use this page */}/* FSM over-promised (stale): record the truth, get another candidate */targetBlock = RecordAndGetPageWithFreeSpace(relation, targetBlock, pageFreeSpace, targetFreeSpace);This is exactly the “caller must be prepared for the page to have too little
space” contract GetPageWithFreeSpace’s comment warns about — the heap verifies
under the page lock it has to take anyway, and feeds the corrected value back so
the FSM converges. After a bulk extension, RelationAddBlocks records the new
pages’ free space via RecordPageWithFreeSpace and FreeSpaceMapVacuumRange so
they become visible to other backends. (The surrounding insert mechanics —
buffer locking order, visibility-map pinning — belong to postgres-heap-am.md.)
VACUUM upper-page refresh (fsm_vacuum_page, callers in vacuumlazy.c)
Section titled “VACUUM upper-page refresh (fsm_vacuum_page, callers in vacuumlazy.c)”VACUUM updates the leaves as it processes heap pages (via
RecordPageWithFreeSpace, which bubbles within the bottom page), but the upper
pages only get refreshed by FreeSpaceMapVacuum[Range], whose recursive guts are
fsm_vacuum_page. It walks the tree depth-first (matching the physical layout
for I/O efficiency), recomputes each interior slot from its child’s true max, and
resets fp_next_slot to 0 to bias future searches toward low-numbered pages
(helping a later VACUUM truncate the tail):
// fsm_vacuum_page — freespace.c (depth-first refresh of upper nodes from children)if (addr.level > FSM_BOTTOM_LEVEL) { for (slot = start_slot; slot <= end_slot; slot++) { if (!eof) child_avail = fsm_vacuum_page(rel, fsm_get_child(addr, slot), start, end, &eof); else child_avail = 0; if (fsm_get_avail(page, slot) != child_avail) { /* upper node was stale */ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); fsm_set_avail(page, slot, child_avail); MarkBufferDirtyHint(buf, false); LockBuffer(buf, BUFFER_LOCK_UNLOCK); } }}max_avail = fsm_get_max_avail(page);((FSMPage) PageGetContents(page))->fp_next_slot = 0; /* bias toward low pages */vacuumlazy.c calls FreeSpaceMapVacuumRange periodically (every
VACUUM_FSM_EVERY_PAGES ≈ 8 GB for index-less tables) during the scan and once
more at the end, so newly freed space becomes searchable without waiting for the
whole VACUUM to finish. The VACUUM driver itself is postgres-vacuum.md.
Index page recycling (indexfsm.c)
Section titled “Index page recycling (indexfsm.c)”indexfsm.c reuses the entire heap FSM machinery but collapses the 256
categories to two: an index page is either wholly free or in use. GetFreeIndexPage
asks for BLCKSZ/2 (anything ≥ half-page maps to “free”), and the record helpers
encode the two states as the byte extremes:
// indexfsm.c — binary free/used encoding over the heap FSMBlockNumber GetFreeIndexPage(Relation rel) { BlockNumber blkno = GetPageWithFreeSpace(rel, BLCKSZ / 2); if (blkno != InvalidBlockNumber) RecordUsedIndexPage(rel, blkno); /* claim it atomically-ish */ return blkno;}void RecordFreeIndexPage(Relation rel, BlockNumber b) { RecordPageWithFreeSpace(rel, b, BLCKSZ - 1); }void RecordUsedIndexPage(Relation rel, BlockNumber b) { RecordPageWithFreeSpace(rel, b, 0); }This is how nbtree recycles deleted pages: a page deemed recyclable is
RecordFreeIndexPaged, and _bt_getbuf’s “get a free page” path pulls it back
via GetFreeIndexPage. (See postgres-nbtree.md for the recyclability gate.)
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
FSM_CATEGORIES, FSM_CAT_STEP, MaxFSMRequestSize | src/backend/storage/freespace/freespace.c | 64–66 |
FSM_TREE_DEPTH, FSM_ROOT_LEVEL, FSM_BOTTOM_LEVEL | src/backend/storage/freespace/freespace.c | 75–78 |
FSMAddress struct | src/backend/storage/freespace/freespace.c | 84–88 |
GetPageWithFreeSpace | src/backend/storage/freespace/freespace.c | 136 |
RecordAndGetPageWithFreeSpace | src/backend/storage/freespace/freespace.c | 153 |
RecordPageWithFreeSpace | src/backend/storage/freespace/freespace.c | 193 |
XLogRecordPageWithFreeSpace | src/backend/storage/freespace/freespace.c | 210 |
GetRecordedFreeSpace | src/backend/storage/freespace/freespace.c | 253 |
FreeSpaceMapPrepareTruncateRel | src/backend/storage/freespace/freespace.c | 284 |
FreeSpaceMapVacuum / FreeSpaceMapVacuumRange | src/backend/storage/freespace/freespace.c | 367 / 386 |
fsm_space_avail_to_cat / fsm_space_needed_to_cat | src/backend/storage/freespace/freespace.c | 401 / 441 |
fsm_logical_to_physical | src/backend/storage/freespace/freespace.c | 464 |
fsm_get_location / fsm_get_parent / fsm_get_child | src/backend/storage/freespace/freespace.c | 500 / 526 / 544 |
fsm_readbuf / fsm_extend | src/backend/storage/freespace/freespace.c | 563 / 638 |
fsm_set_and_search | src/backend/storage/freespace/freespace.c | 655 |
fsm_search | src/backend/storage/freespace/freespace.c | 687 |
fsm_vacuum_page | src/backend/storage/freespace/freespace.c | 821 |
fsm_does_block_exist | src/backend/storage/freespace/freespace.c | 935 |
leftchild / rightchild / parentof macros | src/backend/storage/freespace/fsmpage.c | 29–31 |
rightneighbor | src/backend/storage/freespace/fsmpage.c | 36 |
fsm_set_avail | src/backend/storage/freespace/fsmpage.c | 62 |
fsm_get_avail / fsm_get_max_avail | src/backend/storage/freespace/fsmpage.c | 121 / 137 |
fsm_search_avail | src/backend/storage/freespace/fsmpage.c | 157 |
fsm_truncate_avail | src/backend/storage/freespace/fsmpage.c | 312 |
fsm_rebuild_page | src/backend/storage/freespace/fsmpage.c | 341 |
FSMPageData, NonLeafNodesPerPage, SlotsPerFSMPage | src/include/storage/fsm_internals.h | 24–61 |
GetFreeIndexPage / RecordFreeIndexPage / RecordUsedIndexPage | src/backend/storage/freespace/indexfsm.c | 37 / 51 / 61 |
RelationGetBufferForTuple (FSM calls) | src/backend/access/heap/hio.c | 584 / 737 / 759 |
VACUUM_FSM_EVERY_PAGES, FreeSpaceMapVacuumRange calls | src/backend/access/heap/vacuumlazy.c | 202 / 1301 / 1488 / 1537 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
Free space is quantized to one byte = 256 categories of
BLCKSZ/256bytes. Verified infreespace.c:FSM_CATEGORIES256,FSM_CAT_STEP(BLCKSZ / FSM_CATEGORIES),MaxFSMRequestSize=MaxHeapTupleSize.fsm_space_avail_to_catrounds down and caps non-top values at 254;fsm_space_needed_to_catrounds up; category 255 is reserved forMaxFSMRequestSizeexactly, per the leading comment. Confirmed against the README’s “one map byte to each page … free space divided by BLCKSZ/256.” -
An FSM page body is a complete binary max-heap, non-leaf nodes first then leaves. Verified in
fsm_internals.h:fp_nodes[FLEXIBLE_ARRAY_MEMBER]withNonLeafNodesPerPage = BLCKSZ/2 − 1,LeafNodesPerPage = NodesPerPage − NonLeafNodesPerPage,SlotsPerFSMPage = LeafNodesPerPage. The macrosleftchild/rightchild/parentofinfsmpage.cconfirm the array embedding with root at index 0.fsm_get_max_availreturnsfp_nodes[0]. -
The cross-page tree has fixed height 3 at default
BLCKSZ, root at physical block 0. Verified:FSM_TREE_DEPTH ((SlotsPerFSMPage >= 1626) ? 3 : 4),FSM_ROOT_LEVEL = FSM_TREE_DEPTH − 1,FSM_BOTTOM_LEVEL 0, andFSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0}. The README’s “y = n + (n/F + 1) + …” formula matches the loop infsm_logical_to_physical. The comment that 1626^3 ≥ 2^32−1 is the basis for depth 3 is in theFSM_TREE_DEPTHcomment. -
A leaf in an upper FSM page equals the root of the lower FSM page it covers. Stated in the README (“The root node within each page has the same value as the corresponding leaf node on its parent page”) and enforced by
fsm_vacuum_page, which setsfsm_set_avail(page, slot, child_avail)wherechild_availis the recursive return = the child page’sfsm_get_max_avail. -
Search uses shared locks, update uses exclusive, and only one page is locked at a time during descent. Verified in
fsm_search(BUFFER_LOCK_SHARE, parent released viaUnlockReleaseBuffer/ReleaseBufferbefore descending) andfsm_set_and_search(BUFFER_LOCK_EXCLUSIVE). Matches the README “Locking” section. -
fp_next_slotis a round-robin hint updated even under a shared lock. Verified infsm_search_avail:fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0)with the comment justifying the shared-lock update; thefsm_internals.hcomment explains it isint(notuint16) for atomic fetch/store.fsm_vacuum_pageresets it to 0. -
The FSM is not WAL-logged; writes are
MarkBufferDirtyHint, reads areRBM_ZERO_ON_ERROR. Verified acrossfsm_set_and_search,fsm_search,fsm_vacuum_page(MarkBufferDirtyHint) andfsm_readbuf/fsm_extend(RBM_ZERO_ON_ERROR). The README “Recovery” section states the rationale. The lone exception isXLogRecordPageWithFreeSpace, which usesMarkBufferDirtybecauseMarkBufferDirtyHintis a no-op during recovery with checksums — verified in its inline comment. -
Three self-correcting measures repair corruption without WAL. (1)
fsm_set_avail: after bubbling,if (value > fsmpage->fp_nodes[0]) fsm_rebuild_page(page). (2)fsm_search_avail: the “neither child satisfies” branch logsDEBUG1 "fixing corrupt FSM block", callsfsm_rebuild_page,goto restart. (3)fsm_search: a failed lower search corrects the parent viafsm_set_and_search(rel, parent, parentslot, max_avail, 0). -
fsm_does_block_existguards against post-crash phantom free space. Verified infreespace.c:fsm_searchandRecordAndGetPageWithFreeSpaceboth call it before returning a block; it compares againstsmgr_cached_nblocks[MAIN_FORKNUM]and falls back toRelationGetNumberOfBlocks. The README “Recovery” paragraph explains the hazard: relation extension is not WAL-logged, so an on-disk FSM slot may point at aPageIsNewblock that never reached disk after replay. -
VACUUM updates leaves inline and upper pages periodically. Verified:
vacuumlazy.ccallsRecordPageWithFreeSpaceper heap page andFreeSpaceMapVacuumRangeeveryVACUUM_FSM_EVERY_PAGESand at end of scan.fsm_vacuum_pageis depth-first and resetsfp_next_slot. -
The heap insert path verifies FSM answers and feeds corrections back. Verified in
RelationGetBufferForTuple(hio.c):GetPageWithFreeSpacefor the initial target,PageGetHeapFreeSpacerecheck under lock, andRecordAndGetPageWithFreeSpaceon a miss. The function’s header comment explicitly anticipates stale FSM info and the retry loop. -
indexfsm.creuses the heap FSM with a binary free/used encoding. Verified:RecordFreeIndexPagerecordsBLCKSZ − 1,RecordUsedIndexPagerecords0,GetFreeIndexPagerequestsBLCKSZ/2and marks the page used.IndexFreeSpaceMapVacuumjust callsFreeSpaceMapVacuum.
Open questions
Section titled “Open questions”-
Real-world
fp_next_slotcontention relief. The round-robin start pointer is designed to spread concurrent inserters across pages, but the doc asserts the benefit qualitatively. The actual reduction in buffer-lock contention vs. a fixed start, under N concurrent COPY/INSERT backends, is not measured here. Investigation path: instrumentfsm_search_availhit distribution under a synthetic many-writer workload. -
How often the corruption self-repair paths actually fire in production. The
DEBUG1 "fixing corrupt FSM block"and thefsm_set_availrebuild are designed for torn pages after crashes, but their real frequency (and whether they ever mask a deeper bug rather than benign torn-page loss) is unknown from the code. Investigation path: collectDEBUG1rates across a fleet withdata_checksumson vs. off. -
The exact convergence behavior of stale upper nodes under heavy churn.
fsm_search’s self-correcting parent update and the 10000-restart valve bound the loop, but how far the upper pages can drift before the nextFreeSpaceMapVacuum— and whether that drift ever causes measurable misplacement / bloat between vacuums — is workload-dependent and untested here. Investigation path: compareGetRecordedFreeSpaceagainstPageGetHeapFreeSpacesampled across a churning table mid-vacuum-interval.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
The fork-as-separate-relation model vs. inline FSIP pages. PostgreSQL keeps the FSM in its own fork (
FSM_FORKNUM), physically distinct from the main data fork, so the map can grow and shrink independently and never interleaves with heap pages. DB2’s FSIP and Oracle’s freelist/ASSM bitmaps instead interleave free-space-inventory pages within the tablespace at fixed strides. The fork model trades one extrasmgropen and a second set of buffers for a cleaner truncation story (FreeSpaceMapPrepareTruncateRelshrinks the fork in lockstep with the heap) and the ability to drop the whole map by deleting one fork. A side-by-side of fork-relation vs. interleaved-FSIP truncation and crash recovery would sharpen what the indirection buys. -
CUBRID’s file/extent free-space management. CUBRID tracks free space at the file-manager / extent level rather than with a per-page quantized byte map plus an in-page binary heap. Comparing CUBRID’s allocation bitmap and “numerable file” sector accounting against PostgreSQL’s lossy, non-WAL-logged byte map would frame the durability/precision tradeoff: CUBRID’s structures are recovery-managed allocation metadata, whereas the PostgreSQL FSM is an explicitly approximate, self-healing hint. (See the CUBRID file-manager analysis in the cubrid tree —
raw/code-analysis/cubrid/file-manager-*.) -
Quantized byte map vs. exact free-byte directories. The 256-category encoding (
FSM_CATEGORIES,FSM_CAT_STEP) is a deliberate fidelity-for-size trade: one byte per page keeps a 1 TB table’s map around 32 MB instead of the ~512 MB an exact 4-byte-per-page directory would cost. Engines that store exact free counts pay the size and must page more of the directory in; PostgreSQL accepts that two requests of slightly different size map to the same category and that the recheck under page lock (PageGetHeapFreeSpaceinhio.c) catches the rounding error. Quantifying the misplacement rate induced byFSM_CAT_STEPgranularity on a realistic tuple-size mix is an open measurement. -
Non-WAL-logged hint vs. logged allocation metadata (ARIES lineage). The FSM’s refusal to WAL-log (writes via
MarkBufferDirtyHint, reads viaRBM_ZERO_ON_ERROR, rebuild-on-corruption) sits at one extreme of the recovery spectrum. ARIES-style systems (Mohan et al. 1992;dbms-papers/aries.md) log space-map / page-allocation changes so the map is exactly reconstructable by redo. PostgreSQL instead makes the map cheaply re-derivable (VACUUM’sFreeSpaceMapVacuumrefreshes upper nodes; the heap recheck repairs leaves on the fly) and tolerates post-crash phantom entries viafsm_does_block_exist. The lone WAL exception,XLogRecordPageWithFreeSpace, exists only so a redo path (heap/visibility-map replay) can seed FSM values it already happens to know. A focused note mapping each FSM self-healing measure to the ARIES recovery guarantee it deliberately forgoes would be a research-grade companion. -
Round-robin
fp_next_slotvs. partitioned / per-backend allocation. The shared-lock round-robin start pointer is PostgreSQL’s lightweight answer to insert-point contention: concurrent backends fan out across pages instead of all hammering the lowest-numbered fitting page. Heavier schemes — per-backend free lists, hashed allocation regions, or Oracle ASSM’s multiple freelist-groups — reduce contention further at the cost of more bloat and more metadata. Measuringfp_next_slotfan-out against a partitioned allocator under many-writer COPY is the natural frontier (also raised in Open Questions). -
Index-page recycling via the same machinery (
indexfsm.c). Reusing the heap FSM with a binary free/used encoding (BLCKSZ-1vs0, requestBLCKSZ/2) is an elegant economy, but it means index-page reuse inherits the FSM’s lossiness and non-durability. Other engines maintain a dedicated, often logged, free-page list for index structures. Whether the shared-FSM approach ever delays index-page reuse enough to matter (vs. nbtree’s own recyclability gate inpostgres-nbtree.md) is an unmeasured interaction.
Sources
Section titled “Sources”In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”src/backend/storage/freespace/README— the design document: one map byte per page atBLCKSZ/256granularity, the in-page binary max-heap, the cross-page fanout tree, the search/update operations, the “nice properties” argument, locking, recovery (no WAL, torn-page tolerance, the phantom-block hazard), and the higher-level structure / addressing math.src/backend/storage/freespace/freespace.c— the public API (GetPageWithFreeSpace,RecordAndGetPageWithFreeSpace,RecordPageWithFreeSpace,XLogRecordPageWithFreeSpace,GetRecordedFreeSpace), the category encoding (fsm_space_avail_to_cat,fsm_space_needed_to_cat), the cross-page addressing (fsm_logical_to_physical,fsm_get_location,fsm_get_parent,fsm_get_child), buffer/extend helpers (fsm_readbuf,fsm_extend), the search/update drivers (fsm_search,fsm_set_and_search), VACUUM (FreeSpaceMapVacuum/Range,fsm_vacuum_page), truncation (FreeSpaceMapPrepareTruncateRel), and the post-crash guard (fsm_does_block_exist).src/backend/storage/freespace/fsmpage.c— the in-page binary-heap operations:leftchild/rightchild/parentof/rightneighbormacros,fsm_set_avail(leaf set + bubble-up + rebuild-on-overflow),fsm_get_avail,fsm_get_max_avail(root =fp_nodes[0]),fsm_search_avail(the expanding-triangle search anchored atfp_next_slot),fsm_truncate_avail,fsm_rebuild_page.src/backend/storage/freespace/indexfsm.c—GetFreeIndexPage,RecordFreeIndexPage,RecordUsedIndexPage,IndexFreeSpaceMapVacuum: the binary free/used reuse of the heap FSM for whole index pages.src/include/storage/fsm_internals.h—FSMPageData(fp_next_slot,fp_nodes[FLEXIBLE_ARRAY_MEMBER]),NodesPerPage,NonLeafNodesPerPage,LeafNodesPerPage,SlotsPerFSMPage, and the rationale forfp_next_slotbeingint.src/include/storage/freespace.h,src/include/storage/indexfsm.h— the declared public surfaces consumed by the heap and the index AMs.src/backend/access/heap/hio.c—RelationGetBufferForTuple: the insert path that callsGetPageWithFreeSpace, rechecks withPageGetHeapFreeSpaceunder the page lock, and feeds misses back viaRecordAndGetPageWithFreeSpace.src/backend/access/heap/vacuumlazy.c— the VACUUM driver that records per-page free space and triggersFreeSpaceMapVacuumRange(VACUUM_FSM_EVERY_PAGES).
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Database Internals (Petrov 2019), part I — slotted-page layout, free-space
management directories (bitmaps, per-page bytes, tree-structured maps), and the
fidelity-vs-size / hint-vs-durable framing
(
knowledge/research/dbms-general/). - Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. on
“Storage and File Structure” — heap file organization, the page-selection
problem, and free-list / free-space directory designs
(
knowledge/research/dbms-general/). - Mohan, C. et al. (1992). “ARIES: A Transaction Recovery Method…”
ACM TODS 17(1). The logged-recovery baseline against which the FSM’s
deliberately non-WAL-logged, self-healing design is contrasted
(
knowledge/research/dbms-papers/aries.md).
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-heap-am.md— heap insertion mechanics;RelationGetBufferForTupleis summarized here only for its FSM interaction, not its full retry/extend protocol.postgres-smgr-md.md— thesmgr/mdfork I/O layer thatfsm_readbufandfsm_extendride on (FSM_FORKNUM,RBM_ZERO_ON_ERROR, relation extension).postgres-vacuum.md— the VACUUM driver (lazy_scan_heap) that owns the cadence ofRecordPageWithFreeSpaceandFreeSpaceMapVacuumRange; this doc covers onlyfsm_vacuum_pageitself.postgres-nbtree.md— the index-page recyclability gate that decides when a deleted btree page isRecordFreeIndexPaged and pulled back viaGetFreeIndexPage.postgres-buffer-manager.md—MarkBufferDirtyHint, content locks, and the pin/lock protocol the FSM’s shared/exclusive page accesses rely on.postgres-page-layout.md—PageGetHeapFreeSpaceand the slotted-page model that defines what a page’s “free space” actually means.