PostgreSQL smgr & md — The Storage Manager Switch and the Magnetic-Disk Driver
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every relational engine must answer one basic question: given a logical address — “page N of relation R” — how does the engine translate that to physical bytes on durable storage? The answer is the storage manager layer, which sits between the page-level buffer pool and the operating system. Database Internals (Petrov, ch. 6 “B-Tree Variants”) introduces the concept through the lens of page addressing, and Database System Concepts (Silberschatz, 7e, ch. 10 “Storage and File Structure”) frames it with the classic on-disk organization model: a relation is a sequence of fixed-size pages, the buffer manager fetches or flushes pages one at a time, and the storage layer handles the mechanics of physical location.
Two design choices govern what the storage layer has to do:
-
The I/O unit. Virtually all DBMSes choose a fixed page size (4 KB–32 KB) matching or doubling the OS page size, so that a single kernel call transfers exactly one database page without partial-transfer complications. The buffer manager and the storage layer agree on this unit and never deal in anything smaller at the page-transfer boundary.
-
How logical addresses map to physical files. A simple approach puts one relation in one file. Real engines extend this in two directions: (a) relation forks — multiple independent file sequences per relation for auxiliary metadata (free-space map, visibility map, initialization fork), each accessible independently; and (b) file segmentation — splitting a single fork into multiple fixed-size OS files to avoid OS file-size limits and to give the OS file-system more predictable file objects.
The storage manager is also where fsync discipline lives. A page written
to the kernel’s buffer cache is not durable; the engine must eventually call
fsync(2) on the underlying file. Three strategies exist: (a) fsync
synchronously on every write (correct but slow); (b) collect “dirty segments”
into a pending list and fsync them at checkpoint time, delegating to a
background process; and (c) use O_DIRECT to bypass the kernel buffer cache
entirely, relying on the engine’s own buffer pool as the only cache tier (the
path PG18’s async I/O makes increasingly viable). PostgreSQL uses strategy (b)
as its default, with strategy (c) available via io_direct_flags.
Common DBMS Design
Section titled “Common DBMS Design”Dispatch through a virtual method table
Section titled “Dispatch through a virtual method table”Most engines that were ever designed to support alternative storage back-ends (disk, WORM jukebox, NVM, S3) implement the storage layer as a vtable of function pointers keyed on a storage-manager identifier. The identifier is stored per relation in the catalog or in the relation descriptor. Dispatch costs one extra indirect call per page operation; real I/O dominates that cost entirely, so the indirection is essentially free.
Cached handle objects
Section titled “Cached handle objects”Opening a file is expensive. Engines cache open file descriptors in a
per-backend handle table keyed on the logical relation identifier. A
“handle” object bundles the relation locator, the open file descriptors for
each segment, and a small cache of metadata (current size, target insertion
block). The handle is looked up in the table before every I/O; if already
present the table lookup replaces the open(2) syscall.
Fork-based file layout
Section titled “Fork-based file layout”Since Postgres 8.4, the standard pattern is:
$PGDATA/base/<dbOid>/<relfileNumber> ← main fork, segment 0$PGDATA/base/<dbOid>/<relfileNumber>.1 ← main fork, segment 1$PGDATA/base/<dbOid>/<relfileNumber>_fsm ← free-space-map fork$PGDATA/base/<dbOid>/<relfileNumber>_vm ← visibility-map fork$PGDATA/base/<dbOid>/<relfileNumber>_init ← init fork (unlogged)Each fork is an independent file sequence; forks grow and truncate independently. The fork multiplier is the minimal addition that makes FSM and VM possible without complex multiplexing inside the main file.
Deferred fsync via a pending-ops list
Section titled “Deferred fsync via a pending-ops list”Writing a page to the kernel’s cache creates a fsync obligation: before
the next checkpoint completes, that page’s segment must be fsynced. The naive
solution calls fsync on the write path and blocks. The standard DBMS pattern
is to record the obligation in a pending-ops list (keyed by
(tablespace, db, relfileNumber, fork, segment)) and let a background
checkpointer process drain the list. The write path is non-blocking; the
checkpointer does the heavy I/O during its checkpoint window.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory concept | PostgreSQL name |
|---|---|
| Storage manager vtable | f_smgr struct in smgr.c, single entry smgrsw[0] = md.c |
| Relation handle / cached FD | SMgrRelationData / SMgrRelation |
| Logical address (rel, page) | RelFileLocator + ForkNumber + BlockNumber |
| Fork | ForkNumber: MAIN_FORKNUM, FSM_FORKNUM, VISIBILITYMAP_FORKNUM, INIT_FORKNUM |
| File segment | MdfdVec — one per open segment; md_seg_fds[] array |
| Segment size limit | RELSEG_SIZE blocks (default 131 072 → 1 GB at 8 KB/block) |
| Pending fsync obligation | FileTag entry passed to RegisterSyncRequest |
| Deferred-fsync executor | checkpointer process calling ProcessSyncRequests |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”The smgr switch (smgr.c)
Section titled “The smgr switch (smgr.c)”smgr.c is thin by design. It owns two data structures and twenty-odd
functions that are pure dispatch wrappers:
// f_smgr — smgr.c (the vtable)typedef struct f_smgr{ void (*smgr_init) (void); void (*smgr_open) (SMgrRelation reln); void (*smgr_close) (SMgrRelation reln, ForkNumber forknum); void (*smgr_create) (SMgrRelation reln, ForkNumber forknum, bool isRedo); bool (*smgr_exists) (SMgrRelation reln, ForkNumber forknum); void (*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo); void (*smgr_extend) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void *buffer, bool skipFsync); void (*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks, bool skipFsync); bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks); uint32 (*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum); void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, void **buffers, BlockNumber nblocks); void (*smgr_startreadv) (PgAioHandle *ioh, SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, void **buffers, BlockNumber nblocks); void (*smgr_writev) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void **buffers, BlockNumber nblocks, bool skipFsync); void (*smgr_writeback) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, BlockNumber nblocks); BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum); void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber old_blocks, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); void (*smgr_registersync) (SMgrRelation reln, ForkNumber forknum); int (*smgr_fd) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, uint32 *off);} f_smgr;
static const f_smgr smgrsw[] = { /* magnetic disk — the only entry */ { .smgr_init = mdinit, .smgr_open = mdopen, .smgr_readv = mdreadv, .smgr_startreadv = mdstartreadv, /* ... */ }};The README note is honest: “We retain the notion of a storage manager switch in case anyone ever wants to reintroduce other kinds of storage managers. Removing the switch layer would save nothing noticeable anyway, since storage-access operations are surely far more expensive than one extra layer of C function calls.”
SMgrRelation: the per-backend handle
Section titled “SMgrRelation: the per-backend handle”Every access to a physical relation goes through an SMgrRelation handle.
smgropen() looks up or creates one in a per-backend hash table keyed on
RelFileLocatorBackend (tablespace OID + database OID + relation file number
- backend number):
// SMgrRelationData — src/include/storage/smgr.htypedef struct SMgrRelationData{ RelFileLocatorBackend smgr_rlocator; /* hash key — must be first */
BlockNumber smgr_targblock; /* current insertion target block */ BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1]; /* cached fork sizes */
int smgr_which; /* selector into smgrsw[] — always 0 */
/* md.c private fields */ int md_num_open_segs[MAX_FORKNUM + 1]; struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
/* pinning */ int pincount; dlist_node node; /* link in unpinned_relns list when pincount==0 */} SMgrRelationData;The smgr_cached_nblocks field is a size cache per fork. It is invalidated on
an smgr cache-invalidation signal (PROCSIGNAL_BARRIER_SMGRRELEASE or
CacheInvalidateSmgr) and is only reliable during recovery — the comment in
smgrnblocks_cached() says plainly: “For now, this function uses cached values
only in recovery due to lack of a shared invalidation mechanism for changes in
file size.”
Lifetime and pinning. Handles created inside a transaction are unpinned
by default: they live on the unpinned_relns doubly-linked list and are
destroyed at AtEOXact_SMgr() (end of transaction). The relcache pins the
handle via smgrpin() to prevent it from being destroyed while the relcache
entry is alive. The checkpointer and auxiliary processes that operate outside
transactions manage their own lifetime by calling smgrdestroyall() at
appropriate points.
Signal-driven release. A backend can receive PROCSIGNAL_BARRIER_SMGRRELEASE
at any point, ordering it to close all open file descriptors immediately.
ProcessBarrierSmgrRelease() calls smgrreleaseall(), which iterates the
entire hash table and calls smgrrelease() on every entry. The handle objects
remain in the table; only the underlying OS file descriptors are closed.
// smgropen — smgr.cSMgrRelationsmgropen(RelFileLocator rlocator, ProcNumber backend){ RelFileLocatorBackend brlocator; SMgrRelation reln; bool found;
HOLD_INTERRUPTS();
if (SMgrRelationHash == NULL) { HASHCTL ctl; ctl.keysize = sizeof(RelFileLocatorBackend); ctl.entrysize = sizeof(SMgrRelationData); SMgrRelationHash = hash_create("smgr relation table", 400, &ctl, HASH_ELEM | HASH_BLOBS); dlist_init(&unpinned_relns); }
brlocator.locator = rlocator; brlocator.backend = backend; reln = (SMgrRelation) hash_search(SMgrRelationHash, &brlocator, HASH_ENTER, &found); if (!found) { reln->smgr_targblock = InvalidBlockNumber; for (int i = 0; i <= MAX_FORKNUM; ++i) reln->smgr_cached_nblocks[i] = InvalidBlockNumber; reln->smgr_which = 0; /* md.c only */ reln->pincount = 0; dlist_push_tail(&unpinned_relns, &reln->node); smgrsw[reln->smgr_which].smgr_open(reln); }
RESUME_INTERRUPTS(); return reln;}Relation forks
Section titled “Relation forks”The ForkNumber enum (in src/include/common/relpath.h) defines four forks:
| Fork | Value | File suffix | Purpose |
|---|---|---|---|
MAIN_FORKNUM | 0 | (none) | Heap or index pages |
FSM_FORKNUM | 1 | _fsm | Free-space map |
VISIBILITYMAP_FORKNUM | 2 | _vm | Visibility map |
INIT_FORKNUM | 3 | _init | Unlogged-table initializer |
MAX_FORKNUM is INIT_FORKNUM = 3. The SMgrRelationData struct allocates
[MAX_FORKNUM + 1] arrays for md_num_open_segs, md_seg_fds, and
smgr_cached_nblocks. Most code passes MAIN_FORKNUM; the convenience
wrappers smgrread and smgrwrite in smgr.h hard-code it.
md.c: the magnetic-disk driver
Section titled “md.c: the magnetic-disk driver”md.c translates a logical (SMgrRelation, ForkNumber, BlockNumber) address
into POSIX pread/pwrite calls on the appropriate segment file. Its
internal state lives entirely inside the SMgrRelationData fields
(md_num_open_segs, md_seg_fds) and a MemoryContext called MdCxt for
allocating MdfdVec arrays.
MdfdVec: the per-segment file-descriptor wrapper
Section titled “MdfdVec: the per-segment file-descriptor wrapper”// _MdfdVec — md.ctypedef struct _MdfdVec{ File mdfd_vfd; /* fd number in fd.c's virtual-fd pool */ BlockNumber mdfd_segno; /* segment number, from 0 */} MdfdVec;File is an integer handle into fd.c’s virtual file descriptor (VFD)
pool — not a raw OS fd. The VFD layer (src/backend/storage/file/fd.c)
manages a pool of OS file descriptors, evicting (closing) them on LRU order
when the process approaches the max_files_per_process limit and reopening
them on demand. MdfdVec therefore stores a stable VFD index, not a raw int
that could be invalidated by eviction.
Segment-file layout on disk
Section titled “Segment-file layout on disk”A relation fork with more than RELSEG_SIZE blocks (default 131 072 at
8 KB/block = 1 GB) spans multiple segment files. The naming convention:
base/<dbOid>/<relfileNumber> ← segment 0 of MAIN_FORKNUMbase/<dbOid>/<relfileNumber>.1 ← segment 1base/<dbOid>/<relfileNumber>.2 ← segment 2 ...base/<dbOid>/<relfileNumber>_fsm ← segment 0 of FSM_FORKNUMFigure 2 — smgr switch → md.c → fork/segment file mapping
flowchart TD
CALL["smgrreadv / smgrwritev / smgrextend<br/>(rlocator, forknum, blocknum)"]
SW["smgrsw[reln->smgr_which]<br/>smgr_which == 0 (md.c only)"]
GETSEG["_mdfd_getseg<br/>targetseg = blocknum / RELSEG_SIZE"]
CALL --> SW
SW -->|mdreadv / mdwritev / mdextend| GETSEG
GETSEG -->|forknum == MAIN_FORKNUM 0| MAIN
GETSEG -->|forknum == FSM_FORKNUM 1| FSM
GETSEG -->|forknum == VISIBILITYMAP_FORKNUM 2| VM
GETSEG -->|forknum == INIT_FORKNUM 3| INIT
subgraph MAIN["MAIN fork — relpath() base name"]
M0["<relfileNumber><br/>segment 0 — blocks 0 .. RELSEG_SIZE-1"]
M1["<relfileNumber>.1<br/>segment 1 — next RELSEG_SIZE blocks"]
M2["<relfileNumber>.2<br/>segment 2 ..."]
M0 --> M1 --> M2
end
subgraph FSM["FSM fork (forkNames[1] = 'fsm')"]
F0["<relfileNumber>_fsm[.N]"]
end
subgraph VM["VM fork (forkNames[2] = 'vm')"]
V0["<relfileNumber>_vm[.N]"]
end
subgraph INIT["INIT fork (forkNames[3] = 'init', unlogged)"]
I0["<relfileNumber>_init[.N]"]
end
M0 -.->|seekpos = BLCKSZ * blocknum % RELSEG_SIZE| FILEIO["FileReadV / FileWriteV<br/>on MdfdVec.mdfd_vfd"]
F0 -.-> FILEIO
V0 -.-> FILEIO
I0 -.-> FILEIO
Figure 2 — A logical (rlocator, forknum, blocknum) triple enters through the
single smgrsw[0] (md.c) vtable entry. _mdfd_getseg() divides blocknum by
RELSEG_SIZE to pick the segment; the fork selects the file base name built by
relpath() (suffix from forkNames[]: none / _fsm / _vm / _init). Each
fork is an independent segment vector capped at RELSEG_SIZE blocks (1 GB at
the default 8 KB BLCKSZ); .1, .2, … are the overflow segments.
The _mdfd_segpath() helper builds the path:
// _mdfd_segpath — md.cstatic MdPathStr_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno){ RelPathStr path; MdPathStr fullpath;
path = relpath(reln->smgr_rlocator, forknum); if (segno > 0) sprintf(fullpath.str, "%s.%u", path.str, segno); else strcpy(fullpath.str, path.str); return fullpath;}The md_seg_fds[forknum] array is a resizable palloc slice (via
_fdvec_resize()) allocated in MdCxt. The invariant: md_num_open_segs[f]
counts the number of open segments for fork f; segments beyond that index
may exist on disk but have not been opened yet. This is an optimization: most
relations have only one segment and md_num_open_segs stays at 1.
Finding a segment: _mdfd_getseg()
Section titled “Finding a segment: _mdfd_getseg()”Every read/write path calls _mdfd_getseg() to map a block number to a
MdfdVec:
// _mdfd_getseg — md.c (condensed)static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, bool skipFsync, int behavior){ BlockNumber targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
/* fast path: segment already open */ if (targetseg < reln->md_num_open_segs[forknum]) return &reln->md_seg_fds[forknum][targetseg];
/* open segments from last-open up to targetseg */ for (nextsegno = reln->md_num_open_segs[forknum]; nextsegno <= targetseg; nextsegno++) { /* ... open or create based on 'behavior' flags ... */ v = _mdfd_openseg(reln, forknum, nextsegno, flags); } return &reln->md_seg_fds[forknum][targetseg];}The behavior flags (EXTENSION_FAIL, EXTENSION_RETURN_NULL,
EXTENSION_CREATE, EXTENSION_CREATE_RECOVERY, EXTENSION_DONT_OPEN)
encode four distinct caller requirements cleanly.
Block offset inside a segment
Section titled “Block offset inside a segment”Given a block number, the offset within its segment file is always:
seekpos = (off_t) BLCKSZ * (blocknum % RELSEG_SIZE)This computation recurs identically in mdreadv, mdwritev, mdextend,
mdprefetch, and mdwriteback.
Read path: mdreadv and mdstartreadv
Section titled “Read path: mdreadv and mdstartreadv”mdreadv() is the synchronous read entry point called by the buffer manager.
It translates to a vectored preadv(2) via FileReadV(). The vector
(struct iovec[]) is built by buffers_to_iovec(), which merges contiguous
buffer pointers into a single iovec element to minimize system call count:
// mdreadv — md.c (condensed)voidmdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, void **buffers, BlockNumber nblocks){ while (nblocks > 0) { struct iovec iov[PG_IOV_MAX]; int iovcnt; MdfdVec *v; BlockNumber nblocks_this_segment;
v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); nblocks_this_segment = Min(nblocks, RELSEG_SIZE - (blocknum % RELSEG_SIZE)); iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
/* retry loop for short reads */ for (;;) { nbytes = FileReadV(v->mdfd_vfd, iov, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ); if (nbytes == size_this_segment) break; /* ... handle short read or EOF error ... */ } nblocks -= nblocks_this_segment; buffers += nblocks_this_segment; blocknum += nblocks_this_segment; }}mdstartreadv() is the PG18 async path. Instead of calling FileReadV()
directly, it sets up a PgAioHandle, populates an iovec, and calls
FileStartReadV() to submit the I/O asynchronously to the AIO subsystem
(storage/aio/). The completion callbacks md_readv_complete and
md_readv_report are registered on the handle:
// mdstartreadv — md.c (condensed)voidmdstartreadv(PgAioHandle *ioh, SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, void **buffers, BlockNumber nblocks){ v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); iovcnt = buffers_to_iovec(iov, buffers, nblocks);
if (!(io_direct_flags & IO_DIRECT_DATA)) pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
pgaio_io_set_target_smgr(ioh, reln, forknum, blocknum, nblocks, false); pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos, WAIT_EVENT_DATA_FILE_READ);}The AIO layer can then execute the I/O in a worker process. smgr_aio_reopen()
is the callback for reopening the file descriptor in that worker (since the
worker does not share the issuer’s VFD table).
Write path: mdwritev and deferred fsync
Section titled “Write path: mdwritev and deferred fsync”mdwritev() issues a synchronous vectored write via FileWriteV(), then
calls register_dirty_segment() if skipFsync is false:
// register_dirty_segment — md.c (condensed)static voidregister_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg){ FileTag tag; INIT_MD_FILETAG(tag, reln->smgr_rlocator.locator, forknum, seg->mdfd_segno);
if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false)) { /* queue full — fsync immediately */ FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC); }}RegisterSyncRequest() (in storage/sync/sync.c) adds the FileTag to the
checkpointer’s pending-sync table. At checkpoint time ProcessSyncRequests()
walks the table, issues FileSync() on each entry, and clears the table. If
the queue is full, the backend falls back to synchronous fsync in-line.
Extend path: mdextend and mdzeroextend
Section titled “Extend path: mdextend and mdzeroextend”mdextend() writes a single block at or beyond EOF. It computes seekpos
from blocknum % RELSEG_SIZE, writes via FileWrite(), and registers the
dirty segment. mdzeroextend() extends by N zeroed blocks at once, using
FileFallocate() (via posix_fallocate(2)) for large extensions when
file_extend_method == FILE_EXTEND_METHOD_POSIX_FALLOCATE, or FileZero()
(writev of zeros) otherwise. Both update smgr_cached_nblocks after success.
Truncate: mdtruncate
Section titled “Truncate: mdtruncate”mdtruncate() proceeds from the last segment downward. For segments entirely
above the new EOF it calls FileTruncate(v->mdfd_vfd, 0, ...) to release disk
space, then closes them and shrinks the md_seg_fds array. For the boundary
segment it truncates to the exact byte position. The first segment (segno==0)
is never unlinked — it is truncated to zero and left as an empty file to prevent
the relfileNumber from being recycled before the next checkpoint.
// mdunlink path — md.c (condensed; non-redo, main fork)// 1. Truncate first segment to zero (reclaim disk space)ret = do_truncate(path.str);// 2. Register post-checkpoint unlink requestregister_unlink_segment(rlocator, forknum, 0);// Additional segments: truncate + unlink immediatelyfor (segno = 1; ; segno++) { do_truncate(segpath.str); register_forget_request(rlocator, forknum, segno); unlink(segpath.str);}The deferred-unlink dance prevents a scenario where a relation is dropped, its relfileNumber is immediately reused, and a crash before the next checkpoint causes WAL replay to recreate the wrong file.
Size accounting: mdnblocks
Section titled “Size accounting: mdnblocks”mdnblocks() walks segments from the last open one, probing _mdnblocks() on
each until it finds one shorter than RELSEG_SIZE. The formula:
total_blocks = segno * RELSEG_SIZE + nblocks_in_last_segmentAn important side effect: calling mdnblocks() opens all active segments and
adds them to the md_seg_fds array — which is why mdtruncate() must be
preceded by mdnblocks() per its contract.
Interrupt discipline
Section titled “Interrupt discipline”The most pervasive non-obvious pattern in both files is HOLD_INTERRUPTS() /
RESUME_INTERRUPTS() bracketing every entry point. The reason is documented
in the smgr.c file header: interrupt processing can trigger
PROCSIGNAL_BARRIER_SMGRRELEASE, which calls smgrreleaseall(). Most of
smgr.c is not reentrant, so any function that holds references to the hash
table or open file descriptors must hold interrupts for its duration.
The async I/O target: pgaio_io_set_target_smgr
Section titled “The async I/O target: pgaio_io_set_target_smgr”PG18 introduces a formal AIO target concept. smgr.c registers itself as
the PGAIO_TID_SMGR target via the aio_smgr_target_info struct, providing
two callbacks:
smgr_aio_reopen()— called in an AIO worker to reopen the file descriptor (the worker does not inherit the issuing backend’s VFD table).smgr_aio_describe_identity()— returns a human-readable description of the target for error messages.
// pgaio_io_set_target_smgr — smgr.c (condensed)voidpgaio_io_set_target_smgr(PgAioHandle *ioh, SMgrRelationData *smgr, ForkNumber forknum, BlockNumber blocknum, int nblocks, bool skip_fsync){ PgAioTargetData *sd = pgaio_io_get_target_data(ioh); pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
sd->smgr.rlocator = smgr->smgr_rlocator.locator; sd->smgr.forkNum = forknum; sd->smgr.blockNum = blocknum; sd->smgr.nblocks = nblocks; sd->smgr.is_temp = SmgrIsTemp(smgr); sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);}Figure 1 — smgr + md layering overview
flowchart TD
BM["Buffer Manager<br/>(bufmgr.c)"]
SMGR["smgr.c<br/>dispatch + SMgrRelation hash"]
MD["md.c<br/>segment files + fsync pending-ops"]
VFD["fd.c<br/>virtual file descriptor pool"]
OS["OS kernel<br/>page cache / io_uring"]
SYNC["sync.c<br/>pending-sync table"]
CKPT["Checkpointer process<br/>ProcessSyncRequests"]
AIO["storage/aio/<br/>PgAioHandle"]
BM -->|smgrreadv / smgrwritev| SMGR
BM -->|smgrstartreadv| SMGR
SMGR -->|mdreadv / mdwritev| MD
SMGR -->|mdstartreadv| MD
MD -->|FileReadV / FileWriteV| VFD
MD -->|FileStartReadV| AIO
VFD -->|pread / pwrite| OS
AIO -->|io_uring / worker| OS
MD -->|RegisterSyncRequest| SYNC
SYNC -->|at checkpoint| CKPT
CKPT -->|FileSync| VFD
Figure 1 — Layering from buffer manager through smgr and md to the OS kernel. The synchronous path (left) and the PG18 async path (right of md.c) share the same SMgrRelation handle and segment-file state.
Source Walkthrough
Section titled “Source Walkthrough”smgr.c — the switch layer
Section titled “smgr.c — the switch layer”f_smgr(struct) — vtable definition;smgrsw[]holds the single md.c entry.SMgrRelationHash— per-backendHTAB*keyed onRelFileLocatorBackend.unpinned_relns— doubly-linked list of all unpinnedSMgrRelationDataobjects;AtEOXact_SMgrdestroys them.smgrinit— callssmgr_initon all vtable entries; registerssmgrshutdownonon_proc_exit.smgropen— hash lookup or insert; setssmgr_which = 0, initializes cached fields, callssmgr_open.smgrpin/smgrunpin— move handle on/offunpinned_relns; relcache pins handles it holds.smgrrelease— closes all fork FDs, invalidates cached block counts; handle stays in table.smgrclose— synonym forsmgrrelease(deprecated distinction).smgrdestroy(static) — closes FDs + removes from hash; only called on unpinned handles.smgrdestroyall— destroys all unpinned handles; called byAtEOXact_SMgr.smgrreleaseall— releases (not destroys) all handles; called byProcessBarrierSmgrRelease.smgrcreate/smgrextend/smgrzeroextend/smgrreadv/smgrstartreadv/smgrwritev/smgrwriteback/smgrnblocks/smgrnblocks_cached/smgrtruncate/smgrimmedsync/smgrregistersync— thin wrappers thatHOLD_INTERRUPTS, dispatch throughsmgrsw[reln->smgr_which], andRESUME_INTERRUPTS.smgrdounlinkall— drops buffers, sends sinval, then callssmgr_unlinkfor each fork.pgaio_io_set_target_smgr— populatesPgAioTargetDatafor AIO.smgr_aio_reopen— AIO worker reopen callback.AtEOXact_SMgr— end-of-transaction cleanup hook.
md.c — the POSIX driver
Section titled “md.c — the POSIX driver”MdfdVec(struct) —{File mdfd_vfd, BlockNumber mdfd_segno}.MdCxt—MemoryContextforMdfdVecarrays.mdinit— createsMdCxt.mdopen— zeroesmd_num_open_segsfor all forks (no actual file open).mdclose— closes open segment FDs from last to first; shrinks array.mdopenfork(static) — opens segment 0 of a fork if not already open._mdfd_getseg(static) — mapsblocknum→MdfdVec*; opens missing segments incrementally._mdfd_openseg(static) — opens one segment, resizes array, fills entry._mdfd_segpath(static) — builds segment file path string._fdvec_resize(static) — palloc/repalloc the segment array (no shrink to avoid allocation in critical section).mdcreate—O_CREAT|O_EXCLopen of segment 0; registers dirty.mdexists—mdopenforkwithEXTENSION_RETURN_NULL.mdunlink/mdunlinkfork(static) — main fork: truncate-to-zero + deferred unlink; other forks: immediate unlink.mdextend— single-block write at or beyond EOF; updatessmgr_cached_nblocks.mdzeroextend— multi-block zero-fill viafallocateorpwritev.mdreadv— synchronous vectored read; inner retry loop for short reads.mdstartreadv— async vectored read viaPgAioHandle.mdwritev— synchronous vectored write; registers dirty segment.mdwriteback—FileWriteback(kernel writeback hint; no fsync).mdnblocks— walks all segments; side-effect: opens them all.mdnblocks_cached(in smgr.c) — returnssmgr_cached_nblocksin recovery.mdtruncate— iterates open segments from last to first; deactivates and close excess segments.mdregistersync— opens all segments (incl. inactive) and registers each dirty.mdimmedsync—FileSyncon every segment immediately.mdfd— returns raw OS fd + offset; used by AIO workers.register_dirty_segment(static) —RegisterSyncRequest(SYNC_REQUEST).register_unlink_segment(static) —RegisterSyncRequest(SYNC_UNLINK_REQUEST).register_forget_request(static) —RegisterSyncRequest(SYNC_FORGET_REQUEST).buffers_to_iovec(static) — merges contiguous buffer pointers into minimum iovec count.
Position-hint table (as of 2026-06-05, commit 273fe94)
Section titled “Position-hint table (as of 2026-06-05, commit 273fe94)”| Symbol | File | Line |
|---|---|---|
f_smgr (typedef struct) | smgr.c | 88 |
smgrsw[] | smgr.c | 128 |
SMgrRelationHash | smgr.c | 160 |
smgrinit | smgr.c | 188 |
smgropen | smgr.c | 240 |
smgrpin | smgr.c | 296 |
smgrunpin | smgr.c | 311 |
smgrdestroy | smgr.c | 323 |
smgrrelease | smgr.c | 350 |
smgrclose | smgr.c | 374 |
smgrdestroyall | smgr.c | 386 |
smgrreleaseall | smgr.c | 412 |
smgrcreate | smgr.c | 481 |
smgrextend | smgr.c | 620 |
smgrzeroextend | smgr.c | 649 |
smgrreadv | smgr.c | 721 |
smgrstartreadv | smgr.c | 753 |
smgrwritev | smgr.c | 791 |
smgrwriteback | smgr.c | 805 |
smgrnblocks | smgr.c | 819 |
smgrnblocks_cached | smgr.c | 847 |
smgrtruncate | smgr.c | 875 |
smgrregistersync | smgr.c | 940 |
smgrimmedsync | smgr.c | 974 |
AtEOXact_SMgr | smgr.c | 1017 |
ProcessBarrierSmgrRelease | smgr.c | 1027 |
pgaio_io_set_target_smgr | smgr.c | 1038 |
smgr_aio_reopen | smgr.c | 1064 |
SMgrRelationData (typedef struct) | smgr.h | 35 |
_MdfdVec (typedef struct) | md.c | 81 |
MdCxt | md.c | 87 |
register_dirty_segment (proto) | md.c | 138 |
_fdvec_resize (proto) | md.c | 144 |
_mdfd_segpath (proto) | md.c | 147 |
_mdfd_openseg (proto) | md.c | 149 |
_mdfd_getseg (proto) | md.c | 151 |
mdinit | md.c | 180 |
mdexists | md.c | 193 |
mdcreate | md.c | 212 |
mdunlink | md.c | 327 |
mdunlinkfork | md.c | 364 |
mdextend | md.c | 477 |
mdzeroextend | md.c | 542 |
mdopenfork | md.c | 665 |
mdopen | md.c | 703 |
mdclose | md.c | 714 |
mdprefetch | md.c | 737 |
mdmaxcombine | md.c | 834 |
mdreadv | md.c | 848 |
mdstartreadv | md.c | 986 |
mdwritev | md.c | 1060 |
mdwriteback | md.c | 1165 |
mdnblocks | md.c | 1224 |
mdtruncate | md.c | 1291 |
mdregistersync | md.c | 1380 |
mdimmedsync | md.c | 1431 |
mdfd | md.c | 1484 |
register_dirty_segment | md.c | 1508 |
register_unlink_segment | md.c | 1552 |
register_forget_request | md.c | 1568 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
smgrsw[]has exactly one entry (md.c) at REL_18_STABLE. The README confirms “only the magnetic disk manager remains.”NSmgr = lengthof(smgrsw) = 1. Thesmgr_whichfield is always set to 0 insmgropen(). -
smgrclose()is a synonym forsmgrrelease(), not a destructor. The comment insmgr.csays: “The SMgrRelation reference should not be used after this call. However, because we don’t keep track of the references returned bysmgropen(), we don’t know if there are other references… Therefore, this is just a synonym forsmgrrelease()at the moment.” Callers that want destruction must usesmgrdestroy()(internal) orsmgrdestroyall(). -
smgr_cached_nblocksis reliable only in recovery.smgrnblocks_cached()returnsInvalidBlockNumberoutside recovery regardless of what the cache holds, per the comment: “lack of a shared invalidation mechanism for changes in file size.” -
Segment-0 of the main fork is never immediately unlinked.
mdunlinkfork()truncates it to zero and callsregister_unlink_segment(…, 0)for post-checkpoint deletion. Only segments ≥ 1 are unlinked immediately. Verified in themdunlinkfork()conditional at line 376. -
HOLD_INTERRUPTSis present on every smgr entry point. The smgr.c file header states the rationale explicitly; spot-checked insmgropen,smgrrelease,smgrdestroy,smgrextend,smgrnblocks, andsmgrdounlinkall. -
mdwriteback()does not fsync. It callsFileWriteback()(aposix_fadvise(POSIX_FADV_DONTNEED)hint orsync_file_range()on Linux), which asks the kernel to write back dirty pages but does not guarantee durability. TheAssert((io_direct_flags & IO_DIRECT_DATA) == 0)at line 1168 shows it is not used on the O_DIRECT path. -
mdzeroextend()usesposix_fallocateonly for ≥ 9 blocks. Line 595:if (numblocks > 8 && file_extend_method != FILE_EXTEND_METHOD_WRITE_ZEROS)— the cutoff avoids defeating delayed allocation on some filesystems. -
mdstartreadv()does not implement thezero_damaged_pagesrecovery path. Themdreadv()comment at line 939 explicitly notes: “we chose, at least for now, to not implement thezero_damaged_pageslogic present inmdreadv().” AnAssert(false)inmdreadv()marks this code path as targeted for future removal.
Open questions
Section titled “Open questions”-
RELSEG_SIZEsource location. (Resolved 2026-06-05.)RELSEG_SIZEis generated at configure time from--with-segsizeintopg_config.h; the template issrc/include/pg_config.h.in(#undef RELSEG_SIZE, with the comment “RELSEG_SIZE is the maximum number of blocks allowed in one disk file … RELSEG_SIZE * BLCKSZ must be … changing RELSEG_SIZE requires an initdb”). The default--with-segsize=1yields 1 GB → 131 072 blocks at the default 8 KBBLCKSZ. The#defineis absent from the checked-out tree only becausepg_config.his a build artifact, not a committed file. -
mdwritev()crossing segment boundaries.mdwritev()(line 1092) callselog(ERROR, "write crosses segment boundary")whennblocks_this_segment != nblocks. Is the buffer manager guaranteed never to issue a cross-segment write? Investigation path: checksmgrmaxcombine()(which returnsRELSEG_SIZE - segoff) — this appears to be the mechanism that prevents such writes at the smgr level. -
smgr_aio_reopeninterrupt discipline. The function asserts!INTERRUPTS_CAN_BE_PROCESSED()(line 1076) — meaning the caller is responsible for holding interrupts. Does the AIO worker infrastructure guarantee this before invoking the callback? Investigation path: tracepgaio_io_get_target_datacall sites instorage/aio/.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
VFS / pluggable storage layers elsewhere. MySQL/InnoDB’s
fil0fil.ccimplements a similar file-system abstraction with tablespace-level (not relation-level) granularity; segments are 64-page extents grouped into tablespace files. The comparison reveals a design choice: PostgreSQL’s per-relation file granularity makes DROP TABLE trivially oneunlink(2)per fork, while InnoDB’s tablespace packing requires page-level space management inside files. -
Pluggable storage managers and cloud storage. The
f_smgrvtable structure was designed with alternative back-ends in mind. Projects such aspg_directioand the ongoing NVM/mmap-storage discussions revisit this surface. The PG18 AIO refactor (storage/aio/) is the first significant step toward making a non-POSIX storage manager viable, sincemdstartreadv/smgr_aio_reopenabstract the fd-reopening problem that would otherwise prevent I/O workers from operating on a different process’s file descriptors. -
Direct I/O and io_uring.
O_DIRECTbypasses the kernel page cache;io_uring(Linux 5.1+) allows truly asynchronous submission without per-call system call overhead. PostgreSQL 18’sio_method=io_uringpath wires throughmdstartreadv→FileStartReadV→io_uring_prep_readv. The design is described in thestorage/aio/README(PG18). Prior art: Microsoft SQL Server’s “Scatter-Gather I/O” and Oracle’s asynchronous I/O on AIX/Solaris. Seepostgres-aio.mdfor the full async-I/O doc. -
Write-ahead storage (Zheap / pluggable heap). The table AM API (
postgres-table-am.md) defines how a custom heap can override the storage path entirely. A columnar AM (e.g., Hydra, Citus columnar) can provide its ownmdreadv-equivalent and bypass the smgr layer for its internal storage while still using the smgr path for index forks. -
The fsync disaster (2018). Commit
9ccbe8f7(PostgreSQL 11) added thedata_sync_elevelmechanism and changed fsync-error handling after a Linux kernel bug (and a PostgreSQL behavior long predating it) was discovered: a failedfsync(2)would clear the “dirty” flag on OS pages even though the data was not durable, and PostgreSQL would then silently serve corrupted data on restart. Theregister_dirty_segment/ProcessSyncRequestspath was audited and hardened at that time. See LWN “PostgreSQL’s fsync() surprise” (2018).
Sources
Section titled “Sources”Source files analyzed
Section titled “Source files analyzed”src/backend/storage/smgr/smgr.c(REL_18_STABLE, commit 273fe94)src/backend/storage/smgr/md.c(REL_18_STABLE, commit 273fe94)src/backend/storage/smgr/READMEsrc/include/storage/smgr.hsrc/include/common/relpath.h
Related documents (this KB)
Section titled “Related documents (this KB)”postgres-buffer-manager.md— the caller ofsmgrreadv/smgrwritevpostgres-page-layout.md— the 8 KB page format smgr transferspostgres-aio.md— the PG18 async-I/O subsystem wired intomdstartreadvpostgres-xlog-wal.md— WAL usessmgrimmedsync/smgrregistersyncfor fsync-after-WAL-minimalpostgres-table-am.md— table AM interface above smgrpostgres-heap-am.md— the concrete heap AM that callssmgrextend
Textbooks
Section titled “Textbooks”- Database System Concepts, Silberschatz et al., 7e, ch. 10 “Storage and File Structure”
- Database Internals, Petrov, ch. 6 (on-disk page management and file abstraction)