Skip to content

PostgreSQL smgr & md — The Storage Manager Switch and the Magnetic-Disk Driver

Contents:

Every relational engine must answer one basic question: given a logical address — “page N of relation R” — how does the engine translate that to physical bytes on durable storage? The answer is the storage manager layer, which sits between the page-level buffer pool and the operating system. Database Internals (Petrov, ch. 6 “B-Tree Variants”) introduces the concept through the lens of page addressing, and Database System Concepts (Silberschatz, 7e, ch. 10 “Storage and File Structure”) frames it with the classic on-disk organization model: a relation is a sequence of fixed-size pages, the buffer manager fetches or flushes pages one at a time, and the storage layer handles the mechanics of physical location.

Two design choices govern what the storage layer has to do:

  1. The I/O unit. Virtually all DBMSes choose a fixed page size (4 KB–32 KB) matching or doubling the OS page size, so that a single kernel call transfers exactly one database page without partial-transfer complications. The buffer manager and the storage layer agree on this unit and never deal in anything smaller at the page-transfer boundary.

  2. How logical addresses map to physical files. A simple approach puts one relation in one file. Real engines extend this in two directions: (a) relation forks — multiple independent file sequences per relation for auxiliary metadata (free-space map, visibility map, initialization fork), each accessible independently; and (b) file segmentation — splitting a single fork into multiple fixed-size OS files to avoid OS file-size limits and to give the OS file-system more predictable file objects.

The storage manager is also where fsync discipline lives. A page written to the kernel’s buffer cache is not durable; the engine must eventually call fsync(2) on the underlying file. Three strategies exist: (a) fsync synchronously on every write (correct but slow); (b) collect “dirty segments” into a pending list and fsync them at checkpoint time, delegating to a background process; and (c) use O_DIRECT to bypass the kernel buffer cache entirely, relying on the engine’s own buffer pool as the only cache tier (the path PG18’s async I/O makes increasingly viable). PostgreSQL uses strategy (b) as its default, with strategy (c) available via io_direct_flags.

Most engines that were ever designed to support alternative storage back-ends (disk, WORM jukebox, NVM, S3) implement the storage layer as a vtable of function pointers keyed on a storage-manager identifier. The identifier is stored per relation in the catalog or in the relation descriptor. Dispatch costs one extra indirect call per page operation; real I/O dominates that cost entirely, so the indirection is essentially free.

Opening a file is expensive. Engines cache open file descriptors in a per-backend handle table keyed on the logical relation identifier. A “handle” object bundles the relation locator, the open file descriptors for each segment, and a small cache of metadata (current size, target insertion block). The handle is looked up in the table before every I/O; if already present the table lookup replaces the open(2) syscall.

Since Postgres 8.4, the standard pattern is:

$PGDATA/base/<dbOid>/<relfileNumber> ← main fork, segment 0
$PGDATA/base/<dbOid>/<relfileNumber>.1 ← main fork, segment 1
$PGDATA/base/<dbOid>/<relfileNumber>_fsm ← free-space-map fork
$PGDATA/base/<dbOid>/<relfileNumber>_vm ← visibility-map fork
$PGDATA/base/<dbOid>/<relfileNumber>_init ← init fork (unlogged)

Each fork is an independent file sequence; forks grow and truncate independently. The fork multiplier is the minimal addition that makes FSM and VM possible without complex multiplexing inside the main file.

Writing a page to the kernel’s cache creates a fsync obligation: before the next checkpoint completes, that page’s segment must be fsynced. The naive solution calls fsync on the write path and blocks. The standard DBMS pattern is to record the obligation in a pending-ops list (keyed by (tablespace, db, relfileNumber, fork, segment)) and let a background checkpointer process drain the list. The write path is non-blocking; the checkpointer does the heavy I/O during its checkpoint window.

Theory conceptPostgreSQL name
Storage manager vtablef_smgr struct in smgr.c, single entry smgrsw[0] = md.c
Relation handle / cached FDSMgrRelationData / SMgrRelation
Logical address (rel, page)RelFileLocator + ForkNumber + BlockNumber
ForkForkNumber: MAIN_FORKNUM, FSM_FORKNUM, VISIBILITYMAP_FORKNUM, INIT_FORKNUM
File segmentMdfdVec — one per open segment; md_seg_fds[] array
Segment size limitRELSEG_SIZE blocks (default 131 072 → 1 GB at 8 KB/block)
Pending fsync obligationFileTag entry passed to RegisterSyncRequest
Deferred-fsync executorcheckpointer process calling ProcessSyncRequests

smgr.c is thin by design. It owns two data structures and twenty-odd functions that are pure dispatch wrappers:

// f_smgr — smgr.c (the vtable)
typedef struct f_smgr
{
void (*smgr_init) (void);
void (*smgr_open) (SMgrRelation reln);
void (*smgr_close) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_create) (SMgrRelation reln, ForkNumber forknum, bool isRedo);
bool (*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_unlink) (RelFileLocatorBackend rlocator, ForkNumber forknum,
bool isRedo);
void (*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, const void *buffer, bool skipFsync);
void (*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks, bool skipFsync);
bool (*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks);
uint32 (*smgr_maxcombine) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum);
void (*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
void (*smgr_startreadv) (PgAioHandle *ioh,
SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
void **buffers, BlockNumber nblocks);
void (*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffers, BlockNumber nblocks,
bool skipFsync);
void (*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, BlockNumber nblocks);
BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber old_blocks, BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
void (*smgr_registersync) (SMgrRelation reln, ForkNumber forknum);
int (*smgr_fd) (SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, uint32 *off);
} f_smgr;
static const f_smgr smgrsw[] = {
/* magnetic disk — the only entry */
{
.smgr_init = mdinit,
.smgr_open = mdopen,
.smgr_readv = mdreadv,
.smgr_startreadv = mdstartreadv,
/* ... */
}
};

The README note is honest: “We retain the notion of a storage manager switch in case anyone ever wants to reintroduce other kinds of storage managers. Removing the switch layer would save nothing noticeable anyway, since storage-access operations are surely far more expensive than one extra layer of C function calls.”

Every access to a physical relation goes through an SMgrRelation handle. smgropen() looks up or creates one in a per-backend hash table keyed on RelFileLocatorBackend (tablespace OID + database OID + relation file number

  • backend number):
// SMgrRelationData — src/include/storage/smgr.h
typedef struct SMgrRelationData
{
RelFileLocatorBackend smgr_rlocator; /* hash key — must be first */
BlockNumber smgr_targblock; /* current insertion target block */
BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1]; /* cached fork sizes */
int smgr_which; /* selector into smgrsw[] — always 0 */
/* md.c private fields */
int md_num_open_segs[MAX_FORKNUM + 1];
struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
/* pinning */
int pincount;
dlist_node node; /* link in unpinned_relns list when pincount==0 */
} SMgrRelationData;

The smgr_cached_nblocks field is a size cache per fork. It is invalidated on an smgr cache-invalidation signal (PROCSIGNAL_BARRIER_SMGRRELEASE or CacheInvalidateSmgr) and is only reliable during recovery — the comment in smgrnblocks_cached() says plainly: “For now, this function uses cached values only in recovery due to lack of a shared invalidation mechanism for changes in file size.”

Lifetime and pinning. Handles created inside a transaction are unpinned by default: they live on the unpinned_relns doubly-linked list and are destroyed at AtEOXact_SMgr() (end of transaction). The relcache pins the handle via smgrpin() to prevent it from being destroyed while the relcache entry is alive. The checkpointer and auxiliary processes that operate outside transactions manage their own lifetime by calling smgrdestroyall() at appropriate points.

Signal-driven release. A backend can receive PROCSIGNAL_BARRIER_SMGRRELEASE at any point, ordering it to close all open file descriptors immediately. ProcessBarrierSmgrRelease() calls smgrreleaseall(), which iterates the entire hash table and calls smgrrelease() on every entry. The handle objects remain in the table; only the underlying OS file descriptors are closed.

// smgropen — smgr.c
SMgrRelation
smgropen(RelFileLocator rlocator, ProcNumber backend)
{
RelFileLocatorBackend brlocator;
SMgrRelation reln;
bool found;
HOLD_INTERRUPTS();
if (SMgrRelationHash == NULL)
{
HASHCTL ctl;
ctl.keysize = sizeof(RelFileLocatorBackend);
ctl.entrysize = sizeof(SMgrRelationData);
SMgrRelationHash = hash_create("smgr relation table", 400,
&ctl, HASH_ELEM | HASH_BLOBS);
dlist_init(&unpinned_relns);
}
brlocator.locator = rlocator;
brlocator.backend = backend;
reln = (SMgrRelation) hash_search(SMgrRelationHash,
&brlocator,
HASH_ENTER, &found);
if (!found)
{
reln->smgr_targblock = InvalidBlockNumber;
for (int i = 0; i <= MAX_FORKNUM; ++i)
reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
reln->smgr_which = 0; /* md.c only */
reln->pincount = 0;
dlist_push_tail(&unpinned_relns, &reln->node);
smgrsw[reln->smgr_which].smgr_open(reln);
}
RESUME_INTERRUPTS();
return reln;
}

The ForkNumber enum (in src/include/common/relpath.h) defines four forks:

ForkValueFile suffixPurpose
MAIN_FORKNUM0(none)Heap or index pages
FSM_FORKNUM1_fsmFree-space map
VISIBILITYMAP_FORKNUM2_vmVisibility map
INIT_FORKNUM3_initUnlogged-table initializer

MAX_FORKNUM is INIT_FORKNUM = 3. The SMgrRelationData struct allocates [MAX_FORKNUM + 1] arrays for md_num_open_segs, md_seg_fds, and smgr_cached_nblocks. Most code passes MAIN_FORKNUM; the convenience wrappers smgrread and smgrwrite in smgr.h hard-code it.

md.c translates a logical (SMgrRelation, ForkNumber, BlockNumber) address into POSIX pread/pwrite calls on the appropriate segment file. Its internal state lives entirely inside the SMgrRelationData fields (md_num_open_segs, md_seg_fds) and a MemoryContext called MdCxt for allocating MdfdVec arrays.

MdfdVec: the per-segment file-descriptor wrapper

Section titled “MdfdVec: the per-segment file-descriptor wrapper”
// _MdfdVec — md.c
typedef struct _MdfdVec
{
File mdfd_vfd; /* fd number in fd.c's virtual-fd pool */
BlockNumber mdfd_segno; /* segment number, from 0 */
} MdfdVec;

File is an integer handle into fd.c’s virtual file descriptor (VFD) pool — not a raw OS fd. The VFD layer (src/backend/storage/file/fd.c) manages a pool of OS file descriptors, evicting (closing) them on LRU order when the process approaches the max_files_per_process limit and reopening them on demand. MdfdVec therefore stores a stable VFD index, not a raw int that could be invalidated by eviction.

A relation fork with more than RELSEG_SIZE blocks (default 131 072 at 8 KB/block = 1 GB) spans multiple segment files. The naming convention:

base/<dbOid>/<relfileNumber> ← segment 0 of MAIN_FORKNUM
base/<dbOid>/<relfileNumber>.1 ← segment 1
base/<dbOid>/<relfileNumber>.2 ← segment 2 ...
base/<dbOid>/<relfileNumber>_fsm ← segment 0 of FSM_FORKNUM

Figure 2 — smgr switch → md.c → fork/segment file mapping

flowchart TD
    CALL["smgrreadv / smgrwritev / smgrextend<br/>(rlocator, forknum, blocknum)"]
    SW["smgrsw[reln->smgr_which]<br/>smgr_which == 0 (md.c only)"]
    GETSEG["_mdfd_getseg<br/>targetseg = blocknum / RELSEG_SIZE"]

    CALL --> SW
    SW -->|mdreadv / mdwritev / mdextend| GETSEG

    GETSEG -->|forknum == MAIN_FORKNUM 0| MAIN
    GETSEG -->|forknum == FSM_FORKNUM 1| FSM
    GETSEG -->|forknum == VISIBILITYMAP_FORKNUM 2| VM
    GETSEG -->|forknum == INIT_FORKNUM 3| INIT

    subgraph MAIN["MAIN fork — relpath() base name"]
        M0["&lt;relfileNumber&gt;<br/>segment 0 — blocks 0 .. RELSEG_SIZE-1"]
        M1["&lt;relfileNumber&gt;.1<br/>segment 1 — next RELSEG_SIZE blocks"]
        M2["&lt;relfileNumber&gt;.2<br/>segment 2 ..."]
        M0 --> M1 --> M2
    end

    subgraph FSM["FSM fork (forkNames[1] = 'fsm')"]
        F0["&lt;relfileNumber&gt;_fsm[.N]"]
    end
    subgraph VM["VM fork (forkNames[2] = 'vm')"]
        V0["&lt;relfileNumber&gt;_vm[.N]"]
    end
    subgraph INIT["INIT fork (forkNames[3] = 'init', unlogged)"]
        I0["&lt;relfileNumber&gt;_init[.N]"]
    end

    M0 -.->|seekpos = BLCKSZ * blocknum % RELSEG_SIZE| FILEIO["FileReadV / FileWriteV<br/>on MdfdVec.mdfd_vfd"]
    F0 -.-> FILEIO
    V0 -.-> FILEIO
    I0 -.-> FILEIO

Figure 2 — A logical (rlocator, forknum, blocknum) triple enters through the single smgrsw[0] (md.c) vtable entry. _mdfd_getseg() divides blocknum by RELSEG_SIZE to pick the segment; the fork selects the file base name built by relpath() (suffix from forkNames[]: none / _fsm / _vm / _init). Each fork is an independent segment vector capped at RELSEG_SIZE blocks (1 GB at the default 8 KB BLCKSZ); .1, .2, … are the overflow segments.

The _mdfd_segpath() helper builds the path:

// _mdfd_segpath — md.c
static MdPathStr
_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
{
RelPathStr path;
MdPathStr fullpath;
path = relpath(reln->smgr_rlocator, forknum);
if (segno > 0)
sprintf(fullpath.str, "%s.%u", path.str, segno);
else
strcpy(fullpath.str, path.str);
return fullpath;
}

The md_seg_fds[forknum] array is a resizable palloc slice (via _fdvec_resize()) allocated in MdCxt. The invariant: md_num_open_segs[f] counts the number of open segments for fork f; segments beyond that index may exist on disk but have not been opened yet. This is an optimization: most relations have only one segment and md_num_open_segs stays at 1.

Every read/write path calls _mdfd_getseg() to map a block number to a MdfdVec:

// _mdfd_getseg — md.c (condensed)
static MdfdVec *
_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
bool skipFsync, int behavior)
{
BlockNumber targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
/* fast path: segment already open */
if (targetseg < reln->md_num_open_segs[forknum])
return &reln->md_seg_fds[forknum][targetseg];
/* open segments from last-open up to targetseg */
for (nextsegno = reln->md_num_open_segs[forknum];
nextsegno <= targetseg; nextsegno++)
{
/* ... open or create based on 'behavior' flags ... */
v = _mdfd_openseg(reln, forknum, nextsegno, flags);
}
return &reln->md_seg_fds[forknum][targetseg];
}

The behavior flags (EXTENSION_FAIL, EXTENSION_RETURN_NULL, EXTENSION_CREATE, EXTENSION_CREATE_RECOVERY, EXTENSION_DONT_OPEN) encode four distinct caller requirements cleanly.

Given a block number, the offset within its segment file is always:

seekpos = (off_t) BLCKSZ * (blocknum % RELSEG_SIZE)

This computation recurs identically in mdreadv, mdwritev, mdextend, mdprefetch, and mdwriteback.

mdreadv() is the synchronous read entry point called by the buffer manager. It translates to a vectored preadv(2) via FileReadV(). The vector (struct iovec[]) is built by buffers_to_iovec(), which merges contiguous buffer pointers into a single iovec element to minimize system call count:

// mdreadv — md.c (condensed)
void
mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void **buffers, BlockNumber nblocks)
{
while (nblocks > 0)
{
struct iovec iov[PG_IOV_MAX];
int iovcnt;
MdfdVec *v;
BlockNumber nblocks_this_segment;
v = _mdfd_getseg(reln, forknum, blocknum, false,
EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
nblocks_this_segment = Min(nblocks,
RELSEG_SIZE - (blocknum % RELSEG_SIZE));
iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
/* retry loop for short reads */
for (;;)
{
nbytes = FileReadV(v->mdfd_vfd, iov, iovcnt, seekpos,
WAIT_EVENT_DATA_FILE_READ);
if (nbytes == size_this_segment) break;
/* ... handle short read or EOF error ... */
}
nblocks -= nblocks_this_segment;
buffers += nblocks_this_segment;
blocknum += nblocks_this_segment;
}
}

mdstartreadv() is the PG18 async path. Instead of calling FileReadV() directly, it sets up a PgAioHandle, populates an iovec, and calls FileStartReadV() to submit the I/O asynchronously to the AIO subsystem (storage/aio/). The completion callbacks md_readv_complete and md_readv_report are registered on the handle:

// mdstartreadv — md.c (condensed)
void
mdstartreadv(PgAioHandle *ioh,
SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
void **buffers, BlockNumber nblocks)
{
v = _mdfd_getseg(reln, forknum, blocknum, false,
EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
iovcnt = buffers_to_iovec(iov, buffers, nblocks);
if (!(io_direct_flags & IO_DIRECT_DATA))
pgaio_io_set_flag(ioh, PGAIO_HF_BUFFERED);
pgaio_io_set_target_smgr(ioh, reln, forknum, blocknum, nblocks, false);
pgaio_io_register_callbacks(ioh, PGAIO_HCB_MD_READV, 0);
FileStartReadV(ioh, v->mdfd_vfd, iovcnt, seekpos,
WAIT_EVENT_DATA_FILE_READ);
}

The AIO layer can then execute the I/O in a worker process. smgr_aio_reopen() is the callback for reopening the file descriptor in that worker (since the worker does not share the issuer’s VFD table).

mdwritev() issues a synchronous vectored write via FileWriteV(), then calls register_dirty_segment() if skipFsync is false:

// register_dirty_segment — md.c (condensed)
static void
register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
{
FileTag tag;
INIT_MD_FILETAG(tag, reln->smgr_rlocator.locator, forknum, seg->mdfd_segno);
if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
/* queue full — fsync immediately */
FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC);
}
}

RegisterSyncRequest() (in storage/sync/sync.c) adds the FileTag to the checkpointer’s pending-sync table. At checkpoint time ProcessSyncRequests() walks the table, issues FileSync() on each entry, and clears the table. If the queue is full, the backend falls back to synchronous fsync in-line.

mdextend() writes a single block at or beyond EOF. It computes seekpos from blocknum % RELSEG_SIZE, writes via FileWrite(), and registers the dirty segment. mdzeroextend() extends by N zeroed blocks at once, using FileFallocate() (via posix_fallocate(2)) for large extensions when file_extend_method == FILE_EXTEND_METHOD_POSIX_FALLOCATE, or FileZero() (writev of zeros) otherwise. Both update smgr_cached_nblocks after success.

mdtruncate() proceeds from the last segment downward. For segments entirely above the new EOF it calls FileTruncate(v->mdfd_vfd, 0, ...) to release disk space, then closes them and shrinks the md_seg_fds array. For the boundary segment it truncates to the exact byte position. The first segment (segno==0) is never unlinked — it is truncated to zero and left as an empty file to prevent the relfileNumber from being recycled before the next checkpoint.

// mdunlink path — md.c (condensed; non-redo, main fork)
// 1. Truncate first segment to zero (reclaim disk space)
ret = do_truncate(path.str);
// 2. Register post-checkpoint unlink request
register_unlink_segment(rlocator, forknum, 0);
// Additional segments: truncate + unlink immediately
for (segno = 1; ; segno++) {
do_truncate(segpath.str);
register_forget_request(rlocator, forknum, segno);
unlink(segpath.str);
}

The deferred-unlink dance prevents a scenario where a relation is dropped, its relfileNumber is immediately reused, and a crash before the next checkpoint causes WAL replay to recreate the wrong file.

mdnblocks() walks segments from the last open one, probing _mdnblocks() on each until it finds one shorter than RELSEG_SIZE. The formula:

total_blocks = segno * RELSEG_SIZE + nblocks_in_last_segment

An important side effect: calling mdnblocks() opens all active segments and adds them to the md_seg_fds array — which is why mdtruncate() must be preceded by mdnblocks() per its contract.

The most pervasive non-obvious pattern in both files is HOLD_INTERRUPTS() / RESUME_INTERRUPTS() bracketing every entry point. The reason is documented in the smgr.c file header: interrupt processing can trigger PROCSIGNAL_BARRIER_SMGRRELEASE, which calls smgrreleaseall(). Most of smgr.c is not reentrant, so any function that holds references to the hash table or open file descriptors must hold interrupts for its duration.

The async I/O target: pgaio_io_set_target_smgr

Section titled “The async I/O target: pgaio_io_set_target_smgr”

PG18 introduces a formal AIO target concept. smgr.c registers itself as the PGAIO_TID_SMGR target via the aio_smgr_target_info struct, providing two callbacks:

  • smgr_aio_reopen() — called in an AIO worker to reopen the file descriptor (the worker does not inherit the issuing backend’s VFD table).
  • smgr_aio_describe_identity() — returns a human-readable description of the target for error messages.
// pgaio_io_set_target_smgr — smgr.c (condensed)
void
pgaio_io_set_target_smgr(PgAioHandle *ioh, SMgrRelationData *smgr,
ForkNumber forknum, BlockNumber blocknum,
int nblocks, bool skip_fsync)
{
PgAioTargetData *sd = pgaio_io_get_target_data(ioh);
pgaio_io_set_target(ioh, PGAIO_TID_SMGR);
sd->smgr.rlocator = smgr->smgr_rlocator.locator;
sd->smgr.forkNum = forknum;
sd->smgr.blockNum = blocknum;
sd->smgr.nblocks = nblocks;
sd->smgr.is_temp = SmgrIsTemp(smgr);
sd->smgr.skip_fsync = skip_fsync && !SmgrIsTemp(smgr);
}

Figure 1 — smgr + md layering overview

flowchart TD
    BM["Buffer Manager<br/>(bufmgr.c)"]
    SMGR["smgr.c<br/>dispatch + SMgrRelation hash"]
    MD["md.c<br/>segment files + fsync pending-ops"]
    VFD["fd.c<br/>virtual file descriptor pool"]
    OS["OS kernel<br/>page cache / io_uring"]
    SYNC["sync.c<br/>pending-sync table"]
    CKPT["Checkpointer process<br/>ProcessSyncRequests"]
    AIO["storage/aio/<br/>PgAioHandle"]

    BM -->|smgrreadv / smgrwritev| SMGR
    BM -->|smgrstartreadv| SMGR
    SMGR -->|mdreadv / mdwritev| MD
    SMGR -->|mdstartreadv| MD
    MD -->|FileReadV / FileWriteV| VFD
    MD -->|FileStartReadV| AIO
    VFD -->|pread / pwrite| OS
    AIO -->|io_uring / worker| OS
    MD -->|RegisterSyncRequest| SYNC
    SYNC -->|at checkpoint| CKPT
    CKPT -->|FileSync| VFD

Figure 1 — Layering from buffer manager through smgr and md to the OS kernel. The synchronous path (left) and the PG18 async path (right of md.c) share the same SMgrRelation handle and segment-file state.

  • f_smgr (struct) — vtable definition; smgrsw[] holds the single md.c entry.
  • SMgrRelationHash — per-backend HTAB* keyed on RelFileLocatorBackend.
  • unpinned_relns — doubly-linked list of all unpinned SMgrRelationData objects; AtEOXact_SMgr destroys them.
  • smgrinit — calls smgr_init on all vtable entries; registers smgrshutdown on on_proc_exit.
  • smgropen — hash lookup or insert; sets smgr_which = 0, initializes cached fields, calls smgr_open.
  • smgrpin / smgrunpin — move handle on/off unpinned_relns; relcache pins handles it holds.
  • smgrrelease — closes all fork FDs, invalidates cached block counts; handle stays in table.
  • smgrclose — synonym for smgrrelease (deprecated distinction).
  • smgrdestroy (static) — closes FDs + removes from hash; only called on unpinned handles.
  • smgrdestroyall — destroys all unpinned handles; called by AtEOXact_SMgr.
  • smgrreleaseall — releases (not destroys) all handles; called by ProcessBarrierSmgrRelease.
  • smgrcreate / smgrextend / smgrzeroextend / smgrreadv / smgrstartreadv / smgrwritev / smgrwriteback / smgrnblocks / smgrnblocks_cached / smgrtruncate / smgrimmedsync / smgrregistersync — thin wrappers that HOLD_INTERRUPTS, dispatch through smgrsw[reln->smgr_which], and RESUME_INTERRUPTS.
  • smgrdounlinkall — drops buffers, sends sinval, then calls smgr_unlink for each fork.
  • pgaio_io_set_target_smgr — populates PgAioTargetData for AIO.
  • smgr_aio_reopen — AIO worker reopen callback.
  • AtEOXact_SMgr — end-of-transaction cleanup hook.
  • MdfdVec (struct) — {File mdfd_vfd, BlockNumber mdfd_segno}.
  • MdCxtMemoryContext for MdfdVec arrays.
  • mdinit — creates MdCxt.
  • mdopen — zeroes md_num_open_segs for all forks (no actual file open).
  • mdclose — closes open segment FDs from last to first; shrinks array.
  • mdopenfork (static) — opens segment 0 of a fork if not already open.
  • _mdfd_getseg (static) — maps blocknumMdfdVec*; opens missing segments incrementally.
  • _mdfd_openseg (static) — opens one segment, resizes array, fills entry.
  • _mdfd_segpath (static) — builds segment file path string.
  • _fdvec_resize (static) — palloc/repalloc the segment array (no shrink to avoid allocation in critical section).
  • mdcreateO_CREAT|O_EXCL open of segment 0; registers dirty.
  • mdexistsmdopenfork with EXTENSION_RETURN_NULL.
  • mdunlink / mdunlinkfork (static) — main fork: truncate-to-zero + deferred unlink; other forks: immediate unlink.
  • mdextend — single-block write at or beyond EOF; updates smgr_cached_nblocks.
  • mdzeroextend — multi-block zero-fill via fallocate or pwritev.
  • mdreadv — synchronous vectored read; inner retry loop for short reads.
  • mdstartreadv — async vectored read via PgAioHandle.
  • mdwritev — synchronous vectored write; registers dirty segment.
  • mdwritebackFileWriteback (kernel writeback hint; no fsync).
  • mdnblocks — walks all segments; side-effect: opens them all.
  • mdnblocks_cached (in smgr.c) — returns smgr_cached_nblocks in recovery.
  • mdtruncate — iterates open segments from last to first; deactivates and close excess segments.
  • mdregistersync — opens all segments (incl. inactive) and registers each dirty.
  • mdimmedsyncFileSync on every segment immediately.
  • mdfd — returns raw OS fd + offset; used by AIO workers.
  • register_dirty_segment (static) — RegisterSyncRequest(SYNC_REQUEST).
  • register_unlink_segment (static) — RegisterSyncRequest(SYNC_UNLINK_REQUEST).
  • register_forget_request (static) — RegisterSyncRequest(SYNC_FORGET_REQUEST).
  • buffers_to_iovec (static) — merges contiguous buffer pointers into minimum iovec count.

Position-hint table (as of 2026-06-05, commit 273fe94)

Section titled “Position-hint table (as of 2026-06-05, commit 273fe94)”
SymbolFileLine
f_smgr (typedef struct)smgr.c88
smgrsw[]smgr.c128
SMgrRelationHashsmgr.c160
smgrinitsmgr.c188
smgropensmgr.c240
smgrpinsmgr.c296
smgrunpinsmgr.c311
smgrdestroysmgr.c323
smgrreleasesmgr.c350
smgrclosesmgr.c374
smgrdestroyallsmgr.c386
smgrreleaseallsmgr.c412
smgrcreatesmgr.c481
smgrextendsmgr.c620
smgrzeroextendsmgr.c649
smgrreadvsmgr.c721
smgrstartreadvsmgr.c753
smgrwritevsmgr.c791
smgrwritebacksmgr.c805
smgrnblockssmgr.c819
smgrnblocks_cachedsmgr.c847
smgrtruncatesmgr.c875
smgrregistersyncsmgr.c940
smgrimmedsyncsmgr.c974
AtEOXact_SMgrsmgr.c1017
ProcessBarrierSmgrReleasesmgr.c1027
pgaio_io_set_target_smgrsmgr.c1038
smgr_aio_reopensmgr.c1064
SMgrRelationData (typedef struct)smgr.h35
_MdfdVec (typedef struct)md.c81
MdCxtmd.c87
register_dirty_segment (proto)md.c138
_fdvec_resize (proto)md.c144
_mdfd_segpath (proto)md.c147
_mdfd_openseg (proto)md.c149
_mdfd_getseg (proto)md.c151
mdinitmd.c180
mdexistsmd.c193
mdcreatemd.c212
mdunlinkmd.c327
mdunlinkforkmd.c364
mdextendmd.c477
mdzeroextendmd.c542
mdopenforkmd.c665
mdopenmd.c703
mdclosemd.c714
mdprefetchmd.c737
mdmaxcombinemd.c834
mdreadvmd.c848
mdstartreadvmd.c986
mdwritevmd.c1060
mdwritebackmd.c1165
mdnblocksmd.c1224
mdtruncatemd.c1291
mdregistersyncmd.c1380
mdimmedsyncmd.c1431
mdfdmd.c1484
register_dirty_segmentmd.c1508
register_unlink_segmentmd.c1552
register_forget_requestmd.c1568
  • smgrsw[] has exactly one entry (md.c) at REL_18_STABLE. The README confirms “only the magnetic disk manager remains.” NSmgr = lengthof(smgrsw) = 1. The smgr_which field is always set to 0 in smgropen().

  • smgrclose() is a synonym for smgrrelease(), not a destructor. The comment in smgr.c says: “The SMgrRelation reference should not be used after this call. However, because we don’t keep track of the references returned by smgropen(), we don’t know if there are other references… Therefore, this is just a synonym for smgrrelease() at the moment.” Callers that want destruction must use smgrdestroy() (internal) or smgrdestroyall().

  • smgr_cached_nblocks is reliable only in recovery. smgrnblocks_cached() returns InvalidBlockNumber outside recovery regardless of what the cache holds, per the comment: “lack of a shared invalidation mechanism for changes in file size.”

  • Segment-0 of the main fork is never immediately unlinked. mdunlinkfork() truncates it to zero and calls register_unlink_segment(…, 0) for post-checkpoint deletion. Only segments ≥ 1 are unlinked immediately. Verified in the mdunlinkfork() conditional at line 376.

  • HOLD_INTERRUPTS is present on every smgr entry point. The smgr.c file header states the rationale explicitly; spot-checked in smgropen, smgrrelease, smgrdestroy, smgrextend, smgrnblocks, and smgrdounlinkall.

  • mdwriteback() does not fsync. It calls FileWriteback() (a posix_fadvise(POSIX_FADV_DONTNEED) hint or sync_file_range() on Linux), which asks the kernel to write back dirty pages but does not guarantee durability. The Assert((io_direct_flags & IO_DIRECT_DATA) == 0) at line 1168 shows it is not used on the O_DIRECT path.

  • mdzeroextend() uses posix_fallocate only for ≥ 9 blocks. Line 595: if (numblocks > 8 && file_extend_method != FILE_EXTEND_METHOD_WRITE_ZEROS) — the cutoff avoids defeating delayed allocation on some filesystems.

  • mdstartreadv() does not implement the zero_damaged_pages recovery path. The mdreadv() comment at line 939 explicitly notes: “we chose, at least for now, to not implement the zero_damaged_pages logic present in mdreadv().” An Assert(false) in mdreadv() marks this code path as targeted for future removal.

  1. RELSEG_SIZE source location. (Resolved 2026-06-05.) RELSEG_SIZE is generated at configure time from --with-segsize into pg_config.h; the template is src/include/pg_config.h.in (#undef RELSEG_SIZE, with the comment “RELSEG_SIZE is the maximum number of blocks allowed in one disk file … RELSEG_SIZE * BLCKSZ must be … changing RELSEG_SIZE requires an initdb”). The default --with-segsize=1 yields 1 GB → 131 072 blocks at the default 8 KB BLCKSZ. The #define is absent from the checked-out tree only because pg_config.h is a build artifact, not a committed file.

  2. mdwritev() crossing segment boundaries. mdwritev() (line 1092) calls elog(ERROR, "write crosses segment boundary") when nblocks_this_segment != nblocks. Is the buffer manager guaranteed never to issue a cross-segment write? Investigation path: check smgrmaxcombine() (which returns RELSEG_SIZE - segoff) — this appears to be the mechanism that prevents such writes at the smgr level.

  3. smgr_aio_reopen interrupt discipline. The function asserts !INTERRUPTS_CAN_BE_PROCESSED() (line 1076) — meaning the caller is responsible for holding interrupts. Does the AIO worker infrastructure guarantee this before invoking the callback? Investigation path: trace pgaio_io_get_target_data call sites in storage/aio/.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • VFS / pluggable storage layers elsewhere. MySQL/InnoDB’s fil0fil.cc implements a similar file-system abstraction with tablespace-level (not relation-level) granularity; segments are 64-page extents grouped into tablespace files. The comparison reveals a design choice: PostgreSQL’s per-relation file granularity makes DROP TABLE trivially one unlink(2) per fork, while InnoDB’s tablespace packing requires page-level space management inside files.

  • Pluggable storage managers and cloud storage. The f_smgr vtable structure was designed with alternative back-ends in mind. Projects such as pg_directio and the ongoing NVM/mmap-storage discussions revisit this surface. The PG18 AIO refactor (storage/aio/) is the first significant step toward making a non-POSIX storage manager viable, since mdstartreadv / smgr_aio_reopen abstract the fd-reopening problem that would otherwise prevent I/O workers from operating on a different process’s file descriptors.

  • Direct I/O and io_uring. O_DIRECT bypasses the kernel page cache; io_uring (Linux 5.1+) allows truly asynchronous submission without per-call system call overhead. PostgreSQL 18’s io_method=io_uring path wires through mdstartreadvFileStartReadVio_uring_prep_readv. The design is described in the storage/aio/README (PG18). Prior art: Microsoft SQL Server’s “Scatter-Gather I/O” and Oracle’s asynchronous I/O on AIX/Solaris. See postgres-aio.md for the full async-I/O doc.

  • Write-ahead storage (Zheap / pluggable heap). The table AM API (postgres-table-am.md) defines how a custom heap can override the storage path entirely. A columnar AM (e.g., Hydra, Citus columnar) can provide its own mdreadv-equivalent and bypass the smgr layer for its internal storage while still using the smgr path for index forks.

  • The fsync disaster (2018). Commit 9ccbe8f7 (PostgreSQL 11) added the data_sync_elevel mechanism and changed fsync-error handling after a Linux kernel bug (and a PostgreSQL behavior long predating it) was discovered: a failed fsync(2) would clear the “dirty” flag on OS pages even though the data was not durable, and PostgreSQL would then silently serve corrupted data on restart. The register_dirty_segment / ProcessSyncRequests path was audited and hardened at that time. See LWN “PostgreSQL’s fsync() surprise” (2018).

  • src/backend/storage/smgr/smgr.c (REL_18_STABLE, commit 273fe94)
  • src/backend/storage/smgr/md.c (REL_18_STABLE, commit 273fe94)
  • src/backend/storage/smgr/README
  • src/include/storage/smgr.h
  • src/include/common/relpath.h
  • postgres-buffer-manager.md — the caller of smgrreadv/smgrwritev
  • postgres-page-layout.md — the 8 KB page format smgr transfers
  • postgres-aio.md — the PG18 async-I/O subsystem wired into mdstartreadv
  • postgres-xlog-wal.md — WAL uses smgrimmedsync / smgrregistersync for fsync-after-WAL-minimal
  • postgres-table-am.md — table AM interface above smgr
  • postgres-heap-am.md — the concrete heap AM that calls smgrextend
  • Database System Concepts, Silberschatz et al., 7e, ch. 10 “Storage and File Structure”
  • Database Internals, Petrov, ch. 6 (on-disk page management and file abstraction)