Skip to content

PostgreSQL Shared Memory & IPC — Static Segment, Dynamic Shared Memory, and the shm_mq Message Layer

Contents:

A multi-process database must give every worker the same view of the shared state — the buffer pool, the lock table, the transaction status array, and dozens of smaller structures. Two classical strategies exist:

  1. Message passing. Processes have no shared address space; they exchange state via pipes, sockets, or OS message queues. The model is clean (no aliasing, no cache-coherence races at the language level) but every read of shared state pays a kernel crossing and a copy. Oracle’s MTS and MySQL’s connection-per-thread model both push shared state through well-defined channels.

  2. Shared memory. All processes map the same virtual pages. Reads and writes are load/store instructions, not kernel calls. The cost is paid up front (map the region once) and amortized over every subsequent access. Protection must be enforced by the processes themselves, typically via lightweight locks or atomic operations inside the shared region.

PostgreSQL is unambiguously in the second camp. Every backend, every auxiliary process, and every parallel worker attaches to the same shared-memory region created by the postmaster. The buffer pool, the PGPROC array, the lock table, the sinval ring — these are not copies per process; they are single objects in a shared address space that all processes reference through inherited (post-fork) or re-established (post-exec, Windows) pointers.

Database System Concepts (Silberschatz et al., ch. 17) frames shared memory as the natural substrate for buffer management in a shared-disk architecture: “the buffer pool is maintained in shared memory so that all processes can access it directly.” Database Internals (Petrov, ch. 4) notes the same; the buffer manager’s descriptor array and page frames live in shared memory so no inter- process copy is ever needed for a cache hit.

Two design questions define the architecture once shared memory is chosen:

  1. How large is the segment, and who decides? A fixed-at-startup approach (PostgreSQL’s choice) lets the kernel reserve all physical pages at once and eliminates fragmentation at the cost of requiring a restart to resize. A growable-segment approach allows online resizing but complicates pointer stability — any pointer into the region must be invalidated if the base address shifts.

  2. How is the region subdivided? A bump allocator with a directory (PostgreSQL’s ShmemAlloc + ShmemIndex) is simple and prevents fragmentation, but allocations are permanent. A slab or pool allocator per object class (InnoDB’s buf_pool, for instance) trades simplicity for reclamability.

PostgreSQL adds a third layer: dynamic shared memory (dsm), a separate family of on-demand segments created and destroyed at runtime. DSM sidesteps the fixed-at-startup constraint for transient structures like parallel query workers — but the static segment remains the canonical home for all long-lived global state.

The textbook gives the rationale; this section names the engineering conventions that nearly every multi-process DBMS adopts when it chooses shared memory as its IPC substrate.

Every subsystem announces its memory need at startup via a sizing function. A central coordinator sums those needs, creates the segment once, and then hands out slices via a bump (linear) allocator. This pattern appears in Oracle (the SGA), InnoDB (the buffer pool’s buf_pool_init), and PostgreSQL (the ShmemAlloc arena). The invariant is that the bump pointer never moves backward: once a slice is handed out, it belongs to that subsystem forever. The segment never leaks because nothing is ever truly freed — the bump pointer only advances.

A flat hash table inside the static segment maps human-readable string names to (address, size) pairs for every registered object. Any process, regardless of whether it created the object or merely attached to the segment later (EXEC_BACKEND on Windows; or a backend that re-initializes after an unusual restart), can locate a named object by looking it up in the directory. The alternative — hard-coding offsets — was the original POSTGRES design and is fragile: adding one new field changes every downstream offset.

Separate lightweight dynamic segments for transient structures

Section titled “Separate lightweight dynamic segments for transient structures”

Long-lived state lives in the static segment; the static segment is sized conservatively for the worst case. Transient structures — the memory a parallel query worker needs for the duration of one query, the shared tuple queue between a parallel scan and a gather node — should not inflate the static segment permanently. Every mature multi-process engine therefore adds a second-tier dynamic allocator: Oracle’s PGA (process-private), PostgreSQL’s DSM, CockroachDB’s goroutine-local arenas. The requirement is that the backing pages be unmapped and returned to the OS when the structure’s owner exits, even if the owner crashes.

Once two processes share a region, a ring-buffer message queue is a natural addition: one process writes into a slot, increments the write pointer, and optionally wakes the reader via a latch. The reader consumes from the read pointer, increments it, and signals back. No kernel copy is involved. This pattern — shared-memory ring buffer + latch for wakeup — is used by PostgreSQL’s shm_mq, Oracle’s in-SGA message queues, and the kfifo-based queues in Linux’s kernel scheduler.

ConceptPostgreSQL name
Static shared segmentPGShmemHeader / ShmemSegHdr (shmem.c)
Segment creationCreateSharedMemoryAndSemaphores (ipci.c)
Sizing function (per subsystem)*ShmemSize() family, summed by CalculateShmemSize
Bump allocatorShmemAlloc / ShmemAllocRaw (shmem.c)
Shmem index / name registryShmemIndex HTAB + ShmemInitStruct / ShmemInitHash
Dynamic segmentdsm_segment / dsm_create / dsm_attach / dsm_detach (dsm.c)
Platform dispatch tabledsm_impl_op (dsm_impl.c)
Ring-buffer message queue on DSMshm_mq / shm_mq_create / shm_mq_attach / shm_mq_send / shm_mq_receive (shm_mq.c)
Variable-size slab allocator on DSMdsa_area / dsa_create_ext / dsa_allocate_extended / dsa_free (dsa.c)
Add-in request hookRequestAddinShmemSpace + shmem_startup_hook

At postmaster startup — before any backend is forked — two functions run in sequence:

// CalculateShmemSize — src/backend/storage/ipc/ipci.c
Size
CalculateShmemSize(int *num_semaphores)
{
Size size = 100000; /* baseline for small objects */
size = add_size(size, BufferManagerShmemSize());
size = add_size(size, LockManagerShmemSize());
size = add_size(size, ProcGlobalShmemSize());
size = add_size(size, XLOGShmemSize());
size = add_size(size, LWLockShmemSize());
size = add_size(size, ProcArrayShmemSize());
// ... ~35 more ShmemSize() calls condensed ...
size = add_size(size, AioShmemSize()); /* PG18 async I/O */
size = add_size(size, total_addin_request); /* add-ins */
size = add_size(size, 8192 - (size % 8192)); /* page-align */
return size;
}
// CreateSharedMemoryAndSemaphores — src/backend/storage/ipc/ipci.c
void
CreateSharedMemoryAndSemaphores(void)
{
PGShmemHeader *seghdr;
Size size = CalculateShmemSize(&numSemas);
seghdr = PGSharedMemoryCreate(size, &shim); /* SysV / mmap */
InitShmemAccess(seghdr);
PGReserveSemaphores(numSemas);
InitShmemAllocation();
CreateOrAttachShmemStructs(); /* calls every ShmemInit*() */
dsm_postmaster_startup(shim); /* bootstrap DSM control segment */
if (shmem_startup_hook)
shmem_startup_hook(); /* add-in extensions */
}

The end-to-end sizing-then-creation sequence at postmaster startup is a single straight-line pipeline: sum the needs, create one segment, set up the bump allocator, then run every subsystem’s ShmemInit* through the shmem index.

flowchart TD
    A["CalculateShmemSize<br/>sum ~37 *ShmemSize() + total_addin_request"] --> B["PGSharedMemoryCreate<br/>SysV shmget / POSIX mmap one segment"]
    B --> C["InitShmemAccess<br/>set ShmemBase / ShmemEnd / ShmemSegHdr"]
    C --> D["PGReserveSemaphores"]
    D --> E["InitShmemAllocation<br/>ShmemAllocUnlocked(ShmemLock spinlock)"]
    E --> F["CreateOrAttachShmemStructs<br/>master initializer"]
    F --> G["InitShmemIndex<br/>ShmemIndex HTAB at arena base"]
    G --> H["per-subsystem ShmemInitStruct / ShmemInitHash<br/>BufferManagerShmemInit, LockManagerShmemInit, InitProcGlobal, ..."]
    H --> I["ShmemAllocRaw<br/>CACHELINEALIGN + advance freeoffset under ShmemLock"]
    F --> J["dsm_postmaster_startup<br/>bootstrap DSM control segment"]
    F --> K["shmem_startup_hook<br/>add-in ShmemInitStruct calls"]

Figure 1 — static-segment sizing and creation at postmaster startup. CalculateShmemSize is called twice in practice: once by InitializeShmemGUCs to publish shared_memory_size, and again here as the authoritative size handed to PGSharedMemoryCreate. Every leaf of the ShmemInitStruct fan-out resolves through the same ShmemAllocRaw bump path, so the order in CreateOrAttachShmemStructs is the order slices are stamped into the segment.

The segment header lives at the base of the mapped region:

// PGShmemHeader — src/include/storage/pg_shmem.h
typedef struct PGShmemHeader
{
int32 magic; /* identifies a live Postgres segment */
pid_t creatorPID; /* postmaster PID */
Size totalsize; /* total size of the segment */
Size freeoffset; /* bump pointer — next free byte */
dev_t device; /* data-directory device (Unix only) */
ino_t inode; /* data-directory inode (Unix only) */
} PGShmemHeader;

freeoffset is the bump pointer. ShmemAlloc advances it:

// ShmemAllocRaw — src/backend/storage/ipc/shmem.c
static void *
ShmemAllocRaw(Size size, Size *allocated_size)
{
size = CACHELINEALIGN(size); /* align to cache-line boundary */
SpinLockAcquire(ShmemLock);
newStart = ShmemSegHdr->freeoffset;
newFree = newStart + size;
if (newFree <= ShmemSegHdr->totalsize)
{
newSpace = (char *) ShmemBase + newStart;
ShmemSegHdr->freeoffset = newFree;
}
else
newSpace = NULL; /* out of shared memory */
SpinLockRelease(ShmemLock);
return newSpace;
}

Three things are worth noting. First, CACHELINEALIGN pads each allocation to a cache-line boundary — a micro-optimization to prevent false sharing between adjacent objects in the segment. Second, the allocator holds ShmemLock (a spinlock) only for the bump-pointer update, not for the write into the allocated region. Third, once freeoffset is advanced it is never decremented: shared-memory allocations are permanent.

Every object allocated via ShmemAlloc is registered in ShmemIndex, a hash table at a known offset inside the segment. ShmemInitStruct either creates-and-registers or looks-up-and-returns a named object:

// ShmemInitStruct — src/backend/storage/ipc/shmem.c
void *
ShmemInitStruct(const char *name, Size size, bool *foundPtr)
{
/* Look up or insert into ShmemIndex */
result = (ShmemIndexEnt *)
hash_search(ShmemIndex, name, HASH_ENTER_NULL, foundPtr);
if (!*foundPtr)
{
structPtr = ShmemAlloc(size);
result->location = structPtr;
result->size = size;
}
else
structPtr = result->location;
return structPtr;
}

This is the re-attach path that makes EXEC_BACKEND work: a Windows backend that cannot fork() calls AttachSharedMemoryStructs() which re-runs CreateOrAttachShmemStructs(). Every ShmemInitStruct call returns *foundPtr = true and hands back the pointer from the index — no re-allocation occurs.

The segment also exposes a special pointer ShmemVariableCache (typed as VariableCacheData *), which holds the cluster-wide XID and OID counters (nextXid, nextOid, latestCompletedXid). It is not in the index but rather placed at a fixed offset by InitShmemAllocation and used by access/transam/varsup.c.

CreateOrAttachShmemStructs (ipci.c) is the master initializer. Its call sequence defines what lives in the static segment:

flowchart TD
    A[CreateOrAttachShmemStructs] --> B[CreateLWLocks]
    A --> C[InitShmemIndex]
    A --> D[dsm_shmem_init]
    A --> E[VarsupShmemInit / XLOGShmemInit / XLogRecoveryShmemInit]
    A --> F[CLOGShmemInit / CommitTsShmemInit / SUBTRANSShmemInit]
    A --> G[BufferManagerShmemInit]
    A --> H[LockManagerShmemInit / PredicateLockShmemInit]
    A --> I[InitProcGlobal / ProcArrayShmemInit / BackendStatusShmemInit]
    A --> J[SharedInvalShmemInit]
    A --> K[PMSignalShmemInit / ProcSignalShmemInit / CheckpointerShmemInit]
    A --> L[AutoVacuumShmemInit / ReplicationSlotsShmemInit / WalSndShmemInit]
    A --> M[StatsShmemInit / AioShmemInit]

Figure 2 — CreateOrAttachShmemStructs call graph. Each leaf calls ShmemInitStruct (or its own bump-allocator path) to stake out a slice of the static segment. The order is fixed: LWLocks and the shmem index must come first because every subsequent call uses them.

Every object named above — the buffer descriptors, the PGPROC array, the lock table, the WAL buffers, the sinval ring, the cumulative stats area (PG15+), the async I/O subsystem (PG18) — lives inside this one segment at a pointer inherited by every forked child.

The static segment is sized once at postmaster startup and cannot grow. For transient structures — parallel query worker state, logical replication worker queues — PostgreSQL uses dynamic shared memory: on-demand segments created and destroyed at runtime.

// dsm_segment (backend-local handle) — src/backend/storage/ipc/dsm.c
struct dsm_segment
{
dlist_node node; /* link in backend's segment list */
ResourceOwner resowner; /* tracks ownership for cleanup */
dsm_handle handle; /* shared name (integer key) */
uint32 control_slot; /* slot in the DSM control segment */
void *impl_private; /* platform-specific private data */
void *mapped_address; /* where this backend mapped the segment */
Size mapped_size; /* size of this backend's mapping */
slist_head on_detach; /* callbacks fired on dsm_detach() */
};

The public API is three calls:

// dsm_create / dsm_attach / dsm_detach — src/backend/storage/ipc/dsm.c
dsm_segment *dsm_create(Size size, int flags); /* line 516 */
dsm_segment *dsm_attach(dsm_handle h); /* line 665 */
void dsm_detach(dsm_segment *seg); /* line 803 */

dsm_create returns a dsm_segment handle, allocates a slot in the DSM control segment (a special segment pinned at postmaster startup by dsm_postmaster_startup), and maps the new segment into the caller’s address space. dsm_attach looks up the handle in the control segment and maps it. dsm_detach unmaps and — if the reference count drops to zero — destroys the segment via the platform backend.

Platform dispatch is handled by dsm_impl_op:

// dsm_impl_op — src/backend/storage/ipc/dsm_impl.c
bool
dsm_impl_op(dsm_op op, dsm_handle handle, Size request_size,
void **impl_private, void **mapped_address,
Size *mapped_size, int elevel)
{
switch (dynamic_shared_memory_type)
{
case DSM_IMPL_POSIX: return dsm_impl_posix(...); /* shm_open */
case DSM_IMPL_SYSV: return dsm_impl_sysv(...); /* shmget */
case DSM_IMPL_WINDOWS: return dsm_impl_windows(...);
case DSM_IMPL_MMAP: return dsm_impl_mmap(...); /* mmap file */
}
}

The dynamic_shared_memory_type GUC (default posix on Linux/macOS, windows on Windows, mmap as the universal fallback) selects the implementation at runtime. The rest of the DSM stack sees only the dsm_segment handle and never knows which backend was used.

The DSM lifecycle:

flowchart TD
    Start([start]) -->|dsm_create| Created[Created refcnt=1]
    Created -->|dsm_attach in other backend| Attached[Attached refcnt&gt;1]
    Created -->|dsm_pin_segment| Pinned[Pinned survives creator exit]
    Pinned -->|dsm_unpin_segment| Attached
    Attached -->|dsm_detach with refcnt&gt;1 — unmap only| Attached
    Attached -->|dsm_detach drops refcnt to 0 — destroy| Detached[Detached]
    Created -->|dsm_detach drops refcnt to 0 — destroy| Detached
    Detached -->|OS resources freed| Done([end])

Figure 3 — DSM segment lifecycle. dsm_pin_segment bumps the reference count to prevent a segment from being destroyed when its creator exits; it is used by DSA areas to keep backing segments alive across creator exit.

Layer 4: shm_mq — a lock-free ring-buffer message queue

Section titled “Layer 4: shm_mq — a lock-free ring-buffer message queue”

shm_mq is a single-producer/single-consumer ring buffer allocated inside a DSM segment. It is the transport layer for the parallel-query tuple stream (from a parallel worker to a Gather node) and for background-worker result delivery.

// shm_mq — src/backend/storage/ipc/shm_mq.c
struct shm_mq
{
slock_t mq_mutex; /* protects mq_receiver/sender */
PGPROC *mq_receiver; /* set once, then read-only */
PGPROC *mq_sender; /* set once, then read-only */
pg_atomic_uint64 mq_bytes_read; /* consumer position */
pg_atomic_uint64 mq_bytes_written; /* producer position */
Size mq_ring_size; /* ring buffer capacity */
bool mq_detached; /* either side has gone away */
uint8 mq_ring_offset; /* offset of ring within struct */
char mq_ring[FLEXIBLE_ARRAY_MEMBER];
};

The protocol is straightforward. After shm_mq_create sizes the ring, the sender calls shm_mq_set_sender and the receiver calls shm_mq_set_receiver to register their PGPROC pointers. Send and receive are then lock-free: the sender writes into mq_ring[mq_bytes_written % mq_ring_size] and advances mq_bytes_written; the receiver reads from mq_ring[mq_bytes_read % mq_ring_size] and advances mq_bytes_read. Only when the ring is full (sender) or empty (receiver) does either side call SetLatch / WaitLatch on the counterparty’s PGPROC latch.

sequenceDiagram
    participant W as Worker (sender)
    participant MQ as shm_mq ring
    participant G as Gather (receiver)

    W->>MQ: shm_mq_send(data, nbytes)
    note over MQ: write to ring[written % size]\nadvance mq_bytes_written
    MQ-->>G: (reader polls or is woken)
    G->>MQ: shm_mq_receive()
    note over MQ: read from ring[read % size]\nadvance mq_bytes_read
    G-->>W: (ring has space — sender unblocks)
    W->>MQ: shm_mq_detach() on exit
    note over MQ: mq_detached = true\nwake counterparty latch

Figure 4 — shm_mq send/receive protocol. The ring buffer itself requires no lock for the data path. mq_mutex protects only the initial assignment of mq_receiver / mq_sender. Memory ordering between producer and consumer is maintained by the atomic mq_bytes_written / mq_bytes_read with appropriate barriers.

Layer 5: DSA — a variable-size slab allocator over DSM

Section titled “Layer 5: DSA — a variable-size slab allocator over DSM”

shm_mq handles fixed-size message framing; for variable-size shared data structures (hash tables, trees, or any heap-allocated object that must be visible to multiple processes), PostgreSQL provides the Dynamic Shared-memory Area (dsa). DSA carves a dsa_area control object out of a DSM segment and then manages a pool of further DSM segments from which it allocates:

// dsa_pointer — src/include/utils/dsa.h
typedef uint64 dsa_pointer;
/*
* Encoded as: (segment_number << DSA_OFFSET_WIDTH) | offset_within_segment
* DSA_OFFSET_WIDTH = 40 bits on 64-bit, giving 1 TB per segment.
* Segment number identifies which DSM segment holds the data.
*/
#define DSA_POINTER_FORMAT "%016" PRIx64
// dsa_create_ext — src/backend/utils/mmgr/dsa.c (line 421)
dsa_area *dsa_create_ext(int tranche_id,
size_t init_segment_size,
size_t max_segment_size);
// dsa_allocate_extended — src/backend/utils/mmgr/dsa.c (line 671)
dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
// dsa_free — src/backend/utils/mmgr/dsa.c (line 826)
void dsa_free(dsa_area *area, dsa_pointer dp);
// dsa_get_address — src/backend/utils/mmgr/dsa.c (line 942)
void *dsa_get_address(dsa_area *area, dsa_pointer dp);

dsa_pointer is a 64-bit integer that encodes a (segment number, offset) pair. It is position-independent — the same dsa_pointer can be passed between processes and each resolves it against its own local mapping of that DSM segment via dsa_get_address. The dsa_area_control structure, embedded at the start of the first DSM segment, holds the pool freelist and segment directory; any backend that calls dsa_attach gets its own dsa_area shell pointing at the same control object.

Putting the dynamic side together: a parallel worker

Section titled “Putting the dynamic side together: a parallel worker”

The three dynamic layers compose in access/transam/parallel.c. The leader’s InitializeParallelDSM sizes and creates one DSM segment, lays a shm_toc table of contents over it, and carves per-worker shm_mq error/tuple queues out of it; dsa_create_in_place can place a DSA area inside the same segment when a parallel node needs variable-size shared scratch (e.g. a parallel hash join). Each worker then re-attaches by handle and registers itself as the shm_mq sender.

flowchart TD
    subgraph Leader
        L1["GetSessionDsmHandle"] --> L2["dsm_create(segsize, DSM_CREATE_NULL_IF_MAXSEGMENTS)"]
        L2 --> L3["shm_toc_create(PARALLEL_MAGIC, seg base)"]
        L3 --> L4["shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE)<br/>per worker"]
        L4 --> L5["shm_mq_set_receiver(mq, MyProc)"]
        L5 --> L6["shm_mq_attach(mq, seg, NULL) -> error_mqh"]
        L3 -.optional.-> LD["dsa_create_in_place(place, size, tranche, seg)<br/>variable-size shared scratch"]
        L6 --> L7["LaunchParallelWorkers -> RegisterDynamicBackgroundWorker<br/>handle = dsm_segment_handle(seg)"]
    end
    subgraph Worker
        W1["ParallelWorkerMain(main_arg = dsm_handle)"] --> W2["dsm_attach(DatumGetUInt32(main_arg))"]
        W2 --> W3["shm_toc_attach(PARALLEL_MAGIC, seg base)"]
        W3 --> W4["shm_mq_set_sender(mq, MyProc)"]
        W4 --> W5["shm_mq_attach(mq, seg, NULL) -> mqh"]
        W3 -.optional.-> WD["dsa_attach_in_place(place, seg)"]
    end
    L7 ==>|fork/exec inherits handle| W1
    W5 ==>|tuples + errors via ring| L6

Figure 5 — the dynamic side for one parallel query. A single dsm_create segment is the shared substrate; shm_toc is the offset directory the worker uses to relocate every sub-object (queues, DSA control, plan state) against its own mapping; shm_mq carries the error/tuple streams; an optional in-place dsa_area supplies variable-size allocations. The worker never sees a raw pointer from the leader — only the integer dsm_handle and TOC keys cross the process boundary.

Extension point: RequestAddinShmemSpace and shmem_startup_hook

Section titled “Extension point: RequestAddinShmemSpace and shmem_startup_hook”

Extensions loaded via shared_preload_libraries can grow the static segment before it is created:

// RequestAddinShmemSpace — src/backend/storage/ipc/ipci.c
void
RequestAddinShmemSpace(Size size)
{
if (!process_shmem_requests_in_progress)
elog(FATAL, "cannot request additional shared memory "
"outside shmem_request_hook");
total_addin_request = add_size(total_addin_request, size);
}

The hook fires inside CalculateShmemSize. A second hook, shmem_startup_hook, fires after CreateOrAttachShmemStructs and gives the extension the opportunity to call ShmemInitStruct for its own objects. Together the two hooks are the complete extension API for persistent shared-memory structures.

Static segment sizing and initialization (ipci.c)

Section titled “Static segment sizing and initialization (ipci.c)”
  • CalculateShmemSize — sums all per-subsystem *ShmemSize() sizing functions plus total_addin_request; called by CreateSharedMemoryAndSemaphores and by InitializeShmemGUCs.
  • CreateSharedMemoryAndSemaphores — postmaster-only entry point; calls PGSharedMemoryCreate, InitShmemAccess, InitShmemAllocation, CreateOrAttachShmemStructs, and dsm_postmaster_startup.
  • CreateOrAttachShmemStructs — master initializer; calls every subsystem’s ShmemInit* function in dependency order.
  • AttachSharedMemoryStructs — EXEC_BACKEND re-attach path (Windows); re-runs CreateOrAttachShmemStructs to rebuild local pointers.
  • RequestAddinShmemSpace — extension API for pre-creation size reservation.
  • InitializeShmemGUCs — computes shared_memory_size and shared_memory_size_in_huge_pages GUC values.
  • InitShmemAccess — sets module-level ShmemBase, ShmemEnd, and ShmemSegHdr from the PGShmemHeader pointer.
  • InitShmemAllocation — allocates and initializes ShmemLock (spinlock) and zeroes ShmemIndex.
  • ShmemAlloc — calls ShmemAllocRaw; errors on out-of-space.
  • ShmemAllocNoError — same; returns NULL on out-of-space.
  • ShmemAllocUnlocked — lockless variant used during bootstrap before ShmemLock itself is allocated.
  • ShmemAllocRaw — internal; CACHELINEALIGN + spinlock-guarded bump.
  • InitShmemIndex — creates the ShmemIndex HTAB at the base of the arena using ShmemAllocUnlocked.
  • ShmemInitStruct — named-object create-or-attach via ShmemIndex.
  • ShmemInitHash — like ShmemInitStruct but for HTAB; parameters passed as HASHCTL.

Dynamic shared memory (dsm.c + dsm_impl.c)

Section titled “Dynamic shared memory (dsm.c + dsm_impl.c)”
  • dsm_postmaster_startup — creates the DSM control segment at postmaster startup; records its handle in the static segment’s PGShmemHeader.
  • dsm_shmem_init — called from CreateOrAttachShmemStructs to map the control segment in each backend.
  • dsm_create — allocates a new DSM segment, reserves a dsm_control_item slot, returns a dsm_segment handle.
  • dsm_attach — maps an existing DSM segment by handle; increments the dsm_control_item.refcnt.
  • dsm_detach — unmaps a segment; fires on_detach callbacks; decrements refcnt; destroys the OS object when refcnt reaches 0.
  • on_dsm_detach — registers a callback to fire before a given segment is detached; used by DSA to release area memory.
  • dsm_impl_op — platform dispatch: selects dsm_impl_posix, dsm_impl_sysv, dsm_impl_windows, or dsm_impl_mmap based on the dynamic_shared_memory_type GUC.
  • dsm_control_item — shared metadata per segment: handle, refcnt, first_page, npages, pinned.
  • shm_mq_create — initializes a shm_mq at a caller-provided address inside a DSM segment; sets mq_ring_size.
  • shm_mq_set_receiver / shm_mq_set_sender — register the sending and receiving PGPROC; each may be called only once.
  • shm_mq_attach — creates a backend-local shm_mq_handle for an existing shm_mq; optionally registers a BackgroundWorkerHandle.
  • shm_mq_send — sends a message; blocks (or returns SHM_MQ_WOULD_BLOCK) when the ring is full.
  • shm_mq_receive — receives the next message; blocks (or returns SHM_MQ_WOULD_BLOCK) when the ring is empty.
  • shm_mq_detach — marks mq_detached = true; wakes the counterparty’s latch.
  • dsa_create_ext — creates a new dsa_area backed by a fresh DSM segment; stores the dsa_area_control at the segment’s base.
  • dsa_attach — maps an existing DSA area by dsa_handle (which is a dsm_handle).
  • dsa_allocate_extended — allocates size bytes; returns a position-independent dsa_pointer.
  • dsa_free — returns memory to the area’s freelist.
  • dsa_get_address — translates a dsa_pointer to a backend-local virtual address.
  • dsa_pin / dsa_unpin — prevent / allow the area from being destroyed when the last non-pinned backend detaches.

Position-hint table (as of 2026-06-05, commit 273fe94, REL_18_STABLE)

Section titled “Position-hint table (as of 2026-06-05, commit 273fe94, REL_18_STABLE)”
SymbolFileLine
CalculateShmemSizestorage/ipc/ipci.c89
CreateSharedMemoryAndSemaphoresstorage/ipc/ipci.c200
CreateOrAttachShmemStructsstorage/ipc/ipci.c268
AttachSharedMemoryStructsstorage/ipc/ipci.c173
RequestAddinShmemSpacestorage/ipc/ipci.c74
InitializeShmemGUCsstorage/ipc/ipci.c357
PGShmemHeaderinclude/storage/pg_shmem.h29
InitShmemAccessstorage/ipc/shmem.c102
InitShmemAllocationstorage/ipc/shmem.c115
ShmemAllocstorage/ipc/shmem.c152
ShmemAllocNoErrorstorage/ipc/shmem.c172
ShmemAllocUnlockedstorage/ipc/shmem.c238
ShmemAllocRaw (static)storage/ipc/shmem.c183 (approx)
InitShmemIndexstorage/ipc/shmem.c283
ShmemInitHashstorage/ipc/shmem.c332
ShmemInitStructstorage/ipc/shmem.c387
dsm_postmaster_startupstorage/ipc/dsm.c177
dsm_shmem_initstorage/ipc/dsm.c479
dsm_createstorage/ipc/dsm.c516
dsm_attachstorage/ipc/dsm.c665
dsm_detachstorage/ipc/dsm.c803
on_dsm_detachstorage/ipc/dsm.c1132
dsm_segment (struct)storage/ipc/dsm.c66
dsm_control_item (struct)storage/ipc/dsm.c79
dsm_impl_opstorage/ipc/dsm_impl.c159
shm_mq (struct)storage/ipc/shm_mq.c71
shm_mq_createstorage/ipc/shm_mq.c177
shm_mq_set_receiverstorage/ipc/shm_mq.c206
shm_mq_set_senderstorage/ipc/shm_mq.c224
shm_mq_attachstorage/ipc/shm_mq.c290
shm_mq_sendstorage/ipc/shm_mq.c329
shm_mq_receivestorage/ipc/shm_mq.c572
shm_mq_detachstorage/ipc/shm_mq.c843
dsa_create_extutils/mmgr/dsa.c421
dsa_attachutils/mmgr/dsa.c510
dsa_allocate_extendedutils/mmgr/dsa.c671
dsa_freeutils/mmgr/dsa.c826
dsa_get_addressutils/mmgr/dsa.c942
dsa_pinutils/mmgr/dsa.c975
dsa_unpinutils/mmgr/dsa.c994
dsa_create_in_place_extutils/mmgr/dsa.c471
dsa_attach_in_placeutils/mmgr/dsa.c545
shm_toc_createstorage/ipc/shm_toc.c40
shm_toc_attachstorage/ipc/shm_toc.c64
GetSessionDsmHandleaccess/common/session.c70
InitializeParallelDSMaccess/transam/parallel.c211
LaunchParallelWorkersaccess/transam/parallel.c580
ParallelWorkerMainaccess/transam/parallel.c1299
  • CalculateShmemSize is a simple sum of per-subsystem sizing functions, with no dynamic adjustment. Verified by reading ipci.c lines 89–170. The function calls ~37 *ShmemSize() helpers plus total_addin_request. The PG18-specific additions are AioShmemSize() (async I/O) and SlotSyncShmemSize() (slot sync worker). There is no runtime feedback loop; if the segment is created too small, ShmemAlloc will error with “out of shared memory.”

  • ShmemAllocRaw aligns to CACHELINEALIGN, not just MAXALIGN. Verified at shmem.c line ~205. The comment reads: “experience has proved that on modern systems [MAXALIGN] is not good enough… attempt to align the beginning of the allocation to a cache line boundary.” This is a deliberate change from the pre-PG14 behavior.

  • The DSM platform is selected at runtime via dynamic_shared_memory_type GUC, not at compile time. Verified in dsm_impl_op (dsm_impl.c line 159). The switch dispatches on the GUC value; all four backends (posix/sysv/windows/mmap) may be compiled in. The default is posix on Linux/macOS.

  • shm_mq’s data path uses no mutex; only mq_receiver/mq_sender assignment is mutex-protected. Verified by reading the struct comment in shm_mq.c lines 30–67. mq_bytes_read and mq_bytes_written are pg_atomic_uint64 with documented memory ordering; mq_ring is written/read without any lock.

  • dsa_pointer encodes a (segment_number, offset) pair. Verified in dsa.h lines 81–103 and dsa.c lines 78–98. On 64-bit systems, DSA_OFFSET_WIDTH = 40 allows segments up to 1 TB and up to 1024 segments per area. The 32-bit fallback uses DSA_OFFSET_WIDTH = 27 (32 segments, 128 MB each).

  • RequestAddinShmemSpace is gated to fire only during shmem_request_hook. Verified at ipci.c line 74–77. A call from outside the hook causes FATAL. The flag process_shmem_requests_in_progress is set in the postmaster’s pre-fork hook loop and cleared before CreateSharedMemoryAndSemaphores returns.

  • PG18 adds AioShmemInit() and SlotSyncShmemInit() to CreateOrAttachShmemStructs. Verified by grepping ipci.c lines 268–355. AioShmemInit initializes the async I/O subsystem (storage/aio/); SlotSyncShmemInit supports the slot sync worker added in PG17. Both are absent from REL_16 and earlier.

  1. ShmemAllocRaw offset for ShmemAllocRaw in the position-hint table is approximate. ShmemAllocRaw is a static function without a grep-stable entry point; the line 183 estimate was computed as InitShmemAllocation (115) + visible function bodies. Verification path: grep -n 'ShmemAllocRaw' shmem.c.

  2. NUMA-aware shmem allocation. shmem.c references pg_numa.h and a firstNumaTouch flag (line 96). It is unclear whether PG18 actually uses NUMA topology when placing the static segment or merely exposes the pg_numa_available() SQL function. Investigation path: trace firstNumaTouch through ShmemAlloc and the new pg_numa_available built-in.

  3. dsa_create vs dsa_create_ext. Resolved. In PG18 dsa_create is a macro in dsa.h (line 117) that expands to dsa_create_ext(tranche_id, DSA_DEFAULT_INIT_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE); likewise dsa_create_in_place (line 122) wraps dsa_create_in_place_ext. The _ext forms are the real functions; the bare names are convenience macros with default segment-size bounds. No rename occurred.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Oracle SGA (System Global Area). Oracle’s static shared segment is conceptually identical to PostgreSQL’s: sized at startup, bump- allocated, hosts buffer cache + lock table + redo buffer. Oracle adds the Automatic Memory Management (AMM) layer that can redistribute SGA components at runtime — a feature PostgreSQL explicitly does not have (the CalculateShmemSize sum is fixed). A design comparison would quantify what dynamic redistribution buys vs. the complexity it adds.

  • Linux io_uring shared rings and the PG18 storage/aio/ subsystem. PG18’s async I/O layer (AioShmemInit) places I/O worker state in the static segment and uses io_uring submission / completion queues, which are themselves shared-memory ring buffers established by the kernel. The shm_mq ring-buffer pattern in userspace and the io_uring pattern in kernel-space are structurally identical (producer/consumer with atomic counters); a side-by-side would clarify where PostgreSQL’s tuple-transport ring and the kernel’s I/O ring diverge in their memory-ordering guarantees.

  • CockroachDB / distributed shared-nothing. PostgreSQL’s design assumes all processes share physical memory. A shared-nothing distributed engine has no shared segment — all state is either node-local or exchanged via an explicit RPC layer. The scalable-lock-manager.md paper (Johnson et al., VLDB 2010) is relevant here: it observes that even in shared-memory engines, centralized lock-table access is a scalability bottleneck, motivating per-CPU partitioning — a pattern PostgreSQL uses for its buffer-mapping table but not yet for the entire lock table.

  • shm_mq vs. LMAX Disruptor / lock-free ring patterns. shm_mq’s lockless data path is a classic single-producer/single- consumer ring buffer. The LMAX Disruptor (Thompson et al., 2011) extends this to multi-producer/multi-consumer with sequence barriers. PostgreSQL’s parallel executor is bounded to one sender per shm_mq instance; a future multi-parallel-worker-to-gather transport could benefit from a multi-producer variant. Porting Disruptor patterns to a shared-memory context with pg_atomic_* primitives is a concrete research avenue.

  • DSA vs. custom slab allocators (e.g., radix tree in pg_wait). dsa is a general-purpose variable-size allocator. Several PostgreSQL subsystems (cumulative stats, pg_wait_sampling) have rolled specialized data structures on top of DSM without DSA, managing segments and pointers directly. Understanding where the DSA overhead (size-class bucketing, pagemap per segment) is worth paying vs. where a hand-rolled layout is better would clarify when to adopt DSA in new subsystems.

(none — analysis synthesized directly from source tree)

  • Database System Concepts (Silberschatz, Korth, Sudarshan), 7th ed., ch. 17 §“Shared Memory and Semaphores” — shared-memory as the standard buffer-pool substrate in multi-process engines.
  • Database Internals (Petrov, 2019), ch. 4 §“Buffer Management” — buffer pool as shared structure; page ID as cache key.

Source code (REL_18_STABLE, commit 273fe94)

Section titled “Source code (REL_18_STABLE, commit 273fe94)”
  • src/backend/storage/ipc/ipci.c — segment sizing + master initializer.
  • src/backend/storage/ipc/shmem.c — bump allocator + shmem index.
  • src/backend/storage/ipc/dsm.c — dynamic shared memory segments.
  • src/backend/storage/ipc/dsm_impl.c — platform backends (posix/sysv/windows/mmap).
  • src/backend/storage/ipc/shm_mq.c — ring-buffer message queue.
  • src/backend/utils/mmgr/dsa.c — variable-size DSM allocator.
  • src/backend/storage/ipc/shm_toc.c — table-of-contents over a DSM segment (key → offset directory used by parallel workers).
  • src/backend/access/transam/parallel.c — leader/worker DSM + shm_mq setup (InitializeParallelDSM, LaunchParallelWorkers, ParallelWorkerMain).
  • src/backend/access/common/session.cGetSessionDsmHandle.
  • src/include/storage/pg_shmem.hPGShmemHeader.
  • src/include/storage/dsm.hdsm_handle, dsm_segment public API.
  • src/include/storage/shm_mq.hshm_mq public API.
  • src/include/utils/dsa.hdsa_pointer, dsa_area public API.
  • src/include/access/transam.hVariableCacheData / ShmemVariableCache XID/OID counter layout.
  • postgres-architecture-overview.md — Axis 2 (shared-memory spine); postmaster fork model.
  • postgres-lock-manager.mdLockManagerShmemInit resident in the static segment.
  • postgres-lwlock-spinlock.mdLWLockShmemSize / CreateLWLocks; LWLocks live in the static segment but their internals are covered there.
  • postgres-buffer-manager.mdBufferManagerShmemInit; buffer descriptors and page frames are the largest residents of the static segment.
  • postgres-parallel-query.md — DSM + shm_mq consumer; how nodeGather.c drives the tuple-transport ring.
  • knowledge/research/dbms-papers/scalable-lock-manager.md — Johnson et al. VLDB 2010; per-CPU lock-table partitioning as scalability motivation.