PostgreSQL Shared Memory & IPC — Static Segment, Dynamic Shared Memory, and the shm_mq Message Layer
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A multi-process database must give every worker the same view of the shared state — the buffer pool, the lock table, the transaction status array, and dozens of smaller structures. Two classical strategies exist:
-
Message passing. Processes have no shared address space; they exchange state via pipes, sockets, or OS message queues. The model is clean (no aliasing, no cache-coherence races at the language level) but every read of shared state pays a kernel crossing and a copy. Oracle’s MTS and MySQL’s connection-per-thread model both push shared state through well-defined channels.
-
Shared memory. All processes map the same virtual pages. Reads and writes are load/store instructions, not kernel calls. The cost is paid up front (map the region once) and amortized over every subsequent access. Protection must be enforced by the processes themselves, typically via lightweight locks or atomic operations inside the shared region.
PostgreSQL is unambiguously in the second camp. Every backend, every
auxiliary process, and every parallel worker attaches to the same
shared-memory region created by the postmaster. The buffer pool, the
PGPROC array, the lock table, the sinval ring — these are not copies
per process; they are single objects in a shared address space that all
processes reference through inherited (post-fork) or re-established
(post-exec, Windows) pointers.
Database System Concepts (Silberschatz et al., ch. 17) frames shared memory as the natural substrate for buffer management in a shared-disk architecture: “the buffer pool is maintained in shared memory so that all processes can access it directly.” Database Internals (Petrov, ch. 4) notes the same; the buffer manager’s descriptor array and page frames live in shared memory so no inter- process copy is ever needed for a cache hit.
Two design questions define the architecture once shared memory is chosen:
-
How large is the segment, and who decides? A fixed-at-startup approach (PostgreSQL’s choice) lets the kernel reserve all physical pages at once and eliminates fragmentation at the cost of requiring a restart to resize. A growable-segment approach allows online resizing but complicates pointer stability — any pointer into the region must be invalidated if the base address shifts.
-
How is the region subdivided? A bump allocator with a directory (PostgreSQL’s
ShmemAlloc+ShmemIndex) is simple and prevents fragmentation, but allocations are permanent. A slab or pool allocator per object class (InnoDB’s buf_pool, for instance) trades simplicity for reclamability.
PostgreSQL adds a third layer: dynamic shared memory (dsm), a
separate family of on-demand segments created and destroyed at runtime.
DSM sidesteps the fixed-at-startup constraint for transient structures
like parallel query workers — but the static segment remains the
canonical home for all long-lived global state.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the rationale; this section names the engineering conventions that nearly every multi-process DBMS adopts when it chooses shared memory as its IPC substrate.
Fixed static segment + bump allocator
Section titled “Fixed static segment + bump allocator”Every subsystem announces its memory need at startup via a sizing
function. A central coordinator sums those needs, creates the segment
once, and then hands out slices via a bump (linear) allocator. This
pattern appears in Oracle (the SGA), InnoDB (the buffer pool’s
buf_pool_init), and PostgreSQL (the ShmemAlloc arena). The
invariant is that the bump pointer never moves backward: once a slice is
handed out, it belongs to that subsystem forever. The segment never
leaks because nothing is ever truly freed — the bump pointer only
advances.
Shmem index / name registry
Section titled “Shmem index / name registry”A flat hash table inside the static segment maps human-readable string
names to (address, size) pairs for every registered object. Any
process, regardless of whether it created the object or merely attached
to the segment later (EXEC_BACKEND on Windows; or a backend that
re-initializes after an unusual restart), can locate a named object by
looking it up in the directory. The alternative — hard-coding offsets —
was the original POSTGRES design and is fragile: adding one new field
changes every downstream offset.
Separate lightweight dynamic segments for transient structures
Section titled “Separate lightweight dynamic segments for transient structures”Long-lived state lives in the static segment; the static segment is sized conservatively for the worst case. Transient structures — the memory a parallel query worker needs for the duration of one query, the shared tuple queue between a parallel scan and a gather node — should not inflate the static segment permanently. Every mature multi-process engine therefore adds a second-tier dynamic allocator: Oracle’s PGA (process-private), PostgreSQL’s DSM, CockroachDB’s goroutine-local arenas. The requirement is that the backing pages be unmapped and returned to the OS when the structure’s owner exits, even if the owner crashes.
Message queues layered on shared memory
Section titled “Message queues layered on shared memory”Once two processes share a region, a ring-buffer message queue is a
natural addition: one process writes into a slot, increments the write
pointer, and optionally wakes the reader via a latch. The reader
consumes from the read pointer, increments it, and signals back.
No kernel copy is involved. This pattern — shared-memory ring buffer +
latch for wakeup — is used by PostgreSQL’s shm_mq, Oracle’s
in-SGA message queues, and the kfifo-based queues in Linux’s kernel
scheduler.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Concept | PostgreSQL name |
|---|---|
| Static shared segment | PGShmemHeader / ShmemSegHdr (shmem.c) |
| Segment creation | CreateSharedMemoryAndSemaphores (ipci.c) |
| Sizing function (per subsystem) | *ShmemSize() family, summed by CalculateShmemSize |
| Bump allocator | ShmemAlloc / ShmemAllocRaw (shmem.c) |
| Shmem index / name registry | ShmemIndex HTAB + ShmemInitStruct / ShmemInitHash |
| Dynamic segment | dsm_segment / dsm_create / dsm_attach / dsm_detach (dsm.c) |
| Platform dispatch table | dsm_impl_op (dsm_impl.c) |
| Ring-buffer message queue on DSM | shm_mq / shm_mq_create / shm_mq_attach / shm_mq_send / shm_mq_receive (shm_mq.c) |
| Variable-size slab allocator on DSM | dsa_area / dsa_create_ext / dsa_allocate_extended / dsa_free (dsa.c) |
| Add-in request hook | RequestAddinShmemSpace + shmem_startup_hook |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Layer 0: the static shared segment
Section titled “Layer 0: the static shared segment”At postmaster startup — before any backend is forked — two functions run in sequence:
// CalculateShmemSize — src/backend/storage/ipc/ipci.cSizeCalculateShmemSize(int *num_semaphores){ Size size = 100000; /* baseline for small objects */ size = add_size(size, BufferManagerShmemSize()); size = add_size(size, LockManagerShmemSize()); size = add_size(size, ProcGlobalShmemSize()); size = add_size(size, XLOGShmemSize()); size = add_size(size, LWLockShmemSize()); size = add_size(size, ProcArrayShmemSize()); // ... ~35 more ShmemSize() calls condensed ... size = add_size(size, AioShmemSize()); /* PG18 async I/O */ size = add_size(size, total_addin_request); /* add-ins */ size = add_size(size, 8192 - (size % 8192)); /* page-align */ return size;}// CreateSharedMemoryAndSemaphores — src/backend/storage/ipc/ipci.cvoidCreateSharedMemoryAndSemaphores(void){ PGShmemHeader *seghdr; Size size = CalculateShmemSize(&numSemas); seghdr = PGSharedMemoryCreate(size, &shim); /* SysV / mmap */ InitShmemAccess(seghdr); PGReserveSemaphores(numSemas); InitShmemAllocation(); CreateOrAttachShmemStructs(); /* calls every ShmemInit*() */ dsm_postmaster_startup(shim); /* bootstrap DSM control segment */ if (shmem_startup_hook) shmem_startup_hook(); /* add-in extensions */}The end-to-end sizing-then-creation sequence at postmaster startup is a
single straight-line pipeline: sum the needs, create one segment, set up
the bump allocator, then run every subsystem’s ShmemInit* through the
shmem index.
flowchart TD
A["CalculateShmemSize<br/>sum ~37 *ShmemSize() + total_addin_request"] --> B["PGSharedMemoryCreate<br/>SysV shmget / POSIX mmap one segment"]
B --> C["InitShmemAccess<br/>set ShmemBase / ShmemEnd / ShmemSegHdr"]
C --> D["PGReserveSemaphores"]
D --> E["InitShmemAllocation<br/>ShmemAllocUnlocked(ShmemLock spinlock)"]
E --> F["CreateOrAttachShmemStructs<br/>master initializer"]
F --> G["InitShmemIndex<br/>ShmemIndex HTAB at arena base"]
G --> H["per-subsystem ShmemInitStruct / ShmemInitHash<br/>BufferManagerShmemInit, LockManagerShmemInit, InitProcGlobal, ..."]
H --> I["ShmemAllocRaw<br/>CACHELINEALIGN + advance freeoffset under ShmemLock"]
F --> J["dsm_postmaster_startup<br/>bootstrap DSM control segment"]
F --> K["shmem_startup_hook<br/>add-in ShmemInitStruct calls"]
Figure 1 — static-segment sizing and creation at postmaster startup.
CalculateShmemSize is called twice in practice: once by
InitializeShmemGUCs to publish shared_memory_size, and again here as
the authoritative size handed to PGSharedMemoryCreate. Every leaf of
the ShmemInitStruct fan-out resolves through the same
ShmemAllocRaw bump path, so the order in CreateOrAttachShmemStructs
is the order slices are stamped into the segment.
The segment header lives at the base of the mapped region:
// PGShmemHeader — src/include/storage/pg_shmem.htypedef struct PGShmemHeader{ int32 magic; /* identifies a live Postgres segment */ pid_t creatorPID; /* postmaster PID */ Size totalsize; /* total size of the segment */ Size freeoffset; /* bump pointer — next free byte */ dev_t device; /* data-directory device (Unix only) */ ino_t inode; /* data-directory inode (Unix only) */} PGShmemHeader;freeoffset is the bump pointer. ShmemAlloc advances it:
// ShmemAllocRaw — src/backend/storage/ipc/shmem.cstatic void *ShmemAllocRaw(Size size, Size *allocated_size){ size = CACHELINEALIGN(size); /* align to cache-line boundary */ SpinLockAcquire(ShmemLock); newStart = ShmemSegHdr->freeoffset; newFree = newStart + size; if (newFree <= ShmemSegHdr->totalsize) { newSpace = (char *) ShmemBase + newStart; ShmemSegHdr->freeoffset = newFree; } else newSpace = NULL; /* out of shared memory */ SpinLockRelease(ShmemLock); return newSpace;}Three things are worth noting. First, CACHELINEALIGN pads each
allocation to a cache-line boundary — a micro-optimization to prevent
false sharing between adjacent objects in the segment. Second, the
allocator holds ShmemLock (a spinlock) only for the bump-pointer
update, not for the write into the allocated region. Third, once
freeoffset is advanced it is never decremented: shared-memory
allocations are permanent.
Layer 1: the shmem index
Section titled “Layer 1: the shmem index”Every object allocated via ShmemAlloc is registered in ShmemIndex,
a hash table at a known offset inside the segment. ShmemInitStruct
either creates-and-registers or looks-up-and-returns a named object:
// ShmemInitStruct — src/backend/storage/ipc/shmem.cvoid *ShmemInitStruct(const char *name, Size size, bool *foundPtr){ /* Look up or insert into ShmemIndex */ result = (ShmemIndexEnt *) hash_search(ShmemIndex, name, HASH_ENTER_NULL, foundPtr); if (!*foundPtr) { structPtr = ShmemAlloc(size); result->location = structPtr; result->size = size; } else structPtr = result->location; return structPtr;}This is the re-attach path that makes EXEC_BACKEND work: a Windows
backend that cannot fork() calls AttachSharedMemoryStructs() which
re-runs CreateOrAttachShmemStructs(). Every ShmemInitStruct call
returns *foundPtr = true and hands back the pointer from the index —
no re-allocation occurs.
The segment also exposes a special pointer ShmemVariableCache (typed
as VariableCacheData *), which holds the cluster-wide XID and OID
counters (nextXid, nextOid, latestCompletedXid). It is not in the
index but rather placed at a fixed offset by InitShmemAllocation and
used by access/transam/varsup.c.
Layer 2: the static segment’s residents
Section titled “Layer 2: the static segment’s residents”CreateOrAttachShmemStructs (ipci.c) is the master initializer. Its
call sequence defines what lives in the static segment:
flowchart TD
A[CreateOrAttachShmemStructs] --> B[CreateLWLocks]
A --> C[InitShmemIndex]
A --> D[dsm_shmem_init]
A --> E[VarsupShmemInit / XLOGShmemInit / XLogRecoveryShmemInit]
A --> F[CLOGShmemInit / CommitTsShmemInit / SUBTRANSShmemInit]
A --> G[BufferManagerShmemInit]
A --> H[LockManagerShmemInit / PredicateLockShmemInit]
A --> I[InitProcGlobal / ProcArrayShmemInit / BackendStatusShmemInit]
A --> J[SharedInvalShmemInit]
A --> K[PMSignalShmemInit / ProcSignalShmemInit / CheckpointerShmemInit]
A --> L[AutoVacuumShmemInit / ReplicationSlotsShmemInit / WalSndShmemInit]
A --> M[StatsShmemInit / AioShmemInit]
Figure 2 — CreateOrAttachShmemStructs call graph. Each leaf calls
ShmemInitStruct (or its own bump-allocator path) to stake out a
slice of the static segment. The order is fixed: LWLocks and the shmem
index must come first because every subsequent call uses them.
Every object named above — the buffer descriptors, the PGPROC array, the lock table, the WAL buffers, the sinval ring, the cumulative stats area (PG15+), the async I/O subsystem (PG18) — lives inside this one segment at a pointer inherited by every forked child.
Layer 3: dynamic shared memory (DSM)
Section titled “Layer 3: dynamic shared memory (DSM)”The static segment is sized once at postmaster startup and cannot grow. For transient structures — parallel query worker state, logical replication worker queues — PostgreSQL uses dynamic shared memory: on-demand segments created and destroyed at runtime.
// dsm_segment (backend-local handle) — src/backend/storage/ipc/dsm.cstruct dsm_segment{ dlist_node node; /* link in backend's segment list */ ResourceOwner resowner; /* tracks ownership for cleanup */ dsm_handle handle; /* shared name (integer key) */ uint32 control_slot; /* slot in the DSM control segment */ void *impl_private; /* platform-specific private data */ void *mapped_address; /* where this backend mapped the segment */ Size mapped_size; /* size of this backend's mapping */ slist_head on_detach; /* callbacks fired on dsm_detach() */};The public API is three calls:
// dsm_create / dsm_attach / dsm_detach — src/backend/storage/ipc/dsm.cdsm_segment *dsm_create(Size size, int flags); /* line 516 */dsm_segment *dsm_attach(dsm_handle h); /* line 665 */void dsm_detach(dsm_segment *seg); /* line 803 */dsm_create returns a dsm_segment handle, allocates a slot in the
DSM control segment (a special segment pinned at postmaster startup by
dsm_postmaster_startup), and maps the new segment into the caller’s
address space. dsm_attach looks up the handle in the control segment
and maps it. dsm_detach unmaps and — if the reference count drops to
zero — destroys the segment via the platform backend.
Platform dispatch is handled by dsm_impl_op:
// dsm_impl_op — src/backend/storage/ipc/dsm_impl.cbooldsm_impl_op(dsm_op op, dsm_handle handle, Size request_size, void **impl_private, void **mapped_address, Size *mapped_size, int elevel){ switch (dynamic_shared_memory_type) { case DSM_IMPL_POSIX: return dsm_impl_posix(...); /* shm_open */ case DSM_IMPL_SYSV: return dsm_impl_sysv(...); /* shmget */ case DSM_IMPL_WINDOWS: return dsm_impl_windows(...); case DSM_IMPL_MMAP: return dsm_impl_mmap(...); /* mmap file */ }}The dynamic_shared_memory_type GUC (default posix on Linux/macOS,
windows on Windows, mmap as the universal fallback) selects the
implementation at runtime. The rest of the DSM stack sees only the
dsm_segment handle and never knows which backend was used.
The DSM lifecycle:
flowchart TD
Start([start]) -->|dsm_create| Created[Created refcnt=1]
Created -->|dsm_attach in other backend| Attached[Attached refcnt>1]
Created -->|dsm_pin_segment| Pinned[Pinned survives creator exit]
Pinned -->|dsm_unpin_segment| Attached
Attached -->|dsm_detach with refcnt>1 — unmap only| Attached
Attached -->|dsm_detach drops refcnt to 0 — destroy| Detached[Detached]
Created -->|dsm_detach drops refcnt to 0 — destroy| Detached
Detached -->|OS resources freed| Done([end])
Figure 3 — DSM segment lifecycle. dsm_pin_segment bumps the
reference count to prevent a segment from being destroyed when its
creator exits; it is used by DSA areas to keep backing segments alive
across creator exit.
Layer 4: shm_mq — a lock-free ring-buffer message queue
Section titled “Layer 4: shm_mq — a lock-free ring-buffer message queue”shm_mq is a single-producer/single-consumer ring buffer allocated
inside a DSM segment. It is the transport layer for the parallel-query
tuple stream (from a parallel worker to a Gather node) and for
background-worker result delivery.
// shm_mq — src/backend/storage/ipc/shm_mq.cstruct shm_mq{ slock_t mq_mutex; /* protects mq_receiver/sender */ PGPROC *mq_receiver; /* set once, then read-only */ PGPROC *mq_sender; /* set once, then read-only */ pg_atomic_uint64 mq_bytes_read; /* consumer position */ pg_atomic_uint64 mq_bytes_written; /* producer position */ Size mq_ring_size; /* ring buffer capacity */ bool mq_detached; /* either side has gone away */ uint8 mq_ring_offset; /* offset of ring within struct */ char mq_ring[FLEXIBLE_ARRAY_MEMBER];};The protocol is straightforward. After shm_mq_create sizes the ring,
the sender calls shm_mq_set_sender and the receiver calls
shm_mq_set_receiver to register their PGPROC pointers. Send and
receive are then lock-free: the sender writes into
mq_ring[mq_bytes_written % mq_ring_size] and advances
mq_bytes_written; the receiver reads from
mq_ring[mq_bytes_read % mq_ring_size] and advances mq_bytes_read.
Only when the ring is full (sender) or empty (receiver) does either
side call SetLatch / WaitLatch on the counterparty’s PGPROC
latch.
sequenceDiagram
participant W as Worker (sender)
participant MQ as shm_mq ring
participant G as Gather (receiver)
W->>MQ: shm_mq_send(data, nbytes)
note over MQ: write to ring[written % size]\nadvance mq_bytes_written
MQ-->>G: (reader polls or is woken)
G->>MQ: shm_mq_receive()
note over MQ: read from ring[read % size]\nadvance mq_bytes_read
G-->>W: (ring has space — sender unblocks)
W->>MQ: shm_mq_detach() on exit
note over MQ: mq_detached = true\nwake counterparty latch
Figure 4 — shm_mq send/receive protocol. The ring buffer itself
requires no lock for the data path. mq_mutex protects only the
initial assignment of mq_receiver / mq_sender. Memory ordering
between producer and consumer is maintained by the atomic
mq_bytes_written / mq_bytes_read with appropriate barriers.
Layer 5: DSA — a variable-size slab allocator over DSM
Section titled “Layer 5: DSA — a variable-size slab allocator over DSM”shm_mq handles fixed-size message framing; for variable-size shared
data structures (hash tables, trees, or any heap-allocated object that
must be visible to multiple processes), PostgreSQL provides the
Dynamic Shared-memory Area (dsa). DSA carves a dsa_area control
object out of a DSM segment and then manages a pool of further DSM
segments from which it allocates:
// dsa_pointer — src/include/utils/dsa.htypedef uint64 dsa_pointer;/* * Encoded as: (segment_number << DSA_OFFSET_WIDTH) | offset_within_segment * DSA_OFFSET_WIDTH = 40 bits on 64-bit, giving 1 TB per segment. * Segment number identifies which DSM segment holds the data. */#define DSA_POINTER_FORMAT "%016" PRIx64
// dsa_create_ext — src/backend/utils/mmgr/dsa.c (line 421)dsa_area *dsa_create_ext(int tranche_id, size_t init_segment_size, size_t max_segment_size);
// dsa_allocate_extended — src/backend/utils/mmgr/dsa.c (line 671)dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
// dsa_free — src/backend/utils/mmgr/dsa.c (line 826)void dsa_free(dsa_area *area, dsa_pointer dp);
// dsa_get_address — src/backend/utils/mmgr/dsa.c (line 942)void *dsa_get_address(dsa_area *area, dsa_pointer dp);dsa_pointer is a 64-bit integer that encodes a (segment number,
offset) pair. It is position-independent — the same dsa_pointer
can be passed between processes and each resolves it against its own
local mapping of that DSM segment via dsa_get_address. The
dsa_area_control structure, embedded at the start of the first DSM
segment, holds the pool freelist and segment directory; any backend
that calls dsa_attach gets its own dsa_area shell pointing at the
same control object.
Putting the dynamic side together: a parallel worker
Section titled “Putting the dynamic side together: a parallel worker”The three dynamic layers compose in access/transam/parallel.c. The
leader’s InitializeParallelDSM sizes and creates one DSM segment,
lays a shm_toc table of contents over it, and carves per-worker
shm_mq error/tuple queues out of it; dsa_create_in_place can place a
DSA area inside the same segment when a parallel node needs variable-size
shared scratch (e.g. a parallel hash join). Each worker then re-attaches
by handle and registers itself as the shm_mq sender.
flowchart TD
subgraph Leader
L1["GetSessionDsmHandle"] --> L2["dsm_create(segsize, DSM_CREATE_NULL_IF_MAXSEGMENTS)"]
L2 --> L3["shm_toc_create(PARALLEL_MAGIC, seg base)"]
L3 --> L4["shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE)<br/>per worker"]
L4 --> L5["shm_mq_set_receiver(mq, MyProc)"]
L5 --> L6["shm_mq_attach(mq, seg, NULL) -> error_mqh"]
L3 -.optional.-> LD["dsa_create_in_place(place, size, tranche, seg)<br/>variable-size shared scratch"]
L6 --> L7["LaunchParallelWorkers -> RegisterDynamicBackgroundWorker<br/>handle = dsm_segment_handle(seg)"]
end
subgraph Worker
W1["ParallelWorkerMain(main_arg = dsm_handle)"] --> W2["dsm_attach(DatumGetUInt32(main_arg))"]
W2 --> W3["shm_toc_attach(PARALLEL_MAGIC, seg base)"]
W3 --> W4["shm_mq_set_sender(mq, MyProc)"]
W4 --> W5["shm_mq_attach(mq, seg, NULL) -> mqh"]
W3 -.optional.-> WD["dsa_attach_in_place(place, seg)"]
end
L7 ==>|fork/exec inherits handle| W1
W5 ==>|tuples + errors via ring| L6
Figure 5 — the dynamic side for one parallel query. A single
dsm_create segment is the shared substrate; shm_toc is the offset
directory the worker uses to relocate every sub-object (queues, DSA
control, plan state) against its own mapping; shm_mq carries the
error/tuple streams; an optional in-place dsa_area supplies
variable-size allocations. The worker never sees a raw pointer from the
leader — only the integer dsm_handle and TOC keys cross the process
boundary.
Extension point: RequestAddinShmemSpace and shmem_startup_hook
Section titled “Extension point: RequestAddinShmemSpace and shmem_startup_hook”Extensions loaded via shared_preload_libraries can grow the static
segment before it is created:
// RequestAddinShmemSpace — src/backend/storage/ipc/ipci.cvoidRequestAddinShmemSpace(Size size){ if (!process_shmem_requests_in_progress) elog(FATAL, "cannot request additional shared memory " "outside shmem_request_hook"); total_addin_request = add_size(total_addin_request, size);}The hook fires inside CalculateShmemSize. A second hook,
shmem_startup_hook, fires after CreateOrAttachShmemStructs and
gives the extension the opportunity to call ShmemInitStruct for its
own objects. Together the two hooks are the complete extension API for
persistent shared-memory structures.
Source Walkthrough
Section titled “Source Walkthrough”Static segment sizing and initialization (ipci.c)
Section titled “Static segment sizing and initialization (ipci.c)”CalculateShmemSize— sums all per-subsystem*ShmemSize()sizing functions plustotal_addin_request; called byCreateSharedMemoryAndSemaphoresand byInitializeShmemGUCs.CreateSharedMemoryAndSemaphores— postmaster-only entry point; callsPGSharedMemoryCreate,InitShmemAccess,InitShmemAllocation,CreateOrAttachShmemStructs, anddsm_postmaster_startup.CreateOrAttachShmemStructs— master initializer; calls every subsystem’sShmemInit*function in dependency order.AttachSharedMemoryStructs— EXEC_BACKEND re-attach path (Windows); re-runsCreateOrAttachShmemStructsto rebuild local pointers.RequestAddinShmemSpace— extension API for pre-creation size reservation.InitializeShmemGUCs— computesshared_memory_sizeandshared_memory_size_in_huge_pagesGUC values.
Bump allocator and shmem index (shmem.c)
Section titled “Bump allocator and shmem index (shmem.c)”InitShmemAccess— sets module-levelShmemBase,ShmemEnd, andShmemSegHdrfrom thePGShmemHeaderpointer.InitShmemAllocation— allocates and initializesShmemLock(spinlock) and zeroesShmemIndex.ShmemAlloc— callsShmemAllocRaw; errors on out-of-space.ShmemAllocNoError— same; returns NULL on out-of-space.ShmemAllocUnlocked— lockless variant used during bootstrap beforeShmemLockitself is allocated.ShmemAllocRaw— internal; CACHELINEALIGN + spinlock-guarded bump.InitShmemIndex— creates theShmemIndexHTAB at the base of the arena usingShmemAllocUnlocked.ShmemInitStruct— named-object create-or-attach viaShmemIndex.ShmemInitHash— likeShmemInitStructbut for HTAB; parameters passed asHASHCTL.
Dynamic shared memory (dsm.c + dsm_impl.c)
Section titled “Dynamic shared memory (dsm.c + dsm_impl.c)”dsm_postmaster_startup— creates the DSM control segment at postmaster startup; records its handle in the static segment’sPGShmemHeader.dsm_shmem_init— called fromCreateOrAttachShmemStructsto map the control segment in each backend.dsm_create— allocates a new DSM segment, reserves adsm_control_itemslot, returns adsm_segmenthandle.dsm_attach— maps an existing DSM segment by handle; increments thedsm_control_item.refcnt.dsm_detach— unmaps a segment; fireson_detachcallbacks; decrementsrefcnt; destroys the OS object whenrefcntreaches 0.on_dsm_detach— registers a callback to fire before a given segment is detached; used by DSA to release area memory.dsm_impl_op— platform dispatch: selectsdsm_impl_posix,dsm_impl_sysv,dsm_impl_windows, ordsm_impl_mmapbased on thedynamic_shared_memory_typeGUC.dsm_control_item— shared metadata per segment:handle,refcnt,first_page,npages,pinned.
Ring-buffer message queue (shm_mq.c)
Section titled “Ring-buffer message queue (shm_mq.c)”shm_mq_create— initializes ashm_mqat a caller-provided address inside a DSM segment; setsmq_ring_size.shm_mq_set_receiver/shm_mq_set_sender— register the sending and receivingPGPROC; each may be called only once.shm_mq_attach— creates a backend-localshm_mq_handlefor an existingshm_mq; optionally registers aBackgroundWorkerHandle.shm_mq_send— sends a message; blocks (or returnsSHM_MQ_WOULD_BLOCK) when the ring is full.shm_mq_receive— receives the next message; blocks (or returnsSHM_MQ_WOULD_BLOCK) when the ring is empty.shm_mq_detach— marksmq_detached = true; wakes the counterparty’s latch.
Variable-size allocator over DSM (dsa.c)
Section titled “Variable-size allocator over DSM (dsa.c)”dsa_create_ext— creates a newdsa_areabacked by a fresh DSM segment; stores thedsa_area_controlat the segment’s base.dsa_attach— maps an existing DSA area bydsa_handle(which is adsm_handle).dsa_allocate_extended— allocatessizebytes; returns a position-independentdsa_pointer.dsa_free— returns memory to the area’s freelist.dsa_get_address— translates adsa_pointerto a backend-local virtual address.dsa_pin/dsa_unpin— prevent / allow the area from being destroyed when the last non-pinned backend detaches.
Position-hint table (as of 2026-06-05, commit 273fe94, REL_18_STABLE)
Section titled “Position-hint table (as of 2026-06-05, commit 273fe94, REL_18_STABLE)”| Symbol | File | Line |
|---|---|---|
CalculateShmemSize | storage/ipc/ipci.c | 89 |
CreateSharedMemoryAndSemaphores | storage/ipc/ipci.c | 200 |
CreateOrAttachShmemStructs | storage/ipc/ipci.c | 268 |
AttachSharedMemoryStructs | storage/ipc/ipci.c | 173 |
RequestAddinShmemSpace | storage/ipc/ipci.c | 74 |
InitializeShmemGUCs | storage/ipc/ipci.c | 357 |
PGShmemHeader | include/storage/pg_shmem.h | 29 |
InitShmemAccess | storage/ipc/shmem.c | 102 |
InitShmemAllocation | storage/ipc/shmem.c | 115 |
ShmemAlloc | storage/ipc/shmem.c | 152 |
ShmemAllocNoError | storage/ipc/shmem.c | 172 |
ShmemAllocUnlocked | storage/ipc/shmem.c | 238 |
ShmemAllocRaw (static) | storage/ipc/shmem.c | 183 (approx) |
InitShmemIndex | storage/ipc/shmem.c | 283 |
ShmemInitHash | storage/ipc/shmem.c | 332 |
ShmemInitStruct | storage/ipc/shmem.c | 387 |
dsm_postmaster_startup | storage/ipc/dsm.c | 177 |
dsm_shmem_init | storage/ipc/dsm.c | 479 |
dsm_create | storage/ipc/dsm.c | 516 |
dsm_attach | storage/ipc/dsm.c | 665 |
dsm_detach | storage/ipc/dsm.c | 803 |
on_dsm_detach | storage/ipc/dsm.c | 1132 |
dsm_segment (struct) | storage/ipc/dsm.c | 66 |
dsm_control_item (struct) | storage/ipc/dsm.c | 79 |
dsm_impl_op | storage/ipc/dsm_impl.c | 159 |
shm_mq (struct) | storage/ipc/shm_mq.c | 71 |
shm_mq_create | storage/ipc/shm_mq.c | 177 |
shm_mq_set_receiver | storage/ipc/shm_mq.c | 206 |
shm_mq_set_sender | storage/ipc/shm_mq.c | 224 |
shm_mq_attach | storage/ipc/shm_mq.c | 290 |
shm_mq_send | storage/ipc/shm_mq.c | 329 |
shm_mq_receive | storage/ipc/shm_mq.c | 572 |
shm_mq_detach | storage/ipc/shm_mq.c | 843 |
dsa_create_ext | utils/mmgr/dsa.c | 421 |
dsa_attach | utils/mmgr/dsa.c | 510 |
dsa_allocate_extended | utils/mmgr/dsa.c | 671 |
dsa_free | utils/mmgr/dsa.c | 826 |
dsa_get_address | utils/mmgr/dsa.c | 942 |
dsa_pin | utils/mmgr/dsa.c | 975 |
dsa_unpin | utils/mmgr/dsa.c | 994 |
dsa_create_in_place_ext | utils/mmgr/dsa.c | 471 |
dsa_attach_in_place | utils/mmgr/dsa.c | 545 |
shm_toc_create | storage/ipc/shm_toc.c | 40 |
shm_toc_attach | storage/ipc/shm_toc.c | 64 |
GetSessionDsmHandle | access/common/session.c | 70 |
InitializeParallelDSM | access/transam/parallel.c | 211 |
LaunchParallelWorkers | access/transam/parallel.c | 580 |
ParallelWorkerMain | access/transam/parallel.c | 1299 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
CalculateShmemSizeis a simple sum of per-subsystem sizing functions, with no dynamic adjustment. Verified by reading ipci.c lines 89–170. The function calls ~37*ShmemSize()helpers plustotal_addin_request. The PG18-specific additions areAioShmemSize()(async I/O) andSlotSyncShmemSize()(slot sync worker). There is no runtime feedback loop; if the segment is created too small,ShmemAllocwill error with “out of shared memory.” -
ShmemAllocRawaligns toCACHELINEALIGN, not justMAXALIGN. Verified at shmem.c line ~205. The comment reads: “experience has proved that on modern systems [MAXALIGN] is not good enough… attempt to align the beginning of the allocation to a cache line boundary.” This is a deliberate change from the pre-PG14 behavior. -
The DSM platform is selected at runtime via
dynamic_shared_memory_typeGUC, not at compile time. Verified indsm_impl_op(dsm_impl.c line 159). The switch dispatches on the GUC value; all four backends (posix/sysv/windows/mmap) may be compiled in. The default isposixon Linux/macOS. -
shm_mq’s data path uses no mutex; onlymq_receiver/mq_senderassignment is mutex-protected. Verified by reading the struct comment in shm_mq.c lines 30–67.mq_bytes_readandmq_bytes_writtenarepg_atomic_uint64with documented memory ordering;mq_ringis written/read without any lock. -
dsa_pointerencodes a (segment_number, offset) pair. Verified in dsa.h lines 81–103 and dsa.c lines 78–98. On 64-bit systems,DSA_OFFSET_WIDTH = 40allows segments up to 1 TB and up to 1024 segments per area. The 32-bit fallback usesDSA_OFFSET_WIDTH = 27(32 segments, 128 MB each). -
RequestAddinShmemSpaceis gated to fire only duringshmem_request_hook. Verified at ipci.c line 74–77. A call from outside the hook causesFATAL. The flagprocess_shmem_requests_in_progressis set in the postmaster’s pre-fork hook loop and cleared beforeCreateSharedMemoryAndSemaphoresreturns. -
PG18 adds
AioShmemInit()andSlotSyncShmemInit()toCreateOrAttachShmemStructs. Verified by grepping ipci.c lines 268–355.AioShmemInitinitializes the async I/O subsystem (storage/aio/);SlotSyncShmemInitsupports the slot sync worker added in PG17. Both are absent from REL_16 and earlier.
Open questions
Section titled “Open questions”-
ShmemAllocRawoffset forShmemAllocRawin the position-hint table is approximate.ShmemAllocRawis astaticfunction without a grep-stable entry point; the line 183 estimate was computed asInitShmemAllocation(115) + visible function bodies. Verification path:grep -n 'ShmemAllocRaw' shmem.c. -
NUMA-aware shmem allocation.
shmem.creferencespg_numa.hand afirstNumaTouchflag (line 96). It is unclear whether PG18 actually uses NUMA topology when placing the static segment or merely exposes thepg_numa_available()SQL function. Investigation path: tracefirstNumaTouchthroughShmemAllocand the newpg_numa_availablebuilt-in. -
dsa_createvsdsa_create_ext. Resolved. In PG18dsa_createis a macro indsa.h(line 117) that expands todsa_create_ext(tranche_id, DSA_DEFAULT_INIT_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE); likewisedsa_create_in_place(line 122) wrapsdsa_create_in_place_ext. The_extforms are the real functions; the bare names are convenience macros with default segment-size bounds. No rename occurred.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Oracle SGA (System Global Area). Oracle’s static shared segment is conceptually identical to PostgreSQL’s: sized at startup, bump- allocated, hosts buffer cache + lock table + redo buffer. Oracle adds the Automatic Memory Management (AMM) layer that can redistribute SGA components at runtime — a feature PostgreSQL explicitly does not have (the
CalculateShmemSizesum is fixed). A design comparison would quantify what dynamic redistribution buys vs. the complexity it adds. -
Linux
io_uringshared rings and the PG18storage/aio/subsystem. PG18’s async I/O layer (AioShmemInit) places I/O worker state in the static segment and usesio_uringsubmission / completion queues, which are themselves shared-memory ring buffers established by the kernel. Theshm_mqring-buffer pattern in userspace and theio_uringpattern in kernel-space are structurally identical (producer/consumer with atomic counters); a side-by-side would clarify where PostgreSQL’s tuple-transport ring and the kernel’s I/O ring diverge in their memory-ordering guarantees. -
CockroachDB / distributed shared-nothing. PostgreSQL’s design assumes all processes share physical memory. A shared-nothing distributed engine has no shared segment — all state is either node-local or exchanged via an explicit RPC layer. The
scalable-lock-manager.mdpaper (Johnson et al., VLDB 2010) is relevant here: it observes that even in shared-memory engines, centralized lock-table access is a scalability bottleneck, motivating per-CPU partitioning — a pattern PostgreSQL uses for its buffer-mapping table but not yet for the entire lock table. -
shm_mqvs. LMAX Disruptor / lock-free ring patterns.shm_mq’s lockless data path is a classic single-producer/single- consumer ring buffer. The LMAX Disruptor (Thompson et al., 2011) extends this to multi-producer/multi-consumer with sequence barriers. PostgreSQL’s parallel executor is bounded to one sender pershm_mqinstance; a future multi-parallel-worker-to-gather transport could benefit from a multi-producer variant. Porting Disruptor patterns to a shared-memory context withpg_atomic_*primitives is a concrete research avenue. -
DSA vs. custom slab allocators (e.g.,
radix treein pg_wait).dsais a general-purpose variable-size allocator. Several PostgreSQL subsystems (cumulative stats,pg_wait_sampling) have rolled specialized data structures on top of DSM without DSA, managing segments and pointers directly. Understanding where the DSA overhead (size-class bucketing, pagemap per segment) is worth paying vs. where a hand-rolled layout is better would clarify when to adopt DSA in new subsystems.
Sources
Section titled “Sources”Raw files
Section titled “Raw files”(none — analysis synthesized directly from source tree)
Textbook chapters
Section titled “Textbook chapters”- Database System Concepts (Silberschatz, Korth, Sudarshan), 7th ed., ch. 17 §“Shared Memory and Semaphores” — shared-memory as the standard buffer-pool substrate in multi-process engines.
- Database Internals (Petrov, 2019), ch. 4 §“Buffer Management” — buffer pool as shared structure; page ID as cache key.
Source code (REL_18_STABLE, commit 273fe94)
Section titled “Source code (REL_18_STABLE, commit 273fe94)”src/backend/storage/ipc/ipci.c— segment sizing + master initializer.src/backend/storage/ipc/shmem.c— bump allocator + shmem index.src/backend/storage/ipc/dsm.c— dynamic shared memory segments.src/backend/storage/ipc/dsm_impl.c— platform backends (posix/sysv/windows/mmap).src/backend/storage/ipc/shm_mq.c— ring-buffer message queue.src/backend/utils/mmgr/dsa.c— variable-size DSM allocator.src/backend/storage/ipc/shm_toc.c— table-of-contents over a DSM segment (key → offset directory used by parallel workers).src/backend/access/transam/parallel.c— leader/worker DSM + shm_mq setup (InitializeParallelDSM,LaunchParallelWorkers,ParallelWorkerMain).src/backend/access/common/session.c—GetSessionDsmHandle.src/include/storage/pg_shmem.h—PGShmemHeader.src/include/storage/dsm.h—dsm_handle,dsm_segmentpublic API.src/include/storage/shm_mq.h—shm_mqpublic API.src/include/utils/dsa.h—dsa_pointer,dsa_areapublic API.src/include/access/transam.h—VariableCacheData/ShmemVariableCacheXID/OID counter layout.
Cross-references
Section titled “Cross-references”postgres-architecture-overview.md— Axis 2 (shared-memory spine); postmaster fork model.postgres-lock-manager.md—LockManagerShmemInitresident in the static segment.postgres-lwlock-spinlock.md—LWLockShmemSize/CreateLWLocks; LWLocks live in the static segment but their internals are covered there.postgres-buffer-manager.md—BufferManagerShmemInit; buffer descriptors and page frames are the largest residents of the static segment.postgres-parallel-query.md— DSM +shm_mqconsumer; hownodeGather.cdrives the tuple-transport ring.knowledge/research/dbms-papers/scalable-lock-manager.md— Johnson et al. VLDB 2010; per-CPU lock-table partitioning as scalability motivation.