PostgreSQL Memory Contexts — Hierarchical Region Allocation and longjmp-Safe Cleanup
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every long-running server faces the same memory-management problem: a request
allocates dozens or hundreds of small objects with a shared lifetime, and if
even one of them is not freed on every exit path — including the error exit
path — the process leaks. Tracking each malloc() with a matching free() is
both slow (per-chunk bookkeeping) and fragile (one missed branch leaks
forever). The classic answer is region-based memory management, also called
arena or zone allocation: allocate many objects from a single region,
then free the entire region in one operation rather than freeing each object.
Database Internals (Petrov, ch. “Implementation Details”) frames this in the
context of a database’s buffer and working memory: a general-purpose allocator
must trade off fragmentation (internal — rounding each request up to a size
class; and external — free holes too small to reuse), allocation speed, and
reclamation cost. A region allocator collapses reclamation to O(1): bump a
pointer to allocate, discard the whole arena to free. The price is that you give
up fine-grained free() of individual objects — you reclaim at region
granularity, not object granularity. For a workload whose objects naturally
share a lifetime (everything a query touches dies when the query ends), that is
exactly the right trade.
Two design choices follow from the region model and shape every implementation:
-
How to scope a region to a lifetime. A flat “one big arena per process” leaks anything whose lifetime is shorter than the process. The standard refinement is to give regions a hierarchy: a region can have child regions, and discarding a parent discards all descendants. This lets nested lifetimes (per-transaction ⊃ per-statement ⊃ per-tuple) map onto a tree of regions, so the right granularity of “free everything” is always one node away.
-
How to recover the region from a bare pointer.
free(p)andrealloc(p)are handed only the object pointer, not its region. A region allocator must therefore stamp each chunk with enough metadata — a header — to recover which region (and which kind of allocator) owns it. The header’s width is itself a fragmentation cost, so allocators that never need per-objectfree()can drop it entirely.
PostgreSQL’s memory context subsystem is a region allocator with both
refinements: regions form a tree (the context forest), and chunks carry a
compact header identifying the owning allocator. The rest of this document
traces how the tree, the chunk header, and four interchangeable region
implementations are realized in src/backend/utils/mmgr/.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the region model; this section names the engineering conventions a server-grade region allocator adopts to make it usable across an entire codebase. PostgreSQL’s specifics in the next section are best read as one set of choices within this shared space.
An implicit “current” region
Section titled “An implicit “current” region”Threading an explicit region argument through every function that might
allocate is intolerable — copyObject(), expression evaluation, and a thousand
helpers would each grow a parameter. Real systems keep a thread-of-control’s
“current region” in a global (or thread-local), and the bare allocation call
(palloc, here) draws from it. Switching regions is a cheap save/restore of
that global, and the discipline is “switch in, allocate, switch back.”
A region tree scoped to nested lifetimes
Section titled “A region tree scoped to nested lifetimes”The hierarchy from the theory section becomes a concrete pattern: a small set
of well-known long-lived roots (per-process, per-transaction, per-query,
per-tuple) with transient children hung underneath whichever root matches their
lifetime. Cleanup is then “reset the per-statement region between statements,”
“delete the per-query region at ExecutorEnd,” and so on. Deleting a node
deletes its subtree, so a forgotten child is reclaimed with its parent rather
than leaked.
The chunk header as a type tag
Section titled “The chunk header as a type tag”Because free(p) gets only the pointer, every region allocator stamps a
header immediately before the returned address. Minimally the header
encodes the owning region; usefully it also encodes the allocator type (so a
single free() entry point can dispatch to the right implementation) and the
chunk size (so the freed space can be recycled or accounted). Header width
trades directly against density — a header on millions of tiny chunks is pure
overhead — so high-performance allocators shrink or eliminate it.
Multiple allocators behind one interface
Section titled “Multiple allocators behind one interface”No single allocation policy is best for all patterns. A general workload wants
size-class freelists; a stream of identically-sized objects wants a slab; a
FIFO producer/consumer wants generational blocks; a write-once scratchpad wants
a header-less bump pointer. The convention is a vtable of allocator methods
(alloc, free, realloc, reset, delete) so callers pick a policy at
region-creation time and the rest of the code is policy-agnostic.
Error cleanup by region teardown
Section titled “Error cleanup by region teardown”The decisive payoff for a database: when a statement aborts, the cleanup code does not walk data structures freeing objects — it deletes the regions the statement allocated into. This is why region allocation and a server’s error-handling discipline are co-designed; the exception unwind path’s entire memory job is “delete these contexts.”
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Region / arena | MemoryContext (MemoryContextData header + method vtable) |
| Region tree (parent/child) | parent / firstchild / nextchild links in MemoryContextData |
| Current region (implicit) | CurrentMemoryContext global; MemoryContextSwitchTo() |
| Bare allocate from current region | palloc() / palloc0() |
| Free a chunk via its owning region | pfree() → MCXT_METHOD(p, free_p) |
| Discard whole region | MemoryContextReset() / MemoryContextDelete() |
| Chunk header = allocator type tag | low 4 bits of the uint64 MemoryChunk header = MemoryContextMethodID |
| Allocator vtable | MemoryContextMethods; mcxt_methods[] array indexed by method id |
| General-purpose policy | AllocSetContext (aset.c) |
| Fixed-size policy | SlabContext (slab.c) |
| FIFO / lifespan-group policy | GenerationContext (generation.c) |
| Header-less write-once policy | BumpContext (bump.c) |
| Error cleanup by teardown | abort path deletes TopTransactionContext and friends |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL allocates essentially all backend-private memory through memory
contexts. A context is an abstract base (MemoryContextData) carrying the
tree links and a pointer to a method table; the concrete storage policy
lives in a derived struct (AllocSetContext, SlabContext, …) whose first
field is that base. palloc() allocates from CurrentMemoryContext;
pfree()/repalloc() recover the owning context from the chunk header and so
work regardless of which context is current. Resetting or deleting a context
frees everything allocated within it — and, for delete, everything in its
descendant contexts — which is the mechanism error handling uses to unwind.
Scope note. This doc owns the context tree, the four allocators, the palloc/pfree API, and reset/delete. How the error path decides which contexts to delete lives in
postgres-error-handling.md(thePG_TRY/sigsetjmpmachinery andErrorContext); non-memory resources released alongside memory (buffer pins, relation locks, file descriptors) live inpostgres-resource-owners.md. This doc covers only the memory leg, and the reset-callback hook that lets a context release a non-memory resource.
The abstract context and its vtable
Section titled “The abstract context and its vtable”MemoryContextData is the common header every context type begins with. It
holds the tree links, the accounting counter, two booleans, and the method
vtable pointer:
// MemoryContextData — src/include/nodes/memnodes.htypedef struct MemoryContextData{ pg_node_attr(abstract) /* there are no nodes of this type */ NodeTag type; /* identifies exact kind of context */ bool isReset; /* T = no space alloced since last reset */ bool allowInCritSection; /* allow palloc in critical section */ Size mem_allocated; /* track memory allocated for this context */ const MemoryContextMethods *methods; /* virtual function table */ MemoryContext parent; /* NULL if no parent (toplevel context) */ MemoryContext firstchild; /* head of linked list of children */ MemoryContext prevchild; /* previous child of same parent */ MemoryContext nextchild; /* next child of same parent */ const char *name; /* context name */ const char *ident; /* context ID if any */ MemoryContextCallback *reset_cbs; /* list of reset/delete callbacks */} MemoryContextData;The methods pointer is a C-level virtual table — the README calls
MemoryContextData “essentially an abstract superclass.” The vtable shape:
// MemoryContextMethods — src/include/nodes/memnodes.h (condensed)typedef struct MemoryContextMethods{ void *(*alloc) (MemoryContext context, Size size, int flags); void (*free_p) (void *pointer); void *(*realloc) (void *pointer, Size size, int flags); void (*reset) (MemoryContext context); void (*delete_context) (MemoryContext context); MemoryContext (*get_chunk_context) (void *pointer); Size (*get_chunk_space) (void *pointer); bool (*is_empty) (MemoryContext context); void (*stats) (MemoryContext context, MemoryStatsPrintFunc printfunc, ...); /* ... check() under MEMORY_CONTEXT_CHECKING ... */} MemoryContextMethods;There is exactly one vtable instance per allocator type, gathered in a single
designated-initializer array in mcxt.c. The array is indexed by a small enum,
MemoryContextMethodID:
// mcxt_methods[] — src/backend/utils/mmgr/mcxt.c (condensed)static const MemoryContextMethods mcxt_methods[] = { [MCTX_ASET_ID].alloc = AllocSetAlloc, [MCTX_ASET_ID].free_p = AllocSetFree, [MCTX_ASET_ID].realloc = AllocSetRealloc, [MCTX_ASET_ID].reset = AllocSetReset, [MCTX_ASET_ID].delete_context = AllocSetDelete, /* ... get_chunk_context / get_chunk_space / is_empty / stats ... */
/* generation.c */ [MCTX_GENERATION_ID].alloc = GenerationAlloc, /* ... */ /* slab.c */ [MCTX_SLAB_ID].alloc = SlabAlloc, /* ... */ /* bump.c */ [MCTX_BUMP_ID].alloc = BumpAlloc, /* ... */
/* Reserved / unused IDs get BOGUS_MCTX dummy entries so a bad * pointer fails cleanly instead of jumping through garbage. */ BOGUS_MCTX(MCTX_1_RESERVED_GLIBC_ID), /* ... */};flowchart TB
subgraph ABS["abstract base"]
MCD["MemoryContextData<br/>type, isReset, mem_allocated,<br/>parent/firstchild/nextchild,<br/>methods -> vtable"]
end
MCD --> VT["MemoryContextMethods vtable<br/>(one per allocator type)"]
VT --> ARR["mcxt_methods[ MemoryContextMethodID ]"]
ARR --> A1["MCTX_ASET_ID -> AllocSet* fns"]
ARR --> A2["MCTX_GENERATION_ID -> Generation* fns"]
ARR --> A3["MCTX_SLAB_ID -> Slab* fns"]
ARR --> A4["MCTX_BUMP_ID -> Bump* fns"]
subgraph DERIVED["concrete contexts (first field IS MemoryContextData)"]
D1["AllocSetContext"]
D2["GenerationContext"]
D3["SlabContext"]
D4["BumpContext"]
end
D1 -. methods .-> A1
D2 -. methods .-> A2
D3 -. methods .-> A3
D4 -. methods .-> A4
Figure 1 — The context type hierarchy. MemoryContextData is an abstract
header; each concrete allocator embeds it as its first field and points its
methods at one row of the mcxt_methods[] vtable array, selected by a
MemoryContextMethodID. This is C++-style single inheritance done by hand.
The chunk header: how pfree finds the allocator from a pointer
Section titled “The chunk header: how pfree finds the allocator from a pointer”The elegant trick that makes pfree(p) and repalloc(p) work without a context
argument: every chunk is immediately preceded, with no padding, by a uint64
whose low 4 bits are the owning context’s MemoryContextMethodID. Given any
chunk pointer, pfree reads those 4 bits, indexes mcxt_methods[], and calls
the right free_p:
// GetMemoryChunkMethodID — src/backend/utils/mmgr/mcxt.c (condensed)static inline MemoryContextMethodIDGetMemoryChunkMethodID(const void *pointer){ uint64 header; /* a non-MAXALIGNED pointer can't be a real chunk */ Assert(pointer == (const void *) MAXALIGN(pointer)); header = *((const uint64 *) ((const char *) pointer - sizeof(uint64))); return (MemoryContextMethodID) (header & MEMORY_CONTEXT_METHODID_MASK);}
#define MCXT_METHOD(pointer, method) \ mcxt_methods[GetMemoryChunkMethodID(pointer)].method
voidpfree(void *pointer){ MCXT_METHOD(pointer, free_p) (pointer); /* dispatch to owner's free */}The 4-bit id is chosen so reserved values are diagnostic: 0000
(MCTX_0_RESERVED_UNUSEDMEM_ID) is what never-touched memory reads as, 1111
(MCTX_15_RESERVED_WIPEDMEM_ID) is what wipe_mem leaves behind, and 0001/
0010 match the patterns glibc’s malloc tends to leave — so handing pfree
a glibc pointer or freed memory lands on a BOGUS_MCTX entry and fails cleanly
instead of corrupting the heap.
The remaining 60 bits of that header are the allocator’s to use. All four
in-tree allocators share the MemoryChunk header type (in
memutils_memorychunk.h), which packs a 30-bit chunk size (or freelist index)
and a 30-bit chunk-to-block offset into those 60 bits, with one bit marking an
“external” (oversized, single-chunk-per-block) chunk. The offset lets the
allocator recover the block header from the chunk, and the block points back
at the owning context — which is how GetMemoryChunkContext answers
“which context owns this pointer?”:
// GetMemoryChunkContext — src/backend/utils/mmgr/mcxt.cMemoryContextGetMemoryChunkContext(void *pointer){ return MCXT_METHOD(pointer, get_chunk_context) (pointer);}palloc and the current context
Section titled “palloc and the current context”palloc is MemoryContextAlloc specialized to CurrentMemoryContext. It is
deliberately a near-duplicate (not a wrapper) to shave a stack frame off the
hottest path in the backend:
// palloc — src/backend/utils/mmgr/mcxt.c (condensed)void *palloc(Size size){ void *ret; MemoryContext context = CurrentMemoryContext;
Assert(MemoryContextIsValid(context)); AssertNotInCriticalSection(context);
context->isReset = false; ret = context->methods->alloc(context, size, 0); Assert(ret != NULL); /* OOM is handled inside alloc, via elog */ VALGRIND_MEMPOOL_ALLOC(context, ret, size); return ret;}Two API contracts from the README are load-bearing for the rest of the engine:
pallocnever returns NULL. On OOM itelog(ERROR)s out (throughMemoryContextAllocationFailure), so callers never null-check. Thepalloc_extended(..., MCXT_ALLOC_NO_OOM)variant opts out for the rare caller that wants to handle OOM itself.pfree/repallocignoreCurrentMemoryContext. They route to the chunk’s owning context, so you can free a chunk that belongs to a different context than the one currently active. (repalloccannot take NULL — it has no context to allocate from if the chunk does not exist yet.)
Switching the current context is a trivial save/restore, defined inline in the header so it inlines everywhere:
// MemoryContextSwitchTo — src/include/utils/palloc.hstatic inline MemoryContextMemoryContextSwitchTo(MemoryContext context){ MemoryContext old = CurrentMemoryContext; CurrentMemoryContext = context; return old;}The README’s warning is worth repeating: CurrentMemoryContext should point at
a short-lived context during normal work (typically the per-tuple context),
so that an accidental un-freed palloc is reclaimed soon, not leaked for the
life of the process.
The context forest and the well-known roots
Section titled “The context forest and the well-known roots”Contexts form a forest — in current practice a single tree rooted at
TopMemoryContext, the one context that is never reset or deleted. The
well-known globals (declared in mcxt.c, described in the README) are the
lifetime anchors every transient context hangs under:
flowchart TB TOP["TopMemoryContext<br/>(never reset; ~ malloc)"] TOP --> ERR["ErrorContext<br/>(reserved for error recovery)"] TOP --> CACHE["CacheMemoryContext<br/>(relcache/catcache; never reset)"] TOP --> MSG["MessageContext<br/>(current FE message; reset each cmd)"] TOP --> TTX["TopTransactionContext<br/>(reset at top-level xact end)"] TOP --> POSTM["PostmasterContext<br/>(freed in child after fork)"] MSG --> PLAN["parse/plan temp<br/>(child of MessageContext)"] TTX --> CUR["CurTransactionContext<br/>(= TopTransaction at top level;<br/>child per subtransaction)"] POR["PortalContext (per active portal)"] --> EXEC["ExecutorState<br/>(ExecutorStart..ExecutorEnd)"] EXEC --> PT1["ExprContext per-tuple<br/>(reset every tuple)"] EXEC --> PT2["ExprContext per-tuple<br/>(one per plan node)"]
Figure 2 — The well-known context roots and a typical transient subtree under
them. Each root maps to a lifetime: process (Top, Cache), per-command
(Message), per-transaction (TopTransaction/CurTransaction), per-portal
(Portal), per-query (ExecutorState), per-tuple (ExprContext). Cleanup is
“reset/delete the node matching the lifetime that just ended.”
Highlights from the README that explain why there are so many roots:
ErrorContextis permanently kept with a few KB free at all times, so that out-of-memory can be reported as an ordinaryERROR(the error path itself needs memory). It is reset after each recovery.TopTransactionContextis not cleared immediately on error — its contents survive until the transaction block exits via COMMIT/ROLLBACK.CurTransactionContextequals it at top level but points at a per-subxact child inside a subtransaction; an aborting subxact throws its child away while a committed subxact’s child is retained until top-level commit.CacheMemoryContextnever resets (likeTopMemoryContext); the distinction exists mainly for debugging and so that subsidiary cache storage can live in shorter-lived children of it.
Create, reset, delete — the lifecycle
Section titled “Create, reset, delete — the lifecycle”MemoryContextCreate initializes the abstract header and links the node into
its parent’s child list. It is required not to fail (Asserts only, no
elog), and refuses to run inside a critical section:
// MemoryContextCreate — src/backend/utils/mmgr/mcxt.c (condensed)voidMemoryContextCreate(MemoryContext node, NodeTag tag, MemoryContextMethodID method_id, MemoryContext parent, const char *name){ Assert(CritSectionCount == 0); node->type = tag; node->isReset = true; node->methods = &mcxt_methods[method_id]; node->parent = parent; node->firstchild = NULL; /* ... */ if (parent) { node->nextchild = parent->firstchild; if (parent->firstchild != NULL) parent->firstchild->prevchild = node; parent->firstchild = node; node->allowInCritSection = parent->allowInCritSection; } /* ... */}Callers do not invoke this directly; they use a per-allocator creation helper —
AllocSetContextCreate(parent, name, ALLOCSET_DEFAULT_SIZES) is overwhelmingly
the common case — which mallocs the derived struct (plus an initial “keeper”
block) and then calls MemoryContextCreate on its embedded header.
Reset keeps the context object but frees its contents, and by convention deletes its children rather than resetting them:
// MemoryContextReset / MemoryContextResetOnly — src/backend/utils/mmgr/mcxt.cvoidMemoryContextReset(MemoryContext context){ /* save a call in the common no-children case */ if (context->firstchild != NULL) MemoryContextDeleteChildren(context); if (!context->isReset) MemoryContextResetOnly(context);}
voidMemoryContextResetOnly(MemoryContext context){ if (!context->isReset) { MemoryContextCallResetCallbacks(context); /* release non-mem resources */ context->methods->reset(context); /* allocator-specific */ context->isReset = true; }}The isReset flag is a fast-path guard: a context that has had no palloc
since creation or last reset skips the allocator’s reset entirely — important
because per-tuple contexts are reset on every tuple even when nothing was
allocated.
Delete tears down the context and its whole subtree. Two details are deliberate and worth internalizing:
// MemoryContextDelete — src/backend/utils/mmgr/mcxt.c (condensed)voidMemoryContextDelete(MemoryContext context){ MemoryContext curr = context; for (;;) { MemoryContext parent; /* descend to a leaf */ while (curr->firstchild != NULL) curr = curr->firstchild; parent = curr->parent; MemoryContextDeleteOnly(curr); /* delink + delete one leaf */ if (curr == context) break; curr = parent; }}- No recursion. Deletion is an explicit descend-to-leaf loop, not a recursive walk, precisely because deletion happens on the error/abort path — a “stack depth limit exceeded” error while already cleaning up from an abort would be catastrophic.
- Delink before free.
MemoryContextDeleteOnlycallsMemoryContextSetParent(context, NULL)to unlink the node from the tree before freeing it, so that if a reset callback errors mid-delete, the tree is never left pointing at a half-freed context — “better a leak than a crash.”
flowchart TD
A["MemoryContextDelete(ctx)"] --> B{"ctx has children?"}
B -- "yes" --> C["descend to first leaf"]
C --> B
B -- "no (leaf)" --> D["MemoryContextCallResetCallbacks"]
D --> E["MemoryContextSetParent(leaf, NULL)<br/>delink from tree first"]
E --> F["methods->delete_context<br/>(free all blocks + the struct)"]
F --> G{"leaf == original ctx?"}
G -- "no" --> H["climb to saved parent"]
H --> B
G -- "yes" --> I["done"]
Figure 3 — MemoryContextDelete as an iterative descend-to-leaf, delete,
climb loop. Reset callbacks fire, the node is delinked before its storage is
freed, and the loop avoids recursion so an over-deep tree cannot blow the stack
during abort cleanup.
Reset/delete callbacks — releasing non-memory resources
Section titled “Reset/delete callbacks — releasing non-memory resources”A context can carry a list of callbacks fired once, just before the next
reset or delete. This is how a context becomes a hook for releasing resources
that are associated with palloc’d objects but are not themselves memory — an
open file behind a tuplesort object, a reference count on a cache entry.
// MemoryContextRegisterResetCallback — src/backend/utils/mmgr/mcxt.cvoidMemoryContextRegisterResetCallback(MemoryContext context, MemoryContextCallback *cb){ /* push onto head: newest-registered fires first */ cb->next = context->reset_cbs; context->reset_cbs = cb; context->isReset = false; /* ensure reset path runs the callbacks */}Callbacks fire in reverse order of registration, and a child’s callbacks fire
before its parent’s during a subtree teardown. The caller supplies the
MemoryContextCallback struct itself (typically allocated inside the target
context, so it is freed for free), which avoids an extra palloc. This is the
narrow seam where memory contexts touch resource management — the broader story
of pins, locks, and descriptors lives in postgres-resource-owners.md.
Error handling unwinds by deleting contexts
Section titled “Error handling unwinds by deleting contexts”The decisive payoff. When a backend hits elog(ERROR), control siglongjmps
back to the enclosing PG_TRY/sigsetjmp barrier (owned by
postgres-error-handling.md). Memory cleanup at that barrier is not a walk of
data structures — it is a handful of context resets and deletes. At the
outermost level, AbortTransaction → CleanupTransaction resets/deletes
TopTransactionContext (and thereby every per-statement, per-portal,
per-tuple child created during the transaction); the per-message loop resets
MessageContext at the top of each cycle. Because every transient allocation
was a descendant of one of those roots, a single delete reclaims it all, on
the error path exactly as on the success path — no leak-prone per-object
freeing, no missed branch.
sequenceDiagram
participant Q as backend op
participant E as elog(ERROR)
participant LJ as sigsetjmp barrier<br/>(error-handling doc)
participant MC as memory contexts
Q->>E: error raised
E->>LJ: siglongjmp to enclosing PG_TRY
LJ->>MC: AbortCurrentTransaction()
MC->>MC: MemoryContextDelete(TopTransactionContext)
Note over MC: deletes whole subtree:<br/>per-portal, per-query,<br/>per-tuple contexts
MC->>MC: callbacks fire (files closed, pins... see resowner doc)
LJ->>MC: MemoryContextReset(MessageContext) next loop
Note over MC: every transient chunk reclaimed<br/>by region teardown, not per-object free
Figure 4 — Error unwind as region teardown. The longjmp machinery is owned
by postgres-error-handling.md; this diagram shows only the memory leg —
deleting TopTransactionContext frees the entire transient subtree in one
operation, which is the whole reason the engine can treat OOM and arbitrary
mid-statement errors as ordinary recoverable conditions.
Four allocators, four allocation patterns
Section titled “Four allocators, four allocation patterns”All four concrete allocators present the same external behavior (palloc/pfree/ reset/delete) but implement radically different internal policies. The README’s one-line summaries, expanded:
| Allocator | File | Best for | Per-chunk pfree? | Chunk header |
|---|---|---|---|---|
| AllocSet | aset.c | general purpose (the default) | yes, recycled via size-class freelists | full MemoryChunk |
| Slab | slab.c | many equally-sized chunks | yes, dense packing, returns empty blocks to OS | full MemoryChunk |
| Generation | generation.c | groups with similar lifespan / FIFO | yes, but space is not reused; block freed when empty | full MemoryChunk |
| Bump | bump.c | write-once, never individually freed | no (pfree/realloc unsupported) | none (in normal builds) |
AllocSet — the general-purpose default
Section titled “AllocSet — the general-purpose default”aset.c is what AllocSetContextCreate builds and what nearly every context in
the system is. Its policy: round each request up to a power-of-2 size class
and keep a per-class freelist of recycled chunks; carve chunks out of
malloc’d blocks that double in size (8 KB initial → up to 8 MB) to amortize
malloc overhead; send requests larger than allocChunkLimit (~8 KB) to a
dedicated block returned whole on free.
// AllocSetContext — src/backend/utils/mmgr/aset.c (condensed)typedef struct AllocSetContext{ MemoryContextData header; /* Standard memory-context fields */ AllocBlock blocks; /* head of list of blocks in this set */ MemoryChunk *freelist[ALLOCSET_NUM_FREELISTS]; /* free chunk lists */ uint32 initBlockSize; /* initial block size */ uint32 maxBlockSize; /* maximum block size */ uint32 nextBlockSize; /* next block size to allocate */ uint32 allocChunkLimit; /* effective chunk size limit */ int freeListIndex; /* index in context_freelists[], or -1 */} AllocSetContext;With ALLOC_MINBITS = 3 and ALLOCSET_NUM_FREELISTS = 11, freelist k holds
chunks of size 1 << (k+3) — 8, 16, 32, … up to 8192 bytes. The allocation
fast path:
// AllocSetAlloc — src/backend/utils/mmgr/aset.c (condensed)void *AllocSetAlloc(MemoryContext context, Size size, int flags){ AllocSet set = (AllocSet) context;
if (size > set->allocChunkLimit) /* oversized: dedicated block */ return AllocSetAllocLarge(context, size, flags);
fidx = AllocSetFreeIndex(size); /* size class */ chunk = set->freelist[fidx]; if (chunk != NULL) /* reuse a recycled chunk */ { set->freelist[fidx] = GetFreeListLink(chunk)->next; return MemoryChunkGetPointer(chunk); } /* else carve from the current block, or start a new one */ chunk_size = GetChunkSizeFromFreeListIdx(fidx); block = set->blocks; if (unlikely((block->endptr - block->freeptr) < (chunk_size + ALLOC_CHUNKHDRSZ))) return AllocSetAllocFromNewBlock(context, size, flags, fidx); return AllocSetAllocChunkFromBlock(context, block, size, chunk_size, fidx);}pfree pushes the chunk back onto freelist[fidx] (it is not returned to the
OS); a freed oversized chunk’s dedicated block is returned to malloc. Reset
keeps the first (“keeper”) block and frees the rest, so a repeatedly-reset
context does not thrash malloc. There is also a small context freelist
(context_freelists[], cap MAX_FREE_CONTEXTS = 100): freshly deleted
default-sized AllocSets are cached whole and handed back by the next create,
avoiding repeated malloc/free of the context struct itself.
flowchart LR
REQ["palloc(size) in an AllocSet"] --> Q1{"size > allocChunkLimit?"}
Q1 -- "yes" --> LRG["AllocSetAllocLarge<br/>dedicated block, external chunk"]
Q1 -- "no" --> FIDX["fidx = AllocSetFreeIndex(size)<br/>(power-of-2 class)"]
FIDX --> Q2{"freelist[fidx] non-empty?"}
Q2 -- "yes" --> POP["pop recycled chunk<br/>O(1)"]
Q2 -- "no" --> Q3{"room in current block?"}
Q3 -- "yes" --> CARVE["bump freeptr in block"]
Q3 -- "no" --> NEW["malloc new block<br/>(size doubles toward maxBlockSize)"]
Figure 5 — AllocSetAlloc decision flow. Small requests hit a power-of-2
freelist (recycled chunk, else carve from the current doubling block); requests
over allocChunkLimit (~8 KB) go to a dedicated block freed whole on pfree.
Slab — fixed-size chunks, fragmentation-resistant
Section titled “Slab — fixed-size chunks, fragmentation-resistant”slab.c is for streams of identically-sized objects (the chunk size is
fixed at context creation). Blocks are carved into exact-size chunks; the
context keeps blocks bucketed by free-chunk count in a blocklist[] array and
always serves new allocations from the fullest non-full block, which keeps
used chunks dense and lets emptied blocks return to the OS — directly attacking
the fragmentation that a general allocator suffers when long-lived and
short-lived same-size objects intermix.
// SlabContext — src/backend/utils/mmgr/slab.c (condensed)typedef struct SlabContext{ MemoryContextData header; uint32 chunkSize; /* the requested (non-aligned) chunk size */ uint32 fullChunkSize; /* chunk size with header + alignment */ uint32 blockSize; /* size of each block of chunks */ int32 chunksPerBlock; int32 curBlocklistIndex; /* fullest blocks live here */ /* ... */ dlist_head blocklist[SLAB_BLOCKLIST_COUNT]; /* blocks bucketed by nfree */} SlabContext;In-tree users: reorderbuffer.c (logical decoding allocates a uniform
ReorderBufferChange per WAL change — a textbook fixed-size stream).
Generation — FIFO / similar-lifespan groups
Section titled “Generation — FIFO / similar-lifespan groups”generation.c suits objects allocated in generations that die together, or
roughly in FIFO order. It is a bump-style block allocator with one twist: it
tracks a free-count per block and returns a block to the OS once all its
chunks are pfree’d, but never reuses space within a block (no freelists). It
keeps a single freeblock around to recycle, to avoid malloc churn.
// GenerationContext — src/backend/utils/mmgr/generation.c (condensed)typedef struct GenerationContext{ MemoryContextData header; uint32 initBlockSize, maxBlockSize, nextBlockSize, allocChunkLimit; GenerationBlock *block; /* current (most recently allocated) block */ GenerationBlock *freeblock; /* one empty block kept for recycling */ dlist_head blocks; /* list of blocks */} GenerationContext;In-tree users: reorderbuffer.c (tuple data, freed in commit order),
tuplestore.c, gistvacuum.c — all FIFO-ish producer/consumer patterns where
AllocSet’s freelists would just be wasted bookkeeping.
Bump — header-less, write-once
Section titled “Bump — header-less, write-once”bump.c (added in PG17) is the densest allocator: chunks have no header at
all in normal builds, so pfree, repalloc, GetMemoryChunkSpace, and
GetMemoryChunkContext are unsupported and raise an error if attempted. You
allocate a large number of small chunks, never free them individually, and
reclaim only by reset/delete of the whole context. Dropping the header fits more
chunks per block and per cache line.
// BumpContext — src/backend/utils/mmgr/bump.c (condensed)typedef struct BumpContext{ MemoryContextData header; uint32 initBlockSize, maxBlockSize, nextBlockSize, allocChunkLimit; dlist_head blocks; /* block being filled is at the head */} BumpContext;In normal builds Bump_CHUNKHDRSZ is 0; only under MEMORY_CONTEXT_CHECKING
does bump add a MemoryChunk header so the disallowed operations can be caught
with a clear error. In-tree users: nodeAgg.c (hash-aggregate group state),
tuplesort.c, tidstore.c — all “allocate, use, throw the whole arena away”
scratchpads.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. The PostgreSQL source moves; a function/struct/macro name is the stable handle. Use
git grep -n '<symbol>' src/backend/utils/mmgr/to locate the current position. Line numbers in the position-hint table were observed at commit273fe94(REL_18) and are quick hints only.
Abstract layer and dispatch (src/include/nodes/memnodes.h, src/backend/utils/mmgr/mcxt.c)
Section titled “Abstract layer and dispatch (src/include/nodes/memnodes.h, src/backend/utils/mmgr/mcxt.c)”struct MemoryContextData(inmemnodes.h) — the abstract context header (tree links,methodsvtable pointer,isReset,mem_allocated).struct MemoryContextMethods(inmemnodes.h) — the allocator vtable.enum MemoryContextMethodID(inmemutils_internal.h) — 4-bit allocator tag; reserved values (_RESERVED_GLIBC_,_RESERVED_UNUSEDMEM_,_RESERVED_WIPEDMEM_) make bad pointers fail cleanly.mcxt_methods[](inmcxt.c) — the one-row-per-allocator vtable array.GetMemoryChunkMethodID/MCXT_METHOD(inmcxt.c) — read the 4-bit id from the chunk header and dispatch.MemoryContextCreate(inmcxt.c) — initialize header, link into parent.
Lifecycle (src/backend/utils/mmgr/mcxt.c)
Section titled “Lifecycle (src/backend/utils/mmgr/mcxt.c)”MemoryContextReset/MemoryContextResetOnly/MemoryContextResetChildren— free contents, delete (or reset) children;isResetfast path.MemoryContextDelete/MemoryContextDeleteOnly/MemoryContextDeleteChildren— iterative descend-to-leaf teardown, delink before free.MemoryContextSetParent— reparent (used to change a context’s lifespan after fill).MemoryContextRegisterResetCallback/MemoryContextCallResetCallbacks— the non-memory-resource hook.MemoryContextAllocationFailure/MemoryContextSizeFailure— the OOM / bad-sizeelog(ERROR)paths.
Allocation API (src/backend/utils/mmgr/mcxt.c, src/include/utils/palloc.h)
Section titled “Allocation API (src/backend/utils/mmgr/mcxt.c, src/include/utils/palloc.h)”palloc/palloc0/palloc_extended/palloc_aligned— allocate fromCurrentMemoryContext.MemoryContextAlloc/MemoryContextAllocZero/MemoryContextAllocExtended— allocate from a named context.pfree/repalloc/repalloc0/repalloc_extended— context recovered from the chunk header; ignoreCurrentMemoryContext.GetMemoryChunkContext/GetMemoryChunkSpace— pointer → context / size.MemoryContextSwitchTo(inline, inpalloc.h) — save/restore the current context.
Allocators
Section titled “Allocators”AllocSetContext,AllocSetAlloc,AllocSetFree,AllocSetReset,AllocSetDelete,AllocSetFreeIndex(inaset.c) — general-purpose; power-of-2 freelists, doubling blocks, context freelist.SlabContext,SlabAlloc,SlabReset,SlabContextCreate(inslab.c) — fixed-size, fullest-block-first.GenerationContext,GenerationAlloc,GenerationFree,GenerationContextCreate(ingeneration.c) — FIFO/lifespan, no reuse.BumpContext,BumpAlloc,BumpContextCreate(inbump.c) — header-less, no pfree.AllocSetContextCreatemacro +ALLOCSET_DEFAULT_SIZES/ALLOCSET_SMALL_SIZES(inmemutils.h) — the common creation entry points.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
struct MemoryContextData | src/include/nodes/memnodes.h | 117 |
struct MemoryContextMethods | src/include/nodes/memnodes.h | 58 |
MemoryContextIsValid | src/include/nodes/memnodes.h | 145 |
enum MemoryContextMethodID | src/include/utils/memutils_internal.h | 121 |
mcxt_methods[] | src/backend/utils/mmgr/mcxt.c | 47 |
MCXT_METHOD (macro) | src/backend/utils/mmgr/mcxt.c | 188 |
GetMemoryChunkMethodID | src/backend/utils/mmgr/mcxt.c | 196 |
MemoryContextReset | src/backend/utils/mmgr/mcxt.c | 389 |
MemoryContextResetOnly | src/backend/utils/mmgr/mcxt.c | 408 |
MemoryContextDelete | src/backend/utils/mmgr/mcxt.c | 460 |
MemoryContextDeleteOnly | src/backend/utils/mmgr/mcxt.c | 502 |
MemoryContextRegisterResetCallback | src/backend/utils/mmgr/mcxt.c | 574 |
MemoryContextCallResetCallbacks | src/backend/utils/mmgr/mcxt.c | 591 |
MemoryContextSetParent | src/backend/utils/mmgr/mcxt.c | 643 |
GetMemoryChunkContext | src/backend/utils/mmgr/mcxt.c | 713 |
MemoryContextCreate | src/backend/utils/mmgr/mcxt.c | 1106 |
MemoryContextAlloc | src/backend/utils/mmgr/mcxt.c | 1191 |
MemoryContextAllocExtended | src/backend/utils/mmgr/mcxt.c | 1248 |
palloc | src/backend/utils/mmgr/mcxt.c | 1346 |
palloc0 | src/backend/utils/mmgr/mcxt.c | 1376 |
pfree | src/backend/utils/mmgr/mcxt.c | 1553 |
repalloc | src/backend/utils/mmgr/mcxt.c | 1573 |
MemoryContextSwitchTo (inline) | src/include/utils/palloc.h | 138 |
struct AllocSetContext | src/backend/utils/mmgr/aset.c | 152 |
AllocSetContextCreateInternal | src/backend/utils/mmgr/aset.c | 347 |
AllocSetReset | src/backend/utils/mmgr/aset.c | 537 |
AllocSetDelete | src/backend/utils/mmgr/aset.c | 607 |
AllocSetAlloc | src/backend/utils/mmgr/aset.c | 967 |
AllocSetFree | src/backend/utils/mmgr/aset.c | 1062 |
struct GenerationContext | src/backend/utils/mmgr/generation.c | 59 |
GenerationContextCreate | src/backend/utils/mmgr/generation.c | 160 |
GenerationAlloc | src/backend/utils/mmgr/generation.c | 527 |
struct SlabContext | src/backend/utils/mmgr/slab.c | 103 |
SlabContextCreate | src/backend/utils/mmgr/slab.c | 322 |
SlabAlloc | src/backend/utils/mmgr/slab.c | 631 |
struct BumpContext | src/backend/utils/mmgr/bump.c | 66 |
BumpContextCreate | src/backend/utils/mmgr/bump.c | 131 |
BumpAlloc | src/backend/utils/mmgr/bump.c | 491 |
ALLOCSET_DEFAULT_SIZES (macro) | src/include/utils/memutils.h | 160 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry leads with a fact about the current source at commit
273fe94(REL_18), readable without any other materials. The trailing note records how it was checked. Open questions follow.
Verified facts
Section titled “Verified facts”-
The allocator type is encoded in the low 4 bits of a
uint64header immediately preceding every chunk, with no padding. Verified inGetMemoryChunkMethodIDand theMCXT_METHODmacro (mcxt.c). The mask isMEMORY_CONTEXT_METHODID_MASK; reserved id values (0000,0001,0010,1111) correspond to unused/glibc/wiped memory so bad pointers fail cleanly rather than dispatching into garbage. -
There are exactly four concrete allocators on REL_18: AllocSet, Slab, Generation, Bump. Verified against
MemoryContextIsValid(memnodes.h, whichIsA-checks those four node tags) and themcxt_methods[]array (mcxt.c). A fifth method id,MCTX_ALIGNED_REDIRECT_ID, exists but is not a standalone context type — it is the redirect used bypalloc_aligned(alignedalloc.c), with onlyfree_p/realloc/get_chunk_*populated. -
Bump chunks have no header in normal builds (
Bump_CHUNKHDRSZ == 0). Verified inbump.c.pfree/repalloc/GetMemoryChunkSpace/GetMemoryChunkContexton a bump chunk are unsupported; underMEMORY_CONTEXT_CHECKINGaMemoryChunkheader is added so those calls raise a clear ERROR instead of corrupting memory. -
pallocnever returns NULL; OOM exits viaelog(ERROR)inside the allocator’sallocmethod. Verified inpalloc(whichAssert(ret != NULL)) andMemoryContextAllocationFailure(mcxt.c). The only NULL-capable path isMCXT_ALLOC_NO_OOMviapalloc_extended/MemoryContextAllocExtended. -
MemoryContextDeleteis iterative, not recursive, and delinks each node before freeing it. Verified inMemoryContextDelete(descend-to-leaf loop with an explicit comment that recursion would risk “stack depth limit exceeded” during abort cleanup) andMemoryContextDeleteOnly(MemoryContextSetParent(context, NULL)beforedelete_context). The README and the in-code comment agree: “better a leak than a crash.” -
AllocSet small allocations round up to one of 11 power-of-2 size classes (8 B … 8 KB); larger requests go to dedicated blocks. Verified from
ALLOC_MINBITS = 3,ALLOCSET_NUM_FREELISTS = 11,ALLOC_CHUNK_LIMIT, andAllocSetAlloc/AllocSetFreeIndex(aset.c).ALLOCSET_DEFAULT_INITSIZEis 8 KB andALLOCSET_DEFAULT_MAXSIZEis 8 MB (memutils.h); blocks double between those bounds. -
Reset/delete callbacks fire newest-first, and child callbacks fire before parent callbacks during a subtree teardown. Verified in
MemoryContextRegisterResetCallback(pushes onto list head) andMemoryContextCallResetCallbacks(pops from head), consistent with the README’s “reverse order of registration” statement. The callback struct is caller-provided, typically allocated inside the target context. -
TopTransactionContextis not cleared on error; it survives until the transaction block exits. Stated in the README; the actual abort-time reset/delete is driven from the transaction-management layer, not frommcxt.c. Confirming the precise call site is out of this doc’s scope (seepostgres-error-handling.md/ xact).
Open questions
Section titled “Open questions”-
Bump allocator adoption. Bump landed in PG17 and REL_18 uses it in
nodeAgg.c,tuplesort.c, andtidstore.c. Which other historically AllocSet- or Generation-backed scratchpads are candidates to convert, and what is the measured density/CPU win? Investigation path: diff theBumpContextCreatecall sites across REL_16→REL_18 and look foraset/generationcontexts whose objects are demonstrably never pfree’d. -
context_freelistscap of 100.aset.ccaches up toMAX_FREE_CONTEXTS = 100whole deleted default-sized contexts for reuse, and “deletes all existing entries when the list overflows.” Under what workload (many short-lived per-relation contexts?) does this overflow matter, and is 100 ever a bottleneck? Investigation path: instrumentcontext_freelists[]. num_freeunder a high-DDL or high-plan-churn workload. -
Memory accounting cost.
mem_allocatedis updated lazily per-block, soMemoryContextMemAllocated/stats must walk the subtree recursively (README “Memory Accounting”). For a backend with a very wide/deep context forest (thousands of relcache child contexts), how expensive is apg_log_backend_memory_contexts()dump? Investigation path: timeProcessLogMemoryContextInterruptunder a large relcache.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up doc.
-
CUBRID’s private heap / per-thread allocators. CUBRID does not use a context forest; it mixes a private-heap (
db_private_alloc) discipline with explicit frees and per-thread scratch areas. A side-by-side would show what PostgreSQL buys with the tree (O(1) error-path reclamation) versus what it costs (per-chunk header overhead, no compaction). Cross-ref the CUBRID heap / memory analyses when written. -
Region inference in languages (Tofte & Talpin, “Region-Based Memory Management”, 1997). PostgreSQL assigns regions by hand (which context to switch into); the research line infers region lifetimes statically. The comparison frames PostgreSQL’s manual
MemoryContextSwitchTodiscipline as the un-inferred baseline and points at where bugs (wrong-context allocation) come from. -
Slab allocation lineage (Bonwick, “The Slab Allocator”, USENIX 1994). PostgreSQL’s
slab.cis the same idea (object caches of fixed-size items to fight fragmentation) as the Solaris kernel allocator; the fullest-block-first refinement is PostgreSQL’s. A note tracing the lineage would anchor the slab design choices in their original rationale. -
Arena allocators in other systems (jemalloc/tcmalloc arenas, Rust’s
bumpalo, V8 zones). Bump’s header-less write-once arena is the same pattern as compiler/VM “zones.” Worth a comparison of when a header-less arena beats a freelist allocator, with the density/cache numbers PG17’s bump commit cited. -
Memory accounting and per-query memory limits. PostgreSQL’s lazy per-block
mem_allocatedaccounting (and the lack, on REL_18, of a hard per-backend memory cap) is a recurring proposal area. Tie this to thework_memspill mechanisms inpostgres-tuplesort.md/ hash-agg and to the ongoing community discussion of a backend memory limit.
Sources
Section titled “Sources”In-tree README
Section titled “In-tree README”src/backend/utils/mmgr/README— “Memory Context System Design Overview”: the design rationale (current context, parent/child tree, globally-known contexts, reset callbacks, the four allocators, memory accounting). The primary design document for this subsystem.
Source files (under /data/hgryoo/references/postgres/, REL_18 273fe94)
Section titled “Source files (under /data/hgryoo/references/postgres/, REL_18 273fe94)”src/backend/utils/mmgr/mcxt.c— abstract layer, dispatch, lifecycle, palloc/pfree.src/backend/utils/mmgr/aset.c— AllocSet (default allocator).src/backend/utils/mmgr/generation.c— Generation allocator.src/backend/utils/mmgr/slab.c— Slab allocator.src/backend/utils/mmgr/bump.c— Bump allocator.src/include/nodes/memnodes.h—MemoryContextData,MemoryContextMethods.src/include/utils/palloc.h— palloc/pfree API,MemoryContextSwitchTo.src/include/utils/memutils.h— creation macros,ALLOCSET_*_SIZES.src/include/utils/memutils_internal.h—MemoryContextMethodID.
Textbook / theory
Section titled “Textbook / theory”- Database Internals (Petrov), Part I implementation chapters — memory vs. disk-based stores, allocation/fragmentation trade-offs; the region/arena framing in §“Theoretical Background”.
- Region-based memory management background: Tofte & Talpin 1997 (regions), Bonwick 1994 (slab) — see §“Beyond PostgreSQL” for the specific anchors.
Cross-references (sibling docs, mechanism not duplicated here)
Section titled “Cross-references (sibling docs, mechanism not duplicated here)”postgres-error-handling.md— thePG_TRY/sigsetjmp/elogmachinery that triggers context deletion on the abort path.postgres-resource-owners.md— release of non-memory resources (buffer pins, locks, file descriptors); memory contexts touch this only via reset callbacks.postgres-architecture-overview.md— Axis 11 base-infra positioning; “a backend is mostly a thread of control with a private memory-context tree.”