Skip to content

PostgreSQL Memory Contexts — Hierarchical Region Allocation and longjmp-Safe Cleanup

Contents:

Every long-running server faces the same memory-management problem: a request allocates dozens or hundreds of small objects with a shared lifetime, and if even one of them is not freed on every exit path — including the error exit path — the process leaks. Tracking each malloc() with a matching free() is both slow (per-chunk bookkeeping) and fragile (one missed branch leaks forever). The classic answer is region-based memory management, also called arena or zone allocation: allocate many objects from a single region, then free the entire region in one operation rather than freeing each object.

Database Internals (Petrov, ch. “Implementation Details”) frames this in the context of a database’s buffer and working memory: a general-purpose allocator must trade off fragmentation (internal — rounding each request up to a size class; and external — free holes too small to reuse), allocation speed, and reclamation cost. A region allocator collapses reclamation to O(1): bump a pointer to allocate, discard the whole arena to free. The price is that you give up fine-grained free() of individual objects — you reclaim at region granularity, not object granularity. For a workload whose objects naturally share a lifetime (everything a query touches dies when the query ends), that is exactly the right trade.

Two design choices follow from the region model and shape every implementation:

  1. How to scope a region to a lifetime. A flat “one big arena per process” leaks anything whose lifetime is shorter than the process. The standard refinement is to give regions a hierarchy: a region can have child regions, and discarding a parent discards all descendants. This lets nested lifetimes (per-transaction ⊃ per-statement ⊃ per-tuple) map onto a tree of regions, so the right granularity of “free everything” is always one node away.

  2. How to recover the region from a bare pointer. free(p) and realloc(p) are handed only the object pointer, not its region. A region allocator must therefore stamp each chunk with enough metadata — a header — to recover which region (and which kind of allocator) owns it. The header’s width is itself a fragmentation cost, so allocators that never need per-object free() can drop it entirely.

PostgreSQL’s memory context subsystem is a region allocator with both refinements: regions form a tree (the context forest), and chunks carry a compact header identifying the owning allocator. The rest of this document traces how the tree, the chunk header, and four interchangeable region implementations are realized in src/backend/utils/mmgr/.

The textbook gives the region model; this section names the engineering conventions a server-grade region allocator adopts to make it usable across an entire codebase. PostgreSQL’s specifics in the next section are best read as one set of choices within this shared space.

Threading an explicit region argument through every function that might allocate is intolerable — copyObject(), expression evaluation, and a thousand helpers would each grow a parameter. Real systems keep a thread-of-control’s “current region” in a global (or thread-local), and the bare allocation call (palloc, here) draws from it. Switching regions is a cheap save/restore of that global, and the discipline is “switch in, allocate, switch back.”

The hierarchy from the theory section becomes a concrete pattern: a small set of well-known long-lived roots (per-process, per-transaction, per-query, per-tuple) with transient children hung underneath whichever root matches their lifetime. Cleanup is then “reset the per-statement region between statements,” “delete the per-query region at ExecutorEnd,” and so on. Deleting a node deletes its subtree, so a forgotten child is reclaimed with its parent rather than leaked.

Because free(p) gets only the pointer, every region allocator stamps a header immediately before the returned address. Minimally the header encodes the owning region; usefully it also encodes the allocator type (so a single free() entry point can dispatch to the right implementation) and the chunk size (so the freed space can be recycled or accounted). Header width trades directly against density — a header on millions of tiny chunks is pure overhead — so high-performance allocators shrink or eliminate it.

No single allocation policy is best for all patterns. A general workload wants size-class freelists; a stream of identically-sized objects wants a slab; a FIFO producer/consumer wants generational blocks; a write-once scratchpad wants a header-less bump pointer. The convention is a vtable of allocator methods (alloc, free, realloc, reset, delete) so callers pick a policy at region-creation time and the rest of the code is policy-agnostic.

The decisive payoff for a database: when a statement aborts, the cleanup code does not walk data structures freeing objects — it deletes the regions the statement allocated into. This is why region allocation and a server’s error-handling discipline are co-designed; the exception unwind path’s entire memory job is “delete these contexts.”

Theory / conventionPostgreSQL name
Region / arenaMemoryContext (MemoryContextData header + method vtable)
Region tree (parent/child)parent / firstchild / nextchild links in MemoryContextData
Current region (implicit)CurrentMemoryContext global; MemoryContextSwitchTo()
Bare allocate from current regionpalloc() / palloc0()
Free a chunk via its owning regionpfree()MCXT_METHOD(p, free_p)
Discard whole regionMemoryContextReset() / MemoryContextDelete()
Chunk header = allocator type taglow 4 bits of the uint64 MemoryChunk header = MemoryContextMethodID
Allocator vtableMemoryContextMethods; mcxt_methods[] array indexed by method id
General-purpose policyAllocSetContext (aset.c)
Fixed-size policySlabContext (slab.c)
FIFO / lifespan-group policyGenerationContext (generation.c)
Header-less write-once policyBumpContext (bump.c)
Error cleanup by teardownabort path deletes TopTransactionContext and friends

PostgreSQL allocates essentially all backend-private memory through memory contexts. A context is an abstract base (MemoryContextData) carrying the tree links and a pointer to a method table; the concrete storage policy lives in a derived struct (AllocSetContext, SlabContext, …) whose first field is that base. palloc() allocates from CurrentMemoryContext; pfree()/repalloc() recover the owning context from the chunk header and so work regardless of which context is current. Resetting or deleting a context frees everything allocated within it — and, for delete, everything in its descendant contexts — which is the mechanism error handling uses to unwind.

Scope note. This doc owns the context tree, the four allocators, the palloc/pfree API, and reset/delete. How the error path decides which contexts to delete lives in postgres-error-handling.md (the PG_TRY/sigsetjmp machinery and ErrorContext); non-memory resources released alongside memory (buffer pins, relation locks, file descriptors) live in postgres-resource-owners.md. This doc covers only the memory leg, and the reset-callback hook that lets a context release a non-memory resource.

MemoryContextData is the common header every context type begins with. It holds the tree links, the accounting counter, two booleans, and the method vtable pointer:

// MemoryContextData — src/include/nodes/memnodes.h
typedef struct MemoryContextData
{
pg_node_attr(abstract) /* there are no nodes of this type */
NodeTag type; /* identifies exact kind of context */
bool isReset; /* T = no space alloced since last reset */
bool allowInCritSection; /* allow palloc in critical section */
Size mem_allocated; /* track memory allocated for this context */
const MemoryContextMethods *methods; /* virtual function table */
MemoryContext parent; /* NULL if no parent (toplevel context) */
MemoryContext firstchild; /* head of linked list of children */
MemoryContext prevchild; /* previous child of same parent */
MemoryContext nextchild; /* next child of same parent */
const char *name; /* context name */
const char *ident; /* context ID if any */
MemoryContextCallback *reset_cbs; /* list of reset/delete callbacks */
} MemoryContextData;

The methods pointer is a C-level virtual table — the README calls MemoryContextData “essentially an abstract superclass.” The vtable shape:

// MemoryContextMethods — src/include/nodes/memnodes.h (condensed)
typedef struct MemoryContextMethods
{
void *(*alloc) (MemoryContext context, Size size, int flags);
void (*free_p) (void *pointer);
void *(*realloc) (void *pointer, Size size, int flags);
void (*reset) (MemoryContext context);
void (*delete_context) (MemoryContext context);
MemoryContext (*get_chunk_context) (void *pointer);
Size (*get_chunk_space) (void *pointer);
bool (*is_empty) (MemoryContext context);
void (*stats) (MemoryContext context, MemoryStatsPrintFunc printfunc, ...);
/* ... check() under MEMORY_CONTEXT_CHECKING ... */
} MemoryContextMethods;

There is exactly one vtable instance per allocator type, gathered in a single designated-initializer array in mcxt.c. The array is indexed by a small enum, MemoryContextMethodID:

aset.c
// mcxt_methods[] — src/backend/utils/mmgr/mcxt.c (condensed)
static const MemoryContextMethods mcxt_methods[] = {
[MCTX_ASET_ID].alloc = AllocSetAlloc,
[MCTX_ASET_ID].free_p = AllocSetFree,
[MCTX_ASET_ID].realloc = AllocSetRealloc,
[MCTX_ASET_ID].reset = AllocSetReset,
[MCTX_ASET_ID].delete_context = AllocSetDelete,
/* ... get_chunk_context / get_chunk_space / is_empty / stats ... */
/* generation.c */
[MCTX_GENERATION_ID].alloc = GenerationAlloc, /* ... */
/* slab.c */
[MCTX_SLAB_ID].alloc = SlabAlloc, /* ... */
/* bump.c */
[MCTX_BUMP_ID].alloc = BumpAlloc, /* ... */
/* Reserved / unused IDs get BOGUS_MCTX dummy entries so a bad
* pointer fails cleanly instead of jumping through garbage. */
BOGUS_MCTX(MCTX_1_RESERVED_GLIBC_ID),
/* ... */
};
flowchart TB
  subgraph ABS["abstract base"]
    MCD["MemoryContextData<br/>type, isReset, mem_allocated,<br/>parent/firstchild/nextchild,<br/>methods -> vtable"]
  end
  MCD --> VT["MemoryContextMethods vtable<br/>(one per allocator type)"]
  VT --> ARR["mcxt_methods[ MemoryContextMethodID ]"]
  ARR --> A1["MCTX_ASET_ID -> AllocSet* fns"]
  ARR --> A2["MCTX_GENERATION_ID -> Generation* fns"]
  ARR --> A3["MCTX_SLAB_ID -> Slab* fns"]
  ARR --> A4["MCTX_BUMP_ID -> Bump* fns"]
  subgraph DERIVED["concrete contexts (first field IS MemoryContextData)"]
    D1["AllocSetContext"]
    D2["GenerationContext"]
    D3["SlabContext"]
    D4["BumpContext"]
  end
  D1 -. methods .-> A1
  D2 -. methods .-> A2
  D3 -. methods .-> A3
  D4 -. methods .-> A4

Figure 1 — The context type hierarchy. MemoryContextData is an abstract header; each concrete allocator embeds it as its first field and points its methods at one row of the mcxt_methods[] vtable array, selected by a MemoryContextMethodID. This is C++-style single inheritance done by hand.

The chunk header: how pfree finds the allocator from a pointer

Section titled “The chunk header: how pfree finds the allocator from a pointer”

The elegant trick that makes pfree(p) and repalloc(p) work without a context argument: every chunk is immediately preceded, with no padding, by a uint64 whose low 4 bits are the owning context’s MemoryContextMethodID. Given any chunk pointer, pfree reads those 4 bits, indexes mcxt_methods[], and calls the right free_p:

// GetMemoryChunkMethodID — src/backend/utils/mmgr/mcxt.c (condensed)
static inline MemoryContextMethodID
GetMemoryChunkMethodID(const void *pointer)
{
uint64 header;
/* a non-MAXALIGNED pointer can't be a real chunk */
Assert(pointer == (const void *) MAXALIGN(pointer));
header = *((const uint64 *) ((const char *) pointer - sizeof(uint64)));
return (MemoryContextMethodID) (header & MEMORY_CONTEXT_METHODID_MASK);
}
#define MCXT_METHOD(pointer, method) \
mcxt_methods[GetMemoryChunkMethodID(pointer)].method
void
pfree(void *pointer)
{
MCXT_METHOD(pointer, free_p) (pointer); /* dispatch to owner's free */
}

The 4-bit id is chosen so reserved values are diagnostic: 0000 (MCTX_0_RESERVED_UNUSEDMEM_ID) is what never-touched memory reads as, 1111 (MCTX_15_RESERVED_WIPEDMEM_ID) is what wipe_mem leaves behind, and 0001/ 0010 match the patterns glibc’s malloc tends to leave — so handing pfree a glibc pointer or freed memory lands on a BOGUS_MCTX entry and fails cleanly instead of corrupting the heap.

The remaining 60 bits of that header are the allocator’s to use. All four in-tree allocators share the MemoryChunk header type (in memutils_memorychunk.h), which packs a 30-bit chunk size (or freelist index) and a 30-bit chunk-to-block offset into those 60 bits, with one bit marking an “external” (oversized, single-chunk-per-block) chunk. The offset lets the allocator recover the block header from the chunk, and the block points back at the owning context — which is how GetMemoryChunkContext answers “which context owns this pointer?”:

// GetMemoryChunkContext — src/backend/utils/mmgr/mcxt.c
MemoryContext
GetMemoryChunkContext(void *pointer)
{
return MCXT_METHOD(pointer, get_chunk_context) (pointer);
}

palloc is MemoryContextAlloc specialized to CurrentMemoryContext. It is deliberately a near-duplicate (not a wrapper) to shave a stack frame off the hottest path in the backend:

// palloc — src/backend/utils/mmgr/mcxt.c (condensed)
void *
palloc(Size size)
{
void *ret;
MemoryContext context = CurrentMemoryContext;
Assert(MemoryContextIsValid(context));
AssertNotInCriticalSection(context);
context->isReset = false;
ret = context->methods->alloc(context, size, 0);
Assert(ret != NULL); /* OOM is handled inside alloc, via elog */
VALGRIND_MEMPOOL_ALLOC(context, ret, size);
return ret;
}

Two API contracts from the README are load-bearing for the rest of the engine:

  • palloc never returns NULL. On OOM it elog(ERROR)s out (through MemoryContextAllocationFailure), so callers never null-check. The palloc_extended(..., MCXT_ALLOC_NO_OOM) variant opts out for the rare caller that wants to handle OOM itself.
  • pfree/repalloc ignore CurrentMemoryContext. They route to the chunk’s owning context, so you can free a chunk that belongs to a different context than the one currently active. (repalloc cannot take NULL — it has no context to allocate from if the chunk does not exist yet.)

Switching the current context is a trivial save/restore, defined inline in the header so it inlines everywhere:

// MemoryContextSwitchTo — src/include/utils/palloc.h
static inline MemoryContext
MemoryContextSwitchTo(MemoryContext context)
{
MemoryContext old = CurrentMemoryContext;
CurrentMemoryContext = context;
return old;
}

The README’s warning is worth repeating: CurrentMemoryContext should point at a short-lived context during normal work (typically the per-tuple context), so that an accidental un-freed palloc is reclaimed soon, not leaked for the life of the process.

The context forest and the well-known roots

Section titled “The context forest and the well-known roots”

Contexts form a forest — in current practice a single tree rooted at TopMemoryContext, the one context that is never reset or deleted. The well-known globals (declared in mcxt.c, described in the README) are the lifetime anchors every transient context hangs under:

flowchart TB
  TOP["TopMemoryContext<br/>(never reset; ~ malloc)"]
  TOP --> ERR["ErrorContext<br/>(reserved for error recovery)"]
  TOP --> CACHE["CacheMemoryContext<br/>(relcache/catcache; never reset)"]
  TOP --> MSG["MessageContext<br/>(current FE message; reset each cmd)"]
  TOP --> TTX["TopTransactionContext<br/>(reset at top-level xact end)"]
  TOP --> POSTM["PostmasterContext<br/>(freed in child after fork)"]
  MSG --> PLAN["parse/plan temp<br/>(child of MessageContext)"]
  TTX --> CUR["CurTransactionContext<br/>(= TopTransaction at top level;<br/>child per subtransaction)"]
  POR["PortalContext (per active portal)"] --> EXEC["ExecutorState<br/>(ExecutorStart..ExecutorEnd)"]
  EXEC --> PT1["ExprContext per-tuple<br/>(reset every tuple)"]
  EXEC --> PT2["ExprContext per-tuple<br/>(one per plan node)"]

Figure 2 — The well-known context roots and a typical transient subtree under them. Each root maps to a lifetime: process (Top, Cache), per-command (Message), per-transaction (TopTransaction/CurTransaction), per-portal (Portal), per-query (ExecutorState), per-tuple (ExprContext). Cleanup is “reset/delete the node matching the lifetime that just ended.”

Highlights from the README that explain why there are so many roots:

  • ErrorContext is permanently kept with a few KB free at all times, so that out-of-memory can be reported as an ordinary ERROR (the error path itself needs memory). It is reset after each recovery.
  • TopTransactionContext is not cleared immediately on error — its contents survive until the transaction block exits via COMMIT/ROLLBACK. CurTransactionContext equals it at top level but points at a per-subxact child inside a subtransaction; an aborting subxact throws its child away while a committed subxact’s child is retained until top-level commit.
  • CacheMemoryContext never resets (like TopMemoryContext); the distinction exists mainly for debugging and so that subsidiary cache storage can live in shorter-lived children of it.

MemoryContextCreate initializes the abstract header and links the node into its parent’s child list. It is required not to fail (Asserts only, no elog), and refuses to run inside a critical section:

// MemoryContextCreate — src/backend/utils/mmgr/mcxt.c (condensed)
void
MemoryContextCreate(MemoryContext node, NodeTag tag,
MemoryContextMethodID method_id,
MemoryContext parent, const char *name)
{
Assert(CritSectionCount == 0);
node->type = tag;
node->isReset = true;
node->methods = &mcxt_methods[method_id];
node->parent = parent;
node->firstchild = NULL;
/* ... */
if (parent)
{
node->nextchild = parent->firstchild;
if (parent->firstchild != NULL)
parent->firstchild->prevchild = node;
parent->firstchild = node;
node->allowInCritSection = parent->allowInCritSection;
}
/* ... */
}

Callers do not invoke this directly; they use a per-allocator creation helper — AllocSetContextCreate(parent, name, ALLOCSET_DEFAULT_SIZES) is overwhelmingly the common case — which mallocs the derived struct (plus an initial “keeper” block) and then calls MemoryContextCreate on its embedded header.

Reset keeps the context object but frees its contents, and by convention deletes its children rather than resetting them:

// MemoryContextReset / MemoryContextResetOnly — src/backend/utils/mmgr/mcxt.c
void
MemoryContextReset(MemoryContext context)
{
/* save a call in the common no-children case */
if (context->firstchild != NULL)
MemoryContextDeleteChildren(context);
if (!context->isReset)
MemoryContextResetOnly(context);
}
void
MemoryContextResetOnly(MemoryContext context)
{
if (!context->isReset)
{
MemoryContextCallResetCallbacks(context); /* release non-mem resources */
context->methods->reset(context); /* allocator-specific */
context->isReset = true;
}
}

The isReset flag is a fast-path guard: a context that has had no palloc since creation or last reset skips the allocator’s reset entirely — important because per-tuple contexts are reset on every tuple even when nothing was allocated.

Delete tears down the context and its whole subtree. Two details are deliberate and worth internalizing:

// MemoryContextDelete — src/backend/utils/mmgr/mcxt.c (condensed)
void
MemoryContextDelete(MemoryContext context)
{
MemoryContext curr = context;
for (;;)
{
MemoryContext parent;
/* descend to a leaf */
while (curr->firstchild != NULL)
curr = curr->firstchild;
parent = curr->parent;
MemoryContextDeleteOnly(curr); /* delink + delete one leaf */
if (curr == context)
break;
curr = parent;
}
}
  • No recursion. Deletion is an explicit descend-to-leaf loop, not a recursive walk, precisely because deletion happens on the error/abort path — a “stack depth limit exceeded” error while already cleaning up from an abort would be catastrophic.
  • Delink before free. MemoryContextDeleteOnly calls MemoryContextSetParent(context, NULL) to unlink the node from the tree before freeing it, so that if a reset callback errors mid-delete, the tree is never left pointing at a half-freed context — “better a leak than a crash.”
flowchart TD
  A["MemoryContextDelete(ctx)"] --> B{"ctx has children?"}
  B -- "yes" --> C["descend to first leaf"]
  C --> B
  B -- "no (leaf)" --> D["MemoryContextCallResetCallbacks"]
  D --> E["MemoryContextSetParent(leaf, NULL)<br/>delink from tree first"]
  E --> F["methods->delete_context<br/>(free all blocks + the struct)"]
  F --> G{"leaf == original ctx?"}
  G -- "no" --> H["climb to saved parent"]
  H --> B
  G -- "yes" --> I["done"]

Figure 3 — MemoryContextDelete as an iterative descend-to-leaf, delete, climb loop. Reset callbacks fire, the node is delinked before its storage is freed, and the loop avoids recursion so an over-deep tree cannot blow the stack during abort cleanup.

Reset/delete callbacks — releasing non-memory resources

Section titled “Reset/delete callbacks — releasing non-memory resources”

A context can carry a list of callbacks fired once, just before the next reset or delete. This is how a context becomes a hook for releasing resources that are associated with palloc’d objects but are not themselves memory — an open file behind a tuplesort object, a reference count on a cache entry.

// MemoryContextRegisterResetCallback — src/backend/utils/mmgr/mcxt.c
void
MemoryContextRegisterResetCallback(MemoryContext context,
MemoryContextCallback *cb)
{
/* push onto head: newest-registered fires first */
cb->next = context->reset_cbs;
context->reset_cbs = cb;
context->isReset = false; /* ensure reset path runs the callbacks */
}

Callbacks fire in reverse order of registration, and a child’s callbacks fire before its parent’s during a subtree teardown. The caller supplies the MemoryContextCallback struct itself (typically allocated inside the target context, so it is freed for free), which avoids an extra palloc. This is the narrow seam where memory contexts touch resource management — the broader story of pins, locks, and descriptors lives in postgres-resource-owners.md.

Error handling unwinds by deleting contexts

Section titled “Error handling unwinds by deleting contexts”

The decisive payoff. When a backend hits elog(ERROR), control siglongjmps back to the enclosing PG_TRY/sigsetjmp barrier (owned by postgres-error-handling.md). Memory cleanup at that barrier is not a walk of data structures — it is a handful of context resets and deletes. At the outermost level, AbortTransactionCleanupTransaction resets/deletes TopTransactionContext (and thereby every per-statement, per-portal, per-tuple child created during the transaction); the per-message loop resets MessageContext at the top of each cycle. Because every transient allocation was a descendant of one of those roots, a single delete reclaims it all, on the error path exactly as on the success path — no leak-prone per-object freeing, no missed branch.

sequenceDiagram
    participant Q as backend op
    participant E as elog(ERROR)
    participant LJ as sigsetjmp barrier<br/>(error-handling doc)
    participant MC as memory contexts

    Q->>E: error raised
    E->>LJ: siglongjmp to enclosing PG_TRY
    LJ->>MC: AbortCurrentTransaction()
    MC->>MC: MemoryContextDelete(TopTransactionContext)
    Note over MC: deletes whole subtree:<br/>per-portal, per-query,<br/>per-tuple contexts
    MC->>MC: callbacks fire (files closed, pins... see resowner doc)
    LJ->>MC: MemoryContextReset(MessageContext) next loop
    Note over MC: every transient chunk reclaimed<br/>by region teardown, not per-object free

Figure 4 — Error unwind as region teardown. The longjmp machinery is owned by postgres-error-handling.md; this diagram shows only the memory leg — deleting TopTransactionContext frees the entire transient subtree in one operation, which is the whole reason the engine can treat OOM and arbitrary mid-statement errors as ordinary recoverable conditions.

All four concrete allocators present the same external behavior (palloc/pfree/ reset/delete) but implement radically different internal policies. The README’s one-line summaries, expanded:

AllocatorFileBest forPer-chunk pfree?Chunk header
AllocSetaset.cgeneral purpose (the default)yes, recycled via size-class freelistsfull MemoryChunk
Slabslab.cmany equally-sized chunksyes, dense packing, returns empty blocks to OSfull MemoryChunk
Generationgeneration.cgroups with similar lifespan / FIFOyes, but space is not reused; block freed when emptyfull MemoryChunk
Bumpbump.cwrite-once, never individually freedno (pfree/realloc unsupported)none (in normal builds)

aset.c is what AllocSetContextCreate builds and what nearly every context in the system is. Its policy: round each request up to a power-of-2 size class and keep a per-class freelist of recycled chunks; carve chunks out of malloc’d blocks that double in size (8 KB initial → up to 8 MB) to amortize malloc overhead; send requests larger than allocChunkLimit (~8 KB) to a dedicated block returned whole on free.

// AllocSetContext — src/backend/utils/mmgr/aset.c (condensed)
typedef struct AllocSetContext
{
MemoryContextData header; /* Standard memory-context fields */
AllocBlock blocks; /* head of list of blocks in this set */
MemoryChunk *freelist[ALLOCSET_NUM_FREELISTS]; /* free chunk lists */
uint32 initBlockSize; /* initial block size */
uint32 maxBlockSize; /* maximum block size */
uint32 nextBlockSize; /* next block size to allocate */
uint32 allocChunkLimit; /* effective chunk size limit */
int freeListIndex; /* index in context_freelists[], or -1 */
} AllocSetContext;

With ALLOC_MINBITS = 3 and ALLOCSET_NUM_FREELISTS = 11, freelist k holds chunks of size 1 << (k+3) — 8, 16, 32, … up to 8192 bytes. The allocation fast path:

// AllocSetAlloc — src/backend/utils/mmgr/aset.c (condensed)
void *
AllocSetAlloc(MemoryContext context, Size size, int flags)
{
AllocSet set = (AllocSet) context;
if (size > set->allocChunkLimit) /* oversized: dedicated block */
return AllocSetAllocLarge(context, size, flags);
fidx = AllocSetFreeIndex(size); /* size class */
chunk = set->freelist[fidx];
if (chunk != NULL) /* reuse a recycled chunk */
{
set->freelist[fidx] = GetFreeListLink(chunk)->next;
return MemoryChunkGetPointer(chunk);
}
/* else carve from the current block, or start a new one */
chunk_size = GetChunkSizeFromFreeListIdx(fidx);
block = set->blocks;
if (unlikely((block->endptr - block->freeptr) < (chunk_size + ALLOC_CHUNKHDRSZ)))
return AllocSetAllocFromNewBlock(context, size, flags, fidx);
return AllocSetAllocChunkFromBlock(context, block, size, chunk_size, fidx);
}

pfree pushes the chunk back onto freelist[fidx] (it is not returned to the OS); a freed oversized chunk’s dedicated block is returned to malloc. Reset keeps the first (“keeper”) block and frees the rest, so a repeatedly-reset context does not thrash malloc. There is also a small context freelist (context_freelists[], cap MAX_FREE_CONTEXTS = 100): freshly deleted default-sized AllocSets are cached whole and handed back by the next create, avoiding repeated malloc/free of the context struct itself.

flowchart LR
  REQ["palloc(size) in an AllocSet"] --> Q1{"size > allocChunkLimit?"}
  Q1 -- "yes" --> LRG["AllocSetAllocLarge<br/>dedicated block, external chunk"]
  Q1 -- "no" --> FIDX["fidx = AllocSetFreeIndex(size)<br/>(power-of-2 class)"]
  FIDX --> Q2{"freelist[fidx] non-empty?"}
  Q2 -- "yes" --> POP["pop recycled chunk<br/>O(1)"]
  Q2 -- "no" --> Q3{"room in current block?"}
  Q3 -- "yes" --> CARVE["bump freeptr in block"]
  Q3 -- "no" --> NEW["malloc new block<br/>(size doubles toward maxBlockSize)"]

Figure 5 — AllocSetAlloc decision flow. Small requests hit a power-of-2 freelist (recycled chunk, else carve from the current doubling block); requests over allocChunkLimit (~8 KB) go to a dedicated block freed whole on pfree.

Slab — fixed-size chunks, fragmentation-resistant

Section titled “Slab — fixed-size chunks, fragmentation-resistant”

slab.c is for streams of identically-sized objects (the chunk size is fixed at context creation). Blocks are carved into exact-size chunks; the context keeps blocks bucketed by free-chunk count in a blocklist[] array and always serves new allocations from the fullest non-full block, which keeps used chunks dense and lets emptied blocks return to the OS — directly attacking the fragmentation that a general allocator suffers when long-lived and short-lived same-size objects intermix.

// SlabContext — src/backend/utils/mmgr/slab.c (condensed)
typedef struct SlabContext
{
MemoryContextData header;
uint32 chunkSize; /* the requested (non-aligned) chunk size */
uint32 fullChunkSize; /* chunk size with header + alignment */
uint32 blockSize; /* size of each block of chunks */
int32 chunksPerBlock;
int32 curBlocklistIndex; /* fullest blocks live here */
/* ... */
dlist_head blocklist[SLAB_BLOCKLIST_COUNT]; /* blocks bucketed by nfree */
} SlabContext;

In-tree users: reorderbuffer.c (logical decoding allocates a uniform ReorderBufferChange per WAL change — a textbook fixed-size stream).

Generation — FIFO / similar-lifespan groups

Section titled “Generation — FIFO / similar-lifespan groups”

generation.c suits objects allocated in generations that die together, or roughly in FIFO order. It is a bump-style block allocator with one twist: it tracks a free-count per block and returns a block to the OS once all its chunks are pfree’d, but never reuses space within a block (no freelists). It keeps a single freeblock around to recycle, to avoid malloc churn.

// GenerationContext — src/backend/utils/mmgr/generation.c (condensed)
typedef struct GenerationContext
{
MemoryContextData header;
uint32 initBlockSize, maxBlockSize, nextBlockSize, allocChunkLimit;
GenerationBlock *block; /* current (most recently allocated) block */
GenerationBlock *freeblock; /* one empty block kept for recycling */
dlist_head blocks; /* list of blocks */
} GenerationContext;

In-tree users: reorderbuffer.c (tuple data, freed in commit order), tuplestore.c, gistvacuum.c — all FIFO-ish producer/consumer patterns where AllocSet’s freelists would just be wasted bookkeeping.

bump.c (added in PG17) is the densest allocator: chunks have no header at all in normal builds, so pfree, repalloc, GetMemoryChunkSpace, and GetMemoryChunkContext are unsupported and raise an error if attempted. You allocate a large number of small chunks, never free them individually, and reclaim only by reset/delete of the whole context. Dropping the header fits more chunks per block and per cache line.

// BumpContext — src/backend/utils/mmgr/bump.c (condensed)
typedef struct BumpContext
{
MemoryContextData header;
uint32 initBlockSize, maxBlockSize, nextBlockSize, allocChunkLimit;
dlist_head blocks; /* block being filled is at the head */
} BumpContext;

In normal builds Bump_CHUNKHDRSZ is 0; only under MEMORY_CONTEXT_CHECKING does bump add a MemoryChunk header so the disallowed operations can be caught with a clear error. In-tree users: nodeAgg.c (hash-aggregate group state), tuplesort.c, tidstore.c — all “allocate, use, throw the whole arena away” scratchpads.

Anchor on symbol names, not line numbers. The PostgreSQL source moves; a function/struct/macro name is the stable handle. Use git grep -n '<symbol>' src/backend/utils/mmgr/ to locate the current position. Line numbers in the position-hint table were observed at commit 273fe94 (REL_18) and are quick hints only.

Abstract layer and dispatch (src/include/nodes/memnodes.h, src/backend/utils/mmgr/mcxt.c)

Section titled “Abstract layer and dispatch (src/include/nodes/memnodes.h, src/backend/utils/mmgr/mcxt.c)”
  • struct MemoryContextData (in memnodes.h) — the abstract context header (tree links, methods vtable pointer, isReset, mem_allocated).
  • struct MemoryContextMethods (in memnodes.h) — the allocator vtable.
  • enum MemoryContextMethodID (in memutils_internal.h) — 4-bit allocator tag; reserved values (_RESERVED_GLIBC_, _RESERVED_UNUSEDMEM_, _RESERVED_WIPEDMEM_) make bad pointers fail cleanly.
  • mcxt_methods[] (in mcxt.c) — the one-row-per-allocator vtable array.
  • GetMemoryChunkMethodID / MCXT_METHOD (in mcxt.c) — read the 4-bit id from the chunk header and dispatch.
  • MemoryContextCreate (in mcxt.c) — initialize header, link into parent.
  • MemoryContextReset / MemoryContextResetOnly / MemoryContextResetChildren — free contents, delete (or reset) children; isReset fast path.
  • MemoryContextDelete / MemoryContextDeleteOnly / MemoryContextDeleteChildren — iterative descend-to-leaf teardown, delink before free.
  • MemoryContextSetParent — reparent (used to change a context’s lifespan after fill).
  • MemoryContextRegisterResetCallback / MemoryContextCallResetCallbacks — the non-memory-resource hook.
  • MemoryContextAllocationFailure / MemoryContextSizeFailure — the OOM / bad-size elog(ERROR) paths.

Allocation API (src/backend/utils/mmgr/mcxt.c, src/include/utils/palloc.h)

Section titled “Allocation API (src/backend/utils/mmgr/mcxt.c, src/include/utils/palloc.h)”
  • palloc / palloc0 / palloc_extended / palloc_aligned — allocate from CurrentMemoryContext.
  • MemoryContextAlloc / MemoryContextAllocZero / MemoryContextAllocExtended — allocate from a named context.
  • pfree / repalloc / repalloc0 / repalloc_extended — context recovered from the chunk header; ignore CurrentMemoryContext.
  • GetMemoryChunkContext / GetMemoryChunkSpace — pointer → context / size.
  • MemoryContextSwitchTo (inline, in palloc.h) — save/restore the current context.
  • AllocSetContext, AllocSetAlloc, AllocSetFree, AllocSetReset, AllocSetDelete, AllocSetFreeIndex (in aset.c) — general-purpose; power-of-2 freelists, doubling blocks, context freelist.
  • SlabContext, SlabAlloc, SlabReset, SlabContextCreate (in slab.c) — fixed-size, fullest-block-first.
  • GenerationContext, GenerationAlloc, GenerationFree, GenerationContextCreate (in generation.c) — FIFO/lifespan, no reuse.
  • BumpContext, BumpAlloc, BumpContextCreate (in bump.c) — header-less, no pfree.
  • AllocSetContextCreate macro + ALLOCSET_DEFAULT_SIZES/ALLOCSET_SMALL_SIZES (in memutils.h) — the common creation entry points.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
struct MemoryContextDatasrc/include/nodes/memnodes.h117
struct MemoryContextMethodssrc/include/nodes/memnodes.h58
MemoryContextIsValidsrc/include/nodes/memnodes.h145
enum MemoryContextMethodIDsrc/include/utils/memutils_internal.h121
mcxt_methods[]src/backend/utils/mmgr/mcxt.c47
MCXT_METHOD (macro)src/backend/utils/mmgr/mcxt.c188
GetMemoryChunkMethodIDsrc/backend/utils/mmgr/mcxt.c196
MemoryContextResetsrc/backend/utils/mmgr/mcxt.c389
MemoryContextResetOnlysrc/backend/utils/mmgr/mcxt.c408
MemoryContextDeletesrc/backend/utils/mmgr/mcxt.c460
MemoryContextDeleteOnlysrc/backend/utils/mmgr/mcxt.c502
MemoryContextRegisterResetCallbacksrc/backend/utils/mmgr/mcxt.c574
MemoryContextCallResetCallbackssrc/backend/utils/mmgr/mcxt.c591
MemoryContextSetParentsrc/backend/utils/mmgr/mcxt.c643
GetMemoryChunkContextsrc/backend/utils/mmgr/mcxt.c713
MemoryContextCreatesrc/backend/utils/mmgr/mcxt.c1106
MemoryContextAllocsrc/backend/utils/mmgr/mcxt.c1191
MemoryContextAllocExtendedsrc/backend/utils/mmgr/mcxt.c1248
pallocsrc/backend/utils/mmgr/mcxt.c1346
palloc0src/backend/utils/mmgr/mcxt.c1376
pfreesrc/backend/utils/mmgr/mcxt.c1553
repallocsrc/backend/utils/mmgr/mcxt.c1573
MemoryContextSwitchTo (inline)src/include/utils/palloc.h138
struct AllocSetContextsrc/backend/utils/mmgr/aset.c152
AllocSetContextCreateInternalsrc/backend/utils/mmgr/aset.c347
AllocSetResetsrc/backend/utils/mmgr/aset.c537
AllocSetDeletesrc/backend/utils/mmgr/aset.c607
AllocSetAllocsrc/backend/utils/mmgr/aset.c967
AllocSetFreesrc/backend/utils/mmgr/aset.c1062
struct GenerationContextsrc/backend/utils/mmgr/generation.c59
GenerationContextCreatesrc/backend/utils/mmgr/generation.c160
GenerationAllocsrc/backend/utils/mmgr/generation.c527
struct SlabContextsrc/backend/utils/mmgr/slab.c103
SlabContextCreatesrc/backend/utils/mmgr/slab.c322
SlabAllocsrc/backend/utils/mmgr/slab.c631
struct BumpContextsrc/backend/utils/mmgr/bump.c66
BumpContextCreatesrc/backend/utils/mmgr/bump.c131
BumpAllocsrc/backend/utils/mmgr/bump.c491
ALLOCSET_DEFAULT_SIZES (macro)src/include/utils/memutils.h160

Each entry leads with a fact about the current source at commit 273fe94 (REL_18), readable without any other materials. The trailing note records how it was checked. Open questions follow.

  • The allocator type is encoded in the low 4 bits of a uint64 header immediately preceding every chunk, with no padding. Verified in GetMemoryChunkMethodID and the MCXT_METHOD macro (mcxt.c). The mask is MEMORY_CONTEXT_METHODID_MASK; reserved id values (0000, 0001, 0010, 1111) correspond to unused/glibc/wiped memory so bad pointers fail cleanly rather than dispatching into garbage.

  • There are exactly four concrete allocators on REL_18: AllocSet, Slab, Generation, Bump. Verified against MemoryContextIsValid (memnodes.h, which IsA-checks those four node tags) and the mcxt_methods[] array (mcxt.c). A fifth method id, MCTX_ALIGNED_REDIRECT_ID, exists but is not a standalone context type — it is the redirect used by palloc_aligned (alignedalloc.c), with only free_p/realloc/get_chunk_* populated.

  • Bump chunks have no header in normal builds (Bump_CHUNKHDRSZ == 0). Verified in bump.c. pfree/repalloc/GetMemoryChunkSpace/ GetMemoryChunkContext on a bump chunk are unsupported; under MEMORY_CONTEXT_CHECKING a MemoryChunk header is added so those calls raise a clear ERROR instead of corrupting memory.

  • palloc never returns NULL; OOM exits via elog(ERROR) inside the allocator’s alloc method. Verified in palloc (which Assert(ret != NULL)) and MemoryContextAllocationFailure (mcxt.c). The only NULL-capable path is MCXT_ALLOC_NO_OOM via palloc_extended/MemoryContextAllocExtended.

  • MemoryContextDelete is iterative, not recursive, and delinks each node before freeing it. Verified in MemoryContextDelete (descend-to-leaf loop with an explicit comment that recursion would risk “stack depth limit exceeded” during abort cleanup) and MemoryContextDeleteOnly (MemoryContextSetParent(context, NULL) before delete_context). The README and the in-code comment agree: “better a leak than a crash.”

  • AllocSet small allocations round up to one of 11 power-of-2 size classes (8 B … 8 KB); larger requests go to dedicated blocks. Verified from ALLOC_MINBITS = 3, ALLOCSET_NUM_FREELISTS = 11, ALLOC_CHUNK_LIMIT, and AllocSetAlloc/AllocSetFreeIndex (aset.c). ALLOCSET_DEFAULT_INITSIZE is 8 KB and ALLOCSET_DEFAULT_MAXSIZE is 8 MB (memutils.h); blocks double between those bounds.

  • Reset/delete callbacks fire newest-first, and child callbacks fire before parent callbacks during a subtree teardown. Verified in MemoryContextRegisterResetCallback (pushes onto list head) and MemoryContextCallResetCallbacks (pops from head), consistent with the README’s “reverse order of registration” statement. The callback struct is caller-provided, typically allocated inside the target context.

  • TopTransactionContext is not cleared on error; it survives until the transaction block exits. Stated in the README; the actual abort-time reset/delete is driven from the transaction-management layer, not from mcxt.c. Confirming the precise call site is out of this doc’s scope (see postgres-error-handling.md / xact).

  1. Bump allocator adoption. Bump landed in PG17 and REL_18 uses it in nodeAgg.c, tuplesort.c, and tidstore.c. Which other historically AllocSet- or Generation-backed scratchpads are candidates to convert, and what is the measured density/CPU win? Investigation path: diff the BumpContextCreate call sites across REL_16→REL_18 and look for aset/generation contexts whose objects are demonstrably never pfree’d.

  2. context_freelists cap of 100. aset.c caches up to MAX_FREE_CONTEXTS = 100 whole deleted default-sized contexts for reuse, and “deletes all existing entries when the list overflows.” Under what workload (many short-lived per-relation contexts?) does this overflow matter, and is 100 ever a bottleneck? Investigation path: instrument context_freelists[]. num_free under a high-DDL or high-plan-churn workload.

  3. Memory accounting cost. mem_allocated is updated lazily per-block, so MemoryContextMemAllocated/stats must walk the subtree recursively (README “Memory Accounting”). For a backend with a very wide/deep context forest (thousands of relcache child contexts), how expensive is a pg_log_backend_memory_contexts() dump? Investigation path: time ProcessLogMemoryContextInterrupt under a large relcache.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a starting handle for a follow-up doc.

  • CUBRID’s private heap / per-thread allocators. CUBRID does not use a context forest; it mixes a private-heap (db_private_alloc) discipline with explicit frees and per-thread scratch areas. A side-by-side would show what PostgreSQL buys with the tree (O(1) error-path reclamation) versus what it costs (per-chunk header overhead, no compaction). Cross-ref the CUBRID heap / memory analyses when written.

  • Region inference in languages (Tofte & Talpin, “Region-Based Memory Management”, 1997). PostgreSQL assigns regions by hand (which context to switch into); the research line infers region lifetimes statically. The comparison frames PostgreSQL’s manual MemoryContextSwitchTo discipline as the un-inferred baseline and points at where bugs (wrong-context allocation) come from.

  • Slab allocation lineage (Bonwick, “The Slab Allocator”, USENIX 1994). PostgreSQL’s slab.c is the same idea (object caches of fixed-size items to fight fragmentation) as the Solaris kernel allocator; the fullest-block-first refinement is PostgreSQL’s. A note tracing the lineage would anchor the slab design choices in their original rationale.

  • Arena allocators in other systems (jemalloc/tcmalloc arenas, Rust’s bumpalo, V8 zones). Bump’s header-less write-once arena is the same pattern as compiler/VM “zones.” Worth a comparison of when a header-less arena beats a freelist allocator, with the density/cache numbers PG17’s bump commit cited.

  • Memory accounting and per-query memory limits. PostgreSQL’s lazy per-block mem_allocated accounting (and the lack, on REL_18, of a hard per-backend memory cap) is a recurring proposal area. Tie this to the work_mem spill mechanisms in postgres-tuplesort.md / hash-agg and to the ongoing community discussion of a backend memory limit.

  • src/backend/utils/mmgr/README — “Memory Context System Design Overview”: the design rationale (current context, parent/child tree, globally-known contexts, reset callbacks, the four allocators, memory accounting). The primary design document for this subsystem.

Source files (under /data/hgryoo/references/postgres/, REL_18 273fe94)

Section titled “Source files (under /data/hgryoo/references/postgres/, REL_18 273fe94)”
  • src/backend/utils/mmgr/mcxt.c — abstract layer, dispatch, lifecycle, palloc/pfree.
  • src/backend/utils/mmgr/aset.c — AllocSet (default allocator).
  • src/backend/utils/mmgr/generation.c — Generation allocator.
  • src/backend/utils/mmgr/slab.c — Slab allocator.
  • src/backend/utils/mmgr/bump.c — Bump allocator.
  • src/include/nodes/memnodes.hMemoryContextData, MemoryContextMethods.
  • src/include/utils/palloc.h — palloc/pfree API, MemoryContextSwitchTo.
  • src/include/utils/memutils.h — creation macros, ALLOCSET_*_SIZES.
  • src/include/utils/memutils_internal.hMemoryContextMethodID.
  • Database Internals (Petrov), Part I implementation chapters — memory vs. disk-based stores, allocation/fragmentation trade-offs; the region/arena framing in §“Theoretical Background”.
  • Region-based memory management background: Tofte & Talpin 1997 (regions), Bonwick 1994 (slab) — see §“Beyond PostgreSQL” for the specific anchors.

Cross-references (sibling docs, mechanism not duplicated here)

Section titled “Cross-references (sibling docs, mechanism not duplicated here)”
  • postgres-error-handling.md — the PG_TRY/sigsetjmp/elog machinery that triggers context deletion on the abort path.
  • postgres-resource-owners.md — release of non-memory resources (buffer pins, locks, file descriptors); memory contexts touch this only via reset callbacks.
  • postgres-architecture-overview.md — Axis 11 base-infra positioning; “a backend is mostly a thread of control with a private memory-context tree.”