PostgreSQL ResourceOwner — Hierarchical Resource Tracking and Error-Safe Release
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-06)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every database backend that executes a SQL statement acquires a swarm of
short-lived, query-lifespan resources: it pins buffers so the buffer
manager won’t evict pages out from under a scan, it takes lock-manager locks
on the relations and tuples it touches, it bumps reference counts on cached
relation descriptors and catalog-cache entries, it opens transient files for
sorts and hash spills, it registers snapshots, it allocates JIT-compiled
expression modules and DSM segments for parallel workers. Each of these is a
claim on a finite shared pool that must be returned, and the central
correctness problem is not the happy path — where the executor tidily
releases each resource as it finishes — but the error path. A SQL query
can fail at essentially any C statement: a palloc hits the memory ceiling,
a datatype input function rejects a literal, a unique-index insertion raises
a constraint violation, a SIGINT arrives mid-scan. When that happens, the
fifty nested C function frames that were holding pins and references are torn
down by a longjmp, and none of their carefully-placed “release” calls run.
If the resources those frames held were tracked only in the frames’ own local
variables, they would leak — a leaked buffer pin pins a page forever, a
leaked lock blocks every other backend, a leaked relcache reference corrupts
invalidation.
The conceptual answer is the same one that systems languages reach for under
the name RAII (Resource Acquisition Is Initialization): bind every
resource’s lifetime to the lifetime of some owner object, and make tearing
down the owner automatically release everything it owns. C++ does this with
stack-allocated objects whose destructors run during stack unwinding; Rust
does it with Drop. C has neither destructors nor unwinding, so PostgreSQL
builds the equivalent by hand. It interposes, between the resource and the C
stack frame that acquired it, a heap-allocated bookkeeping object — the
ResourceOwner — that records “this owner now holds this pin / lock /
reference.” The owner is not tied to a C stack frame; it is tied to a
transaction, subtransaction, or portal, which are precisely the scopes at
whose boundaries PostgreSQL wants resources reclaimed. When a transaction
ends — commit or abort — its ResourceOwner is walked and every still-held
resource is released by a kind-specific callback. The error path then becomes
trivial: a longjmp lands in the abort handler, which releases the top
transaction’s ResourceOwner, which frees everything every failed frame
forgot.
There is a second, subtler theoretical requirement: ordering. Resources
are not independent. A pinned buffer is visible to other backends through the
shared buffer descriptor’s refcount; a held lock is something other backends
are blocked waiting on. If a committing transaction released its locks
before its buffer pins, another backend could acquire the freed lock, look at
the relation, and find pages still pinned (and thus possibly mid-modification)
by the “finished” transaction — a backend that, from the lock’s perspective,
has already left. So the release must be phased: everything visible to
other backends (pins) goes before the locks; backend-internal cleanup
(catalog caches, transient files) goes after. This is the same
ordering discipline ARIES (Mohan et al. 1992, captured in
knowledge/research/dbms-papers/aries.md) imposes on commit/abort: locks are
the last thing released to other transactions, and only after the
transaction’s externally-visible state is consistent. The ResourceOwner’s
three-phase release is the operational embodiment of that rule inside a single
backend.
The standard texts treat this only obliquely. Database System Concepts
(Silberschatz 7e, ch. 17–18) discusses transactions as the unit of
atomicity and recovery, and Database Internals (Petrov 2019, ch. 5–6)
discusses buffer management and lock lifetimes, but neither names the
in-backend bookkeeping object that ties resource lifetime to transaction
scope. That object is an implementation invention, and PostgreSQL’s
resowner/README is explicit that its design was modeled on the
MemoryContext API — another PostgreSQL invention that solved the analogous
problem for heap allocations. The two are deliberately separate (memory leaks
and resource leaks have different usage patterns) but share the same shape: a
tree of scopes, a “current” pointer, and a recursive bulk-free at scope exit.
Common DBMS Design
Section titled “Common DBMS Design”Every serious DBMS confronts the leak-on-error problem; they differ in what language mechanism they lean on and how explicit the bookkeeping is.
Garbage-collected / managed engines (Java H2, Derby; C# engines) get
memory reclamation for free but still need explicit release for non-memory
resources (locks, file handles, latches). They typically use try/finally
blocks or language using/try-with-resources constructs, which run cleanup
code during exception propagation. This is RAII-by-language-runtime: the
unwinding machinery is built in, and each frame’s finally releases what that
frame holds. The cost is that the discipline is distributed — every frame
must remember its own cleanup — and a missing finally silently leaks.
C++ engines (much of the storage layer of systems written in modern C++)
use RAII directly: a BufferGuard or LockGuard is a stack object whose
destructor unpins/unlocks, and C++ stack unwinding during exception
propagation runs the destructors automatically. This is elegant and local —
the resource and its release live in one type — but it ties resource lifetime
to C stack scope, which is not always the scope you want. A buffer that must
stay pinned across several function returns but be released at transaction end
cannot live in a stack guard; you need a heap-resident owner anyway.
C engines — PostgreSQL, SQLite, and the many engines descended from
Berkeley/Ingres lineages — have no destructors and (in PostgreSQL’s case) no
true exceptions, only setjmp/longjmp. They therefore cannot rely on the
compiler to run cleanup during unwinding, and must maintain an explicit,
centralized registry of outstanding resources that a single error handler
walks. SQLite, being single-threaded-per-connection and arena-ish, largely
sidesteps this by scoping almost everything to the sqlite3 connection
object and its prepared statements. PostgreSQL, being a heavily concurrent,
multi-resource engine, needs something richer: a forest of owners mirroring
the transaction/subtransaction/portal nesting, with a “current owner” global
that acquisition primitives consult automatically so call sites don’t have to
pass an owner around.
Three design axes recur across these implementations:
-
Granularity of the owner. Tie it to the C frame (C++ guards), to the statement, or to the transaction/savepoint? PostgreSQL chooses the transaction/subtransaction/portal, because those are the recovery boundaries and the points where it actually wants bulk reclamation.
-
Phasing of release. Engines that hold locks until end-of-transaction (two-phase locking, which is essentially all of them) must release backend-visible state before locks. Most encode this as an ordered list of cleanup steps in the commit/abort routine; PostgreSQL encodes it declaratively as a per-resource-kind phase + priority and lets the ResourceOwner sort.
-
The lock special case. Locks held by a subtransaction or portal that commits must not be released — they belong to the enclosing transaction now. So the “release” of a child owner is, for locks, really a transfer to the parent. Engines with savepoints all need this; PostgreSQL builds it into the ResourceOwner’s lock handling directly.
The unifying idea is that resource cleanup is too important to leave to the
correctness of fifty hand-written release calls scattered through the
executor. Localize the tracking into one module, make acquisition implicitly
register with the current owner, and make scope-exit a single recursive
sweep. PostgreSQL’s resowner.c is one of the cleaner realizations of that
idea in any open-source engine.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s ResourceOwner is a heap object in TopMemoryContext (so it
outlives any transaction’s memory context and is freed only explicitly),
holding a parent pointer and a child list — a forest. Four global owners
anchor the forest:
// resowner.c — the four globally-known owners (resowner.c, GLOBAL MEMORY)ResourceOwner CurrentResourceOwner = NULL;ResourceOwner CurTransactionResourceOwner = NULL;ResourceOwner TopTransactionResourceOwner = NULL;ResourceOwner AuxProcessResourceOwner = NULL;CurrentResourceOwner is the one that matters at acquisition time: when
ReadBuffer pins a page or LockAcquire takes a lock, it records the
resource against whatever CurrentResourceOwner points at right now. The
README is emphatic that this is NULL outside any transaction (and inside a
failed transaction), and that acquiring a query-lifespan resource then is
illegal. CurTransactionResourceOwner is the current (sub)transaction’s
owner; TopTransactionResourceOwner is the outermost transaction’s owner (the
root of the per-transaction subtree); AuxProcessResourceOwner serves
non-backend auxiliary processes (checkpointer, walwriter) that have no real
transactions but still pin buffers.
The shape of the forest follows the nesting exactly. A subtransaction’s owner is created as a child of its parent’s owner:
// xact.c — StartSubTransaction creates a child owner (xact.c)s->curTransactionOwner = ResourceOwnerCreate(s->parent->curTransactionOwner, "SubTransaction");CurTransactionResourceOwner = s->curTransactionOwner;CurrentResourceOwner = s->curTransactionOwner;A portal’s owner is created as a child of the current transaction’s owner, so that when the portal closes, any locks it still holds become the transaction’s responsibility:
// portalmem.c — CreatePortal hangs the portal owner under the transactionportal->resowner = ResourceOwnerCreate(CurTransactionResourceOwner, "Portal");Each tracked resource kind registers a ResourceOwnerDesc that declares
when and in what order it is released, plus the callbacks to release and to
debug-print it. Buffer pins, for example, are a BEFORE_LOCKS resource (they
are visible to other backends) at the buffer-pin priority:
// bufmgr.c — the buffer-pin resource kind descriptorconst ResourceOwnerDesc buffer_pin_resowner_desc ={ .name = "buffer pin", .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS, .release_priority = RELEASE_PRIO_BUFFER_PINS, .ReleaseResource = ResOwnerReleaseBufferPin, .DebugPrint = ResOwnerPrintBufferPin};The three phases and the built-in priorities are fixed in the header. The
phase enum encodes the ordering rule directly — pins and other
externally-visible resources are BEFORE_LOCKS, locks are their own phase,
and backend-internal caches (catcache, plancache, tupdesc, snapshots, files,
wait-event sets) are AFTER_LOCKS:
// resowner.h — the three release phases and selected built-in prioritiestypedef enum{ RESOURCE_RELEASE_BEFORE_LOCKS = 1, RESOURCE_RELEASE_LOCKS, RESOURCE_RELEASE_AFTER_LOCKS,} ResourceReleasePhase;
/* priorities of built-in BEFORE_LOCKS resources */#define RELEASE_PRIO_BUFFER_IOS 100#define RELEASE_PRIO_BUFFER_PINS 200#define RELEASE_PRIO_RELCACHE_REFS 300/* priorities of built-in AFTER_LOCKS resources */#define RELEASE_PRIO_CATCACHE_REFS 100#define RELEASE_PRIO_SNAPSHOT_REFS 500#define RELEASE_PRIO_FILES 600The release itself is driven from xact.c, which calls ResourceOwnerRelease
on TopTransactionResourceOwner once per phase — the phasing is the caller’s
responsibility, with engine cleanup interleaved between phases (catalog
invalidation, relcache cleanup) at exactly the right moment relative to lock
release. The error-unwind path is the same three calls with isCommit=false.
This is the crux of the PG_TRY integration: a longjmp from anywhere lands in
AbortTransaction, which runs these three calls and thereby frees every
resource the aborted C frames left behind.
The whole design intentionally parallels MemoryContexts: a tree of scopes, a
Current* pointer that acquisition primitives consult, and a recursive bulk
sweep at scope exit. The difference is that ResourceOwners track typed
external claims (each with a release callback), whereas MemoryContexts track
raw allocations — which is why the README resists unifying them.
flowchart TD
TT["TopTransactionResourceOwner<br/>(TopTransaction)"]
ST["SubTransaction owner<br/>(child of parent xact)"]
P1["Portal owner<br/>(child of CurTransactionResourceOwner)"]
P2["Portal owner #2"]
AUX["AuxProcessResourceOwner<br/>(separate root: checkpointer, walwriter)"]
TT --> ST
TT --> P1
TT --> P2
ST --> STP["nested Portal under subxact"]
CUR["CurrentResourceOwner<br/>(global: who owns NEW acquisitions)"]
CUR -.points at active scope.-> P1
subgraph claims["resources remembered against an owner"]
B["buffer pins (BEFORE_LOCKS)"]
L["lmgr locks (LOCKS phase, lossy cache)"]
R["relcache / catcache refs"]
F["transient files, snapshots (AFTER_LOCKS)"]
A["AIO handles (dlist, critical-section safe)"]
end
P1 --> claims
Source Walkthrough
Section titled “Source Walkthrough”The owner object and its storage
Section titled “The owner object and its storage”ResourceOwnerData is the heart of the module. Note the three storage
regions: a fixed 32-slot array for the most-recent resources, a hash table it
spills into, and a separate 15-entry lock cache. The releasing/sorted
flags lock the owner against further Remember/Forget once release starts.
// resowner.c — ResourceOwnerData (abridged to the load-bearing fields)struct ResourceOwnerData{ ResourceOwner parent; /* NULL if no parent (toplevel owner) */ ResourceOwner firstchild; /* head of linked list of children */ ResourceOwner nextchild; /* next child of same parent */ const char *name; /* name (just for debugging) */
bool releasing; /* release has started; no more Remember */ bool sorted; /* are 'hash' and 'arr' sorted by priority? */
uint8 nlocks; /* number of owned locks */ uint8 narr; /* how many items are stored in the array */ uint32 nhash; /* how many items are stored in the hash */
ResourceElem arr[RESOWNER_ARRAY_SIZE]; /* recent resources (size 32) */
ResourceElem *hash; /* open-addressing spill table */ uint32 capacity; /* allocated length of hash[] */ uint32 grow_at; /* grow hash when reach this */
LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* lossy lock cache (size 15) */ dlist_head aio_handles; /* AIO handles, registered in crit sections */};A ResourceElem is just { Datum item; const ResourceOwnerDesc *kind; }. The
design comment at the top of the file explains the array/hash split: the
common case is remember-then-forget-shortly-after (pin a buffer, read a
tuple, unpin), which a linear scan of a small array serves cheaply; long-lived
or numerous resources spill into the hash. Creation just zeroes the struct in
TopMemoryContext and links it under its parent:
// resowner.c — ResourceOwnerCreateResourceOwnerResourceOwnerCreate(ResourceOwner parent, const char *name){ ResourceOwner owner;
owner = (ResourceOwner) MemoryContextAllocZero(TopMemoryContext, sizeof(struct ResourceOwnerData)); owner->name = name;
if (parent) { owner->parent = parent; owner->nextchild = parent->firstchild; parent->firstchild = owner; } dlist_init(&owner->aio_handles); return owner;}Enlarge-then-remember: the reserve-before-acquire contract
Section titled “Enlarge-then-remember: the reserve-before-acquire contract”Remembering a resource is split into two calls deliberately. ResourceOwnerEnlarge
guarantees room before the resource is acquired, because if making room
fails (out of memory growing the hash), it must fail before you have an
untracked pin in hand:
// resowner.c — ResourceOwnerEnlarge (hash-growth path abridged)voidResourceOwnerEnlarge(ResourceOwner owner){ if (owner->releasing) elog(ERROR, "ResourceOwnerEnlarge called after release started");
if (owner->narr < RESOWNER_ARRAY_SIZE) return; /* no work needed — array has room */
/* array full: ensure the hash has space, growing (doubling) if needed */ if (owner->narr + owner->nhash >= owner->grow_at) { uint32 newcap = (owner->capacity > 0) ? owner->capacity * 2 : RESOWNER_HASH_INIT_SIZE; ResourceElem *newhash = MemoryContextAllocZero(TopMemoryContext, newcap * sizeof(ResourceElem)); /* ... after this point we assume no failure, so scribble on owner ... */ owner->hash = newhash; owner->capacity = newcap; owner->grow_at = RESOWNER_HASH_MAX_ITEMS(newcap); /* re-hash old entries, pfree old table */ }
/* Drain the 32-slot array into the hash so the array is free again */ for (int i = 0; i < owner->narr; i++) ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind); owner->narr = 0;}ResourceOwnerRemember then just appends into the (now guaranteed non-full)
array. It asserts that the caller reserved space and that release has not
started:
// resowner.c — ResourceOwnerRemember appends to the fast arrayvoidResourceOwnerRemember(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind){ Assert(kind->release_phase != 0); Assert(kind->release_priority != 0); Assert(!owner->releasing); Assert(!owner->sorted);
if (owner->narr >= RESOWNER_ARRAY_SIZE) elog(ERROR, "ResourceOwnerRemember called but array was full");
owner->arr[owner->narr].item = value; owner->arr[owner->narr].kind = kind; owner->narr++;}The bufmgr pin path shows the contract in practice: ResourceOwnerEnlarge(CurrentResourceOwner)
is called up front (alongside ReservePrivateRefCountEntry), and the actual
ResourceOwnerRemember of the pin happens later once the pin is secured — the
README’s warning “make sure there are no unrelated ResourceOwnerRemember
calls between your Enlarge and the Remember you reserved for” is exactly
about preserving the one reserved slot.
Forget: array-first, then hash
Section titled “Forget: array-first, then hash”ResourceOwnerForget searches the array back-to-front (the most-recent slot
is the most-likely target), and on a hit swaps the last element down — an
O(1) unordered removal. Only if the array misses does it probe the hash. The
back-to-front scan plus swap-with-last is what makes the
forget-the-just-remembered case fall out for free, which several callers rely
on:
// resowner.c — ResourceOwnerForget (array scan; hash probe abridged)voidResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind){ if (owner->releasing) elog(ERROR, "ResourceOwnerForget called for %s after release started", kind->name); Assert(!owner->sorted);
/* Search the array first, newest-first */ for (int i = owner->narr - 1; i >= 0; i--) { if (owner->arr[i].item == value && owner->arr[i].kind == kind) { owner->arr[i] = owner->arr[owner->narr - 1]; /* swap last down */ owner->narr--; return; } } /* else probe the open-addressing hash, NULL out the slot, nhash-- */ /* ... */ elog(ERROR, "%s %p is not owned by resource owner %s", kind->name, DatumGetPointer(value), owner->name);}Three-phase release and the recursion
Section titled “Three-phase release and the recursion”ResourceOwnerRelease is a thin wrapper over ResourceOwnerReleaseInternal,
which is where the phasing, recursion, and lock special-casing live. Two
structural facts dominate it: (1) it recurses into children first so that a
portal/subxact is fully released before its parent within each phase; (2) on
the first call it sets releasing and sorts the resources by phase+priority,
after which no more Remember/Forget is allowed.
// resowner.c — ResourceOwnerReleaseInternal (children-first recursion + sort)static voidResourceOwnerReleaseInternal(ResourceOwner owner, ResourceReleasePhase phase, bool isCommit, bool isTopLevel){ ResourceOwner child; ResourceOwner save;
/* Recurse to handle descendants before self */ for (child = owner->firstchild; child != NULL; child = child->nextchild) ResourceOwnerReleaseInternal(child, phase, isCommit, isTopLevel);
if (!owner->releasing) { Assert(phase == RESOURCE_RELEASE_BEFORE_LOCKS); owner->releasing = true; } if (!owner->sorted) { ResourceOwnerSort(owner); /* sort by reverse phase+priority */ owner->sorted = true; }
/* Make the release callbacks see the owner being released as current */ save = CurrentResourceOwner; CurrentResourceOwner = owner; /* ... per-phase work below ... */ CurrentResourceOwner = save;}The per-phase body is where the ordering rule is enforced. BEFORE_LOCKS
releases the externally-visible resources (and drains AIO handles);
AFTER_LOCKS releases backend-internal ones — both via the sorted
ResourceOwnerReleaseAll. The LOCKS phase is special:
// resowner.c — the LOCKS phase: bulk for top xact, transfer-or-release for childrenif (phase == RESOURCE_RELEASE_LOCKS){ if (isTopLevel) { /* top xact: drop ALL locks in one lmgr call at the top of recursion */ if (owner == TopTransactionResourceOwner) { ProcReleaseLocks(isCommit); ReleasePredicateLocks(isCommit, false); } } else { /* subxact/portal: hand this owner's locks to the lock manager */ LOCALLOCK **locks; int nlocks;
if (owner->nlocks > MAX_RESOWNER_LOCKS) /* cache overflowed */ locks = NULL, nlocks = 0; /* lmgr scans its own table */ else locks = owner->locks, nlocks = owner->nlocks;
if (isCommit) LockReassignCurrentOwner(locks, nlocks); /* transfer to parent */ else LockReleaseCurrentOwner(locks, nlocks); /* truly release */ }}This is the lock special case the README and theory both demand: on commit
of a subtransaction or portal, locks are reassigned to the parent (they
must outlive the child up to end-of-transaction); on abort, they are
genuinely released. The 15-entry owner->locks cache is a lossy fast path:
if a child held ≤15 locks it can be reassigned/released directly from the
cache, but if it overflowed, the code passes NULL and the lock manager
falls back to scanning its own local-lock hash table.
Sorting in reverse priority
Section titled “Sorting in reverse priority”ResourceOwnerSort consolidates the array and hash into one contiguous run
and qsorts it by resource_priority_cmp, which orders by phase then
priority in reverse, so that ResourceOwnerReleaseAll can release from the
tail and stop as soon as it crosses into the next phase:
// resowner.c — resource_priority_cmp orders reverse so release walks from the endstatic intresource_priority_cmp(const void *a, const void *b){ const ResourceElem *ra = a; const ResourceElem *rb = b;
/* Note: reverse order */ if (ra->kind->release_phase == rb->kind->release_phase) return pg_cmp_u32(rb->kind->release_priority, ra->kind->release_priority); else if (ra->kind->release_phase > rb->kind->release_phase) return -1; else return 1;}ResourceOwnerReleaseAll then walks from nitems-1 downward, invoking each
kind’s ReleaseResource(value) callback, and — when printLeakWarnings is
set (i.e. on commit, where the executor should have released everything
itself) — emits WARNING: resource was not closed: ... using the kind’s
DebugPrint. On abort, leaks are expected and silent; that asymmetry is the
README’s “at commit the owner should be empty; at abort we truly rely on this
mechanism.”
The lock cache, AIO handles, and delete
Section titled “The lock cache, AIO handles, and delete”Locks bypass the array/hash entirely and live in the 15-slot cache, populated
by ResourceOwnerRememberLock. The cache is intentionally lossy — once it
overflows it stops tracking, trading exact accounting for cheap bulk
release/reassign:
// resowner.c — ResourceOwnerRememberLock: lossy 15-entry cachevoidResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock){ Assert(locallock != NULL);
if (owner->nlocks > MAX_RESOWNER_LOCKS) return; /* already overflowed: stop tracking */
if (owner->nlocks < MAX_RESOWNER_LOCKS) owner->locks[owner->nlocks] = locallock; else { /* overflowed (nlocks becomes MAX+1, a sentinel) */ } owner->nlocks++;}AIO handles get their own dlist because they may be remembered inside
critical sections, where the normal ResourceOwnerEnlarge (which can
palloc) is forbidden — so they use a no-allocation push/pop
(ResourceOwnerRememberAioHandle/ForgetAioHandle) and are drained in the
BEFORE_LOCKS phase via pgaio_io_release_resowner. Finally,
ResourceOwnerDelete frees the owner object itself, but only after every
resource is gone — it asserts the array, hash, and lock count are empty (the
lock count may legitimately be the overflow sentinel) and recursively deletes
children:
// resowner.c — ResourceOwnerDelete asserts emptiness, recurses, freesvoidResourceOwnerDelete(ResourceOwner owner){ Assert(owner != CurrentResourceOwner); Assert(owner->narr == 0); Assert(owner->nhash == 0); Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
while (owner->firstchild != NULL) ResourceOwnerDelete(owner->firstchild); /* child delinks itself */
ResourceOwnerNewParent(owner, NULL); /* delink from parent */ if (owner->hash) pfree(owner->hash); pfree(owner);}How xact.c drives it — the PG_TRY error-unwind tie-in
Section titled “How xact.c drives it — the PG_TRY error-unwind tie-in”The phasing is the caller’s job. CommitTransaction issues the three
ResourceOwnerRelease calls with isCommit=true, interleaving engine cleanup
between phases — note RESOURCE_RELEASE_BEFORE_LOCKS runs, then
AtEOXact_Buffers/AtEOXact_RelationCache/AtEOXact_Inval run, then the
LOCKS and AFTER_LOCKS phases — so catalog invalidation is published while
locks are still held:
// xact.c — CommitTransaction: phased release with cleanup interleavedCurrentResourceOwner = NULL;ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_BEFORE_LOCKS, true, true);AtEOXact_Buffers(true);AtEOXact_RelationCache(true);AtEOXact_Inval(true); /* publish catalog invalidations under lock */AtEOXact_MultiXact();ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_LOCKS, true, true);ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_AFTER_LOCKS, true, true);AbortTransaction runs the same three phases with isCommit=false. This is
the payoff of the whole design: when a longjmp from a failing C frame lands
in the abort path (via PG_TRY/sigsetjmp in the top-level loop), the executor’s
own release calls never ran, so the ResourceOwner is the only thing that
knows about the still-held pins, locks, and references — and these three calls
free them all:
// xact.c — AbortTransaction: same three phases, isCommit=false (error unwind)if (TopTransactionResourceOwner != NULL){ CallXactCallbacks(XACT_EVENT_ABORT); ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_BEFORE_LOCKS, false, true); AtEOXact_Buffers(false); AtEOXact_RelationCache(false); AtEOXact_Inval(false); ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_LOCKS, false, true); ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_AFTER_LOCKS, false, true);}CleanupTransaction then calls ResourceOwnerDelete(TopTransactionResourceOwner)
to free the (now-empty) owner objects and nulls the three globals. The
ResourceOwnerReleaseInternal re-entrancy comment (“if an error happens
between the release phases, we might get called again for the same
ResourceOwner from AbortTransaction”) is what makes a failure during commit’s
release safely fall through to abort’s release without double-sorting.
flowchart TD
ERR["error: ereport(ERROR) / elog(ERROR)<br/>anywhere in executor C frames"]
LJ["siglongjmp to PG_exception_stack<br/>(set by PG_TRY/sigsetjmp in main loop)"]
AB["AbortTransaction()"]
P1["ResourceOwnerRelease(BEFORE_LOCKS, isCommit=false)<br/>→ unpin buffers, drain AIO"]
EC["AtEOXact_Buffers / RelationCache / Inval"]
P2["ResourceOwnerRelease(LOCKS, false)<br/>→ ProcReleaseLocks (truly release)"]
P3["ResourceOwnerRelease(AFTER_LOCKS, false)<br/>→ drop catcache/files/snapshots"]
CL["CleanupTransaction → ResourceOwnerDelete<br/>free owner tree, null globals"]
ERR --> LJ --> AB --> P1 --> EC --> P2 --> P3 --> CL
P1 -. children-first recursion .-> P1
Position hints (as of 2026-06-06, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-06, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
struct ResourceOwnerData | src/backend/utils/resowner/resowner.c | 112 |
RESOWNER_ARRAY_SIZE (32) | src/backend/utils/resowner/resowner.c | 73 |
MAX_RESOWNER_LOCKS (15) | src/backend/utils/resowner/resowner.c | 107 |
CurrentResourceOwner (globals) | src/backend/utils/resowner/resowner.c | 173 |
resource_priority_cmp | src/backend/utils/resowner/resowner.c | 269 |
ResourceOwnerSort | src/backend/utils/resowner/resowner.c | 292 |
ResourceOwnerReleaseAll | src/backend/utils/resowner/resowner.c | 348 |
ResourceOwnerCreate | src/backend/utils/resowner/resowner.c | 421 |
ResourceOwnerEnlarge | src/backend/utils/resowner/resowner.c | 452 |
ResourceOwnerRemember | src/backend/utils/resowner/resowner.c | 524 |
ResourceOwnerForget | src/backend/utils/resowner/resowner.c | 564 |
ResourceOwnerRelease | src/backend/utils/resowner/resowner.c | 658 |
ResourceOwnerReleaseInternal | src/backend/utils/resowner/resowner.c | 678 |
ResourceOwnerReleaseAllOfKind | src/backend/utils/resowner/resowner.c | 818 |
ResourceOwnerDelete | src/backend/utils/resowner/resowner.c | 871 |
ResourceOwnerNewParent | src/backend/utils/resowner/resowner.c | 914 |
CreateAuxProcessResourceOwner | src/backend/utils/resowner/resowner.c | 999 |
ResourceOwnerRememberLock | src/backend/utils/resowner/resowner.c | 1062 |
ResourceOwnerForgetLock | src/backend/utils/resowner/resowner.c | 1082 |
ResourceOwnerRememberAioHandle | src/backend/utils/resowner/resowner.c | 1104 |
ResourceReleasePhase enum | src/include/utils/resowner.h | 52 |
ResourceOwnerDesc struct | src/include/utils/resowner.h | 91 |
buffer_pin_resowner_desc | src/backend/storage/buffer/bufmgr.c | 244 |
| Subxact owner create | src/backend/access/transam/xact.c | 1293 |
| Portal owner create | src/backend/utils/mmgr/portalmem.c | 205 |
| CommitTransaction release | src/backend/access/transam/xact.c | 2411 |
| AbortTransaction release | src/backend/access/transam/xact.c | 2967 |
| CleanupTransaction delete | src/backend/access/transam/xact.c | 3027 |
Source verification (as of 2026-06-06)
Section titled “Source verification (as of 2026-06-06)”Verified facts
Section titled “Verified facts”-
A ResourceOwner is allocated in
TopMemoryContextand freed only explicitly. Verified inResourceOwnerCreate(MemoryContextAllocZero(TopMemoryContext, ...)) andResourceOwnerDelete(pfree(owner)). This is why owners survive the reset of any per-transaction memory context — the resource accounting must outlive the memory it tracks. -
The forest mirrors transaction/portal nesting: subxact owner is a child of its parent xact’s owner; portal owner is a child of
CurTransactionResourceOwner. Verified inxact.cStartSubTransaction(ResourceOwnerCreate(s->parent->curTransactionOwner, "SubTransaction")) andportalmem.cCreatePortal(ResourceOwnerCreate(CurTransactionResourceOwner, "Portal")). The README’s “any remaining [portal] resources become the responsibility of the current transaction” is implemented by this parent linkage. -
Release runs in exactly three phases, driven three times by the caller, with children released before parents within each phase. Verified in
ResourceOwnerReleaseInternal(children-firstforloop overfirstchild) and theRESOURCE_RELEASE_BEFORE_LOCKS / LOCKS / AFTER_LOCKSbranch structure, plus the threeResourceOwnerReleasecalls in bothCommitTransactionandAbortTransaction. -
Buffer pins are
BEFORE_LOCKSand released before locks. Verified inbufmgr.cbuffer_pin_resowner_desc(.release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,.release_priority = RELEASE_PRIO_BUFFER_PINS). The README’s rationale is that pins are visible to other backends, so they must be gone before a lock another backend waits on is released. -
On subtransaction/portal commit, locks are reassigned to the parent, not released; on abort they are released. Verified in the
RESOURCE_RELEASE_LOCKSnon-top-level branch:isCommit ? LockReassignCurrentOwner(...) : LockReleaseCurrentOwner(...). The README’s “release operation on a child transfers lock ownership to the parent if isCommit is true” matches exactly. -
The lock cache holds at most
MAX_RESOWNER_LOCKS(15) entries and is lossy on overflow. Verified inResourceOwnerRememberLock(returns early whennlocks > MAX_RESOWNER_LOCKS; sentinelnlocks == MAX+1) and the LOCKS phase passinglocks = NULLwhen overflowed so lmgr scans its own table. The15value and its pg_dump-derived justification are in theMAX_RESOWNER_LOCKScomment. -
The fast store is a 32-slot array spilling into an open-addressing hash;
Enlargemust be called beforeRemember. Verified byRESOWNER_ARRAY_SIZE 32,ResourceOwnerEnlarge(grows/drains before returning), andResourceOwnerRemember(elog(ERROR, "...array was full")if you skipped Enlarge). The bufmgr pin path callsResourceOwnerEnlarge(CurrentResourceOwner)up front (bufmgr.c ~692/2023/2366). -
Release sorts in reverse priority and walks from the tail, stopping at the phase boundary. Verified in
resource_priority_cmp(/* Note: reverse order */) andResourceOwnerReleaseAll(while (nitems > 0) { ... if (kind->release_phase > phase) break; ... nitems--; }). -
Commit warns on leaked resources; abort is silent. Verified in
ResourceOwnerReleaseAll:printLeakWarnings(passed asisCommit) gates theelog(WARNING, "resource was not closed: %s", ...). The README states the owner should be empty at commit but it is normal to have resources at abort. -
AIO handles use a
dlist, not the ResourceElem array, because they may be remembered in critical sections. Verified inResourceOwnerRememberAioHandle(dlist_push_tail) and theBEFORE_LOCKSdrain loop callingpgaio_io_release_resowner. The struct comment names the critical-section constraint explicitly.
Open questions
Section titled “Open questions”-
How often the 15-entry lock cache actually overflows in real OLTP. The
MAX_RESOWNER_LOCKScomment cites 9.2-era pg_dump measurements (≤9 locks per non-top owner). Whether modern partitioned schemas with hundreds of per-partition locks routinely overflow the top owner’s cache (forcing the slower lmgr-hash scan at commit) is workload-dependent and unmeasured here. Investigation path: instrumentResourceOwnerRememberLock’s overflow branch under a partition-heavy benchmark. -
The cost of the array→hash spill threshold (32) for wide executor trees. A deep plan pinning many buffers simultaneously crosses the 32-slot array into the hash, paying a re-hash and losing the cheap linear forget. Whether
RESOWNER_ARRAY_SIZEis still well-tuned for current executor pin counts is not established from the code. Investigation path: tracenarrhigh-water marks across TPC-style queries. -
Whether
ResourceOwnerReleaseAllOfKind(the retail bulk-release used by, e.g., snapshot/relcache reset) interacts cleanly with a subsequent normal phased release. It temporarily setsreleasingwithout sorting; the re-entrancy comment inResourceOwnerReleaseInternalsuggests the interplay is intentional, but the exact set of callers that mix the two on one owner is not enumerated here.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
CUBRID. CUBRID does not have a single unifying ResourceOwner abstraction; resource cleanup is distributed across its transaction descriptor (
LOG_TDES), its page-fix bookkeeping in the page buffer (pgbuf), and its lock manager’s per-transaction lock entry list. Buffer fixes are tracked per thread/transaction and unfixed at the end of a request or on rollback; locks are chained off the transaction descriptor and released inlock_unlock_all. The effect is the same — everything is reclaimed at transaction end and on error unwinding througher_set/setjmp-style error handling — but the bookkeeping is per-subsystem rather than funneled through one owner object with declarative release phases. PostgreSQL’s single phase+priority-sorted owner is notably more uniform; CUBRID trades that uniformity for subsystem-local control. (See the CUBRID code-analysis tree underknowledge/code-analysis/cubrid/for the lock-manager and buffer details.) -
C++ storage engines (RocksDB-style, InnoDB). These lean on RAII guard objects (
mtr_tmini-transactions in InnoDB;unique_ptr/scope guards in RocksDB) whose destructors run during C++ stack unwinding. The lifetime is lexical — bound to the C frame — which is cleaner for resources that genuinely have lexical scope but cannot express “pin held across many frames, released at transaction end” without a heap-resident owner. InnoDB’s mini-transaction is the closest analog to a scoped ResourceOwner, but it is statement/operation-scoped rather than transaction-scoped, and locks are tracked separately in the trx_t. -
Managed-runtime engines. Java/C# engines get memory reclamation from GC and use
try-with-resources/usingfor the rest, distributing cleanup across frames. The trade-off versus PostgreSQL’s centralized owner is the classic one: language-integrated unwinding is ergonomic but each frame must remember its own cleanup, whereas a central registry makes a single sweep authoritative at the cost of an explicit Remember/Forget protocol on every acquisition. -
Relation to the MemoryContext lineage. The README states the ResourceOwner API was modeled on MemoryContexts, and the parallel is the research-frontier-relevant observation: PostgreSQL discovered that a tree-of-scopes + current-pointer + recursive-bulk-free pattern, proven for heap allocations, generalizes to any reclaimable claim. The deliberate decision not to unify them (different usage patterns: allocations are untyped and enormously frequent; resources are typed, callback-bearing, and fewer) is a small but instructive design judgment — the same abstraction shape, instantiated twice with different cost models.
-
Error handling as the real driver. The deeper point, echoing ARIES’s insistence on disciplined commit/abort ordering, is that the ResourceOwner exists because PostgreSQL chose
setjmp/longjmpover per-frame cleanup. An engine built on a language with destructors or checked exceptions might never invent it. It is the absence of automatic unwinding in C, combined with the need for transaction-scoped (not frame-scoped) lifetimes, that makes a centralized, phase-ordered owner the right answer here. Seepostgres-error-handling.mdfor the PG_TRY/sigsetjmpmachinery that this module is the cleanup half of.
Sources
Section titled “Sources”In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”src/backend/utils/resowner/README— the design document: the MemoryContext-modeled rationale, the forest/parent-transfer semantics, the lock special case, the “adding a new resource type” recipe, and the three-phase release ordering with the worked parent/child priority example.src/include/utils/resowner.h—ResourceOwneropaque type, the four global owners,ResourceReleasePhase, theRELEASE_PRIO_*built-in priorities,ResourceOwnerDesc, and the full exported function surface (Create/Release/Delete/Enlarge/Remember/Forget/RememberLock/…).src/backend/utils/resowner/resowner.c—ResourceOwnerData, the array/hash store,Enlarge/Remember/Forget,ResourceOwnerSort+resource_priority_cmp,ResourceOwnerReleaseInternal(the three-phase recursion and lock transfer), the lock cache, AIO handles, and the aux-process owner.src/backend/access/transam/xact.c— the driver:StartSubTransactionowner creation, and the three phasedResourceOwnerReleasecalls inCommitTransaction/AbortTransaction, withResourceOwnerDeleteinCleanupTransaction.src/backend/utils/mmgr/portalmem.c—CreatePortalhangs the portal’s owner underCurTransactionResourceOwner.src/backend/storage/buffer/bufmgr.c—buffer_pin_resowner_desc/buffer_io_resowner_descand theResourceOwnerEnlargereserve-before-pin call sites.src/backend/storage/lmgr/lock.c—ResourceOwnerRememberLock/ForgetLockcall sites andLockReassignCurrentOwner/LockReleaseCurrentOwnerused by the LOCKS phase.
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Mohan, C. et al. (1992). “ARIES: A Transaction Recovery Method…”
ACM TODS 17(1):94-162. The discipline of releasing locks last and only
after externally-visible state is consistent — the principle the three-phase
release embodies. Captured in
knowledge/research/dbms-papers/aries.md. - Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 17-18 —
transactions as the unit of atomicity/recovery, the scope at whose boundary
resources are reclaimed (
knowledge/research/dbms-general/). - Database Internals (Petrov 2019), ch. 5-6 — buffer management and lock
lifetimes; the resource claims a ResourceOwner tracks
(
knowledge/research/dbms-general/). - RAII / scope-bound resource management (Stroustrup, the C++ idiom) — the language-runtime alternative PostgreSQL emulates by hand because C lacks destructors and true exceptions.
Cross-references within this knowledge base
Section titled “Cross-references within this knowledge base”postgres-memory-contexts.md— the sibling allocator the ResourceOwner API was modeled on; tree-of-scopes + current-pointer + recursive bulk-free.postgres-error-handling.md— PG_TRY/sigsetjmp/ereport; the unwinding half whose cleanup this module performs.postgres-buffer-manager.md— buffer pins, the canonical BEFORE_LOCKS resource, and theReservePrivateRefCountEntry+ResourceOwnerEnlargereserve protocol.postgres-lock-manager.md—LOCALLOCK,LockReassignCurrentOwner, and the local-lock hash that backs the lossy lock cache.postgres-xact.md— transaction/subtransaction state machine that creates and drives the per-transaction owners.postgres-portals-prepared.md— portals, whose owners are children of the current transaction’s owner.postgres-aio.md— asynchronous I/O handles tracked via the owner’sdlist.postgres-overview-base-infra.md— where ResourceOwner sits in the base-infrastructure layer.