PostgreSQL Large Objects — pg_largeobject Chunked Storage and the lo API
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A relational table is, physically, an array of fixed-size disk pages, and a
row must fit on one page. PostgreSQL’s page is BLCKSZ = 8 KB by default;
Database System Concepts (Silberschatz 7e, ch. 13 “Data Storage
Structures”) and Database Internals (Petrov, “File Formats”) both make the
same observation: once a single attribute’s value approaches or exceeds the
page size, the slotted-page row layout — which assumes many small tuples per
page — breaks down. A database that wants to store a 50 MB image, a PDF, or a
video clip therefore needs an out-of-line representation: the row holds a
small reference, and the bulk bytes live somewhere else, decomposed into
page-sized units.
There are two classical answers to “where else,” and they differ in whether the large value is addressed as a column or addressed as an object:
-
In-row overflow / large-value attribution. The value remains logically a column of the row (a
bytea,text,BLOB,CLOB,VARBINARY(MAX)). The engine transparently spills the bytes into an overflow structure and leaves a pointer in the tuple. The application never sees the decomposition; it reads and writes the whole value as a datum. PostgreSQL’s name for this is TOAST (seepostgres-toast.md); DB2 calls it LOB columns with inlining thresholds, SQL Servervarbinary(max)withTEXTIMAGEallocation units. -
Object-addressed large storage. The large value is not a column. It is an independent object with its own identity (an OID or “locator”), created and dropped by explicit calls, and accessed through a file-like, seekable, byte-stream API —
open,read,write,lseek,truncate,close— so the client can stream a multi-gigabyte object in bounded-memory chunks rather than materializing it whole. The SQL standard’sBLOB/CLOBlocator model and the JDBCBlob/Clobinterfaces are the standardized form of this; Oracle’sBFILE/LOBlocators and PostgreSQL’s large objects are concrete instances.
The two models trade off along one axis: granularity of access. The
column model is dead simple for the application — a value is a value — but it
is all-or-nothing: to read one byte you fetch (and detoast) the whole datum,
and to change one byte you rewrite the whole datum as a new row version.
The object model adds API ceremony (you must lo_open a descriptor, you must
lo_close it, the descriptor is only valid inside a transaction) but buys
random access: you can lseek to offset 1 GB in a 4 GB object and read
4 KB without touching the rest. That is the entire reason the object model
survives alongside bytea.
PostgreSQL’s large object facility traces directly to the original POSTGRES
research system. The header comment in large_object.h records the lineage:
“POSTGRES 4.2 supported zillions of large objects (internal, external,
jaquith, inversion). Now we only support inversion.” The inversion
scheme — the one survivor — is the subject of this document. Its defining
idea, due to the POSTGRES storage work of Stonebraker and collaborators, is
deceptively economical: do not invent a new storage manager for large
objects at all. Instead, store the object’s bytes as ordinary rows of an
ordinary table, chopped into fixed-size chunks, and let the existing heap +
B-tree + MVCC + WAL machinery provide durability, concurrency, and
transactional rollback for free. “Inversion” names the inversion of the
usual relationship: rather than a file system holding database files, the
database holds what would otherwise be file-system objects.
This choice has a profound consequence that organizes the rest of the document: large objects inherit every property of the heap. They are MVCC-versioned (a write produces a new tuple and a dead old one), they are WAL-logged (crash recovery just replays the chunk inserts/updates), they are vacuumed (dead chunk tuples are reclaimed by autovacuum), and they roll back on transaction abort. The price is paid in the same currency: write amplification (rewriting one byte rewrites a whole 2 KB chunk tuple), MVCC bloat (every overwrite leaves a dead chunk), and no out-of-band streaming (bytes pass through the SQL function-call protocol, not a side channel like a TCP file transfer).
Two design questions follow, and they frame sections 2–4:
- What is the chunk size, and why that size? It controls write amplification, the number of catalog tuples per object, and whether the chunk itself gets compressed.
- How does a seekable, sparse, transactional byte stream get implemented
on top of a
(loid, pageno)-keyed catalog table — in particular how reads synthesize holes, how writes do read-modify-write, and how the client-facing file descriptor and its snapshot are scoped to a transaction.
Common DBMS Design
Section titled “Common DBMS Design”Before the PostgreSQL specifics, it helps to name the moving parts that any object-addressed large-storage subsystem must provide, because PostgreSQL’s symbols are then recognizable as choices within a shared design space.
Chunking and the chunk table
Section titled “Chunking and the chunk table”Every implementation that decomposes a large value into page-sized units
needs three things: a chunk size, a chunk container, and a chunk
key that orders the pieces. The container is usually a hidden/system table
(SQL Server’s LOB allocation units, Oracle’s LOB segments,
PostgreSQL’s pg_largeobject). The key is the pair (object identity,
sequence number) so that a range of bytes maps to a contiguous range of
chunk keys and an index range-scan streams them in order. The chunk size is
a tuning knob: too large and a small edit rewrites a lot; too small and the
per-chunk tuple/row overhead and the number of index entries explode.
flowchart LR
subgraph App["Client / SQL"]
A1["lo_open(loid, INV_READ)"]
A2["lo_lseek(fd, off)"]
A3["lo_read(fd, n)"]
end
subgraph API["Inversion API (inv_api.c)"]
B1["LargeObjectDesc<br/>id, offset, snapshot, flags"]
B2["index scan loid eq,<br/>pageno ge start"]
end
subgraph Cat["System catalogs"]
C1["pg_largeobject_metadata<br/>(oid, lomowner, lomacl)"]
C2["pg_largeobject<br/>(loid, pageno, data bytea)"]
C3["btree (loid, pageno)"]
end
A1 --> B1
A2 --> B1
A3 --> B2
B1 -.permission + existence.-> C1
B2 --> C3 --> C2
Metadata separate from data
Section titled “Metadata separate from data”A large object has identity and permissions even when it holds zero bytes.
Implementations therefore split a tiny metadata record (owner, ACL,
sometimes size/timestamps) from the bulky data chunks. The metadata
record is what CREATE inserts and DROP removes; the data chunks are
inserted lazily as bytes are written. This separation also lets the engine
answer “does this object exist?” and “may this role read it?” with a single
small lookup rather than touching the chunk store.
Seekable byte stream with holes
Section titled “Seekable byte stream with holes”The file abstraction implies an lseek-style cursor and the possibility of
sparse objects (“holes”): if a client seeks far past the end and writes,
the gap should logically read back as zero bytes, exactly like a Unix sparse
file, without physically materializing zero chunks for the gap. A read that
crosses a missing chunk must fabricate zeros; a write into a hole creates
only the chunks it actually touches.
Transactional and crash-safe semantics
Section titled “Transactional and crash-safe semantics”Because large objects participate in SQL transactions, a write must be atomic with the surrounding transaction (rolled back on abort) and durable on commit. Engines that store chunks as ordinary rows get this from the table engine’s MVCC + WAL; engines that use a bespoke LOB segment must build logging and undo for it explicitly. The trade-off is bloat versus specialization: row-backed LOBs bloat under heavy overwrite (dead versions) but need zero new recovery code.
A transaction-scoped descriptor with a pinned snapshot
Section titled “A transaction-scoped descriptor with a pinned snapshot”The file descriptor returned by open is a session/transaction-local
handle, not a durable object. It carries the current seek offset and, for
read mode, a snapshot so that successive reads of the same object see a
consistent point-in-time image even as other transactions modify it. The
descriptor and its snapshot must be cleaned up at transaction end (and
reassigned across subtransaction boundaries), or they leak.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL implements the inversion scheme with two system catalogs and one
API file. The numbers are fixed by large_object.h:
// LOBLKSIZE / MAX_LARGE_OBJECT_SIZE — src/include/storage/large_object.h#define LOBLKSIZE (BLCKSZ / 4)#define MAX_LARGE_OBJECT_SIZE ((int64) INT_MAX * LOBLKSIZE)With the default 8 KB block, LOBLKSIZE is 2048 bytes. The comment in
the header explains the two reasons for BLCKSZ/4 rather than something
near BLCKSZ: first, a smaller chunk means a partial-page write rewrites
less data; second — and more subtly — the value is “deliberately chosen
large enough to trigger the tuple toaster, so that we will attempt to
compress page tuples in-line.” That is, a 2 KB data field is over the
TOAST threshold, so PostgreSQL will try to LZ-compress each chunk in place
(though it won’t push chunks out-of-line to a TOAST table unless the DBA
manually creates one for pg_largeobject). With pageno an int32 and
INT_MAX chunks of 2 KB each, the maximum object is INT_MAX * 2048 ≈
4 TB.
The two catalogs
Section titled “The two catalogs”pg_largeobject_metadata holds one row per object: its OID is the large
object’s identity, plus owner and ACL. pg_largeobject holds the data
chunks, keyed by (loid, pageno):
// FormData_pg_largeobject — src/include/catalog/pg_largeobject.hCATALOG(pg_largeobject,2613,LargeObjectRelationId){ Oid loid BKI_LOOKUP(pg_largeobject_metadata); /* LO identity */ int32 pageno; /* Page number (starting from 0) */ bytea data BKI_FORCE_NOT_NULL; /* Data for page (may be zero-length) */} FormData_pg_largeobject;
DECLARE_UNIQUE_INDEX_PKEY(pg_largeobject_loid_pn_index, 2683, LargeObjectLOidPNIndexId, pg_largeobject, btree(loid oid_ops, pageno int4_ops));The composite unique B-tree on (loid, pageno) is the workhorse: it lets
every inversion operation phrase itself as “scan the chunks of object loid
whose pageno is in some range, in order.” LargeObjectCreate inserts only
the metadata row — no chunks — so a freshly created object has size zero
and exists for permission purposes immediately:
// LargeObjectCreate — src/backend/catalog/pg_largeobject.c (condensed)pg_lo_meta = table_open(LargeObjectMetadataRelationId, RowExclusiveLock);if (OidIsValid(loid)) loid_new = loid;else loid_new = GetNewOidWithIndex(pg_lo_meta, LargeObjectMetadataOidIndexId, Anum_pg_largeobject_metadata_oid);values[Anum_pg_largeobject_metadata_oid - 1] = ObjectIdGetDatum(loid_new);values[Anum_pg_largeobject_metadata_lomowner - 1] = ObjectIdGetDatum(GetUserId());/* ... lomacl from get_user_default_acl ... */ntup = heap_form_tuple(RelationGetDescr(pg_lo_meta), values, nulls);CatalogTupleInsert(pg_lo_meta, ntup);pg_largeobject_metadata was introduced in PostgreSQL 9.0 specifically to
attach ownership and per-object ACLs to large objects; before that, all LOs
were effectively owned in common. Dependency tracking, however, still
records the object under LargeObjectRelationId (the data catalog’s OID,
2613), not the metadata catalog — inv_create comments that this is a
backwards-compatibility decision for pg_dump and other clients.
The inversion descriptor and open
Section titled “The inversion descriptor and open”inv_open is where existence, permission, and snapshot policy are decided.
The single most important design point is the read/write snapshot
asymmetry:
// inv_open — src/backend/storage/large_object/inv_api.c (condensed)if (flags & INV_WRITE) descflags |= IFS_WRLOCK | IFS_RDLOCK;if (flags & INV_READ) descflags |= IFS_RDLOCK;/* ... reject descflags == 0 ... */
/* Get snapshot. If write is requested, use an instantaneous snapshot. */if (descflags & IFS_WRLOCK) snapshot = NULL; /* => "current" semantics each access */else snapshot = GetActiveSnapshot(); /* stable point-in-time view */
if (!LargeObjectExistsWithSnapshot(lobjId, snapshot)) /* existence check */ ereport(ERROR, ... "large object %u does not exist" ...);/* ACL_SELECT for RDLOCK, ACL_UPDATE for WRLOCK, both snapshot-aware */A read-mode open captures GetActiveSnapshot() so that the whole
sequence of lo_read calls against that descriptor sees one consistent
image — a bytea read 4 KB at a time is the same object throughout, even if
a concurrent transaction commits a rewrite. A write-mode open uses
snapshot = NULL, meaning each catalog access uses a fresh
up-to-date (SnapshotSelf/current) view, which is what you want for
read-modify-write: you must see your own just-written chunks and the latest
committed state. The flags IFS_RDLOCK/IFS_WRLOCK double as
“permission was checked for this mode” markers, per the large_object.h
note that since v11 permission is checked at open time.
flowchart TD
O["inv_open(loid, flags, mcxt)"] --> M{"INV_WRITE<br/>set?"}
M -->|yes| W["descflags = WRLOCK|RDLOCK<br/>snapshot = NULL (current)"]
M -->|no, INV_READ| R["descflags = RDLOCK<br/>snapshot = GetActiveSnapshot()"]
W --> E["LargeObjectExistsWithSnapshot"]
R --> E
E -->|missing| ERR["ERROR: does not exist"]
E -->|exists| P["aclcheck: ACL_UPDATE if WRLOCK,<br/>ACL_SELECT if RDLOCK"]
P --> D["alloc LargeObjectDesc in mcxt<br/>id, offset=0, flags, snapshot"]
Read: ordered index scan plus zero-fill for holes
Section titled “Read: ordered index scan plus zero-fill for holes”inv_read translates the byte offset into a starting pageno, opens an
ordered systable scan with two scan keys (loid = id and pageno >= start),
and walks the chunks forward, copying the relevant slice of each and
fabricating zeros for any gap between the expected offset and the next
chunk’s actual offset. This is what makes large objects sparse:
// inv_read — src/backend/storage/large_object/inv_api.c (condensed)int32 pageno = (int32) (obj_desc->offset / LOBLKSIZE);ScanKeyInit(&skey[0], Anum_pg_largeobject_loid, BTEqualStrategyNumber, F_OIDEQ, ...);ScanKeyInit(&skey[1], Anum_pg_largeobject_pageno, BTGreaterEqualStrategyNumber, F_INT4GE, Int32GetDatum(pageno));sd = systable_beginscan_ordered(lo_heap_r, lo_index_r, obj_desc->snapshot, 2, skey);while ((tuple = systable_getnext_ordered(sd, ForwardScanDirection)) != NULL){ pageoff = ((uint64) data->pageno) * LOBLKSIZE; if (pageoff > obj_desc->offset) /* a hole: synthesize zeros */ { n = pageoff - obj_desc->offset; n = (n <= (nbytes - nread)) ? n : (nbytes - nread); MemSet(buf + nread, 0, n); nread += n; obj_desc->offset += n; } if (nread < nbytes) { /* copy slice from this chunk's data field */ } if (nread >= nbytes) break;}Write: read-modify-write of 2 KB chunks
Section titled “Write: read-modify-write of 2 KB chunks”inv_write is the most intricate routine. For each chunk that the write
range touches it either updates an existing chunk (load old chunk into a
stack workbuf, splice in the new bytes, zero-fill any intra-chunk hole, and
heap_modify_tuple + CatalogTupleUpdateWithInfo) or inserts a brand-new
chunk (heap_form_tuple + CatalogTupleInsertWithInfo). The
read-modify-write is why a one-byte change rewrites a whole 2 KB tuple:
// inv_write — src/backend/storage/large_object/inv_api.c (condensed, update branch)if (olddata != NULL && olddata->pageno == pageno){ getdatafield(olddata, &datafield, &len, &pfreeit); /* load old chunk */ memcpy(workb, VARDATA(datafield), len); off = (int) (obj_desc->offset % LOBLKSIZE); if (off > len) MemSet(workb + len, 0, off - len); /* intra-chunk hole */ n = LOBLKSIZE - off; n = (n <= (nbytes - nwritten)) ? n : (nbytes - nwritten); memcpy(workb + off, buf + nwritten, n); /* splice new bytes */ nwritten += n; obj_desc->offset += n; off += n; len = (len >= off) ? len : off; SET_VARSIZE(&workbuf.hdr, len + VARHDRSZ); values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf); replace[Anum_pg_largeobject_data - 1] = true; newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r), values, nulls, replace); CatalogTupleUpdateWithInfo(lo_heap_r, &newtup->t_self, newtup, indstate);}After the loop, inv_write issues CommandCounterIncrement() so that the
just-written chunks become visible to subsequent operations in the same
command sequence — essential when a lo_write is immediately followed by a
lo_read or another lo_write in the same transaction.
The fmgr surface and the FD table
Section titled “The fmgr surface and the FD table”The SQL-visible functions (lo_open, loread, lowrite, lo_lseek,
lo_creat, lo_unlink, lo_import, lo_export, lo_get, lo_put, …)
live in be-fsstubs.c as be_lo_* fmgr wrappers. They maintain a
per-transaction file-descriptor table — the cookies array of
LargeObjectDesc * allocated in a private fscxt memory context — and an
integer “fd” is just an index into it. be_lo_open allocates a slot,
calls inv_open, stamps the descriptor with the current subtransaction id,
and registers the read snapshot on TopTransactionResourceOwner so it
outlives the current portal:
// be_lo_open — src/backend/libpq/be-fsstubs.c (condensed)fd = newLOfd(); /* index into cookies[] in fscxt */lobjDesc = inv_open(lobjId, mode, fscxt);lobjDesc->subid = GetCurrentSubTransactionId();if (lobjDesc->snapshot) lobjDesc->snapshot = RegisterSnapshotOnOwner(lobjDesc->snapshot, TopTransactionResourceOwner);cookies[fd] = lobjDesc;PG_RETURN_INT32(fd);These descriptors are valid only within the transaction that opened
them; AtEOXact_LargeObject tears down the whole fscxt at transaction end
and AtEOSubXact_LargeObject reassigns or closes them across subtransaction
boundaries. The next section walks these flows symbol by symbol.
Source Walkthrough
Section titled “Source Walkthrough”The implementation splits cleanly into three layers: the catalog layer
(pg_largeobject.c) that owns the two system tables, the inversion layer
(inv_api.c) that presents a seekable byte stream, and the fmgr/FD layer
(be-fsstubs.c) that exposes SQL functions and manages descriptors. We walk
them bottom-up.
Catalog layer — pg_largeobject.c
Section titled “Catalog layer — pg_largeobject.c”LargeObjectCreate(loid) inserts one pg_largeobject_metadata row (owner =
GetUserId(), default ACL via get_user_default_acl) and returns its OID,
choosing a fresh OID with GetNewOidWithIndex when loid is InvalidOid.
No data chunks are written, so the object’s size is zero until first write.
LargeObjectDrop(loid) deletes both halves: first the single
pg_largeobject_metadata row (scanned via LargeObjectMetadataOidIndexId),
then every pg_largeobject chunk for that loid by range-scanning the
(loid, pageno) index and calling CatalogTupleDelete on each:
// LargeObjectDrop — src/backend/catalog/pg_largeobject.c (condensed)ScanKeyInit(&skey[0], Anum_pg_largeobject_loid, BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(loid));scan = systable_beginscan(pg_largeobject, LargeObjectLOidPNIndexId, true, NULL, 1, skey);while (HeapTupleIsValid(tuple = systable_getnext(scan))) CatalogTupleDelete(pg_largeobject, &tuple->t_self);systable_endscan(scan);LargeObjectExists(loid) / LargeObjectExistsWithSnapshot(loid, snapshot)
answer existence by probing pg_largeobject_metadata (note: the metadata
catalog, not the data catalog) under the given snapshot. The function comment
is explicit that LO metadata is not cached in the syscache “for fear of
using too much local memory,” so each check is a fresh systable_beginscan.
The snapshot-aware variant exists precisely so that a read-mode inv_open
checks existence under the same snapshot it will read with.
Inversion layer — inv_api.c
Section titled “Inversion layer — inv_api.c”Relation caching. open_lo_relation() opens pg_largeobject and its
index once per transaction, holding the references on
TopTransactionResourceOwner so a subtransaction abort doesn’t yank them
out from under a later operation. It takes RowExclusiveLock even for reads
“since we might either read or write.” close_lo_relation(isCommit) releases
them at main-transaction end (called from AtEOXact_LargeObject).
Chunk decoding. getdatafield() is the one place that touches the raw
data column. Because a 2 KB chunk is over the TOAST threshold it may be
stored compressed or in short-header form, so the function detoasts when
VARATT_IS_EXTENDED, validates 0 <= len <= LOBLKSIZE (raising
ERRCODE_DATA_CORRUPTED otherwise), and reports whether the caller must
pfree the detoasted copy:
// getdatafield — src/backend/storage/large_object/inv_api.c (condensed)datafield = &(tuple->data); /* see note at top of file */if (VARATT_IS_EXTENDED(datafield)) { datafield = (bytea *) detoast_attr((struct varlena *) datafield); freeit = true;}len = VARSIZE(datafield) - VARHDRSZ;if (len < 0 || len > LOBLKSIZE) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("pg_largeobject entry for OID %u, page %d has invalid data field size %d", tuple->loid, tuple->pageno, len)));Size. inv_getsize() exploits the (loid, pageno) index ordering: a
single backward ordered scan returns the highest-pageno chunk first, so
the size is pageno * LOBLKSIZE + len of that one tuple — no full scan
needed. The comment notes LOs “can contain gaps, just like Unix files,” so
this is the offset of the last byte + 1, not a count of stored bytes.
Seek/tell. inv_seek(obj_desc, offset, whence) updates the descriptor’s
offset for SEEK_SET/SEEK_CUR/SEEK_END (the last calling
inv_getsize), rejecting a negative result or one beyond
MAX_LARGE_OBJECT_SIZE. inv_tell just returns the current offset. Both are
allowed with either read or write permission, so neither does an ACL check.
Read. inv_read (walked in §3) starts an ordered two-key scan from the
offset’s chunk and copies forward, zero-filling holes via MemSet. It
enforces IFS_RDLOCK and reads under obj_desc->snapshot.
Write. inv_write (walked in §3) loops chunk by chunk. Three sub-cases:
an existing chunk at the target pageno is updated in place
(heap_modify_tuple → CatalogTupleUpdateWithInfo); a target pageno with
no pre-existing chunk is inserted fresh (heap_form_tuple →
CatalogTupleInsertWithInfo); and intra-chunk or whole-chunk holes are
zero-filled with MemSet before splicing. It enforces IFS_WRLOCK, rejects
writes that would exceed MAX_LARGE_OBJECT_SIZE, and ends with
CommandCounterIncrement().
Truncate. inv_truncate(obj_desc, len) finds the chunk containing the
truncation point. If that chunk exists, it is shortened in place (and any
intra-chunk hole zero-filled); if the point lands in a hole, a new short
chunk is created to mark end-of-data. Then all later chunks are deleted
by continuing the ordered scan and calling CatalogTupleDelete:
// inv_truncate — src/backend/storage/large_object/inv_api.c (condensed tail)/* ... write/shorten the boundary chunk at pageno ... */if (olddata != NULL) /* delete every chunk strictly after the boundary */{ while ((oldtuple = systable_getnext_ordered(sd, ForwardScanDirection)) != NULL) CatalogTupleDelete(lo_heap_r, &oldtuple->t_self);}systable_endscan_ordered(sd);CatalogCloseIndexes(indstate);CommandCounterIncrement();Create/open/close/drop. inv_create calls LargeObjectCreate, records a
dependency on the owner (under LargeObjectRelationId for compatibility),
fires the post-create hook, and CommandCounterIncrements. inv_open
(walked in §3) builds the LargeObjectDesc. inv_close just pfrees the
descriptor. inv_drop calls performDeletion with DROP_CASCADE (which
routes through the dependency machinery to LargeObjectDrop), then
CommandCounterIncrements; it always returns 1 “for historical reasons.”
fmgr/FD layer — be-fsstubs.c
Section titled “fmgr/FD layer — be-fsstubs.c”The descriptor table. newLOfd() lazily creates the fscxt
(AllocSetContext under TopMemoryContext, named "Filesystem"), finds a
free slot in cookies or doubles the array (first allocation is 64 slots),
and returns the index. closeLOfd(fd) nulls the slot before freeing — “a
leak is better than a crash” — unregisters the snapshot, and inv_closes:
// closeLOfd — src/backend/libpq/be-fsstubs.c (condensed)lobj = cookies[fd];cookies[fd] = NULL; /* null first: avoid double-free on error */if (lobj->snapshot) UnregisterSnapshotFromOwner(lobj->snapshot, TopTransactionResourceOwner);inv_close(lobj);Open/close/read/write/seek. be_lo_open (walked in §3) gates
INV_WRITE behind PreventCommandIfReadOnly, allocates the fd, registers
the snapshot. be_lo_close validates the fd and closeLOfds. The bare
lo_read/lo_write helpers (not fmgr-callable) re-check the descriptor’s
IFS_RDLOCK/IFS_WRLOCK so the error message is about the FD’s mode
rather than the underlying privilege, then delegate to inv_read/inv_write.
be_loread/be_lowrite are the bytea fmgr entry points over those.
be_lo_lseek/be_lo_tell (32-bit) guard against int32 overflow and raise
ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE; be_lo_lseek64/be_lo_tell64 return
the full int64.
Create/unlink. be_lo_creat/be_lo_create call inv_create.
be_lo_unlink checks ownership (object_ownercheck unless
lo_compat_privileges), closes any open FDs referencing that loid by
scanning cookies, then inv_drops — note it deliberately throws the
ownership error before closing FDs.
Import/export. lo_import_internal opens a server-side file with
OpenTransientFile, creates an LO, and streams it in BUFSIZE (8192-byte)
gulps through inv_write. be_lo_export does the reverse with
inv_read, dropping the umask to 022 around the OpenTransientFilePerm so
exported files are not world-writable. Both run only on the server and
require superuser-level access in practice (file access is server-side).
Whole-object SQL helpers. lo_get_fragment_internal opens read-mode,
seeks to SEEK_END to size the result, clamps to MaxAllocSize, then seeks
back and reads into a bytea. be_lo_get reads the whole object;
be_lo_get_fragment reads a range. be_lo_from_bytea creates an LO and
inv_writes a whole bytea into it; be_lo_put seeks and overwrites a
range. These are the modern, FD-less, single-call accessors.
Transaction lifecycle. AtEOXact_LargeObject(isCommit) is the
end-of-transaction hook. If any LO work happened (lo_cleanup_needed), on
commit it closeLOfds every open descriptor (to avoid leaked-resource
warnings), then zaps the cookies array and deletes fscxt entirely so
the descriptors cannot survive the transaction; finally it calls
close_lo_relation(isCommit):
// AtEOXact_LargeObject — src/backend/libpq/be-fsstubs.c (condensed)if (!lo_cleanup_needed) return;if (isCommit) for (i = 0; i < cookies_size; i++) if (cookies[i] != NULL) closeLOfd(i);cookies = NULL; cookies_size = 0;if (fscxt) MemoryContextDelete(fscxt); /* descriptors die with the xact */fscxt = NULL;close_lo_relation(isCommit);lo_cleanup_needed = false;AtEOSubXact_LargeObject(isCommit, mySubid, parentSubid) handles
subtransaction boundaries: a descriptor whose subid matches the
committing subxact is reassigned to the parent (lo->subid = parentSubid); on abort it is closeLOfd’d. This is why every descriptor
records the subtransaction that owns it.
flowchart TD
T0["BEGIN"] --> OP["lo_open -> newLOfd<br/>cookies[fd] in fscxt"]
OP --> RW["lo_read / lo_write<br/>via inv_read / inv_write"]
RW --> SUB{"subxact<br/>boundary?"}
SUB -->|commit| RE["AtEOSubXact: subid := parent"]
SUB -->|abort| CL["AtEOSubXact: closeLOfd"]
RE --> EOX
CL --> EOX
RW --> EOX{"end of xact"}
EOX -->|commit| CC["AtEOXact: closeLOfd all,<br/>MemoryContextDelete(fscxt),<br/>close_lo_relation(true)"]
EOX -->|abort| CA["AtEOXact: drop fscxt,<br/>close_lo_relation(false)"]
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
LOBLKSIZE (BLCKSZ / 4) | src/include/storage/large_object.h | 70 |
MAX_LARGE_OBJECT_SIZE | src/include/storage/large_object.h | 76 |
LargeObjectDesc struct | src/include/storage/large_object.h | 39 |
IFS_RDLOCK / IFS_WRLOCK | src/include/storage/large_object.h | 48–49 |
FormData_pg_largeobject | src/include/catalog/pg_largeobject.h | 30 |
pg_largeobject_loid_pn_index | src/include/catalog/pg_largeobject.h | 48 |
FormData_pg_largeobject_metadata | src/include/catalog/pg_largeobject_metadata.h | 30 |
INV_WRITE / INV_READ | src/include/libpq/libpq-fs.h | 21–22 |
LargeObjectCreate | src/backend/catalog/pg_largeobject.c | 36 |
LargeObjectDrop | src/backend/catalog/pg_largeobject.c | 95 |
LargeObjectExists | src/backend/catalog/pg_largeobject.c | 167 |
LargeObjectExistsWithSnapshot | src/backend/catalog/pg_largeobject.c | 176 |
open_lo_relation | src/backend/storage/large_object/inv_api.c | 73 |
close_lo_relation | src/backend/storage/large_object/inv_api.c | 97 |
getdatafield | src/backend/storage/large_object/inv_api.c | 131 |
inv_create | src/backend/storage/large_object/inv_api.c | 173 |
inv_open | src/backend/storage/large_object/inv_api.c | 215 |
inv_close | src/backend/storage/large_object/inv_api.c | 299 |
inv_drop | src/backend/storage/large_object/inv_api.c | 311 |
inv_getsize | src/backend/storage/large_object/inv_api.c | 340 |
inv_seek | src/backend/storage/large_object/inv_api.c | 388 |
inv_tell | src/backend/storage/large_object/inv_api.c | 437 |
inv_read | src/backend/storage/large_object/inv_api.c | 450 |
inv_write | src/backend/storage/large_object/inv_api.c | 543 |
inv_truncate | src/backend/storage/large_object/inv_api.c | 740 |
be_lo_open | src/backend/libpq/be-fsstubs.c | 87 |
be_lo_close | src/backend/libpq/be-fsstubs.c | 126 |
lo_read / lo_write | src/backend/libpq/be-fsstubs.c | 154 / 182 |
be_lo_creat / be_lo_create | src/backend/libpq/be-fsstubs.c | 249 / 262 |
be_lo_unlink | src/backend/libpq/be-fsstubs.c | 314 |
be_loread / be_lowrite | src/backend/libpq/be-fsstubs.c | 362 / 380 |
lo_import_internal | src/backend/libpq/be-fsstubs.c | 424 |
be_lo_export | src/backend/libpq/be-fsstubs.c | 486 |
be_lo_truncate / be_lo_truncate64 | src/backend/libpq/be-fsstubs.c | 579 / 591 |
AtEOXact_LargeObject | src/backend/libpq/be-fsstubs.c | 607 |
AtEOSubXact_LargeObject | src/backend/libpq/be-fsstubs.c | 653 |
newLOfd / closeLOfd | src/backend/libpq/be-fsstubs.c | 680 / 721 |
lo_get_fragment_internal | src/backend/libpq/be-fsstubs.c | 746 |
be_lo_get / be_lo_from_bytea / be_lo_put | src/backend/libpq/be-fsstubs.c | 797 / 832 / 855 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against /data/hgryoo/references/postgres at REL_18_STABLE,
commit 273fe94 (PostgreSQL 18.x). Checks performed:
LOBLKSIZE = BLCKSZ / 4andMAX_LARGE_OBJECT_SIZE = INT_MAX * LOBLKSIZE: confirmed verbatim inlarge_object.hlines 70 and 76. With the default 8 KBBLCKSZthis is 2048 bytes per chunk and a ~4 TB cap.- Two-catalog split:
pg_largeobject(CATALOG(...,2613,...), columnsloid,pageno,data) andpg_largeobject_metadata(CATALOG(...,2995,...), columnsoid,lomowner,lomacl) confirmed in the respective headers; the composite unique B-treepg_largeobject_loid_pn_indexonbtree(loid oid_ops, pageno int4_ops)is declared atpg_largeobject.h:48. data BKI_FORCE_NOT_NULL: confirmed — thedatacolumn is forced NOT NULL even thoughinitdbmarksloid/pagenobut notdata; theinv_api.chead comment and theHeapTupleHasNulls“paranoia” checks corroborate.- Read/write snapshot asymmetry:
inv_opensetssnapshot = NULLforIFS_WRLOCKandsnapshot = GetActiveSnapshot()for read-only, confirmed atinv_api.c:238–241. The matchingRegisterSnapshotOnOwner/UnregisterSnapshotFromOwnercalls are inbe_lo_openandcloseLOfd. - Sparse-hole zero-fill: the
pageoff > obj_desc->offsetbranch ininv_readwithMemSet(buf + nread, 0, n)confirmed atinv_api.c:505–512. - Read-modify-write:
inv_write’s update branch usesheap_modify_tupleCatalogTupleUpdateWithInfo; its insert branch usesheap_form_tuple+CatalogTupleInsertWithInfo; both confirmed (lines ~675–721). Each mutating routine ends withCommandCounterIncrement().
- FD table in
fscxt:cookies/cookies_size/fscxtstatics and thenewLOfd/closeLOfd/AtEOXact_LargeObject/AtEOSubXact_LargeObjectlifecycle confirmed inbe-fsstubs.c.AtEOXact_LargeObjectcallsMemoryContextDelete(fscxt)on cleanup. - Permission model:
pg_largeobject_aclcheck_snapshotwithACL_SELECT(read) /ACL_UPDATE(write) ininv_open;object_ownercheckforbe_lo_unlink; all gated by thelo_compat_privilegesGUC. Confirmed. - Caveat / version note:
inv_read/inv_writeare declaredint(32-bit) byte counts, so a single SQLloread/lowritecall moves at mostINT_MAXbytes regardless of the 4 TB object cap; thebe_lo_lseek/be_lo_tell32-bit wrappers raiseERRCODE_NUMERIC_VALUE_OUT_OF_RANGEpast 2 GB, which is why the*64variants exist. This matches REL_18; no PG19-only symbols were relied on.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Large objects vs. bytea/TOAST — the in-house comparison
Section titled “Large objects vs. bytea/TOAST — the in-house comparison”The first comparison is internal. PostgreSQL ships two ways to store big binary values, and the contrast is instructive:
| Dimension | Large object (lo) | bytea + TOAST |
|---|---|---|
| Addressing | Object OID, separate from any row | Column value inside a row |
| Access | Seekable: lo_lseek/lo_read random access | All-or-nothing datum (detoast whole value) |
| Max size | ~4 TB (INT_MAX * LOBLKSIZE) | ~1 GB (varlena 30-bit length) |
| Chunk store | pg_largeobject, shared by all LOs | per-table TOAST relation, TOAST_MAX_CHUNK_SIZE |
| Partial update | lo_put/inv_write rewrites only touched chunks | rewrites the entire datum as a new row version |
| Streaming to client | chunked via loread/lowrite round-trips | one big value over the wire |
| Permissions | per-object owner + ACL (since 9.0) | inherit the table’s row privileges |
| Transactional | yes (heap MVCC + WAL) | yes (heap MVCC + WAL) |
The decision rule the PostgreSQL docs effectively encode: use bytea when
values are below ~a few MB and are read/written whole; use large objects when
you need random access into multi-gigabyte blobs or values exceeding the
1 GB bytea ceiling. The shared mechanism under both is the same heap +
B-tree chunk store — TOAST chunks a column, inversion chunks an object —
which is why both inherit MVCC bloat under heavy overwrite. See
postgres-toast.md for the column side.
The MVCC-bloat tax and vacuuming
Section titled “The MVCC-bloat tax and vacuuming”Because every chunk is an ordinary heap tuple, overwriting LO data leaves
dead tuples in pg_largeobject, exactly as updating a normal row does. A
workload that repeatedly lo_puts into the same offsets of the same objects
will bloat pg_largeobject and depend on autovacuum to reclaim space (see
postgres-autovacuum.md and postgres-vacuum.md). There is also the
orphaned-LO problem: dropping the referencing row in a user table does not
drop the large object (the OID column is just an integer), so abandoned LOs
accumulate. The community answer is the vacuumlo utility and the
lo_manage trigger from the lo contrib module — both client/contrib
tooling, outside this core analysis, but worth naming as the operational
counterpart to the core mechanism.
Other engines
Section titled “Other engines”- Oracle separates
BFILE(a read-only pointer to an OS file, bytes not in the database) from internal LOBs (BLOB/CLOB/NCLOB) stored in dedicated LOB segments with their own chunk size and an optionalSecureFileslayer that adds compression, deduplication, and encryption. Unlike inversion, Oracle’s LOB segments are a bespoke storage structure, not “just another table,” so they carry purpose-built logging and a separate undo/retention model. - SQL Server offers
varbinary(max)/text/imagestored inLOB_DATA/ROW_OVERFLOW_DATAallocation units, plus FILESTREAM and FileTables, which keep the bytes in the NTFS file system while preserving transactional consistency — closer in spirit to OracleBFILEbut transactional. Like PostgreSQL’sbytea, these are column-addressed, not object-addressed. - MySQL/InnoDB stores
BLOB/TEXToverflow off-page in the same tablespace, column-addressed; there is no object-locator API analogous tolo_open. - JDBC/SQL standard locators. The standard’s
BLOB/CLOBlocator model — a handle valid for the duration of a transaction, supportinggetBytes(pos, len)andsetBytes— is conceptually the same contract as PostgreSQL’s transaction-scoped FD. The PostgreSQL JDBC driver mapsjava.sql.Blobonto thelo_*functions.
Research frontiers
Section titled “Research frontiers”The inversion idea — reuse the table engine for blobs instead of building a
new store — anticipates a recurring theme in storage research: unbundling
vs. reusing the storage layer. Modern object stores (S3-style) and
disaggregated-storage databases push large values to an external,
content-addressed tier and keep only a reference in the row, which is the
BFILE/FILESTREAM direction taken to its cloud conclusion. The trade PostgreSQL
made in the 1990s — pay MVCC bloat and write amplification to get
transactional, crash-safe blobs for free — is exactly the trade that
cloud-native engines revisit when they decide whether large values deserve a
separate, append-only, log-structured tier (which avoids in-place rewrite
bloat) or should stay inline for transactional simplicity. The unchanged
LOBLKSIZE = BLCKSZ/4 and int32 pageno since the early releases are also a
reminder of how a chunk-size and key-width choice ossifies: widening the 4 TB
cap would require widening pageno and an initdb, a migration cost that
keeps the original constants frozen.
Sources
Section titled “Sources”- PostgreSQL source (
/data/hgryoo/references/postgres,REL_18_STABLE, commit273fe94):src/backend/storage/large_object/inv_api.c— the inversion API:inv_create/open/close/drop/getsize/seek/tell/read/write/truncate,getdatafield,open_lo_relation/close_lo_relation.src/backend/libpq/be-fsstubs.c— fmgrbe_lo_*wrappers, thecookiesFD table infscxt,lo_import/lo_export,lo_get/lo_put/lo_from_bytea, and theAtEOXact/AtEOSubXactlifecycle.src/backend/catalog/pg_largeobject.c—LargeObjectCreate/Drop/Exists/ExistsWithSnapshot.src/include/storage/large_object.h—LOBLKSIZE,MAX_LARGE_OBJECT_SIZE,LargeObjectDesc,IFS_RDLOCK/IFS_WRLOCK.src/include/catalog/pg_largeobject.h,src/include/catalog/pg_largeobject_metadata.h— the two catalogs and the composite(loid, pageno)index.src/include/libpq/libpq-fs.h—INV_WRITE/INV_READflags.
- Theory anchors (
knowledge/research/dbms-general/):- Database System Concepts, Silberschatz 7e — ch. 13 “Data Storage Structures” (page layout, large-value overflow), file/record organization.
- Database Internals, Petrov — “File Formats” (slotted pages, overflow pages, why oversize values must go out of line).
- Cross-references (this repo):
postgres-toast.md(column-addressed large values, varlena, in-line/out-of-line compression),postgres-heap-am.md(heap tuple insert/update/delete that LO chunks ride on),postgres-autovacuum.md/postgres-vacuum.md(reclaiming dead chunk tuples),postgres-page-layout.md(BLCKSZ, slotted pages),postgres-nbtree.md(the B-tree index backing the(loid, pageno)scans),postgres-mvcc-snapshots.md(the read-mode snapshot semantics).