PostgreSQL Large Objects — pg_largeobject Chunked Storage and the lo API

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

A relational table is, physically, an array of fixed-size disk pages, and a row must fit on one page. PostgreSQL’s page is BLCKSZ = 8 KB by default; Database System Concepts (Silberschatz 7e, ch. 13 “Data Storage Structures”) and Database Internals (Petrov, “File Formats”) both make the same observation: once a single attribute’s value approaches or exceeds the page size, the slotted-page row layout — which assumes many small tuples per page — breaks down. A database that wants to store a 50 MB image, a PDF, or a video clip therefore needs an out-of-line representation: the row holds a small reference, and the bulk bytes live somewhere else, decomposed into page-sized units.

There are two classical answers to “where else,” and they differ in whether the large value is addressed as a column or addressed as an object:

In-row overflow / large-value attribution. The value remains logically a column of the row (a bytea, text, BLOB, CLOB, VARBINARY(MAX)). The engine transparently spills the bytes into an overflow structure and leaves a pointer in the tuple. The application never sees the decomposition; it reads and writes the whole value as a datum. PostgreSQL’s name for this is TOAST (see postgres-toast.md); DB2 calls it LOB columns with inlining thresholds, SQL Server varbinary(max) with TEXTIMAGE allocation units.
Object-addressed large storage. The large value is not a column. It is an independent object with its own identity (an OID or “locator”), created and dropped by explicit calls, and accessed through a file-like, seekable, byte-stream API — open, read, write, lseek, truncate, close — so the client can stream a multi-gigabyte object in bounded-memory chunks rather than materializing it whole. The SQL standard’s BLOB/CLOB locator model and the JDBC Blob/Clob interfaces are the standardized form of this; Oracle’s BFILE/LOB locators and PostgreSQL’s large objects are concrete instances.

The two models trade off along one axis: granularity of access. The column model is dead simple for the application — a value is a value — but it is all-or-nothing: to read one byte you fetch (and detoast) the whole datum, and to change one byte you rewrite the whole datum as a new row version. The object model adds API ceremony (you must lo_open a descriptor, you must lo_close it, the descriptor is only valid inside a transaction) but buys random access: you can lseek to offset 1 GB in a 4 GB object and read 4 KB without touching the rest. That is the entire reason the object model survives alongside bytea.

PostgreSQL’s large object facility traces directly to the original POSTGRES research system. The header comment in large_object.h records the lineage: “POSTGRES 4.2 supported zillions of large objects (internal, external, jaquith, inversion). Now we only support inversion.” The inversion scheme — the one survivor — is the subject of this document. Its defining idea, due to the POSTGRES storage work of Stonebraker and collaborators, is deceptively economical: do not invent a new storage manager for large objects at all. Instead, store the object’s bytes as ordinary rows of an ordinary table, chopped into fixed-size chunks, and let the existing heap + B-tree + MVCC + WAL machinery provide durability, concurrency, and transactional rollback for free. “Inversion” names the inversion of the usual relationship: rather than a file system holding database files, the database holds what would otherwise be file-system objects.

This choice has a profound consequence that organizes the rest of the document: large objects inherit every property of the heap. They are MVCC-versioned (a write produces a new tuple and a dead old one), they are WAL-logged (crash recovery just replays the chunk inserts/updates), they are vacuumed (dead chunk tuples are reclaimed by autovacuum), and they roll back on transaction abort. The price is paid in the same currency: write amplification (rewriting one byte rewrites a whole 2 KB chunk tuple), MVCC bloat (every overwrite leaves a dead chunk), and no out-of-band streaming (bytes pass through the SQL function-call protocol, not a side channel like a TCP file transfer).

Two design questions follow, and they frame sections 2–4:

What is the chunk size, and why that size? It controls write amplification, the number of catalog tuples per object, and whether the chunk itself gets compressed.
How does a seekable, sparse, transactional byte stream get implemented on top of a (loid, pageno)-keyed catalog table — in particular how reads synthesize holes, how writes do read-modify-write, and how the client-facing file descriptor and its snapshot are scoped to a transaction.

Common DBMS Design

Before the PostgreSQL specifics, it helps to name the moving parts that any object-addressed large-storage subsystem must provide, because PostgreSQL’s symbols are then recognizable as choices within a shared design space.

Chunking and the chunk table

Every implementation that decomposes a large value into page-sized units needs three things: a chunk size, a chunk container, and a chunk key that orders the pieces. The container is usually a hidden/system table (SQL Server’s LOB allocation units, Oracle’s LOB segments, PostgreSQL’s pg_largeobject). The key is the pair (object identity, sequence number) so that a range of bytes maps to a contiguous range of chunk keys and an index range-scan streams them in order. The chunk size is a tuning knob: too large and a small edit rewrites a lot; too small and the per-chunk tuple/row overhead and the number of index entries explode.

flowchart LR
  subgraph App["Client / SQL"]
    A1["lo_open(loid, INV_READ)"]
    A2["lo_lseek(fd, off)"]
    A3["lo_read(fd, n)"]
  end
  subgraph API["Inversion API (inv_api.c)"]
    B1["LargeObjectDesc<br/>id, offset, snapshot, flags"]
    B2["index scan loid eq,<br/>pageno ge start"]
  end
  subgraph Cat["System catalogs"]
    C1["pg_largeobject_metadata<br/>(oid, lomowner, lomacl)"]
    C2["pg_largeobject<br/>(loid, pageno, data bytea)"]
    C3["btree (loid, pageno)"]
  end
  A1 --> B1
  A2 --> B1
  A3 --> B2
  B1 -.permission + existence.-> C1
  B2 --> C3 --> C2

Metadata separate from data

A large object has identity and permissions even when it holds zero bytes. Implementations therefore split a tiny metadata record (owner, ACL, sometimes size/timestamps) from the bulky data chunks. The metadata record is what CREATE inserts and DROP removes; the data chunks are inserted lazily as bytes are written. This separation also lets the engine answer “does this object exist?” and “may this role read it?” with a single small lookup rather than touching the chunk store.

Seekable byte stream with holes

The file abstraction implies an lseek-style cursor and the possibility of sparse objects (“holes”): if a client seeks far past the end and writes, the gap should logically read back as zero bytes, exactly like a Unix sparse file, without physically materializing zero chunks for the gap. A read that crosses a missing chunk must fabricate zeros; a write into a hole creates only the chunks it actually touches.

Transactional and crash-safe semantics

Because large objects participate in SQL transactions, a write must be atomic with the surrounding transaction (rolled back on abort) and durable on commit. Engines that store chunks as ordinary rows get this from the table engine’s MVCC + WAL; engines that use a bespoke LOB segment must build logging and undo for it explicitly. The trade-off is bloat versus specialization: row-backed LOBs bloat under heavy overwrite (dead versions) but need zero new recovery code.

A transaction-scoped descriptor with a pinned snapshot

The file descriptor returned by open is a session/transaction-local handle, not a durable object. It carries the current seek offset and, for read mode, a snapshot so that successive reads of the same object see a consistent point-in-time image even as other transactions modify it. The descriptor and its snapshot must be cleaned up at transaction end (and reassigned across subtransaction boundaries), or they leak.

PostgreSQL’s Approach

PostgreSQL implements the inversion scheme with two system catalogs and one API file. The numbers are fixed by large_object.h:

// LOBLKSIZE / MAX_LARGE_OBJECT_SIZE — src/include/storage/large_object.h
#define LOBLKSIZE    (BLCKSZ / 4)
#define MAX_LARGE_OBJECT_SIZE  ((int64) INT_MAX * LOBLKSIZE)

With the default 8 KB block, LOBLKSIZE is 2048 bytes. The comment in the header explains the two reasons for BLCKSZ/4 rather than something near BLCKSZ: first, a smaller chunk means a partial-page write rewrites less data; second — and more subtly — the value is “deliberately chosen large enough to trigger the tuple toaster, so that we will attempt to compress page tuples in-line.” That is, a 2 KB data field is over the TOAST threshold, so PostgreSQL will try to LZ-compress each chunk in place (though it won’t push chunks out-of-line to a TOAST table unless the DBA manually creates one for pg_largeobject). With pageno an int32 and INT_MAX chunks of 2 KB each, the maximum object is INT_MAX * 2048 ≈ 4 TB.

The two catalogs

pg_largeobject_metadata holds one row per object: its OID is the large object’s identity, plus owner and ACL. pg_largeobject holds the data chunks, keyed by (loid, pageno):

// FormData_pg_largeobject — src/include/catalog/pg_largeobject.h
CATALOG(pg_largeobject,2613,LargeObjectRelationId)
{
  Oid      loid BKI_LOOKUP(pg_largeobject_metadata);  /* LO identity */
  int32    pageno;      /* Page number (starting from 0) */
  bytea    data BKI_FORCE_NOT_NULL;  /* Data for page (may be zero-length) */
} FormData_pg_largeobject;

DECLARE_UNIQUE_INDEX_PKEY(pg_largeobject_loid_pn_index, 2683,
  LargeObjectLOidPNIndexId, pg_largeobject, btree(loid oid_ops, pageno int4_ops));

The composite unique B-tree on (loid, pageno) is the workhorse: it lets every inversion operation phrase itself as “scan the chunks of object loid whose pageno is in some range, in order.” LargeObjectCreate inserts only the metadata row — no chunks — so a freshly created object has size zero and exists for permission purposes immediately:

// LargeObjectCreate — src/backend/catalog/pg_largeobject.c (condensed)
pg_lo_meta = table_open(LargeObjectMetadataRelationId, RowExclusiveLock);
if (OidIsValid(loid)) loid_new = loid;
else loid_new = GetNewOidWithIndex(pg_lo_meta, LargeObjectMetadataOidIndexId,
                   Anum_pg_largeobject_metadata_oid);
values[Anum_pg_largeobject_metadata_oid - 1] = ObjectIdGetDatum(loid_new);
values[Anum_pg_largeobject_metadata_lomowner - 1] = ObjectIdGetDatum(GetUserId());
/* ... lomacl from get_user_default_acl ... */
ntup = heap_form_tuple(RelationGetDescr(pg_lo_meta), values, nulls);
CatalogTupleInsert(pg_lo_meta, ntup);

pg_largeobject_metadata was introduced in PostgreSQL 9.0 specifically to attach ownership and per-object ACLs to large objects; before that, all LOs were effectively owned in common. Dependency tracking, however, still records the object under LargeObjectRelationId (the data catalog’s OID, 2613), not the metadata catalog — inv_create comments that this is a backwards-compatibility decision for pg_dump and other clients.

The inversion descriptor and open

inv_open is where existence, permission, and snapshot policy are decided. The single most important design point is the read/write snapshot asymmetry:

// inv_open — src/backend/storage/large_object/inv_api.c (condensed)
if (flags & INV_WRITE)  descflags |= IFS_WRLOCK | IFS_RDLOCK;
if (flags & INV_READ)   descflags |= IFS_RDLOCK;
/* ... reject descflags == 0 ... */

/* Get snapshot.  If write is requested, use an instantaneous snapshot. */
if (descflags & IFS_WRLOCK)
  snapshot = NULL;          /* => "current" semantics each access */
else
  snapshot = GetActiveSnapshot();    /* stable point-in-time view */

if (!LargeObjectExistsWithSnapshot(lobjId, snapshot)) /* existence check */
  ereport(ERROR, ... "large object %u does not exist" ...);
/* ACL_SELECT for RDLOCK, ACL_UPDATE for WRLOCK, both snapshot-aware */

A read-mode open captures GetActiveSnapshot() so that the whole sequence of lo_read calls against that descriptor sees one consistent image — a bytea read 4 KB at a time is the same object throughout, even if a concurrent transaction commits a rewrite. A write-mode open uses snapshot = NULL, meaning each catalog access uses a fresh up-to-date (SnapshotSelf/current) view, which is what you want for read-modify-write: you must see your own just-written chunks and the latest committed state. The flags IFS_RDLOCK/IFS_WRLOCK double as “permission was checked for this mode” markers, per the large_object.h note that since v11 permission is checked at open time.

flowchart TD
  O["inv_open(loid, flags, mcxt)"] --> M{"INV_WRITE<br/>set?"}
  M -->|yes| W["descflags = WRLOCK|RDLOCK<br/>snapshot = NULL (current)"]
  M -->|no, INV_READ| R["descflags = RDLOCK<br/>snapshot = GetActiveSnapshot()"]
  W --> E["LargeObjectExistsWithSnapshot"]
  R --> E
  E -->|missing| ERR["ERROR: does not exist"]
  E -->|exists| P["aclcheck: ACL_UPDATE if WRLOCK,<br/>ACL_SELECT if RDLOCK"]
  P --> D["alloc LargeObjectDesc in mcxt<br/>id, offset=0, flags, snapshot"]

Read: ordered index scan plus zero-fill for holes

inv_read translates the byte offset into a starting pageno, opens an ordered systable scan with two scan keys (loid = id and pageno >= start), and walks the chunks forward, copying the relevant slice of each and fabricating zeros for any gap between the expected offset and the next chunk’s actual offset. This is what makes large objects sparse:

// inv_read — src/backend/storage/large_object/inv_api.c (condensed)
int32 pageno = (int32) (obj_desc->offset / LOBLKSIZE);
ScanKeyInit(&skey[0], Anum_pg_largeobject_loid, BTEqualStrategyNumber, F_OIDEQ, ...);
ScanKeyInit(&skey[1], Anum_pg_largeobject_pageno, BTGreaterEqualStrategyNumber,
      F_INT4GE, Int32GetDatum(pageno));
sd = systable_beginscan_ordered(lo_heap_r, lo_index_r, obj_desc->snapshot, 2, skey);
while ((tuple = systable_getnext_ordered(sd, ForwardScanDirection)) != NULL)
{
  pageoff = ((uint64) data->pageno) * LOBLKSIZE;
  if (pageoff > obj_desc->offset)    /* a hole: synthesize zeros */
  {
    n = pageoff - obj_desc->offset;
    n = (n <= (nbytes - nread)) ? n : (nbytes - nread);
    MemSet(buf + nread, 0, n);
    nread += n; obj_desc->offset += n;
  }
  if (nread < nbytes) { /* copy slice from this chunk's data field */ }
  if (nread >= nbytes) break;
}

Write: read-modify-write of 2 KB chunks

inv_write is the most intricate routine. For each chunk that the write range touches it either updates an existing chunk (load old chunk into a stack workbuf, splice in the new bytes, zero-fill any intra-chunk hole, and heap_modify_tuple + CatalogTupleUpdateWithInfo) or inserts a brand-new chunk (heap_form_tuple + CatalogTupleInsertWithInfo). The read-modify-write is why a one-byte change rewrites a whole 2 KB tuple:

// inv_write — src/backend/storage/large_object/inv_api.c (condensed, update branch)
if (olddata != NULL && olddata->pageno == pageno)
{
  getdatafield(olddata, &datafield, &len, &pfreeit);   /* load old chunk */
  memcpy(workb, VARDATA(datafield), len);
  off = (int) (obj_desc->offset % LOBLKSIZE);
  if (off > len) MemSet(workb + len, 0, off - len);    /* intra-chunk hole */
  n = LOBLKSIZE - off;
  n = (n <= (nbytes - nwritten)) ? n : (nbytes - nwritten);
  memcpy(workb + off, buf + nwritten, n);              /* splice new bytes */
  nwritten += n; obj_desc->offset += n; off += n;
  len = (len >= off) ? len : off;
  SET_VARSIZE(&workbuf.hdr, len + VARHDRSZ);
  values[Anum_pg_largeobject_data - 1] = PointerGetDatum(&workbuf);
  replace[Anum_pg_largeobject_data - 1] = true;
  newtup = heap_modify_tuple(oldtuple, RelationGetDescr(lo_heap_r),
                 values, nulls, replace);
  CatalogTupleUpdateWithInfo(lo_heap_r, &newtup->t_self, newtup, indstate);
}

After the loop, inv_write issues CommandCounterIncrement() so that the just-written chunks become visible to subsequent operations in the same command sequence — essential when a lo_write is immediately followed by a lo_read or another lo_write in the same transaction.

The fmgr surface and the FD table

The SQL-visible functions (lo_open, loread, lowrite, lo_lseek, lo_creat, lo_unlink, lo_import, lo_export, lo_get, lo_put, …) live in be-fsstubs.c as be_lo_* fmgr wrappers. They maintain a per-transaction file-descriptor table — the cookies array of LargeObjectDesc * allocated in a private fscxt memory context — and an integer “fd” is just an index into it. be_lo_open allocates a slot, calls inv_open, stamps the descriptor with the current subtransaction id, and registers the read snapshot on TopTransactionResourceOwner so it outlives the current portal:

// be_lo_open — src/backend/libpq/be-fsstubs.c (condensed)
fd = newLOfd();                /* index into cookies[] in fscxt */
lobjDesc = inv_open(lobjId, mode, fscxt);
lobjDesc->subid = GetCurrentSubTransactionId();
if (lobjDesc->snapshot)
  lobjDesc->snapshot = RegisterSnapshotOnOwner(lobjDesc->snapshot,
                         TopTransactionResourceOwner);
cookies[fd] = lobjDesc;
PG_RETURN_INT32(fd);

These descriptors are valid only within the transaction that opened them; AtEOXact_LargeObject tears down the whole fscxt at transaction end and AtEOSubXact_LargeObject reassigns or closes them across subtransaction boundaries. The next section walks these flows symbol by symbol.

Source Walkthrough

The implementation splits cleanly into three layers: the catalog layer (pg_largeobject.c) that owns the two system tables, the inversion layer (inv_api.c) that presents a seekable byte stream, and the fmgr/FD layer (be-fsstubs.c) that exposes SQL functions and manages descriptors. We walk them bottom-up.

Catalog layer — `pg_largeobject.c`

LargeObjectCreate(loid) inserts one pg_largeobject_metadata row (owner = GetUserId(), default ACL via get_user_default_acl) and returns its OID, choosing a fresh OID with GetNewOidWithIndex when loid is InvalidOid. No data chunks are written, so the object’s size is zero until first write.

LargeObjectDrop(loid) deletes both halves: first the single pg_largeobject_metadata row (scanned via LargeObjectMetadataOidIndexId), then every pg_largeobject chunk for that loid by range-scanning the (loid, pageno) index and calling CatalogTupleDelete on each:

// LargeObjectDrop — src/backend/catalog/pg_largeobject.c (condensed)
ScanKeyInit(&skey[0], Anum_pg_largeobject_loid, BTEqualStrategyNumber, F_OIDEQ,
      ObjectIdGetDatum(loid));
scan = systable_beginscan(pg_largeobject, LargeObjectLOidPNIndexId, true,
              NULL, 1, skey);
while (HeapTupleIsValid(tuple = systable_getnext(scan)))
  CatalogTupleDelete(pg_largeobject, &tuple->t_self);
systable_endscan(scan);

LargeObjectExists(loid) / LargeObjectExistsWithSnapshot(loid, snapshot) answer existence by probing pg_largeobject_metadata (note: the metadata catalog, not the data catalog) under the given snapshot. The function comment is explicit that LO metadata is not cached in the syscache “for fear of using too much local memory,” so each check is a fresh systable_beginscan. The snapshot-aware variant exists precisely so that a read-mode inv_open checks existence under the same snapshot it will read with.

Inversion layer — `inv_api.c`

Relation caching. open_lo_relation() opens pg_largeobject and its index once per transaction, holding the references on TopTransactionResourceOwner so a subtransaction abort doesn’t yank them out from under a later operation. It takes RowExclusiveLock even for reads “since we might either read or write.” close_lo_relation(isCommit) releases them at main-transaction end (called from AtEOXact_LargeObject).

Chunk decoding. getdatafield() is the one place that touches the raw data column. Because a 2 KB chunk is over the TOAST threshold it may be stored compressed or in short-header form, so the function detoasts when VARATT_IS_EXTENDED, validates 0 <= len <= LOBLKSIZE (raising ERRCODE_DATA_CORRUPTED otherwise), and reports whether the caller must pfree the detoasted copy:

// getdatafield — src/backend/storage/large_object/inv_api.c (condensed)
datafield = &(tuple->data);        /* see note at top of file */
if (VARATT_IS_EXTENDED(datafield)) {
  datafield = (bytea *) detoast_attr((struct varlena *) datafield);
  freeit = true;
}
len = VARSIZE(datafield) - VARHDRSZ;
if (len < 0 || len > LOBLKSIZE)
  ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED),
    errmsg("pg_largeobject entry for OID %u, page %d has invalid data field size %d",
         tuple->loid, tuple->pageno, len)));

Size. inv_getsize() exploits the (loid, pageno) index ordering: a single backward ordered scan returns the highest-pageno chunk first, so the size is pageno * LOBLKSIZE + len of that one tuple — no full scan needed. The comment notes LOs “can contain gaps, just like Unix files,” so this is the offset of the last byte + 1, not a count of stored bytes.

Seek/tell. inv_seek(obj_desc, offset, whence) updates the descriptor’s offset for SEEK_SET/SEEK_CUR/SEEK_END (the last calling inv_getsize), rejecting a negative result or one beyond MAX_LARGE_OBJECT_SIZE. inv_tell just returns the current offset. Both are allowed with either read or write permission, so neither does an ACL check.

Read. inv_read (walked in §3) starts an ordered two-key scan from the offset’s chunk and copies forward, zero-filling holes via MemSet. It enforces IFS_RDLOCK and reads under obj_desc->snapshot.

Write. inv_write (walked in §3) loops chunk by chunk. Three sub-cases: an existing chunk at the target pageno is updated in place (heap_modify_tuple → CatalogTupleUpdateWithInfo); a target pageno with no pre-existing chunk is inserted fresh (heap_form_tuple → CatalogTupleInsertWithInfo); and intra-chunk or whole-chunk holes are zero-filled with MemSet before splicing. It enforces IFS_WRLOCK, rejects writes that would exceed MAX_LARGE_OBJECT_SIZE, and ends with CommandCounterIncrement().

Truncate. inv_truncate(obj_desc, len) finds the chunk containing the truncation point. If that chunk exists, it is shortened in place (and any intra-chunk hole zero-filled); if the point lands in a hole, a new short chunk is created to mark end-of-data. Then all later chunks are deleted by continuing the ordered scan and calling CatalogTupleDelete:

// inv_truncate — src/backend/storage/large_object/inv_api.c (condensed tail)
/* ... write/shorten the boundary chunk at pageno ... */
if (olddata != NULL)  /* delete every chunk strictly after the boundary */
{
  while ((oldtuple = systable_getnext_ordered(sd, ForwardScanDirection)) != NULL)
    CatalogTupleDelete(lo_heap_r, &oldtuple->t_self);
}
systable_endscan_ordered(sd);
CatalogCloseIndexes(indstate);
CommandCounterIncrement();

Create/open/close/drop. inv_create calls LargeObjectCreate, records a dependency on the owner (under LargeObjectRelationId for compatibility), fires the post-create hook, and CommandCounterIncrements. inv_open (walked in §3) builds the LargeObjectDesc. inv_close just pfrees the descriptor. inv_drop calls performDeletion with DROP_CASCADE (which routes through the dependency machinery to LargeObjectDrop), then CommandCounterIncrements; it always returns 1 “for historical reasons.”

fmgr/FD layer — `be-fsstubs.c`

The descriptor table. newLOfd() lazily creates the fscxt (AllocSetContext under TopMemoryContext, named "Filesystem"), finds a free slot in cookies or doubles the array (first allocation is 64 slots), and returns the index. closeLOfd(fd) nulls the slot before freeing — “a leak is better than a crash” — unregisters the snapshot, and inv_closes:

// closeLOfd — src/backend/libpq/be-fsstubs.c (condensed)
lobj = cookies[fd];
cookies[fd] = NULL;          /* null first: avoid double-free on error */
if (lobj->snapshot)
  UnregisterSnapshotFromOwner(lobj->snapshot, TopTransactionResourceOwner);
inv_close(lobj);

Open/close/read/write/seek. be_lo_open (walked in §3) gates INV_WRITE behind PreventCommandIfReadOnly, allocates the fd, registers the snapshot. be_lo_close validates the fd and closeLOfds. The bare lo_read/lo_write helpers (not fmgr-callable) re-check the descriptor’s IFS_RDLOCK/IFS_WRLOCK so the error message is about the FD’s mode rather than the underlying privilege, then delegate to inv_read/inv_write. be_loread/be_lowrite are the bytea fmgr entry points over those. be_lo_lseek/be_lo_tell (32-bit) guard against int32 overflow and raise ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE; be_lo_lseek64/be_lo_tell64 return the full int64.

Create/unlink. be_lo_creat/be_lo_create call inv_create. be_lo_unlink checks ownership (object_ownercheck unless lo_compat_privileges), closes any open FDs referencing that loid by scanning cookies, then inv_drops — note it deliberately throws the ownership error before closing FDs.

Import/export. lo_import_internal opens a server-side file with OpenTransientFile, creates an LO, and streams it in BUFSIZE (8192-byte) gulps through inv_write. be_lo_export does the reverse with inv_read, dropping the umask to 022 around the OpenTransientFilePerm so exported files are not world-writable. Both run only on the server and require superuser-level access in practice (file access is server-side).

Whole-object SQL helpers. lo_get_fragment_internal opens read-mode, seeks to SEEK_END to size the result, clamps to MaxAllocSize, then seeks back and reads into a bytea. be_lo_get reads the whole object; be_lo_get_fragment reads a range. be_lo_from_bytea creates an LO and inv_writes a whole bytea into it; be_lo_put seeks and overwrites a range. These are the modern, FD-less, single-call accessors.

Transaction lifecycle. AtEOXact_LargeObject(isCommit) is the end-of-transaction hook. If any LO work happened (lo_cleanup_needed), on commit it closeLOfds every open descriptor (to avoid leaked-resource warnings), then zaps the cookies array and deletes fscxt entirely so the descriptors cannot survive the transaction; finally it calls close_lo_relation(isCommit):

// AtEOXact_LargeObject — src/backend/libpq/be-fsstubs.c (condensed)
if (!lo_cleanup_needed) return;
if (isCommit)
  for (i = 0; i < cookies_size; i++)
    if (cookies[i] != NULL) closeLOfd(i);
cookies = NULL; cookies_size = 0;
if (fscxt) MemoryContextDelete(fscxt);    /* descriptors die with the xact */
fscxt = NULL;
close_lo_relation(isCommit);
lo_cleanup_needed = false;

AtEOSubXact_LargeObject(isCommit, mySubid, parentSubid) handles subtransaction boundaries: a descriptor whose subid matches the committing subxact is reassigned to the parent (lo->subid = parentSubid); on abort it is closeLOfd’d. This is why every descriptor records the subtransaction that owns it.

flowchart TD
  T0["BEGIN"] --> OP["lo_open -> newLOfd<br/>cookies[fd] in fscxt"]
  OP --> RW["lo_read / lo_write<br/>via inv_read / inv_write"]
  RW --> SUB{"subxact<br/>boundary?"}
  SUB -->|commit| RE["AtEOSubXact: subid := parent"]
  SUB -->|abort| CL["AtEOSubXact: closeLOfd"]
  RE --> EOX
  CL --> EOX
  RW --> EOX{"end of xact"}
  EOX -->|commit| CC["AtEOXact: closeLOfd all,<br/>MemoryContextDelete(fscxt),<br/>close_lo_relation(true)"]
  EOX -->|abort| CA["AtEOXact: drop fscxt,<br/>close_lo_relation(false)"]

Position hints (as of 2026-06-05, REL_18 273fe94)

Symbol	File	Line
`LOBLKSIZE` (`BLCKSZ / 4`)	src/include/storage/large_object.h	70
`MAX_LARGE_OBJECT_SIZE`	src/include/storage/large_object.h	76
`LargeObjectDesc` struct	src/include/storage/large_object.h	39
`IFS_RDLOCK` / `IFS_WRLOCK`	src/include/storage/large_object.h	48–49
`FormData_pg_largeobject`	src/include/catalog/pg_largeobject.h	30
`pg_largeobject_loid_pn_index`	src/include/catalog/pg_largeobject.h	48
`FormData_pg_largeobject_metadata`	src/include/catalog/pg_largeobject_metadata.h	30
`INV_WRITE` / `INV_READ`	src/include/libpq/libpq-fs.h	21–22
`LargeObjectCreate`	src/backend/catalog/pg_largeobject.c	36
`LargeObjectDrop`	src/backend/catalog/pg_largeobject.c	95
`LargeObjectExists`	src/backend/catalog/pg_largeobject.c	167
`LargeObjectExistsWithSnapshot`	src/backend/catalog/pg_largeobject.c	176
`open_lo_relation`	src/backend/storage/large_object/inv_api.c	73
`close_lo_relation`	src/backend/storage/large_object/inv_api.c	97
`getdatafield`	src/backend/storage/large_object/inv_api.c	131
`inv_create`	src/backend/storage/large_object/inv_api.c	173
`inv_open`	src/backend/storage/large_object/inv_api.c	215
`inv_close`	src/backend/storage/large_object/inv_api.c	299
`inv_drop`	src/backend/storage/large_object/inv_api.c	311
`inv_getsize`	src/backend/storage/large_object/inv_api.c	340
`inv_seek`	src/backend/storage/large_object/inv_api.c	388
`inv_tell`	src/backend/storage/large_object/inv_api.c	437
`inv_read`	src/backend/storage/large_object/inv_api.c	450
`inv_write`	src/backend/storage/large_object/inv_api.c	543
`inv_truncate`	src/backend/storage/large_object/inv_api.c	740
`be_lo_open`	src/backend/libpq/be-fsstubs.c	87
`be_lo_close`	src/backend/libpq/be-fsstubs.c	126
`lo_read` / `lo_write`	src/backend/libpq/be-fsstubs.c	154 / 182
`be_lo_creat` / `be_lo_create`	src/backend/libpq/be-fsstubs.c	249 / 262
`be_lo_unlink`	src/backend/libpq/be-fsstubs.c	314
`be_loread` / `be_lowrite`	src/backend/libpq/be-fsstubs.c	362 / 380
`lo_import_internal`	src/backend/libpq/be-fsstubs.c	424
`be_lo_export`	src/backend/libpq/be-fsstubs.c	486
`be_lo_truncate` / `be_lo_truncate64`	src/backend/libpq/be-fsstubs.c	579 / 591
`AtEOXact_LargeObject`	src/backend/libpq/be-fsstubs.c	607
`AtEOSubXact_LargeObject`	src/backend/libpq/be-fsstubs.c	653
`newLOfd` / `closeLOfd`	src/backend/libpq/be-fsstubs.c	680 / 721
`lo_get_fragment_internal`	src/backend/libpq/be-fsstubs.c	746
`be_lo_get` / `be_lo_from_bytea` / `be_lo_put`	src/backend/libpq/be-fsstubs.c	797 / 832 / 855

Source verification (as of 2026-06-05)

Verified against /data/hgryoo/references/postgres at REL_18_STABLE, commit 273fe94 (PostgreSQL 18.x). Checks performed:

LOBLKSIZE = BLCKSZ / 4 and MAX_LARGE_OBJECT_SIZE = INT_MAX * LOBLKSIZE: confirmed verbatim in large_object.h lines 70 and 76. With the default 8 KB BLCKSZ this is 2048 bytes per chunk and a ~4 TB cap.
Two-catalog split: pg_largeobject (CATALOG(...,2613,...), columns loid, pageno, data) and pg_largeobject_metadata (CATALOG(...,2995,...), columns oid, lomowner, lomacl) confirmed in the respective headers; the composite unique B-tree pg_largeobject_loid_pn_index on btree(loid oid_ops, pageno int4_ops) is declared at pg_largeobject.h:48.
data BKI_FORCE_NOT_NULL: confirmed — the data column is forced NOT NULL even though initdb marks loid/pageno but not data; the inv_api.c head comment and the HeapTupleHasNulls “paranoia” checks corroborate.
Read/write snapshot asymmetry: inv_open sets snapshot = NULL for IFS_WRLOCK and snapshot = GetActiveSnapshot() for read-only, confirmed at inv_api.c:238–241. The matching RegisterSnapshotOnOwner / UnregisterSnapshotFromOwner calls are in be_lo_open and closeLOfd.
Sparse-hole zero-fill: the pageoff > obj_desc->offset branch in inv_read with MemSet(buf + nread, 0, n) confirmed at inv_api.c:505–512.
Read-modify-write: inv_write’s update branch uses heap_modify_tuple
- CatalogTupleUpdateWithInfo; its insert branch uses heap_form_tuple + CatalogTupleInsertWithInfo; both confirmed (lines ~675–721). Each mutating routine ends with CommandCounterIncrement().
FD table in fscxt: cookies/cookies_size/fscxt statics and the newLOfd/closeLOfd/AtEOXact_LargeObject/AtEOSubXact_LargeObject lifecycle confirmed in be-fsstubs.c. AtEOXact_LargeObject calls MemoryContextDelete(fscxt) on cleanup.
Permission model: pg_largeobject_aclcheck_snapshot with ACL_SELECT (read) / ACL_UPDATE (write) in inv_open; object_ownercheck for be_lo_unlink; all gated by the lo_compat_privileges GUC. Confirmed.
Caveat / version note: inv_read/inv_write are declared int (32-bit) byte counts, so a single SQL loread/lowrite call moves at most INT_MAX bytes regardless of the 4 TB object cap; the be_lo_lseek / be_lo_tell 32-bit wrappers raise ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE past 2 GB, which is why the *64 variants exist. This matches REL_18; no PG19-only symbols were relied on.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Large objects vs. `bytea`/TOAST — the in-house comparison

The first comparison is internal. PostgreSQL ships two ways to store big binary values, and the contrast is instructive:

Dimension	Large object (`lo`)	`bytea` + TOAST
Addressing	Object OID, separate from any row	Column value inside a row
Access	Seekable: `lo_lseek`/`lo_read` random access	All-or-nothing datum (detoast whole value)
Max size	~4 TB (`INT_MAX * LOBLKSIZE`)	~1 GB (varlena 30-bit length)
Chunk store	`pg_largeobject`, shared by all LOs	per-table TOAST relation, `TOAST_MAX_CHUNK_SIZE`
Partial update	`lo_put`/`inv_write` rewrites only touched chunks	rewrites the entire datum as a new row version
Streaming to client	chunked via `loread`/`lowrite` round-trips	one big value over the wire
Permissions	per-object owner + ACL (since 9.0)	inherit the table’s row privileges
Transactional	yes (heap MVCC + WAL)	yes (heap MVCC + WAL)

The decision rule the PostgreSQL docs effectively encode: use bytea when values are below ~a few MB and are read/written whole; use large objects when you need random access into multi-gigabyte blobs or values exceeding the 1 GB bytea ceiling. The shared mechanism under both is the same heap + B-tree chunk store — TOAST chunks a column, inversion chunks an object — which is why both inherit MVCC bloat under heavy overwrite. See postgres-toast.md for the column side.

The MVCC-bloat tax and vacuuming

Because every chunk is an ordinary heap tuple, overwriting LO data leaves dead tuples in pg_largeobject, exactly as updating a normal row does. A workload that repeatedly lo_puts into the same offsets of the same objects will bloat pg_largeobject and depend on autovacuum to reclaim space (see postgres-autovacuum.md and postgres-vacuum.md). There is also the orphaned-LO problem: dropping the referencing row in a user table does not drop the large object (the OID column is just an integer), so abandoned LOs accumulate. The community answer is the vacuumlo utility and the lo_manage trigger from the lo contrib module — both client/contrib tooling, outside this core analysis, but worth naming as the operational counterpart to the core mechanism.

Other engines

Oracle separates BFILE (a read-only pointer to an OS file, bytes not in the database) from internal LOBs (BLOB/CLOB/NCLOB) stored in dedicated LOB segments with their own chunk size and an optional SecureFiles layer that adds compression, deduplication, and encryption. Unlike inversion, Oracle’s LOB segments are a bespoke storage structure, not “just another table,” so they carry purpose-built logging and a separate undo/retention model.
SQL Server offers varbinary(max)/text/image stored in LOB_DATA/ROW_OVERFLOW_DATA allocation units, plus FILESTREAM and FileTables, which keep the bytes in the NTFS file system while preserving transactional consistency — closer in spirit to Oracle BFILE but transactional. Like PostgreSQL’s bytea, these are column-addressed, not object-addressed.
MySQL/InnoDB stores BLOB/TEXT overflow off-page in the same tablespace, column-addressed; there is no object-locator API analogous to lo_open.
JDBC/SQL standard locators. The standard’s BLOB/CLOB locator model — a handle valid for the duration of a transaction, supporting getBytes(pos, len) and setBytes — is conceptually the same contract as PostgreSQL’s transaction-scoped FD. The PostgreSQL JDBC driver maps java.sql.Blob onto the lo_* functions.

Research frontiers

The inversion idea — reuse the table engine for blobs instead of building a new store — anticipates a recurring theme in storage research: unbundling vs. reusing the storage layer. Modern object stores (S3-style) and disaggregated-storage databases push large values to an external, content-addressed tier and keep only a reference in the row, which is the BFILE/FILESTREAM direction taken to its cloud conclusion. The trade PostgreSQL made in the 1990s — pay MVCC bloat and write amplification to get transactional, crash-safe blobs for free — is exactly the trade that cloud-native engines revisit when they decide whether large values deserve a separate, append-only, log-structured tier (which avoids in-place rewrite bloat) or should stay inline for transactional simplicity. The unchanged LOBLKSIZE = BLCKSZ/4 and int32 pageno since the early releases are also a reminder of how a chunk-size and key-width choice ossifies: widening the 4 TB cap would require widening pageno and an initdb, a migration cost that keeps the original constants frozen.

Sources

PostgreSQL source (/data/hgryoo/references/postgres, REL_18_STABLE, commit 273fe94):
- src/backend/storage/large_object/inv_api.c — the inversion API: inv_create/open/close/drop/getsize/seek/tell/read/write/truncate, getdatafield, open_lo_relation/close_lo_relation.
- src/backend/libpq/be-fsstubs.c — fmgr be_lo_* wrappers, the cookies FD table in fscxt, lo_import/lo_export, lo_get/lo_put/ lo_from_bytea, and the AtEOXact/AtEOSubXact lifecycle.
- src/backend/catalog/pg_largeobject.c — LargeObjectCreate/Drop/ Exists/ExistsWithSnapshot.
- src/include/storage/large_object.h — LOBLKSIZE, MAX_LARGE_OBJECT_SIZE, LargeObjectDesc, IFS_RDLOCK/IFS_WRLOCK.
- src/include/catalog/pg_largeobject.h, src/include/catalog/pg_largeobject_metadata.h — the two catalogs and the composite (loid, pageno) index.
- src/include/libpq/libpq-fs.h — INV_WRITE/INV_READ flags.
Theory anchors (knowledge/research/dbms-general/):
- Database System Concepts, Silberschatz 7e — ch. 13 “Data Storage Structures” (page layout, large-value overflow), file/record organization.
- Database Internals, Petrov — “File Formats” (slotted pages, overflow pages, why oversize values must go out of line).
Cross-references (this repo): postgres-toast.md (column-addressed large values, varlena, in-line/out-of-line compression), postgres-heap-am.md (heap tuple insert/update/delete that LO chunks ride on), postgres-autovacuum.md / postgres-vacuum.md (reclaiming dead chunk tuples), postgres-page-layout.md (BLCKSZ, slotted pages), postgres-nbtree.md (the B-tree index backing the (loid, pageno) scans), postgres-mvcc-snapshots.md (the read-mode snapshot semantics).