Skip to content

PostgreSQL TOAST — Oversized Attribute Storage, Compression, and Detoasting

Contents:

A fixed-size page of 8 KB is PostgreSQL’s storage atom. Every heap tuple must fit on a single page (the canonical limit is MaxHeapTupleSize, approximately 8 KB minus page and tuple overhead). This is a hard constraint: the page I/O model gives the buffer manager a uniform unit, and the slotted- page layout (Database Internals, Petrov, ch. 3 “File Formats”, §“Slotted Pages”) addresses records by (page, slot) — a slot cannot span two pages.

Real-world data routinely violates this constraint. A TEXT column holding a product description, a BYTEA column holding a document, or a JSONB column holding a large object can all exceed 8 KB. The engine must therefore provide a mechanism to store values that are larger than a page while preserving the illusion to the query layer that the datum is just another column in the tuple.

The two canonical approaches from Database System Concepts (Silberschatz, 7e, ch. 13 “Data Storage Structures”) are:

  1. Large-object / BLOB storage. The engine maintains a separate file or object store; the main tuple holds only a handle. The application must explicitly manage reads and writes through a large-object API. Oracle LOB columns and PostgreSQL’s own large_object subsystem follow this model.

  2. Transparent overflow. The engine intercepts writes and reads automatically, so the application sees an ordinary column of arbitrary size. The planner, executor, and access methods operate on an opaque Datum pointer; the storage layer decides whether the datum lives inline or in a secondary store, and decompresses or reassembles it before handing it to the caller. This is the TOAST model.

TOAST (acronym coined by the PostgreSQL development team as “The Oversized- Attribute Storage Technique”, sometimes glossed humorously as “the best thing since sliced bread”) is the second approach. Its design constraints are:

  • Transparency. No SQL change is required from the user. Existing TEXT, BYTEA, and other variable-length (varlena) columns are eligible automatically based on their declared type storage strategy.
  • Two reduction strategies. The engine first tries inline compression to keep the datum on the main page. Only if compressed size still exceeds the threshold is the datum moved to an external table.
  • Random access slice retrieval. A caller that only needs part of a datum (e.g., substring(col, 1, 100)) should not have to fetch and decompress the entire datum.

The theoretical anchor for the compression side is general-purpose lossless data compression (PGLZ is a custom LZ-family algorithm; LZ4 is a well-known fast-path variant). Neither algorithm is described in the standard DBMS textbooks; both are engineering choices within the TOAST framework.

Every engine that stores variable-length values in fixed-size pages must solve the same header problem: a column value needs to carry its own length (the page does not store it separately for variable-length attributes), and that length field consumes bytes that reduce the usable data space. The universal answer is a compact variable-length header — a small integer prefix whose bit pattern encodes both the length and metadata flags. The design tension is between header size (smaller is better for small strings) and the range of sizes that can be expressed.

PostgreSQL’s struct varlena uses a two-tier scheme:

  • 4-byte header (varattrib_4b): the top two bits encode whether the datum is compressed or external; the remaining bits give the total length including the header. Supports values up to ~1 GB.
  • 1-byte short header (varattrib_1b): the top bit is 1 to signal “short”; the remaining 7 bits give the total length including the header. Supports values up to 126 bytes inline without paying the 4-byte tax.

The short-header optimization is a well-known technique: MySQL VARCHAR uses a 1- or 2-byte length prefix depending on declared column length; SQL Server uses VARLEN columns with a similar 2-byte length prefix. The exact encoding varies; the tradeoff is always header size vs. addressable range.

Engines that support transparent overflow typically store the out-of-line data in one of three places:

  1. A row-overflow page in the same file (SQL Server’s row-overflow and LOB pages; MySQL InnoDB’s off-page columns using the same tablespace).
  2. A dedicated secondary heap keyed by (value_id, chunk_seq) (Oracle LOB segments; PostgreSQL TOAST tables).
  3. An external file outside the page file (Oracle BFILE; PostgreSQL large objects are also external, but they are a separate mechanism from TOAST).

PostgreSQL TOAST takes approach 2: each relation that has at least one toastable column gets an associated pg_toast_<OID> heap with a schema of (chunk_id OID, chunk_seq INT4, chunk_data BYTEA). This keeps the out-of- line data inside the MVCC machinery — chunks are regular heap tuples subject to vacuum, snapshots, and WAL — but isolates them from the main heap to avoid layout interference.

Compressing before externalizing is a standard optimization: it reduces I/O, increases the chance that the datum stays inline, and halves the detoasting cost when the datum is read frequently. The practical question is when to give up and accept incompressibility — TOAST’s “savings of more than 2 bytes” threshold (if (VARSIZE(tmp) < valsize - 2) in toast_compress_datum) is typical: a compression that saves nothing net after header and alignment padding is not worth keeping.

Pointer-based indirection for out-of-line datums

Section titled “Pointer-based indirection for out-of-line datums”

When a datum is moved to secondary storage, the main tuple holds a small pointer (also called a toast pointer or indirect datum) rather than the data. The pointer must be small enough not to trigger further toasting and must carry enough information to retrieve the datum: a relation OID and a value OID, from which the engine can reconstruct the chunks. This pattern appears in every engine that supports transparent overflow; the exact pointer layout varies.

All PostgreSQL variable-length types use struct varlena as their in-memory and on-disk representation. The first one or four bytes are the header; the exact interpretation depends on the two high bits of the first byte:

PatternEncodingMeaning
1xxxxxxx1-byte shorttotal length = low 7 bits; data inline
00xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx4-byte normaltotal length = low 30 bits; data inline
10000000 (4-byte)4-byte compresseddatum is PGLZ or LZ4 compressed; actual size in tcinfo
00000001 (1-byte)1-byte external tagdatum is an external pointer; VARTAG byte follows

The external pointer’s VARTAG byte discriminates three kinds:

  • VARTAG_ONDISK = 18 — the standard varatt_external toast pointer.
  • VARTAG_INDIRECT = 1 — an in-memory indirect pointer to another varlena (used by the expanded-object and other transient mechanisms).
  • VARTAG_EXPANDED_RO / VARTAG_EXPANDED_RW = 2 / 3 — an expanded-object pointer (e.g., an in-memory array or composite in expanded form).

Only VARTAG_ONDISK ever appears on disk. The others are strictly in-memory representations produced during query execution.

Each column of a toastable type carries a typstorage flag (stored in pg_attribute.attstorage, default from pg_type.typstorage):

FlagConstantMeaning
'p'TYPSTORAGE_PLAINNever toast. Store inline as-is; fail if the tuple overflows the page.
'e'TYPSTORAGE_EXTERNALAllow out-of-line but not inline compression. Useful for large blobs where the caller will handle compression (e.g., already-compressed PNG).
'm'TYPSTORAGE_MAINPrefer inline; compress inline if possible; move out-of-line only as last resort.
'x'TYPSTORAGE_EXTENDEDFull TOAST: try inline compression first; move out-of-line if still too large. Default for TEXT, BYTEA, JSONB.

The 'm' strategy is the weakest form of the promise “keep this inline”. The toaster will honour it up to TOAST_TUPLE_TARGET_MAIN (approximately one tuple per page), but if even that is violated, the datum goes out-of- line anyway.

Toasting threshold and the four-round algorithm

Section titled “Toasting threshold and the four-round algorithm”

The toaster activates when a tuple’s data area exceeds TOAST_TUPLE_TARGET (approximately BLCKSZ / 4 - overhead, roughly 2 KB for the default 8 KB page). The logic lives in heap_toast_insert_or_update in heaptoast.c. It loops over the tuple’s attributes four times in decreasing aggressiveness:

Round 1 — Compress EXTENDED, externalize very large EXTENDED/EXTERNAL. For each attribute with attstorage = TYPSTORAGE_EXTENDED, call toast_tuple_try_compression. If after compression the individual attribute is still larger than maxDataLen, push it out-of-line immediately with toast_tuple_externalize. EXTERNAL attributes that cannot be compressed are marked TOASTCOL_INCOMPRESSIBLE and skipped in future compression passes.

Round 2 — Externalize all remaining EXTENDED/EXTERNAL. If the tuple still exceeds maxDataLen and a toast table exists (rel->rd_rel->reltoastrelid != InvalidOid), push remaining eligible attributes out-of-line.

Round 3 — Compress MAIN. If the tuple still exceeds maxDataLen, apply toast_tuple_try_compression to TYPSTORAGE_MAIN attributes. This is the inline-compression last resort for “prefer inline” columns.

Round 4 — Externalize MAIN. The target is relaxed to TOAST_TUPLE_TARGET_MAIN (approximately one full page). If still exceeded, push MAIN attributes out-of-line.

The entry point first deforms both the new and (for UPDATE) old tuple, initialises the ToastTupleContext, and computes the header overhead so it can convert TOAST_TUPLE_TARGET into a data-size limit. Note that the loop condition is re-evaluated by recomputing heap_compute_data_size on each iteration — the toaster reacts to the current (possibly already compressed/externalized) state of the value array, not a precomputed plan:

// heap_toast_insert_or_update — src/backend/access/heap/heaptoast.c
heap_deform_tuple(newtup, tupleDesc, toast_values, toast_isnull);
if (oldtup != NULL)
heap_deform_tuple(oldtup, tupleDesc, toast_oldvalues, toast_oldisnull);
/* ... fill ttc fields ... */
toast_tuple_init(&ttc);
/* compute header overhead --- this should match heap_form_tuple() */
hoff = SizeofHeapTupleHeader;
if ((ttc.ttc_flags & TOAST_HAS_NULLS) != 0)
hoff += BITMAPLEN(numAttrs);
hoff = MAXALIGN(hoff);
/* now convert to a limit on the tuple data size */
maxDataLen = RelationGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff;

The four passes share the same skeleton — pick the largest eligible attribute via toast_tuple_find_biggest_attribute, act on it, and re-test. The two boolean arguments to that selector encode the round’s policy: for_compression (only consider not-yet-compressed columns) and check_main (whether MAIN-storage columns are eligible this round):

// heap_toast_insert_or_update — src/backend/access/heap/heaptoast.c
/* Round 1: compress EXTENDED; externalize a single huge value early */
while (heap_compute_data_size(tupleDesc,
toast_values, toast_isnull) > maxDataLen)
{
int biggest_attno = toast_tuple_find_biggest_attribute(&ttc, true, false);
if (biggest_attno < 0) break;
if (TupleDescAttr(tupleDesc, biggest_attno)->attstorage == TYPSTORAGE_EXTENDED)
toast_tuple_try_compression(&ttc, biggest_attno);
else
/* has attstorage EXTERNAL, ignore on subsequent compression passes */
toast_attr[biggest_attno].tai_colflags |= TOASTCOL_INCOMPRESSIBLE;
/* if it alone still busts the budget, push it out now */
if (toast_attr[biggest_attno].tai_size > maxDataLen &&
rel->rd_rel->reltoastrelid != InvalidOid)
toast_tuple_externalize(&ttc, biggest_attno, options);
}
/* Round 2: externalize remaining EXTENDED/EXTERNAL (needs a toast table) */
while (heap_compute_data_size(tupleDesc,
toast_values, toast_isnull) > maxDataLen &&
rel->rd_rel->reltoastrelid != InvalidOid)
{
int biggest_attno = toast_tuple_find_biggest_attribute(&ttc, false, false);
if (biggest_attno < 0) break;
toast_tuple_externalize(&ttc, biggest_attno, options);
}
/* Round 3: now take MAIN attributes into compression */
while (heap_compute_data_size(tupleDesc,
toast_values, toast_isnull) > maxDataLen)
{
int biggest_attno = toast_tuple_find_biggest_attribute(&ttc, true, true);
if (biggest_attno < 0) break;
toast_tuple_try_compression(&ttc, biggest_attno);
}
/* Round 4: relax the budget to one tuple/page, then externalize MAIN */
maxDataLen = TOAST_TUPLE_TARGET_MAIN - hoff;
while (heap_compute_data_size(tupleDesc,
toast_values, toast_isnull) > maxDataLen &&
rel->rd_rel->reltoastrelid != InvalidOid)
{
int biggest_attno = toast_tuple_find_biggest_attribute(&ttc, false, true);
if (biggest_attno < 0) break;
toast_tuple_externalize(&ttc, biggest_attno, options);
}

The “externalize a single huge value early” branch in Round 1 is a deliberate optimisation: in the common case of one long TEXT/JSONB column and several short ones, pushing the giant out immediately avoids spending CPU compressing the short columns that were never the problem.

If any value was replaced, TOAST_NEEDS_CHANGE is set and the function rebuilds a fresh HeapTuple from the (now smaller) toast_values array via heap_fill_tuple, recomputing t_hoff because an intervening ALTER TABLE ADD COLUMN could have changed the null-bitmap width since the old tuple was stored.

toast_save_datum in toast_internals.c opens the toast relation, assigns a fresh valueid OID via GetNewOidWithIndex, slices the datum into TOAST_MAX_CHUNK_SIZE-byte chunks, and inserts each as a heap tuple (valueid, chunk_seq, chunk_data). After all chunks are inserted, it constructs and returns an 18-byte varatt_external pointer:

The on-disk pointer is the four-field varatt_external struct. The key subtlety is va_extinfo: it packs both the stored payload size and the compression method into one uint32, leaving va_rawsize to hold the fully-decompressed size (so the reader can pre-size its result buffer without decompressing):

// varatt_external — src/include/varatt.h
typedef struct varatt_external
{
int32 va_rawsize; /* Original data size (includes header) */
uint32 va_extinfo; /* External saved size (without header) and
* compression method */
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid; /* RelID of TOAST table containing it */
} varatt_external;

toast_save_datum opens the toast relation and its indexes, then derives data_p / data_todo and the two pointer fields from the shape of the incoming datum. A short-header datum is written as if it had a normal header; an already-inline-compressed datum carries its compression method straight into va_extinfo (the datum is stored compressed, never re-compressed):

// toast_save_datum — src/backend/access/common/toast_internals.c
if (VARATT_IS_SHORT(dval))
{
data_p = VARDATA_SHORT(dval);
data_todo = VARSIZE_SHORT(dval) - VARHDRSZ_SHORT;
toast_pointer.va_rawsize = data_todo + VARHDRSZ; /* as if not short */
toast_pointer.va_extinfo = data_todo;
}
else if (VARATT_IS_COMPRESSED(dval))
{
data_p = VARDATA(dval);
data_todo = VARSIZE(dval) - VARHDRSZ;
toast_pointer.va_rawsize = VARDATA_COMPRESSED_GET_EXTSIZE(dval) + VARHDRSZ;
VARATT_EXTERNAL_SET_SIZE_AND_COMPRESS_METHOD(toast_pointer, data_todo,
VARDATA_COMPRESSED_GET_COMPRESS_METHOD(dval));
Assert(VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer));
}
else
{
data_p = VARDATA(dval);
data_todo = VARSIZE(dval) - VARHDRSZ;
toast_pointer.va_rawsize = VARSIZE(dval);
toast_pointer.va_extinfo = data_todo;
}

The valueid is a fresh OID from GetNewOidWithIndex (uniqueness checked against the toast table’s own index). During CLUSTER/VACUUM FULL rewrites, rel->rd_toastoid is set and the code instead reuses the old value’s OID, short-circuiting the chunk loop with data_todo = 0 if the value already exists in the new toast table — this is how a rewrite avoids duplicating shared toast values.

The chunk loop itself slices data_p into TOAST_MAX_CHUNK_SIZE-byte spans, forms a (valueid, chunk_seq, chunk_data) heap tuple for each, and inserts both the tuple and a matching index entry:

// toast_save_datum — src/backend/access/common/toast_internals.c
t_values[0] = ObjectIdGetDatum(toast_pointer.va_valueid);
t_values[2] = PointerGetDatum(&chunk_data);
while (data_todo > 0)
{
CHECK_FOR_INTERRUPTS();
chunk_size = Min(TOAST_MAX_CHUNK_SIZE, data_todo);
t_values[1] = Int32GetDatum(chunk_seq++);
SET_VARSIZE(&chunk_data, chunk_size + VARHDRSZ);
memcpy(VARDATA(&chunk_data), data_p, chunk_size);
toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
heap_insert(toastrel, toasttup, mycid, options, NULL);
/* index entry for each toast index (columns mirror the table) */
for (i = 0; i < num_indexes; i++)
if (toastidxs[i]->rd_index->indisvalid)
index_insert(toastidxs[i], t_values, t_isnull,
&(toasttup->t_self), toastrel,
toastidxs[i]->rd_index->indisunique ?
UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
false, NULL);
heap_freetuple(toasttup);
data_todo -= chunk_size;
data_p += chunk_size;
}

Finally it builds the 18-byte external varlena and hands it back to toast_tuple_externalize, which drops it into ttc_values[attno]:

// toast_save_datum — src/backend/access/common/toast_internals.c
result = (struct varlena *) palloc(TOAST_POINTER_SIZE);
SET_VARTAG_EXTERNAL(result, VARTAG_ONDISK);
memcpy(VARDATA_EXTERNAL(result), &toast_pointer, sizeof(toast_pointer));
return PointerGetDatum(result);

The TOAST table always has a B-tree index on (chunk_id, chunk_seq) — this is what heap_fetch_toast_slice scans with ScanKeyInit to retrieve chunks in order.

toast_compress_datum in toast_internals.c dispatches to pglz_compress_datum or lz4_compress_datum based on default_toast_compression (GUC, default pglz; lz4 requires compile-time USE_LZ4). The compression method ID is stored in the top 2 bits of va_extinfo for external datums and in tcinfo for inline-compressed datums.

// toast_compress_datum — src/backend/access/common/toast_internals.c
switch (cmethod) {
case TOAST_PGLZ_COMPRESSION:
tmp = pglz_compress_datum((const struct varlena *) value);
cmid = TOAST_PGLZ_COMPRESSION_ID;
break;
case TOAST_LZ4_COMPRESSION:
tmp = lz4_compress_datum((const struct varlena *) value);
cmid = TOAST_LZ4_COMPRESSION_ID;
break;
}
if (VARSIZE(tmp) < valsize - 2) {
/* net savings; keep compressed form */
TOAST_COMPRESS_SET_SIZE_AND_COMPRESS_METHOD(tmp, valsize, cmid);
return PointerGetDatum(tmp);
} else {
pfree(tmp); return PointerGetDatum(NULL); /* incompressible */
}

The “net savings of more than 2 bytes” guard prevents compression from inflating small datums after header and alignment overhead.

The two compressors share a contract — take a plain varlena, return a VARHDRSZ_COMPRESSED-prefixed compressed varlena or NULL on failure — but differ in how they decide failure and in their speed/ratio tradeoff.

PGLZ is PostgreSQL’s in-tree LZ77-family coder. It refuses tiny or huge inputs up front (PGLZ_strategy_default bounds) and treats a negative return from the core pglz_compress as “incompressible”:

// pglz_compress_datum — src/backend/access/common/toast_compression.c
valsize = VARSIZE_ANY_EXHDR(value);
if (valsize < PGLZ_strategy_default->min_input_size ||
valsize > PGLZ_strategy_default->max_input_size)
return NULL;
tmp = (struct varlena *) palloc(PGLZ_MAX_OUTPUT(valsize) + VARHDRSZ_COMPRESSED);
len = pglz_compress(VARDATA_ANY(value), valsize,
(char *) tmp + VARHDRSZ_COMPRESSED, NULL);
if (len < 0)
{
pfree(tmp);
return NULL;
}
SET_VARSIZE_COMPRESSED(tmp, len + VARHDRSZ_COMPRESSED);
return tmp;

LZ4 (compiled in only under USE_LZ4; the stub raises an error otherwise) is far faster with a typically lower ratio. It sizes its buffer with LZ4_compressBound, hard-errors on a genuine library failure, and treats “output bigger than input” as the incompressible signal:

// lz4_compress_datum — src/backend/access/common/toast_compression.c
max_size = LZ4_compressBound(valsize);
tmp = (struct varlena *) palloc(max_size + VARHDRSZ_COMPRESSED);
len = LZ4_compress_default(VARDATA_ANY(value),
(char *) tmp + VARHDRSZ_COMPRESSED,
valsize, max_size);
if (len <= 0)
elog(ERROR, "lz4 compression failed");
/* data is incompressible so just free the memory and return NULL */
if (len > valsize)
{
pfree(tmp);
return NULL;
}
SET_VARSIZE_COMPRESSED(tmp, len + VARHDRSZ_COMPRESSED);
return tmp;

The compressed datum’s first four bytes are the va_tcinfo word (toast_compress_header): 30 bits of original payload size plus a 2-bit ToastCompressionId. That id is what toast_decompress_datum reads back to choose the decompressor — the method travels with the datum, so a table can hold a mix of PGLZ- and LZ4-compressed values after an ALTER TABLE ... ALTER COLUMN ... SET COMPRESSION:

// toast_decompress_datum — src/backend/access/common/detoast.c
cmid = TOAST_COMPRESS_METHOD(attr);
switch (cmid)
{
case TOAST_PGLZ_COMPRESSION_ID:
return pglz_decompress_datum(attr);
case TOAST_LZ4_COMPRESSION_ID:
return lz4_decompress_datum(attr);
default:
elog(ERROR, "invalid compression method id %d", cmid);
return NULL; /* keep compiler quiet */
}

The asymmetry that drives the slice path (below) is decompression randomness: PGLZ exposes pglz_decompress_datum_slice which can stop after a prefix length, and LZ4’s LZ4_decompress_safe_partial does the same — but only for liblz4 ≥ 1.8.3, so lz4_decompress_datum_slice falls back to full decompression on older libraries:

// lz4_decompress_datum_slice — src/backend/access/common/toast_compression.c
/* slice decompression not supported prior to 1.8.3 */
if (LZ4_versionNumber() < 10803)
return lz4_decompress_datum(value);
result = (struct varlena *) palloc(slicelength + VARHDRSZ);
rawsize = LZ4_decompress_safe_partial((char *) value + VARHDRSZ_COMPRESSED,
VARDATA(result),
VARSIZE(value) - VARHDRSZ_COMPRESSED,
slicelength, slicelength);

detoast_attr in detoast.c is the full detoast path: fetch from external storage if needed, then decompress if needed, then expand from short-header if needed. The result is always a normal 4-byte-header varlena.

// detoast_attr — src/backend/access/common/detoast.c
if (VARATT_IS_EXTERNAL_ONDISK(attr))
{
/* externally stored --- fetch it back from there */
attr = toast_fetch_datum(attr);
/* If it's compressed, decompress it */
if (VARATT_IS_COMPRESSED(attr))
{
struct varlena *tmp = attr;
attr = toast_decompress_datum(tmp);
pfree(tmp);
}
}
else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
{
/* in-memory indirect pointer --- dereference and recurse */
struct varatt_indirect redirect;
VARATT_EXTERNAL_GET_POINTER(redirect, attr);
attr = detoast_attr((struct varlena *) redirect.pointer);
/* ... copy if it was already flat ... */
}
else if (VARATT_IS_EXTERNAL_EXPANDED(attr))
attr = detoast_external_attr(attr); /* flatten expanded object */
else if (VARATT_IS_COMPRESSED(attr))
attr = toast_decompress_datum(attr); /* inline-compressed only */
else if (VARATT_IS_SHORT(attr))
{
/* short-header varlena --- convert to 4-byte header format */
Size data_size = VARSIZE_SHORT(attr) - VARHDRSZ_SHORT;
struct varlena *new_attr = (struct varlena *) palloc(data_size + VARHDRSZ);
SET_VARSIZE(new_attr, data_size + VARHDRSZ);
memcpy(VARDATA(new_attr), VARDATA_SHORT(attr), data_size);
attr = new_attr;
}
return attr;

The five branches are mutually exclusive and ordered by frequency on the hot path: a fully external on-disk datum first, then the two in-memory transient forms (INDIRECT, EXPANDED) that never appear on disk, then inline-compressed, then short-header. The post-condition is always a plain 4-byte-header varlena that callers may pfree.

toast_fetch_datum is the reassembly engine. It copies the (potentially unaligned) varatt_external out of the pointer, pre-sizes a result buffer to the stored size, marks it compressed or not so the caller’s VARATT_IS_COMPRESSED test works, and delegates the actual chunk read to the table AM:

// toast_fetch_datum — src/backend/access/common/detoast.c
VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
attrsize = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
result = (struct varlena *) palloc(attrsize + VARHDRSZ);
if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
SET_VARSIZE_COMPRESSED(result, attrsize + VARHDRSZ);
else
SET_VARSIZE(result, attrsize + VARHDRSZ);
if (attrsize == 0)
return result; /* shouldn't happen, but be safe */
toastrel = table_open(toast_pointer.va_toastrelid, AccessShareLock);
table_relation_fetch_toast_slice(toastrel, toast_pointer.va_valueid,
attrsize, 0, attrsize, result);
table_close(toastrel, AccessShareLock);
return result;

table_relation_fetch_toast_slice is the table-AM hook; heap implements it as heap_fetch_toast_slice. That function computes the chunk range, builds 1–3 scan keys (equality on valueid, plus an optional equality or range condition on chunk_seq), and walks the toast index in order. The “all chunks” fast path uses a single key; a sub-range uses BTGreaterEqual / BTLessEqual bounds:

// heap_fetch_toast_slice — src/backend/access/heap/heaptoast.c
startchunk = sliceoffset / TOAST_MAX_CHUNK_SIZE;
endchunk = (sliceoffset + slicelength - 1) / TOAST_MAX_CHUNK_SIZE;
ScanKeyInit(&toastkey[0], (AttrNumber) 1,
BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(valueid));
if (startchunk == 0 && endchunk == totalchunks - 1)
nscankeys = 1; /* whole value */
else if (startchunk == endchunk)
{
ScanKeyInit(&toastkey[1], (AttrNumber) 2,
BTEqualStrategyNumber, F_INT4EQ, Int32GetDatum(startchunk));
nscankeys = 2; /* single chunk */
}
else
{
ScanKeyInit(&toastkey[1], (AttrNumber) 2,
BTGreaterEqualStrategyNumber, F_INT4GE, Int32GetDatum(startchunk));
ScanKeyInit(&toastkey[2], (AttrNumber) 2,
BTLessEqualStrategyNumber, F_INT4LE, Int32GetDatum(endchunk));
nscankeys = 3; /* chunk range */
}
toastscan = systable_beginscan_ordered(toastrel, toastidxs[validIndex],
get_toast_snapshot(), nscankeys, toastkey);
/* loop: copy each chunk's VARDATA into result, verifying curchunk == expectedchunk */

The per-chunk loop verifies curchunk == expectedchunk and that the chunk size matches the expected size for its position, raising ERRCODE_DATA_CORRUPTED on any gap, duplicate, or out-of-order chunk — a cheap integrity check that catches toast-table corruption at read time.

The slice variant detoast_attr_slice supports partial retrieval. For uncompressed external datums it takes a fast path straight to toast_fetch_datum_slice (which narrows the chunk range as shown above). For compressed external datums it must fetch enough compressed bytes to cover the requested decompressed prefix: PGLZ exposes pglz_maximum_compressed_size to bound that; LZ4 has no equivalent, so the slice path fetches the entire compressed value:

// detoast_attr_slice — src/backend/access/common/detoast.c
if (VARATT_IS_EXTERNAL_ONDISK(attr))
{
VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
/* fast path for non-compressed external datums */
if (!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
return toast_fetch_datum_slice(attr, sliceoffset, slicelength);
/* compressed: fetch enough to decompress the requested prefix */
if (slicelimit >= 0)
{
int32 max_size = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
if (VARATT_EXTERNAL_GET_COMPRESS_METHOD(toast_pointer) ==
TOAST_PGLZ_COMPRESSION_ID)
max_size = pglz_maximum_compressed_size(slicelimit, max_size);
/* (LZ4 has no such bound, so max_size stays = full size) */
preslice = toast_fetch_datum_slice(attr, 0, max_size);
}
else
preslice = toast_fetch_datum(attr);
}
/* ... INDIRECT / EXPANDED / inline cases ... */
if (VARATT_IS_COMPRESSED(preslice))
{
struct varlena *tmp = preslice;
if (slicelimit >= 0)
preslice = toast_decompress_datum_slice(tmp, slicelimit);
else
preslice = toast_decompress_datum(tmp);
if (tmp != attr)
pfree(tmp);
}
/* ... then copy [sliceoffset, sliceoffset+slicelength) out of preslice ... */

This is why substring(big_text, 1, 100) on an uncompressed external value touches only the first chunk, but the same call on an LZ4-compressed value pays for the whole datum: the prefix bound is a property of the compressor, not of TOAST.

toast_fetch_datum calls table_relation_fetch_toast_slice, which dispatches to heap_fetch_toast_slice in heaptoast.c. That function opens the toast relation, runs systable_beginscan_ordered with a SnapshotToastData snapshot, and reads chunks in chunk_seq order, copying each chunk’s VARDATA into the pre-allocated result buffer.

Detoasting requires a snapshot to read the TOAST table. get_toast_snapshot (in toast_internals.c) returns &SnapshotToastData, a special snapshot that reads committed versions the same way SnapshotSelf does for the TOAST relation. It enforces one safety rule: an active snapshot must be registered in the current session before detoasting is attempted, so that the TOAST data cannot be vacuumed away between the main-table fetch and the detoast fetch.

// get_toast_snapshot — src/backend/access/common/toast_internals.c
if (!HaveRegisteredOrActiveSnapshot())
elog(ERROR, "cannot fetch toast data without an active snapshot");
return &SnapshotToastData;

TOAST is not hardcoded to the heap access method. The table AM interface (tableam.h) exposes table_relation_fetch_toast_slice, which the heap AM implements as heap_fetch_toast_slice. A custom AM that stores large values differently can provide its own implementation. The toasting decision logic in toast_internals.c and detoast.c is AM-agnostic; only the chunk storage and retrieval are AM-specific.

Write path (INSERT/UPDATE):

flowchart TD
    A[heap_insert / heap_update] --> B[heap_toast_insert_or_update]
    B --> C{tuple > TOAST_TUPLE_TARGET?}
    C -- No --> Z[return original tuple]
    C -- Yes --> D[Round 1: compress EXTENDED\nthen externalize if still large]
    D --> E[Round 2: externalize remaining\nEXTENDED / EXTERNAL]
    E --> F[Round 3: compress MAIN]
    F --> G[Round 4: externalize MAIN\ntarget = TOAST_TUPLE_TARGET_MAIN]
    G --> H[heap_form_tuple with\nreplaced Datum values]
    H --> Z2[return new tuple]

    D -->|toast_save_datum| T1[toast rel: INSERT chunks\nassign valueid OID\nreplace Datum with varatt_external pointer]
    E -->|toast_save_datum| T1
    G -->|toast_save_datum| T1

Read path (detoasting):

flowchart TD
    R[executor reads Datum from slot] --> V{varlena tag?}
    V -- short header --> S[detoast_attr: expand to 4-byte header]
    V -- inline compressed --> IC[detoast_attr: decompress\npglz / lz4]
    V -- VARTAG_ONDISK --> E1[toast_fetch_datum:\nopen pg_toast_N\nsystable_beginscan_ordered\nread chunks in order]
    E1 --> E2{chunk data compressed?}
    E2 -- Yes --> IC
    E2 -- No --> Done[return assembled varlena]
    IC --> Done
    S --> Done

Slice read (partial retrieval):

flowchart TD
    P[detoast_attr_slice sliceoffset slicelength] --> Q{VARTAG_ONDISK?}
    Q -- No compressed --> QA[toast_fetch_datum_slice:\nnarrow chunk range scan]
    Q -- PGLZ compressed --> QB[pglz_maximum_compressed_size\nfetch minimal prefix chunks]
    Q -- LZ4 compressed --> QC[fetch all chunks\nno streaming prefix API]
    QA --> D2[copy slice from assembled buffer]
    QB --> D3[decompress prefix then slice]
    QC --> D3
    D2 --> Out[return slice varlena]
    D3 --> Out

End-to-end datum lifecycle (one EXTENDED column):

This view follows a single oversized TEXT/JSONB value across the symbol boundaries — from the toaster decision down through chunk storage, and back up through reassembly and decompression on read.

flowchart TD
    subgraph WRITE [Write path]
      W1[toast_tuple_try_compression] --> W2[toast_compress_datum]
      W2 --> W3{pglz_compress_datum / lz4_compress_datum<br/>net saved over 2 bytes?}
      W3 -- No --> W4[keep raw varlena]
      W3 -- Yes --> W5[set va_tcinfo: size + method id]
      W4 --> W6[toast_tuple_externalize]
      W5 --> W6
      W6 --> W7[toast_save_datum]
      W7 --> W8[slice into TOAST_MAX_CHUNK_SIZE chunks<br/>heap_insert + index_insert per chunk]
      W8 --> W9[build varatt_external<br/>va_rawsize / va_extinfo / va_valueid / va_toastrelid]
      W9 --> W10[replace Datum with VARTAG_ONDISK pointer]
    end

    W10 -.stored on disk.-> R1

    subgraph READ [Read path]
      R1[detoast_attr] --> R2{VARATT_IS_EXTERNAL_ONDISK?}
      R2 -- Yes --> R3[toast_fetch_datum]
      R3 --> R4[table_relation_fetch_toast_slice<br/>heap_fetch_toast_slice]
      R4 --> R5[systable_beginscan_ordered on toast index<br/>get_toast_snapshot]
      R5 --> R6{VARATT_IS_COMPRESSED result?}
      R6 -- Yes --> R7[toast_decompress_datum<br/>pglz / lz4 by va_tcinfo method id]
      R6 -- No --> R8[plain varlena]
      R7 --> R8
    end
SymbolFileRole
heap_toast_insert_or_updateaccess/heap/heaptoast.cEntry point called by heap_insert / heap_update; drives the four-round loop
heap_toast_deleteaccess/heap/heaptoast.cCascaded delete of toast rows when main tuple is deleted
toast_tuple_initaccess/common/toast_helper.cInitialises ToastTupleContext, classifies each attribute
toast_tuple_find_biggest_attributeaccess/common/toast_helper.cSelects the largest eligible attribute for each round
toast_tuple_try_compressionaccess/common/toast_helper.cCalls toast_compress_datum; replaces value in ttc_values if compressed
toast_tuple_externalizeaccess/common/toast_helper.cCalls toast_save_datum; replaces value with varatt_external pointer
toast_tuple_cleanupaccess/common/toast_helper.cFrees old external values that were replaced
toast_save_datumaccess/common/toast_internals.cOpens toast rel, assigns valueid, inserts chunks, returns pointer
toast_delete_datumaccess/common/toast_internals.cDeletes all chunks for one valueid
toast_delete_externalaccess/common/toast_internals.cIterates columns and calls toast_delete_datum for external ones
toast_compress_datumaccess/common/toast_internals.cDispatches to pglz_compress_datum or lz4_compress_datum
toast_open_indexesaccess/common/toast_internals.cOpens all indexes on the toast relation; returns the valid one
toast_close_indexesaccess/common/toast_internals.cCloses toast indexes and frees the array
SymbolFileRole
detoast_attraccess/common/detoast.cFull detoast: fetch external + decompress + expand short header
detoast_external_attraccess/common/detoast.cFetch external datum only (may still be compressed)
detoast_attr_sliceaccess/common/detoast.cPartial retrieval; dispatches narrow scan or prefix decompress
toast_fetch_datumaccess/common/detoast.c (static)Reassembles all chunks for a varatt_external datum
toast_fetch_datum_sliceaccess/common/detoast.c (static)Reassembles a chunk range for a varatt_external datum
toast_decompress_datumaccess/common/detoast.c (static)Dispatches to pglz_decompress_datum or lz4_decompress_datum
toast_decompress_datum_sliceaccess/common/detoast.c (static)Decompresses only a prefix length of the datum
heap_fetch_toast_sliceaccess/heap/heaptoast.cAM-side chunk retrieval: systable_beginscan_ordered on the toast index
get_toast_snapshotaccess/common/toast_internals.cReturns SnapshotToastData; enforces active-snapshot precondition
toast_raw_datum_sizeaccess/common/detoast.cReturns decompressed size without fully detoasting
toast_datum_sizeaccess/common/detoast.cReturns physical stored size
SymbolFileRole
pglz_compress_datumaccess/common/toast_compression.cCompresses via pglz_compress; returns NULL if incompressible
pglz_decompress_datumaccess/common/toast_compression.cFull PGLZ decompression
pglz_decompress_datum_sliceaccess/common/toast_compression.cPartial PGLZ decompression
lz4_compress_datumaccess/common/toast_compression.cLZ4 compression (requires USE_LZ4)
lz4_decompress_datumaccess/common/toast_compression.cFull LZ4 decompression
lz4_decompress_datum_sliceaccess/common/toast_compression.cPartial LZ4 decompression (requires liblz4 ≥ 1.8.3)
toast_get_compression_idaccess/common/toast_compression.cExtracts ToastCompressionId from a varlena
CompressionNameToMethodaccess/common/toast_compression.cMaps "pglz" / "lz4" string to compression method char
SymbolFileRole
varatt_externalinclude/varatt.h18-byte on-disk toast pointer: va_rawsize, va_extinfo, va_valueid, va_toastrelid
varatt_indirectinclude/varatt.hIn-memory indirect pointer to another varlena
ToastAttrInfoinclude/access/toast_helper.hPer-attribute classification flags and current size within ToastTupleContext
ToastTupleContextinclude/access/toast_helper.hWorking context for the four-round loop: rel, values, isnull, oldvalues, attr array, flags
toast_compress_headerinclude/access/toast_internals.hInline-compressed datum header: vl_len_ + tcinfo (size + method ID)
ConstantDefinitionValue (8 KB page)
TOAST_TUPLE_TARGETheaptoast.h~2 KB (≡ TOAST_TUPLE_THRESHOLD, 4 tuples/page)
TOAST_TUPLE_TARGET_MAINheaptoast.h~8 KB (1 tuple/page, one full page)
TOAST_MAX_CHUNK_SIZEheaptoast.h~1996 bytes (4 chunks/page minus overhead)
TOAST_POINTER_SIZEdetoast.h18 bytes (VARHDRSZ_EXTERNAL + sizeof(varatt_external))
TYPSTORAGE_PLAIN/EXTERNAL/MAIN/EXTENDEDpg_type.h'p', 'e', 'm', 'x'

Position hints (as of 2026-06-05, commit 273fe94)

Section titled “Position hints (as of 2026-06-05, commit 273fe94)”
SymbolFileApprox. line
heap_toast_insert_or_updatesrc/backend/access/heap/heaptoast.c96
heap_toast_deletesrc/backend/access/heap/heaptoast.c43
heap_fetch_toast_slicesrc/backend/access/heap/heaptoast.c626
toast_flatten_tuplesrc/backend/access/heap/heaptoast.c350
toast_save_datumsrc/backend/access/common/toast_internals.c119
toast_delete_datumsrc/backend/access/common/toast_internals.c385
toast_compress_datumsrc/backend/access/common/toast_internals.c46
toast_open_indexessrc/backend/access/common/toast_internals.c562
get_toast_snapshotsrc/backend/access/common/toast_internals.c638
detoast_attrsrc/backend/access/common/detoast.c116
detoast_external_attrsrc/backend/access/common/detoast.c45
detoast_attr_slicesrc/backend/access/common/detoast.c205
toast_fetch_datumsrc/backend/access/common/detoast.c342
toast_fetch_datum_slicesrc/backend/access/common/detoast.c395
toast_decompress_datumsrc/backend/access/common/detoast.c470
toast_decompress_datum_slicesrc/backend/access/common/detoast.c502
pglz_compress_datumsrc/backend/access/common/toast_compression.c39
pglz_decompress_datumsrc/backend/access/common/toast_compression.c81
pglz_decompress_datum_slicesrc/backend/access/common/toast_compression.c108
lz4_compress_datumsrc/backend/access/common/toast_compression.c138
lz4_decompress_datumsrc/backend/access/common/toast_compression.c181
lz4_decompress_datum_slicesrc/backend/access/common/toast_compression.c214
varatt_externalsrc/include/varatt.h32
TOAST_TUPLE_TARGETsrc/include/access/heaptoast.h50
TOAST_MAX_CHUNK_SIZEsrc/include/access/heaptoast.h84
TYPSTORAGE_EXTENDEDsrc/include/catalog/pg_type.h309

Verified against REL_18_STABLE, commit 273fe94.

Confirmed:

  • heap_toast_insert_or_update four-round loop structure matches the description: rounds 1–4, TYPSTORAGE_EXTENDED / TYPSTORAGE_MAIN discrimination, TOAST_TUPLE_TARGET / TOAST_TUPLE_TARGET_MAIN thresholds.
  • toast_save_datum chunk loop, valueid assignment via GetNewOidWithIndex, and varatt_external pointer construction confirmed.
  • detoast_attr branch structure for ONDISK / COMPRESSED / SHORT confirmed.
  • detoast_attr_slice narrow-scan path for uncompressed external, PGLZ prefix via pglz_maximum_compressed_size, and LZ4 full-fetch fallback confirmed.
  • get_toast_snapshot HaveRegisteredOrActiveSnapshot guard confirmed.
  • TOAST_MAX_CHUNK_SIZE defined as EXTERN_TUPLE_MAX_SIZE - MAXALIGN(SizeofHeapTupleHeader) - sizeof(Oid) - sizeof(int32) - VARHDRSZ in heaptoast.h.
  • TOAST_POINTER_SIZE = VARHDRSZ_EXTERNAL + sizeof(varatt_external) in detoast.h.
  • LZ4 lz4_decompress_datum_slice falls back to full decompression if LZ4_versionNumber() < 10803 confirmed.

AM interface:

  • table_relation_fetch_toast_slice is the tableam dispatch point; heap’s implementation is heap_fetch_toast_slice confirmed in src/backend/access/heap/heaptoast.c.

Unresolved / out of scope:

  • The toast_build_flattened_tuple / toast_flatten_tuple variants are utility helpers (for container types and CLUSTER/rewrite); the table-rewrite OID-preservation path in toast_save_datum is present but not fully traced.
  • The large_object subsystem (storage/large_object/) is a separate mechanism from TOAST and is not covered here.
  • Per-column default_toast_compression overrides (ALTER TABLE … SET COMPRESSION) interact with toast_compress_datum’s cmethod parameter; verified that the parameter flows from pg_attribute.attcompression through the caller chain, but the catalog path is not traced.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

InnoDB stores large BLOB, TEXT, and VARCHAR values in overflow pages in the same .ibd file. The threshold is approximately 40 bytes for COMPACT row format: once the inline portion drops to 20 bytes, the rest goes to overflow pages. Unlike PostgreSQL’s TOAST, InnoDB does not compress at the storage layer by default — compression is a table-level option that applies to B-tree pages, not individual column values. The trade-off: InnoDB’s off- page storage avoids the separate heap, but TOAST’s out-of-line scheme lets vacuum reclaim TOAST rows independently and keeps compression orthogonal to page compression.

Oracle LOB columns are stored in a dedicated LOB segment outside the main table segment. An inline LOB below a threshold (11g: 4 KB; 12c+: configurable) is stored inside the row (“inline LOB”). Oracle BasicFiles use an extent-based storage model; SecureFiles (11g+) add deduplication, compression, and encryption at the LOB level. The semantic difference from TOAST: Oracle LOBs are a distinct SQL type with a read/write cursor API; PostgreSQL TOAST is fully transparent — the application uses TEXT or BYTEA normally.

SQL Server stores VARCHAR(MAX), NVARCHAR(MAX), VARBINARY(MAX), and TEXT/IMAGE in LOB pages separate from the 8 KB data page. The main row holds a 24-byte pointer. Row overflow for smaller values (still in 8 KB range but exceeding 8060-byte row limit) uses row-overflow pages. No inline compression equivalent to TOAST’s Round 1 exists; page-level compression (PAGE COMPRESSION) is a table option analogous to InnoDB’s page compression.

TOAST’s per-datum compression model predates the columnar storage movement, which compresses whole column runs (run-length encoding, dictionary encoding, delta encoding) rather than individual values. Columnar compression ratios are typically far higher because the compressor sees many values of the same type and distribution, not isolated varlenas. Engines like DuckDB and Apache Parquet use columnar compression exclusively; hybrid HTAP engines (like Greenplum’s AO tables, which build on PostgreSQL) bypass TOAST for columnar segments and handle large values at the block level.

The pluggable-AM surface in PostgreSQL 12+ (table_relation_fetch_toast_slice dispatch) is the extension point for an AM that wants to bypass TOAST entirely. The Citus columnar AM (columnar) introduced its own large-value handling (though that is a contrib extension and thus out of scope here as per the plan’s contrib boundary).

PostgreSQL 18 introduced an async I/O layer (storage/aio/) for prefetching heap pages during sequential scans. Toast chunk reads (heap_fetch_toast_slice via systable_beginscan_ordered) currently bypass async I/O and use synchronous ReadBuffer calls. A potential future optimisation would be to prefetch toast chunks when an index scan on the main table detects an out-of-line pointer — this is an open engineering item, not yet implemented in REL_18_STABLE.

  • src/backend/access/heap/heaptoast.c — heap-specific toasting entry points and chunk fetch
  • src/backend/access/common/toast_internals.ctoast_save_datum, toast_delete_datum, toast_compress_datum, index helpers, get_toast_snapshot
  • src/backend/access/common/detoast.c — detoasting, partial fetch, size reporting
  • src/backend/access/common/toast_compression.c — PGLZ and LZ4 compress/decompress dispatch
  • src/include/access/heaptoast.hTOAST_TUPLE_TARGET, TOAST_MAX_CHUNK_SIZE, function declarations
  • src/include/access/toast_internals.htoast_compress_header, TOAST_COMPRESS_* macros, function declarations
  • src/include/access/detoast.hTOAST_POINTER_SIZE
  • src/include/access/toast_helper.hToastAttrInfo, ToastTupleContext, TOAST_HAS_NULLS, TOAST_NEEDS_CHANGE, TOASTCOL_* flags
  • src/include/varatt.hvaratt_external, varatt_indirect, VARTAG_*, VARATT_IS_* macros, VARHDRSZ_*
  • src/include/catalog/pg_type.hTYPSTORAGE_PLAIN/EXTERNAL/MAIN/EXTENDED
  • knowledge/code-analysis/postgres/postgres-heap-am.md — heap tuple layout, slotted page, HOT context
  • knowledge/code-analysis/postgres/postgres-page-layout.mdBLCKSZ, page header layout
  • knowledge/code-analysis/postgres/postgres-mvcc-snapshots.md — snapshot semantics
  • knowledge/code-analysis/postgres/postgres-vacuum.md — vacuum reclaims dead TOAST rows
  • Database System Concepts, Silberschatz et al., 7e, ch. 13 “Data Storage Structures”
  • Database Internals, Alex Petrov, ch. 3 “File Formats”