PostgreSQL Datatype Library — varlena, numeric, datetime, jsonb, and arrays
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A relational database is, at bottom, a machine for storing and comparing values, and a value only means something relative to its type. The relational model (Codd 1970; Database System Concepts, Silberschatz 7e, ch. 4 “Intermediate SQL” and the type-system discussion in ch. 5) takes the domain — the set of permissible values for an attribute — as a primitive. SQL turns the domain into a data type with an external textual syntax, an internal storage encoding, a total or partial order, and a battery of operators. For the engine, a type is therefore not a passive label but an abstract data type (ADT): a representation plus the operations that respect it. The representation is opaque to the rest of the system; only the type’s own functions may interpret the bytes. This is exactly the encapsulation that Liskov & Zilles formalized for programming languages, applied to on-disk tuples.
Three forces shape how a DBMS realizes its types. The first is storage efficiency. A row store packs many attributes into a fixed-width tuple, so the engine wants fixed-length types (a 4-byte integer, an 8-byte timestamp) to be stored inline with no per-value overhead. But strings, decimals, JSON documents, and arrays are intrinsically variable-length, and a billion-row table cannot afford a fat header on every tiny string. The encoding must therefore pay for size only in proportion to size — a one-character string should not cost the same header as a one-megabyte blob.
The second force is comparison and ordering. Indexes (B-trees), sorts,
hash joins, and GROUP BY all require that a type expose a consistent
comparison function and, where applicable, a hash function. The catch is
that “consistent” is type-specific and sometimes locale- or
collation-specific: byte order is the right order for bytea but the wrong
order for human-language text, and 1.0 must compare equal to 1.00 for
numeric even though their byte representations differ. The ordering
contract — the operator class — is what lets a generic B-tree index any
type without knowing what the type is (see postgres-nbtree.md and
postgres-index-am.md).
The third force is extensibility. What Goes Around Comes Around
(Stonebraker & Hellerstein 2005; captured in
dbms-papers/goes-around.md) traces how the object-relational lineage —
POSTGRES above all — chose to make the type system open: users and
extensions add new base types, and the engine treats them exactly like
built-ins because every type is described by catalog rows plus a set of
registered C functions. The price of that openness is a rigid calling
convention: every type function, built-in or third-party, must speak the
same ABI so the executor can call it indirectly. The benefit is that PostGIS
can add a geometry type, or a citext extension a case-insensitive text,
without patching the core executor — the same generality that The Design of
POSTGRES (Stonebraker & Rowe 1986; dbms-papers/) set out to provide.
The deep idea, then, is uniformity through indirection. The query
executor manipulates values as undifferentiated Datums — a register-width
token that is either the value itself (for pass-by-value types like int4)
or a pointer to it (for pass-by-reference types like text). All
type-specific behavior is reached by looking up a function in the catalog
and calling it through a uniform signature. The ADT library is the
collection of those functions for the built-in types. It is the layer where
the abstract relational notion of a domain becomes concrete bytes and
concrete C code.
Common DBMS Design
Section titled “Common DBMS Design”Every SQL engine must answer the same questions for each type, and the answers converge on a recognizable pattern.
The I/O quartet. A type needs to convert between its internal byte
representation and (a) the human-readable text form used in SQL literals and
COPY/psql output, and (b) a machine-readable binary form used on the
client wire protocol for efficiency. That yields four functions, which
nearly every engine has in some form:
- input — parse external text into internal bytes (used by literals,
COPY ... FROM, casts from text); - output — render internal bytes as text (used by result rows,
COPY ... TO); - receive — decode the binary wire form into internal bytes;
- send — encode internal bytes into the binary wire form.
Text I/O is canonical and stable; binary I/O is an optimization that trades human-readability for parse speed and exactness (no float round-tripping through decimal). A robust engine never lets the binary form become a hidden on-disk format it cannot evolve — hence binary protocols usually carry a version byte.
Variable-length representation. Fixed-length types are trivial: store
the bytes inline. Variable-length types need a length somewhere. The two
classic choices are a length-prefixed encoding (a header word giving the
byte count, then the payload) and a sentinel-terminated encoding (C strings,
NUL-terminated). Databases overwhelmingly choose length-prefix because it
permits embedded NULs, O(1) length, and binary-safe copying. The
refinement every mature engine eventually makes is to shrink the header for
small values and to spill large values out of line — because a row store
wants tuples to fit on a page, and a multi-megabyte value would otherwise
make the tuple unstorable. The out-of-line mechanism (Oracle’s LOBs, SQL
Server’s LOB_DATA / row-overflow pages, PostgreSQL’s TOAST) stores the big
payload in a side relation and leaves a small pointer in the tuple.
Arbitrary-precision decimal. float is fast but lossy; financial and
exact-arithmetic workloads need a decimal type that represents 0.1
exactly and supports hundreds of significant digits. The universal
implementation is a sign, an exponent/scale, and an array of digits in some
radix (often a power of ten so that decimal rounding and text conversion are
clean), with schoolbook algorithms for add/subtract/multiply/divide.
Temporal types. Dates and timestamps are stored as integers — a count of days, or of microseconds, from an epoch — because integer arithmetic is exact and fast, and calendar conversion (the messy part: leap years, month lengths, Gregorian reform) is isolated in a Julian-day kernel that maps (year, month, day) to a single day number and back.
Semi-structured and collection types. Modern engines add JSON and array types. The naive implementation stores the text and re-parses on every access; the mature implementation stores a parsed binary tree so that field access and containment tests are fast, trading insert-time serialization cost for query-time speed. Collection types (arrays, nested tables) carry dimensionality and a null map alongside packed element data.
flowchart TD
subgraph cat["catalog description"]
T["pg_type row<br/>typinput typoutput<br/>typreceive typsend<br/>typlen typbyval typalign typstorage"]
P["pg_proc rows<br/>(the C functions)"]
end
subgraph adt["ADT library (utils/adt)"]
IN["typeIN: cstring to internal"]
OUT["typeOUT: internal to cstring"]
RECV["typeRECV: binary to internal"]
SEND["typeSEND: internal to binary"]
CMP["btTYPEcmp / hashTYPE / sortsupport"]
end
T --> P --> adt
EXEC["executor / COPY / wire protocol"] -->|"FunctionCall via fmgr"| adt
IN --> DATUM["Datum (value or pointer)"]
RECV --> DATUM
DATUM --> OUT
DATUM --> SEND
DATUM --> CMP
PostgreSQL’s distinctive choice is to make this pattern fully catalog-driven
and uniform across built-in and user types. There is no privileged
“built-in type” code path in the executor: int4 and a PostGIS geometry
are dispatched identically, through the function manager (fmgr). That is
the through-line of the rest of this document.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”The type is a catalog row; behavior is registered functions
Section titled “The type is a catalog row; behavior is registered functions”A PostgreSQL type is a row in pg_type. Its scalar attributes —
typlen (length, or -1 for varlena, -2 for cstring), typbyval
(pass-by-value vs by-reference), typalign (c/s/i/d), typstorage
(plain/extended/external/main, governing TOAST) — tell generic code
how to move a value around without understanding it. Its function
references — typinput, typoutput, typreceive, typsend, plus the
operator-class entries for comparison and hashing — point at pg_proc
rows, i.e. at C functions in the ADT library. Generic subsystems (the
executor, COPY, the wire protocol, the planner’s selectivity code) never
hard-code a type; they read these catalog fields and call the registered
functions through fmgr. The mechanics of that dispatch — FmgrInfo,
FunctionCallInfo, PG_FUNCTION_ARGS, the V1 ABI — are the subject of
postgres-fmgr.md; here we take them as given and focus on what the ADT
functions do.
Every ADT function has the same C signature:
// textin — src/backend/utils/adt/varlena.cDatumtextin(PG_FUNCTION_ARGS){ char *inputText = PG_GETARG_CSTRING(0);
PG_RETURN_TEXT_P(cstring_to_text(inputText));}
// textout — src/backend/utils/adt/varlena.cDatumtextout(PG_FUNCTION_ARGS){ Datum txt = PG_GETARG_DATUM(0);
PG_RETURN_CSTRING(TextDatumGetCString(txt));}PG_FUNCTION_ARGS expands to a single FunctionCallInfo fcinfo parameter;
the PG_GETARG_* and PG_RETURN_* macros unpack arguments and box the
result back into a Datum. This is the uniformity that lets the executor
call any type’s input function by OID. The recv/send pair is the binary
twin of in/out, reading and writing a StringInfo message buffer:
// textrecv — src/backend/utils/adt/varlena.cDatumtextrecv(PG_FUNCTION_ARGS){ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0); text *result; char *str; int nbytes;
str = pq_getmsgtext(buf, buf->len - buf->cursor, &nbytes); result = cstring_to_text_with_len(str, nbytes); pfree(str); PG_RETURN_TEXT_P(result);}varlena: the variable-length header zoo
Section titled “varlena: the variable-length header zoo”Every variable-length value — text, bytea, numeric, jsonb, arrays,
and user types declared with typlen = -1 — is a varlena. The contract
(stated at the top of src/include/varatt.h) is that the value begins with
a header whose first byte’s low bits encode which of four physical layouts
this datum uses. The header is read only through macros, never field
access, because the layout differs by endianness and by form.
The four forms (little-endian flag bits shown; big-endian is mirrored):
| Form | First-byte flag | Header size | Capacity | Use |
|---|---|---|---|---|
| 4-byte uncompressed | xxxxxx00 | 4 B (aligned) | up to ~1 GB | normal value |
| 4-byte compressed in-line | xxxxxx10 | 4 B + va_tcinfo | up to ~1 GB | pglz/lz4-compressed, still in the tuple |
| 1-byte short | xxxxxxx1 | 1 B (unaligned) | up to 126 B | small values, saves alignment padding |
| TOAST pointer | 00000001 | 2 B tag + body | n/a | out-of-line / indirect / expanded |
The 4-byte length word includes itself, so the payload length is
VARSIZE - VARHDRSZ. The endianness-specific extraction is:
// VARSIZE_4B / VARSIZE_1B — src/include/varatt.h (little-endian)#define VARSIZE_4B(PTR) \ ((((varattrib_4b *) (PTR))->va_4byte.va_header >> 2) & 0x3FFFFFFF)#define VARSIZE_1B(PTR) \ ((((varattrib_1b *) (PTR))->va_header >> 1) & 0x7F)The 1-byte short header is the cleverest piece. A normal 4-byte header
must be 4-byte aligned, which on a tuple full of tiny strings wastes up to
3 padding bytes per value plus 3 of the 4 header bytes. The short form
uses a single byte for both length and flag, is stored unaligned, and
caps the value at 126 bytes — perfect for the short strings that dominate
real schemas. A datum can be down-converted to short form when it fits
(VARATT_CAN_MAKE_SHORT). Because short and external datums are unaligned,
code that might see them must use the *_ANY family of macros, which dispatch
on the flag bits:
// VARSIZE_ANY_EXHDR / VARDATA_ANY — src/include/varatt.h#define VARSIZE_ANY_EXHDR(PTR) \ (VARATT_IS_1B_E(PTR) ? VARSIZE_EXTERNAL(PTR)-VARHDRSZ_EXTERNAL : \ (VARATT_IS_1B(PTR) ? VARSIZE_1B(PTR)-VARHDRSZ_SHORT : \ VARSIZE_4B(PTR)-VARHDRSZ))
#define VARDATA_ANY(PTR) \ (VARATT_IS_1B(PTR) ? VARDATA_1B(PTR) : VARDATA_4B(PTR))A value that is compressed in-line or pushed out of line is extended
(VARATT_IS_EXTENDED). Before a type function can touch the payload it must
detoast: pg_detoast_datum expands any extended datum to a plain 4-byte
varlena, while pg_detoast_datum_packed leaves a short header alone (it only
needs to undo compression/externalization, and the *_ANY macros handle the
short header). The actual out-of-line fetch and decompression are
detoast_attr in access/common/detoast.c, covered by postgres-toast.md:
// pg_detoast_datum_packed — src/backend/utils/fmgr/fmgr.cstruct varlena *pg_detoast_datum_packed(struct varlena *datum){ if (VARATT_IS_COMPRESSED(datum) || VARATT_IS_EXTERNAL(datum)) return detoast_attr(datum); else return datum;}This is why the canonical text_to_cstring calls pg_detoast_datum_packed
and VARDATA_ANY/VARSIZE_ANY_EXHDR, then frees the unpacked copy only if
it differs from the original (i.e. only if detoasting actually allocated):
// text_to_cstring — src/backend/utils/adt/varlena.cchar *text_to_cstring(const text *t){ text *tunpacked = pg_detoast_datum_packed(unconstify(text *, t)); int len = VARSIZE_ANY_EXHDR(tunpacked); char *result;
result = (char *) palloc(len + 1); memcpy(result, VARDATA_ANY(tunpacked), len); result[len] = '\0';
if (tunpacked != t) pfree(tunpacked); return result;}The constructor side is symmetric and always builds the full 4-byte header (datums “begin life untoasted”); the system may later shorten or TOAST them at tuple-assembly time:
// cstring_to_text_with_len — src/backend/utils/adt/varlena.ctext *cstring_to_text_with_len(const char *s, int len){ text *result = (text *) palloc(len + VARHDRSZ);
SET_VARSIZE(result, len + VARHDRSZ); memcpy(VARDATA(result), s, len); return result;}flowchart TD
D["incoming varlena Datum"] --> Q{"flag bits in first byte"}
Q -->|"xxxxxx00"| FB["4-byte header<br/>aligned, plain<br/>VARDATA / VARSIZE"]
Q -->|"xxxxxx10"| FC["4-byte header<br/>compressed in-line<br/>va_tcinfo holds rawsize+method"]
Q -->|"xxxxxxx1"| SH["1-byte header<br/>unaligned short<br/>up to 126 bytes"]
Q -->|"00000001"| EX["TOAST pointer (1b_e)<br/>tag: ONDISK / INDIRECT / EXPANDED"]
FC -->|"detoast_attr"| FB
EX -->|"detoast_attr"| FB
FB --> USE["type function reads payload"]
SH --> USE
text and bytea: copy plus collation-aware ordering
Section titled “text and bytea: copy plus collation-aware ordering”text and bytea are varlena with no extra structure: the payload is the
string/byte sequence. Input/output are essentially memcpy around a header
(textin/textout above; byteain/byteaout add escape handling for
non-printable bytes). The interesting code is ordering, because text
must sort by collation, not by byte value. varstr_cmp is the kernel: it
fast-paths the C locale to memcmp, and otherwise calls the collation
provider (pg_strncoll), with a memcmp-equality shortcut to dodge the
expensive collation call when strings are byte-identical:
// varstr_cmp — src/backend/utils/adt/varlena.cintvarstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid){ int result; pg_locale_t mylocale;
check_collation_set(collid); mylocale = pg_newlocale_from_collation(collid);
if (mylocale->collate_is_c) { result = memcmp(arg1, arg2, Min(len1, len2)); if ((result == 0) && (len1 != len2)) result = (len1 < len2) ? -1 : 1; } else { if (len1 == len2 && memcmp(arg1, arg2, len1) == 0) return 0; result = pg_strncoll(arg1, len1, arg2, len2, mylocale); /* Break tie if necessary. */ if (result == 0 && mylocale->deterministic) { result = memcmp(arg1, arg2, Min(len1, len2)); if ((result == 0) && (len1 != len2)) result = (len1 < len2) ? -1 : 1; } } return result;}For sorts, text registers a SortSupport function so the executor can
skip the per-comparison fmgr call and even use abbreviated keys (pack a
prefix of the collation key into a Datum for cheap first-pass comparison).
The C-locale fast comparator is a bare memcmp:
// varstrfastcmp_c — src/backend/utils/adt/varlena.cstatic intvarstrfastcmp_c(Datum x, Datum y, SortSupport ssup){ VarString *arg1 = DatumGetVarStringPP(x); VarString *arg2 = DatumGetVarStringPP(y); char *a1p = VARDATA_ANY(arg1); char *a2p = VARDATA_ANY(arg2); int len1 = VARSIZE_ANY_EXHDR(arg1); int len2 = VARSIZE_ANY_EXHDR(arg2); int result;
result = memcmp(a1p, a2p, Min(len1, len2)); if ((result == 0) && (len1 != len2)) result = (len1 < len2) ? -1 : 1;
/* We can't afford to leak memory here. */ if (PointerGetDatum(arg1) != x) pfree(arg1); if (PointerGetDatum(arg2) != y) pfree(arg2); return result;}Note the explicit pfree of any detoasted copy: B-tree comparators must not
leak, because they run once per comparison in a long sort. The collation
machinery and the abbreviated-key encoding belong to the i18n/sort subsystems
(postgres-overview-i18n-text.md, postgres-agg-sort-nodes.md); here the
point is only that text ordering is not memcmp in general — it is a
catalog-driven, collation-aware function reachable through the same operator
class a B-tree uses for any type.
numeric: base-10000 arbitrary precision
Section titled “numeric: base-10000 arbitrary precision”numeric is the canonical “fat varlena”: an arbitrary-precision decimal whose
on-disk form packs a sign, a weight, a display scale, and a digit array, while
its arithmetic form (NumericVar) unpacks the same digits into a mutable
buffer. The digit radix is NBASE = 10000 (DEC_DIGITS = 4 decimal digits per
stored int16 digit), chosen so that a product of two digits fits in an
int32 and decimal text conversion is a straightforward 4-digits-at-a-time
loop:
// numeric.c — radix selection (the NBASE==10000 branch is the live one)#define NBASE 10000#define HALF_NBASE 5000#define DEC_DIGITS 4 /* decimal digits per NBASE digit */#define MUL_GUARD_DIGITS 2 /* these are measured in NBASE digits */#define DIV_GUARD_DIGITS 4typedef int16 NumericDigit;The in-memory NumericVar separates the palloc’d buffer (buf) from the
first significant digit (digits), deliberately leaving a spare leading digit
so a carry out of the top can be absorbed by just decrementing digits and
incrementing weight — no reallocation:
// NumericVar — src/backend/utils/adt/numeric.ctypedef struct NumericVar{ int ndigits; /* # of digits in digits[] - can be 0! */ int weight; /* weight of first digit */ int sign; /* NUMERIC_POS, _NEG, _NAN, _PINF, or _NINF */ int dscale; /* display scale */ NumericDigit *buf; /* start of palloc'd space for digits[] */ NumericDigit *digits; /* base-NBASE digits */} NumericVar;On-disk, the header is itself adaptive — a short form (one 16-bit header word, used when weight and scale are small) or a long form (two header words), with a third “special” form whose header alone encodes NaN / +Inf / -Inf. The flag bits live in the top of the first header word:
// numeric.c — on-disk header flag bits#define NUMERIC_SIGN_MASK 0xC000#define NUMERIC_POS 0x0000#define NUMERIC_NEG 0x4000#define NUMERIC_SHORT 0x8000 /* short (1-word) header */#define NUMERIC_SPECIAL 0xC000 /* NaN / Inf, header is all there is */#define NUMERIC_HDRSZ (VARHDRSZ + sizeof(uint16) + sizeof(int16))#define NUMERIC_HDRSZ_SHORT (VARHDRSZ + sizeof(uint16))Output is “unpack then stringify”: numeric_out special-cases the three
non-finite values, otherwise calls init_var_from_num to get a NumericVar
view over the stored digits and get_str_from_var to render them:
// numeric_out — src/backend/utils/adt/numeric.cDatumnumeric_out(PG_FUNCTION_ARGS){ Numeric num = PG_GETARG_NUMERIC(0); NumericVar x; char *str;
if (NUMERIC_IS_SPECIAL(num)) /* NaN / Infinity / -Infinity */ { if (NUMERIC_IS_PINF(num)) PG_RETURN_CSTRING(pstrdup("Infinity")); else if (NUMERIC_IS_NINF(num)) PG_RETURN_CSTRING(pstrdup("-Infinity")); else PG_RETURN_CSTRING(pstrdup("NaN")); } init_var_from_num(num, &x); str = get_str_from_var(&x); PG_RETURN_CSTRING(str);}Arithmetic is schoolbook on the digit arrays. add_var dispatches on signs to
add_abs/sub_abs (and cmp_abs to decide which is larger when signs
differ), so the absolute-value routines only ever handle like-signed addition
and the larger-minus-smaller subtraction:
// add_var — src/backend/utils/adt/numeric.c (sign dispatch, condensed)static voidadd_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result){ if (var1->sign == NUMERIC_POS) { if (var2->sign == NUMERIC_POS) /* (+a) + (+b) */ { add_abs(var1, var2, result); result->sign = NUMERIC_POS; } else /* (+a) + (-b) */ { switch (cmp_abs(var1, var2)) { case 0: zero_var(result); ...; break; case 1: sub_abs(var1, var2, result); result->sign = NUMERIC_POS; break; /* case -1: sub_abs(var2, var1, result); sign = NUMERIC_NEG; */ } } } /* ... symmetric branch for var1->sign == NUMERIC_NEG ... */}Re-packing a NumericVar to the on-disk Numeric is make_result (via
make_result_opt_error): it strips leading and trailing zero digits, forces a
canonical zero, and chooses the short vs long header by whether the weight and
scale fit (NUMERIC_CAN_BE_SHORT). This is the place the in-memory and
on-disk representations meet:
// make_result_opt_error — src/backend/utils/adt/numeric.c (condensed)n = var->ndigits;while (n > 0 && *digits == 0) { digits++; weight--; n--; } /* leading 0s */while (n > 0 && digits[n - 1] == 0) n--; /* trailing 0s */if (n == 0) { weight = 0; sign = NUMERIC_POS; } /* canonical zero */
if (NUMERIC_CAN_BE_SHORT(var->dscale, weight)) /* short header */{ len = NUMERIC_HDRSZ_SHORT + n * sizeof(NumericDigit); result = (Numeric) palloc(len); SET_VARSIZE(result, len); result->choice.n_short.n_header = (sign == NUMERIC_NEG ? (NUMERIC_SHORT | NUMERIC_SHORT_SIGN_MASK) : NUMERIC_SHORT) | (var->dscale << NUMERIC_SHORT_DSCALE_SHIFT) | (weight < 0 ? NUMERIC_SHORT_WEIGHT_SIGN_MASK : 0) | (weight & NUMERIC_SHORT_WEIGHT_MASK);}else { /* long header: n_sign_dscale + n_weight */ }memcpy(NUMERIC_DIGITS(result), digits, n * sizeof(NumericDigit));The binary I/O pair (numeric_recv/numeric_send) reads/writes the same
digits over the wire as int16 words plus the weight/sign/dscale, so binary
transmission is exact (no decimal-text round-trip).
flowchart TD ON["on-disk Numeric<br/>short / long / special header<br/>+ base-10000 digit array"] ON -->|"init_var_from_num"| NV["NumericVar<br/>ndigits weight sign dscale<br/>buf + digits (spare leading digit)"] NV -->|"add_var / mul_var<br/>(add_abs sub_abs cmp_abs)"| NV2["NumericVar result<br/>guard digits, then round"] NV2 -->|"make_result<br/>strip 0s, pick header"| ON2["on-disk Numeric"] NV -->|"get_str_from_var"| STR["cstring (numeric_out)"] ON -->|"numeric_send"| WIRE["binary wire (int16 digits)"]
datetime: integer storage over a Julian-day kernel
Section titled “datetime: integer storage over a Julian-day kernel”date, time, timestamp, and timestamptz are fixed-length types
(typlen 4 or 8, typbyval where it fits), so unlike the varlena types above
they need no header at all — the Datum carries the integer directly. date
is a day count from the PostgreSQL epoch (2000-01-01); timestamp is a
microsecond count from the same epoch. Their I/O does not parse calendars
inline: it routes through the shared datetime tokenizer
(ParseDateTime → DecodeDateTime) which handles field order, time zones,
and locale month names, and the calendar math itself is isolated in the
classic Julian-day kernel date2j / j2date (in src/backend/utils/adt/ datetime.c). That isolation is the textbook design from §“Common DBMS
Design”: exact integer arithmetic for ordering and intervals, with the messy
Gregorian conversion confined to two functions. The tokenizer and the
timezone database are large enough to warrant their own treatment; this doc
notes only that the type contract is the same I/O quartet, with the calendar
complexity pushed below it.
jsonb: a TOAST-compressible binary document tree
Section titled “jsonb: a TOAST-compressible binary document tree”jsonb is the most elaborate ADT type: it stores a parsed document so that
key lookup and containment are fast, yet stays a single varlena so TOAST can
compress and spill it. The unit is a JsonbContainer: a 32-bit header whose
low 28 bits count children and whose top bits flag array/object/scalar,
followed by a parallel JEntry array, followed by the children’s
variable-length payloads:
// JsonbContainer + JEntry — src/include/utils/jsonb.htypedef struct JsonbContainer{ uint32 header; /* # of elements or pairs, plus flag bits */ JEntry children[FLEXIBLE_ARRAY_MEMBER]; /* the data for each child node follows. */} JsonbContainer;
#define JB_CMASK 0x0FFFFFFF /* mask for the count field */#define JB_FSCALAR 0x10000000 /* top-level scalar wrapped in a 1-elem array */#define JB_FOBJECT 0x20000000#define JB_FARRAY 0x40000000
typedef uint32 JEntry;#define JENTRY_OFFLENMASK 0x0FFFFFFF /* length OR offset of this child */#define JENTRY_TYPEMASK 0x70000000 /* string/numeric/bool/null/container */#define JENTRY_HAS_OFF 0x80000000 /* this JEntry holds an offset, not a len */The subtle design choice is length-or-offset with a stride. Storing a
length per child makes the value highly compressible (lengths of similar
children are similar bytes) but turns “find child k” into an O(k) prefix
sum. Storing an offset per child gives O(1) random access but defeats
compression. PostgreSQL compromises: store a length in most JEntrys, but
convert every JB_OFFSET_STRIDE-th (= 32nd) child’s field to a cumulative
offset (flagged JENTRY_HAS_OFF), so any child is reachable by summing at most
31 lengths from the nearest stored offset:
// jsonb.h — the stride rationale (verbatim constant)#define JB_OFFSET_STRIDE 32A value is built first as an in-memory JsonbValue tree (by the parser or by
pushJsonbValue), then serialized depth-first into the flat binary form.
JsonbValueToJsonb is the entry point; it wraps a bare scalar in a
one-element JB_FSCALAR array (so the top level is always a container) and
otherwise calls convertToJsonb:
// JsonbValueToJsonb — src/backend/utils/adt/jsonb_util.c (condensed)Jsonb *JsonbValueToJsonb(JsonbValue *val){ if (IsAJsonbScalar(val)) /* wrap scalar as rawScalar array */ { JsonbParseState *pstate = NULL; JsonbValue scalarArray, *res; scalarArray.type = jbvArray; scalarArray.val.array.rawScalar = true; scalarArray.val.array.nElems = 1; pushJsonbValue(&pstate, WJB_BEGIN_ARRAY, &scalarArray); pushJsonbValue(&pstate, WJB_ELEM, val); res = pushJsonbValue(&pstate, WJB_END_ARRAY, NULL); out = convertToJsonb(res); } else if (val->type == jbvObject || val->type == jbvArray) out = convertToJsonb(val); /* object/array container */ else { /* jbvBinary: already-serialized child, just copy with a header */ } return out;}The recursive serializer convertJsonbArray shows the stride logic in
action: it reserves the JEntry slots, converts each element (which appends
its payload to the buffer and returns a JEntry with that element’s length
in the low bits), and converts the length to a cumulative offset on every
stride boundary:
// convertJsonbArray — src/backend/utils/adt/jsonb_util.c (condensed)containerhead = nElems | JB_FARRAY;if (val->val.array.rawScalar) containerhead |= JB_FSCALAR;appendToBuffer(buffer, &containerhead, sizeof(uint32));jentry_offset = reserveFromBuffer(buffer, sizeof(JEntry) * nElems);
totallen = 0;for (i = 0; i < nElems; i++){ convertJsonbValue(buffer, &meta, &val->val.array.elems[i], level + 1); totallen += JBE_OFFLENFLD(meta); /* running data size */ if (totallen > JENTRY_OFFLENMASK) ereport(ERROR, ...); /* 256 MB cap */
if ((i % JB_OFFSET_STRIDE) == 0) /* every 32nd: store offset */ meta = (meta & JENTRY_TYPEMASK) | totallen | JENTRY_HAS_OFF; copyToBuffer(buffer, jentry_offset, &meta, sizeof(JEntry)); jentry_offset += sizeof(JEntry);}*header = JENTRY_ISCONTAINER | (buffer->len - base_offset);Comparison is structural, not byte-wise: compareJsonbContainers walks both
documents with synchronized iterators (JsonbIteratorNext), comparing token by
token so two jsonb values that differ only in key insertion order or
whitespace compare equal. This is what lets jsonb participate in B-tree and
GROUP BY with semantic equality:
// compareJsonbContainers — src/backend/utils/adt/jsonb_util.c (condensed)ita = JsonbIteratorInit(a);itb = JsonbIteratorInit(b);do { ra = JsonbIteratorNext(&ita, &va, false); rb = JsonbIteratorNext(&itb, &vb, false); if (ra == rb) { if (ra == WJB_DONE) break; /* decisively equal */ if (va.type == vb.type) res = compareJsonbScalarValue(&va, &vb); /* compare like-typed */ else res = (va.type > vb.type) ? 1 : -1; /* type ordering */ } else res = ...; /* shorter/structurally-different document orders first */} while (res == 0);flowchart TD TXT["jsonb input text"] -->|"jsonb_in / parser"| JV["JsonbValue tree<br/>(in-memory, jbvObject/Array/String/...)"] JV -->|"JsonbValueToJsonb<br/>convertJsonbValue depth-first"| BIN["flat Jsonb varlena<br/>JsonbContainer header<br/>+ JEntry[] (len, offset every 32)<br/>+ payloads"] BIN -->|"TOAST compress / spill"| DISK["on-disk attribute"] BIN -->|"JsonbIteratorNext"| ACCESS["key lookup / containment"] BIN2["other jsonb"] --> CMP["compareJsonbContainers<br/>structural, order-independent"] BIN --> CMP
arrays: dimensioned varlena with an optional null bitmap
Section titled “arrays: dimensioned varlena with an optional null bitmap”A PostgreSQL array is a varlena whose ArrayType header carries the
dimensionality, an element-type OID, and an offset that is zero exactly when
the array has no NULLs — the presence of a null bitmap is encoded in that one
field:
// ArrayType — src/include/utils/array.htypedef struct ArrayType{ int32 vl_len_; /* varlena header (do not touch directly!) */ int ndim; /* # of dimensions */ int32 dataoffset; /* offset to data, or 0 if no null bitmap */ Oid elemtype; /* element type OID */} ArrayType;/* followed by: int dims[ndim], int lbound[ndim], [null bitmap], element data */
#define ARR_NDIM(a) ((a)->ndim)#define ARR_HASNULL(a) ((a)->dataoffset != 0)#define ARR_ELEMTYPE(a) ((a)->elemtype)Because the element type is stored only as an OID, generic array code must look
up the element’s typlen/typbyval/typalign to stride through the packed
data — exactly the catalog-driven indirection of §“PostgreSQL’s Approach”.
array_get_element is the canonical reader: it detoasts, validates the
subscripts against dims/lbound, computes a linear offset with
ArrayGetOffset, and then walks element-by-element (arrays are not randomly
addressable when elements are variable-length or nullable):
// array_get_element — src/backend/utils/adt/arrayfuncs.c (condensed)else /* the normal flat-array case */{ ArrayType *array = DatumGetArrayTypeP(arraydatum); /* detoasts */ ndim = ARR_NDIM(array); dim = ARR_DIMS(array); lb = ARR_LBOUND(array); arraydataptr = ARR_DATA_PTR(array); arraynullsptr = ARR_NULLBITMAP(array);}if (ndim != nSubscripts || ndim <= 0 || ndim > MAXDIM) { *isNull = true; return (Datum) 0; }for (i = 0; i < ndim; i++) if (indx[i] < lb[i] || indx[i] >= (dim[i] + lb[i])) { *isNull = true; return (Datum) 0; }
offset = ArrayGetOffset(nSubscripts, dim, lb, indx);/* then array_seek + fetch_att honoring elmlen/elmbyval/elmalign and the null map */Bulk consumers use deconstruct_array to explode the packed form into parallel
Datum/isnull C arrays in one pass, and array_in/array_out handle the
{...} text syntax (with per-element calls back into the element type’s own
I/O functions — again the uniform quartet, one level down). The expanded-array
machinery (ExpandedArrayHeader, VARATT_IS_EXTERNAL_EXPANDED) is an
optimization for repeated in-place updates and is part of the TOAST/expanded-
datum story in postgres-toast.md.
Source Walkthrough
Section titled “Source Walkthrough”The ADT library is organized one file per type family under
src/backend/utils/adt/. The through-line is always the same: a pg_proc-
registered V1 function (Datum f(PG_FUNCTION_ARGS)) unpacks its Datum
arguments, does type-specific work, and boxes a Datum back.
The I/O quartet and varlena plumbing (varlena.c, varatt.h, fmgr.c)
Section titled “The I/O quartet and varlena plumbing (varlena.c, varatt.h, fmgr.c)”textin/textout/textrecv/textsend— the reference quartet; thin wrappers overcstring_to_text*/text_to_cstringand theStringInfowire helpers (pq_getmsgtext).cstring_to_text_with_len— the constructor; always builds a full 4-byte header viaSET_VARSIZE(datums “begin life untoasted”).text_to_cstring— the consumer;pg_detoast_datum_packedthenVARDATA_ANY/VARSIZE_ANY_EXHDR, freeing the unpacked copy only if it differs from the input.VARSIZE_4B/VARSIZE_1B/VARSIZE_ANY_EXHDR/VARDATA_ANY(invaratt.h) — the endianness- and form-dispatching header macros that make the four physical layouts uniform to callers.pg_detoast_datum_packed/pg_detoast_datum(fmgr.c) — the detoast entry points; delegate todetoast_attr(access/common/detoast.c, covered bypostgres-toast.md) only for compressed/external datums.varstr_cmp/varstrfastcmp_c— collation-aware ordering and the C-localememcmpSortSupport fast path.
numeric (numeric.c)
Section titled “numeric (numeric.c)”NumericVar— the in-memory arithmetic format;init_var_from_numviews an on-diskNumericas aNumericVar,make_result/make_result_opt_errorpacks one back, choosing short vs long header.numeric_in/numeric_out/numeric_recv/numeric_send— the quartet; output goes throughget_str_from_var, binary I/O ships rawint16digits.add_var/add_abs/sub_abs/cmp_abs/mul_var— schoolbook digit-array arithmetic with guard digits (MUL_GUARD_DIGITS).
datetime (date.c, timestamp.c, datetime.c)
Section titled “datetime (date.c, timestamp.c, datetime.c)”ParseDateTime/DecodeDateTime— the shared field tokenizer/decoder used by every temporal input function.date2j/j2date— the Julian-day kernel: (year,month,day) ↔ day number, isolating Gregorian calendar math from the integer-arithmetic storage.
jsonb (jsonb.c, jsonb_util.c, jsonb.h)
Section titled “jsonb (jsonb.c, jsonb_util.c, jsonb.h)”JsonbContainer/JEntry/JB_OFFSET_STRIDE— the binary container format: count+flags header, length-or-offset child array, packed payloads.JsonbValueToJsonb/convertToJsonb/convertJsonbValue/convertJsonbArray/convertJsonbObject/convertJsonbScalar— the in-memory-tree → flat-binary depth-first serializer.getJsonbOffset/JsonbIteratorNext— random access (sum of ≤31 lengths from the nearest stored offset) and ordered traversal.compareJsonbContainers/compareJsonbScalarValue— structural, order-independent comparison.
arrays (arrayfuncs.c, array.h)
Section titled “arrays (arrayfuncs.c, array.h)”ArrayTypeand theARR_*macros (ARR_NDIM,ARR_DIMS,ARR_LBOUND,ARR_DATA_PTR,ARR_NULLBITMAP,ARR_HASNULL,ARR_ELEMTYPE) — the header layout and accessors.array_in/array_out/array_recv— text{...}and binary I/O, recursing into each element type’s own functions.array_get_element/ArrayGetOffset/deconstruct_array— element access and bulk explosion toDatum/isnullarrays.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
textin | src/backend/utils/adt/varlena.c | 588 |
textout | src/backend/utils/adt/varlena.c | 599 |
textrecv | src/backend/utils/adt/varlena.c | 610 |
cstring_to_text_with_len | src/backend/utils/adt/varlena.c | 205 |
text_to_cstring | src/backend/utils/adt/varlena.c | 226 |
varstr_cmp | src/backend/utils/adt/varlena.c | 1666 |
varstrfastcmp_c | src/backend/utils/adt/varlena.c | 2121 |
VARSIZE_4B / VARSIZE_1B | src/include/varatt.h | 192 / 194 |
VARSIZE_ANY_EXHDR | src/include/varatt.h | 317 |
VARDATA_ANY | src/include/varatt.h | 324 |
pg_detoast_datum | src/backend/utils/fmgr/fmgr.c | 1832 |
pg_detoast_datum_packed | src/backend/utils/fmgr/fmgr.c | 1864 |
NumericVar | src/backend/utils/adt/numeric.c | 313 |
numeric_in | src/backend/utils/adt/numeric.c | 637 |
numeric_out | src/backend/utils/adt/numeric.c | 816 |
numeric_recv | src/backend/utils/adt/numeric.c | 1078 |
numeric_send | src/backend/utils/adt/numeric.c | 1163 |
init_var_from_num | src/backend/utils/adt/numeric.c | 7570 |
get_str_from_var | src/backend/utils/adt/numeric.c | 7613 |
make_result_opt_error | src/backend/utils/adt/numeric.c | 7901 |
make_result | src/backend/utils/adt/numeric.c | 8010 |
add_var | src/backend/utils/adt/numeric.c | 8550 |
mul_var | src/backend/utils/adt/numeric.c | 8788 |
add_abs | src/backend/utils/adt/numeric.c | 11942 |
JsonbContainer / JEntry | src/include/utils/jsonb.h | 190 / 136 |
JB_OFFSET_STRIDE | src/include/utils/jsonb.h | 178 |
JsonbValueToJsonb | src/backend/utils/adt/jsonb_util.c | 92 |
compareJsonbContainers | src/backend/utils/adt/jsonb_util.c | 191 |
getJsonbOffset | src/backend/utils/adt/jsonb_util.c | 134 |
JsonbIteratorNext | src/backend/utils/adt/jsonb_util.c | 859 |
convertJsonbValue | src/backend/utils/adt/jsonb_util.c | 1603 |
convertJsonbArray | src/backend/utils/adt/jsonb_util.c | 1628 |
convertJsonbObject | src/backend/utils/adt/jsonb_util.c | 1712 |
ArrayType | src/include/utils/array.h | 92 |
array_in | src/backend/utils/adt/arrayfuncs.c | 179 |
array_out | src/backend/utils/adt/arrayfuncs.c | 1016 |
array_recv | src/backend/utils/adt/arrayfuncs.c | 1271 |
array_get_element | src/backend/utils/adt/arrayfuncs.c | 1820 |
deconstruct_array | src/backend/utils/adt/arrayfuncs.c | 3631 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”All symbols, constants, and code excerpts below were read directly from the
REL_18_STABLE working tree at commit 273fe94 on 2026-06-05.
- The I/O quartet is real and thin.
textin/textout/textrecvinvarlena.care exactly the few-line wrappers quoted; the V1 ABI (PG_FUNCTION_ARGS,PG_GETARG_*,PG_RETURN_*) is the universal signature. Confirmed. - Four varlena physical forms.
varatt.hdefines the 4-byte (plain/compressed), 1-byte short, and1B_Eexternal/TOAST-pointer forms, with the flag-bit dispatch inVARSIZE_ANY_EXHDR/VARDATA_ANY. The short header caps at 126 bytes (VARATT_SHORT_MAX-derived). Confirmed. - Detoast split.
pg_detoast_datum_packed(fmgr.c) returns the input untouched unless it isVARATT_IS_COMPRESSEDorVARATT_IS_EXTERNAL, in which case it callsdetoast_attr. Confirmed; the actual fetch/decompress lives inaccess/common/detoast.c(deferred topostgres-toast.md). - numeric radix. The live
NBASEis10000withDEC_DIGITS = 4andNumericDigit = int16;MUL_GUARD_DIGITS = 2. TheNumericVarstruct and the short/long/special on-disk header families are as quoted. Confirmed. - make_result canonicalization.
make_result_opt_errorstrips leading and trailing zero digits, canonicalizes zero to weight 0 / positive, and picks the short header whenNUMERIC_CAN_BE_SHORT. Confirmed. - jsonb stride.
JB_OFFSET_STRIDE == 32;convertJsonbArrayconverts every 32nd child’sJEntryto aJENTRY_HAS_OFFcumulative offset and caps payload atJENTRY_OFFLENMASK(0x0FFFFFFF, 256 MB). Confirmed. - Structural jsonb comparison.
compareJsonbContainersdrives twoJsonbIterators in lock-step, so equality is order-independent. Confirmed. - array null bitmap encoding.
ARR_HASNULL(a)is literally((a)->dataoffset != 0); the null bitmap exists iffdataoffsetis nonzero.array_get_elementdetoasts, bounds-checks, and computes a linear offset viaArrayGetOffset. Confirmed.
Open questions
Section titled “Open questions”- Where exactly short-header down-conversion happens. Constructors build
full 4-byte headers; the conversion to 1-byte short form occurs during
tuple assembly (
heap_fill_tuple/fill_valpath). The precise trigger and the interaction withtypstoragebelong topostgres-toast.md/postgres-heap-am.md; this doc only asserts the four forms exist. - lz4 vs pglz selection for in-line compression. The
va_tcinfohigh bits encode the method, but the default-compression GUC and the per-columnALTER ... SET COMPRESSIONpath are TOAST concerns, not ADT concerns. Deferred. - Abbreviated-key encoding for non-C collations.
varstrfastcmp_cis the C-locale fast path; the ICU/libc abbreviated-key converter and its abort-if-unhelpful heuristic live in the SortSupport/i18n code and are out of scope here.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”The object-relational bet. PostgreSQL’s “a type is catalog rows plus
registered C functions” is the direct descendant of The Design of POSTGRES
(Stonebraker & Rowe 1986) and the lineage traced in What Goes Around Comes
Around (Stonebraker & Hellerstein 2005; captured in
dbms-papers/goes-around.md): the object-relational school chose an open
type system over a fixed one. The payoff is visible today — PostGIS
(geometry), pgvector (vector), citext, and hstore are all
“just types” with no executor patches. The cost is the rigid V1 ABI and the
fmgr indirection on every call, which a closed system (e.g. a hand-tuned
analytics engine) avoids by hard-coding its handful of types.
Variable-length headers elsewhere. The varlena “shrink the header for
small values, spill the big ones out of line” pattern recurs across engines.
SQL Server uses in-row vs row-overflow vs LOB_DATA allocation units; Oracle
distinguishes inline VARCHAR2 from out-of-line LOB segments with a LOB
locator; MySQL/InnoDB stores long VARCHAR/BLOB columns off-page with a
20-byte pointer. PostgreSQL’s distinctive 1-byte short header (an
unaligned length-and-flag byte capped at 126 bytes) is unusually aggressive
about the small-string common case, reflecting how much of a real schema is
short text. Column stores push this further: dictionary/RLE/bit-packing
encodings (the C-Store/Vertica lineage, dbms-papers/column-vs-row.md)
make the “length” implicit in the encoding rather than per-value.
Decimal arithmetic. Base-10000 schoolbook arithmetic is the conventional
choice (IBM’s decNumber, Java’s BigDecimal, and most engines use a
power-of-ten radix for clean rounding and text conversion). Hardware decimal
(IEEE 754-2008 decimal floating point, POWER’s DFP unit) is the road not taken
for general-purpose engines, which prefer the portability and unbounded
precision of a software digit array.
Binary JSON. The length-or-offset-with-stride trick is PostgreSQL’s answer to a tension every binary-JSON format faces: MySQL’s binary JSON stores full offset tables (fast access, poor compression), while a pure length encoding compresses well but is O(n) to index. The stride is a tunable midpoint. Research on succinct/compressed semi-structured storage and on schema-inference for JSON columns (e.g. JSON tiles, Sinew) continues to probe whether a fully columnar shredding of JSON beats a single TOASTed blob for analytic workloads — a frontier where PostgreSQL’s “one varlena per document” is deliberately on the OLTP-friendly side.
Arrays vs nested relations. PostgreSQL’s flat dimensioned array with an
element-type OID is the SQL-standard ARRAY realized as a single value. The
alternative lineage — nested tables / MULTISET (Oracle), and the
NF² (non-first-normal-form) research tradition — models collections as
first-class relations. PostgreSQL stays closer to first normal form,
treating the array as an opaque scalar that the unnest/array_agg operators
bridge to and from rows.
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”src/backend/utils/adt/varlena.c—text/byteaI/O,varstr_cmp, SortSupport,*_to_text/text_to_cstring.src/backend/utils/adt/numeric.c— arbitrary-precision decimal: I/O,NumericVar,make_result, digit-array arithmetic.src/backend/utils/adt/jsonb.c,src/backend/utils/adt/jsonb_util.c—jsonbI/O, theJsonbContainer/JEntrybinary format, serialization (convertJsonb*), structural comparison, iteration.src/backend/utils/adt/arrayfuncs.c— array I/O,ArrayTypeaccess,array_get_element,deconstruct_array.src/backend/utils/adt/date.c,src/backend/utils/adt/datetime.c— temporal I/O and thedate2j/j2dateJulian kernel.src/include/varatt.h— the varlena header layouts andVAR*macros.src/include/utils/jsonb.h,src/include/utils/array.h—jsonband array on-disk structures and accessor macros.src/backend/utils/fmgr/fmgr.c—pg_detoast_datum*detoast entry points.
Knowledge-base cross-references
Section titled “Knowledge-base cross-references”postgres-fmgr.md— the V1 calling convention,FmgrInfo,FunctionCallInfo, and how ADT functions are dispatched by OID.postgres-toast.md— out-of-line storage, in-line compression (pglz/lz4),detoast_attr, expanded datums.postgres-nbtree.md,postgres-index-am.md— how operator classes consume the comparison/hash functions these types register.postgres-overview-base-infra.md,postgres-overview-i18n-text.md— surrounding base-infrastructure and collation/text context.dbms-papers/goes-around.md— What Goes Around Comes Around (the object-relational type-system lineage).research/dbms-general/database-system-concepts.md— domains, types, and the relational type system (ch. 4–5).