PostgreSQL Character Set Encoding — Server/Client Encodings and Conversion
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-06)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A character set encoding is a mapping between abstract characters and the
byte sequences that represent them on the wire and on disk. Every text-like
value a database stores — a text column, a table name, an SQL string
literal — is ultimately a run of bytes, and the encoding is the contract that
says how to cut that run into characters and what each character is. Three
properties define the design space:
-
Fixed-width vs. variable-width. A single-byte encoding (ASCII, ISO-8859-1) maps each character to exactly one byte: character count equals byte count, and
string[i]is the i-th character. A multibyte encoding (UTF-8, EUC-JP, GB18030) uses one to four bytes per character, so finding the i-th character requires walking the string from the start, decoding each character’s length as you go. This single fact — that you cannot index a multibyte string in O(1) — drives almost every interesting piece of engineering in a multibyte-aware engine. It is whylength()on atextvalue is O(n) not O(1), whysubstr()cannot simply offset a pointer, and whyvarchar(n)truncation needspg_mbcliplento find a character boundary at or before the byte limit rather than blindly cutting at byten. The single-byte fast path (count =strlen, index = pointer arithmetic) is preserved precisely so that databases declared in a single-byte encoding pay none of this cost — the engine branches onmaxmblen == 1to keep the common Latin-1/ASCII case cheap. -
Self-synchronizing vs. stateful. UTF-8 is self-synchronizing: from any byte you can tell whether it is a leading byte (high bits
0,110,1110,11110) or a continuation byte (10), so a corrupt or truncated stream can resynchronize at the next character boundary. Stateful encodings (ISO-2022 with its escape-sequence shift states) cannot; PostgreSQL deliberately refuses such encodings as a server encoding. Self-synchronization is what makes a fast validator possible. -
Repertoire and round-tripping. Two encodings may cover different sets of characters. Converting EUC-JP → LATIN1 must fail for any Japanese character, because LATIN1 has no code point for it — an untranslatable character. Converting LATIN1 → UTF8 always succeeds because UTF-8’s repertoire is a superset. A correct conversion framework must distinguish “the input was malformed” (invalid encoding) from “the input was well-formed but the target cannot represent it” (untranslatable) — two different error classes with two different SQLSTATEs.
Database System Concepts (Silberschatz et al.) treats character data as one
of the SQL built-in types and notes that string comparison and ordering are
locale- and encoding-sensitive — it is the encoding that fixes which bytes are
one character before any collation question (how to order those characters)
can even be asked. Database Internals (Petrov) frames the on-disk
representation question generally: the storage engine stores opaque byte runs,
and a layer above must impose meaning. Encoding is precisely that layer for
text. The key architectural consequence the textbooks imply, and that
PostgreSQL makes concrete, is separation of concerns: encoding answers
“where does one character end?”; collation (the sibling
postgres-collation-providers.md) answers “how do two characters order?”.
The two are orthogonal but coupled — a collation needs to decode characters,
so it leans on the encoding for character boundaries, but never the reverse.
Unicode deserves special mention because PostgreSQL builds its whole conversion
matrix around it. UTF-8 (RFC 3629) encodes a Unicode code point (U+0000 to
U+10FFFF) in one to four bytes, with the crucial property that the byte length
is determined entirely by the leading byte and that overlong encodings (using
more bytes than necessary) are illegal. The illegality of overlong forms is a
security property, not a pedantic one: without it, an attacker could smuggle an
ASCII character (say / or ') past an ASCII-level filter by encoding it as a
two-byte sequence that decodes back to the ASCII byte. PostgreSQL’s UTF-8
validator (pg_utf8_islegal) implements exactly the RFC 3629 byte-range
restrictions that forbid overlong forms and surrogate halves.
A final subtlety the textbooks raise but that lives above the encoding layer
is normalization. Unicode lets the same visible character be spelled more
than one way — é can be a single precomposed code point (U+00E9) or a base
letter plus a combining accent (U+0065 U+0301). These are distinct byte
sequences in UTF-8 and therefore distinct encoded strings, even though they
are canonically equivalent. The encoding layer deliberately does not
normalize: its job is byte boundaries and validity, not semantic equivalence.
Normalization (NFC/NFD) is a separate operation a caller invokes explicitly,
which keeps the encoding layer a pure, fast, lossless byte discipline and
leaves equivalence policy to the layer that actually compares or matches text.
The off-by-one Hangul recomposition fix in this very revision (commit message
“Fix off-by-one with NFC recomposition for Hangul U+11A7”) is a reminder that
normalization is intricate and rightly kept out of the hot validation path.
Common DBMS Design
Section titled “Common DBMS Design”This section names the engineering patterns that multibyte-aware relational engines converge on, so that PostgreSQL’s specific choices read as selections within a shared space rather than as ad-hoc inventions.
A per-database fixed encoding plus a per-session view encoding
Section titled “A per-database fixed encoding plus a per-session view encoding”Almost every engine distinguishes the encoding in which data is stored from the encoding in which a particular client wants to see it. Storing data in a single fixed encoding per database keeps every byte comparison, every index, and every on-disk tuple unambiguous: the engine never has to ask “what encoding is this row in?”. Meanwhile a client connecting from a legacy LATIN1 application and a client connecting from a UTF-8 terminal can both talk to the same UTF-8 database, each declaring its own client encoding, and the engine transcodes on the session boundary. This is the server-encoding / client-encoding split, and it localizes all transcoding to two choke points: data coming in (client → server) and data going out (server → client).
Conversion through a pivot encoding
Section titled “Conversion through a pivot encoding”Supporting N encodings with direct pairwise converters needs O(N²) conversion tables — unmanageable when N is forty. The standard trick is a pivot encoding: convert source → pivot → target, so each encoding needs only a converter to and from the pivot, giving O(N) tables. Unicode (UTF-8 or UTF-16) is the universal pivot in modern engines because its repertoire is a superset of essentially every legacy charset. The cost is two passes instead of one and a possible loss of fidelity for characters that round-trip imperfectly through Unicode; the benefit is linear table growth and one well-tested code path.
The fidelity caveat is not hypothetical. Some legacy charsets contain
characters that map to the same Unicode code point as another character, or
have multiple legacy code points that Unicode unifies; a pivot through Unicode
can then collapse a distinction the legacy charset made, so source→pivot→source
is not guaranteed to be the identity. Engines handle this with carefully
authored mapping tables (PostgreSQL’s are generated from authoritative Unicode
mapping files and checked into conversion_procs/), and they distinguish a
lossy but defined mapping from an undefined one — the latter being the
untranslatable-character error. The pivot design does not remove the
round-tripping problem; it concentrates it into the two pivot tables where it
can be audited once rather than across O(N²) pairwise tables.
Verify on ingress, trust on egress
Section titled “Verify on ingress, trust on egress”Data that originates outside the server — a string literal in a submitted
query, a COPY line, a bytea handed to convert_from — cannot be trusted to
be validly encoded, and a single malformed multibyte sequence can desynchronize
a parser or overrun a buffer. So engines validate aggressively at the
boundary where untrusted bytes enter, and then trust the data internally:
once a string is known to be valid server-encoded text, internal operations
skip revalidation. The asymmetry is deliberate — validation is O(n) and the
boundary is the only place an adversary controls the bytes.
The security stakes are concrete. A malformed multibyte sequence whose second byte happens to be an ASCII quote or backslash can, if it reaches a byte-oriented parser unverified, break out of a string literal — the classic encoding-based SQL-injection vector. By forcing every externally-sourced byte through a validator before the lexer, the engine guarantees the parser only ever sees well-formed server-encoded text, which is why the ASCII-transparency property (no ASCII byte ever appears as a non-leading byte) is a hard requirement for server encodings and merely a tolerated risk for client encodings that are transcoded away on the way in.
A per-encoding function table
Section titled “A per-encoding function table”Because every multibyte operation (how long is this character? is this byte
sequence valid? what is the display width?) is encoding-specific, engines route
them through a table of function pointers indexed by encoding ID, rather than a
giant switch re-evaluated per character. This is a classic vtable: one row per
encoding, one column per primitive (mblen, verify, char↔wchar,
dsplen). Hot loops fetch the function pointer once and call it per character.
The vtable also doubles as a capability declaration. A single-byte encoding’s
row points all of its primitives at trivial implementations (mblen returns 1,
verify checks only for NUL), and its maxmblen of 1 is the flag the
length/clip routines test to take the O(1) fast path. A client-only encoding’s
row may legally have NULL in the mb2wchar/wchar2mb slots (it never needs
to produce the internal wide-char form), while the mblen/verify slots are
always populated because validation must work for any client encoding on
ingress. Reading the table’s shape thus tells you, per encoding, exactly which
operations the engine is prepared to perform — without consulting any catalog.
Sharing the encoding library between server and tools
Section titled “Sharing the encoding library between server and tools”The same mblen/validate logic is needed by client tools (psql, pg_dump),
which have no backend. Engines therefore factor the pure, dependency-free
encoding primitives into a shared library that both the server and the
frontend link, keeping only the catalog-aware parts (looking up a conversion
function in a system catalog) on the server side.
flowchart LR
subgraph client["Client side"]
APP["application bytes<br/>in client_encoding"]
end
subgraph server["Backend (one process, fixed server encoding)"]
IN["ingress<br/>pg_client_to_server<br/>VERIFY + convert"]
CORE["stored / parsed text<br/>(server encoding, trusted)"]
OUT["egress<br/>pg_server_to_client<br/>convert (trust)"]
end
APP -->|"Query / Bind / COPY"| IN --> CORE
CORE --> OUT -->|"DataRow / results"| APP
IN -. "no conversion if<br/>client == server or SQL_ASCII" .-> CORE
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL fixes the server encoding (also called the database encoding) at
CREATE DATABASE time: it is immutable for the life of the database and is the
encoding of every stored text, varchar, name, cstring, xml, and
json value, and of the SQL text after it enters the backend. The client
encoding is a per-session GUC (client_encoding) that the client may change
at any time. The backend caches three pg_enc2name pointers — one each for the
database, client, and message encoding — and a small set of cached FmgrInfo
conversion-function handles, in mbutils.c:
// ClientEncoding / DatabaseEncoding / MessageEncoding — mbutils.cstatic const pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];static const pg_enc2name *DatabaseEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];static const pg_enc2name *MessageEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];DatabaseEncoding is set once, early, by SetDatabaseEncoding and never
changes. ClientEncoding moves whenever the session runs SET client_encoding
or sends a startup client_encoding parameter.
The third pointer, MessageEncoding, exists because server log and error
messages may be emitted in a different encoding than either the database or
the client — gettext localizes messages, and the encoding of the translated
message catalog need not match the database. GetMessageEncoding lets the
error-reporting path tag emitted text correctly. The separation matters most
during startup and error handling, when a backend may need to emit a message
before ClientEncoding is even resolved; defaulting all three pointers to
PG_SQL_ASCII at process start (see the static initializers above) guarantees
the message path never dereferences an uninitialized conversion proc.
The encoding identity space
Section titled “The encoding identity space”Every encoding has a small integer ID — a pg_enc enum value — and the IDs are
partitioned: IDs 0 .. PG_ENCODING_BE_LAST (which is PG_KOI8U) may be either
a server or a client encoding, while IDs above it (PG_SJIS, PG_BIG5,
PG_GBK, PG_UHC, PG_GB18030, PG_JOHAB, PG_SHIFT_JIS_2004) are
client-only. Those seven are deliberately barred as server encodings
because they are not ASCII-transparent: a byte in the ASCII range can appear as
the second byte of a multibyte character, which would let an embedded \ or
' confuse a byte-oriented parser. The macros encode this partition:
// PG_VALID_BE_ENCODING / PG_VALID_FE_ENCODING — pg_wchar.h#define PG_VALID_BE_ENCODING(_enc) \ ((_enc) >= 0 && (_enc) <= PG_ENCODING_BE_LAST)/* On FE are possible all encodings */#define PG_VALID_FE_ENCODING(_enc) PG_VALID_ENCODING(_enc)Encoding names are resolved to IDs by pg_char_to_encoding (a binary
search over a normalized-name table in src/common/encnames.c) and back by
pg_encoding_to_char. The name table and the ID→name table (pg_enc2name_tbl)
both live in src/common precisely so frontend tools can resolve encoding
names without a backend.
The conversion framework: catalog-driven, UTF-8-pivoted
Section titled “The conversion framework: catalog-driven, UTF-8-pivoted”A conversion between two server-capable encodings is performed by a function
registered in the pg_conversion system catalog. FindDefaultConversionProc
(in namespace.c) walks the active search path to find the default
conversion proc for an (from_encoding, to_encoding) pair; the actual byte
work is done by the proc, which is almost always one of the C functions in
src/backend/utils/mb/conversion_procs/ that ultimately call UtfToLocal or
LocalToUtf (in conv.c). PostgreSQL does not ship a direct converter for
every pair; instead UTF-8 is the pivot, so e.g. EUC_JP → EUC_KR is realized as
two registered conversions (EUC_JP → UTF8, then UTF8 → EUC_KR) when invoked
through the higher-level paths, while the catalog stores the per-pair procs that
each go through UTF-8 internally.
The pivot mechanics live in UtfToLocal (and its mirror LocalToUtf): it
walks the UTF-8 input one character at a time, re-validating each character
with pg_utf8_islegal before looking the code point up in a radix-tree map to
the local encoding. A failed lookup or an illegal byte is the seam where the
two error classes diverge — malformed input vs. untranslatable character:
// UtfToLocal — src/backend/utils/mb/conv.c (condensed)for (; len > 0; len -= l){ if (*utf == '\0') break; l = pg_utf_mblen(utf); if (len < l) break; /* truncated trailing char */ if (!pg_utf8_islegal(utf, l)) break; /* malformed -> report_invalid_encoding */ if (l == 1) { *iso++ = *utf++; continue; } /* ASCII passes through */ /* collect b1..b4, pg_mb_radix_conv() lookup, else combined-char bsearch, else report_untranslatable_char() */}Every break falls into the post-loop error path; whether that path raises
report_invalid_encoding or report_untranslatable_char (or silently returns
in noError mode) is exactly the invalid-vs-untranslatable distinction from
the Theoretical Background.
The single most important entry point is pg_do_encoding_conversion, the
general-case converter. Its structure is a cascade of fast-paths before it ever
touches the catalog:
// pg_do_encoding_conversion — mbutils.c (condensed)if (len <= 0) return src; /* empty string is always valid */if (src_encoding == dest_encoding) return src; /* no conversion required, assume valid */if (dest_encoding == PG_SQL_ASCII) return src; /* any string is valid in SQL_ASCII */if (src_encoding == PG_SQL_ASCII){ /* No conversion is possible, but we must validate the result */ (void) pg_verify_mbstr(dest_encoding, (const char *) src, len, false); return src;}/* ... look up proc via FindDefaultConversionProc, allocate, OidFunctionCall6 ... */Note the SQL_ASCII semantics, which are PostgreSQL’s escape hatch and its
foot-gun: SQL_ASCII means “no encoding declared — treat bytes as opaque.” Any
string is “valid” in SQL_ASCII, and no conversion is ever done to or from it.
That makes a SQL_ASCII database a permissive byte bucket; it also means the
server cannot guarantee anything about the text it stores.
When a conversion is needed, the result buffer is sized for the worst case —
MAX_CONVERSION_GROWTH (= 4) bytes of output per input byte — and the proc is
invoked through OidFunctionCall6 with the standard six-argument conversion
signature (src_encoding, dest_encoding, src, dest, len, noError). The
allocation is done with MemoryContextAllocHuge and guarded against integer
overflow, because len * 4 can exceed MaxAllocSize even when the real result
would fit. The over-allocation is then trimmed: after the proc reports how many
bytes it actually wrote, large buffers are shrunk with repalloc so the slack
between worst-case and actual output is not held for the life of the result.
The six-argument ABI also carries a noError flag, and this is what lets the
same proc back two very different callers. SQL convert() and the parser want
a hard ERROR on a bad byte; speculative callers (e.g. probing whether a value
can be represented in a target encoding) want a soft failure. With
noError = true the proc stops at the first untranslatable or invalid
character and returns the count of bytes successfully consumed instead of
calling report_invalid_encoding / report_untranslatable_char. The error
classification — invalid encoding (CHARACTER_NOT_IN_REPERTOIRE) versus
untranslatable character (UNTRANSLATABLE_CHARACTER) — is therefore not
decided by the caller but emerges from where in the proc the byte failed:
malformed input fails the pg_utf8_islegal / verify check, while a well-formed
character with no target mapping fails the radix-tree lookup.
flowchart TD
S["pg_do_encoding_conversion(src, len, from, to)"] --> A{"len<=0 or<br/>from==to?"}
A -->|yes| R1["return src as-is"]
A -->|no| B{"to == SQL_ASCII?"}
B -->|yes| R1
B -->|no| C{"from == SQL_ASCII?"}
C -->|yes| V["pg_verify_mbstr(to)<br/>then return src"]
C -->|no| D["FindDefaultConversionProc(from,to)"]
D --> E{"proc found?"}
E -->|no| ERR["ERROR: no default<br/>conversion function"]
E -->|yes| F["alloc len*4+1 (Huge)<br/>OidFunctionCall6(proc, ...)"]
F --> G["repalloc down if large<br/>return result"]
Client↔server: the cached fast path
Section titled “Client↔server: the cached fast path”Most conversion in a live session is not the general case but the client to
server and server to client direction, and for those PostgreSQL caches the
FmgrInfo so it can convert without a catalog lookup — important because
pg_server_to_client runs on every result row and may run outside a
transaction. SetClientEncoding installs the cached procs into two static
pointers; perform_default_encoding_conversion uses them:
// pg_any_to_server — mbutils.c (condensed)if (encoding == DatabaseEncoding->encoding || encoding == PG_SQL_ASCII){ /* No conversion is needed, but we must still validate the data. */ (void) pg_verify_mbstr(DatabaseEncoding->encoding, s, len, false); return unconstify(char *, s);}/* Fast path if we can use cached conversion function */if (encoding == ClientEncoding->encoding) return perform_default_encoding_conversion(s, len, true);/* General case ... will not work outside transactions */return pg_do_encoding_conversion(...);The asymmetry the file header documents is the heart of the discipline:
pg_any_to_server always validates even when no conversion is needed,
because the bytes came from outside; pg_server_to_any trusts the
server-side bytes when no conversion is needed. Ingress verifies; egress trusts.
There is one case pg_any_to_server cannot resolve cleanly: a SQL_ASCII
server receiving data from an ASCII-unsafe client encoding (one of the
seven client-only encodings). No conversion proc exists into SQL_ASCII, yet the
bytes might contain a multibyte sequence whose second byte is an ASCII
metacharacter the parser would mis-read. PostgreSQL refuses to guess — it
rejects any non-ASCII byte outright:
// pg_any_to_server — mbutils.c (SQL_ASCII-server, ASCII-unsafe client)if (PG_VALID_BE_ENCODING(encoding)) (void) pg_verify_mbstr(encoding, s, len, false); /* ASCII-safe: verify */else{ for (i = 0; i < len; i++) if (s[i] == '\0' || IS_HIGHBIT_SET(s[i])) /* ASCII-unsafe: reject */ ereport(ERROR, (errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE), errmsg("invalid byte value for encoding \"%s\": 0x%02x", pg_enc2name_tbl[PG_SQL_ASCII].name, (unsigned char) s[i])));}The comment in the source spells out the reasoning: “we dare not pass such data to the parser but we have no way to convert it. We compromise by rejecting the data if it contains any non-ASCII characters.” This is the seam where the “SQL_ASCII is a permissive byte bucket” semantics meet the “never let an ASCII-unsafe byte reach a byte-oriented parser” safety property, and the resolution is to narrow the permissiveness to pure ASCII.
The per-encoding vtable: pg_wchar_table
Section titled “The per-encoding vtable: pg_wchar_table”Every multibyte primitive is dispatched through pg_wchar_table, indexed by
encoding ID. Each row is a pg_wchar_tbl of seven members: mb2wchar_with_len,
wchar2mb_with_len, mblen, dsplen, mbverifychar, mbverifystr, and
maxmblen:
// pg_wchar_table — src/common/wchar.c (excerpt)const pg_wchar_tbl pg_wchar_table[] = { [PG_SQL_ASCII] = {pg_ascii2wchar_with_len, pg_wchar2single_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifychar, pg_ascii_verifystr, 1}, [PG_UTF8] = {pg_utf2wchar_with_len, pg_wchar2utf_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifychar, pg_utf8_verifystr, 4}, [PG_GB18030] = {0, 0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifychar, pg_gb18030_verifystr, 4}, /* ... 40 entries total ... */};The backend wrappers in mbutils.c (pg_mblen, pg_mbstrlen, pg_dsplen,
pg_verify_mbstr) read DatabaseEncoding->encoding, then index this table; the
pg_encoding_* variants in wchar.c take the encoding as an argument so
frontend tools can call them. Note the LATIN/WIN/ISO single-byte encodings all
share the same function pointers (pg_latin1_*) with maxmblen 1 — for a
single-byte encoding “decode a character” is trivial, and the only validity
question is “is this byte a NUL?”.
UTF-8 in detail
Section titled “UTF-8 in detail”Because UTF-8 is the pivot, its primitives are the most exercised. Decoding a character’s length is a pure function of the leading byte:
// pg_utf_mblen — src/common/wchar.cif ((*s & 0x80) == 0) len = 1;else if ((*s & 0xe0) == 0xc0) len = 2;else if ((*s & 0xf0) == 0xe0) len = 3;else if ((*s & 0xf8) == 0xf0) len = 4;else len = 1; /* bogus lead: treat as 1 */Decoding the value is the inverse bit-shuffle (utf8_to_unicode in
pg_wchar.h), and encoding a code point back to bytes is unicode_to_utf8.
Validation, however, is more than length: pg_utf8_islegal enforces the RFC
3629 restrictions that forbid overlong forms and UTF-16 surrogate halves, by
constraining the range of the second byte based on the leading byte:
// pg_utf8_islegal — src/common/wchar.c (condensed)case 2: a = source[1]; switch (*source) { case 0xE0: if (a < 0xA0 || a > 0xBF) return false; break; /* no overlong-3 */ case 0xED: if (a < 0x80 || a > 0x9F) return false; break; /* no surrogates */ case 0xF0: if (a < 0x90 || a > 0xBF) return false; break; /* no overlong-4 */ case 0xF4: if (a < 0x80 || a > 0x8F) return false; break; /* <= U+10FFFF */ default: if (a < 0x80 || a > 0xBF) return false; break; } /* FALL THRU */case 1: a = *source; if (a >= 0x80 && a < 0xC2) return false; /* 0x80..0xC1: cont/overlong lead */ if (a > 0xF4) return false; /* > U+10FFFF lead */ break;The FALL THRU chain checks the trailing bytes (0x80..0xBF) for lengths 4,
3, 2 in turn, then the lead byte; lengths 5 and 6 are rejected outright. The
range gates after 0xE0/0xED/0xF0/0xF4 are exactly the RFC 3629 cases
that distinguish a legal minimal encoding from an overlong one or a surrogate
half.
For bulk validation of a whole string, pg_utf8_verifystr uses a shift-based
DFA (Utf8Transition) processing two SIMD-vector widths at a time, falling
back to the byte-wise pg_utf8_verifychar only when a chunk contains non-ASCII
or ends mid-character. This is the hot path that runs on every UTF-8 string
entering the server; the DFA design avoids the data-dependent loads of a
traditional table-driven automaton.
Identifiers and string literals
Section titled “Identifiers and string literals”There is no separate “identifier encoding.” Table names, column names, and the
name type are stored and compared in the server encoding, just like any
other text. The point at which an identifier or a string literal acquires the
server encoding is ingress: the entire query string is run through
pg_client_to_server before the lexer ever sees it, so by the time the parser
tokenizes "naïve_column" or 'café', those bytes are already validated
server-encoded text. pg_unicode_to_server handles the special case of a
Unicode escape (U&'\00e9' or é in JSON): it takes a single code point,
renders it as UTF-8, and then runs the cached UTF8→server conversion (or, if the
server is already UTF-8, just reformats) — and it is carefully written to work
outside a transaction, because the lexer can run before a transaction starts.
Source Walkthrough
Section titled “Source Walkthrough”The encoding machinery splits across two trees by dependency: the
backend-only, catalog-aware glue in src/backend/utils/mb/, and the pure,
self-contained primitives in src/common/ that both the server and frontend
tools (psql, pg_dump, libpq) link. Below, symbols are grouped by
call-flow.
Encoding state and selection (mbutils.c)
Section titled “Encoding state and selection (mbutils.c)”ConvProcInfo— the cached record pairing a(server, client)encoding pair with its twoFmgrInfohandles (to_server_info,to_client_info). Kept in aConvProcListinTopMemoryContextso a setting can be restored during transaction rollback without catalog access.PrepareClientEncoding— validates a requested client encoding and, if in a live transaction, looks up the two conversion procs viaFindDefaultConversionProcand caches them. Returns failure before it is committed to, soSET client_encodingcan fail gracefully. During backend startup it short-circuits (catalogs not yet available).SetClientEncoding— installs the activeClientEncodingpointer and theToServerConvProc/ToClientConvProcstaticFmgrInfopointers from the cache prepared above; the no-conversion cases (client == server, or either side isSQL_ASCII) clear the procs toNULL.InitializeClientEncoding— called once fromInitPostgres; flipsbackend_startup_complete, applies thepending_client_encoding, and additionally looks up the UTF8→server proc intoUtf8ToServerConvProc(used bypg_unicode_to_server).SetDatabaseEncoding/GetDatabaseEncoding/GetDatabaseEncodingName— set-once / read accessors for the fixed server encoding.GetMessageEncoding/SetMessageEncodingtrack the separate encodinggettextemits messages in.pg_get_client_encoding/pg_client_encoding(SQL) /getdatabaseencoding(SQL) — expose the current encodings to callers and to SQL.
The conversion entry points (mbutils.c)
Section titled “The conversion entry points (mbutils.c)”pg_do_encoding_conversion— the general-case converter: fast-paths for empty / identical /SQL_ASCIIcases, thenFindDefaultConversionProc+MemoryContextAllocHuge(len * MAX_CONVERSION_GROWTH + 1)+OidFunctionCall6. Requires a transaction for the catalog lookup.pg_do_encoding_conversion_buf— the buffer-output variant: the caller has already found the proc and supplies the destination buffer; clamps input length so the worst-case output fits.pg_any_to_server/pg_client_to_server— ingress; always validates even with no conversion. Has theSQL_ASCII-server special case that rejects non-ASCII bytes from an ASCII-unsafe client encoding.pg_server_to_any/pg_server_to_client— egress; trusts the source when no conversion is needed.perform_default_encoding_conversion— the cached-FmgrInfoworker shared by both directions; the only converter safe to call outside a transaction (no catalog access).pg_unicode_to_server/pg_unicode_to_server_noerror— convert one Unicode code point to a server-encoded string, viaUtf8ToServerConvProc; written to be transaction-independent for the lexer.- SQL wrappers —
pg_convert,pg_convert_to,pg_convert_from(theconvert()/convert_to()/convert_from()functions), andlength_in_encoding(length(bytea, name)).
The catalog seam (namespace.c)
Section titled “The catalog seam (namespace.c)”FindDefaultConversionProc— walksactiveSearchPath(skipping the temp namespace) callingFindDefaultConversionuntil it finds a default proc for the(for_encoding, to_encoding)pair; returnsInvalidOidif none. This is the single bridge fromutils/mbinto thepg_conversioncatalog.
Multibyte string primitives (mbutils.c)
Section titled “Multibyte string primitives (mbutils.c)”pg_mblen/pg_mblen_cstr/pg_mblen_range/pg_mblen_with_len/pg_mblen_unbounded— byte length of one character in the database encoding, with varying bounds-checking strictness.pg_mbstrlen/pg_mbstrlen_with_len— character count of a string, with a single-byte-encoding fast path (strlen).pg_mbcliplen/pg_encoding_mbcliplen/pg_mbcharcliplen— clip a string to a byte (or character) limit without splitting a multibyte character; the workhorse ofvarchar(n)truncation.pg_verify_mbstr/pg_verifymbstr/pg_verify_mbstr_len— validate a string in a given (or the database) encoding;*_lenalso counts characters and so cannot use the fastmbverifystr.pg_database_encoding_max_length/pg_database_encoding_character_incrementer—maxmblenaccessor and themake_greater_stringcharacter incrementer (withpg_utf8_increment/pg_eucjp_increment/pg_generic_charinc).report_invalid_encoding/report_untranslatable_char/check_encoding_conversion_args— the two error reporters (distinct SQLSTATEs:CHARACTER_NOT_IN_REPERTOIREvs.UNTRANSLATABLE_CHARACTER) and the argument validator every conversion proc calls.
Pure encoding primitives (src/common/wchar.c)
Section titled “Pure encoding primitives (src/common/wchar.c)”pg_wchar_table— the 40-entry vtable ofpg_wchar_tblrows.pg_utf_mblen/utf8_to_unicode/unicode_to_utf8/unicode_utf8len— UTF-8 length, decode, encode, and encoded-length.pg_utf8_islegal— RFC 3629 single-character legality (overlong / surrogate rejection).pg_utf8_verifychar/pg_utf8_verifystr— single-char and whole-string UTF-8 validators; the latter is the shift-based DFA (Utf8Transition,utf8_advance).pg_encoding_mblen/pg_encoding_mblen_or_incomplete/pg_encoding_mblen_bounded— table-dispatched char length by encoding ID (the_or_incompleteform is the GB18030-safe one that may read two bytes).pg_encoding_verifymbchar/pg_encoding_verifymbstr/pg_encoding_max_length/pg_encoding_dsplen— the argument-takes-encoding variants used by frontend tools.
Encoding names and conversion-proc helpers
Section titled “Encoding names and conversion-proc helpers”pg_char_to_encoding/pg_encoding_to_char/pg_valid_server_encoding(src/common/encnames.c) — name↔ID, withpg_enc2name_tbl/pg_encname_tbl.local2local/latin2mic/mic2latin/latin2mic_with_table/mic2latin_with_table(conv.c) — the single-byte and MULE_INTERNAL helper converters the conversion procs build on.UtfToLocal/LocalToUtf(conv.c) — the UTF-8 ↔ local-encoding radix-tree converters that nearly everypg_conversionproc delegates to;pg_mb_radix_conv/store_coded_charare their inner primitives, andcompare3/compare4drive thebsearchover combined-character maps.
Position hints (as of 2026-06-06, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-06, REL_18 273fe94)”Symbols are the stable anchor; these line numbers are hints scoped to this revision.
| Symbol | File | Line |
|---|---|---|
ConvProcInfo | src/backend/utils/mb/mbutils.c | 55 |
PrepareClientEncoding | src/backend/utils/mb/mbutils.c | 119 |
SetClientEncoding | src/backend/utils/mb/mbutils.c | 217 |
InitializeClientEncoding | src/backend/utils/mb/mbutils.c | 290 |
pg_do_encoding_conversion | src/backend/utils/mb/mbutils.c | 365 |
pg_do_encoding_conversion_buf | src/backend/utils/mb/mbutils.c | 478 |
pg_convert | src/backend/utils/mb/mbutils.c | 562 |
length_in_encoding | src/backend/utils/mb/mbutils.c | 624 |
pg_client_to_server | src/backend/utils/mb/mbutils.c | 669 |
pg_any_to_server | src/backend/utils/mb/mbutils.c | 685 |
pg_server_to_client | src/backend/utils/mb/mbutils.c | 747 |
pg_server_to_any | src/backend/utils/mb/mbutils.c | 758 |
perform_default_encoding_conversion | src/backend/utils/mb/mbutils.c | 792 |
pg_unicode_to_server | src/backend/utils/mb/mbutils.c | 873 |
pg_mblen_cstr | src/backend/utils/mb/mbutils.c | 1043 |
pg_mbstrlen | src/backend/utils/mb/mbutils.c | 1163 |
pg_mbcliplen | src/backend/utils/mb/mbutils.c | 1209 |
SetDatabaseEncoding | src/backend/utils/mb/mbutils.c | 1287 |
GetDatabaseEncoding | src/backend/utils/mb/mbutils.c | 1387 |
pg_database_encoding_character_incrementer | src/backend/utils/mb/mbutils.c | 1648 |
pg_verify_mbstr | src/backend/utils/mb/mbutils.c | 1692 |
pg_verify_mbstr_len | src/backend/utils/mb/mbutils.c | 1723 |
report_invalid_encoding | src/backend/utils/mb/mbutils.c | 1824 |
report_untranslatable_char | src/backend/utils/mb/mbutils.c | 1869 |
local2local | src/backend/utils/mb/conv.c | 33 |
mic2latin | src/backend/utils/mb/conv.c | 127 |
pg_mb_radix_conv | src/backend/utils/mb/conv.c | 373 |
UtfToLocal | src/backend/utils/mb/conv.c | 507 |
LocalToUtf | src/backend/utils/mb/conv.c | 717 |
pg_utf2wchar_with_len | src/common/wchar.c | 462 |
pg_utf_mblen | src/common/wchar.c | 556 |
pg_utf8_verifychar | src/common/wchar.c | 1723 |
pg_utf8_verifystr | src/common/wchar.c | 1913 |
pg_utf8_islegal | src/common/wchar.c | 2011 |
pg_wchar_table | src/common/wchar.c | 2086 |
pg_encoding_mblen | src/common/wchar.c | 2157 |
pg_encoding_mblen_or_incomplete | src/common/wchar.c | 2169 |
pg_encoding_verifymbstr | src/common/wchar.c | 2224 |
pg_encoding_max_length | src/common/wchar.c | 2235 |
pg_char_to_encoding | src/common/encnames.c | 552 |
pg_encoding_to_char | src/common/encnames.c | 590 |
utf8_to_unicode | src/include/mb/pg_wchar.h | 565 |
unicode_to_utf8 | src/include/mb/pg_wchar.h | 591 |
FindDefaultConversionProc | src/backend/catalog/namespace.c | 4083 |
Source verification (as of 2026-06-06)
Section titled “Source verification (as of 2026-06-06)”Facts about the source at commit
273fe94, readable without external materials. Open questions follow.
Verified facts
Section titled “Verified facts”-
The server encoding is fixed per backend; only the client encoding is mutable. Verified:
DatabaseEncodingis written only bySetDatabaseEncoding(mbutils.c), whileClientEncodingis reassigned bySetClientEncoding.pg_unicode_to_servereven comments that “the server encoding is fixed within any one backend process,” which is whyUtf8ToServerConvProcis looked up exactly once inInitializeClientEncoding. -
Ingress validates unconditionally; egress trusts. Verified by direct comparison:
pg_any_to_servercallspg_verify_mbstr(...)even on the no-conversion path (encoding == DatabaseEncoding->encoding), whereaspg_server_to_anyreturns the source unconstified on the same path with the comment “assume data is valid.” The file header states this asymmetry explicitly. -
SQL_ASCIIdisables all conversion in both directions. Verified inpg_do_encoding_conversion:dest_encoding == PG_SQL_ASCIIreturnssrcunchanged (“any string is valid in SQL_ASCII”), andsrc_encoding == PG_SQL_ASCIIvalidates only the destination interpretation but performs no byte conversion. -
Worst-case conversion growth is a factor of 4. Verified:
MAX_CONVERSION_GROWTHis4inpg_wchar.h, and bothpg_do_encoding_conversionandperform_default_encoding_conversionallocatelen * MAX_CONVERSION_GROWTH + 1withMemoryContextAllocHuge, guarding againstlen >= MaxAllocHugeSize / MAX_CONVERSION_GROWTH. -
The conversion-proc ABI is a fixed six-argument signature. Verified: every call site (
OidFunctionCall6inpg_do_encoding_conversion,FunctionCall6inperform_default_encoding_conversionandpg_unicode_to_server) passes(src_encoding, dest_encoding, src, dest, len, noError), andcheck_encoding_conversion_argsvalidates exactly those. -
UTF-8 validity is more than length: overlong forms and surrogates are rejected. Verified in
pg_utf8_islegal(wchar.c), which constrains the second byte by leading byte:0xE0→0xA0..0xBF,0xED→0x80..0x9F,0xF0→0x90..0xBF,0xF4→0x80..0x8F, and rejects leads0x80..0xC1and> 0xF4outright. The comment cites RFC 3629 and the overlong-encoding security hazard explicitly. -
The bulk UTF-8 validator is a shift-based DFA, not a per-byte branch tree. Verified:
pg_utf8_verifystrdrivesutf8_advanceover theUtf8Transition[256]table, processingSTRIDE_LENGTH = 2 * sizeof(Vector8)bytes per iteration and falling back topg_utf8_verifycharonly for the tail or on non-ASCII. The unusual state-number constants exist so the transitions pack into 32-bit integers. -
Seven encodings are client-only. Verified from the
pg_encenum andPG_ENCODING_BE_LAST = PG_KOI8U:PG_SJIS,PG_BIG5,PG_GBK,PG_UHC,PG_GB18030,PG_JOHAB,PG_SHIFT_JIS_2004all sort afterPG_KOI8U, soPG_VALID_BE_ENCODINGrejects them as server encodings whilePG_VALID_FE_ENCODINGaccepts them as client encodings. -
The conversion catalog lookup is search-path-sensitive. Verified:
FindDefaultConversionProc(namespace.c) iteratesactiveSearchPath, explicitly skippingmyTempNamespace, so aCREATE CONVERSION ... DEFAULTin an earlier schema shadows a later one. -
GB18030 is the reason
pg_encoding_mblenhas an_or_incompletevariant. Verified: the comment onpg_encoding_mblenwarns thatencoding==GB18030may need to readmbstr[1]to determine length, andpg_encoding_mblen_or_incompletereturnsINT_MAXwhenremaining < 2for a high-bit GB18030 lead byte.
Open questions
Section titled “Open questions”-
Two-hop conversion for non-UTF8 server encodings. When both client and server are non-UTF8 (e.g. EUC_JP client, EUC_KR server),
pg_do_encoding_conversionfinds a single(EUC_JP, EUC_KR)default proc. Whether that proc internally pivots through UTF-8 or carries a direct table is a property of the specificconversion_procs/entry, not visible frommbutils.c/conv.calone. Investigation path: readsrc/backend/utils/mb/conversion_procs/euc_jp_and_*. -
Interaction of
client_encodingrollback with cachedConvProcInfo.PrepareClientEncodingleaves stale duplicate entries inConvProcListforSetClientEncodingto garbage-collect. The exact lifetime guarantee that a still-referencedFmgrInfois never freed while a conversion is mid-flight is asserted by the comments but not traced end-to-end here. -
Display width and East-Asian ambiguity.
pg_utf_dsplen→ucs_wcwidthuses fixed nonspacing / east-asian-fullwidth tables; howpsql’s column alignment handles “ambiguous width” characters (which some terminals render single- and some double-width) is not resolved in this layer. Investigation path:ucs_wcwidthand the generatedunicode_east_asian_fw_table.h.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up document.
-
UTF-8 as the only server encoding (the modern monoculture). SQL Server (since 2019, with
UTF8collations), Oracle (AL32UTF8), and the entire cloud-native generation (CockroachDB, Spanner, most NewSQL engines) either default to or mandate UTF-8 as the storage encoding, treating legacy charsets purely as a client-side I/O concern. PostgreSQL’s retention of ~40 server-capable encodings and a full catalog-driven conversion matrix is increasingly the historical outlier. The interesting comparison is cost/benefit: the conversion framework (pg_conversion,UtfToLocal) is real maintenance surface that a UTF-8-only engine simply does not carry, in exchange for native storage of legacy East-Asian charsets without a transcoding tax on every byte. -
SIMD UTF-8 validation — “Validating UTF-8 In Less Than One Instruction Per Byte” (Keiser & Lemire, 2021) and the
simdutflibrary. PostgreSQL’spg_utf8_verifystrshift-based DFA (theUtf8Transitiontable processing2 * sizeof(Vector8)bytes per stride) is a deliberate, portable approximation of the fully vectorized validators that libraries likesimdutfachieve with AVX-512/NEON range checks. A follow-up could measure the gap between PostgreSQL’s generic-SIMD fallback and a hand-tuned architecture-specific validator on the ingress hot path, and ask whether the portability cost is still justified. -
Collation vs. encoding separation, and ICU. This document deliberately defers ordering to
postgres-collation-providers.md. The research frontier is the coupling point: ICU collations decode UTF-16 internally, so a UTF-8 server encoding pays a transcoding cost inside every ICU comparison. Engines that store UTF-16 natively (older SQL ServerNVARCHAR, Java-based stores) invert that tradeoff. A comparative note would trace where the decode happens in each design and what it costs per comparison. -
Encoding-aware indexing and the validate-on-ingress boundary. Because PostgreSQL trusts stored bytes (egress trusts, ingress verifies), an index over
textnever re-validates. Systems that allow per-column or per-value encodings (some document stores, polyglot engines) cannot make that assumption and must carry encoding metadata into the index. The PostgreSQL choice — one fixed server encoding per database — is what makes the “trust stored bytes” optimization sound, and is worth contrasting with the per-value approach’s flexibility/cost. -
GB18030 and the limits of the leading-byte length rule. GB18030 is the reason
pg_encoding_mblenneeds an_or_incompletevariant (a four-byte GB18030 character cannot have its length determined from the first byte alone). The Unicode-superset mandate that GB18030-2022 imposes on Chinese-market software is a live standards question; a follow-up could compare how PostgreSQL’s client-only GB18030 support handles the 2022 revision’s new mappings against engines that store GB18030 natively. -
Streaming / incremental transcoding. PostgreSQL converts whole strings (
pg_do_encoding_conversionallocateslen * 4 + 1up front). Streaming parsers (e.g.COPYof a huge field, or a future chunked protocol) would benefit from an incremental converter that carries DFA state across chunk boundaries — exactly the resynchronization property UTF-8 self-synchronization enables but that the current whole-buffer API does not expose. Seepostgres-copy.mdfor where bulk ingest currently sits.
Sources
Section titled “Sources”PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”src/backend/utils/mb/mbutils.c— encoding state (ClientEncoding,DatabaseEncoding,MessageEncoding,ConvProcInfo), selection (PrepareClientEncoding,SetClientEncoding,InitializeClientEncoding,SetDatabaseEncoding), conversion entry points (pg_do_encoding_conversion,pg_do_encoding_conversion_buf,pg_any_to_server/pg_client_to_server,pg_server_to_any/pg_server_to_client,perform_default_encoding_conversion,pg_unicode_to_server), multibyte primitives (pg_mblen,pg_mbstrlen,pg_mbcliplen,pg_verify_mbstr), and the error reporters (report_invalid_encoding,report_untranslatable_char,check_encoding_conversion_args).src/backend/utils/mb/conv.c— the pivot convertersUtfToLocal/LocalToUtf, the radix-tree primitivespg_mb_radix_conv/store_coded_char, thebsearchcomparatorscompare3/compare4, and the single-byte / MULE_INTERNAL helperslocal2local,latin2mic,mic2latin.src/common/wchar.c— thepg_wchar_tablevtable, UTF-8 primitives (pg_utf_mblen,pg_utf2wchar_with_len,pg_utf8_islegal,pg_utf8_verifychar,pg_utf8_verifystrwith theUtf8TransitionDFA), and the encoding-as-argument variants (pg_encoding_mblen,pg_encoding_mblen_or_incomplete,pg_encoding_verifymbstr,pg_encoding_max_length).src/common/encnames.c— name↔ID resolution (pg_char_to_encoding,pg_encoding_to_char,pg_valid_server_encoding) and the shared tablespg_enc2name_tbl/pg_encname_tbl.src/include/mb/pg_wchar.h— thepg_encID space,PG_VALID_BE_ENCODING/PG_VALID_FE_ENCODING,MAX_CONVERSION_GROWTH, and the inlineutf8_to_unicode/unicode_to_utf8code-point codecs.src/backend/catalog/namespace.c—FindDefaultConversionProc, the single bridge fromutils/mbinto thepg_conversionsystem catalog.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz et al.) — SQL character types and the locale/encoding sensitivity of string comparison and ordering.
- Database Internals (Petrov) — the storage engine as an opaque-byte layer over which meaning (here, encoding) is imposed.
Standards and external references
Section titled “Standards and external references”- RFC 3629 — UTF-8 byte-range restrictions (overlong-form and surrogate
rejection) implemented by
pg_utf8_islegal. - Keiser & Lemire, “Validating UTF-8 In Less Than One Instruction Per Byte”
(2021) — the SIMD-validation frontier
pg_utf8_verifystrapproximates.
Cross-references (sibling module docs)
Section titled “Cross-references (sibling module docs)”postgres-collation-providers.md— string ordering (the orthogonal collation question); owns ICU/libc provider mechanics deferred here.postgres-datatypes-adt.md— thetext/varchar/nametypes whose bytes are interpreted by this layer; ownslength()/substr()semantics.postgres-wire-protocol.md— the startupclient_encodingparameter and theQuery/DataRowbyte streams that hitpg_client_to_server/pg_server_to_client.postgres-copy.md— bulk ingress, the other major caller ofpg_any_to_server.postgres-parser.md— the lexer that runspg_client_to_serveron the whole query string before tokenizing, andpg_unicode_to_serverfor Unicode escapes.