Skip to content

PostgreSQL Character Set Encoding — Server/Client Encodings and Conversion

Contents:

A character set encoding is a mapping between abstract characters and the byte sequences that represent them on the wire and on disk. Every text-like value a database stores — a text column, a table name, an SQL string literal — is ultimately a run of bytes, and the encoding is the contract that says how to cut that run into characters and what each character is. Three properties define the design space:

  1. Fixed-width vs. variable-width. A single-byte encoding (ASCII, ISO-8859-1) maps each character to exactly one byte: character count equals byte count, and string[i] is the i-th character. A multibyte encoding (UTF-8, EUC-JP, GB18030) uses one to four bytes per character, so finding the i-th character requires walking the string from the start, decoding each character’s length as you go. This single fact — that you cannot index a multibyte string in O(1) — drives almost every interesting piece of engineering in a multibyte-aware engine. It is why length() on a text value is O(n) not O(1), why substr() cannot simply offset a pointer, and why varchar(n) truncation needs pg_mbcliplen to find a character boundary at or before the byte limit rather than blindly cutting at byte n. The single-byte fast path (count = strlen, index = pointer arithmetic) is preserved precisely so that databases declared in a single-byte encoding pay none of this cost — the engine branches on maxmblen == 1 to keep the common Latin-1/ASCII case cheap.

  2. Self-synchronizing vs. stateful. UTF-8 is self-synchronizing: from any byte you can tell whether it is a leading byte (high bits 0, 110, 1110, 11110) or a continuation byte (10), so a corrupt or truncated stream can resynchronize at the next character boundary. Stateful encodings (ISO-2022 with its escape-sequence shift states) cannot; PostgreSQL deliberately refuses such encodings as a server encoding. Self-synchronization is what makes a fast validator possible.

  3. Repertoire and round-tripping. Two encodings may cover different sets of characters. Converting EUC-JP → LATIN1 must fail for any Japanese character, because LATIN1 has no code point for it — an untranslatable character. Converting LATIN1 → UTF8 always succeeds because UTF-8’s repertoire is a superset. A correct conversion framework must distinguish “the input was malformed” (invalid encoding) from “the input was well-formed but the target cannot represent it” (untranslatable) — two different error classes with two different SQLSTATEs.

Database System Concepts (Silberschatz et al.) treats character data as one of the SQL built-in types and notes that string comparison and ordering are locale- and encoding-sensitive — it is the encoding that fixes which bytes are one character before any collation question (how to order those characters) can even be asked. Database Internals (Petrov) frames the on-disk representation question generally: the storage engine stores opaque byte runs, and a layer above must impose meaning. Encoding is precisely that layer for text. The key architectural consequence the textbooks imply, and that PostgreSQL makes concrete, is separation of concerns: encoding answers “where does one character end?”; collation (the sibling postgres-collation-providers.md) answers “how do two characters order?”. The two are orthogonal but coupled — a collation needs to decode characters, so it leans on the encoding for character boundaries, but never the reverse.

Unicode deserves special mention because PostgreSQL builds its whole conversion matrix around it. UTF-8 (RFC 3629) encodes a Unicode code point (U+0000 to U+10FFFF) in one to four bytes, with the crucial property that the byte length is determined entirely by the leading byte and that overlong encodings (using more bytes than necessary) are illegal. The illegality of overlong forms is a security property, not a pedantic one: without it, an attacker could smuggle an ASCII character (say / or ') past an ASCII-level filter by encoding it as a two-byte sequence that decodes back to the ASCII byte. PostgreSQL’s UTF-8 validator (pg_utf8_islegal) implements exactly the RFC 3629 byte-range restrictions that forbid overlong forms and surrogate halves.

A final subtlety the textbooks raise but that lives above the encoding layer is normalization. Unicode lets the same visible character be spelled more than one way — é can be a single precomposed code point (U+00E9) or a base letter plus a combining accent (U+0065 U+0301). These are distinct byte sequences in UTF-8 and therefore distinct encoded strings, even though they are canonically equivalent. The encoding layer deliberately does not normalize: its job is byte boundaries and validity, not semantic equivalence. Normalization (NFC/NFD) is a separate operation a caller invokes explicitly, which keeps the encoding layer a pure, fast, lossless byte discipline and leaves equivalence policy to the layer that actually compares or matches text. The off-by-one Hangul recomposition fix in this very revision (commit message “Fix off-by-one with NFC recomposition for Hangul U+11A7”) is a reminder that normalization is intricate and rightly kept out of the hot validation path.

This section names the engineering patterns that multibyte-aware relational engines converge on, so that PostgreSQL’s specific choices read as selections within a shared space rather than as ad-hoc inventions.

A per-database fixed encoding plus a per-session view encoding

Section titled “A per-database fixed encoding plus a per-session view encoding”

Almost every engine distinguishes the encoding in which data is stored from the encoding in which a particular client wants to see it. Storing data in a single fixed encoding per database keeps every byte comparison, every index, and every on-disk tuple unambiguous: the engine never has to ask “what encoding is this row in?”. Meanwhile a client connecting from a legacy LATIN1 application and a client connecting from a UTF-8 terminal can both talk to the same UTF-8 database, each declaring its own client encoding, and the engine transcodes on the session boundary. This is the server-encoding / client-encoding split, and it localizes all transcoding to two choke points: data coming in (client → server) and data going out (server → client).

Supporting N encodings with direct pairwise converters needs O(N²) conversion tables — unmanageable when N is forty. The standard trick is a pivot encoding: convert source → pivot → target, so each encoding needs only a converter to and from the pivot, giving O(N) tables. Unicode (UTF-8 or UTF-16) is the universal pivot in modern engines because its repertoire is a superset of essentially every legacy charset. The cost is two passes instead of one and a possible loss of fidelity for characters that round-trip imperfectly through Unicode; the benefit is linear table growth and one well-tested code path.

The fidelity caveat is not hypothetical. Some legacy charsets contain characters that map to the same Unicode code point as another character, or have multiple legacy code points that Unicode unifies; a pivot through Unicode can then collapse a distinction the legacy charset made, so source→pivot→source is not guaranteed to be the identity. Engines handle this with carefully authored mapping tables (PostgreSQL’s are generated from authoritative Unicode mapping files and checked into conversion_procs/), and they distinguish a lossy but defined mapping from an undefined one — the latter being the untranslatable-character error. The pivot design does not remove the round-tripping problem; it concentrates it into the two pivot tables where it can be audited once rather than across O(N²) pairwise tables.

Data that originates outside the server — a string literal in a submitted query, a COPY line, a bytea handed to convert_from — cannot be trusted to be validly encoded, and a single malformed multibyte sequence can desynchronize a parser or overrun a buffer. So engines validate aggressively at the boundary where untrusted bytes enter, and then trust the data internally: once a string is known to be valid server-encoded text, internal operations skip revalidation. The asymmetry is deliberate — validation is O(n) and the boundary is the only place an adversary controls the bytes.

The security stakes are concrete. A malformed multibyte sequence whose second byte happens to be an ASCII quote or backslash can, if it reaches a byte-oriented parser unverified, break out of a string literal — the classic encoding-based SQL-injection vector. By forcing every externally-sourced byte through a validator before the lexer, the engine guarantees the parser only ever sees well-formed server-encoded text, which is why the ASCII-transparency property (no ASCII byte ever appears as a non-leading byte) is a hard requirement for server encodings and merely a tolerated risk for client encodings that are transcoded away on the way in.

Because every multibyte operation (how long is this character? is this byte sequence valid? what is the display width?) is encoding-specific, engines route them through a table of function pointers indexed by encoding ID, rather than a giant switch re-evaluated per character. This is a classic vtable: one row per encoding, one column per primitive (mblen, verify, char↔wchar, dsplen). Hot loops fetch the function pointer once and call it per character.

The vtable also doubles as a capability declaration. A single-byte encoding’s row points all of its primitives at trivial implementations (mblen returns 1, verify checks only for NUL), and its maxmblen of 1 is the flag the length/clip routines test to take the O(1) fast path. A client-only encoding’s row may legally have NULL in the mb2wchar/wchar2mb slots (it never needs to produce the internal wide-char form), while the mblen/verify slots are always populated because validation must work for any client encoding on ingress. Reading the table’s shape thus tells you, per encoding, exactly which operations the engine is prepared to perform — without consulting any catalog.

Sharing the encoding library between server and tools

Section titled “Sharing the encoding library between server and tools”

The same mblen/validate logic is needed by client tools (psql, pg_dump), which have no backend. Engines therefore factor the pure, dependency-free encoding primitives into a shared library that both the server and the frontend link, keeping only the catalog-aware parts (looking up a conversion function in a system catalog) on the server side.

flowchart LR
  subgraph client["Client side"]
    APP["application bytes<br/>in client_encoding"]
  end
  subgraph server["Backend (one process, fixed server encoding)"]
    IN["ingress<br/>pg_client_to_server<br/>VERIFY + convert"]
    CORE["stored / parsed text<br/>(server encoding, trusted)"]
    OUT["egress<br/>pg_server_to_client<br/>convert (trust)"]
  end
  APP -->|"Query / Bind / COPY"| IN --> CORE
  CORE --> OUT -->|"DataRow / results"| APP
  IN -. "no conversion if<br/>client == server or SQL_ASCII" .-> CORE

PostgreSQL fixes the server encoding (also called the database encoding) at CREATE DATABASE time: it is immutable for the life of the database and is the encoding of every stored text, varchar, name, cstring, xml, and json value, and of the SQL text after it enters the backend. The client encoding is a per-session GUC (client_encoding) that the client may change at any time. The backend caches three pg_enc2name pointers — one each for the database, client, and message encoding — and a small set of cached FmgrInfo conversion-function handles, in mbutils.c:

// ClientEncoding / DatabaseEncoding / MessageEncoding — mbutils.c
static const pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
static const pg_enc2name *DatabaseEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
static const pg_enc2name *MessageEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];

DatabaseEncoding is set once, early, by SetDatabaseEncoding and never changes. ClientEncoding moves whenever the session runs SET client_encoding or sends a startup client_encoding parameter.

The third pointer, MessageEncoding, exists because server log and error messages may be emitted in a different encoding than either the database or the client — gettext localizes messages, and the encoding of the translated message catalog need not match the database. GetMessageEncoding lets the error-reporting path tag emitted text correctly. The separation matters most during startup and error handling, when a backend may need to emit a message before ClientEncoding is even resolved; defaulting all three pointers to PG_SQL_ASCII at process start (see the static initializers above) guarantees the message path never dereferences an uninitialized conversion proc.

Every encoding has a small integer ID — a pg_enc enum value — and the IDs are partitioned: IDs 0 .. PG_ENCODING_BE_LAST (which is PG_KOI8U) may be either a server or a client encoding, while IDs above it (PG_SJIS, PG_BIG5, PG_GBK, PG_UHC, PG_GB18030, PG_JOHAB, PG_SHIFT_JIS_2004) are client-only. Those seven are deliberately barred as server encodings because they are not ASCII-transparent: a byte in the ASCII range can appear as the second byte of a multibyte character, which would let an embedded \ or ' confuse a byte-oriented parser. The macros encode this partition:

// PG_VALID_BE_ENCODING / PG_VALID_FE_ENCODING — pg_wchar.h
#define PG_VALID_BE_ENCODING(_enc) \
((_enc) >= 0 && (_enc) <= PG_ENCODING_BE_LAST)
/* On FE are possible all encodings */
#define PG_VALID_FE_ENCODING(_enc) PG_VALID_ENCODING(_enc)

Encoding names are resolved to IDs by pg_char_to_encoding (a binary search over a normalized-name table in src/common/encnames.c) and back by pg_encoding_to_char. The name table and the ID→name table (pg_enc2name_tbl) both live in src/common precisely so frontend tools can resolve encoding names without a backend.

The conversion framework: catalog-driven, UTF-8-pivoted

Section titled “The conversion framework: catalog-driven, UTF-8-pivoted”

A conversion between two server-capable encodings is performed by a function registered in the pg_conversion system catalog. FindDefaultConversionProc (in namespace.c) walks the active search path to find the default conversion proc for an (from_encoding, to_encoding) pair; the actual byte work is done by the proc, which is almost always one of the C functions in src/backend/utils/mb/conversion_procs/ that ultimately call UtfToLocal or LocalToUtf (in conv.c). PostgreSQL does not ship a direct converter for every pair; instead UTF-8 is the pivot, so e.g. EUC_JP → EUC_KR is realized as two registered conversions (EUC_JP → UTF8, then UTF8 → EUC_KR) when invoked through the higher-level paths, while the catalog stores the per-pair procs that each go through UTF-8 internally.

The pivot mechanics live in UtfToLocal (and its mirror LocalToUtf): it walks the UTF-8 input one character at a time, re-validating each character with pg_utf8_islegal before looking the code point up in a radix-tree map to the local encoding. A failed lookup or an illegal byte is the seam where the two error classes diverge — malformed input vs. untranslatable character:

// UtfToLocal — src/backend/utils/mb/conv.c (condensed)
for (; len > 0; len -= l)
{
if (*utf == '\0') break;
l = pg_utf_mblen(utf);
if (len < l) break; /* truncated trailing char */
if (!pg_utf8_islegal(utf, l)) break; /* malformed -> report_invalid_encoding */
if (l == 1) { *iso++ = *utf++; continue; } /* ASCII passes through */
/* collect b1..b4, pg_mb_radix_conv() lookup, else combined-char bsearch,
else report_untranslatable_char() */
}

Every break falls into the post-loop error path; whether that path raises report_invalid_encoding or report_untranslatable_char (or silently returns in noError mode) is exactly the invalid-vs-untranslatable distinction from the Theoretical Background.

The single most important entry point is pg_do_encoding_conversion, the general-case converter. Its structure is a cascade of fast-paths before it ever touches the catalog:

// pg_do_encoding_conversion — mbutils.c (condensed)
if (len <= 0)
return src; /* empty string is always valid */
if (src_encoding == dest_encoding)
return src; /* no conversion required, assume valid */
if (dest_encoding == PG_SQL_ASCII)
return src; /* any string is valid in SQL_ASCII */
if (src_encoding == PG_SQL_ASCII)
{
/* No conversion is possible, but we must validate the result */
(void) pg_verify_mbstr(dest_encoding, (const char *) src, len, false);
return src;
}
/* ... look up proc via FindDefaultConversionProc, allocate, OidFunctionCall6 ... */

Note the SQL_ASCII semantics, which are PostgreSQL’s escape hatch and its foot-gun: SQL_ASCII means “no encoding declared — treat bytes as opaque.” Any string is “valid” in SQL_ASCII, and no conversion is ever done to or from it. That makes a SQL_ASCII database a permissive byte bucket; it also means the server cannot guarantee anything about the text it stores.

When a conversion is needed, the result buffer is sized for the worst case — MAX_CONVERSION_GROWTH (= 4) bytes of output per input byte — and the proc is invoked through OidFunctionCall6 with the standard six-argument conversion signature (src_encoding, dest_encoding, src, dest, len, noError). The allocation is done with MemoryContextAllocHuge and guarded against integer overflow, because len * 4 can exceed MaxAllocSize even when the real result would fit. The over-allocation is then trimmed: after the proc reports how many bytes it actually wrote, large buffers are shrunk with repalloc so the slack between worst-case and actual output is not held for the life of the result.

The six-argument ABI also carries a noError flag, and this is what lets the same proc back two very different callers. SQL convert() and the parser want a hard ERROR on a bad byte; speculative callers (e.g. probing whether a value can be represented in a target encoding) want a soft failure. With noError = true the proc stops at the first untranslatable or invalid character and returns the count of bytes successfully consumed instead of calling report_invalid_encoding / report_untranslatable_char. The error classification — invalid encoding (CHARACTER_NOT_IN_REPERTOIRE) versus untranslatable character (UNTRANSLATABLE_CHARACTER) — is therefore not decided by the caller but emerges from where in the proc the byte failed: malformed input fails the pg_utf8_islegal / verify check, while a well-formed character with no target mapping fails the radix-tree lookup.

flowchart TD
  S["pg_do_encoding_conversion(src, len, from, to)"] --> A{"len<=0 or<br/>from==to?"}
  A -->|yes| R1["return src as-is"]
  A -->|no| B{"to == SQL_ASCII?"}
  B -->|yes| R1
  B -->|no| C{"from == SQL_ASCII?"}
  C -->|yes| V["pg_verify_mbstr(to)<br/>then return src"]
  C -->|no| D["FindDefaultConversionProc(from,to)"]
  D --> E{"proc found?"}
  E -->|no| ERR["ERROR: no default<br/>conversion function"]
  E -->|yes| F["alloc len*4+1 (Huge)<br/>OidFunctionCall6(proc, ...)"]
  F --> G["repalloc down if large<br/>return result"]

Most conversion in a live session is not the general case but the client to server and server to client direction, and for those PostgreSQL caches the FmgrInfo so it can convert without a catalog lookup — important because pg_server_to_client runs on every result row and may run outside a transaction. SetClientEncoding installs the cached procs into two static pointers; perform_default_encoding_conversion uses them:

// pg_any_to_server — mbutils.c (condensed)
if (encoding == DatabaseEncoding->encoding || encoding == PG_SQL_ASCII)
{
/* No conversion is needed, but we must still validate the data. */
(void) pg_verify_mbstr(DatabaseEncoding->encoding, s, len, false);
return unconstify(char *, s);
}
/* Fast path if we can use cached conversion function */
if (encoding == ClientEncoding->encoding)
return perform_default_encoding_conversion(s, len, true);
/* General case ... will not work outside transactions */
return pg_do_encoding_conversion(...);

The asymmetry the file header documents is the heart of the discipline: pg_any_to_server always validates even when no conversion is needed, because the bytes came from outside; pg_server_to_any trusts the server-side bytes when no conversion is needed. Ingress verifies; egress trusts.

There is one case pg_any_to_server cannot resolve cleanly: a SQL_ASCII server receiving data from an ASCII-unsafe client encoding (one of the seven client-only encodings). No conversion proc exists into SQL_ASCII, yet the bytes might contain a multibyte sequence whose second byte is an ASCII metacharacter the parser would mis-read. PostgreSQL refuses to guess — it rejects any non-ASCII byte outright:

// pg_any_to_server — mbutils.c (SQL_ASCII-server, ASCII-unsafe client)
if (PG_VALID_BE_ENCODING(encoding))
(void) pg_verify_mbstr(encoding, s, len, false); /* ASCII-safe: verify */
else
{
for (i = 0; i < len; i++)
if (s[i] == '\0' || IS_HIGHBIT_SET(s[i])) /* ASCII-unsafe: reject */
ereport(ERROR,
(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
errmsg("invalid byte value for encoding \"%s\": 0x%02x",
pg_enc2name_tbl[PG_SQL_ASCII].name,
(unsigned char) s[i])));
}

The comment in the source spells out the reasoning: “we dare not pass such data to the parser but we have no way to convert it. We compromise by rejecting the data if it contains any non-ASCII characters.” This is the seam where the “SQL_ASCII is a permissive byte bucket” semantics meet the “never let an ASCII-unsafe byte reach a byte-oriented parser” safety property, and the resolution is to narrow the permissiveness to pure ASCII.

Every multibyte primitive is dispatched through pg_wchar_table, indexed by encoding ID. Each row is a pg_wchar_tbl of seven members: mb2wchar_with_len, wchar2mb_with_len, mblen, dsplen, mbverifychar, mbverifystr, and maxmblen:

// pg_wchar_table — src/common/wchar.c (excerpt)
const pg_wchar_tbl pg_wchar_table[] = {
[PG_SQL_ASCII] = {pg_ascii2wchar_with_len, pg_wchar2single_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifychar, pg_ascii_verifystr, 1},
[PG_UTF8] = {pg_utf2wchar_with_len, pg_wchar2utf_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifychar, pg_utf8_verifystr, 4},
[PG_GB18030] = {0, 0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifychar, pg_gb18030_verifystr, 4},
/* ... 40 entries total ... */
};

The backend wrappers in mbutils.c (pg_mblen, pg_mbstrlen, pg_dsplen, pg_verify_mbstr) read DatabaseEncoding->encoding, then index this table; the pg_encoding_* variants in wchar.c take the encoding as an argument so frontend tools can call them. Note the LATIN/WIN/ISO single-byte encodings all share the same function pointers (pg_latin1_*) with maxmblen 1 — for a single-byte encoding “decode a character” is trivial, and the only validity question is “is this byte a NUL?”.

Because UTF-8 is the pivot, its primitives are the most exercised. Decoding a character’s length is a pure function of the leading byte:

// pg_utf_mblen — src/common/wchar.c
if ((*s & 0x80) == 0) len = 1;
else if ((*s & 0xe0) == 0xc0) len = 2;
else if ((*s & 0xf0) == 0xe0) len = 3;
else if ((*s & 0xf8) == 0xf0) len = 4;
else len = 1; /* bogus lead: treat as 1 */

Decoding the value is the inverse bit-shuffle (utf8_to_unicode in pg_wchar.h), and encoding a code point back to bytes is unicode_to_utf8. Validation, however, is more than length: pg_utf8_islegal enforces the RFC 3629 restrictions that forbid overlong forms and UTF-16 surrogate halves, by constraining the range of the second byte based on the leading byte:

// pg_utf8_islegal — src/common/wchar.c (condensed)
case 2:
a = source[1];
switch (*source)
{
case 0xE0: if (a < 0xA0 || a > 0xBF) return false; break; /* no overlong-3 */
case 0xED: if (a < 0x80 || a > 0x9F) return false; break; /* no surrogates */
case 0xF0: if (a < 0x90 || a > 0xBF) return false; break; /* no overlong-4 */
case 0xF4: if (a < 0x80 || a > 0x8F) return false; break; /* <= U+10FFFF */
default: if (a < 0x80 || a > 0xBF) return false; break;
}
/* FALL THRU */
case 1:
a = *source;
if (a >= 0x80 && a < 0xC2) return false; /* 0x80..0xC1: cont/overlong lead */
if (a > 0xF4) return false; /* > U+10FFFF lead */
break;

The FALL THRU chain checks the trailing bytes (0x80..0xBF) for lengths 4, 3, 2 in turn, then the lead byte; lengths 5 and 6 are rejected outright. The range gates after 0xE0/0xED/0xF0/0xF4 are exactly the RFC 3629 cases that distinguish a legal minimal encoding from an overlong one or a surrogate half.

For bulk validation of a whole string, pg_utf8_verifystr uses a shift-based DFA (Utf8Transition) processing two SIMD-vector widths at a time, falling back to the byte-wise pg_utf8_verifychar only when a chunk contains non-ASCII or ends mid-character. This is the hot path that runs on every UTF-8 string entering the server; the DFA design avoids the data-dependent loads of a traditional table-driven automaton.

There is no separate “identifier encoding.” Table names, column names, and the name type are stored and compared in the server encoding, just like any other text. The point at which an identifier or a string literal acquires the server encoding is ingress: the entire query string is run through pg_client_to_server before the lexer ever sees it, so by the time the parser tokenizes "naïve_column" or 'café', those bytes are already validated server-encoded text. pg_unicode_to_server handles the special case of a Unicode escape (U&'\00e9' or é in JSON): it takes a single code point, renders it as UTF-8, and then runs the cached UTF8→server conversion (or, if the server is already UTF-8, just reformats) — and it is carefully written to work outside a transaction, because the lexer can run before a transaction starts.

The encoding machinery splits across two trees by dependency: the backend-only, catalog-aware glue in src/backend/utils/mb/, and the pure, self-contained primitives in src/common/ that both the server and frontend tools (psql, pg_dump, libpq) link. Below, symbols are grouped by call-flow.

  • ConvProcInfo — the cached record pairing a (server, client) encoding pair with its two FmgrInfo handles (to_server_info, to_client_info). Kept in a ConvProcList in TopMemoryContext so a setting can be restored during transaction rollback without catalog access.
  • PrepareClientEncoding — validates a requested client encoding and, if in a live transaction, looks up the two conversion procs via FindDefaultConversionProc and caches them. Returns failure before it is committed to, so SET client_encoding can fail gracefully. During backend startup it short-circuits (catalogs not yet available).
  • SetClientEncoding — installs the active ClientEncoding pointer and the ToServerConvProc / ToClientConvProc static FmgrInfo pointers from the cache prepared above; the no-conversion cases (client == server, or either side is SQL_ASCII) clear the procs to NULL.
  • InitializeClientEncoding — called once from InitPostgres; flips backend_startup_complete, applies the pending_client_encoding, and additionally looks up the UTF8→server proc into Utf8ToServerConvProc (used by pg_unicode_to_server).
  • SetDatabaseEncoding / GetDatabaseEncoding / GetDatabaseEncodingName — set-once / read accessors for the fixed server encoding. GetMessageEncoding / SetMessageEncoding track the separate encoding gettext emits messages in.
  • pg_get_client_encoding / pg_client_encoding (SQL) / getdatabaseencoding (SQL) — expose the current encodings to callers and to SQL.
  • pg_do_encoding_conversion — the general-case converter: fast-paths for empty / identical / SQL_ASCII cases, then FindDefaultConversionProc + MemoryContextAllocHuge(len * MAX_CONVERSION_GROWTH + 1) + OidFunctionCall6. Requires a transaction for the catalog lookup.
  • pg_do_encoding_conversion_buf — the buffer-output variant: the caller has already found the proc and supplies the destination buffer; clamps input length so the worst-case output fits.
  • pg_any_to_server / pg_client_to_server — ingress; always validates even with no conversion. Has the SQL_ASCII-server special case that rejects non-ASCII bytes from an ASCII-unsafe client encoding.
  • pg_server_to_any / pg_server_to_client — egress; trusts the source when no conversion is needed.
  • perform_default_encoding_conversion — the cached-FmgrInfo worker shared by both directions; the only converter safe to call outside a transaction (no catalog access).
  • pg_unicode_to_server / pg_unicode_to_server_noerror — convert one Unicode code point to a server-encoded string, via Utf8ToServerConvProc; written to be transaction-independent for the lexer.
  • SQL wrapperspg_convert, pg_convert_to, pg_convert_from (the convert()/convert_to()/convert_from() functions), and length_in_encoding (length(bytea, name)).
  • FindDefaultConversionProc — walks activeSearchPath (skipping the temp namespace) calling FindDefaultConversion until it finds a default proc for the (for_encoding, to_encoding) pair; returns InvalidOid if none. This is the single bridge from utils/mb into the pg_conversion catalog.
  • pg_mblen / pg_mblen_cstr / pg_mblen_range / pg_mblen_with_len / pg_mblen_unbounded — byte length of one character in the database encoding, with varying bounds-checking strictness.
  • pg_mbstrlen / pg_mbstrlen_with_len — character count of a string, with a single-byte-encoding fast path (strlen).
  • pg_mbcliplen / pg_encoding_mbcliplen / pg_mbcharcliplen — clip a string to a byte (or character) limit without splitting a multibyte character; the workhorse of varchar(n) truncation.
  • pg_verify_mbstr / pg_verifymbstr / pg_verify_mbstr_len — validate a string in a given (or the database) encoding; *_len also counts characters and so cannot use the fast mbverifystr.
  • pg_database_encoding_max_length / pg_database_encoding_character_incrementermaxmblen accessor and the make_greater_string character incrementer (with pg_utf8_increment / pg_eucjp_increment / pg_generic_charinc).
  • report_invalid_encoding / report_untranslatable_char / check_encoding_conversion_args — the two error reporters (distinct SQLSTATEs: CHARACTER_NOT_IN_REPERTOIRE vs. UNTRANSLATABLE_CHARACTER) and the argument validator every conversion proc calls.

Pure encoding primitives (src/common/wchar.c)

Section titled “Pure encoding primitives (src/common/wchar.c)”
  • pg_wchar_table — the 40-entry vtable of pg_wchar_tbl rows.
  • pg_utf_mblen / utf8_to_unicode / unicode_to_utf8 / unicode_utf8len — UTF-8 length, decode, encode, and encoded-length.
  • pg_utf8_islegal — RFC 3629 single-character legality (overlong / surrogate rejection).
  • pg_utf8_verifychar / pg_utf8_verifystr — single-char and whole-string UTF-8 validators; the latter is the shift-based DFA (Utf8Transition, utf8_advance).
  • pg_encoding_mblen / pg_encoding_mblen_or_incomplete / pg_encoding_mblen_bounded — table-dispatched char length by encoding ID (the _or_incomplete form is the GB18030-safe one that may read two bytes).
  • pg_encoding_verifymbchar / pg_encoding_verifymbstr / pg_encoding_max_length / pg_encoding_dsplen — the argument-takes-encoding variants used by frontend tools.

Encoding names and conversion-proc helpers

Section titled “Encoding names and conversion-proc helpers”
  • pg_char_to_encoding / pg_encoding_to_char / pg_valid_server_encoding (src/common/encnames.c) — name↔ID, with pg_enc2name_tbl / pg_encname_tbl.
  • local2local / latin2mic / mic2latin / latin2mic_with_table / mic2latin_with_table (conv.c) — the single-byte and MULE_INTERNAL helper converters the conversion procs build on.
  • UtfToLocal / LocalToUtf (conv.c) — the UTF-8 ↔ local-encoding radix-tree converters that nearly every pg_conversion proc delegates to; pg_mb_radix_conv / store_coded_char are their inner primitives, and compare3 / compare4 drive the bsearch over combined-character maps.

Position hints (as of 2026-06-06, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-06, REL_18 273fe94)”

Symbols are the stable anchor; these line numbers are hints scoped to this revision.

SymbolFileLine
ConvProcInfosrc/backend/utils/mb/mbutils.c55
PrepareClientEncodingsrc/backend/utils/mb/mbutils.c119
SetClientEncodingsrc/backend/utils/mb/mbutils.c217
InitializeClientEncodingsrc/backend/utils/mb/mbutils.c290
pg_do_encoding_conversionsrc/backend/utils/mb/mbutils.c365
pg_do_encoding_conversion_bufsrc/backend/utils/mb/mbutils.c478
pg_convertsrc/backend/utils/mb/mbutils.c562
length_in_encodingsrc/backend/utils/mb/mbutils.c624
pg_client_to_serversrc/backend/utils/mb/mbutils.c669
pg_any_to_serversrc/backend/utils/mb/mbutils.c685
pg_server_to_clientsrc/backend/utils/mb/mbutils.c747
pg_server_to_anysrc/backend/utils/mb/mbutils.c758
perform_default_encoding_conversionsrc/backend/utils/mb/mbutils.c792
pg_unicode_to_serversrc/backend/utils/mb/mbutils.c873
pg_mblen_cstrsrc/backend/utils/mb/mbutils.c1043
pg_mbstrlensrc/backend/utils/mb/mbutils.c1163
pg_mbcliplensrc/backend/utils/mb/mbutils.c1209
SetDatabaseEncodingsrc/backend/utils/mb/mbutils.c1287
GetDatabaseEncodingsrc/backend/utils/mb/mbutils.c1387
pg_database_encoding_character_incrementersrc/backend/utils/mb/mbutils.c1648
pg_verify_mbstrsrc/backend/utils/mb/mbutils.c1692
pg_verify_mbstr_lensrc/backend/utils/mb/mbutils.c1723
report_invalid_encodingsrc/backend/utils/mb/mbutils.c1824
report_untranslatable_charsrc/backend/utils/mb/mbutils.c1869
local2localsrc/backend/utils/mb/conv.c33
mic2latinsrc/backend/utils/mb/conv.c127
pg_mb_radix_convsrc/backend/utils/mb/conv.c373
UtfToLocalsrc/backend/utils/mb/conv.c507
LocalToUtfsrc/backend/utils/mb/conv.c717
pg_utf2wchar_with_lensrc/common/wchar.c462
pg_utf_mblensrc/common/wchar.c556
pg_utf8_verifycharsrc/common/wchar.c1723
pg_utf8_verifystrsrc/common/wchar.c1913
pg_utf8_islegalsrc/common/wchar.c2011
pg_wchar_tablesrc/common/wchar.c2086
pg_encoding_mblensrc/common/wchar.c2157
pg_encoding_mblen_or_incompletesrc/common/wchar.c2169
pg_encoding_verifymbstrsrc/common/wchar.c2224
pg_encoding_max_lengthsrc/common/wchar.c2235
pg_char_to_encodingsrc/common/encnames.c552
pg_encoding_to_charsrc/common/encnames.c590
utf8_to_unicodesrc/include/mb/pg_wchar.h565
unicode_to_utf8src/include/mb/pg_wchar.h591
FindDefaultConversionProcsrc/backend/catalog/namespace.c4083

Facts about the source at commit 273fe94, readable without external materials. Open questions follow.

  • The server encoding is fixed per backend; only the client encoding is mutable. Verified: DatabaseEncoding is written only by SetDatabaseEncoding (mbutils.c), while ClientEncoding is reassigned by SetClientEncoding. pg_unicode_to_server even comments that “the server encoding is fixed within any one backend process,” which is why Utf8ToServerConvProc is looked up exactly once in InitializeClientEncoding.

  • Ingress validates unconditionally; egress trusts. Verified by direct comparison: pg_any_to_server calls pg_verify_mbstr(...) even on the no-conversion path (encoding == DatabaseEncoding->encoding), whereas pg_server_to_any returns the source unconstified on the same path with the comment “assume data is valid.” The file header states this asymmetry explicitly.

  • SQL_ASCII disables all conversion in both directions. Verified in pg_do_encoding_conversion: dest_encoding == PG_SQL_ASCII returns src unchanged (“any string is valid in SQL_ASCII”), and src_encoding == PG_SQL_ASCII validates only the destination interpretation but performs no byte conversion.

  • Worst-case conversion growth is a factor of 4. Verified: MAX_CONVERSION_GROWTH is 4 in pg_wchar.h, and both pg_do_encoding_conversion and perform_default_encoding_conversion allocate len * MAX_CONVERSION_GROWTH + 1 with MemoryContextAllocHuge, guarding against len >= MaxAllocHugeSize / MAX_CONVERSION_GROWTH.

  • The conversion-proc ABI is a fixed six-argument signature. Verified: every call site (OidFunctionCall6 in pg_do_encoding_conversion, FunctionCall6 in perform_default_encoding_conversion and pg_unicode_to_server) passes (src_encoding, dest_encoding, src, dest, len, noError), and check_encoding_conversion_args validates exactly those.

  • UTF-8 validity is more than length: overlong forms and surrogates are rejected. Verified in pg_utf8_islegal (wchar.c), which constrains the second byte by leading byte: 0xE00xA0..0xBF, 0xED0x80..0x9F, 0xF00x90..0xBF, 0xF40x80..0x8F, and rejects leads 0x80..0xC1 and > 0xF4 outright. The comment cites RFC 3629 and the overlong-encoding security hazard explicitly.

  • The bulk UTF-8 validator is a shift-based DFA, not a per-byte branch tree. Verified: pg_utf8_verifystr drives utf8_advance over the Utf8Transition[256] table, processing STRIDE_LENGTH = 2 * sizeof(Vector8) bytes per iteration and falling back to pg_utf8_verifychar only for the tail or on non-ASCII. The unusual state-number constants exist so the transitions pack into 32-bit integers.

  • Seven encodings are client-only. Verified from the pg_enc enum and PG_ENCODING_BE_LAST = PG_KOI8U: PG_SJIS, PG_BIG5, PG_GBK, PG_UHC, PG_GB18030, PG_JOHAB, PG_SHIFT_JIS_2004 all sort after PG_KOI8U, so PG_VALID_BE_ENCODING rejects them as server encodings while PG_VALID_FE_ENCODING accepts them as client encodings.

  • The conversion catalog lookup is search-path-sensitive. Verified: FindDefaultConversionProc (namespace.c) iterates activeSearchPath, explicitly skipping myTempNamespace, so a CREATE CONVERSION ... DEFAULT in an earlier schema shadows a later one.

  • GB18030 is the reason pg_encoding_mblen has an _or_incomplete variant. Verified: the comment on pg_encoding_mblen warns that encoding==GB18030 may need to read mbstr[1] to determine length, and pg_encoding_mblen_or_incomplete returns INT_MAX when remaining < 2 for a high-bit GB18030 lead byte.

  1. Two-hop conversion for non-UTF8 server encodings. When both client and server are non-UTF8 (e.g. EUC_JP client, EUC_KR server), pg_do_encoding_conversion finds a single (EUC_JP, EUC_KR) default proc. Whether that proc internally pivots through UTF-8 or carries a direct table is a property of the specific conversion_procs/ entry, not visible from mbutils.c/conv.c alone. Investigation path: read src/backend/utils/mb/conversion_procs/euc_jp_and_*.

  2. Interaction of client_encoding rollback with cached ConvProcInfo. PrepareClientEncoding leaves stale duplicate entries in ConvProcList for SetClientEncoding to garbage-collect. The exact lifetime guarantee that a still-referenced FmgrInfo is never freed while a conversion is mid-flight is asserted by the comments but not traced end-to-end here.

  3. Display width and East-Asian ambiguity. pg_utf_dsplenucs_wcwidth uses fixed nonspacing / east-asian-fullwidth tables; how psql’s column alignment handles “ambiguous width” characters (which some terminals render single- and some double-width) is not resolved in this layer. Investigation path: ucs_wcwidth and the generated unicode_east_asian_fw_table.h.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a starting handle for a follow-up document.

  • UTF-8 as the only server encoding (the modern monoculture). SQL Server (since 2019, with UTF8 collations), Oracle (AL32UTF8), and the entire cloud-native generation (CockroachDB, Spanner, most NewSQL engines) either default to or mandate UTF-8 as the storage encoding, treating legacy charsets purely as a client-side I/O concern. PostgreSQL’s retention of ~40 server-capable encodings and a full catalog-driven conversion matrix is increasingly the historical outlier. The interesting comparison is cost/benefit: the conversion framework (pg_conversion, UtfToLocal) is real maintenance surface that a UTF-8-only engine simply does not carry, in exchange for native storage of legacy East-Asian charsets without a transcoding tax on every byte.

  • SIMD UTF-8 validation — “Validating UTF-8 In Less Than One Instruction Per Byte” (Keiser & Lemire, 2021) and the simdutf library. PostgreSQL’s pg_utf8_verifystr shift-based DFA (the Utf8Transition table processing 2 * sizeof(Vector8) bytes per stride) is a deliberate, portable approximation of the fully vectorized validators that libraries like simdutf achieve with AVX-512/NEON range checks. A follow-up could measure the gap between PostgreSQL’s generic-SIMD fallback and a hand-tuned architecture-specific validator on the ingress hot path, and ask whether the portability cost is still justified.

  • Collation vs. encoding separation, and ICU. This document deliberately defers ordering to postgres-collation-providers.md. The research frontier is the coupling point: ICU collations decode UTF-16 internally, so a UTF-8 server encoding pays a transcoding cost inside every ICU comparison. Engines that store UTF-16 natively (older SQL Server NVARCHAR, Java-based stores) invert that tradeoff. A comparative note would trace where the decode happens in each design and what it costs per comparison.

  • Encoding-aware indexing and the validate-on-ingress boundary. Because PostgreSQL trusts stored bytes (egress trusts, ingress verifies), an index over text never re-validates. Systems that allow per-column or per-value encodings (some document stores, polyglot engines) cannot make that assumption and must carry encoding metadata into the index. The PostgreSQL choice — one fixed server encoding per database — is what makes the “trust stored bytes” optimization sound, and is worth contrasting with the per-value approach’s flexibility/cost.

  • GB18030 and the limits of the leading-byte length rule. GB18030 is the reason pg_encoding_mblen needs an _or_incomplete variant (a four-byte GB18030 character cannot have its length determined from the first byte alone). The Unicode-superset mandate that GB18030-2022 imposes on Chinese-market software is a live standards question; a follow-up could compare how PostgreSQL’s client-only GB18030 support handles the 2022 revision’s new mappings against engines that store GB18030 natively.

  • Streaming / incremental transcoding. PostgreSQL converts whole strings (pg_do_encoding_conversion allocates len * 4 + 1 up front). Streaming parsers (e.g. COPY of a huge field, or a future chunked protocol) would benefit from an incremental converter that carries DFA state across chunk boundaries — exactly the resynchronization property UTF-8 self-synchronization enables but that the current whole-buffer API does not expose. See postgres-copy.md for where bulk ingest currently sits.

PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”
  • src/backend/utils/mb/mbutils.c — encoding state (ClientEncoding, DatabaseEncoding, MessageEncoding, ConvProcInfo), selection (PrepareClientEncoding, SetClientEncoding, InitializeClientEncoding, SetDatabaseEncoding), conversion entry points (pg_do_encoding_conversion, pg_do_encoding_conversion_buf, pg_any_to_server / pg_client_to_server, pg_server_to_any / pg_server_to_client, perform_default_encoding_conversion, pg_unicode_to_server), multibyte primitives (pg_mblen, pg_mbstrlen, pg_mbcliplen, pg_verify_mbstr), and the error reporters (report_invalid_encoding, report_untranslatable_char, check_encoding_conversion_args).
  • src/backend/utils/mb/conv.c — the pivot converters UtfToLocal / LocalToUtf, the radix-tree primitives pg_mb_radix_conv / store_coded_char, the bsearch comparators compare3 / compare4, and the single-byte / MULE_INTERNAL helpers local2local, latin2mic, mic2latin.
  • src/common/wchar.c — the pg_wchar_table vtable, UTF-8 primitives (pg_utf_mblen, pg_utf2wchar_with_len, pg_utf8_islegal, pg_utf8_verifychar, pg_utf8_verifystr with the Utf8Transition DFA), and the encoding-as-argument variants (pg_encoding_mblen, pg_encoding_mblen_or_incomplete, pg_encoding_verifymbstr, pg_encoding_max_length).
  • src/common/encnames.c — name↔ID resolution (pg_char_to_encoding, pg_encoding_to_char, pg_valid_server_encoding) and the shared tables pg_enc2name_tbl / pg_encname_tbl.
  • src/include/mb/pg_wchar.h — the pg_enc ID space, PG_VALID_BE_ENCODING / PG_VALID_FE_ENCODING, MAX_CONVERSION_GROWTH, and the inline utf8_to_unicode / unicode_to_utf8 code-point codecs.
  • src/backend/catalog/namespace.cFindDefaultConversionProc, the single bridge from utils/mb into the pg_conversion system catalog.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database System Concepts (Silberschatz et al.) — SQL character types and the locale/encoding sensitivity of string comparison and ordering.
  • Database Internals (Petrov) — the storage engine as an opaque-byte layer over which meaning (here, encoding) is imposed.
  • RFC 3629 — UTF-8 byte-range restrictions (overlong-form and surrogate rejection) implemented by pg_utf8_islegal.
  • Keiser & Lemire, “Validating UTF-8 In Less Than One Instruction Per Byte” (2021) — the SIMD-validation frontier pg_utf8_verifystr approximates.
  • postgres-collation-providers.md — string ordering (the orthogonal collation question); owns ICU/libc provider mechanics deferred here.
  • postgres-datatypes-adt.md — the text / varchar / name types whose bytes are interpreted by this layer; owns length()/substr() semantics.
  • postgres-wire-protocol.md — the startup client_encoding parameter and the Query/DataRow byte streams that hit pg_client_to_server / pg_server_to_client.
  • postgres-copy.md — bulk ingress, the other major caller of pg_any_to_server.
  • postgres-parser.md — the lexer that runs pg_client_to_server on the whole query string before tokenizing, and pg_unicode_to_server for Unicode escapes.