PostgreSQL Collation Providers — libc, ICU, and the Builtin Provider
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A relational database must impose a total order on the values of every
sortable type, because ORDER BY, B-tree indexes, MIN/MAX, merge joins,
and GROUP BY all rest on a comparator that, given two values, returns
less / equal / greater. For numbers and dates the order is intrinsic. For
character strings it is not: “is Z < a?” has no universal answer.
ASCII says yes (uppercase sorts before lowercase); a German phone-book order
says treat ä like ae; a Swedish order puts ä after z. The function
that decides is a collation, and it is a cultural artifact, not a
mathematical one.
Database System Concepts (Silberschatz et al.) introduces this under the
SQL COLLATE clause: a collation is “a set of rules that determines how
strings are compared and sorted,” and the standard lets a collation be
attached to a column, an expression, or a query-level override. The textbook
stresses two things the beginner overlooks. First, collation is separate
from character encoding — encoding (UTF-8, LATIN1) decides which byte
sequences are legal characters; collation decides how those characters
sort. Second, collation interacts with equality, not only ordering: a
case-insensitive collation must report 'A' = 'a' as true, which ripples
into unique indexes, DISTINCT, and hash joins.
Three design knobs define the space a collation implementation chooses within:
-
Where do the rules come from? A database can lean on the host operating system’s C library (
strcoll(3)), bind a dedicated Unicode library (ICU/CLDR), or compile its own ordering tables into the binary. Each choice trades coverage (how many locales) against stability (does the order change when the host upgrades). -
Is the comparison deterministic? The Unicode Collation Algorithm (UCA, Unicode Technical Standard #10) defines a multi-level comparison: primary level ignores case and accents, secondary adds accents, tertiary adds case. A collation can declare that two strings comparing equal at the chosen level are equal (nondeterministic —
'café' = 'café'even though the byte strings differ), or that ties are broken by a final byte comparison so only byte-identical strings are equal (deterministic). The choice changes the meaning of=, unique constraints, and whetherabbreviated keys/deduplicationare even sound. -
How is order versioned? Collation tables are data, and they change. CLDR ships a new release; glibc 2.28 famously reordered many locales. When the rules underneath a B-tree change, the index is silently corrupt — entries are no longer in the order the tree assumes. A mature engine records the collation version that built each index and warns when the live library disagrees.
PostgreSQL’s answer to knob 1 is a provider abstraction: every collation
carries a one-character collprovider code, and a thin dispatch layer routes
to one of three back ends. Knob 2 is a per-collation boolean,
collisdeterministic, supported only by ICU. Knob 3 is the collversion
string stored in pg_collation / pg_database, checked on first use.
Common DBMS Design
Section titled “Common DBMS Design”Every engine that supports culturally-correct string ordering converges on the same handful of engineering patterns, regardless of which library it delegates to. Naming them here lets PostgreSQL’s specific symbols read as choices within a shared space.
A provider/strategy indirection over the comparator
Section titled “A provider/strategy indirection over the comparator”No serious engine hard-codes one collation library. The comparator is reached
through an indirection — a function-pointer table, a virtual method, or a
tagged dispatch on a provider code — so the same call site (ORDER BY,
index build) works whether the rules come from libc, ICU, or a built-in
table. The indirection is resolved once per (collation, session) and
cached, because resolving a locale handle (newlocale(3), ucol_open()) is
expensive and the same collation is hit millions of times per query.
Locale handle distinct from the resolved comparator
Section titled “Locale handle distinct from the resolved comparator”The locale name ("en_US.utf8", "und-x-icu") is catalog data; the
locale handle (locale_t, UCollator *) is a live OS/library object that
cannot be stored in a catalog and must be reconstructed per backend. Engines
keep the two strictly separate: the catalog row names the collation, and a
per-process cache materializes the handle on demand.
Transform keys for sort-once, compare-many
Section titled “Transform keys for sort-once, compare-many”strcoll-style comparison is costly: it re-parses both strings on every
call. The standard optimization is a transform (strxfrm(3),
ucol_getSortKey()) that converts a string once into a byte blob whose plain
memcmp reproduces the collation order. Sorts and abbreviated-key
optimizations use the transform; the catch is that not every library’s
transform is trustworthy, so the engine guards it behind a capability flag.
Deterministic tie-break appended to the cultural order
Section titled “Deterministic tie-break appended to the cultural order”Cultural orders are not total — many distinct byte strings tie at every UCA
level. For a B-tree key or a unique index, ties are unacceptable. The
near-universal fix: after the library returns “equal,” append a raw
memcmp (then a length comparison) to force a total order. A collation that
skips this tie-break is “nondeterministic” and must be excluded from
optimizations that assume byte-equal-iff-equal (deduplication, abbreviated
keys, hash equality by image).
Version stamping for index validity
Section titled “Version stamping for index validity”Because collation tables drift, engines stamp each collation-dependent on-disk structure with the library version that built it and surface a mismatch as a warning plus an explicit “rebuild + refresh version” workflow. Without this, an OS upgrade turns every text index subtly wrong with no diagnostic.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / pattern | PostgreSQL name |
|---|---|
| Provider/strategy indirection | collprovider code + create_pg_locale() dispatch |
| Provider codes | COLLPROVIDER_LIBC 'c', _ICU 'i', _BUILTIN 'b', _DEFAULT 'd' |
| Resolved comparator handle | pg_locale_t (struct pg_locale_struct) |
| Comparator method table | struct collate_methods (strncoll, strnxfrm, …) |
| Locale name (catalog data) | collcollate/collctype (libc), colllocale (ICU/builtin) |
| Per-session handle cache | CollationCache + last_collation_cache_* |
| Transform key | pg_strnxfrm / strxfrm_is_safe flag |
| Deterministic tie-break | mylocale->deterministic branch in varstr_cmp |
| Version stamping | collversion column + get_collation_actual_version() |
| Fixed-at-create database locale | datcollate/datctype/datlocale in pg_database |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL routes all locale-sensitive string work through a single opaque
handle, pg_locale_t, obtained from a collation OID. The handle’s first
field is the provider code; every public entry point (pg_strcoll,
pg_strnxfrm, pg_strlower, …) is a thin dispatcher that branches on that
code and forwards to a provider-specific implementation living in
pg_locale_libc.c, pg_locale_icu.c, or pg_locale_builtin.c.
The resolved handle: pg_locale_t
Section titled “The resolved handle: pg_locale_t”// struct pg_locale_struct — src/include/utils/pg_locale.hstruct pg_locale_struct{ char provider; bool deterministic; bool collate_is_c; bool ctype_is_c; bool is_default;
const struct collate_methods *collate; /* NULL if collate_is_c */
union { struct { const char *locale; bool casemap_full; } builtin; locale_t lt; /* libc */#ifdef USE_ICU struct { const char *locale; UCollator *ucol; } icu;#endif } info;};The struct is a discriminated union. provider is the discriminant; the
info union holds exactly one of: the builtin locale string, a libc
locale_t, or an ICU UCollator *. Two booleans, collate_is_c and
ctype_is_c, are hoisted out of the union because the engine optimizes the
C/POSIX path aggressively (plain memcmp, ASCII case folding) and wants to
test for it without a provider-specific call. deterministic is likewise
hoisted because it gates correctness logic in dozens of call sites.
The comparison methods are themselves a pointer table, so even within one provider PostgreSQL can swap implementations (e.g., a Windows UTF-8 variant):
// struct collate_methods — src/include/utils/pg_locale.hstruct collate_methods{ int (*strncoll) (const char *arg1, ssize_t len1, const char *arg2, ssize_t len2, pg_locale_t locale); /* required */ size_t (*strnxfrm) (char *dest, size_t destsize, const char *src, ssize_t srclen, pg_locale_t locale); /* required */ size_t (*strnxfrm_prefix) (char *dest, size_t destsize, const char *src, ssize_t srclen, pg_locale_t locale); /* optional */ bool strxfrm_is_safe;};Provider dispatch
Section titled “Provider dispatch”A collation OID becomes a handle in create_pg_locale(), which reads the
pg_collation row, branches on collprovider, and calls the matching
constructor:
// create_pg_locale — src/backend/utils/adt/pg_locale.ccollform = (Form_pg_collation) GETSTRUCT(tp);
if (collform->collprovider == COLLPROVIDER_BUILTIN) result = create_pg_locale_builtin(collid, context);else if (collform->collprovider == COLLPROVIDER_ICU) result = create_pg_locale_icu(collid, context);else if (collform->collprovider == COLLPROVIDER_LIBC) result = create_pg_locale_libc(collid, context);else PGLOCALE_SUPPORT_ERROR(collform->collprovider); /* shouldn't happen */The provider codes are single characters, stored as a char column so the
catalog stays compact:
// pg_collation.h — provider codes#define COLLPROVIDER_DEFAULT 'd'#define COLLPROVIDER_BUILTIN 'b'#define COLLPROVIDER_ICU 'i'#define COLLPROVIDER_LIBC 'c'COLLPROVIDER_DEFAULT ('d') never appears on a real collation row that
gets a handle — it is the sentinel for “use the database default,” and the
dispatcher reaches it through DEFAULT_COLLATION_OID instead (see the cache
section). The three concrete providers are libc, ICU, and builtin.
Every public string operation follows the same dispatch shape. pg_strlower
is representative:
// pg_strlower — src/backend/utils/adt/pg_locale.csize_tpg_strlower(char *dst, size_t dstsize, const char *src, ssize_t srclen, pg_locale_t locale){ if (locale->provider == COLLPROVIDER_BUILTIN) return strlower_builtin(dst, dstsize, src, srclen, locale);#ifdef USE_ICU else if (locale->provider == COLLPROVIDER_ICU) return strlower_icu(dst, dstsize, src, srclen, locale);#endif else if (locale->provider == COLLPROVIDER_LIBC) return strlower_libc(dst, dstsize, src, srclen, locale); else PGLOCALE_SUPPORT_ERROR(locale->provider);
return 0; /* keep compiler quiet */}Note #ifdef USE_ICU: ICU is a build-time option. A server compiled without
ICU still understands the 'i' code in the catalog (so pg_dump/restore of
ICU collations does not crash) but errors cleanly if asked to actually use
one. The comparison and transform entry points are even thinner — they bounce
straight through the method table without a provider switch:
// pg_strncoll / pg_strnxfrm — src/backend/utils/adt/pg_locale.cintpg_strncoll(const char *arg1, ssize_t len1, const char *arg2, ssize_t len2, pg_locale_t locale){ return locale->collate->strncoll(arg1, len1, arg2, len2, locale);}
size_tpg_strnxfrm(char *dest, size_t destsize, const char *src, ssize_t srclen, pg_locale_t locale){ return locale->collate->strnxfrm(dest, destsize, src, srclen, locale);}That is the whole abstraction: provider selects the constructor; the
constructor installs a collate_methods table; comparison and transform go
through the table; case mapping goes through a per-function provider switch.
flowchart TD
A["caller: ORDER BY / index build / texteq<br/>has collation OID"] --> B["pg_newlocale_from_collation(collid)"]
B --> C{"collid?"}
C -->|"DEFAULT_COLLATION_OID"| D["return default_locale<br/>(set at init_database_collation)"]
C -->|"C_COLLATION_OID"| E["return &c_locale<br/>(static, provider=LIBC, collate_is_c)"]
C -->|"other"| F{"in CollationCache?"}
F -->|"yes"| G["return cached pg_locale_t"]
F -->|"no"| H["create_pg_locale(collid)"]
H --> I{"collprovider"}
I -->|"'b' BUILTIN"| J["create_pg_locale_builtin"]
I -->|"'i' ICU"| K["create_pg_locale_icu"]
I -->|"'c' LIBC"| L["create_pg_locale_libc"]
J --> M["install collate_methods + info.builtin"]
K --> M2["install collate_methods + info.icu.ucol"]
L --> M3["install collate_methods + info.lt"]
M --> N["cache + version check"]
M2 --> N
M3 --> N
N --> G
Figure 1 — Resolving a collation OID to a pg_locale_t. The two fast paths
(DEFAULT_COLLATION_OID, C_COLLATION_OID) short-circuit before any catalog
access; everything else goes through the per-backend cache and, on a miss,
the provider-dispatched constructor. The version check (Figure 4) runs inside
create_pg_locale after the constructor returns.
The libc provider: lc_collate / lc_ctype fixed at CREATE DATABASE
Section titled “The libc provider: lc_collate / lc_ctype fixed at CREATE DATABASE”The libc provider is the historical default and the one tied to the
LC_COLLATE / LC_CTYPE settings. The file-header comment in pg_locale.c
states the central constraint:
// pg_locale.c — file header comment/*---------- * Here is how the locale stuff is handled: LC_COLLATE and LC_CTYPE * are fixed at CREATE DATABASE time, stored in pg_database, and cannot * be changed. Thus, the effects of strcoll(), strxfrm(), isupper(), * toupper(), etc. are always in the same fixed locale. * ... *---------- */LC_COLLATE and LC_CTYPE are not GUCs you can SET per session — they
are baked into pg_database.datcollate / datctype when the database is
created, because a mutable collation would make every existing B-tree index
ambiguous. (By contrast LC_MESSAGES, LC_MONETARY, LC_NUMERIC,
LC_TIME are runtime GUCs, handled separately in this same file by the
check_locale_* / assign_locale_* hooks and the PGLC_localeconv /
cache_locale_time caches — but those only affect formatting, never sort
order.)
The libc constructor pulls the two locale strings from the catalog and builds
an OS locale_t:
// create_pg_locale_libc — src/backend/utils/adt/pg_locale_libc.cloc = make_libc_collator(collate, ctype);
result = MemoryContextAllocZero(context, sizeof(struct pg_locale_struct));result->provider = COLLPROVIDER_LIBC;result->deterministic = true; /* libc is always deterministic */result->collate_is_c = (strcmp(collate, "C") == 0) || (strcmp(collate, "POSIX") == 0);result->ctype_is_c = (strcmp(ctype, "C") == 0) || (strcmp(ctype, "POSIX") == 0);result->info.lt = loc;if (!result->collate_is_c){#ifdef WIN32 if (GetDatabaseEncoding() == PG_UTF8) result->collate = &collate_methods_libc_win32_utf8; else#endif result->collate = &collate_methods_libc;}Two things stand out. First, deterministic is hard-wired true: libc has
no nondeterministic mode. Second, for the C/POSIX locale make_libc_collator
returns NULL and result->collate stays NULL — the engine never calls
strcoll_l for C, it uses raw memcmp, which is both faster and immune to
OS collation drift. Under the hood the libc methods call strncoll_l /
strnxfrm_l, the thread-safe locale-parameterized forms of strcoll /
strxfrm.
The ICU provider: BCP-47 language tags and nondeterministic collations
Section titled “The ICU provider: BCP-47 language tags and nondeterministic collations”ICU is the modern, OS-independent provider built on Unicode CLDR data. Its
locale names are BCP-47 language tags (en-US, und for the root),
canonicalized before they reach ucol_open():
// icu_language_tag — src/backend/utils/adt/pg_locale.c (condensed)char *icu_language_tag(const char *loc_str, int elevel){#ifdef USE_ICU /* ... grow buffer in a loop ... */ uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status); /* ... on U_FAILURE, ereport(elevel) or return NULL ... */ return langtag;#else ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("ICU is not supported in this build"))); return NULL;#endif}Canonicalization (uloc_toLanguageTag) does “level-2 canonicalization”:
it accepts POSIX-ish or .NET-ish locale spellings and produces one
consistent language tag, so CREATE COLLATION ... (locale = 'en_US') and
'en-US-x-icu' resolve to the same ICU collator. icu_validate_locale()
then best-effort checks the language exists, and finally ucol_open() builds
the live UCollator.
ICU is the only provider that supports nondeterministic collations,
because it is the only one that can compare at a UCA level where distinct byte
strings legitimately tie. DefineCollation enforces this:
// DefineCollation — src/backend/commands/collationcmds.c/* * Nondeterministic collations are currently only supported with ICU * because that's the only case where it can actually make a * difference. ... */if (!collisdeterministic && collprovider != COLLPROVIDER_ICU) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("nondeterministic collations not supported with this provider")));The builtin provider: C, C.UTF-8, and PG_UNICODE_FAST
Section titled “The builtin provider: C, C.UTF-8, and PG_UNICODE_FAST”The builtin provider (added so PostgreSQL has a fast, version-stable
Unicode option that does not depend on the host libc or a linked ICU) accepts
exactly three locale names, validated and canonicalized in pg_locale.c:
// builtin_validate_locale — src/backend/utils/adt/pg_locale.cif (strcmp(locale, "C") == 0) canonical_name = "C";else if (strcmp(locale, "C.UTF-8") == 0 || strcmp(locale, "C.UTF8") == 0) canonical_name = "C.UTF-8";else if (strcmp(locale, "PG_UNICODE_FAST") == 0) canonical_name = "PG_UNICODE_FAST";
if (!canonical_name) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("invalid locale name \"%s\" for builtin provider", locale)));Each builtin locale carries an encoding constraint, returned by
builtin_locale_encoding: "C" is encoding-agnostic (-1), while
"C.UTF-8" and "PG_UNICODE_FAST" require UTF-8. The ordering is pure code
point order (memcmp on UTF-8 bytes gives code-point order), so it is fast
and never changes across PostgreSQL releases — there is no external library
to drift. PG_UNICODE_FAST differs from C.UTF-8 in its ctype behavior:
it applies full Unicode case mapping (upper/lower/title across the whole
repertoire) while keeping code-point sort order, where C.UTF-8 does only
ASCII case mapping.
Deterministic vs. nondeterministic equality
Section titled “Deterministic vs. nondeterministic equality”Determinism is where collation stops being “just sorting” and starts changing
the meaning of =. The comparison core, varstr_cmp, calls the provider and
then breaks ties only when the collation is deterministic:
// varstr_cmp tie-break — src/backend/utils/adt/varlena.c (condensed)result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
/* Break tie if necessary. */if (result == 0 && mylocale->deterministic){ result = memcmp(arg1, arg2, Min(len1, len2)); if ((result == 0) && (len1 != len2)) result = (len1 < len2) ? -1 : 1;}For a deterministic collation, two strings are equal iff they are
byte-identical: the cultural comparison decides ordering, but any residual
tie is resolved by raw bytes, so = reduces to memcmp. texteq exploits
this — for a deterministic locale it skips strcoll entirely and does a
bitwise compare:
// texteq — src/backend/utils/adt/varlena.c (condensed)mylocale = pg_newlocale_from_collation(collid);
if (mylocale->deterministic){ /* ... equality is pure length check + memcmp, no strcoll ... */ if (len1 == len2 && memcmp(arg1, arg2, len1) == 0) /* equal */ ;}else{ /* must run the collation: 'café' may equal 'café' */}For a nondeterministic collation the tie-break is skipped: two strings
that compare equal at the chosen UCA level are equal even if their bytes
differ. That makes case-insensitive or accent-insensitive unique indexes
possible, but it disables optimizations that assume byte-equal-iff-equal —
PostgreSQL refuses deduplication, abbreviated keys are unsafe, and some
substring operations error out ("nondeterministic collations are not supported for substring searches").
flowchart TD
A["text equality / comparison<br/>varstr_cmp"] --> B{"locale->deterministic?"}
B -->|"true (libc, builtin,<br/>most ICU)"| C["run provider compare"]
C --> D{"compares equal?"}
D -->|"no"| E["return cultural order"]
D -->|"yes (tie)"| F["memcmp tie-break<br/>then length"]
F --> G["total order:<br/>equal iff byte-identical"]
B -->|"false (ICU only)"| H["run provider compare<br/>at UCA level"]
H --> I["equal means equal —<br/>no byte tie-break"]
I --> J["disables: dedup,<br/>abbreviated keys,<br/>substring search"]
Figure 2 — The determinism branch. Deterministic collations (the default,
all that libc and builtin support) append a memcmp tie-break so equality
collapses to byte-identity, which keeps B-tree deduplication and abbreviated
keys sound. Nondeterministic collations (ICU only) take the cultural verdict
as final, enabling accent/case-insensitive uniqueness at the cost of those
optimizations.
Transform keys and the safety flag
Section titled “Transform keys and the safety flag”Sorting calls the comparator O(n log n) times, each re-parsing both strings.
The classic optimization is to transform each string once into a byte blob
whose memcmp reproduces the collation order, then sort the blobs. PostgreSQL
exposes this as pg_strnxfrm, but not every provider’s transform is
trustworthy on every platform, so the engine gates it behind a capability
flag rather than assuming it works:
// pg_strxfrm_enabled — src/backend/utils/adt/pg_locale.cboolpg_strxfrm_enabled(pg_locale_t locale){ /* * locale->collate->strnxfrm is still a required method, even if it may * have the wrong behavior, because the planner uses it for estimates in * some cases. */ return locale->collate->strxfrm_is_safe;}The subtlety in the comment is load-bearing: strnxfrm is always present
(the planner calls it for selectivity estimates, where a wrong answer only
hurts plan quality, not correctness), but it is only used for actual
sorting when strxfrm_is_safe is true. Historically some glibc versions
produced strxfrm output that did not round-trip through strcoll, so the
flag lets PostgreSQL fall back to direct comparison on those platforms while
still offering the transform where it is sound. The abbreviated-keys
optimization in tuplesort (see postgres-tuplesort.md) is built on this
transform.
Collation version tracking
Section titled “Collation version tracking”A collation’s rules are external data that drifts: a new ICU/CLDR release,
or the notorious glibc 2.28 reordering, can silently change the order of an
existing B-tree index. PostgreSQL defends against this by recording the
provider’s reported version string in pg_collation.collversion (and
pg_database.datcollversion for the database default) at create time, then
re-checking it every time the collation is first resolved in a backend.
get_collation_actual_version asks the live provider what version it reports
now:
// get_collation_actual_version — src/backend/utils/adt/pg_locale.cchar *get_collation_actual_version(char collprovider, const char *collcollate){ char *collversion = NULL;
if (collprovider == COLLPROVIDER_BUILTIN) collversion = get_collation_actual_version_builtin(collcollate);#ifdef USE_ICU else if (collprovider == COLLPROVIDER_ICU) collversion = get_collation_actual_version_icu(collcollate);#endif else if (collprovider == COLLPROVIDER_LIBC) collversion = get_collation_actual_version_libc(collcollate);
return collversion;}create_pg_locale then compares the recorded version (read from the
catalog row) against the actual version and warns on mismatch, with an
errhint naming the exact remediation:
// create_pg_locale — src/backend/utils/adt/pg_locale.c (condensed)collversionstr = TextDatumGetCString(datum); /* recorded */
actual_versionstr = get_collation_actual_version(collform->collprovider, ...);
if (strcmp(actual_versionstr, collversionstr) != 0) ereport(WARNING, (errmsg("collation \"%s\" has version mismatch", ...), errdetail("The collation in the database was created using version %s, " "but the operating system provides version %s.", collversionstr, actual_versionstr), errhint("Rebuild all objects affected by this collation and run " "ALTER COLLATION %s REFRESH VERSION, ...")));The mismatch is a warning, not an error — PostgreSQL cannot know whether
the specific strings in your indexes are affected by the reordering, so it
flags the risk and lets the DBA decide. The fix is two steps: REINDEX every
affected index (rebuilding it under the new order), then ALTER COLLATION ... REFRESH VERSION to stamp the new version so the warning stops.
AlterCollation implements the refresh by re-reading the actual version and
writing it back into collversion:
// AlterCollation — src/backend/commands/collationcmds.c (condensed)newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
if ((!oldversion && newversion) || (oldversion && !newversion)) elog(ERROR, "invalid collation version change");else if (oldversion && newversion && strcmp(newversion, oldversion) != 0){ ereport(NOTICE, (errmsg("changing version from %s to %s", oldversion, newversion))); /* ... write newversion into Anum_pg_collation_collversion ... */}else ereport(NOTICE, (errmsg("version has not changed")));The default collation gets its version from pg_database, not
pg_collation, so ALTER COLLATION on it is refused with a hint to use
ALTER DATABASE ... REFRESH COLLATION VERSION instead. The SQL function
pg_collation_actual_version(oid) exposes the live version to users for
inspection, reading from pg_database for DEFAULT_COLLATION_OID and from
pg_collation otherwise.
flowchart TD
A["CREATE COLLATION / CREATE DATABASE"] --> B["get_collation_actual_version()"]
B --> C["store version string in<br/>pg_collation.collversion /<br/>pg_database.datcollversion"]
C --> D["index built under these rules"]
D --> E["OS upgrade: glibc 2.28 /<br/>new ICU CLDR release"]
E --> F["first use: create_pg_locale()<br/>recorded vs actual"]
F --> G{"strcmp differs?"}
G -->|"no"| H["silent — index trusted"]
G -->|"yes"| I["WARNING: version mismatch<br/>errhint: REINDEX + REFRESH"]
I --> J["DBA: REINDEX affected indexes"]
J --> K["ALTER COLLATION ... REFRESH VERSION<br/>writes actual into collversion"]
K --> H
Figure 3 — The collation version lifecycle. The version stamped at create time is the contract; a library upgrade can break it. The mismatch surfaces as a warning (PostgreSQL cannot prove which rows are affected), and the two-step REINDEX + REFRESH workflow restores the contract.
CREATE COLLATION: option parsing and provider selection
Section titled “CREATE COLLATION: option parsing and provider selection”DefineCollation is where SQL syntax becomes catalog data. It parses the
provider, locale / lc_collate / lc_ctype, deterministic, rules,
and version options, then normalizes them per provider. The locale shorthand
expands differently depending on provider — for libc it fills both
lc_collate and lc_ctype; for ICU/builtin it fills colllocale:
// DefineCollation — src/backend/commands/collationcmds.c (condensed)if (localeEl){ if (collprovider == COLLPROVIDER_LIBC) { collcollate = defGetString(localeEl); collctype = defGetString(localeEl); } else colllocale = defGetString(localeEl);}The provider default is libc when none is given. ICU-only options
(rules, nondeterminism) are rejected for other providers, ICU locales are
canonicalized to a language tag (unless in binary upgrade, which preserves the
original string), and finally CollationCreate writes the row. After creation
the code calls pg_newlocale_from_collation(newoid) once as a smoke test —
“check that the locales can be loaded” — so a typo’d locale fails at
CREATE COLLATION time rather than at first query.
Importing system collations
Section titled “Importing system collations”A fresh database does not enumerate every OS locale by hand;
pg_import_system_collations() (invoked by initdb and exposed as an
SQL function) does it. On non-Windows it runs locale -a, normalizes each
name (stripping .utf8-style encoding tags and creating a short alias), and
calls CollationCreate with COLLPROVIDER_LIBC for each valid one:
// create_collation_from_locale — src/backend/commands/collationcmds.c (condensed)collid = CollationCreate(locale, nspid, GetUserId(), COLLPROVIDER_LIBC, true, enc, locale, locale, NULL, NULL, get_collation_actual_version(COLLPROVIDER_LIBC, locale), true, true);When built with ICU, the same function loops over uloc_getAvailable() and
creates <langtag>-x-icu collations (the root locale sneaks in at index
-1). Each is created with its actual version already stamped, so freshly
imported collations start life version-consistent.
Source Walkthrough
Section titled “Source Walkthrough”The collation-provider machinery spans the catalog (pg_collation,
pg_database), the runtime resolver (pg_locale.c plus three
provider files), and the DDL layer (collationcmds.c). The symbols below are
grouped by call-flow.
Catalog and type definitions
Section titled “Catalog and type definitions”struct pg_locale_struct/pg_locale_t(inpg_locale.h) — the resolved handle: provider discriminant, thedeterministic/collate_is_c/ctype_is_cflags, thecollatemethod pointer, and theinfounion over builtin / libclocale_t/ ICUUCollator.struct collate_methods(inpg_locale.h) — the per-collation method table:strncoll(required),strnxfrm(required),strnxfrm_prefix(optional), and thestrxfrm_is_safeflag.COLLPROVIDER_DEFAULT/_BUILTIN/_ICU/_LIBC(inpg_collation.h) — the single-char provider codes'd'/'b'/'i'/'c'.Form_pg_collationfieldscollprovider,collisdeterministic,collencoding,collcollate,collctype,colllocale,collicurules,collversion— the catalog columns the resolver reads.Form_pg_databasefieldsdatlocprovider,datcollate,datctype,datlocale,datcollversion— the database-default locale, fixed at CREATE DATABASE.
Handle resolution and caching
Section titled “Handle resolution and caching”pg_newlocale_from_collation(inpg_locale.c) — the public entry point; short-circuitsDEFAULT_COLLATION_OID→default_localeandC_COLLATION_OID→&c_locale, then consults the cache.c_locale(inpg_locale.c) — the staticpg_locale_structfor C/POSIX, usable without catalog access.collation_cache_entry/CollationCache/last_collation_cache_oid/last_collation_cache_locale(inpg_locale.c) — the per-backend simplehash cache plus a one-entry fast path for the repeated-collation case.create_pg_locale(inpg_locale.c) — reads the catalog row, dispatches oncollproviderto the provider constructor, then runs the version check.init_database_collation(inpg_locale.c) — buildsdefault_localefrompg_databaseat backend startup, dispatching ondatlocprovider.
Provider constructors
Section titled “Provider constructors”create_pg_locale_libc/make_libc_collator(inpg_locale_libc.c) — builds alocale_tvianewlocale; setsdeterministic = trueand thecollate_is_c/ctype_is_cflags; leavescollateNULL for C/POSIX.create_pg_locale_icu(inpg_locale_icu.c) — opens aUCollator; the only constructor that can setdeterministic = false.create_pg_locale_builtin(inpg_locale_builtin.c) — installs the compiled-in Unicode tables forC/C.UTF-8/PG_UNICODE_FAST.
String operations (dispatch)
Section titled “String operations (dispatch)”pg_strcoll/pg_strncoll(inpg_locale.c) — comparison vialocale->collate->strncoll.pg_strnxfrm/pg_strxfrm/pg_strxfrm_enabled/pg_strnxfrm_prefix/pg_strxfrm_prefix_enabled(inpg_locale.c) — transform keys and their safety/availability gates.pg_strlower/pg_strtitle/pg_strupper/pg_strfold(inpg_locale.c) — case mapping, each a provider switch (notepg_strfoldfalls back tostrlower_libcfor the libc provider).varstr_cmp/text_cmp/texteq(invarlena.c) — the determinism tie-break and the deterministic fast path for equality.
Builtin / ICU locale validation
Section titled “Builtin / ICU locale validation”builtin_validate_locale/builtin_locale_encoding(inpg_locale.c) — canonicalize and encoding-check the three builtin locale names.icu_language_tag/icu_validate_locale(inpg_locale.c) — BCP-47 canonicalization and best-effort validation for ICU locales.
Version tracking and DDL
Section titled “Version tracking and DDL”get_collation_actual_version(inpg_locale.c) — asks the live provider for its current version string.DefineCollation(incollationcmds.c) — CREATE COLLATION: parse options, per-provider normalization, write the row, smoke-test the locale.AlterCollation(incollationcmds.c) — ALTER COLLATION … REFRESH VERSION; refuses the default collation.pg_collation_actual_version(incollationcmds.c) — SQL-exposed live version, readingpg_databaseorpg_collation.pg_import_system_collations/create_collation_from_locale/normalize_libc_locale_name(incollationcmds.c) — bulk import of OS locales (libc vialocale -a, ICU viauloc_getAvailable).
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
struct collate_methods | src/include/utils/pg_locale.h | 73 |
struct pg_locale_struct | src/include/utils/pg_locale.h | 115 |
COLLPROVIDER_DEFAULT | src/include/catalog/pg_collation.h | 70 |
COLLPROVIDER_BUILTIN | src/include/catalog/pg_collation.h | 71 |
COLLPROVIDER_ICU | src/include/catalog/pg_collation.h | 72 |
COLLPROVIDER_LIBC | src/include/catalog/pg_collation.h | 73 |
c_locale | src/backend/utils/adt/pg_locale.c | 137 |
collation_cache_entry | src/backend/utils/adt/pg_locale.c | 146 |
last_collation_cache_oid | src/backend/utils/adt/pg_locale.c | 176 |
check_locale | src/backend/utils/adt/pg_locale.c | 301 |
create_pg_locale | src/backend/utils/adt/pg_locale.c | 1075 |
init_database_collation | src/backend/utils/adt/pg_locale.c | 1154 |
pg_newlocale_from_collation | src/backend/utils/adt/pg_locale.c | 1196 |
get_collation_actual_version | src/backend/utils/adt/pg_locale.c | 1254 |
pg_strlower | src/backend/utils/adt/pg_locale.c | 1270 |
pg_strcoll | src/backend/utils/adt/pg_locale.c | 1353 |
pg_strncoll | src/backend/utils/adt/pg_locale.c | 1373 |
pg_strxfrm_enabled | src/backend/utils/adt/pg_locale.c | 1387 |
pg_strnxfrm | src/backend/utils/adt/pg_locale.c | 1428 |
builtin_locale_encoding | src/backend/utils/adt/pg_locale.c | 1486 |
builtin_validate_locale | src/backend/utils/adt/pg_locale.c | 1510 |
icu_language_tag | src/backend/utils/adt/pg_locale.c | 1550 |
icu_validate_locale | src/backend/utils/adt/pg_locale.c | 1608 |
create_pg_locale_libc | src/backend/utils/adt/pg_locale_libc.c | 421 |
make_libc_collator | src/backend/utils/adt/pg_locale_libc.c | 488 |
varstr_cmp (tie-break) | src/backend/utils/adt/varlena.c | 1698 |
texteq | src/backend/utils/adt/varlena.c | 1738 |
DefineCollation | src/backend/commands/collationcmds.c | 53 |
AlterCollation | src/backend/commands/collationcmds.c | 424 |
pg_collation_actual_version | src/backend/commands/collationcmds.c | 507 |
normalize_libc_locale_name | src/backend/commands/collationcmds.c | 596 |
create_collation_from_locale | src/backend/commands/collationcmds.c | 696 |
pg_import_system_collations | src/backend/commands/collationcmds.c | 836 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
There are exactly four provider codes and three concrete providers. Verified in
pg_collation.h:COLLPROVIDER_DEFAULT'd',COLLPROVIDER_BUILTIN'b',COLLPROVIDER_ICU'i',COLLPROVIDER_LIBC'c'.'d'is a sentinel (“use the database default”) and never reaches a provider constructor —create_pg_localedispatches only on the other three andPGLOCALE_SUPPORT_ERRORs otherwise. -
Nondeterministic collations are ICU-only, enforced at CREATE time. Verified in
DefineCollation(collationcmds.c):if (!collisdeterministic && collprovider != COLLPROVIDER_ICU) ereport(ERROR, ...). The libc constructor hard-wiresresult->deterministic = true; the builtin constructor does likewise. Onlycreate_pg_locale_icucan produce a non-deterministic handle. -
The C/POSIX path never calls
strcolland never carries a method table. Verified increate_pg_locale_libcandmake_libc_collator(pg_locale_libc.c): for"C"/"POSIX",make_libc_collatorreturnsNULL,collate_is_cis set true, andresult->collateis left NULL. Comparison falls back tomemcmp, which is why the C locale is immune to OS collation drift. -
The collation version mismatch is a WARNING, not an ERROR. Verified in
create_pg_locale(pg_locale.c): thestrcmp(actual, recorded) != 0branchereport(WARNING, ...)with an errhint namingREINDEXandALTER COLLATION ... REFRESH VERSION. PostgreSQL cannot prove which rows a reordering affects, so it flags risk rather than blocking access. -
LC_COLLATE and LC_CTYPE are fixed at CREATE DATABASE, not GUCs. Verified by the
pg_locale.cfile-header comment and by the libc constructor readingdatcollate/datctypefrompg_database. The runtime locale GUCs in this same file (locale_messages/_monetary/_numeric/_time) affect only formatting (PGLC_localeconv,cache_locale_time), never sort order. -
strnxfrmis always present but only used whenstrxfrm_is_safe. Verified inpg_strxfrm_enabled(pg_locale.c) and its comment: the method pointer is required because the planner calls it for estimates, but actual sorting consults thestrxfrm_is_safeflag first. This is the hook some glibc versions’ brokenstrxfrmis disabled through. -
The builtin provider accepts exactly three locale names. Verified in
builtin_validate_locale(pg_locale.c):C,C.UTF-8(aliasC.UTF8), andPG_UNICODE_FAST; anything else errors.Cis encoding-agnostic (-1); the other two require UTF-8 perbuiltin_locale_encoding. -
The per-backend collation cache has a one-entry fast path. Verified in
pg_newlocale_from_collation(pg_locale.c):last_collation_cache_oid/last_collation_cache_localeshort-circuit the simplehash lookup when the same collation is requested consecutively — the common case inside one query.
Open questions
Section titled “Open questions”-
What exactly distinguishes
PG_UNICODE_FASTfromC.UTF-8at runtime? Both are UTF-8 builtin locales with code-point sort order. The difference is in ctype/case mapping (casemap_fullin the builtin info union — full Unicode case mapping vs. ASCII-only), but the precise table sources and the list of case-mapping functions that branch oncasemap_fullwere not traced here. Investigation path: readpg_locale_builtin.cand thestrlower_builtin/strupper_builtin/strtitle_builtin/strfold_builtinimplementations, plus the generated Unicode tables undersrc/common/unicode/. -
How does
strxfrm_is_safeget set to false in practice on REL_18? The flag exists and is honored, but which providers/platforms actually clear it (and whether any do by default on a modern glibc) was not confirmed from the constructors read here. Investigation path: grepstrxfrm_is_safeacrosspg_locale_libc.c/pg_locale_icu.c/pg_locale_builtin.cand trace thecollate_methodstable initializers.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Unicode Collation Algorithm (UTS #10) and CLDR. The multi-level primary/secondary/tertiary comparison PostgreSQL’s ICU provider relies on is specified in Unicode Technical Standard #10; the locale-specific tailorings live in CLDR. A focused note mapping UCA levels onto PostgreSQL’s deterministic/nondeterministic distinction and the
rules(tailoring) option would make the ICU provider’s behavior concrete. -
The glibc 2.28 collation break. The 2018 glibc release reordered most locales, silently corrupting B-tree indexes on upgraded hosts and motivating much of PostgreSQL’s version-tracking machinery and the push toward ICU and the builtin provider. A short case study of that incident would ground the
collversionmechanism in the failure it exists to catch. -
ICU vs. libc stability tradeoff. ICU pins collation behavior to a linked library version independent of the host OS, but introduces its own versioning surface. SQL Server (Windows collations + UCA), Oracle (
NLS_SORT/ linguistic collations), and MySQL (per-character-set collations,utf8mb4_0900_ai_ci) each pick a different point on the coverage/stability curve. A side-by-side comparison would clarify what PostgreSQL’s three-provider design buys. -
Encoding vs. collation separation. This doc deliberately defers the encoding side; see
postgres-encoding.mdforpg_wchar, server/client encoding conversion, and howcollencodingconstrains which collations are usable in a given database. The two subsystems meet atpg_get_encoding_from_localeandcheck_encoding_locale_matches. -
Where collations are consumed. The comparator and transform keys surface in sorting (
postgres-tuplesort.md, abbreviated keys), B-tree ordering (postgres-nbtree.md, deduplication disabled for nondeterministic collations), and the text type functions (postgres-datatypes-adt.md). A follow-up could trace oneORDER BYend-to-end from parse-time collation assignment to thestrncollcall.
Sources
Section titled “Sources”Raw analysis materials
Section titled “Raw analysis materials”- None. This document was synthesized directly from the REL_18 source tree
(
sources: []in frontmatter).
Source code (REL_18, commit 273fe94)
Section titled “Source code (REL_18, commit 273fe94)”src/backend/utils/adt/pg_locale.c— provider dispatch, handle cache, version check, builtin/ICU validation, runtime locale GUC hooks.src/backend/commands/collationcmds.c— CREATE / ALTER COLLATION, version refresh, system-collation import.src/include/utils/pg_locale.h—pg_locale_t,collate_methods,pg_locale_struct, the public string-operation prototypes.src/include/catalog/pg_collation.h—COLLPROVIDER_*codes, thepg_collationcatalog form.src/include/catalog/pg_database.h—datlocprovider/datcollate/datctype/datlocale/datcollversion.src/backend/utils/adt/pg_locale_libc.c— the libc provider constructor andmake_libc_collator.src/backend/utils/adt/varlena.c—varstr_cmpdeterminism tie-break,texteqdeterministic fast path.
Textbook references
Section titled “Textbook references”- Database System Concepts (Silberschatz, Korth, Sudarshan, 7th ed.) — the
SQL
COLLATEclause, collation as comparison-and-sort rules, and the encoding-vs-collation distinction. Captured atknowledge/research/dbms-general/database-system-concepts.md.
External standards
Section titled “External standards”- Unicode Technical Standard #10 (Unicode Collation Algorithm) and the CLDR data set — the basis for the ICU provider’s multi-level comparison and tailoring rules.
Cross-references
Section titled “Cross-references”postgres-encoding.md— character encoding and the encoding/collation boundary.postgres-datatypes-adt.md— thetexttype functions that consume these comparators.postgres-tuplesort.md— abbreviated keys built on the transform machinery.postgres-nbtree.md— B-tree ordering and deduplication’s interaction with determinism.