CUBRID Charset and Collation — Codeset Conversion, Locale-Aware Comparison, and Multi-Encoding Support

CUBRID handles text as a (byte-sequence, codeset, collation) triple. The codeset (INTL_CODESET) tells the engine how to walk bytes into characters; the collation (LANG_COLLATION) tells it how to order, match, and hash those characters. The two are decoupled but constrained — every collation pins itself to exactly one codeset, and the codeset of a string must be coercible to the collation’s codeset before any comparison can run.

This module is unusual relative to other RDBMS engines in two ways. First, CUBRID does not link against ICU. Locale rules — including UCA tailorings written in LDML — are compiled by CUBRID’s own genlocale toolchain into a per-platform shared library (libcubrid_<lang>.so/.dll) that the server dlopens at startup. Second, every codeset has its own dedicated comparator family (lang_fastcmp_byte, lang_strcmp_utf8, lang_strcmp_utf8_uca, …) wired into a LANG_COLLATION vtable, so the hot path skips the codeset-dispatch switch that conversion routines pay at the boundary.

Position table

Symbol	File	Line
`enum intl_codeset`	`src/base/intl_support.h`	175
`INTL_CODESET_MULT`	`src/base/intl_support.h`	77
`INTL_NEXT_CHAR`	`src/base/intl_support.h`	99
`intl_Len_utf8_char` (decl)	`src/base/intl_support.h`	173
`intl_utf8_to_cp`	`src/base/intl_support.c`	3754
`intl_cp_to_utf8`	`src/base/intl_support.c`	3661
`intl_back_utf8_to_cp`	`src/base/intl_support.c`	3812
`intl_check_utf8`	`src/base/intl_support.c`	3950
`intl_check_euckr`	`src/base/intl_support.c`	4239
`intl_fast_iso88591_to_utf8`	`src/base/intl_support.c`	4932
`intl_euckr_to_utf8`	`src/base/intl_support.c`	5101
`intl_utf8_to_euckr`	`src/base/intl_support.c`	5256
`intl_convert_charset`	`src/base/intl_support.c`	944
`intl_char_count` / `intl_char_size`	`src/base/intl_support.c`	974 / 1021
`intl_identifier_casecmp`	`src/base/intl_support.c`	2783
`struct lang_collation`	`src/base/language_support.h`	159
`struct lang_locale_data`	`src/base/language_support.h`	190
`struct db_charset` (`lang_Db_charsets`)	`src/base/language_support.h:128` / `.c:171`
`enum LANG_COLL_*` (built-in IDs)	`src/base/language_support.h`	105
`LANG_GET_BINARY_COLLATION`	`src/base/language_support.h`	121
`LANG_RT_COMMON_COLL`	`src/base/language_support.h`	65
`built_In_collations[]`	`src/base/language_support.c`	830
`lang_init_builtin`	`src/base/language_support.c`	850
`register_collation`	`src/base/language_support.c`	1568
`lang_get_collation`	`src/base/language_support.c`	1648
`lang_get_collation_by_name`	`src/base/language_support.c`	1678
`lang_load_coll_from_lib`	`src/base/language_support.c`	7105
`lang_load_library`	`src/base/language_support.c`	7272
`lang_fastcmp_byte`	`src/base/language_support.c`	5453
`lang_fastcmp_binary`	`src/base/language_support.c`	6415
`lang_strcmp_utf8`	`src/base/language_support.c`	2807
`lang_strmatch_utf8`	`src/base/language_support.c`	2830
`lang_strcmp_utf8_uca`	`src/base/language_support.c`	4321
`lang_strmatch_utf8_uca_w_coll_data`	`src/base/language_support.c`	4368
`lang_strmatch_utf8_uca_w_level`	`src/base/language_support.c`	3583
`lang_get_uca_w_l13`	`src/base/language_support.c`	3376
`lang_get_contr_for_string`	`src/base/language_support.c`	3296
`lang_init_coll_en_cs` / `_en_ci`	`src/base/language_support.c`	5304 / 5346
`struct coll_data`	`src/base/locale_support.h`	354
`struct coll_tailoring`	`src/base/locale_support.h`	391
`struct uca_options`	`src/base/locale_support.h`	315
`struct alphabet_data`	`src/base/locale_support.h`	437
`struct text_conversion`	`src/base/locale_support.h`	464
`struct unicode_normalization`	`src/base/locale_support.h`	492
`locale_compile_locale`	`src/base/locale_support.c`	4573
`COLL_CONTRACTION`	`src/base/locale_lib_common.h`	44
`QSTR_COMPARE` macro	`src/query/string_opfunc.h`	59
`QSTR_MATCH` macro	`src/query/string_opfunc.h`	62
`QSTR_NEXT_ALPHA_CHAR` macro	`src/query/string_opfunc.h`	68
`QSTR_SPLIT_KEY` macro	`src/query/string_opfunc.h`	71
`btree_compare_individual_key_value`	`src/storage/btree.c`	19602

Theoretical Background

A DBMS that handles text settles two orthogonal questions. Encoding decodes byte sequences into characters; collation orders already-decoded characters. They are independent in principle but coupled in practice — every comparator walks the byte stream, and the walk strategy is dictated by the encoding.

A codepoint is an abstract integer in U+0000 .. U+10FFFF. UTF-8 serializes them as 1..4 bytes: ASCII-compatible (< 0x80), self-synchronizing (continuation bytes start with 10xxxxxx), length-encoded in the lead byte (110xxxxx = 2 bytes, 1110xxxx = 3, 11110xxx = 4). Pre-Unicode encodings — ISO-8859-1, EUC-KR, Shift-JIS — persist in legacy data and CSV imports, so any RDBMS must ingest them.

The Unicode Collation Algorithm (UCA, UTS #10) gives each codepoint a collation element (CE) — a tuple of weights at three or four levels: L1 = base letter, L2 = accent, L3 = case, L4 = punctuation/variable. Sort compares L1 keys; on tie, L2; and so on. Two extensions matter: expansions (“ß” → “ss”) and contractions (“ch” sorts as a unit in Spanish). The Default Unicode Collation Element Table (DUCET) is the baseline; locale tailorings override it. Tailorings are written in LDML (CLDR XML), with rules like & a < b (“b sorts after a at primary level”).

Case sensitivity (CI/CS), accent sensitivity (AI/AS), and kana sensitivity (KI/KS) are usually implemented by truncating compare at level 1 (CI/AI) or collapsing equivalent CEs into a shared weight.

Casing is locale-dependent. The textbook example is Turkish: LOWER('I') is 'ı' (U+0131) and UPPER('i') is 'İ' (U+0130) — engine cannot use C runtime tolower/toupper.

Normalization sits underneath: é is either U+00E9 (NFC) or U+0065+U+0301 (NFD). UCA tables assume NFC; an engine that does not normalize on input ends up with semantically-equal strings comparing unequal.

Implementations cluster into three patterns:

Link against ICU — delegate conversion, iteration, normalization, collation wholesale.
Roll your own UCA on top of CLDR data — parse DUCET and LDML at build time, emit compact tables.
Keep it simple — binary collation plus a few hand-tuned per-language tables.

CUBRID lives in pattern two: genlocale parses LDML, applies tailorings to DUCET, emits per-locale C source, CMake builds shared libraries, the runtime dlopens them. ICU is not a dependency.

Common DBMS Design

PostgreSQL originally used libc strcoll; since 10, ICU is a first-class provider, and since 16 it is the default. pg_collation stores (collname, collprovider, collcollate, collctype, collicurules). Encoding is per-database. Collation flows as part of the type — operators must agree, raising “could not determine which collation to use” on conflict.
MySQL/MariaDB has its own framework predating ICU. ~40 charsets, each with multiple collations (utf8mb4_general_ci, utf8mb4_0900_ai_ci). Each collation is a CHARSET_INFO struct with function pointers (strnncoll, hash_sort, like_range) — a near-twin of CUBRID’s LANG_COLLATION. MySQL 8.0 ships CLDR-based UCA 9.0 collations.
Oracle uses NLS. Database character set fixed at CREATE DATABASE; linguistic collation via NLS_SORT (BINARY, XSPANISH, …) with _CI/_AI suffixes. Own collation engine, predating ICU.
SQL Server added UTF-8 collations (Latin1_General_100_CI_AS_SC_UTF8) only in 2019; collation name encodes locale, version, CI/CS, AI/AS, KI/KS, WI/WS, supplementary characters, UTF-8 — all in the string.
SQLite has three built-ins (BINARY, NOCASE ASCII-only, RTRIM) plus the sqlite3_create_collation extension API.

CUBRID’s shape is closest to MySQL: small fixed charset set (4: binary, ISO-8859-1, EUC-KR, UTF-8 — lang_Db_charsets[]), built-in collation IDs in 0..31, user collations from 32..255, all dispatched through a LANG_COLLATION vtable. The differentiator is the build-time pipeline: locale data is not loaded as runtime files; it is compiled into per-locale shared libraries that the server dlopens.

CUBRID’s Approach

The model is six structured pieces: INTL_CODESET (closed enum of encodings), LANG_COLLATION (per-collation vtable), COLL_DATA (weight tables and UCA options), LANG_LOCALE_DATA (per-language-codeset alphabets, calendar, currency), ALPHABET_DATA (codepoint → lower/upper with multipliers), TEXT_CONVERSION (console codepage ↔ UTF-8).

Built-in collations are static globals at language_support.c:696..791, aggregated into built_In_collations[] at line 830. User collations are loaded via lang_load_coll_from_lib (line 7105). The same lang_Collations[256] array holds both: 0..31 reserved for built-ins, 32..255 for user-defined.

Three constraints behind the no-ICU choice, all visible in the code:

Ship-as-one-binary — cub_server and cubridsa must not require an ICU runtime install.
Deterministic upgrades — ICU collation tables churn across versions (9.0/10.0/13.0); a freeze plus per-collation MD5 checksum (coll_data.checksum[33], validated by lang_check_coll_compat) lets indexes survive upgrades.
Codeset-direct comparison — the server can run on EUC-KR; B+Tree must compare EUC-KR bytes directly without a conversion hop, and each codeset gets its own comparator wired through the vtable.

Codesets — `enum intl_codeset`

enum intl_codeset (intl_support.h:175) defines six values plus error/none sentinels: ASCII = 0, RAW_BITS = 1 (BIT type), RAW_BYTES = 2 (BINARY, also aliased BINARY), ISO88591 = 3, KSC5601_EUC = 4, UTF8 = 5. INTL_CODESET_LAST = UTF8.

Two macros drive codeset-aware code: INTL_CODESET_MULT(cs) returns the max bytes per character — INTL_UTF8_MAX_CHAR_SIZE (4) for UTF-8, 3 for EUC-KR (KSC 5601 two-byte plus the JIS X 0212 three-byte form), 1 otherwise. INTL_NEXT_CHAR(ptr, s, codeset, *char_size) advances one character; for UTF-8 it dispatches into intl_nextchar_utf8, for EUC-KR into intl_nextchar_euc, else just s + 1.

intl_Len_utf8_char[256] (intl_support.h:173) is the table that powers branchless UTF-8 walking: index by lead byte, get the byte length (1 for invalid lead, best-effort recovery). The macro INTL_NEXTCHAR_UTF8(c) does c + intl_Len_utf8_char[*c].

The DB-facing charset table is lang_Db_charsets[] (language_support.c:171), six rows mapping (charset_name, charset_desc, space_char, introducer, charset_cubrid_name, charset_id, space_size):

name	introducer	space (pad)	id	space_size
`ascii`	—	`" "`	`ASCII`	1
`raw-bits`	—	`""`	`RAW_BITS`	1
`raw-bytes`	`_binary`	`""`	`BINARY`	1
`iso8859-1`	`_iso88591`	`" "`	`ISO88591`	1
`ksc-euc`	`_euckr`	`"\241\241"` (U+3000 ideographic)	`KSC5601_EUC`	2
`utf-8`	`_utf8`	`" "`	`UTF8`	1

The space_char is the CHAR fixed-length pad. EUC-KR uses U+3000, the ideographic space (two bytes); everything else uses ASCII SPACE. The introducer is what the grammar accepts in _utf8'literal'.

UTF-8 codepoint conversion

Three atomic functions cover forward, backward, and emit:

intl_cp_to_utf8 (line 3661) — codepoint → 1..4 byte UTF-8 sequence. Branches on codepoint range (≤ 0x7F, ≤ 0x7FF, ≤ 0xFFFF, ≤ 0x10FFFF); returns '?' and 1 on assert failure.
intl_utf8_to_cp (line 3754) — UTF-8 → codepoint. Dispatches on lead-byte high bits, returns 0xFFFFFFFF on a malformed sequence, and unconditionally advances *next_char by at least 1 so callers cannot infinite-loop. Cost on pre-validated input: one branch and two-to-four byte loads.
intl_back_utf8_to_cp (line 3812) — backward walk from utf8_last, skipping continuation bytes (b & 0xc0) == 0x80, then re-encodes forward. Used by UCA backward sort (French L2 reverse rule) and LIKE escape lookback.

For pre-validated streams the table-driven INTL_NEXTCHAR_UTF8 macro (intl_support.h:67) is faster than intl_utf8_to_cp: intl_Len_utf8_char[lead_byte] returns the byte length. Used inside the UCA inner loops, where ingress validation has already run.

UTF-8 validation

intl_check_utf8 (intl_support.c:3950) is the gatekeeper. It is called whenever a string enters the system from outside (parser, CSQL input, client-server protocol). It encodes the modern UTF-8 grammar — exactly the one in RFC 3629 — refusing overlong forms (0xC0..0xC1), surrogates (0xED 0xA0..0xBF), and out-of-range four-byte sequences (0xF4 0x90..0xBF and above).

Valid ranges enforced by intl_check_utf8:
  1 byte : 00 - 7F
  2 bytes: C2 - DF , 80 - BF                      (U+80   .. U+7FF)
  3 bytes: E0      , A0 - BF , 80 - BF            (U+800  .. U+FFF)
           E1 - EC , 80 - BF , 80 - BF            (U+1000 .. U+CFFF)
           ED      , 80 - 9F , 80 - BF            (U+D000 .. U+D7FF) — excludes surrogates
           EE - EF , 80 - BF , 80 - BF            (U+E000 .. U+FFFF)
  4 bytes: F0      , 90 - BF , 80 - BF , 80 - BF  (U+10000 .. U+3FFFF)
           F1 - F3 , 80 - BF , 80 - BF , 80 - BF  (U+40000 .. U+FFFFF)
           F4      , 80 - 8F , 80 - BF , 80 - BF  (U+100000 .. U+10FFFF)

Return is a tristate: INTL_UTF8_VALID, INTL_UTF8_INVALID (bad byte sequence, with *pos pointing at the start of the offender), or INTL_UTF8_TRUNCATED (string ended mid-character, common for binary truncation of a TEXT column).

EUC-KR has its own validator, intl_check_euckr (line 4239). EUC-KR allows three families: ASCII (0x00..0x7F), KSC 5601 (0xA1..0xFE lead, 0xA1..0xFE trail, two bytes), and JIS X 0212 (0x8F lead, two 0xA1..0xFE continuation bytes — three bytes total). This is why INTL_CODESET_MULT for EUC-KR is 3, not 2.

Charset conversion pipeline

flowchart LR
  subgraph Input
    A1[ISO-8859-1 byte] --> C1[intl_iso88591_to_utf8]
    A2[EUC-KR byte] --> C2[intl_euckr_to_utf8]
    A3[Console MBCS] --> C3[TEXT_CONVERSION.text_to_utf8]
    A4[UTF-8 byte] --> V[intl_check_utf8]
  end
  C1 --> U[UTF-8 stream]
  C2 --> U
  C3 --> U
  V --> U
  U --> CP[intl_utf8_to_cp]
  CP --> CMP[Collation comparator]
  CP --> NORM[unicode_compose_string NFC]
  CP --> CASE[intl_upper_string / intl_lower_string]
  U --> O1[intl_utf8_to_iso88591]
  U --> O2[intl_utf8_to_euckr]
  O1 --> Z1[ISO-8859-1 out]
  O2 --> Z2[EUC-KR out]
  U --> Z3[UTF-8 out]

There is no general n-to-n converter; CUBRID instead provides a triangle of pairwise functions:

Source	Target	Function	File:Line
ISO-8859-1	UTF-8	`intl_fast_iso88591_to_utf8`	`intl_support.c:4933`
ISO-8859-1	EUC-KR	`intl_iso88591_to_euckr`	(similar)
EUC-KR	ISO-8859-1	`intl_euckr_to_iso88591`	`intl_support.c:4982`
EUC-KR	UTF-8	`intl_euckr_to_utf8`	`intl_support.c:5102`
UTF-8	ISO-8859-1	`intl_utf8_to_iso88591`	`intl_support.c`
UTF-8	EUC-KR	`intl_utf8_to_euckr`	`intl_support.c:5257`

The ISO-8859-1 → UTF-8 path is a one-byte scan with three cases: ASCII (copy), C1 controls 0x80..0x9F (replaced with '?' — CUBRID does not interpret Windows-1252 typography here), Latin-1 0xA0..0xFF (encoded as a 2-byte UTF-8 sequence 0xC0 | (b >> 6), 0x80 | (b & 0x3F)).

The EUC-KR ↔ UTF-8 path calls into ksc5601_mbtowc (declared in src/base/ksc5601.h, conversion tables from libiconv via charset_converters.h). For each 0xA1..0xFE lead byte it consumes a two-byte KSC 5601 character: ksc_buf[0] = lead - 0x80; ksc_buf[1] = trail - 0x80; then ksc5601_mbtowc → unicode_cp → intl_cp_to_utf8. The 0x8F lead byte triggers the three-byte JIS X 0212 path via jisx0212_mbtowc.

charset_converters.h is a 44-line LGPL header lifted from GNU libiconv. It defines the Summary16 two-level table format (indx, used bitmask) that the ksc5601/jisx0212 tables index into.

intl_convert_charset (line 944) is a no-op stub returning ER_QSTR_BAD_SRC_CODESET — user-level CONVERT(s USING ...) is not wired up. The pairwise intl_*_to_* functions are called only from inside the type-coercion layer when an explicit cast or COLLATE forces transcoding.

LANG_COLLATION — the comparator vtable

struct lang_collation (language_support.h:159) carries: codeset, built_in flag, need_init flag, COLL_OPT options (allow_like_rewrite, allow_index_opt, allow_prefix_index), a LANG_LOCALE_DATA *default_lang back-pointer, embedded COLL_DATA coll (weights, contractions, UCA options), and six function pointers:

fastcmp(coll, s1, sz1, s2, sz2, ignore_trailing_space) — pure string compare.
strmatch(coll, is_match, s1, sz1, s2, sz2, escape, has_last_escape, *match_size, ti) — pattern-aware compare for LIKE.
next_coll_seq(coll, seq, size, next_seq, *len_next, ti) — produce the smallest key strictly greater than seq. Used for range-scan boundary increment.
split_key(coll, is_desc, s1, sz1, s2, sz2, **key, *byte_size, ti) — given str1 <= str2, return the shortest separator key K with str1 <= K < str2. B+Tree uses it on page split.
mht2str(coll, str, size) — sort-stable hash for in-memory hash join. Equality must agree with collation: CI hash must walk weights, not raw bytes.
init_coll(coll) — deferred initializer; called once by register_collation.

The vtable is where the codeset × collation-kind cross-product is resolved into one concrete function pointer per pair. Wiring for the ten built-in collations:

ID	Name	Codeset	fastcmp	strmatch	split_key	mht2str
0	`iso88591_bin`	ISO-8859-1	`lang_fastcmp_byte`	`lang_strmatch_byte`	`lang_split_key_iso`	`lang_mht2str_default`
1	`utf8_bin`	UTF-8	`lang_fastcmp_byte`	`lang_strmatch_utf8`	`lang_split_key_utf8`	`lang_mht2str_byte`
2	`iso88591_en_cs`	ISO-8859-1	`lang_fastcmp_byte`	`lang_strmatch_byte`	`lang_split_key_byte`	`lang_mht2str_byte`
3	`iso88591_en_ci`	ISO-8859-1	`lang_fastcmp_byte`	`lang_strmatch_byte`	`lang_split_key_byte`	`lang_mht2str_byte`
4	`utf8_en_cs`	UTF-8	`lang_fastcmp_byte`	`lang_strmatch_utf8`	`lang_split_key_utf8`	`lang_mht2str_byte`
5	`utf8_en_ci`	UTF-8	`lang_fastcmp_byte`	`lang_strmatch_utf8`	`lang_split_key_utf8`	`lang_mht2str_byte`
6	`utf8_tr_cs`	UTF-8	`lang_strcmp_utf8`	`lang_strmatch_utf8`	`lang_split_key_utf8`	`lang_mht2str_utf8`
7	`utf8_ko_cs`	UTF-8	`lang_strcmp_utf8`	`lang_strmatch_utf8`	`lang_split_key_utf8`	`lang_mht2str_utf8`
8	`euckr_bin`	EUC-KR	`lang_fastcmp_byte`	`lang_strmatch_byte`	`lang_split_key_euckr`	`lang_mht2str_ko`
9	`binary`	RAW_BYTES	`lang_fastcmp_binary`	`lang_strmatch_binary`	`lang_split_key_binary`	(default)

Three observations from the table:

utf8_bin reuses lang_fastcmp_byte rather than a UTF-8-aware compare. This is correct: UTF-8’s design guarantees that lexicographic byte comparison agrees with codepoint comparison. The byte comparator is a tight inner loop and beats any UTF-8 walker.
strmatch for utf8_bin is lang_strmatch_utf8, not lang_strmatch_byte. Match (LIKE) needs to step one character at a time because the pattern’s _ wildcard means “one character”, not “one byte”. So pattern matching must walk codepoints even when comparison can stay at the byte level.
CI collations on ISO-8859-1 still use lang_fastcmp_byte. The case-insensitivity is achieved by changing the weight table, not the comparator. lang_init_coll_en_ci (line 5346) populates coll.weights[] so that 'a'..'z' map to the same weight as 'A'..'Z'. The byte compare then runs unchanged — it just reads from a folded weight table.

Built-in collation bootstrap

built_In_collations[] (line 830) lists ten pointers in coll_id order: coll_Iso_binary, coll_Utf8_binary, coll_Iso88591_en_cs, coll_Iso88591_en_ci, coll_Utf8_en_cs, coll_Utf8_en_ci, coll_Utf8_tr_cs, coll_Utf8_ko_cs, coll_Euckr_bin, coll_Binary.

lang_init_builtin (line 850) pre-seeds all 256 lang_Collations[] slots with &coll_Iso_binary, then calls register_collation for each built-in, then registers seven LANG_LOCALE_DATA structs. register_collation (line 1568) writes lang_Collations[id] = coll only if the current occupant has coll_id == LANG_COLL_DEFAULT (0) — first-writer-wins, with later collisions returning ER_LOC_INIT.

Weight tables and the deferred initializer

Two collations share a weight array but only one owns the initialization. register_collation calls coll->init_coll once when not NULL.

lang_init_coll_en_cs (line 5304) writes the identity weight: weights[i] = i, next_cp[i] = i + 1. The trailing-insensitive variant additionally sets weights_ti[32] = 0, next_cp_ti[32] = 1 so SPACE has zero weight.

lang_init_coll_en_ci (line 5346) inherits the identity, then overlays 'a'..'z' to weight 'A'..'Z': weights[i] = i - ('a' - 'A') for lowercase letters. CI is implemented entirely in the weight table; the comparator stays lang_fastcmp_byte. (The C++ lambda used by these initializers is unusual; recall .c files compile as C++17 in CUBRID.)

weights_ti / next_cp_ti are the trailing-insensitive variants. With ignore_trailing_space = true (set by the type system based on CHAR vs VARCHAR), 'foo ' compares equal to 'foo'. The comparator switches table pointers; no inner-loop branch.

`lang_fastcmp_byte` — the hot path

language_support.c:5453. The function consumed by every byte-level B+Tree comparison and every string operator on a binary, ISO-8859-1, EUC-KR-binary, or UTF-8-binary key. Three phases:

Common-prefix loop of min(size1, size2) bytes. Each byte: special-case SPACE → ZERO, otherwise weight[byte]. Compare c1 - c2. If non-zero, return.
Trailing-sensitive tail (!ignore_trailing_space): if sizes differ, return size1 - size2.
Trailing-insensitive tail: walk the longer side; each weight is compared against ZERO. First non-zero diff wins.

Subtleties: the SPACE check is inside the loop so it works regardless of which weight table variant is loaded. weight[*s1] is one L1-cache lookup into a 256/352-entry table. The trailing-space tail picks one of two specialized loops by side, so the branch predictor sees a stable target.

`lang_strcmp_utf8` — the codepoint walker

language_support.c:2830 (entered through lang_strcmp_utf8 at 2807, which delegates with is_match=false). For locale collations that need codepoint-level decisions (Turkish I/i, Korean Hangul order). Both sides walk one UTF-8 character at a time via intl_utf8_to_cp, then weight the codepoint:

cp < alpha_cnt: weight is weight_ptr[cp] (or ZERO for SPACE).
cp >= alpha_cnt: weight is the codepoint itself — preserves total order even for codepoints the locale did not enumerate, at the cost of placing them in pure Unicode order rather than locale order.

alpha_cnt is 352 for Turkish (LANG_CHAR_COUNT_TR, large enough to cover the Turkish alphabet plus surrounding Latin Extended-A), 256 for English.

Full UCA — `lang_strmatch_utf8_uca_w_coll_data`

When the collation has expansions (uca_opt.sett_expansions) or strength above primary, CUBRID drops into the multi-level UCA comparator at language_support.c:4368. The structure is the textbook UCA “level-by-level, restart on tie”:

Level 1 (primary) compare via lang_strmatch_utf8_uca_w_level (cd, 0, ...). Return on non-zero.
If strength is TAILOR_PRIMARY only: optionally do level 3 (case) if sett_caseLevel. sett_caseFirst == 1 (upper-first) reverses the sign.
Level 2 (accent). If sett_backwards (French L2 reverse), call lang_back_strmatch_utf8_uca_w_level. Else forward.
Level 3 (case).
Level 4 (variable/quaternary, UCA_L4_W 16-bit weights) — only if sett_strength >= TAILOR_QUATERNARY.

A cmp_offset is threaded through across levels: level 2 must restart at the offset where level 1 declared the prefix equal, not at byte 0. This implements the “longest equal prefix at this level” semantics correctly.

Inside one level — codepoints, contractions, expansions

lang_get_uca_w_l13 (line 3376) translates one source character into a weight-array pointer:

Decode the codepoint via intl_utf8_to_cp.
If cp < cd->w_count and the codepoint is a contraction starter (cp_first_contr_offset <= cp < cp_first_contr_offset + cp_first_contr_count and there is enough remaining text), call lang_get_contr_for_string (line 3296) to look up a contraction. On hit, *uca_w_l13 = contr->uca_w_l13, *num_ce = contr->uca_num, *str_next = str + contr->size, and the high bit INTL_MASK_CONTR is set in *cp_out.
Otherwise the per-codepoint slot: &cd->uca_w_l13[cp * cd->uca_exp_num] with cd->uca_num[cp] valid CEs. The table is row-major with uca_exp_num (max CE count across the locale) as stride. Wasteful in space, one-multiply-one-add for lookup.
For codepoints beyond w_count, return a single max-weight CE (0xFFFFFFFF) so they sort last.

lang_get_contr_for_string is a two-level filter: cp_first_contr_array[cp - offset] returns the index of the first contraction starting with cp (or -1), then a linear walk through contr_list (kept in lexicographic order) does memcmp until match or strict gap.

LANG_LOCALE_DATA — locale beyond collation

struct lang_locale_data (language_support.h:190) carries per-(language, codeset) data that is not about ordering: lang_name, lang_id, codeset, two ALPHABET_DATA (general string casing and identifier casing), default_lang_coll back-pointer, TEXT_CONVERSION *txt_conv (console MBCS ↔ UTF-8), date/time/timestamp format strings, calendar names (months, days, AM/PM and their parse orders), number_decimal_sym, number_group_sym, default_currency_code, UNICODE_NORMALIZATION unicode_norm, MD5 checksum, and a deferred initloc callback. is_user_data is true if loaded from a .so, false for built-in.

The next_lld chain links same-language different-codeset locales: lc_Korean_iso88591 → lc_Korean_utf8 → lc_Korean_euckr. A session that switches charset can still resolve ko_KR to a locale.

ALPHABET_DATA (locale_support.h:437) holds casing data: a_type (UNICODE | ASCII | TAILORED), l_count (covered codepoints), lower_multiplier/upper_multiplier (typically 1; ≥ 2 when expanding), and the row-major arrays lower_cp/upper_cp indexed at [cp * multiplier .. +multiplier - 1]. The multiplier ≥ 2 case handles “ß → SS” upper-casing: intl_upper_string (line 1573) walks codepoint by codepoint, looks up &upper_cp[cp * upper_multiplier], and emits up to multiplier output codepoints (zeros terminate early).

Two alphabets per locale because Turkish forces the split: the SQL standard wants ASCII-only case-insensitivity for identifiers (so SELECT and select are the same keyword), but a Turkish-locale full alphabet would fold 'I' to 'ı' — breaking SQL keywords. intl_identifier_casecmp (line 2783) uses ident_alphabet; db_string_lower uses alphabet.

LDML build pipeline — `genlocale`

locales/make_locale.sh drives the CMake target cubrid_locale_<lang>:

flowchart TD
  L1[locales/data/ldml/cubrid_xx_XX.xml]
  L2[locales/data/ducet.txt]
  L3[locales/data/unicodedata.txt]
  L4[locales/data/codepages/CP949.TXT etc.]
  G[genlocale binary]
  C1[loclib_*/cubrid_xx_XX.c]
  C2[libcubrid_xx_XX.so]
  S[CUBRID server runtime]
  L1 --> G
  L2 --> G
  L3 --> G
  L4 --> G
  G --> C1
  C1 --> C2
  C2 --> S

locale_compile_locale (locale_support.c:4573) drives Expat over the LDML XML through start_element_ok / end_element_ok callbacks (locale_support.c:907), building three internal structures: LOCALE_DATA (calendar names, formats, currency, alphabets, normalization parameters); for each <collation> a LOCALE_COLLATION with a COLL_TAILORING (parsed rules) plus an opt_coll slot for post-optimization weights; cross-locale shared data, deduped to keep the unicode mapping tables from being copied per locale.

LDML rules (<rules>) parse into struct tailor_rule (locale_support.h:246): T_LEVEL level, an anchor_buf, a reference buffer (r_buf) with logical position (r_pos_type — RULE_POS_FIRST_VAR, RULE_POS_LAST_NON_IGN, …) or anchor character, a direction (after/before), and a tailored buffer t_buf. A rule like & a < b becomes level=PRIMARY, anchor=a, direction=AFTER, t_buf=b. Multi-character rules (& ch <<< Ch) flag multiple_chars = true and represent contractions plus tertiary case overlays.

CUBRID also has a vendor extension CUBRID_TAILOR_RULE (locale_support.h:279) for absolute weight assignment — LDML’s rule grammar can only anchor relative to other characters, so it cannot say “set U+0020 to weight [0.0.0.0]”. The <cubridrules> element solves this:

<collation type="utf8_gen">
  <settings id="32" strength="quaternary" caseLevel="on" caseFirst="upper" .../>
  <cubridrules>
    <set><scp>20</scp><w>[0.0.0.0]</w></set>   <!-- SPACE is primary-ignorable -->
  </cubridrules>
</collation>

After parsing, genlocale deduplicates: if two collations end up bit-identical in weights[], only one copy is exported and the other’s coll_data_ref.coll_weights_ref names the donor symbol (locale_mark_duplicate_collations). locale_save_all_to_C_file emits a .c of static const arrays plus exported symbols coll_<id>_weights, coll_<id>_uca_w_l13, coll_<id>_contr_list. CMake compiles each into libcubrid_<locale>.so.

Runtime load — `lang_load_coll_from_lib`

lang_init → init_user_locales walks cubrid_locales.txt, calls lang_load_library (line 7272 — dlopen with fallback path), lang_load_count_coll_from_lib, lang_load_get_coll_name_from_lib, and finally lang_load_coll_from_lib (line 7105) to populate runtime COLL_DATA.

The function is a sequence of SHLIB_GET_VAL / SHLIB_GET_ADDR calls (which expand to dlsym plus a symbol-name stringification helper) — coll_name, coll_id, coll_sett_strength, coll_w_count, coll_uca_exp_num, coll_count_contr. If count_contr > 0, fetch coll_contr_list, coll_contr_min_size, coll_cp_first_contr_*. If uca_opt.sett_expansions, fetch coll_uca_w_l13 (and coll_uca_w_l4 if quaternary). Else fetch the simpler coll_weights.

The _W_REF variants resolve the dedupe forwarding from the build-time merge pass — if this collation’s weights[] was merged with another collation’s, the _ref symbol points at the donor.

Comparator binding follows: a UCA collation with expansions gets lang_strcmp_utf8_uca/lang_strmatch_utf8_uca; with only contractions, the _w_contr variants; with only remapped weights, lang_fastcmp_byte or lang_strcmp_utf8.

The collation MD5 (COLL_DATA.checksum) is verified at session handshake by lang_check_coll_compat (language_support.h:348) — this is how CUBRID catches “client and server linked against different locale libraries” before any data flows.

B+Tree integration

The B+Tree is the heaviest consumer — every key insert, every range scan, every page split. btree_compare_individual_key_value (btree.c:19602) handles NULL ordering (controlled by TP_DOMAIN.is_desc), then dispatches via key_domain->type->cmpval (key1, key2, 2, 1, NULL, key_domain->collation_id). The collation flows from the index TP_DOMAIN.collation_id through the type’s cmpval (e.g., pr_varchar_cmpval) into QSTR_COMPARE, which expands to (LANG_GET_COLLATION(id))->fastcmp(...). In release builds LANG_GET_COLLATION(id) is lang_Collations[id] — one indexed load and one indirect call.

Page split needs a separator K with last_left <= K < first_right — the split_key callback. For UTF-8 binary, lang_split_key_utf8 walks both until divergence at byte position i and returns the prefix of str2 up to and including byte i. For UCA collations, two byte sequences can compare equal (NFC vs NFD), so lang_split_key_w_exp walks the weight stream instead.

next_coll_seq produces “smallest key strictly greater than this” for range-scan boundary increment: for binary, increment the last byte; for UCA, walk next_cp[] per codepoint.

mht2str feeds hash join and OID list directories. It must agree with collation equality (a CI hash on 'foo' must equal that of 'FOO'), so it walks weights, not raw bytes. lang_mht2str_byte sums all bytes mod a prime; lang_mht2str_utf8_exp walks UCA L1 weights.

flowchart TD
  Q[SQL: SELECT ... WHERE name = 'foo'] --> P[Parser: pt_check_collations]
  P --> X[XASL: collation_id propagated to expression]
  X --> E[Executor: pr_varchar_cmpval]
  E --> M[QSTR_COMPARE macro]
  M --> L[LANG_GET_COLLATION coll_id]
  L --> V[lang_Collations 256 vtable]
  V --> F[lang_collation->fastcmp pointer]
  F --> A[lang_fastcmp_byte / lang_strcmp_utf8 / lang_strcmp_utf8_uca]
  A --> R[int comparison result]

  B[B+Tree: btree_search] --> E
  S[Sort: ext_sort] --> E
  H[Hash join: heap_hash] --> V
  V -- mht2str --> H

LIKE and pattern matching

LIKE, MATCH, REGEXP_LIKE all funnel through strmatch. The pattern interpretation is in db_string_like (string_opfunc.c), but the per-character match is the collation’s strmatch. The wildcard semantics:

_ matches exactly one character (codepoint, not byte).
% matches zero or more characters.
An escape character (ESCAPE '\\' clause) demotes the next character to literal.

For UTF-8 binary, lang_strmatch_utf8 (line 2830) handles this by walking codepoints in both pattern and source, with a recursive descent into the source on %. The escape character is matched by byte (memcmp (str2, escape, str2_next - str2) == 0) — it can be multi-byte.

The COLL_OPT.allow_like_rewrite flag (language_support.h:144) controls whether the optimizer can rewrite col LIKE 'foo%' into col >= 'foo' AND col < 'fop' for index-friendly pushdown. This is true for CS collations (where appending the next codepoint preserves order) and false for CI collations (where 'FOO' can sort between 'foo' and 'fop', breaking the rewrite).

Trailing-space treatment

CHAR vs. VARCHAR semantics in CUBRID follow the SQL standard’s PAD SPACE / NO PAD distinction. When a CHAR(10) holds 'foo', the storage layer pads to ten bytes with U+0020 (or the locale’s space character). When comparing CHAR to VARCHAR, trailing spaces are ignored — 'foo ' = 'foo' is true.

This is implemented by carrying ignore_trailing_space through every comparator. The weights_ti and next_cp_ti variants of the weight tables are pre-built with SPACE having weight 0, so the comparator does no special-casing in the hot loop. The decision to use _ti happens at type-coercion time:

// QSTR_COMPARE call site, e.g. in db_string_compare
QSTR_COMPARE (coll_id, lhs_buf, lhs_size, rhs_buf, rhs_size,
              /*ignore_trailing_space=*/ tp_is_padded_string(domain1, domain2));

For binary (coll_Binary, ID 9), the answer is always “no, never ignore” — binary strings are byte-exact.

The five-codepoint dance for casing

Casing has four distinct entry points, with subtly different behaviour:

Function	Used for	Special-cased Turkish?
`intl_lower_string` (intl_support.c:1684)	`LOWER(s)`, `UPPER(s)` SQL functions	yes (alphabet has `lang_lower_i_TR`)
`intl_identifier_lower` (intl_support.c:2960)	identifier name resolution	no (uses ident_alphabet)
`intl_identifier_casecmp` (intl_support.c:2783)	identifier comparison without folding	yes (per locale rules)
`intl_identifier_casecmp_w_size`	identifier compare with explicit sizes	yes

intl_lower_string walks a UTF-8 stream, fetches the codepoint, indexes into alphabet->lower_cp[cp * lower_multiplier], and emits the result as one or more codepoints back to UTF-8. The size discrepancy is precomputed by intl_lower_string_size so the caller can allocate properly.

The lower_multiplier ≥ 2 case is rare but real: 'İ' (U+0130) lowercases to 'i' + U+0307 (combining dot above) per Unicode default casing, so a Turkish-aware locale that wants Unicode-default casing has multiplier 2.

Normalization

UNICODE_NORMALIZATION (locale_support.h:492) holds NFC composition tables built from unicodedata.txt. The runtime API is in unicode_support.h:

bool unicode_string_need_compose (const char *str_in, int size_in,
                                  int *size_out, const UNICODE_NORMALIZATION *norm);
void unicode_compose_string     (const char *str_in, int size_in,
                                 char *str_out, int *size_out,
                                 bool *is_composed, const UNICODE_NORMALIZATION *norm);

CUBRID does not normalize automatically on insert — the application is expected to send NFC. The compose API is exposed for client-side use (libcs only, #if !defined (SERVER_MODE)) and for the parser to normalize identifier names. For collation, the UCA tables are built against NFC, so a non-NFC string compared to an NFC string can mis-compare; this is documented as a constraint.

Configuration knobs

The runtime uses environment variables and cubrid_locales.txt:

CUBRID_CHARSET / system locale → lang_charset(), the system codeset.
CUBRID_LANG → message language for cubrid.msg.
cubrid_locales.txt (under $CUBRID/conf/) → list of LDML locales to load. Each entry names the LDML XML and the path to the precompiled shared library.

The system collation LANG_SYS_COLLATION is derived from lang_charset():

// LANG_GET_BINARY_COLLATION — src/base/language_support.h:121
#define LANG_GET_BINARY_COLLATION(c) \
   ((c) == INTL_CODESET_UTF8         ? LANG_COLL_UTF8_BINARY :  \
   ((c) == INTL_CODESET_KSC5601_EUC  ? LANG_COLL_EUCKR_BINARY : \
   ((c) == INTL_CODESET_ISO88591     ? LANG_COLL_ISO_BINARY :   \
                                       LANG_COLL_BINARY)))

A column with no explicit COLLATE clause inherits this collation. The default DB charset can be set at cubrid createdb time and is persisted in the db_root system catalog row.

Coercion — `LANG_RT_COMMON_COLL`

When a binary operator (=, <, LIKE) is given two strings of different collations, the engine picks one — but only if at least one side is “coercible”. Built-in binary collations are coercible (LANG_IS_COERCIBLE_COLL accepts them); explicitly-collated values are not. The macro:

// LANG_RT_COMMON_COLL — src/base/language_support.h:65
#define LANG_RT_COMMON_COLL(c1, c2, coll) do { \
    coll = -1; \
    if ((c1) == (c2))           coll = (c1);   /* trivial */ \
    else if (LANG_IS_COERCIBLE_COLL(c1)) {                  \
      if (!LANG_IS_COERCIBLE_COLL(c2))   coll = (c2);       \
      else if ((c2) == LANG_COLL_ISO_BINARY) coll = (c2);   \
    } else if (LANG_IS_COERCIBLE_COLL(c2)) coll = (c1);     \
} while (0)

The result coll == -1 means “no common collation” — the engine raises ER_QSTR_INCOMPATIBLE_COLLATIONS. This is CUBRID’s analogue of MySQL’s “Illegal mix of collations” or Postgres’s “could not determine which collation to use”.

What’s missing / known limitations

intl_convert_charset is a stub. User-level CONVERT(s USING charset) is not supported; only the parser/coercion-layer pairwise functions run.
No UTF-16 or UTF-32 — wire and storage are UTF-8 only.
No automatic NFC normalization on input. NFD text in a UCA-collation column produces silently mis-sorted index entries.
Single MD5 checksum per collation; no analogue of Postgres’s pg_collation.collversion index-invalidation hook.
Codeset set is fixed at four. Adding GBK or Shift-JIS would touch enum intl_codeset, the comparator family, the check-validity functions, and the conversion pairs.

How to read the code

Layer-cake order: bytes → codepoints → collation elements → vtable dispatch → consumers.

intl_support.h — enum intl_codeset, INTL_NEXT_CHAR, INTL_CODESET_MULT.
intl_support.c — intl_utf8_to_cp (3754), intl_cp_to_utf8 (3661), intl_back_utf8_to_cp (3812), intl_check_utf8 (3950).
language_support.h:159 — struct lang_collation (the vtable).
language_support.c — built_In_collations[] (830) → lang_init_builtin (850) → individual static declarations (lines 696, 715, 736, 755…).
language_support.c:5453 — lang_fastcmp_byte (the hot path).
language_support.c:2830 — lang_strmatch_utf8 (codepoint walker).
language_support.c:4368 — lang_strmatch_utf8_uca_w_coll_data (UCA driver) → lang_get_uca_w_l13 (3376).
locale_support.h — struct coll_data (354), struct coll_tailoring (391); locale_lib_common.h for the shared-library ABI.
language_support.c:7105 — lang_load_coll_from_lib (runtime binding).
string_opfunc.h:59-74 — QSTR_COMPARE, QSTR_MATCH, QSTR_NEXT_ALPHA_CHAR, QSTR_SPLIT_KEY.
btree.c:19602 — btree_compare_individual_key_value (the consumer).

References

src/base/intl_support.h / intl_support.c — codeset definitions, UTF-8/EUC-KR/ISO conversion, validation, identifier casing.
src/base/language_support.h / language_support.c — LANG_COLLATION vtable, built-in collations, locale loading, system collation/charset.
src/base/locale_support.h / locale_support.c — LDML parser, genlocale build-time machinery, COLL_DATA and COLL_TAILORING.
src/base/locale_lib_common.h — locale shared library ABI: COLL_CONTRACTION, UNICODE_MAPPING, CONV_CP_TO_BYTES.
src/base/unicode_support.h / unicode_support.c — NFC/NFD composition and decomposition.
src/base/charset_converters.h — single- and double-byte conversion table format from libiconv (Summary16, ksc5601.h, jisx0212.h).
src/query/string_opfunc.h — QSTR_COMPARE, QSTR_MATCH, QSTR_NEXT_ALPHA_CHAR, QSTR_SPLIT_KEY macros.
src/storage/btree.c — btree_compare_individual_key_value, the B+Tree’s collation-aware key comparator.
locales/data/ldml/cubrid_*.xml — LDML locale definitions; common_collations.xml for vendor extensions.
locales/data/ducet.txt — Default Unicode Collation Element Table.
locales/data/unicodedata.txt — Unicode character data for alphabet and normalization generation.
locales/data/codepages/CP949.TXT, CP932.TXT, etc. — non-Unicode codepage tables for TEXT_CONVERSION.
locales/make_locale.sh — build entry point that invokes genlocale per locale.

CUBRID Charset and Collation — Codeset Conversion, Locale-Aware Comparison, and Multi-Encoding Support

CUBRID Charset and Collation — Codeset Conversion, Locale-Aware Comparison, and Multi-Encoding Support

Position table

Theoretical Background

Common DBMS Design

CUBRID’s Approach

Codesets — enum intl_codeset

UTF-8 codepoint conversion

UTF-8 validation

Charset conversion pipeline

LANG_COLLATION — the comparator vtable

Built-in collation bootstrap

Weight tables and the deferred initializer

lang_fastcmp_byte — the hot path

lang_strcmp_utf8 — the codepoint walker

Full UCA — lang_strmatch_utf8_uca_w_coll_data

Inside one level — codepoints, contractions, expansions

LANG_LOCALE_DATA — locale beyond collation

LDML build pipeline — genlocale

Runtime load — lang_load_coll_from_lib

B+Tree integration

LIKE and pattern matching

Trailing-space treatment

The five-codepoint dance for casing

Normalization

Configuration knobs

Coercion — LANG_RT_COMMON_COLL

What’s missing / known limitations

How to read the code

References

Codesets — `enum intl_codeset`

`lang_fastcmp_byte` — the hot path

`lang_strcmp_utf8` — the codepoint walker

Full UCA — `lang_strmatch_utf8_uca_w_coll_data`

LDML build pipeline — `genlocale`

Runtime load — `lang_load_coll_from_lib`

Coercion — `LANG_RT_COMMON_COLL`