CUBRID Semantic Check — Name Resolution, Type Checking, Constant Folding, and Statement-Specific Validation

Contents:

Theoretical Background
Common DBMS Design
CUBRID’s Approach
Source Walkthrough
Cross-check Notes
Open Questions
Sources

Theoretical Background

The semantic-check phase of a SQL compiler sits between the parser and the optimizer. The parser proves the query is grammatically well-formed; the optimizer proves a chosen plan is cheap. Between them, semantic analysis proves the query is meaningful against the catalog: every name resolves to something, every operator’s arguments have types it accepts, and every transformation that removes degrees of freedom for the optimizer (constant folding, predicate normalization) happens here so the optimizer sees a single canonical form.

Three textbook subproblems live underneath this single pass.

Symbol resolution and scoping. The Dragon Book frames it (Aho et al., ch. 6 §“Symbol Tables”, ch. 7 §“Scope”) as a tree-walk that maintains a stack of scopes; an inner scope shadows an outer one, and a free name is bound to the closest scope where the name is declared. SQL’s scope rules are largely flat — a SELECT introduces a scope spanning its FROM, WHERE, GROUP BY, HAVING, and ORDER BY — but two recursive cases push them into the same scope-stack discipline as a programming language: subqueries (each SELECT opens a new scope; a free reference inside walks outward to find a FROM-list provider) and derived tables / CTEs (an inner SELECT produces an aliased relation that the outer scope must see after the inner has been resolved).

The data structure is the same as Wirth-style nested-block compilers: a linked list of scopes, head is innermost. CUBRID’s SCOPES struct (name_resolution.c:78) is exactly this list — a next pointer to the outer scope, a list of PT_SPEC nodes that the scope makes visible, and a correlation_level counter that tells the resolver “I had to walk N scopes outward to bind this name”, which the optimizer later uses to decide whether a subquery is correlated.

Type systems with implicit conversion. Database System Concepts (Silberschatz et al., ch. 4 §“Built-in Data Types”, ch. 5 §“Functions and Procedures”) formalizes a SQL type system as a partial order of types with explicit and implicit conversion rules. Every operator has a signature — a tuple of input types and a return type — and the type-check pass either picks the unique matching signature or chooses one by promoting the actual arguments through implicit casts. The two canonical promotions are integer ↔ floating-point (preserves value, may lose precision) and numeric ↔ string (succeeds only if the string is parseable as a number; otherwise a runtime error). Engines vary on overload selection — exact match preferred, otherwise minimum-cost promotion, otherwise reject — and on whether the rules are operator-driven (each operator has hand-written rules) or signature-table-driven (a single matcher reads a per-operator table). CUBRID uses both: operator-driven rules for arithmetic and a few special cases, plus a signature-table matcher (pt_apply_expressions_definition) for the long tail.

Constant folding as compile-time evaluation. The Dragon Book chapter 9 §“Loop-Invariant Computations” introduces folding as a peephole optimization: if every operand of an expression is a literal, evaluate it now and replace the expression with its result. In SQL the prize is twofold — the executor avoids per-tuple work, and the optimizer sees a simpler predicate that may unlock further rewrites (e.g., WHERE 1=1 AND foo > 5 becoming WHERE foo > 5 after folding the first conjunct). The hazard is that folding must respect side-effect-bearing operators (RAND, NOW, BENCHMARK) and not-yet-bound host variables. The compiler must classify operators by purity and not fold expressions that contain anything impure or any unresolved host variable.

CNF for predicate normalization. Quine’s Methods of Logic and the classical resolution-theorem-proving literature (Robinson 1965; Nilsson’s Principles of Artificial Intelligence, the “resolution1” chapter cited by the deck) define CNF as the conjunction-of-disjunctions form: every formula (A ∨ B) ∧ (C ∨ ¬D ∨ E) ∧ …. Three rewrite steps reach it mechanically — eliminate implication and equivalence, push negation inward via De Morgan, distribute disjunction over conjunction. A query optimizer wants CNF for two reasons: (a) conjunct-by-conjunct selectivity estimation is sound only when conjuncts are independent — each (...) clause becomes a single predicate that the cost model multiplies; (b) predicate pushdown and index selection treat each conjunct as a candidate for a separate index access path, and a CNF predicate is already a flat list of such candidates. A non-CNF predicate would force the optimizer to enumerate all factorizations.

These four ideas — scope stack, type promotion, constant folding, predicate normalization — appear over and over in the rest of this document. They are the load-bearing concepts that the CUBRID source expresses in C.

Common DBMS Design

Every relational engine — PostgreSQL, Oracle, SQL Server, MySQL, CUBRID — adopts roughly the same five-stage shape between parser and optimizer. The names differ; the work does not.

Stage 1: parse-tree walk that resolves spec / FROM-list relations

The first walk binds the FROM-list itself. PostgreSQL’s transformFromClause (in parser/parse_clause.c) takes each RangeVar (a parser node naming a relation) and resolves it against the catalog, returning a RangeTblEntry that the rest of the pass can index by ordinal. Oracle’s kkqs / kkpamqry does the same role — turn each table reference into a slot in the query block’s range table. CUBRID’s parallel is pt_flat_spec_pre: it walks the parse tree, and for each PT_SPEC (CUBRID’s term for a FROM-list element, including subselects, derived tables, and CTEs) it expands the spec into a flat_entity_list of physical class references and attaches the workspace’s DB_OBJECT for each.

The data structure shape is identical across engines: each FROM-list entry gets a stable identifier (PostgreSQL: 1-based ordinal in the range table; CUBRID: the address of the PT_SPEC cast to UINTPTR, re-purposed as info.spec.id). Every later name reference will carry that identifier as its “I belong to this relation” backpointer.

Stage 2: name binding under a scope stack

The second walk binds every other name in the tree to one of the FROM-list entries from stage 1, walking a stack of scopes that grows as the walker descends into subqueries and shrinks as it pops out. Every engine implements this as a recursive descent over the parse tree carrying a mutable scope-list argument.

PostgreSQL’s transformExpr chain calls colNameToVar against the current ParseState’s p_namespace. Oracle does the same against the parsed query block’s range table. CUBRID’s pt_bind_names / pt_bind_name_or_path_in_scope walks SCOPES *scopes (a linked list, innermost first) and resolves names against the specs of each scope in turn. The first match wins; an ambiguity (same column name in two specs in the same scope) errors out as MSGCAT_SEMANTIC_AMBIGUOUS_REF_TO.

The classical correctness invariant — “no name resolves to a relation that was introduced after the current scope was opened” — is enforced by the order in which the walker pushes scopes. CUBRID pushes a new SCOPES frame at every PT_SELECT, PT_UPDATE, PT_DELETE, etc., and unwinds on the post-walk; this is the same discipline as Wirth-style nested blocks.

Stage 3: type checking with overload resolution

After every name has a type (because every PT_SPEC has been mapped to a catalog object whose attributes are typed), the third walk computes a type for every internal node — PT_EXPR, PT_FUNCTION, PT_VALUE, PT_HOST_VAR. The classical algorithm is bottom-up: leaves carry their declared type, an internal node picks an overload of its operator that accepts the children’s types (inserting implicit casts where the language allows), and the parent receives the overload’s return type.

PostgreSQL’s coerce_to_target_type and func_select_candidate and Oracle’s kkmpc cost-rank candidate signatures by exact match versus implicit conversion. CUBRID’s pt_apply_expressions_definition (type_checking.c:5778) reads a per-operator EXPRESSION_DEFINITION table of overloads, scores each by pt_are_equivalent_types plus pt_are_unmatchable_types, and picks the highest-scoring signature (falling back to the first if no signature matches all three argument slots). Picked signatures cause pt_coerce_expression_argument to wrap mismatched arguments in PT_CAST expressions.

Stage 4: constant folding

Once types are known, every expression whose arguments are all PT_VALUE can be evaluated at compile time. PostgreSQL’s eval_const_expressions, Oracle’s kkecf, MySQL’s Item::const_item propagate folding through the tree.

CUBRID does this in pt_fold_const_expr (for PT_EXPR) and pt_fold_const_function (for PT_FUNCTION). The classification of “foldable” is operator-driven: PT_RAND, PT_BENCHMARK, anything involving an unresolved PT_HOST_VAR is excluded by an early-return check. Folded subtrees are replaced in place by the resulting PT_VALUE.

Stage 5: predicate normalization (CNF)

Every engine that does cost-based optimization wants the WHERE-clause predicate in CNF, because the conjunct-by-conjunct selectivity model multiplies independent conjuncts and the index-scan rewriter looks for single-attribute conjuncts to push into index access methods. PostgreSQL’s canonicalize_qual does exactly this. CUBRID has a dedicated cnf.c whose entry point pt_do_cnf walks the tree and calls pt_cnf on each WHERE and HAVING predicate; pt_cnf itself first applies De Morgan’s law to push negation inward (via pt_and_or_form), then distributes disjunction over conjunction (via pt_transform_cnf_pre / _post). The result is a linked list of PT_EXPR nodes joined by implicit AND on next; each conjunct can carry one OR-list joined by or_next, allowing CNF to be represented as a flat 2-D structure without a tree.

Theory ↔ CUBRID mapping

Theoretical concept	CUBRID name
Symbol-table per-scope frame	`SCOPES { next, specs, correlation_level, location }` (`name_resolution.c:78`)
Whole-walk symbol-table state	`PT_BIND_NAMES_ARG { scopes, spec_frames, sc_info }` (`name_resolution.c:88`)
FROM-list entry	`PT_SPEC` carrying `flat_entity_list`, `entity_name`, `as_attr_list`, `id`, `location`, `natural`, `join_type`, `on_cond` (`parse_tree.h`)
Bound-to-spec back-pointer on a name	`PT_NAME::info.name.spec_id` (`parse_tree.h`)
Resolved relation (`db_object`) cache on a name	`PT_NAME::info.name.db_object` (`parse_tree.h`)
Per-statement type-check / fold context	`SEMANTIC_CHK_INFO { top_node, attrdefs, system_class, donot_fold, ... }` (`semantic_check.h`)
Operator-overload table (signature-driven)	`EXPRESSION_DEFINITION { op, overloads_count, overloads[MAX_OVERLOADS] }` (`type_checking.c`)
Implicit cast insertion	`pt_coerce_expression_argument` wrapping in `PT_CAST` with target `TP_DOMAIN`
”Maybe” type for unresolved host vars	`PT_TYPE_MAYBE` + `expected_domain` field on `PT_HOST_VAR`
Per-operator late-binding decision	`pt_is_op_hv_late_bind` (`type_checking.c:20260`)
Constant-fold gate flag	`SEMANTIC_CHK_INFO::donot_fold` and `BENCHMARK` early return in `pt_fold_constants_pre`
CNF “this conjunct already normalized” flag	`PT_EXPR_INFO_CNF_DONE` set inside `pt_cnf`
Conjunct-list / disjunct-list link fields	`PT_NODE::next` for AND, `PT_NODE::or_next` for OR (used as a 2-D linked list)

CUBRID’s Approach

The semantic-check pass enters at pt_compile (in compile.c), which is called once per top-level statement returned by the parser. pt_compile is a thin wrapper that calls pt_semantic_check, which in turn delegates to pt_check_with_info. All real work lives there.

// pt_compile — src/parser/compile.c
PT_NODE *
pt_compile (PARSER_CONTEXT * parser, PT_NODE * volatile statement)
{
  PT_NODE *next;

  PT_SET_JMP_ENV (parser);

  if (statement)
    {
      next = statement->next;
      statement->next = NULL;

      statement = pt_semantic_check (parser, statement);

      /* restore link */
      if (statement)
        {
          statement->next = next;
        }
    }

  PT_CLEAR_JMP_ENV (parser);

  return statement;
}

Note statement->next = NULL — the wrapper detaches the current statement from a multi-statement list before checking it, so each statement is checked in isolation. The setjmp/longjmp envelope is the parser’s panic-mode error recovery; a deep semantic error can unwind out of nested calls without the boilerplate of propagating return codes.

The pipeline

flowchart LR
  PARSE["parser_main\n(bison/flex)"] --> PT["PT_NODE tree\n(unanalyzed)"]
  PT --> COMP["pt_compile"]
  COMP --> SC["pt_semantic_check"]
  SC --> CWI["pt_check_with_info\n(per-statement switch)"]
  CWI --> RN["1) pt_resolve_names\nname resolution"]
  RN --> CW["2) pt_check_where\naggregate-in-WHERE check"]
  CW --> RH["3) pt_check_and_replace_hostvar\nresolve host-var values"]
  RH --> SCL["4) parser_walk_tree\n(post=pt_semantic_check_local)\nstatement-specific\nrewrites + checks\n· pt_semantic_type"]
  SCL --> EI["pt_expand_isnull_preds"]
  EI --> CNF["pt_do_cnf / pt_cnf\n(predicate normalization)"]
  CNF --> XASL["xasl_generation"]
  SCL -. internal call .-> PST["pt_semantic_type"]
  PST --> EVTP["pt_eval_type_pre\n(top-down)"]
  PST --> EVTPOST["pt_eval_type\n(bottom-up)\n  → pt_eval_expr_type\n  → pt_eval_function_type\n  → pt_where_type"]
  PST --> FOLD["pt_fold_constants_pre\n· pt_fold_constants_post\n  → pt_fold_const_expr\n  → pt_fold_const_function"]

Figure 1 — Semantic-check pipeline: pt_compile drives pt_check_with_info through name resolution (pt_resolve_names), aggregate-in-WHERE check, host-var replacement, pt_semantic_check_local (which calls pt_semantic_type for type-check and constant-fold), pt_expand_isnull_preds, and CNF conversion.

The flowchart hides one subtle detail: the type-check and constant-fold walk inside pt_semantic_type is invoked recursively by the statement-specific rewrites, because some rewrites (e.g., wrapping a LIMIT clause as an INST_NUM() predicate) introduce new expressions whose types need to be re-derived. The top-level pt_check_with_info relies on the post-order callback pt_semantic_check_local to invoke pt_semantic_type once per PT_SELECT / PT_UNION / etc. The CNF pass at the end is run after all rewriting and folding settle.

Stage 1 — Name resolution

Inside pt_check_with_info, for any DML or query node, the first call is to pt_resolve_names.

// pt_resolve_names — src/parser/name_resolution.c (condensed)
PT_NODE *
pt_resolve_names (PARSER_CONTEXT * parser, PT_NODE * statement, SEMANTIC_CHK_INFO * sc_info)
{
  PT_BIND_NAMES_ARG bind_arg;
  PT_FLAT_SPEC_INFO info;

  bind_arg.scopes = NULL;
  bind_arg.spec_frames = NULL;
  bind_arg.sc_info = sc_info;

  /* Replace each Entity Spec with an Equivalent flat list */
  info.spec_parent = NULL;
  info.for_update = false;
  statement = parser_walk_tree (parser, statement, pt_flat_spec_pre, &info, pt_continue_walk, NULL);

  /* Mark PT_NAME nodes that are inside GROUP BY / HAVING. */
  statement = parser_walk_tree (parser, statement, pt_mark_group_having_pt_name, NULL, NULL, NULL);

  /* The main name-binding walk. */
  statement = parser_walk_tree (parser, statement, pt_bind_names, &bind_arg, pt_bind_names_post, &bind_arg);

  /* Resolve alias references in GROUP BY / HAVING. */
  statement = parser_walk_tree (parser, statement, pt_resolve_group_having_alias, NULL, NULL, NULL);

  /* Convert NATURAL JOIN into INNER/OUTER JOIN by synthesizing the equi-predicates. */
  statement = parser_walk_tree (parser, statement, NULL, NULL, pt_resolve_natural_join, NULL);

  /* FOR UPDATE: flag the affected specs. */
  if (statement && statement->node_type == PT_SELECT
      && PT_SELECT_INFO_IS_FLAGED (statement, PT_SELECT_INFO_FOR_UPDATE))
    {
      // ... condensed: flag each spec with PT_SPEC_FLAG_FOR_UPDATE_CLAUSE ...
    }

  return statement;
}

Five sub-passes execute in this order:

Spec expansion (pt_flat_spec_pre): for each PT_SPEC, expand entity_name into a flat_entity_list that includes inherited classes (CUBRID is an ORDBMS, so SELECT FROM Parent may need to touch every subclass) and the workspace’s DB_OBJECT. The ALL ... EXCEPT syntax can subtract specific subclasses; the resulting list is exact.
GROUP BY / HAVING marking (pt_mark_group_having_pt_name): tag every PT_NAME inside a GROUP BY or HAVING clause with CPTR_PT_NAME_IN_GROUP_HAVING so that, in the next pass, the resolver knows whether to prefer SELECT-list aliases over FROM-list columns when both could match.
Name binding (pt_bind_names + pt_bind_names_post): the main recursive walk that pushes / pops scope frames and resolves every PT_NAME against the scope stack via pt_bind_name_or_path_in_scope.
Alias resolution (pt_resolve_group_having_alias): for GROUP BY and HAVING expressions, replace any PT_NAME whose string matches a SELECT-list alias with a copy of the SELECT-list node. This is the “GROUP BY alias_name” feature — alias resolution is post-binding because it must let the binder fail to find the name in the FROM list first.
NATURAL JOIN rewriting (pt_resolve_natural_join): walk the tree, find any PT_SPEC whose info.spec.natural is true, and for each common attribute of its left- and right-hand specs synthesize an equi-predicate lhs.col = rhs.col and graft it onto the right-hand spec’s on_cond. After this rewrite the planner only sees a regular INNER JOIN with an explicit ON condition.

Spec expansion (`pt_flat_spec_pre`)

The reason CUBRID expands every spec into a list of physical relations rather than carrying the original entity_name is its ORDBMS heritage: SELECT * FROM Parent must produce rows from every subclass unless explicit ALL / EXCEPT clauses say otherwise. The flat_entity_list materializes that decision at semantic-check time, so the binder and the optimizer can both treat each list element as a regular relation reference.

For each PT_SPEC, the expansion attaches three pieces of state:

id — the address of the PT_SPEC node itself, cast to UINTPTR. This is the back-reference key that every later PT_NAME::info.name.spec_id will carry.
db_object — the workspace pointer fetched in pt_class_pre_fetch; every relation referenced by a query is locked and pre-fetched before semantic check runs, so the workspace is already populated.
flat_entity_list — a fresh linked list of PT_NAME nodes, one per physical class the query reaches. Each carries its own db_object so attribute lookup needs only the list element.

Name binding and the scope stack

The scope stack is a singly-linked list in declaration-order-reversed — innermost first. Each frame holds the relations that frame introduces.

// SCOPES — src/parser/name_resolution.c:78
typedef struct scopes SCOPES;
struct scopes
{
  SCOPES *next;                /* next outermost scope */
  PT_NODE *specs;              /* list of PT_SPEC nodes */
  unsigned short correlation_level;
                               /* how far up the stack was a name found? */
  short location;              /* for outer join */
};

typedef struct pt_bind_names_arg PT_BIND_NAMES_ARG;
struct pt_bind_names_arg
{
  SCOPES *scopes;
  PT_EXTRA_SPECS_FRAME *spec_frames;
  SEMANTIC_CHK_INFO *sc_info;
};

The correlation_level is the answer to “if the binder had to walk outward N scopes to find this name, what was N?” It is the optimizer’s hint that a subquery is correlated — non-zero level means the inner SELECT’s evaluation depends on a row from an outer relation, which forbids subquery-flattening transformations. The location is a companion to outer-join rewriting: every PT_NAME under an outer-join’s ON condition gets the same nonzero location so the optimizer knows the predicate cannot be hoisted out of the join.

flowchart TD
  subgraph OUTER["Outer SELECT scope"]
    OS["specs: history h, olympic o"]
  end
  subgraph INNER["Inner subquery scope (correlated)"]
    IS["specs: prize p"]
    IS --> ISN["Resolving p.year:\n  match in IS — level 0"]
    IS --> ISO["Resolving o.host_year:\n  no match in IS\n  walk to OS — level 1\n  → mark subquery correlated"]
  end
  INNER -- "next" --> OUTER
  ISO -. "spec_id ← &outer_PT_SPEC\nresolved ← 'o'" .-> OS

Figure 2 — Correlated subquery scope walk: resolving p.year matches in the inner scope at level 0; resolving o.host_year finds no match in the inner scope, walks to the outer scope, marks the subquery correlated at level 1, and binds spec_id to the outer PT_SPEC for o.

The walker pushes frames whenever it descends into a query node and pops on the way out. Pushing happens lazily inside pt_bind_scope, which gathers the PT_SPEC nodes the new scope makes visible, builds a SCOPES *new with new->next = bind_arg->scopes, and reassigns bind_arg->scopes = new. On the post-walk callback the scope is unlinked. The two-phase callback signature (pt_bind_names, pt_bind_names_post) of parser_walk_tree is what makes this push/pop reliable.

Resolving `*` in the SELECT list

SELECT * FROM code — the * is parsed as a PT_VALUE of a special “star” kind, not as a PT_NAME. The binder’s job at the SELECT-list level is to expand it. pt_resolve_star does the expansion: locate the PT_SPEC (or all specs, if the * is bare) that the star refers to, fetch each spec’s flat_entity_list’s DB_ATTRIBUTE set, and emit one fresh PT_NAME per attribute, linked by next. Each fresh PT_NAME then re-enters the regular binder and gets its spec_id and data_type set as if it had been written by the user.

flowchart LR
  STAR["PT_VALUE ('*')\nin SELECT list"] --> EXP["pt_resolve_star"]
  EXP --> NL["fresh PT_NAME list:\n  s_name → f_name → ..."]
  NL --> BIND["pt_bind_name_or_path_in_scope\n(re-enters binder)"]
  BIND --> DONE["each PT_NAME has\n  spec_id, resolved,\n  data_type filled"]

Figure 3 — SELECT * expansion: pt_resolve_star locates the matching PT_SPEC, emits a fresh PT_NAME per DB_ATTRIBUTE, and each new name re-enters pt_bind_name_or_path_in_scope to get spec_id and data_type filled as if written explicitly.

Derived tables, subqueries, CTEs

Subqueries (a PT_SELECT inside another PT_SELECT) and derived tables (a PT_SELECT directly under a FROM-list PT_SPEC) push a new scope. The order in which scopes resolve matters: a derived table must be resolved first, before its alias is offered to the outer scope, so that the outer scope can look up as_attr_list columns by their bound types. pt_bind_scope enforces this: when the binder hits a PT_SPEC whose derived_table is non-NULL, it descends into the derived table recursively, calls pt_semantic_type to type its select list, then threads the inner select-list types back into the spec’s as_attr_list. After this, the outer scope sees the derived table as a regular relation.

CTEs follow the same shape via pt_bind_names_in_with_clause / pt_bind_names_in_cte: the CTE’s body is bound as if it were a derived table; the CTE’s name and column list become a PT_SPEC that the WITH-clause’s query body sees in scope.

Oracle-style outer join

Oracle outer-join syntax — WHERE a.col(+) = b.col — pushes the outerness into the predicate rather than the join. The CUBRID parser detects the (+) and flags the corresponding PT_EXPR with PT_EXPR_INFO_RIGHT_OUTER or PT_EXPR_INFO_LEFT_OUTER. At name-resolution time, pt_check_Oracle_outerjoin (called inside pt_bind_names) finds these predicates, identifies the lhs and rhs specs, infers the join order, and rewrites the WHERE-clause predicate into an equivalent ANSI-style join: it sets the affected PT_SPEC’s join_type and moves the predicate from where to on_cond of the right-hand spec. The rest of the pipeline now sees a regular ANSI LEFT/RIGHT OUTER JOIN.

If the lhs and rhs specs are reversed in the FROM list relative to the predicate (FROM x, y WHERE y.i = x.i(+)), the rewriter swaps the join type:

// pt_check_Oracle_outerjoin (excerpt) — src/parser/name_resolution.c
if (lhs_spec && rhs_spec && lhs_spec->info.spec.id != rhs_spec->info.spec.id)
  {
    /* found edge: set join type and spec */
    if (lhs_location < rhs_location)   /* left-to-right in FROM order */
      {
        p_spec = lhs_spec;
        spec   = rhs_spec;
      }
    else                                /* SELECT ... FROM x, y WHERE y.i = x.i(+) */
      {
        join_type = (join_type == PT_JOIN_LEFT_OUTER) ? PT_JOIN_RIGHT_OUTER : PT_JOIN_LEFT_OUTER;
        p_spec    = rhs_spec;
        spec      = lhs_spec;
      }
  }

A trailing detail: a meaningless outer-join predicate (SELECT ... FROM x, y ON y.i(+) = 1) collapses to PT_JOIN_NONE, and the predicate is moved back to the WHERE clause — the rewriter explicitly recognizes that the (+) was a syntactic tic in this position and not a real join request.

NATURAL JOIN

NATURAL JOIN is sugar for “INNER JOIN with equi-predicates on every matching column name”. The parser carries it as a PT_SPEC with info.spec.natural == true, leaving the join-type bit PT_JOIN_NATURAL unused (it exists in the enum for completeness but the natural flag is what the rewriter reads). pt_resolve_natural_join walks the tree, and for each natural-join PT_SPEC builds two attribute lists — one from the lhs, one from the rhs — by reading each spec’s flat_entity_list (or the derived-table’s select-list / CTE’s attribute list, depending on the spec’s kind). Common attribute names are converted into lhs.col = rhs.col PT_EXPR nodes which are appended to the rhs spec’s on_cond:

-- before
SELECT DISTINCT h.host_year, o.host_nation
FROM history h NATURAL JOIN olympic o
WHERE o.host_year > 1950;

-- after pt_resolve_natural_join
SELECT DISTINCT h.host_year, o.host_nation
FROM history h INNER JOIN olympic o ON h.host_year = o.host_year
WHERE o.host_year > 1950;

The rewriter additionally flips the spec’s info.spec.natural to false so that subsequent passes treat it as a regular inner join.

Stage 2 — Aggregate-in-WHERE check (`pt_check_where`)

A small but mandatory check: aggregate (SUM, AVG, …) and analytic (ROW_NUMBER, RANK, …) functions are forbidden in WHERE clauses; they must move to HAVING.

// pt_check_where — src/parser/semantic_check.c:17503
PT_NODE *
pt_check_where (PARSER_CONTEXT * parser, PT_NODE * node)
{
  PT_NODE *function = NULL;

  /* check if exists aggregate/analytic functions in where clause */
  function = pt_find_aggregate_analytic_in_where (parser, node);
  if (function != NULL)
    {
      if (pt_is_aggregate_function (parser, function))
        {
          PT_ERRORm (parser, function, MSGCAT_SET_PARSER_SEMANTIC, MSGCAT_SEMANTIC_INVALID_AGGREGATE);
        }
      else
        {
          PT_ERRORm (parser, function, MSGCAT_SET_PARSER_SEMANTIC, MSGCAT_SEMANTIC_NESTED_ANALYTIC_FUNCTIONS);
        }
    }

  return node;
}

The message MSGCAT_SEMANTIC_NESTED_ANALYTIC_FUNCTIONS is reused for analytics-in-WHERE because the underlying classification function treats both as “function nodes that semantically belong elsewhere”.

Stage 3 — Host-variable replacement (`pt_check_and_replace_hostvar`)

A host variable is the ? (or :name) parameter that the client binds at execute time. After parsing it is a PT_HOST_VAR of PT_TYPE_MAYBE. The walk replaces those whose value the parser has already bound (e.g., literal placeholders in dynamic SQL the client supplied at compile time) with PT_VALUE nodes carrying the bound value, leaving genuinely free host vars as PT_HOST_VAR for late binding. The step also sets a cannot_prepare flag on the statement when it discovers a host var referencing an OBJECT (parameter) — those statements must be re-checked at execute time and so cannot be prepared once.

Stage 4 — Statement-specific rewrites + type checking

pt_semantic_check_local is the largest single function in semantic_check.c. It is the post-order callback of a parser_walk_tree in pt_check_with_info:

// pt_check_with_info (excerpt) — src/parser/semantic_check.c
node = parser_walk_tree (parser, node, NULL, NULL, pt_semantic_check_local, sc_info_ptr);

Because it is post-order only, it sees subqueries before their containing query — important for type propagation upward.

The function dispatches on node->node_type. For PT_SELECT it does:

SELECT-list checks: pt_check_into_clause (the INTO :var count must match the select-list count; INTO is forbidden inside a subquery). MSGCAT_SEMANTIC_NOT_SINGLE_COL reproduces the parser’s multi-column-subquery check at the semantic level. WITH INCREMENT FOR col rewrites into a hidden incr(col) selection.
GROUP BY checks: MSGCAT_SEMANTIC_SORT_SPEC_RANGE_ERR for a position-style GROUP BY beyond the select-list length; MSGCAT_SEMANTIC_NO_GROUPBY_ALLOWED for a host-var GROUP BY (the position would change at execute time); MSGCAT_SEMANTIC_CANNOT_USE_GROUPBYNUM_WITH_ROLLUP for the rare WITH ROLLUP HAVING GROUPBY_NUM() combination.
Aggregate / analytic argument checks: every aggregate / analytic function except a small whitelist (COUNT(*), ROW_NUMBER(), RANK(), DENSE_RANK(), CUME_DIST(), PERCENT_RANK()) must have an arg_list. MEDIAN cannot be used with OVER (ORDER BY ...). PARTITION BY NULL is removed and replaced by ORDER BY <const> to keep the analytic well-formed.
ORDER BY checks: strip ORDER BY if the query is a single-row aggregation that won’t benefit; range-check positional ORDER BY.
Hierarchical query checks (CONNECT BY): forbid CONNECT_BY_ISLEAF and CONNECT_BY_ISCYCLE outside the prior expr context; analyze LEVEL to decide whether a cycle check is needed.
SHOW statement rewrite: replace the placeholder select-list with the real meta-data list for the SHOW kind (e.g., SHOW TABLES → SELECT against a system catalog).
Derived-query checks: for each PT_SPEC carrying a derived table, verify that as_attr_list has unique names (MSGCAT_SEMANTIC_AMBIGUOUS_REF_TO) and that hidden columns get synthetic ha_<n> placeholder names so the count matches.
CAST well-formedness: pt_check_cast_op verifies the source type can be coerced to the target type; otherwise emits MSGCAT_SEMANTIC_CANT_COERCE_TO (target is in the coercion table but marked NO) or MSGCAT_SEMANTIC_COERCE_UNSUPPORTED (target not in the table at all).
LIMIT rewriting — see below.
pt_semantic_type invocation: finally, after all rewrites, pt_semantic_check_local calls pt_semantic_type on the subtree to type-check and constant-fold. This is the “type checking + constant folding” pass.

LIMIT → numbering expression

LIMIT is a syntactic shorthand for “stop after N rows”, but the optimizer has no notion of “stop” — every clause must turn into a predicate. The rewriter chooses the predicate based on the surrounding context:

Context	Rewrite
LIMIT with WHERE only	append `inst_num() <= ?` to WHERE
LIMIT with ORDER BY	append `orderby_num() <= ?` (the ordering is preserved before the cut)
LIMIT with GROUP BY	append `groupby_num() BETWEEN 1 AND ?` to HAVING
LIMIT with DISTINCT only	treat as ORDER BY case (distinct is order-equivalent for the purpose of cutting)
LIMIT inside aggregate	wrap the aggregate select as a derived table, apply LIMIT outside

-- example 4
SELECT * FROM code LIMIT 3;
=> SELECT code.s_name, code.f_name FROM code code WHERE (inst_num() <= ?:0);

-- example 1 — LIMIT pushes through to ORDERBY_NUM inside the analytic
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY math) AS pdisc FROM scores LIMIT 5;
=> ... WITHIN GROUP (ORDER BY math FOR ORDERBY_NUM() BETWEEN 1 AND 5) ...

The rewrite is performed before pt_semantic_type is called, so the type checker sees a regular WHERE/HAVING with the number-expression in place. This is why pt_eval_type_pre’s LIMIT handling is also the spot that replaces LIMIT with the appropriate xxx_NUM() — the rewrite needs to know the surrounding clause kind, and eval_type_pre is top-down so that knowledge is available when the LIMIT subtree is visited.

`pt_semantic_type` — the type-check + fold engine

// pt_semantic_type — src/parser/type_checking.c
PT_NODE *
pt_semantic_type (PARSER_CONTEXT * parser, PT_NODE * tree, SEMANTIC_CHK_INFO * sc_info_ptr)
{
  /* ... condensed: derive sc_info_ptr->has_dblink from tree ... */

  /* do type checking */
  tree = parser_walk_tree (parser, tree, pt_eval_type_pre, sc_info_ptr, pt_eval_type, sc_info_ptr);
  if (pt_has_error (parser))
    {
      return NULL;
    }

  /* Parsing static sql is only for semantic check. Any kind of execution should be avoided */
  if (!parser->flag.is_parsing_static_sql)
    {
      PT_NODE *spec_list = NULL;
      /* do constant folding */
      tree = parser_walk_tree (parser, tree, pt_fold_constants_pre, &spec_list, pt_fold_constants_post, sc_info_ptr);
      if (pt_has_error (parser))
        {
          tree = NULL;
        }
    }

  return tree;
}

Two parser_walk_tree calls in sequence:

Type checking — pre-callback pt_eval_type_pre (top-down) sets things up that the children need (LIMIT rewrite, recursive-expr handling, derived-table outer-join flag), and post-callback pt_eval_type (bottom-up) computes each node’s type. pt_eval_type dispatches by node kind: PT_EXPR → pt_eval_expr_type, PT_FUNCTION → pt_eval_function_type, PT_CREATE_INDEX / PT_DELETE / PT_UPDATE → pt_where_type on their search conditions.
Constant folding — pre-callback pt_fold_constants_pre decides whether to descend (it stops descent on BENCHMARK to keep the benchmark target unfoldable), post-callback pt_fold_constants_post checks each PT_EXPR / PT_FUNCTION to see if every argument is a PT_VALUE and, if so, evaluates and replaces.

The static-SQL guard is for prepared-statement-only parsing where the statement is being checked but never executed; folding could rely on volatile state that does not exist in that mode.

Type checking for `PT_EXPR` — `pt_eval_expr_type`

The function receives a parent PT_EXPR whose children have already been typed (because pt_eval_type is the post-order callback). It extracts up to three argument slots (arg1, arg2, arg3) plus their host-var siblings if any (arg1_hv, arg2_hv, arg3_hv):

// pt_eval_expr_type — src/parser/type_checking.c (head, condensed)
static PT_NODE *
pt_eval_expr_type (PARSER_CONTEXT * parser, PT_NODE * node)
{
  PT_OP_TYPE op = node->info.expr.op;
  PT_NODE *arg1 = NULL, *arg2 = NULL, *arg3 = NULL;
  PT_NODE *arg1_hv = NULL, *arg2_hv = NULL, *arg3_hv = NULL;
  PT_TYPE_ENUM arg1_type = PT_TYPE_NONE, arg2_type = PT_TYPE_NONE;
  PT_TYPE_ENUM arg3_type = PT_TYPE_NONE, common_type = PT_TYPE_NONE;
  /* ... condensed: enumeration-comparison fast path ... */

  arg1 = node->info.expr.arg1;
  if (arg1)
    {
      arg1_type = arg1->type_enum;
      if (arg1->node_type == PT_HOST_VAR && arg1->type_enum == PT_TYPE_MAYBE)
        arg1_hv = arg1;
      /* ... condensed: unwrap unary-minus / PRIOR / CONNECT_BY_ROOT around a host var ... */
    }
  /* ... arg2, arg3 similarly ... */
}

The body that follows is one of the longest switches in the codebase. At a high level it performs six tasks:

Operator-driven rule for arithmetic and a handful of others. A small set of operators — PT_PLUS, PT_MINUS, the PT_BETWEEN_* family, PT_LIKE, PT_TO_CHAR, PT_FROM_TZ, PT_NEW_TIME, the PT_IS_IN / PT_IS_NOT_IN pair — have hand-written rules. The arithmetic rule, for instance, encodes the date/time × integer matrix from the deck (slide 32): TIMESTAMP - TIMESTAMP produces BIGINT (seconds), INT + DATE produces DATE, etc. The rule lives directly in the switch arm, which then sets common_type and bypasses the signature table.

Signature-driven matching for the long tail (pt_apply_expressions_definition). For everything else, a precomputed EXPRESSION_DEFINITION (returned by pt_get_expression_definition (op, &def)) has overloads_count overloads, each an (arg1_type, arg2_type, arg3_type, return_type) tuple. The matcher scores each overload by counting per-argument pt_are_equivalent_types matches; the highest-scoring overload wins. If no overload’s argument types are even unmatchable-free, the matcher falls back to overload 0 (which propagates an error downstream when pt_coerce_expression_argument cannot insert the needed cast).

// pt_apply_expressions_definition (scoring loop, condensed) — src/parser/type_checking.c:5778
best_match = 0;
matches = -1;
for (i = 0; i < def.overloads_count; i++)
  {
    int match_cnt = 0;
    if (pt_are_unmatchable_types (def.overloads[i].arg1_type, arg1_type)) { match_cnt = -1; continue; }
    if (pt_are_equivalent_types (def.overloads[i].arg1_type, arg1_type))   match_cnt++;
    if (pt_are_unmatchable_types (def.overloads[i].arg2_type, arg2_type)) { match_cnt = -1; continue; }
    if (pt_are_equivalent_types (def.overloads[i].arg2_type, arg2_type))   match_cnt++;
    if (pt_are_unmatchable_types (def.overloads[i].arg3_type, arg3_type)) { match_cnt = -1; continue; }
    if (pt_are_equivalent_types (def.overloads[i].arg3_type, arg3_type))   match_cnt++;
    if (match_cnt == 3) { best_match = i; break; }                /* perfect — short-circuit */
    /* ... condensed: track running best_match by match_cnt > matches ... */
  }

Collation-compatibility check (pt_check_expr_collation). For string-typed arguments, both arguments must agree on collation or one must be coercible. CUBRID’s collation system distinguishes “explicit” (user-written COLLATE), “implicit” (column’s default), and “coercible” (literal); the rule prefers the most-explicit collation and errors if two arguments are explicitly tagged with incompatible collations.
Late-binding host vars (pt_is_op_hv_late_bind). Some operators — range comparisons, arithmetic on a ? argument — can postpone the host var’s exact type until execution. The pass wraps the host var in a PT_CAST with target expected_domain set to a DB_TYPE_VARIABLE domain, then clears the expected_domain so the XASL generator knows to treat the cast as late-binding rather than early.
Type self-evidence. A handful of operators have a return type that is fixed regardless of argument type (PT_RAND → numeric, PT_YEAR → integer, PT_EXTRACT → integer, PT_COALESCE → common-type). After the signature match the type is overridden from the operator’s contract.
Argument-type back-propagation for typed host vars. When the matcher has settled on a return type and the host var is now constrained, pt_eval_expr_type records the inferred domain into expected_domain so that subsequent passes (and ultimately the executor) coerce the actual host-var value at bind time.

The function returns the modified node. By the time it returns, node->type_enum is set, node->data_type is allocated (if the type needs precision/collation), and any required PT_CAST wrappers have been inserted around children.

Recursive expressions

A handful of operators — GREATEST, LEAST, COALESCE (left-recursive) and CASE, DECODE (right-recursive) — are parsed as a chain of identical PT_EXPR nodes. Each chain element has only two interesting arguments (the immediate operand and the recursion-link to the next chain element); the chain is parsed bottom-up but must be type-checked as one expression so the final common type propagates to every link.

pt_eval_type_pre’s recursive-expression handling spots the chain on the way down, walks to the bottom, and from there walks back up calling eval_recursive_expr_type repeatedly to compute the common type once. The result recursive_type is then carried in expr.recursive_type so that the per-link pt_apply_expressions_definition can use it instead of asking each link to type-check independently.

flowchart TD
  CASE1["PT_EXPR (CASE)\narg1: cond\narg2: val\narg3: → CASE2"]
  CASE2["PT_EXPR (CASE)\narg1: cond\narg2: val\narg3: → CASE3"]
  CASE3["PT_EXPR (CASE)\narg1: cond\narg2: val\narg3: → CASE4"]
  CASE4["PT_EXPR (CASE)\narg1: cond\narg2: val\narg3: const"]
  CASE1 --> CASE2 --> CASE3 --> CASE4
  PRE["eval_type_pre walks down\nto CASE4, computes\nrecursive_type from\nleaf vals, propagates\nupward into all\nrecursive_type fields"]
  POST["eval_type then runs\nbottom-up; each link\nreads its own recursive_type\nrather than inferring"]

Figure 4 — CASE expression right-linear chain: pt_eval_type_pre descends to the leaf CASE4, computes recursive_type from all branch values, propagates it upward into every link’s recursive_type field, then pt_eval_type runs bottom-up reading the pre-computed type instead of inferring per-link.

Type checking for `PT_FUNCTION`

pt_eval_function_type is the parallel of pt_eval_expr_type for function nodes (which can have arbitrary arity). CUBRID has two implementations side-by-side:

pt_eval_function_type_old — the long-standing C implementation for built-in aggregates, analytics, and most string / numeric functions.
pt_eval_function_type_new — a C++17 replacement built around a func_type::Node class in func_type.cpp, currently active for JSON functions, the BENCHMARK family, and the new REGEXP_* family. New functions are expected to land in the C++ path.

Both paths read a per-function signature list and pick the best-matching overload by argument type, same scoring discipline as pt_apply_expressions_definition.

`pt_where_type` — predicate cut-off

After child types are known, the WHERE clause itself gets a final sweep to cut off provably-true / provably-false conjuncts. Because the parser delivers the WHERE clause as a CNF-shaped list (next for AND, or_next for OR), the cut-off logic walks the list once:

// pt_where_type (excerpt) — src/parser/type_checking.c:6495
PT_NODE *
pt_where_type (PARSER_CONTEXT * parser, PT_NODE * where)
{
  PT_NODE *cnf_node, *dnf_node, *cnf_prev, *dnf_prev;
  bool cut_off;
  short location;

  /* traverse CNF list and keep track the pointer to previous node */
  cnf_prev = NULL;
  while ((cnf_node = ((cnf_prev) ? cnf_prev->next : where)))
    {
      // ... condensed: extract location, validate logical type ...

      if (cnf_node->or_next == NULL)
        {
          if (cnf_node->node_type == PT_VALUE && cnf_node->type_enum == PT_TYPE_LOGICAL
              && cnf_node->info.value.data_value.i == 1)
            {
              cut_off = true;          /* ... AND TRUE AND ...  → drop the conjunct */
            }
          else if (cnf_node->node_type == PT_VALUE && cnf_node->type_enum == PT_TYPE_LOGICAL
                   && cnf_node->info.value.data_value.i == 0)
            {
              if (cnf_node == where && cnf_node->next == NULL) return where;
              goto always_false;         /* ... AND FALSE AND ... → whole WHERE is false */
            }
        }
      else
        {
          /* DNF inner pass: drop FALSEs from OR-list, short-circuit on first TRUE */
          // ... condensed ...
        }

      // ... condensed: advance cnf_prev or splice cnf_node out ...
    }
  return where;
}

The two reductions are:

A ∧ TRUE ∧ B → A ∧ B (drop the conjunct).
A ∧ FALSE ∧ B → FALSE (replace the whole WHERE with a single false PT_VALUE).

The DNF inner loop does the dual:

D ∨ TRUE ∨ E → TRUE (short-circuit the disjunct as a single true).
D ∨ FALSE ∨ E → D ∨ E (drop the FALSE alternative).

The same predicate-cut-off logic runs against ON conditions, HAVING clauses, START WITH / CONNECT BY clauses, and ORDERBY-FOR clauses.

Constant folding

pt_fold_constants_post is the per-node folder. For a PT_EXPR it delegates to pt_fold_const_expr, which checks that every argument is a PT_VALUE, then evaluates the operator at compile time using the same machinery the executor would use (pt_evaluate_db_value_expr) and replaces the entire PT_EXPR with the resulting PT_VALUE. PT_FUNCTION follows the same pattern via pt_fold_const_function.

The pre-callback pt_fold_constants_pre only intervenes for BENCHMARK: it returns PT_LIST_WALK to stop descent into the benchmark’s argument, ensuring that the benchmark’s target expression is evaluated by the executor (where the benchmark can time it) rather than collapsed at compile time.

There is an additional gate at the SEMANTIC_CHK_INFO level: donot_fold is a per-pass flag that the caller of pt_check_with_info can set to suppress folding entirely. Statement re-checking after a view-transformation rewrite uses this so that folding doesn’t erase the structure the rewriter just produced.

Type infrastructure — `PT_DATA_TYPE` and `TP_DOMAIN`

CUBRID’s parse tree has its own type representation (PT_TYPE_ENUM + PT_DATA_TYPE), separate from the engine’s DB_TYPE + TP_DOMAIN. The two coexist because the parse tree needs to carry parameterized metadata (precision, scale, collation, codeset, enumeration values) in a form aligned with parser nodes, while the engine’s domain cache needs to be canonical to support fast comparisons:

Layer	Header	Purpose
Parse-tree types	`parser/parse_tree.h`	`PT_TYPE_ENUM`, `PT_DATA_TYPE_INFO` — parser-side type metadata
Engine domains	`object/object_domain.h`	`TP_DOMAIN` — built-in + user-defined type descriptor, cached
Primitive types	`object/object_primitive.h`	`PR_TYPE` — per-primitive serialization / comparison / size
On-disk layout	`object/object_representation.h`	`OR_*` — physical byte layout for storage

Every PT_DATA_TYPE node materialized in the parse tree is mapped to a TP_DOMAIN via pt_data_type_to_db_domain. Repeated TP_DOMAIN allocations are deduped through the tp_Domains[] cache (built-in domains by DB_TYPE, with collection-domain and parameterized-domain chains hanging off each entry). The key invariant: any two TP_DOMAIN * pointers that represent the same parameterized type are identical pointers after the cache lookup, so equality is == rather than a deep compare.

Stage 5 — CNF normalization (`pt_do_cnf` / `pt_cnf`)

After all rewriting and folding settle, pt_do_cnf walks the tree and calls pt_cnf on each WHERE and HAVING predicate.

// pt_do_cnf — src/parser/cnf.c:1190
PT_NODE *
pt_do_cnf (PARSER_CONTEXT * parser, PT_NODE * node, void *arg, int *continue_walk)
{
  PT_NODE *list, *spec;

  if (node->node_type != PT_SELECT) return node;

  list = node->info.query.q.select.where;
  if (list)
    {
      for (; list; list = list->next)
        PT_EXPR_INFO_CLEAR_FLAG (list, PT_EXPR_INFO_CNF_DONE);

      node->info.query.q.select.where = pt_cnf (parser, node->info.query.q.select.where);

      /* annotate each conjunct with the spec(s) it touches — used by predicate pushdown */
      for (spec = node->info.query.q.select.from; spec; spec = spec->next)
        pt_tag_terms_with_specs (parser, node->info.query.q.select.where, spec, spec->info.spec.id);
    }

  list = node->info.query.q.select.having;
  if (list)
    {
      for (; list; list = list->next)
        PT_EXPR_INFO_CLEAR_FLAG (list, PT_EXPR_INFO_CNF_DONE);
      node->info.query.q.select.having = pt_cnf (parser, node->info.query.q.select.having);
    }

  return node;
}

The PT_EXPR_INFO_CNF_DONE clear-then-set protocol exists because pt_do_cnf may be invoked more than once on the same subtree across the compile (view transform can re-feed predicates that were already CNF; the second pass should treat them as fresh).

pt_cnf itself is the textbook three-step transform:

// pt_cnf — src/parser/cnf.c:941
PT_NODE *
pt_cnf (PARSER_CONTEXT * parser, PT_NODE * node)
{
  PT_NODE *list = NULL, *cnf, *next, *last = NULL;
  CNF_MODE mode;

  do
    {
      next = node->next;
      node->next = NULL;

      if (node->node_type == PT_VALUE || PT_EXPR_INFO_IS_FLAGED (node, PT_EXPR_INFO_CNF_DONE))
        {
          /* already in CNF (or a folded constant) — splice in unchanged */
          cnf = node;
          // ... condensed: append to (list, last) ...
        }
      else
        {
          /* AND/OR form: push NOT inward via De Morgan */
          node = pt_and_or_form (parser, node);

          /* if too many disjuncts, do CNF in OR-tree-pruning mode */
          mode = (count_and_or (parser, node) > 100) ? TRANSFORM_CNF_OR_COMPACT : TRANSFORM_CNF_AND_OR;

          /* distribute disjunction over conjunction */
          cnf = parser_walk_tree (parser, node, pt_transform_cnf_pre, &mode, pt_transform_cnf_post, &mode);
          // ... condensed: append cnf to (list, last) ...

          /* tag every conjunct as done */
          for (last = cnf; last->next; last = last->next)
            PT_EXPR_INFO_SET_FLAG (last, PT_EXPR_INFO_CNF_DONE);
          PT_EXPR_INFO_SET_FLAG (last, PT_EXPR_INFO_CNF_DONE);
        }

      node = next;
    }
  while (next);

  list = parser_walk_tree (parser, list, NULL, NULL, pt_tag_start_of_cnf_post, NULL);
  return list;
}

The three rewrite steps inside pt_cnf are:

AND/OR form (pt_and_or_form) — replace implication and equivalence with their AND/OR equivalents, push every NOT inward using De Morgan until NOT only appears in front of a leaf predicate.
Distribute (pt_transform_cnf_*) — apply the distributive law A ∨ (B ∧ C) ≡ (A ∨ B) ∧ (A ∨ C) repeatedly until the formula is a flat conjunction of disjunctions. The mode switch (TRANSFORM_CNF_OR_COMPACT over 100 OR-children) is the exponential-blow-up guard: distributing OR over AND can multiply the number of conjuncts by the cross-product of disjunct sizes, so for very wide formulas the compact mode preserves the high-level OR rather than blowing it out.
Tag — set PT_EXPR_INFO_CNF_DONE on every conjunct so a subsequent pt_cnf invocation does not re-distribute.

The post-pass pt_tag_start_of_cnf_post marks the first conjunct of each CNF list with a different flag so that downstream code can detect the head of a CNF list separately from its members.

The pt_tag_terms_with_specs annotation happens after CNF: each conjunct is tagged with the bitset of PT_SPEC ids it touches. The optimizer reads this to decide which conjuncts are pushdown candidates to which join input.

flowchart LR
  RAW["WHERE (a OR b) AND NOT (c AND d)"]
  RAW --> AOF["pt_and_or_form\n→ (a OR b) AND (NOT c OR NOT d)"]
  AOF --> DIST["pt_transform_cnf_∗\n→ (a OR b) AND (NOT c OR NOT d)\n  (already CNF — flat)"]
  DIST --> TAG["each conjunct gets\nPT_EXPR_INFO_CNF_DONE\nplus spec-id tag"]
  TAG --> OPT["xasl_generation\n· optimizer reads each\nconjunct for pushdown"]

Figure 5 — CNF transformation and tagging: pt_and_or_form converts to NNF, pt_transform_cnf_∗ distributes OR over AND, each conjunct receives PT_EXPR_INFO_CNF_DONE and a PT_SPEC-id bitset tag that the optimizer reads to decide which conjuncts are pushdown candidates.

Statement-class coverage

pt_check_with_info’s switch handles every statement kind the parser can produce; the heavy lifting differs:

Statement kind	What `pt_check_with_info` does
`PT_SELECT`, `PT_UNION`, `PT_INTERSECTION`, `PT_DIFFERENCE`	Full pipeline: resolve names, check WHERE, replace host vars, semantic_check_local (which calls semantic_type), expand IS NULL, CNF
`PT_INSERT`, `PT_UPDATE`, `PT_DELETE`, `PT_MERGE`	Full pipeline plus DML-specific assignments / target validation
`PT_CREATE_INDEX`, `PT_ALTER_INDEX`, `PT_DROP_INDEX`	`pt_resolve_names` + `pt_check_create_index` / `pt_check_function_index_expr` / `pt_check_filter_index_expr`; no host-var phase
`PT_CREATE_ENTITY`, `PT_ALTER`, `PT_DROP`, etc.	DDL-specific checkers (uniqueness of attr names, partition-range validation, FK validity, etc.)
`PT_EVALUATE`, `PT_DO`, `PT_SET_*`	Full pipeline — these are SQL-level expressions / commands and need typing
`PT_HOST_VAR`, `PT_VALUE`, `PT_NAME`, `PT_EXPR`, `PT_FUNCTION`	Bare expressions — full pipeline

The DDL paths skip the host-var pass because MSGCAT_SEMANTIC_HOSTVAR_IN_DDL forbids host vars in DDL statements; the parser sets parser->host_var_count if any were encountered, and the DDL arm errors out before semantic checking.

Source Walkthrough

Anchor on symbol names, not line numbers. The line-number table at the end is a position hint scoped to this doc’s updated: date.

Top-level driver (`src/parser/`)

pt_compile (compile.c) — single-statement entry; sets the longjmp envelope, calls pt_semantic_check.
pt_class_pre_fetch (compile.c) — runs before pt_semantic_check from the caller’s perspective; pre-locks every referenced relation. Not part of the semantic pass per se but the precondition for pt_flat_spec_pre to find every relation in the workspace.
pt_semantic_check (semantic_check.c) — wrapper for pt_check_with_info.
pt_check_with_info (semantic_check.c) — the statement-kind switch; the scaffolding under which all four sub-passes run.

Name resolution (`name_resolution.c`)

SCOPES, PT_BIND_NAMES_ARG (name_resolution.c:78,87) — scope-stack frame and walk-state argument.
pt_resolve_names — five-step driver.
pt_flat_spec_pre — expand each PT_SPEC into a flat_entity_list (handles inheritance and ALL ... EXCEPT).
pt_mark_group_having_pt_name — tag GROUP BY / HAVING PT_NAMEs.
pt_bind_names, pt_bind_names_post — main name-binding walk (push / pop scopes; handle PT_DOT_).
pt_bind_scope — push a new SCOPES frame; if the spec is a derived table or CTE, recurse into it before exposing it.
pt_bind_name_or_path_in_scope — resolve a single PT_NAME (or PT_DOT_ path) against the current scope stack.
pt_get_resolution, pt_find_entity_in_scopes, pt_find_outer_entity_in_scopes — the walker primitives behind the binder.
pt_resolve_star, pt_resolve_star_reserved_names — * expansion in the SELECT list.
pt_resolve_correlation — bump correlation_level when a name walks outward; the optimizer reads this.
pt_resolve_group_having_alias, _pt_name, _pt_expr, _pt_sort_spec, _internal — alias resolution in GROUP BY / HAVING.
pt_resolve_natural_join, pt_resolve_natural_join_internal — NATURAL JOIN → INNER JOIN ON … rewrite.
pt_check_Oracle_outerjoin, pt_clear_Oracle_outerjoin_spec_id — Oracle (+) syntax to ANSI rewrite.
pt_bind_names_in_with_clause, pt_bind_names_in_cte — CTE binding.
pt_bind_names_merge_insert, pt_bind_names_merge_update — MERGE-specific binding.
pt_make_flat_name_list, pt_make_subclass_list — list builders for the flat entity list.

Aggregate-in-WHERE check (`semantic_check.c`)

pt_check_where — walks the predicate tree looking for an aggregate / analytic function and emits MSGCAT_SEMANTIC_INVALID_AGGREGATE / MSGCAT_SEMANTIC_NESTED_ANALYTIC_FUNCTIONS.
pt_find_aggregate_analytic_in_where, pt_is_aggregate_function — predicate helpers.

Host-variable replacement (`semantic_check.c`)

pt_check_and_replace_hostvar — parser_walk_tree callback.
SET_HOST_VARIABLES_IF_INTERNAL_STATEMENT (macro) — inherits host vars from a parent parser context, used when the statement was generated internally (view transform / trigger).

Statement-specific checks (`semantic_check.c`)

pt_semantic_check_local — the post-order callback that drives per-statement semantic checks and rewrites; also the call site of pt_semantic_type.
pt_check_into_clause, pt_check_into_clause_for_static_sql — INTO arity and INTO-in-subquery checks.
pt_check_create_index, pt_check_function_index_expr, pt_check_filter_index_expr — DDL index-expression checks.
pt_check_cast_op — CAST type / collation feasibility.
pt_check_unique_attr — duplicate-column check.
pt_check_range_partition_strict_increasing, pt_coerce_partition_value_with_data_type — partition-range validation.
pt_expand_isnull_preds, pt_expand_isnull_preds_helper — expand col IS NULL into the disjunction of class-discriminating predicates needed for inheritance-aware nullability; runs after pt_semantic_check_local and before CNF (so CNF sees the expansion).
pt_mark_union_leaf_nodes — propagate UNION leaf flags upward.

Type checking and constant folding (`type_checking.c`)

pt_semantic_type — driver; calls the type-check walk and the fold walk.
pt_eval_type_pre — top-down callback (LIMIT rewrites, recursive- expr handling, derived-table outer-join flag, sort-spec subquery flag, search-condition cleanup).
pt_eval_type — bottom-up callback (dispatches per node type to pt_eval_expr_type, pt_eval_function_type, pt_where_type).
pt_eval_expr_type — operator-driven + signature-driven type inference for PT_EXPR.
pt_eval_function_type, pt_eval_function_type_old, pt_eval_function_type_new — same for PT_FUNCTION. The _new path delegates to func_type::Node in func_type.cpp for new functions (JSON, REGEXP, BENCHMARK).
pt_apply_expressions_definition — signature-table matcher.
pt_get_expression_definition — returns the per-operator EXPRESSION_DEFINITION.
pt_coerce_expression_argument — wrap an argument in PT_CAST to a target TP_DOMAIN.
pt_check_expr_collation — collation-compatibility verification.
pt_is_op_hv_late_bind — operator predicate; true if a host-var argument can be late-bound.
pt_upd_domain_info — refresh data_type precision/scale on the PT_EXPR from its operator and arguments.
pt_where_type, pt_where_type_keep_true — predicate cut-off based on resolved Boolean values.
pt_fold_constants_pre — descent gate (BENCHMARK).
pt_fold_constants_post — per-node folder.
pt_fold_const_expr — fold a PT_EXPR whose every argument is a PT_VALUE.
pt_fold_const_function — fold a PT_FUNCTION similarly.
pt_evaluate_db_value_expr — the actual compile-time evaluation primitive (lives in parse_evaluate.c); used by the folders.

Type infrastructure (`parse_dbi.c`, `object_domain.c`)

pt_data_type_to_db_domain (parse_dbi.c) — convert PT_DATA_TYPE to TP_DOMAIN.
tp_domain_new, tp_domain_construct (object_domain.c) — fresh domain allocation; the parser-side allocators wrap these.
tp_Domains[] — the global per-DB_TYPE domain cache.
tp_is_domain_cached — cache lookup primitive for parameterized domains.

CNF normalization (`cnf.c`)

pt_do_cnf — top-level walker that finds WHERE / HAVING in PT_SELECT and runs CNF.
pt_cnf — three-step transform (AND/OR form → distribute → tag).
pt_and_or_form — eliminate ⇒ / ⇔, push NOT inward.
pt_aof_to_cnf — recursive AOF-to-CNF helper.
pt_distributes_disjunction — distribute OR over AND.
pt_transform_cnf_pre, pt_transform_cnf_post, pt_tag_start_of_cnf_post — parser_walk_tree callbacks.
pt_tag_terms_with_specs — annotate each CNF conjunct with the spec-id bitset it touches.
count_and_or — disjunct-count gate that selects between TRANSFORM_CNF_AND_OR and TRANSFORM_CNF_OR_COMPACT modes.
PT_EXPR_INFO_CNF_DONE — flag set on every conjunct after CNF.

Position hints as of 2026-04-30

Symbol	File	Line
`pt_compile`	`parser/compile.c`	381
`pt_class_pre_fetch`	`parser/compile.c`	432
`pt_semantic_check`	`parser/semantic_check.c`	12523
`pt_check_with_info`	`parser/semantic_check.c`	12060
`pt_check_where`	`parser/semantic_check.c`	17503
`pt_check_and_replace_hostvar`	`parser/semantic_check.c`	11934
`pt_semantic_check_local`	`parser/semantic_check.c`	10865
`pt_check_into_clause`	`parser/semantic_check.c`	10684
`pt_check_into_clause_for_static_sql`	`parser/semantic_check.c`	10657
`pt_check_create_index`	`parser/semantic_check.c`	8888
`pt_expand_isnull_preds`	`parser/semantic_check.c`	222
`pt_mark_union_leaf_nodes`	`parser/semantic_check.c`	269
`SCOPES` struct	`parser/name_resolution.c`	78
`PT_BIND_NAMES_ARG` struct	`parser/name_resolution.c`	87
`pt_bind_name_or_path_in_scope`	`parser/name_resolution.c`	841
`pt_bind_scope`	`parser/name_resolution.c`	1206
`pt_bind_names_post`	`parser/name_resolution.c`	1467
`pt_bind_names`	`parser/name_resolution.c`	1974
`pt_flat_spec_pre`	`parser/name_resolution.c`	4735
`pt_resolve_star_reserved_names`	`parser/name_resolution.c`	7391
`pt_resolve_star`	`parser/name_resolution.c`	7460
`pt_resolve_natural_join_internal`	`parser/name_resolution.c`	8914
`pt_resolve_natural_join`	`parser/name_resolution.c`	9026
`pt_resolve_group_having_alias_pt_sort_spec`	`parser/name_resolution.c`	9142
`pt_resolve_group_having_alias_pt_name`	`parser/name_resolution.c`	9158
`pt_resolve_group_having_alias_pt_expr`	`parser/name_resolution.c`	9211
`pt_resolve_group_having_alias_internal`	`parser/name_resolution.c`	9269
`pt_resolve_group_having_alias`	`parser/name_resolution.c`	9302
`pt_resolve_names`	`parser/name_resolution.c`	9350
`pt_bind_names_merge_insert`	`parser/name_resolution.c`	10837
`pt_bind_names_merge_update`	`parser/name_resolution.c`	10933
`pt_bind_names_in_with_clause`	`parser/name_resolution.c`	11147
`pt_bind_names_in_cte`	`parser/name_resolution.c`	11202
`pt_coerce_expression_argument`	`parser/type_checking.c`	4473
`pt_apply_expressions_definition`	`parser/type_checking.c`	5778
`pt_where_type`	`parser/type_checking.c`	6495
`pt_where_type_keep_true`	`parser/type_checking.c`	6689
`pt_eval_type_pre`	`parser/type_checking.c`	7082
`pt_fold_constants_pre`	`parser/type_checking.c`	7487
`pt_fold_constants_post`	`parser/type_checking.c`	7661
`pt_eval_type`	`parser/type_checking.c`	7714
`pt_eval_expr_type`	`parser/type_checking.c`	8706
`pt_upd_domain_info`	`parser/type_checking.c`	11245
`pt_eval_function_type`	`parser/type_checking.c`	12285
`pt_fold_const_expr`	`parser/type_checking.c`	17579
`pt_fold_const_function`	`parser/type_checking.c`	18799
`pt_semantic_type`	`parser/type_checking.c`	19073
`pt_is_op_hv_late_bind`	`parser/type_checking.c`	20260
`pt_check_expr_collation`	`parser/type_checking.c`	22076
`pt_aof_to_cnf`	`parser/cnf.c`	255
`pt_cnf`	`parser/cnf.c`	941
`pt_do_cnf`	`parser/cnf.c`	1191

Cross-check Notes

The raw analyses are dated (versions 0.5 – 1.0, internal RND-2팀 review) and lag the current source on a few specific points:

pt_semantic_check_local is no longer the place where pt_semantic_type is unconditionally called. The deck describes pt_semantic_check_local as “calls semantic_type and constant folds”. In the current source, pt_check_with_info schedules the walks and the call to pt_semantic_type happens both inside pt_semantic_check_local (per-statement) and directly from pt_check_with_info for non-DML node kinds (e.g., PT_CREATE_INDEX calls pt_semantic_type directly at semantic_check.c:12257).
The pt_eval_function_type split. The raw type-checking deck refers only to pt_eval_function_type as a single function. The current source has both pt_eval_function_type_old and pt_eval_function_type_new; the new C++17 path (func_type::Node in func_type.cpp) handles JSON, BENCHMARK, and REGEXP families. Most other functions still go through the old path.
PT_JOIN_NATURAL is dead. The deck flags it as “not used”; the comment is still in the enum (parse_tree.h). NATURAL JOIN is carried as a bool natural flag on PT_SPEC, not as a join-type. PT_JOIN_FULL_OUTER is also dead — full outer join is unsupported in CUBRID.
The Oracle outer-join “Meaningless” predicate handling. The deck notes SELECT ... FROM x, y ON y.i(+) = 1 collapses to PT_JOIN_NONE and the predicate moves back to WHERE. Verified in pt_check_Oracle_outerjoin’s post-walk; the current source preserves this behavior.
CNF mode switch threshold. pt_cnf switches to TRANSFORM_CNF_OR_COMPACT when count_and_or > 100. The CNF deck does not mention this exponential-blow-up guard. This is a recent hardening; prefer the source.
pt_check_into_clause has a static-SQL variant. The “INTO” section of the per-statement deck describes only the regular arity check. The current source has pt_check_into_clause_for_static_sql which performs the same check at static-SQL parse time (the is_parsing_static_sql flag is also what gates folding in pt_semantic_type).
The “host-var → expected_domain” handling has matured. The deck describes a single late-bind step. The current source performs five distinct settle-then-revisit operations on host vars: type unwrap (handle -? and prior ?), late-bind operator classification (pt_is_op_hv_late_bind), expected-domain computation, expected-domain clear-on-late-bind (so the XASL generator emits DB_TYPE_VARIABLE), and a final back-propagation of the inferred type into the host var’s expected domain. The five-step shape is implicit in the deck but explicit in the code.
NATURAL JOIN attribute lookup for derived-tables / CTEs. The deck describes two cases (entity-fetched DB_ATTRIBUTE, derived table or CTE select-list / WITH-list). Verified in pt_resolve_natural_join_internal; current source matches.

Open Questions

pt_eval_function_type_new migration plan. Three function families live in the C++17 path; the rest still use pt_eval_function_type_old. Is there a documented roadmap for moving more functions over (and what does the migration look like for an existing aggregate)? Investigation path: git log func_type.cpp plus the pt_eval_function_type_new callsites for merge-commits introducing new function families.
donot_fold semantics. SEMANTIC_CHK_INFO::donot_fold is set by some callers of pt_check_with_info, including the view transform. The deck does not enumerate every site. Is there a complete list of folding-suppression contexts, and does any of them affect plan-cache correctness? Investigation path: grep for donot_fold = true across the parser.
PT_EXPR_INFO_CNF_DONE and re-entrance. pt_do_cnf clears the flag before applying pt_cnf. What if the predicate already contains a partially CNF-tagged sub-tree from view transform? The current code unconditionally clears the flag on every top-level conjunct; a sub-conjunct’s flag is not cleared. Could this cause pt_cnf to skip a sub-tree that should have been re-CNFed? Investigation path: write a regression with a view that contains a CNF-tagged predicate and UNION it with a query that re-enters CNF.
pt_tag_terms_with_specs and natural-join post-conditions. The natural-join rewrite synthesizes new PT_EXPR predicates and appends them to the rhs spec’s on_cond. After this rewrite, the spec-id tag bitset on each synthesized predicate must be the union of lhs + rhs. Is this correctly produced? Investigation path: instrument pt_tag_terms_with_specs and run the natural-join regression suite.
Static-SQL parsing and folding. pt_semantic_type’s is_parsing_static_sql guard skips folding entirely. This means that constant predicates in static SQL stay alive in the parse tree until execution. Is the executor robust against this — does it fold lazily, or does it always evaluate? Investigation path: trace pt_evaluate_db_value_expr callers in query/query_executor.c.
Recursive-expression depth limit. eval_recursive_expr_type walks the chain to its bottom. Long chains (CASE WHEN ... CASE WHEN ... repeated hundreds of times) could blow the stack. Is there a depth guard? Investigation path: search for MAX_RECURSIVE_DEPTH or equivalent in type_checking.c.
The five-bit OR_MVCC_FLAG_MASK analogue here? The MVCC record header has reserved bits; the parse-tree node structure has its own flag bytes (PT_NODE::flag, PT_EXPR_INFO_*, PT_NAME_INFO_*). Are any of those flag-byte bits reserved for future semantic-check use, or is the flag space allocated tight? Investigation path: enumerate PT_*_INFO_ and PT_NODE_FLAG_* masks and confirm none overlap.

Sources

Raw analyses (`raw/code-analysis/cubrid/query-processing/`)

code_analysis_Semantic_check_Overview_v_1_0.pdf — the four-stage pipeline overview.
code_analysis_Semantic_check-Name_Resolution_v_0_9.pdf — the longest deck (26 KB markdown after conversion); covers spec expansion, the scope stack, * resolution, derived tables, subqueries, joins (including Oracle (+) style), GROUP BY / HAVING aliasing, NATURAL JOIN rewriting, FOR UPDATE flagging.
code_analysis_Semantic_check-Type_Checking_and_Constant_Folding_v_1_0.pdf — pt_semantic_type and the type-evaluation rules.
code_analysis_Semantic_check-Checking_Semantic_of_Particular_Statement_v_0_8.pdf — per-statement rewrites and error messages.
type_checking_v1.0.pptx — the deeper companion deck for the type-checking pass (signature table, type coercion table, parameterized types, the PT_DATA_TYPE ↔ TP_DOMAIN relationship).
code_analysis_Common_module_CNF_(parsercnf).pdf — CNF as a utility module shared between semantic check and view transform.

Sibling docs

knowledge/code-analysis/cubrid/cubrid-mvcc.md — same query-processing pipeline; the WHERE clause CNF normalization feeds into the optimizer that the MVCC-aware scan pages out.
knowledge/code-analysis/cubrid/cubrid-heartbeat.md — unrelated but a style reference for the format of this document.

Textbook chapters / references

Database Internals (Petrov), Ch. 5 §“Query Processing” — sets the stage for the parser-to-optimizer pipeline.
Database System Concepts (Silberschatz, Korth, Sudarshan), Ch. 4 §“Built-in Data Types”, Ch. 5 §“Functions and Procedures” — the type-system framing used in §“Theoretical Background”.
Compilers: Principles, Techniques, and Tools (Aho, Lam, Sethi, Ullman; “Dragon Book”), Ch. 6 §“Symbol Tables”, Ch. 9 §“Loop-Invariant Computations” — symbol table and constant-folding framing.
Robinson, A Machine-Oriented Logic Based on the Resolution Principle, JACM 12, 1965 — the formal CNF transform.
Nilsson, Principles of Artificial Intelligence, ch. “Resolution Refutations” — referenced directly by the CUBRID CNF deck.

CUBRID source (`/data/hgryoo/references/cubrid/`)

src/parser/compile.c — pt_compile, pt_class_pre_fetch.
src/parser/semantic_check.c — the per-statement driver and every per-statement check.
src/parser/semantic_check.h — SEMANTIC_CHK_INFO and the module’s public surface.
src/parser/name_resolution.c — every name-resolution sub-pass.
src/parser/type_checking.c — pt_semantic_type and the type-evaluation rules.
src/parser/cnf.c — CNF normalization.
src/parser/parse_tree.h — PT_NODE definitions; the union of info.<kind> payloads for every node type.
src/parser/parse_dbi.c — pt_data_type_to_db_domain and the PT_DATA_TYPE ↔ TP_DOMAIN bridge.
src/parser/parse_evaluate.c — pt_evaluate_db_value_expr, the compile-time evaluator used by the constant folders.
src/parser/func_type.cpp, func_type.hpp — the C++17 function type-evaluation machinery used by pt_eval_function_type_new.
src/object/object_domain.{c,h} — TP_DOMAIN and the domain cache.