Skip to content

CUBRID Parser — Flex/Bison Pipeline, PT_NODE Tree, and the Parser Memory Model

Contents:

A SQL parser sits between two textbook problems. The first is lexical analysis: chop a flat character stream into tokens that the grammar can reason about. The second is syntactic analysis: decide whether the token stream conforms to a context-free grammar and, if it does, build a tree that captures the derivation. The tooling for both halves was canonised by Aho, Sethi, Lam, and Ullman’s Compilers: Principles, Techniques, and Tools — the Dragon book — and its core abstractions are unchanged in modern DBMS parsers: regular languages → DFA for lexing, and a deterministic shift-reduce automaton (LR(1) / LALR(1) / GLR) for parsing.

Database Internals (Petrov), Ch. 1 §“Query Processor”, frames the same picture from the database side: SQL text enters the query processor, the parser produces an internal tree representation, the optimiser rewrites it, and the executor walks the rewritten tree. The parser owns one job — produce a tree the later passes can rely on — and one constraint — never silently accept input the grammar does not cover.

A lexer is a regular-expression engine specialised for one language. Every token class (keywords, identifiers, integer literals, string literals, operators) is encoded as a regular expression; the union of those regexes is compiled to a single deterministic finite automaton; the DFA reads characters and returns (token_class, lexeme) pairs. Compilers §3 derives the NFA → DFA construction. Two practical points matter for any real lexer: longest-match-wins disambiguation, and a stateful start condition mechanism that lets the same surface syntax mean different things in different contexts (string interior vs. SQL keyword space, for example).

Flex is the GNU re-implementation of Lex. It generates a C function yylex() that returns one token per call and exposes yytext, yyleng, and a per-token yylval slot for the value the parser will reduce on. CUBRID’s lexer is in src/parser/csql_lexer.l; the generated csql_lexer.c is what the build actually compiles.

Bison generates a shift-reduce parser from a context-free grammar with semantic actions. The default flavour is LALR(1) — Look-Ahead LR with one token of lookahead, the same automaton family covered in Compilers §4.7. LALR(1) covers most programming-language grammars cleanly; it does not cover SQL cleanly because SQL has historic ambiguities (the classic example is the parse of SELECT a FROM b GROUP BY c, d where c, d could be a list or a tuple, depending on dialect).

Bison’s escape hatch is GLR (Generalized LR): when the LALR(1) automaton hits a conflict it forks, parses both branches in parallel, and discards the branches that fail to reduce. The performance cost is real but bounded — most real SQL queries encounter conflicts at only a handful of points. CUBRID compiles its grammar with %glr-parser (csql_grammar.y declaration %glr-parser); this is what lets the grammar accept the permissive superset of MySQL- and Oracle-style joins, hint syntax, and method-call syntax that the surface SQL allows.

Bison’s semantic-action mechanism is what builds the tree. Each grammar rule has an embedded C action block; when the parser reduces by that rule, the action runs. The action sees $1, $2, ... for the rule’s right-hand-side semantic values and writes the rule’s own value into $$. The %union declaration tells Bison the type of those values; in CUBRID the union holds PT_NODE *node, int number, char *cptr, and a few helper containers. The grammar is the only place where parse-tree nodes are constructed for the user’s input — every other module reads what the grammar built.

The parse tree is the contract the parser hands to the rest of the system. Two design choices shape every concrete realisation:

  1. Polymorphism mechanism. Every parse tree has nodes of many shapes (a SELECT is not a WHERE clause, which is not a literal). Three approaches dominate:

    • Class hierarchy (modern C++ / Java): one base class, virtual methods. Conventional but requires RTTI and virtual dispatch.
    • Tagged union (the C dialect): one struct with a discriminating tag and a union of per-type substructures. Cheaper at runtime, less safe at compile time.
    • Sum-type ADT (Haskell / OCaml / Rust): the language’s own union type with exhaustive matching.

    PostgreSQL, MySQL, and CUBRID all chose the tagged-union approach for the same C-era reasons.

  2. Error recovery. Compilers §4.1.4 enumerates the four classical strategies: panic-mode (skip until a synchronising token), phrase-level (replace tokens to make the rule reduce), error productions (grammar rules that match ill-formed input), and global correction (find the minimum-edit valid program). Bison’s built-in error token is panic-mode; production engines layer custom location tracking and message synthesis on top.

CUBRID’s combination is conventional: Flex with start conditions, Bison %glr-parser, a tagged-union PT_NODE, and panic-mode recovery enriched with YYLTYPE location info that pinpoints the offending byte in the original buffer. The distinguishing engineering choices are in the memory model and in the per-type child layout, which we trace in §“CUBRID’s Approach”.

Every SQL engine that compiles SQL on the client side faces the same set of tasks; the engineering vocabulary that has settled across PostgreSQL, MySQL, and CUBRID is the lens this section adopts. The CUBRID-specific code in §“CUBRID’s Approach” is best read as one set of dials within this shared design space.

Flex + Bison as the default front-end stack

Section titled “Flex + Bison as the default front-end stack”

Three of the most-cited C/C++ RDBMSes — PostgreSQL, MySQL, and CUBRID — generate their lexer and parser from Flex/Bison sources. Postgres’ files are src/backend/parser/scan.l and src/backend/parser/gram.y; MySQL’s are sql/sql_lex.cc/sql/sql_yacc.yy; CUBRID’s are src/parser/csql_lexer.l and src/parser/csql_grammar.y. The generation step is part of the build (CMake on CUBRID, Autoconf on Postgres) and the generated .c files are checked in only sometimes; CUBRID checks in only the inputs.

The shared shape is:

  • One .l file with three sections — Definitions, Rules, User subroutines — and the rules section returning token codes that the grammar’s %token declarations match.
  • One .y file with %union, %type, %token, the start symbol, the grammar rules each carrying a C action block, and an epilogue with helper functions used from inside actions.
  • Generated artifacts wired into the build through CMake’s flex_target / bison_target.

ParseNode / PT_NODE / Item — the tagged-union AST

Section titled “ParseNode / PT_NODE / Item — the tagged-union AST”

PostgreSQL’s Node (src/include/nodes/nodes.h) tags every parse-tree, plan-tree, and executor-tree element with a NodeTag. The tag selects which concrete struct the pointer actually refers to; visitor-style walks switch on it.

MySQL splits the surface differently — Item for expressions, SELECT_LEX and friends for query blocks — but each variant carries a tag and shares a base type; the discriminator is the class identity in MySQL’s case (it does use C++ inheritance).

CUBRID’s PT_NODE is the same idea as Postgres’s Node minus inheritance: a single struct with one PT_NODE_TYPE node_type field and a giant union pt_statement_info info field whose layout depends on node_type. Every per-type substructure is named PT_<TYPE>_INFO and accessed as node->info.<lowercase>.

Parser-time allocations are short-lived and many. Every engine uses some form of arena / region / zone allocator: a few big blocks of memory acquired up-front, individual nodes carved out by bumping a pointer, and the whole arena released in one free at the end. PostgreSQL has MemoryContext (per-context palloc/pfree); MySQL has MEM_ROOT; CUBRID has the PARSER_CONTEXT plus a per-parser pool of PARSER_NODE_BLOCKs and PARSER_STRING_BLOCKs in parse_tree.c.

The win is the same in all three: nodes built during parsing that turn out to be unreachable (because a rule failed to reduce or because semantic check rewrote them) do not need individual free calls — they vanish with the parser context. The cost is that node pointers cannot safely outlive the parser; if the optimizer or executor needs a node after the parser closes, it has to copy it into its own arena (this is the role PostgreSQL’s copyObject plays, and CUBRID’s parser_copy_tree).

Per-type child layout — the apply function table

Section titled “Per-type child layout — the apply function table”

A parse tree walker has to know which children each node type has. Three approaches:

  • Hard-coded per-type switches in every walker. Fast, brittle.
  • Reflection (Java, C#) — read the field types at runtime. Slow, expensive at build time.
  • Function-pointer table indexed by the discriminator — one function per type, manually written, that visits the correct set of children for that type.

Postgres uses the third approach with macros that expand to visitor code; CUBRID uses it explicitly with the pt_apply_f array of per-PT_NODE_TYPE functions, plus parallel pt_init_f and pt_print_f arrays for initialisation and pretty-printing. The trade-off is a small build-time tax (every new node type must register three functions) for a fast and debuggable walker.

Theoretical / common conceptCUBRID name
Lexer sourcesrc/parser/csql_lexer.l (flex)
Generated lexercsql_lexer.c (built by flex_target in CMake)
Grammar sourcesrc/parser/csql_grammar.y (bison %glr-parser)
Generated parsercsql_grammar.c (built by bison_target)
Token type prefixcsql_yy* (set via --name-prefix=csql_yy / --prefix=csql_yy)
Lexer entrycsql_yylex (invoked by Bison-generated csql_yyparse)
Parser entrycsql_yyparse → wrapped by parser_main (csql_grammar.y)
Parse-tree nodePT_NODE (parse_tree.h)
Node discriminatorenum pt_node_type PT_NODE_TYPE (parse_tree.h)
Per-type substructure unionunion pt_statement_info PT_STATEMENT_INFO (parse_tree.h)
Polymorphic child-visit tablept_apply_f[PT_NODE_NUMBER] (parse_tree_cl.c)
Polymorphic init tablept_init_f[PT_NODE_NUMBER]
Polymorphic print tablept_print_f[PT_NODE_NUMBER]
Tree walkerparser_walk_tree, parser_walk_leaves (parse_tree_cl.c)
Per-parser arenaPARSER_CONTEXT + PARSER_NODE_BLOCK + PARSER_STRING_BLOCK (parse_tree.c)
Node allocatorparser_new_nodeparser_create_node → block bump (parse_tree_cl.c, parse_tree.c)
String allocatorparser_alloc (parse_tree.c)
Lexer-token name interningpt_makename (scanner_support.c)
Bison location typeYYLTYPE { first_line, first_column, last_line, last_column, buffer_pos } (csql_grammar.y)
Per-node line/columnPT_NODE::line_number, column_number, buffer_pos
Error tokenBison built-in error; messages via csql_yyerror and PT_ERRORmf
Parser context lifecycleparser_create_parser / parser_free_parser (parse_tree.c)
Public string-parse entryparser_parse_string, parser_parse_string_with_escapes (parse_tree_cl.c)

We walk three levels in order: (1) the lex → parse → tree pipeline at runtime, (2) the PT_NODE data model that the pipeline produces, (3) the memory model that lets the producer and the consumer share node pointers safely.

flowchart LR
  SQL["SQL text"]
  PC["PARSER_CONTEXT"]
  PARSE["parser_parse_string"]
  MAIN["parser_main"]
  LEX["csql_yylex<br/>(generated from csql_lexer.l)"]
  PARSER["csql_yyparse<br/>(generated from csql_grammar.y)"]
  ACT["reduce-action<br/>parser_new_node / parser_make_∗"]
  POOL["PARSER_NODE_BLOCK pool<br/>(per-parser arena)"]
  STK["parser->node_stack<br/>(top-level stmt list)"]
  TREE["PT_NODE root tree(s)"]
  STMTS["parser->statements[]"]

  SQL --> PARSE
  PARSE --> PC
  PARSE --> MAIN
  MAIN --> LEX
  MAIN --> PARSER
  LEX -- "(token, csql_yylval)" --> PARSER
  PARSER -- "on each reduce" --> ACT
  ACT -- "carve from arena" --> POOL
  ACT -- "wire children" --> TREE
  PARSER -- "top-level reduce" --> STK
  MAIN -- "snapshot stack" --> STMTS
  STMTS --> TREE

The figure makes three boundaries explicit. (grammar / allocator) the grammar rules are the only place that calls parser_new_node; reduce actions are where nodes are born. (parse / drive) parser_main is a thin wrapper around yyparse that owns the per-call setup (line/column reset, buffer-position save/restore, hint table init, top-level statement-array snapshot). (per-call arena) every node and string allocated during the call lives in PARSER_CONTEXT-owned blocks and is released by parser_free_parser; nothing is heap-individual.

Flex source has the standard three-section layout:

// csql_lexer.l (skeleton)
%{
#include "csql_grammar.h"
#include "csql_grammar_scan.h"
#include "parse_tree.h"
#include "parser_message.h"
#include "system_parameter.h"
#include "message_catalog.h"
#define CSQL_MAXNAME 256
static int parser_yyinput (char *buff, int max_size);
#undef YY_INPUT
#define YY_INPUT(buffer, result, max_size) { \
result = parser_yyinput(buffer, max_size); \
result == 0 ? result = YY_NULL : result; \
}
#define YY_USER_ACTION { \
yybuffer_pos += csql_yyget_leng (); \
csql_yylloc.buffer_pos = yybuffer_pos; \
}
%}
%option yylineno
%x QUOTED_CHAR_STRING DOUBLY_QUOTED_CHAR_STRING DELIMITED_ID_NAME
%x BRACKET_ID_NAME BACKTICK_ID_NAME PLCSQL_TEXT
/* ... twelve more start conditions ... */
%%
[sS][eE][lL][eE][cC][tT] { begin_token(yytext); return SELECT; }
[sS][eE][nN][sS][iI][tT][iI][vV][eE]
{ begin_token(yytext); return SENSITIVE; }
[sS][eE][pP][aA][rR][aA][tT][oO][rR]
{ begin_token(yytext);
csql_yylval.cptr = pt_makename(yytext);
return SEPARATOR; }
/* ... ~2000 more rules ... */
%%
/* user subroutines: parser_yyinput, csql_yyerror, yywrap, ... */

Three deliberate choices live in the snippet:

  1. YY_INPUT is overridden. The default Flex YY_INPUT reads from stdin. CUBRID parses an in-memory buffer (parser->buffer) so it routes Flex’s input through parser_yyinput, which copies bytes out of the parser context’s owned buffer and tracks yybuffer_pos.
  2. YY_USER_ACTION accumulates a byte-position counter. yybuffer_pos is the cumulative offset into the original SQL string of the token Flex just matched; it is the anchor every error message and every PT_NODE::buffer_pos later refers back to. Lines and columns alone are not enough — error messages quote the offending substring, which needs the byte offset.
  3. Keyword rules are case-insensitive by enumeration of character classes. [sS][eE][lL][eE][cC][tT] is the manual way of getting case-insensitive matching without asking Flex to apply %option case-insensitive (which would also affect identifier rules and string literals). The generated DFA pays a few extra states per keyword; the alternative would have been to lowercase the input first, which interferes with delimited identifiers like "FooBar".

The user-subroutines block holds the input drivers (parser_yyinput, parser_yyinput_single_line, parser_yyinput_multi_line), the error reporter (csql_yyerror / csql_yyerror_explicit), and the end-of-input handler yywrap (which always returns 1 — the parser sees one buffer per parser_parse_string call, never a chained input).

The fourteen start conditions (excerpt: QUOTED_CHAR_STRING, DOUBLY_QUOTED_CHAR_STRING, DELIMITED_ID_NAME, BRACKET_ID_NAME, BACKTICK_ID_NAME, PL_LANG_SPEC, PLCSQL_TEXT, …) are how CUBRID handles the fact that the same surface character (e.g. ") means different things in the middle of a string literal versus at the start of a delimited identifier. Each BEGIN(state) action in the grammar/lexer flips the lexer into a state where only the rules relevant to that context are eligible to match. PL/CSQL text matches a multi-line balanced BEGIN ... END block; expecting_pl_lang_spec is a one-shot flag that the grammar sets right before it expects to see a procedure body.

The grammar uses Bison %glr-parser to handle SQL’s ambiguities; the %union carries the value types reduce actions can return; the rules build PT_NODEs.

// csql_grammar.y — declarations (excerpt)
%{/*%CODE_REQUIRES_START%*/
#include "json_table_def.h"
#include "parser.h"
typedef struct YYLTYPE
{
int first_line;
int first_column;
int last_line;
int last_column;
int buffer_pos; /* cumulative byte offset; updated by YY_USER_ACTION */
} YYLTYPE;
#define YYLTYPE_IS_DECLARED 1
/*%CODE_END%*/%}
%locations
%glr-parser
%define parse.error verbose
%union
{
int number;
bool boolean;
PT_NODE *node;
char *cptr;
container_2 c2;
container_3 c3;
container_4 c4;
container_10 c10;
container_11 c11;
struct json_table_column_behavior jtcb;
}
%type <node> stmt
%type <node> select_stmt
%type <node> select_expression
%type <node> select_expression_opt_with
/* ... ~1100 more %type lines ... */

The YYLTYPE extension is the parser’s way of preserving the lexer’s buffer_pos through reductions. The default Bison YYLTYPE carries only (first_line, first_column, last_line, last_column); CUBRID adds buffer_pos so error messages can render the offending byte range against the original SQL text.

A representative reduce action — the one that pairs an opt_with_clause with a select_expression — shows the typical shape:

// csql_grammar.y — select_expression_opt_with
select_expression_opt_with
: opt_with_clause
select_expression
{{
PT_NODE *with_clause = $1;
PT_NODE *stmt = $2;
if (stmt && with_clause)
{
stmt->info.query.with = with_clause;
}
$$ = stmt;
}}
;

Three things to notice: (a) the action only wires up an existing PT_NODE returned by the inner rule; the PT_SELECT itself was allocated by an action deeper down, in select_stmt, with parser_new_node (this_parser, PT_SELECT); (b) the action does not return — it writes to $$, a local variable in the generated parser; (c) this_parser is a file-static PARSER_CONTEXT * that parser_main sets up before calling yyparse and tears down afterwards.

The actual PT_SELECT allocation happens here:

// csql_grammar.y — select_stmt (excerpt)
select_stmt
: SELECT
{{
PT_NODE *node;
parser_save_found_Oracle_outer ();
if (parser_select_level >= 0)
parser_select_level++;
parser_hidden_incr_list = NULL;
node = parser_new_node (this_parser, PT_SELECT);
if (node)
{
node->info.query.q.select.flavor = PT_USER_SELECT;
node->info.query.q.select.hint = PT_HINT_NONE;
}
parser_push_select_stmt_node (node);
parser_push_hint_node (node);
}}
opt_hint_list /* $3 */
all_distinct /* $4 */
select_list /* $5 */
opt_select_param_list /* $6 */
/* ... wire children, pop the stmt node, set $$, ... */
;

Two patterns recur here and across the grammar: a stmt-stack push in the action that creates the node, and a stmt-stack pop in a later action that finalises it — this is how mid-rule actions like opt_hint_list know which PT_SELECT they belong to without threading explicit arguments through every rule.

parser_main lives at the bottom of csql_grammar.y (it has to, because yyparse is a static function in the generated csql_grammar.c and calling it from outside the .y file is awkward):

// parser_main — csql_grammar.y (condensed)
PT_NODE **
parser_main (PARSER_CONTEXT * parser)
{
long desc_index = 0;
long i, top;
int rv, yybuffer_pos_save;
PARSER_CONTEXT *this_parser_saved;
if (!parser)
return 0;
parser_output_host_index = parser_input_host_index = desc_index = 0;
this_parser_saved = this_parser;
this_parser = parser; /* publish to grammar-action TLS */
dbcs_start_input ();
yycolumn = yycolumn_end = 1;
yybuffer_pos_save = yybuffer_pos;
yybuffer_pos = 0;
is_dblink_query_string = 0;
expecting_pl_lang_spec = 0;
csql_yylloc.buffer_pos = 0;
g_query_string = NULL;
g_query_string_pos = 0;
g_query_string_len = 0;
g_original_buffer_len = 0;
pt_initialize_hint (parser, parser_hint_table);
rv = yyparse (); /* drives csql_yylex through tokens */
/* parser_main can be re-entered (loaddb -s); restore yybuffer_pos. */
yybuffer_pos = yybuffer_pos_save;
pt_cleanup_hint (parser, parser_hint_table);
if (pt_has_error (parser) || parser->stack_top <= 0 || !parser->node_stack)
{
parser->statements = NULL;
}
else
{
parser->statements = (PT_NODE **)
parser_alloc (parser, (1 + parser->stack_top) * sizeof (PT_NODE *));
if (parser->statements)
{
for (i = 0, top = parser->stack_top; i < top; i++)
parser->statements[i] = parser->node_stack[i];
parser->statements[top] = NULL;
}
parser->host_var_count = parser_input_host_index;
/* ... allocate host_variables, host_var_expected_domains ... */
}
/* ... restore this_parser ... */
return parser->statements;
}

The save/restore of this_parser and yybuffer_pos matters because of the loaddb -s re-entry path: a parse-and-execute loop inside loaddb can call parser_main recursively while the outer parser_main is still parsing the input file’s statements.

The parser->statements array is a NULL-terminated list of top-level statements. CUBRID’s parser accepts a multi-statement buffer (e.g. SELECT 1; SELECT 2;), pushes each one onto node_stack as the top-level rule reduces, and parser_main copies the stack into the result array at the end.

PT_NODE — the polymorphic-tagged-union node

Section titled “PT_NODE — the polymorphic-tagged-union node”

The node type is one struct with one tag and one union:

// parser_node — src/parser/parse_tree.h
struct parser_node
{
PT_NODE_TYPE node_type; /* the type of SQL statement this represents */
int parser_id; /* which parser did I come from */
int line_number; /* the user line number originating this */
int column_number; /* the user column number originating this */
int buffer_pos; /* position in the parse buffer of the string */
char *sql_user_text; /* user input sql string */
int sql_user_text_len;
PT_NODE *next; /* forward link for NULL-terminated list */
PT_NODE *or_next; /* forward link for DNF list */
PT_NODE *next_row; /* PT_VALUE/NAME/EXPR row in PT_NODE_LIST */
void *etc; /* application-specific info hook */
UINTPTR spec_ident; /* entity-spec equivalence class (for FROM resolve) */
TP_DOMAIN *expected_domain; /* expected domain for input marker */
PT_NODE *data_type; /* for non-primitive types, sets, objects */
XASL_ID *xasl_id; /* XASL_ID for this SQL statement */
const char *alias_print;
PARSER_VARCHAR *expr_before_const_folding;
PT_TYPE_ENUM type_enum; /* PT_TYPE_INTEGER / VARCHAR / ... */
CACHE_TIME cache_time;
int sub_host_var_count;
int *sub_host_var_index;
struct { /* 27 single-bit flags */
unsigned recompile:1;
unsigned cannot_prepare:1;
/* ... use_plan_cache, is_hidden_column, with_rollup,
* do_not_fold, is_value_query, do_not_replace_orderby,
* use_auto_commit, ... 27 in total */
} flag;
PT_STATEMENT_INFO info; /* depends on 'node_type' field */
};

Three observations:

  • next / or_next / next_row are three independent list spines. next is the natural list (select-list elements, AND-conjoined predicates, FROM-list specs). or_next is the DNF spine for predicate normalisation (a = 1 OR a = 2 OR a = 3 becomes a next of one and an or_next of three). next_row is for VALUES queries that reduce to PT_NODE_LIST. The walker (pt_walk_private) descends each spine independently — this is what lets transformation passes splice into one spine without disturbing the others.
  • data_type is a child, not an in-place type tag. type_enum is the cheap discriminator (PT_TYPE_INTEGER, PT_TYPE_VARCHAR); data_type is a separate PT_NODE of type PT_DATA_TYPE that carries domain-specific extras (precision, scale, collation). The walker’s PT_APPLY_WALK (parser, node->data_type, walk) is what makes type-checking transforms uniform across all parents.
  • Line / column / buffer_pos live on every node. They are set by pt_parser_line_col inside parser_new_node. Error messages can therefore quote the source range of any node in any later pass without re-parsing — semantic check, the optimizer, even XASL generation can produce source-anchored diagnostics.

The PT_NODE_TYPE enum carries dual semantics: the values for SQL statements (PT_SELECT, PT_INSERT, …) are intentionally the same as CUBRID_STMT_* codes from the API, so a PT_NODE discriminator can be returned to the API client without translation:

// PT_NODE_TYPE — src/parser/parse_tree.h (excerpt)
enum pt_node_type
{
PT_NODE_NONE = CUBRID_STMT_NONE,
PT_ALTER = CUBRID_STMT_ALTER_CLASS,
PT_CREATE_ENTITY = CUBRID_STMT_CREATE_CLASS,
PT_INSERT = CUBRID_STMT_INSERT,
PT_SELECT = CUBRID_STMT_SELECT,
PT_UPDATE = CUBRID_STMT_UPDATE,
PT_DELETE = CUBRID_STMT_DELETE,
/* ... ~50 more statement-typed nodes ... */
PT_DIFFERENCE = CUBRID_MAX_STMT_TYPE, /* internal-only types from here down */
PT_INTERSECTION,
PT_UNION,
/* expression / clause / fragment nodes */
PT_EXPR, PT_NAME, PT_VALUE, PT_FUNCTION, PT_DOT_,
PT_SPEC, PT_DATA_TYPE, PT_SORT_SPEC, PT_HOST_VAR,
/* ... ~40 more ... */
PT_NODE_NUMBER, /* the number of node types */
PT_LAST_NODE_NUMBER = PT_NODE_NUMBER
};

The split is meaningful: node_type < CUBRID_MAX_STMT_TYPE identifies a top-level statement that the API and the executor both need; node_type >= CUBRID_MAX_STMT_TYPE is internal parse-tree machinery.

The per-type info union is in parse_tree.h:3470:

// pt_statement_info — src/parser/parse_tree.h (excerpt)
union pt_statement_info
{
PT_ZZ_ERROR_MSG_INFO error_msg;
PT_ALTER_INFO alter;
PT_ALTER_TRIGGER_INFO alter_trigger;
PT_CREATE_ENTITY_INFO create_entity;
PT_DELETE_INFO delete_;
PT_DOT_INFO dot;
PT_EXPR_INFO expr;
PT_FUNCTION_INFO function;
PT_INSERT_INFO insert;
PT_NAME_INFO name;
PT_QUERY_INFO query; /* SELECT, UNION, etc. */
PT_SPEC_INFO spec; /* FROM-clause table spec */
PT_VALUE_INFO value;
PT_DATA_TYPE_INFO data_type;
/* ... 80+ more variants ... */
};

PT_QUERY_INFO::q.select is the one most users picture when they think of a parse tree: it has list (the select list), from, where, group_by, having, connect_by, start_with, using_index, hint slots, order_by (lifted to PT_QUERY_INFO::order_by because UNION shares it), limit, for_update, and a dozen others. Each of these is a PT_NODE * that points to whatever the grammar reduced into it.

Per-type child layout — pt_apply_f, pt_init_f, pt_print_f

Section titled “Per-type child layout — pt_apply_f, pt_init_f, pt_print_f”

Three function-pointer tables, all sized PT_NODE_NUMBER, are the polymorphism mechanism:

// parse_tree_cl.c — apply/init/print function tables (excerpt)
typedef PT_NODE *(*PARSER_INIT_NODE_FUNC) (PT_NODE *);
typedef PARSER_VARCHAR *(*PARSER_PRINT_NODE_FUNC) (PARSER_CONTEXT * parser, PT_NODE * p);
typedef PT_NODE *(*PARSER_APPLY_NODE_FUNC) (PARSER_CONTEXT * parser, PT_NODE * p, void *arg);
static PARSER_INIT_NODE_FUNC *pt_init_f = NULL;
static PARSER_PRINT_NODE_FUNC *pt_print_f = NULL;
static PARSER_APPLY_NODE_FUNC *pt_apply_f = NULL;
static PARSER_APPLY_NODE_FUNC pt_apply_func_array[PT_NODE_NUMBER];
static PARSER_INIT_NODE_FUNC pt_init_func_array[PT_NODE_NUMBER];
static PARSER_PRINT_NODE_FUNC pt_print_func_array[PT_NODE_NUMBER];

parser_init_func_vectors (called once from parser_create_parser when pt_apply_f == NULL) populates the arrays. The per-PT_SELECT apply function is the canonical example:

// pt_apply_select — parse_tree_cl.c
static PT_NODE *
pt_apply_select (PARSER_CONTEXT * parser, PT_NODE * p, void *arg)
{
PT_APPLY_WALK (parser, p->info.query.with, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.list, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.from, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.where, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.connect_by, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.start_with, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.after_cb_filter, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.group_by, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.having, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.using_index, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.with_increment, arg);
/* ... more hint slots ... */
PT_APPLY_WALK (parser, p->info.query.into_list, arg);
PT_APPLY_WALK (parser, p->info.query.order_by, arg);
PT_APPLY_WALK (parser, p->info.query.orderby_for, arg);
PT_APPLY_WALK (parser, p->info.query.qcache_hint, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.check_where, arg);
PT_APPLY_WALK (parser, p->info.query.limit, arg);
PT_APPLY_WALK (parser, p->info.query.q.select.for_update, arg);
return p;
}

The order matters: every walker (semantic check, type inference, name resolution, plan generation, pretty-printing) visits these children in this order. A new child added to PT_QUERY_INFO::q.select that is not added here will be silently invisible to every existing walker — which is one of the few places in the parser where the build will not warn about an oversight.

pt_init_select is the parallel for default initialisation:

// pt_init_select — parse_tree_cl.c
static PT_NODE *
pt_init_select (PT_NODE * p)
{
p->info.query.q.select.hint = PT_HINT_NONE;
p->info.query.q.select.check_cycles = CONNECT_BY_CYCLES_ERROR;
p->info.query.all_distinct = PT_ALL;
p->info.query.is_subquery = (PT_MISC_TYPE) 0;
p->info.query.hint = PT_HINT_NONE;
p->info.query.scan_op_type = S_SELECT;
return p;
}

pt_init_f[node->node_type] is called from parser_init_node on every newly-allocated node, so the default values are in-place before any reduce action gets to the new node.

The walker — parser_walk_tree / pt_walk_private

Section titled “The walker — parser_walk_tree / pt_walk_private”

parser_walk_tree is the universal pre/post-order traversal:

// parser_walk_tree — parse_tree_cl.c
PT_NODE *
parser_walk_tree (PARSER_CONTEXT * parser, PT_NODE * node,
PT_NODE_WALK_FUNCTION pre_function, void *pre_argument,
PT_NODE_WALK_FUNCTION post_function, void *post_argument)
{
if (node == NULL) return NULL;
PT_WALK_ARG walk_argument;
walk_argument.continue_walk = PT_CONTINUE_WALK;
walk_argument.pre_function = pre_function;
walk_argument.pre_argument = pre_argument;
walk_argument.post_function = post_function;
walk_argument.post_argument = post_argument;
PT_APPLY_WALK (parser, node, &walk_argument);
return node;
}

The real engine is pt_walk_private. For each node it: (a) calls pre_function, (b) dispatches through pt_apply_f[node_type] to visit the typed children, then visits data_type, then or_next, then next, in that order, (c) calls post_function. The four continue_walk modes (PT_STOP_WALK, PT_CONTINUE_WALK, PT_LEAF_WALK, PT_LIST_WALK) let the pre-function prune the search:

// pt_walk_private — parse_tree_cl.c (condensed)
static PT_NODE *
pt_walk_private (PARSER_CONTEXT * parser, PT_NODE * node, void *void_arg)
{
PT_WALK_ARG *walk = (PT_WALK_ARG *) void_arg;
PT_NODE_TYPE node_type;
PARSER_APPLY_NODE_FUNC apply;
int save_continue;
if (walk->pre_function)
{
node = (*walk->pre_function) (parser, node, walk->pre_argument,
&(walk->continue_walk));
if (!node) return NULL;
}
if (walk->continue_walk != PT_STOP_WALK)
{
save_continue = walk->continue_walk;
if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LEAF_WALK)
{
node_type = node->node_type;
if (node_type >= PT_LAST_NODE_NUMBER || !(apply = pt_apply_f[node_type]))
return NULL;
(*apply) (parser, node, walk); /* typed children */
PT_APPLY_WALK (parser, node->data_type, walk); /* data_type child */
}
if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LEAF_WALK
|| save_continue == PT_LIST_WALK)
PT_APPLY_WALK (parser, node->or_next, walk);
if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LIST_WALK)
PT_APPLY_WALK (parser, node->next, walk);
if (walk->continue_walk != PT_STOP_WALK)
walk->continue_walk = save_continue;
}
if (walk->post_function)
node = (*walk->post_function) (parser, node, walk->post_argument,
&(walk->continue_walk));
return node;
}

The traversal shape — pre-visit, typed-child via apply, data_type, or_next, next, post-visit — is fixed; only the user-supplied pre_function and post_function vary. This is how every later pass (pt_resolve_names, pt_check_with_info, pt_semantic_type, pt_to_buildlist_proc, pt_print_tree) plugs into the same walker without each having to re-implement the per-type child visit. The companion parser_walk_leaves is the variant that starts at the leaves of the input node — it skips the pre-visit on node itself and goes straight to the typed children.

The third key piece is the allocator. Every parse-time allocation goes through one of two paths into a per-parser pool:

// parse_tree.c — block-pool sketch (one block per pool node)
#define NODES_PER_BLOCK 256
typedef struct parser_node_block PARSER_NODE_BLOCK;
struct parser_node_block
{
PARSER_NODE_BLOCK *next;
int parser_id;
PT_NODE nodes[NODES_PER_BLOCK];
};
static PARSER_NODE_BLOCK *parser_Node_blocks[HASH_NUMBER];
static PARSER_NODE_FREE_LIST *parser_Node_free_lists[HASH_NUMBER];
static PARSER_STRING_BLOCK *parser_String_blocks[HASH_NUMBER];
static int parser_id = 1;

parser_create_parser is the lifecycle entry:

// parser_create_parser — parse_tree.c (condensed)
PARSER_CONTEXT *
parser_create_parser (void)
{
PARSER_CONTEXT *parser;
struct timeval t;
struct drand48_data rand_buf;
parser = (PARSER_CONTEXT *) calloc (sizeof (PARSER_CONTEXT), 1);
if (parser == NULL)
{
er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_OUT_OF_VIRTUAL_MEMORY,
1, sizeof (PARSER_CONTEXT));
return NULL;
}
#if !defined (SERVER_MODE)
parser_init_func_vectors (); /* lazy fill of pt_apply_f / init_f / print_f */
#endif
parser->id = parser_id++;
if (pt_register_parser (parser) == ER_FAILED)
{
free_and_init (parser);
return NULL;
}
parser->execution_values.row_count = -1;
/* ... seed lrand48 / drand48 / etc ... */
parser->query_id = NULL_QUERY_ID;
/* ... 15 single-bit flags ... */
return parser;
}

Two non-obvious facts. First, parser_init_func_vectors is lazy-fired here — the first parser created in a process pays the cost of populating the three function-pointer tables, every subsequent parser inherits them. Second, the parser does not own its node blocks directly; it owns a parser_id and the node blocks are looked up in a process-wide hash (parser_Node_blocks[HASH_NUMBER]). This makes parser_free_parser a hash-walk that reclaims every block tagged with the dying parser’s id.

Node allocation:

// parser_new_node — parse_tree_cl.c
PT_NODE *
parser_new_node (PARSER_CONTEXT * parser, PT_NODE_TYPE node_type)
{
PT_NODE *node;
node = parser_create_node (parser); /* carve from PARSER_NODE_BLOCK */
if (node)
{
parser_init_node (node, node_type); /* memset + pt_init_f[node_type] */
pt_parser_line_col (node); /* stamp from current YYLTYPE */
node->sql_user_text = g_query_string;
node->sql_user_text_len = g_query_string_len;
}
return node;
}

Three things happen in a single call: (a) a PT_NODE-sized slot is taken from the parser’s node-block pool; (b) the slot is zeroed, the node_type field is set, and pt_init_f[node_type] runs to fill in default values for the type-specific union arm; (c) the (line, column, buffer_pos) triple is stamped from the lexer’s current csql_yylloc/g_query_string_pos so future error messages can quote the source position of this exact node.

parser_init_node is the second half:

// parser_init_node — parse_tree_cl.c
PT_NODE *
parser_init_node (PT_NODE * node, PT_NODE_TYPE node_type)
{
assert (node != NULL);
assert (node_type < PT_LAST_NODE_NUMBER);
int parser_id = node->parser_id;
memset (node, 0x00, sizeof (PT_NODE));
node->buffer_pos = -1;
node->type_enum = PT_TYPE_NONE;
node->parser_id = parser_id;
node->node_type = node_type;
if (pt_init_f[node_type])
node = (pt_init_f[node_type]) (node);
return node;
}

The parser_id is preserved across the memset because the slot’s identity (which parser does it belong to?) outlives the node’s logical state; this is what lets parser_reinit_node re-stamp an in-use slot during walk-rewrites without losing the slot’s home parser.

The companion parser_alloc is for everything that is not a PT_NODE — strings, arrays, host-variable arrays:

// parser_alloc — parse_tree.c (excerpt)
void *
parser_alloc (const PARSER_CONTEXT * parser, const int length)
{
/* ... locate or create a PARSER_STRING_BLOCK with `length` bytes free,
* bump the block's high-water mark, return the new region ... */
}

Strings inside parser_alloc’s pool live and die with the parser context just like nodes. The grammar’s calls to pt_makename(yytext) (which calls pt_append_string) for every identifier therefore do not leak — the entire universe of identifier strings allocated during a parse is freed in one sweep by parser_free_parser → pt_free_string_blocks.

A handful of functions live in csql_grammar.y’s epilogue because every grammar action calls them:

// parser_make_expression — csql_grammar.y
PT_NODE *
parser_make_expression (PARSER_CONTEXT * parser, PT_OP_TYPE OP,
PT_NODE * arg1, PT_NODE * arg2, PT_NODE * arg3)
{
PT_NODE *expr = parser_new_node (parser, PT_EXPR);
if (expr)
{
expr->info.expr.op = OP;
if (pt_is_operator_logical (expr->info.expr.op))
expr->type_enum = PT_TYPE_LOGICAL;
expr->info.expr.arg1 = arg1;
expr->info.expr.arg2 = arg2;
expr->info.expr.arg3 = arg3;
/* ... INST_NUM / GROUPBY_NUM / ORDERBY_NUM compatibility checks ... */
/* ... mark parser_cannot_cache for SERIAL / SYS_DATE-family ops ... */
}
return expr;
}
static PT_NODE *
parser_make_link (PT_NODE * list, PT_NODE * node)
{
parser_append_node (node, list);
return list;
}

parser_make_expression is what every binary or ternary operator rule in the grammar reduces to — <expr> + <expr>, <expr> AND <expr>, <expr> BETWEEN <expr> AND <expr> all funnel through it. parser_make_link is what lets a recursive list rule (select_list : select_list ',' expr) extend the list using the next pointer rather than allocating a separate list node. These two helpers — plus parser_make_func_with_arg_count for arity-checked function nodes and pt_makename for interning identifier strings — are about 80% of every reduce action.

Putting it together — the SELECT example

Section titled “Putting it together — the SELECT example”

The deck’s running example, SELECT * FROM code WHERE s_name > 'G';, exercises every piece above. The rough trace:

  1. The lexer returns the tokens SELECT '*' FROM IDENTIFIER('code') WHERE IDENTIFIER('s_name') '>' STRING_LITERAL('G') ';'. IDENTIFIER and STRING_LITERAL carry their lexeme through pt_makename/pt_append_bytes; the others carry only their token code.
  2. As the bottom-up parser reduces, each rule’s action allocates the right PT_NODE and wires its children. Allocation order is roughly bottom-up: the literal 'G' reduces first (PT_VALUE, type_enum = PT_TYPE_CHAR), then the identifier s_name (PT_NAME), then the comparison (PT_EXPR with op = PT_GT), then the wildcard * (PT_VALUE, type_enum = PT_TYPE_STAR), then the FROM-spec (PT_SPEC whose entity_name points at a PT_NAME for code), then the PT_SELECT whose info.query.q.select.list/from/where points at all of the above.
  3. The top-level reduce pushes the resulting PT_SELECT onto parser->node_stack.
  4. parser_main snapshots the stack into parser->statements and returns it.

After parser_parse_string returns, the rest of the system sees only the PT_SELECT root. The lexer and grammar are out of the picture; every later transformation walks pt_apply_f/pt_walk_private and rewrites the same arena.

Both .l and .y are inputs to a CMake custom command that runs Flex and Bison at build time:

# CMakeLists.txt (root) — CSQL FLEX/BISON targets (excerpt)
# Replace old bison directives with new ones; regenerate csql_grammar.yy
# whenever csql_grammar.y is modified.
add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy
COMMAND ${CMAKE_COMMAND}
-DGRAMMAR_INPUT_FILE=${CMAKE_SOURCE_DIR}/src/parser/csql_grammar.y
-DGRAMMAR_OUTPUT_FILE=${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy
-P ${CMAKE_SOURCE_DIR}/cmake/CSQLGrammarTransform.cmake
MAIN_DEPENDENCY ${CMAKE_SOURCE_DIR}/src/parser/csql_grammar.y
)
set(CSQL_GRAMMAR_INPUT ${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy)
bison_target(csql_grammar ${CSQL_GRAMMAR_INPUT} ${CSQL_GRAMMAR_OUTPUT}
COMPILE_FLAGS "--no-lines --name-prefix=csql_yy -d -r all")
flex_target(csql_lexer ${CSQL_LEXER_INPUT} ${CSQL_LEXER_OUTPUT}
COMPILE_FLAGS "--noline --never-interactive --prefix=csql_yy")
add_flex_bison_dependency(csql_lexer csql_grammar)
add_custom_target(gen_csql_grammar DEPENDS ${BISON_csql_grammar_OUTPUTS})
add_custom_target(gen_csql_lexer DEPENDS ${FLEX_csql_lexer_OUTPUTS})

Two notable choices: --name-prefix=csql_yy (so all generated symbols begin csql_yy* and do not collide with the loader’s own Flex/Bison grammar in src/loaddb/), and the csql_grammar.yy indirection — csql_grammar.y is pre-processed by CSQLGrammarTransform.cmake before Bison sees it, which is how the project rewrites legacy YACC directives into Bison-3 syntax without editing the original .y file.

Anchor on symbol names, not line numbers. The CUBRID source moves; the position table at the end is scoped to the doc’s updated: date.

  • Definitions block — YY_INPUT / YY_USER_ACTION / start conditions / globals (yybuffer_pos, is_dblink_query_string, expecting_pl_lang_spec).
  • Keyword rules — about 600 [xX]... patterns each returning a Bison token code. Keywords that need their lexeme retained set csql_yylval.cptr = pt_makename(yytext) first.
  • Identifier / numeric / string-literal rules.
  • Start-condition rules — <QUOTED_CHAR_STRING>, <DELIMITED_ID_NAME>, <BRACKET_ID_NAME>, <BACKTICK_ID_NAME>, <PLCSQL_TEXT>, <POST_PLCSQL_TEXT>, <PL_LANG_SPEC>, etc.
  • User-subroutines block — parser_yyinput and its single / multi-line variants, csql_yyerror, csql_yyerror_explicit, yywrap, begin_token, parser_c_hint / parser_line_hint.
  • Prologue (%{ ... %}) — YYLTYPE extension with buffer_pos, helper-container typedefs (container_2..11), parser_make_func_with_arg_count, parser_make_expression, parser_make_link declarations.
  • Bison declarations — %locations, %glr-parser, %union, %type (~1100 entries), %token (~600 entries).
  • Grammar rules — stmt, stmt_, select_stmt, select_expression, select_expression_opt_with, select_list, from_list, extended_table_specification, single_table_spec, subquery, expression, predicate, function_call, data_type, create_stmt, alter_stmt, insert_stmt, update_stmt, delete_stmt, merge_stmt … ~1700 productions.
  • Epilogue — parser_initialize_parser_context, parser_make_func_with_arg_count, parser_make_expression, parser_make_link, parser_make_link_or, parser_main, parser_init_yydebug.

Parse-tree definitions (src/parser/parse_tree.h)

Section titled “Parse-tree definitions (src/parser/parse_tree.h)”
  • enum pt_node_typePT_NODE_NUMBER/PT_LAST_NODE_NUMBER sentinel.
  • enum pt_type_enum — runtime value-type tag (PT_TYPE_INTEGER, PT_TYPE_VARCHAR, …).
  • Per-type PT_*_INFO structs (PT_QUERY_INFO, PT_SELECT_INFO inside PT_QUERY_INFO, PT_SPEC_INFO, PT_NAME_INFO, PT_EXPR_INFO, PT_VALUE_INFO, PT_DATA_TYPE_INFO, …).
  • union pt_statement_info PT_STATEMENT_INFO.
  • struct parser_node — the PT_NODE itself with the 27 flag bits and the info union.
  • struct parser_context — node stack, statements array, per-parser globals (error_msgs, host_variables, …).
  • enum { PT_STOP_WALK, PT_CONTINUE_WALK, PT_LEAF_WALK, PT_LIST_WALK } — walker continuation modes.
  • typedef PT_NODE *(*PT_NODE_WALK_FUNCTION) (...) — the walker function-pointer type users plug into parser_walk_tree.

Tree-and-arena code (src/parser/parse_tree.c)

Section titled “Tree-and-arena code (src/parser/parse_tree.c)”
  • PARSER_NODE_BLOCK, PARSER_NODE_FREE_LIST, PARSER_STRING_BLOCK — the per-parser pool types.
  • parser_create_node_block — extend the block list.
  • parser_create_node — bump-allocate one PT_NODE slot.
  • pt_free_node_blocks — release all blocks for a parser id.
  • parser_create_string_block / parser_allocate_string_buffer — strings.
  • pt_register_parser / pt_unregister_parser — process-wide hash registration.
  • parser_alloc — generic-allocator wrapper.
  • pt_append_string / pt_append_varchar / pt_append_bytes — string accumulation.
  • parser_create_parser / parser_free_parser — context lifecycle.
  • parser_free_node_resources / parser_free_node — individual-node free (rarely called; the arena handles bulk free).

Walker, init, helpers (src/parser/parse_tree_cl.c)

Section titled “Walker, init, helpers (src/parser/parse_tree_cl.c)”
  • pt_apply_f, pt_init_f, pt_print_f — the three per-PT_NODE_TYPE function-pointer arrays.
  • pt_apply_func_array, pt_init_func_array, pt_print_func_array — the static backing arrays.
  • pt_init_apply_f, pt_init_init_f, pt_init_print_f — populate the three arrays.
  • parser_init_func_vectors — lazy entry called from parser_create_parser.
  • pt_walk_private — the recursive walker, shared by parser_walk_tree and parser_walk_leaves.
  • parser_walk_tree, parser_walk_leaves, pt_continue_walk — public traversal API.
  • parser_new_node, parser_init_node, parser_reinit_node — node creation and re-init.
  • parser_free_tree, parser_free_subtrees — explicit subtree free (used by transforms that re-write a subtree before the parser dies).
  • parser_copy_tree, parser_copy_tree_list — deep-copy with copy_node_in_tree_pre.
  • parser_append_node, parser_append_node_or — list-spine helpers.
  • pt_apply_select, pt_init_select, pt_print_select — the canonical example for PT_SELECT; every other PT_NODE_TYPE has a parallel triple.
  • parser_parse_string, parser_parse_string_with_escapes, parser_parse_string_use_sys_charset, parser_parse_file — public entrypoints.
  • parser_main, parser_create_parser, parser_free_parser, parser_init_func_vectors, parser_alloc.
  • parser_parse_string, parser_parse_string_with_escapes, parser_parse_string_use_sys_charset, parser_parse_file.
  • parser_new_node, parser_init_node, parser_reinit_node, parser_free_tree, parser_free_subtrees, parser_clear_node, parser_free_node, parser_free_node_resources.
  • parser_walk_tree, parser_walk_leaves, pt_continue_walk.
  • parser_print_tree, parser_print_tree_with_quotes, parser_print_tree_list.

Scanner support (src/parser/scanner_support.c)

Section titled “Scanner support (src/parser/scanner_support.c)”
  • pt_makename — interns an identifier string into the per-parser arena and returns the char *.
  • pt_get_keyword / pt_check_word_for_keyword — look up whether a candidate identifier is actually a reserved word.
  • dbcs_get_next / buffgetin / binarygetin / fgetinnext_char / next_byte implementations the parser context wires up before calling the lexer.
SymbolFileLine
enum pt_node_type (PT_NODE_TYPE)parse_tree.h963
PT_LAST_NODE_NUMBERparse_tree.h1091
enum pt_type_enum (PT_TYPE_ENUM)parse_tree.h1096
union pt_statement_infoparse_tree.h3470
struct parser_node (the PT_NODE itself)parse_tree.h3649
struct parser_contextparse_tree.h3766
PT_NODE_WALK_FUNCTION typedefparse_tree.h1737
PT_STOP_WALK / CONTINUE / LEAF / LISTparse_tree.h882
parser_maincsql_grammar.y23678
parser_make_expressioncsql_grammar.y22573
parser_make_linkcsql_grammar.y22633
parser_make_func_with_arg_countcsql_grammar.y22542
parser_initialize_parser_contextcsql_grammar.y23480
select_stmt rulecsql_grammar.y12848
select_expression rulecsql_grammar.y12365
select_expression_opt_with rulecsql_grammar.y12263
stmt rulecsql_grammar.y1776
%unioncsql_grammar.y545
%locations / %glr-parsercsql_grammar.y540
pt_walk_privateparse_tree_cl.c914
parser_walk_leavesparse_tree_cl.c1001
parser_walk_treeparse_tree_cl.c1045
parser_new_nodeparse_tree_cl.c2245
parser_init_nodeparse_tree_cl.c2266
parser_reinit_nodeparse_tree_cl.c2292
parser_append_nodeparse_tree_cl.c3238
pt_init_apply_fparse_tree_cl.c5016
pt_init_init_fparse_tree_cl.c5145
pt_init_print_fparse_tree_cl.c5280
pt_apply_selectparse_tree_cl.c14302
pt_init_selectparse_tree_cl.c14340
pt_print_selectparse_tree_cl.c14358
parser_init_func_vectorsparse_tree_cl.c17754
parser_parse_stringparse_tree_cl.c1800
parser_parse_string_with_escapesparse_tree_cl.c1813
parser_parse_string_use_sys_charsetparse_tree_cl.c1782
parser_parse_fileparse_tree_cl.c1924
parser_create_parserparse_tree.c1174
parser_free_parserparse_tree.c1253
parser_create_nodeparse_tree.c276
parser_create_node_blockparse_tree.c219
parser_allocparse_tree.c956
parser_main declarationparser.h68
parser_create_parser declarationparser.h73
parser_walk_tree declarationparser.h114
pt_makenamescanner_support.c213

The raw analyses (PDF “Parser - Lexing and Parsing SQL v1.0”, PDF “Parser - Parsing Tree Structure v1.0”, PPTX “parser_semantic_check_basic v1.0”) were written against an earlier branch of CUBRID. Verified deltas as of the source state captured above:

  • parser_yyinput is now a single function, not three. The PDF describes parser_yyinput, parser_yyinput_single_line, and parser_yyinput_multi_line as three peers in the user-subroutines block. Current csql_lexer.l declares only parser_yyinput as the active reader; the old multi-line helper is referenced from parser_yyinput_single_mode’s one-shot mode but is not the default. Behaviour is the same for typical multi-statement strings.
  • The extern int yyline; extern int dot_flag; declarations in the deck no longer appear at the top of csql_lexer.l. Current source has only extern int yycolumn and extern int yycolumn_end; line tracking moved to Flex’s own %option yylineno. The deck’s snippet pre-dates that switch.
  • parser_make_link returns PT_NODE *, not void. The PDF excerpt at the bottom of “Lexing and Parsing SQL” (parser_make_link returning a PT_NODE *) is correct; this is consistent with the current source.
  • The pt_walk_private body in the deck’s “Parsing Tree Structure” PDF is current, modulo one structural change. The PDF’s if (node && walk->pre_function) opening guard has been simplified to an unconditional pre-call (the caller already null-checks). The end-of-walk data_type visit is now folded inside the typed-child arm (PT_APPLY_WALK (parser, node->data_type, walk) after (*apply)), not after it. The four continue_walk modes (PT_STOP_WALK, PT_CONTINUE_WALK, PT_LEAF_WALK, PT_LIST_WALK) and their semantics are unchanged.
  • PT_NODE::flag is a 27-bit struct, not a free int. The deck’s “PT_NODE” diagram shows for “etc flag” — it glosses over the bitfield. The current source has 27 named single-bit flags including recompile, cannot_prepare, partition_pruned, si_datetime, si_tran_id, clt_cache_check, use_plan_cache, is_hidden_column, with_rollup, do_not_fold, is_value_query, do_not_replace_orderby, is_added_by_parser, is_alias_enabled_expr, is_wrapped_res_for_coll, is_system_generated_stmt, use_auto_commit, done_reduce_equality_terms, print_in_value_for_dblink, do_not_use_subquery_cache, for_default_func — the rest are pre-existing.
  • The next_row link in PT_NODE is for PT_NODE_LIST / VALUES-query rows. The deck does not mention it; current source uses it for PT_VALUE/PT_NAME/PT_EXPR nodes that belong to a PT_NODE_LIST for PT_SELECT “values” generated queries. The walker (pt_walk_private) does not follow next_row directly — a PT_NODE_LIST apply visits its inner row list explicitly via the typed-child path.
  • parser_init_func_vectors is now lazy-fired from parser_create_parser (under #if !defined (SERVER_MODE)). The deck describes parser_init_func_vectors as called during process startup; in current source it is called the first time a parser context is created in a process.
  • PT_NODE::sql_user_text is per-node, set inside parser_new_node. The deck doesn’t describe it; this is what lets the plan cache key on the literal SQL string for the node’s owning statement (and what makes per-statement EXPLAIN quote the original text rather than a re-rendered form).
  1. GLR conflict count is unmeasured. The grammar declares %glr-parser because it has shift/reduce ambiguities; what is the actual ambiguous-state count, and which constructs are responsible? Investigation path: build with -r all -d (already in bison_target’s COMPILE_FLAGS), inspect csql_grammar.output, count entries in “State has conflicts”. The deck does not address this.
  2. csql_grammar.yy pre-processor scope. cmake/CSQLGrammarTransform.cmake rewrites csql_grammar.y into csql_grammar.yy before Bison sees it. What does the rewrite do — replace deprecated YACC directives (%pure_parser%define api.pure)? Does it rewrite reduce-action syntax? This matters because every grammar edit goes through that filter. Investigation path: read CSQLGrammarTransform.cmake and diff a representative pair of input / output files in a debug build.
  3. expecting_pl_lang_spec is a one-shot flag. The lexer uses it to switch into the PLCSQL_TEXT start condition when the grammar is about to consume a procedure body. How does it get reset on parse error mid-procedure-body — does an error inside PLCSQL_TEXT leak the start-condition state into the next parser_main call? Investigation path: trace the BEGIN(INITIAL) calls in csql_lexer.l and confirm that parser_main’s entry-time expecting_pl_lang_spec = 0 is sufficient to recover (it is not obvious — Flex’s start-condition stack is separate from the variable).
  4. Memory recovery on parser_main re-entry. The comment parser_main can be reentered while executing statements loaded by loaddb -s. During the loaddb -s, the yybuffer_pos must not be currupted saves and restores yybuffer_pos on entry/exit. Are there other thread-local-ish parser globals (g_query_string, g_query_string_pos, parser_select_level, parser_within_join_condition) that re-entry should protect but currently does not? Investigation path: grep for ^static .* parser_ in csql_grammar.y and audit each for re-entry safety.
  5. parser_id overflow. The static parser_id counter in parse_tree.c (static int parser_id = 1;) is the key into parser_Node_blocks[HASH_NUMBER]. After 2³¹ parser creations it overflows. In a long-lived cub_server (or a heavily-stressed broker CAS) is this reachable, and what happens — does the hash chain just collide and old blocks get returned, or is there a guard? Investigation path: read pt_register_parser for the collision check.
  6. PT_NODE::flag versus struct alignment growth. The flag struct currently has 27 single-bit fields; one more bit pushes it to a second word on 32-bit hosts. The PT_NODE is allocated 256-at-a-time; growing it by even one machine word changes the pool’s per-block memory footprint by 8 KB. Is anyone tracking this? Investigation path: sizeof (PT_NODE) snapshots over recent commits.

Raw analyses (raw/code-analysis/cubrid/query-processing/)

Section titled “Raw analyses (raw/code-analysis/cubrid/query-processing/)”
  • code_analysis_Parser-Lexing_and_Parsing_SQL_v1_0.pdf — the lex/parse pipeline, file-level walkthrough of csql_lexer.l and csql_grammar.y, build wiring snippet.
  • code_analysis_Parser-Parsing_Tree_Structure_v_1_0.pdfPT_NODE structure, three-element (node, info, walk) framing, BST analogy for traversal, pt_walk_private body.
  • parser_semantic_check_basic_v1.0.pptx — query-processing pipeline diagram, PT_NODE quick reference, the running SELECT * FROM code WHERE s_name > 'G' example, full example trees for joins / DOT paths / derived tables / Oracle-outer joins. The document used the parser-related slides; the semantic-check half of the deck is upstream of the next code-analysis doc.
  • _converted/codeanalysisparser-lexingandparsingsqlv10.pdf.md, _converted/codeanalysisparser-parsingtreestructurev10.pdf.md, _converted/parsersemanticcheckbasicv1.0.pptx.md — markitdown extracts of the above.
  • knowledge/code-analysis/cubrid/cubrid-mvcc.md — MVCC visibility runs against the PT_NODE tree only after semantic check; the parser is upstream of the MVCC code path.
  • Database Internals (Petrov), Ch. 1 §“Query Processor” — the architectural picture of where the parser sits; cited for the framing of the parser as the producer of the tree every later pass walks.
  • Aho, Lam, Sethi, Ullman, Compilers: Principles, Techniques, and Tools (Dragon book), 2nd ed. — Ch. 3 (lexing, regular languages → DFA, Lex), Ch. 4.7 (LALR(1)), Ch. 4.1.4 (error recovery), Ch. 5 (syntax-directed translation, semantic actions). The canonical reference for everything Flex/Bison is generating under the hood.
  • Bison manual, GNU Free Software Foundation — %glr-parser semantics, %locations, the csql_yyparse / csql_yylex driver contract, YYLTYPE / csql_yylval interactions.
  • PostgreSQL source — src/backend/parser/scan.l, src/backend/parser/gram.y, src/include/nodes/nodes.h (the Node/NodeTag pattern that CUBRID’s PT_NODE / PT_NODE_TYPE mirrors).
  • MySQL source — sql/sql_yacc.yy, sql/sql_lex.cc (the GLR-flavoured large grammar that CUBRID’s pattern resembles).

CUBRID source (/data/hgryoo/references/cubrid/)

Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”
  • src/parser/csql_lexer.l — Flex source.
  • src/parser/csql_grammar.y — Bison source plus parser_main and reduce-action helpers.
  • src/parser/parse_tree.hPT_NODE, PT_NODE_TYPE, PT_STATEMENT_INFO, parser_context, walker-typedefs.
  • src/parser/parse_tree.c — per-parser arena, lifecycle.
  • src/parser/parse_tree_cl.c — walker, init vectors, per-type apply/init/print, public parser_parse_* wrappers, node-creation entry points.
  • src/parser/parser.h — public API surface.
  • src/parser/scanner_support.cpt_makename, next_char / next_byte implementations.
  • cmake/CSQLGrammarTransform.cmake, CMakeLists.txt (root) — flex_target / bison_target wiring.