CUBRID Parser — Flex/Bison Pipeline, PT_NODE Tree, and the Parser Memory Model

Contents:

Theoretical Background
Common DBMS Design
CUBRID’s Approach
Source Walkthrough
Cross-check Notes
Open Questions
Sources

Theoretical Background

A SQL parser sits between two textbook problems. The first is lexical analysis: chop a flat character stream into tokens that the grammar can reason about. The second is syntactic analysis: decide whether the token stream conforms to a context-free grammar and, if it does, build a tree that captures the derivation. The tooling for both halves was canonised by Aho, Sethi, Lam, and Ullman’s Compilers: Principles, Techniques, and Tools — the Dragon book — and its core abstractions are unchanged in modern DBMS parsers: regular languages → DFA for lexing, and a deterministic shift-reduce automaton (LR(1) / LALR(1) / GLR) for parsing.

Database Internals (Petrov), Ch. 1 §“Query Processor”, frames the same picture from the database side: SQL text enters the query processor, the parser produces an internal tree representation, the optimiser rewrites it, and the executor walks the rewritten tree. The parser owns one job — produce a tree the later passes can rely on — and one constraint — never silently accept input the grammar does not cover.

Lexing — regular languages and DFAs

A lexer is a regular-expression engine specialised for one language. Every token class (keywords, identifiers, integer literals, string literals, operators) is encoded as a regular expression; the union of those regexes is compiled to a single deterministic finite automaton; the DFA reads characters and returns (token_class, lexeme) pairs. Compilers §3 derives the NFA → DFA construction. Two practical points matter for any real lexer: longest-match-wins disambiguation, and a stateful start condition mechanism that lets the same surface syntax mean different things in different contexts (string interior vs. SQL keyword space, for example).

Flex is the GNU re-implementation of Lex. It generates a C function yylex() that returns one token per call and exposes yytext, yyleng, and a per-token yylval slot for the value the parser will reduce on. CUBRID’s lexer is in src/parser/csql_lexer.l; the generated csql_lexer.c is what the build actually compiles.

Parsing — LALR(1) and GLR

Bison generates a shift-reduce parser from a context-free grammar with semantic actions. The default flavour is LALR(1) — Look-Ahead LR with one token of lookahead, the same automaton family covered in Compilers §4.7. LALR(1) covers most programming-language grammars cleanly; it does not cover SQL cleanly because SQL has historic ambiguities (the classic example is the parse of SELECT a FROM b GROUP BY c, d where c, d could be a list or a tuple, depending on dialect).

Bison’s escape hatch is GLR (Generalized LR): when the LALR(1) automaton hits a conflict it forks, parses both branches in parallel, and discards the branches that fail to reduce. The performance cost is real but bounded — most real SQL queries encounter conflicts at only a handful of points. CUBRID compiles its grammar with %glr-parser (csql_grammar.y declaration %glr-parser); this is what lets the grammar accept the permissive superset of MySQL- and Oracle-style joins, hint syntax, and method-call syntax that the surface SQL allows.

Bison’s semantic-action mechanism is what builds the tree. Each grammar rule has an embedded C action block; when the parser reduces by that rule, the action runs. The action sees $1, $2, ... for the rule’s right-hand-side semantic values and writes the rule’s own value into $$. The %union declaration tells Bison the type of those values; in CUBRID the union holds PT_NODE *node, int number, char *cptr, and a few helper containers. The grammar is the only place where parse-tree nodes are constructed for the user’s input — every other module reads what the grammar built.

AST construction and error recovery

The parse tree is the contract the parser hands to the rest of the system. Two design choices shape every concrete realisation:

Polymorphism mechanism. Every parse tree has nodes of many shapes (a SELECT is not a WHERE clause, which is not a literal). Three approaches dominate:
- Class hierarchy (modern C++ / Java): one base class, virtual methods. Conventional but requires RTTI and virtual dispatch.
- Tagged union (the C dialect): one struct with a discriminating tag and a union of per-type substructures. Cheaper at runtime, less safe at compile time.
- Sum-type ADT (Haskell / OCaml / Rust): the language’s own union type with exhaustive matching.
PostgreSQL, MySQL, and CUBRID all chose the tagged-union approach for the same C-era reasons.
Error recovery. Compilers §4.1.4 enumerates the four classical strategies: panic-mode (skip until a synchronising token), phrase-level (replace tokens to make the rule reduce), error productions (grammar rules that match ill-formed input), and global correction (find the minimum-edit valid program). Bison’s built-in error token is panic-mode; production engines layer custom location tracking and message synthesis on top.

CUBRID’s combination is conventional: Flex with start conditions, Bison %glr-parser, a tagged-union PT_NODE, and panic-mode recovery enriched with YYLTYPE location info that pinpoints the offending byte in the original buffer. The distinguishing engineering choices are in the memory model and in the per-type child layout, which we trace in §“CUBRID’s Approach”.

Common DBMS Design

Every SQL engine that compiles SQL on the client side faces the same set of tasks; the engineering vocabulary that has settled across PostgreSQL, MySQL, and CUBRID is the lens this section adopts. The CUBRID-specific code in §“CUBRID’s Approach” is best read as one set of dials within this shared design space.

Flex + Bison as the default front-end stack

Three of the most-cited C/C++ RDBMSes — PostgreSQL, MySQL, and CUBRID — generate their lexer and parser from Flex/Bison sources. Postgres’ files are src/backend/parser/scan.l and src/backend/parser/gram.y; MySQL’s are sql/sql_lex.cc/sql/sql_yacc.yy; CUBRID’s are src/parser/csql_lexer.l and src/parser/csql_grammar.y. The generation step is part of the build (CMake on CUBRID, Autoconf on Postgres) and the generated .c files are checked in only sometimes; CUBRID checks in only the inputs.

The shared shape is:

One .l file with three sections — Definitions, Rules, User subroutines — and the rules section returning token codes that the grammar’s %token declarations match.
One .y file with %union, %type, %token, the start symbol, the grammar rules each carrying a C action block, and an epilogue with helper functions used from inside actions.
Generated artifacts wired into the build through CMake’s flex_target / bison_target.

`ParseNode` / `PT_NODE` / `Item` — the tagged-union AST

PostgreSQL’s Node (src/include/nodes/nodes.h) tags every parse-tree, plan-tree, and executor-tree element with a NodeTag. The tag selects which concrete struct the pointer actually refers to; visitor-style walks switch on it.

MySQL splits the surface differently — Item for expressions, SELECT_LEX and friends for query blocks — but each variant carries a tag and shares a base type; the discriminator is the class identity in MySQL’s case (it does use C++ inheritance).

CUBRID’s PT_NODE is the same idea as Postgres’s Node minus inheritance: a single struct with one PT_NODE_TYPE node_type field and a giant union pt_statement_info info field whose layout depends on node_type. Every per-type substructure is named PT_<TYPE>_INFO and accessed as node->info.<lowercase>.

Per-`PARSER_CONTEXT` block allocator

Parser-time allocations are short-lived and many. Every engine uses some form of arena / region / zone allocator: a few big blocks of memory acquired up-front, individual nodes carved out by bumping a pointer, and the whole arena released in one free at the end. PostgreSQL has MemoryContext (per-context palloc/pfree); MySQL has MEM_ROOT; CUBRID has the PARSER_CONTEXT plus a per-parser pool of PARSER_NODE_BLOCKs and PARSER_STRING_BLOCKs in parse_tree.c.

The win is the same in all three: nodes built during parsing that turn out to be unreachable (because a rule failed to reduce or because semantic check rewrote them) do not need individual free calls — they vanish with the parser context. The cost is that node pointers cannot safely outlive the parser; if the optimizer or executor needs a node after the parser closes, it has to copy it into its own arena (this is the role PostgreSQL’s copyObject plays, and CUBRID’s parser_copy_tree).

Per-type child layout — the `apply` function table

A parse tree walker has to know which children each node type has. Three approaches:

Hard-coded per-type switches in every walker. Fast, brittle.
Reflection (Java, C#) — read the field types at runtime. Slow, expensive at build time.
Function-pointer table indexed by the discriminator — one function per type, manually written, that visits the correct set of children for that type.

Postgres uses the third approach with macros that expand to visitor code; CUBRID uses it explicitly with the pt_apply_f array of per-PT_NODE_TYPE functions, plus parallel pt_init_f and pt_print_f arrays for initialisation and pretty-printing. The trade-off is a small build-time tax (every new node type must register three functions) for a fast and debuggable walker.

Theory ↔ CUBRID mapping

Theoretical / common concept	CUBRID name
Lexer source	`src/parser/csql_lexer.l` (flex)
Generated lexer	`csql_lexer.c` (built by `flex_target` in CMake)
Grammar source	`src/parser/csql_grammar.y` (bison `%glr-parser`)
Generated parser	`csql_grammar.c` (built by `bison_target`)
Token type prefix	`csql_yy*` (set via `--name-prefix=csql_yy` / `--prefix=csql_yy`)
Lexer entry	`csql_yylex` (invoked by Bison-generated `csql_yyparse`)
Parser entry	`csql_yyparse` → wrapped by `parser_main` (`csql_grammar.y`)
Parse-tree node	`PT_NODE` (`parse_tree.h`)
Node discriminator	`enum pt_node_type PT_NODE_TYPE` (`parse_tree.h`)
Per-type substructure union	`union pt_statement_info PT_STATEMENT_INFO` (`parse_tree.h`)
Polymorphic child-visit table	`pt_apply_f[PT_NODE_NUMBER]` (`parse_tree_cl.c`)
Polymorphic init table	`pt_init_f[PT_NODE_NUMBER]`
Polymorphic print table	`pt_print_f[PT_NODE_NUMBER]`
Tree walker	`parser_walk_tree`, `parser_walk_leaves` (`parse_tree_cl.c`)
Per-parser arena	`PARSER_CONTEXT` + `PARSER_NODE_BLOCK` + `PARSER_STRING_BLOCK` (`parse_tree.c`)
Node allocator	`parser_new_node` → `parser_create_node` → block bump (`parse_tree_cl.c`, `parse_tree.c`)
String allocator	`parser_alloc` (`parse_tree.c`)
Lexer-token name interning	`pt_makename` (`scanner_support.c`)
Bison location type	`YYLTYPE { first_line, first_column, last_line, last_column, buffer_pos }` (`csql_grammar.y`)
Per-node line/column	`PT_NODE::line_number`, `column_number`, `buffer_pos`
Error token	Bison built-in `error`; messages via `csql_yyerror` and `PT_ERRORmf`
Parser context lifecycle	`parser_create_parser` / `parser_free_parser` (`parse_tree.c`)
Public string-parse entry	`parser_parse_string`, `parser_parse_string_with_escapes` (`parse_tree_cl.c`)

CUBRID’s Approach

We walk three levels in order: (1) the lex → parse → tree pipeline at runtime, (2) the PT_NODE data model that the pipeline produces, (3) the memory model that lets the producer and the consumer share node pointers safely.

Pipeline at runtime

flowchart LR
  SQL["SQL text"]
  PC["PARSER_CONTEXT"]
  PARSE["parser_parse_string"]
  MAIN["parser_main"]
  LEX["csql_yylex<br/>(generated from csql_lexer.l)"]
  PARSER["csql_yyparse<br/>(generated from csql_grammar.y)"]
  ACT["reduce-action<br/>parser_new_node / parser_make_∗"]
  POOL["PARSER_NODE_BLOCK pool<br/>(per-parser arena)"]
  STK["parser->node_stack<br/>(top-level stmt list)"]
  TREE["PT_NODE root tree(s)"]
  STMTS["parser->statements[]"]

  SQL --> PARSE
  PARSE --> PC
  PARSE --> MAIN
  MAIN --> LEX
  MAIN --> PARSER
  LEX -- "(token, csql_yylval)" --> PARSER
  PARSER -- "on each reduce" --> ACT
  ACT -- "carve from arena" --> POOL
  ACT -- "wire children" --> TREE
  PARSER -- "top-level reduce" --> STK
  MAIN -- "snapshot stack" --> STMTS
  STMTS --> TREE

The figure makes three boundaries explicit. (grammar / allocator) the grammar rules are the only place that calls parser_new_node; reduce actions are where nodes are born. (parse / drive) parser_main is a thin wrapper around yyparse that owns the per-call setup (line/column reset, buffer-position save/restore, hint table init, top-level statement-array snapshot). (per-call arena) every node and string allocated during the call lives in PARSER_CONTEXT-owned blocks and is released by parser_free_parser; nothing is heap-individual.

The lexer — `csql_lexer.l`

Flex source has the standard three-section layout:

// csql_lexer.l (skeleton)

%{
#include "csql_grammar.h"
#include "csql_grammar_scan.h"
#include "parse_tree.h"
#include "parser_message.h"
#include "system_parameter.h"
#include "message_catalog.h"

#define CSQL_MAXNAME    256
static int parser_yyinput (char *buff, int max_size);

#undef YY_INPUT
#define YY_INPUT(buffer, result, max_size) {                           \
    result = parser_yyinput(buffer, max_size);                         \
    result == 0 ? result = YY_NULL : result;                           \
}

#define YY_USER_ACTION  {                                              \
    yybuffer_pos += csql_yyget_leng ();                                \
    csql_yylloc.buffer_pos = yybuffer_pos;                             \
}
%}

%option yylineno
%x QUOTED_CHAR_STRING DOUBLY_QUOTED_CHAR_STRING DELIMITED_ID_NAME
%x BRACKET_ID_NAME BACKTICK_ID_NAME PLCSQL_TEXT
/* ... twelve more start conditions ... */

%%
[sS][eE][lL][eE][cC][tT]   { begin_token(yytext); return SELECT; }
[sS][eE][nN][sS][iI][tT][iI][vV][eE]
                           { begin_token(yytext); return SENSITIVE; }
[sS][eE][pP][aA][rR][aA][tT][oO][rR]
                           { begin_token(yytext);
                             csql_yylval.cptr = pt_makename(yytext);
                             return SEPARATOR; }
/* ... ~2000 more rules ... */
%%

/* user subroutines: parser_yyinput, csql_yyerror, yywrap, ... */

Three deliberate choices live in the snippet:

YY_INPUT is overridden. The default Flex YY_INPUT reads from stdin. CUBRID parses an in-memory buffer (parser->buffer) so it routes Flex’s input through parser_yyinput, which copies bytes out of the parser context’s owned buffer and tracks yybuffer_pos.
YY_USER_ACTION accumulates a byte-position counter. yybuffer_pos is the cumulative offset into the original SQL string of the token Flex just matched; it is the anchor every error message and every PT_NODE::buffer_pos later refers back to. Lines and columns alone are not enough — error messages quote the offending substring, which needs the byte offset.
Keyword rules are case-insensitive by enumeration of character classes. [sS][eE][lL][eE][cC][tT] is the manual way of getting case-insensitive matching without asking Flex to apply %option case-insensitive (which would also affect identifier rules and string literals). The generated DFA pays a few extra states per keyword; the alternative would have been to lowercase the input first, which interferes with delimited identifiers like "FooBar".

The user-subroutines block holds the input drivers (parser_yyinput, parser_yyinput_single_line, parser_yyinput_multi_line), the error reporter (csql_yyerror / csql_yyerror_explicit), and the end-of-input handler yywrap (which always returns 1 — the parser sees one buffer per parser_parse_string call, never a chained input).

The fourteen start conditions (excerpt: QUOTED_CHAR_STRING, DOUBLY_QUOTED_CHAR_STRING, DELIMITED_ID_NAME, BRACKET_ID_NAME, BACKTICK_ID_NAME, PL_LANG_SPEC, PLCSQL_TEXT, …) are how CUBRID handles the fact that the same surface character (e.g. ") means different things in the middle of a string literal versus at the start of a delimited identifier. Each BEGIN(state) action in the grammar/lexer flips the lexer into a state where only the rules relevant to that context are eligible to match. PL/CSQL text matches a multi-line balanced BEGIN ... END block; expecting_pl_lang_spec is a one-shot flag that the grammar sets right before it expects to see a procedure body.

The grammar — `csql_grammar.y`

The grammar uses Bison %glr-parser to handle SQL’s ambiguities; the %union carries the value types reduce actions can return; the rules build PT_NODEs.

// csql_grammar.y — declarations (excerpt)

%{/*%CODE_REQUIRES_START%*/
#include "json_table_def.h"
#include "parser.h"

typedef struct YYLTYPE
{
  int first_line;
  int first_column;
  int last_line;
  int last_column;
  int buffer_pos;     /* cumulative byte offset; updated by YY_USER_ACTION */
} YYLTYPE;
#define YYLTYPE_IS_DECLARED 1
/*%CODE_END%*/%}

%locations
%glr-parser
%define parse.error verbose

%union
{
  int                   number;
  bool                  boolean;
  PT_NODE              *node;
  char                 *cptr;
  container_2           c2;
  container_3           c3;
  container_4           c4;
  container_10          c10;
  container_11          c11;
  struct json_table_column_behavior jtcb;
}

%type <node> stmt
%type <node> select_stmt
%type <node> select_expression
%type <node> select_expression_opt_with
/* ... ~1100 more %type lines ... */

The YYLTYPE extension is the parser’s way of preserving the lexer’s buffer_pos through reductions. The default Bison YYLTYPE carries only (first_line, first_column, last_line, last_column); CUBRID adds buffer_pos so error messages can render the offending byte range against the original SQL text.

A representative reduce action — the one that pairs an opt_with_clause with a select_expression — shows the typical shape:

// csql_grammar.y — select_expression_opt_with

select_expression_opt_with
        : opt_with_clause
          select_expression
                {{
                        PT_NODE *with_clause = $1;
                        PT_NODE *stmt        = $2;
                        if (stmt && with_clause)
                          {
                            stmt->info.query.with = with_clause;
                          }
                        $$ = stmt;
                }}
        ;

Three things to notice: (a) the action only wires up an existing PT_NODE returned by the inner rule; the PT_SELECT itself was allocated by an action deeper down, in select_stmt, with parser_new_node (this_parser, PT_SELECT); (b) the action does not return — it writes to $$, a local variable in the generated parser; (c) this_parser is a file-static PARSER_CONTEXT * that parser_main sets up before calling yyparse and tears down afterwards.

The actual PT_SELECT allocation happens here:

// csql_grammar.y — select_stmt (excerpt)

select_stmt
        : SELECT
                {{
                        PT_NODE *node;
                        parser_save_found_Oracle_outer ();
                        if (parser_select_level >= 0)
                          parser_select_level++;
                        parser_hidden_incr_list = NULL;

                        node = parser_new_node (this_parser, PT_SELECT);
                        if (node)
                          {
                            node->info.query.q.select.flavor = PT_USER_SELECT;
                            node->info.query.q.select.hint   = PT_HINT_NONE;
                          }
                        parser_push_select_stmt_node (node);
                        parser_push_hint_node (node);
                }}
          opt_hint_list           /* $3 */
          all_distinct            /* $4 */
          select_list             /* $5 */
          opt_select_param_list   /* $6 */
        /* ... wire children, pop the stmt node, set $$, ... */
        ;

Two patterns recur here and across the grammar: a stmt-stack push in the action that creates the node, and a stmt-stack pop in a later action that finalises it — this is how mid-rule actions like opt_hint_list know which PT_SELECT they belong to without threading explicit arguments through every rule.

Driver — `parser_main`

parser_main lives at the bottom of csql_grammar.y (it has to, because yyparse is a static function in the generated csql_grammar.c and calling it from outside the .y file is awkward):

// parser_main — csql_grammar.y (condensed)

PT_NODE **
parser_main (PARSER_CONTEXT * parser)
{
  long desc_index = 0;
  long i, top;
  int rv, yybuffer_pos_save;
  PARSER_CONTEXT *this_parser_saved;

  if (!parser)
    return 0;

  parser_output_host_index = parser_input_host_index = desc_index = 0;

  this_parser_saved = this_parser;
  this_parser       = parser;        /* publish to grammar-action TLS */

  dbcs_start_input ();

  yycolumn = yycolumn_end = 1;
  yybuffer_pos_save = yybuffer_pos;
  yybuffer_pos = 0;
  is_dblink_query_string = 0;
  expecting_pl_lang_spec = 0;
  csql_yylloc.buffer_pos = 0;

  g_query_string       = NULL;
  g_query_string_pos   = 0;
  g_query_string_len   = 0;
  g_original_buffer_len = 0;

  pt_initialize_hint (parser, parser_hint_table);

  rv = yyparse ();          /* drives csql_yylex through tokens */

  /* parser_main can be re-entered (loaddb -s); restore yybuffer_pos. */
  yybuffer_pos = yybuffer_pos_save;

  pt_cleanup_hint (parser, parser_hint_table);

  if (pt_has_error (parser) || parser->stack_top <= 0 || !parser->node_stack)
    {
      parser->statements = NULL;
    }
  else
    {
      parser->statements = (PT_NODE **)
          parser_alloc (parser, (1 + parser->stack_top) * sizeof (PT_NODE *));
      if (parser->statements)
        {
          for (i = 0, top = parser->stack_top; i < top; i++)
            parser->statements[i] = parser->node_stack[i];
          parser->statements[top] = NULL;
        }
      parser->host_var_count = parser_input_host_index;
      /* ... allocate host_variables, host_var_expected_domains ... */
    }
  /* ... restore this_parser ... */
  return parser->statements;
}

The save/restore of this_parser and yybuffer_pos matters because of the loaddb -s re-entry path: a parse-and-execute loop inside loaddb can call parser_main recursively while the outer parser_main is still parsing the input file’s statements.

The parser->statements array is a NULL-terminated list of top-level statements. CUBRID’s parser accepts a multi-statement buffer (e.g. SELECT 1; SELECT 2;), pushes each one onto node_stack as the top-level rule reduces, and parser_main copies the stack into the result array at the end.

`PT_NODE` — the polymorphic-tagged-union node

The node type is one struct with one tag and one union:

// parser_node — src/parser/parse_tree.h

struct parser_node
{
  PT_NODE_TYPE node_type;       /* the type of SQL statement this represents */
  int          parser_id;       /* which parser did I come from */
  int          line_number;     /* the user line number originating this */
  int          column_number;   /* the user column number originating this */
  int          buffer_pos;      /* position in the parse buffer of the string */
  char        *sql_user_text;   /* user input sql string */
  int          sql_user_text_len;

  PT_NODE     *next;            /* forward link for NULL-terminated list */
  PT_NODE     *or_next;         /* forward link for DNF list */
  PT_NODE     *next_row;        /* PT_VALUE/NAME/EXPR row in PT_NODE_LIST */
  void        *etc;             /* application-specific info hook */
  UINTPTR      spec_ident;      /* entity-spec equivalence class (for FROM resolve) */
  TP_DOMAIN   *expected_domain; /* expected domain for input marker */
  PT_NODE     *data_type;       /* for non-primitive types, sets, objects */
  XASL_ID     *xasl_id;         /* XASL_ID for this SQL statement */
  const char  *alias_print;
  PARSER_VARCHAR *expr_before_const_folding;
  PT_TYPE_ENUM type_enum;       /* PT_TYPE_INTEGER / VARCHAR / ... */
  CACHE_TIME   cache_time;
  int          sub_host_var_count;
  int         *sub_host_var_index;

  struct {                      /* 27 single-bit flags */
    unsigned recompile:1;
    unsigned cannot_prepare:1;
    /* ... use_plan_cache, is_hidden_column, with_rollup,
     *     do_not_fold, is_value_query, do_not_replace_orderby,
     *     use_auto_commit, ... 27 in total */
  } flag;

  PT_STATEMENT_INFO info;       /* depends on 'node_type' field */
};

Three observations:

next / or_next / next_row are three independent list spines. next is the natural list (select-list elements, AND-conjoined predicates, FROM-list specs). or_next is the DNF spine for predicate normalisation (a = 1 OR a = 2 OR a = 3 becomes a next of one and an or_next of three). next_row is for VALUES queries that reduce to PT_NODE_LIST. The walker (pt_walk_private) descends each spine independently — this is what lets transformation passes splice into one spine without disturbing the others.
data_type is a child, not an in-place type tag. type_enum is the cheap discriminator (PT_TYPE_INTEGER, PT_TYPE_VARCHAR); data_type is a separate PT_NODE of type PT_DATA_TYPE that carries domain-specific extras (precision, scale, collation). The walker’s PT_APPLY_WALK (parser, node->data_type, walk) is what makes type-checking transforms uniform across all parents.
Line / column / buffer_pos live on every node. They are set by pt_parser_line_col inside parser_new_node. Error messages can therefore quote the source range of any node in any later pass without re-parsing — semantic check, the optimizer, even XASL generation can produce source-anchored diagnostics.

The PT_NODE_TYPE enum carries dual semantics: the values for SQL statements (PT_SELECT, PT_INSERT, …) are intentionally the same as CUBRID_STMT_* codes from the API, so a PT_NODE discriminator can be returned to the API client without translation:

// PT_NODE_TYPE — src/parser/parse_tree.h (excerpt)

enum pt_node_type
{
  PT_NODE_NONE       = CUBRID_STMT_NONE,
  PT_ALTER           = CUBRID_STMT_ALTER_CLASS,
  PT_CREATE_ENTITY   = CUBRID_STMT_CREATE_CLASS,
  PT_INSERT          = CUBRID_STMT_INSERT,
  PT_SELECT          = CUBRID_STMT_SELECT,
  PT_UPDATE          = CUBRID_STMT_UPDATE,
  PT_DELETE          = CUBRID_STMT_DELETE,
  /* ... ~50 more statement-typed nodes ... */

  PT_DIFFERENCE = CUBRID_MAX_STMT_TYPE, /* internal-only types from here down */
  PT_INTERSECTION,
  PT_UNION,

  /* expression / clause / fragment nodes */
  PT_EXPR, PT_NAME, PT_VALUE, PT_FUNCTION, PT_DOT_,
  PT_SPEC, PT_DATA_TYPE, PT_SORT_SPEC, PT_HOST_VAR,
  /* ... ~40 more ... */

  PT_NODE_NUMBER,                      /* the number of node types */
  PT_LAST_NODE_NUMBER = PT_NODE_NUMBER
};

The split is meaningful: node_type < CUBRID_MAX_STMT_TYPE identifies a top-level statement that the API and the executor both need; node_type >= CUBRID_MAX_STMT_TYPE is internal parse-tree machinery.

The per-type info union is in parse_tree.h:3470:

// pt_statement_info — src/parser/parse_tree.h (excerpt)

union pt_statement_info
{
  PT_ZZ_ERROR_MSG_INFO       error_msg;
  PT_ALTER_INFO              alter;
  PT_ALTER_TRIGGER_INFO      alter_trigger;
  PT_CREATE_ENTITY_INFO      create_entity;
  PT_DELETE_INFO             delete_;
  PT_DOT_INFO                dot;
  PT_EXPR_INFO               expr;
  PT_FUNCTION_INFO           function;
  PT_INSERT_INFO             insert;
  PT_NAME_INFO               name;
  PT_QUERY_INFO              query;          /* SELECT, UNION, etc. */
  PT_SPEC_INFO               spec;           /* FROM-clause table spec */
  PT_VALUE_INFO              value;
  PT_DATA_TYPE_INFO          data_type;
  /* ... 80+ more variants ... */
};

PT_QUERY_INFO::q.select is the one most users picture when they think of a parse tree: it has list (the select list), from, where, group_by, having, connect_by, start_with, using_index, hint slots, order_by (lifted to PT_QUERY_INFO::order_by because UNION shares it), limit, for_update, and a dozen others. Each of these is a PT_NODE * that points to whatever the grammar reduced into it.

Per-type child layout — `pt_apply_f`, `pt_init_f`, `pt_print_f`

Three function-pointer tables, all sized PT_NODE_NUMBER, are the polymorphism mechanism:

// parse_tree_cl.c — apply/init/print function tables (excerpt)

typedef PT_NODE *(*PARSER_INIT_NODE_FUNC)  (PT_NODE *);
typedef PARSER_VARCHAR *(*PARSER_PRINT_NODE_FUNC) (PARSER_CONTEXT * parser, PT_NODE * p);
typedef PT_NODE *(*PARSER_APPLY_NODE_FUNC) (PARSER_CONTEXT * parser, PT_NODE * p, void *arg);

static PARSER_INIT_NODE_FUNC  *pt_init_f  = NULL;
static PARSER_PRINT_NODE_FUNC *pt_print_f = NULL;
static PARSER_APPLY_NODE_FUNC *pt_apply_f = NULL;

static PARSER_APPLY_NODE_FUNC pt_apply_func_array[PT_NODE_NUMBER];
static PARSER_INIT_NODE_FUNC  pt_init_func_array[PT_NODE_NUMBER];
static PARSER_PRINT_NODE_FUNC pt_print_func_array[PT_NODE_NUMBER];

parser_init_func_vectors (called once from parser_create_parser when pt_apply_f == NULL) populates the arrays. The per-PT_SELECT apply function is the canonical example:

// pt_apply_select — parse_tree_cl.c

static PT_NODE *
pt_apply_select (PARSER_CONTEXT * parser, PT_NODE * p, void *arg)
{
  PT_APPLY_WALK (parser, p->info.query.with, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.list, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.from, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.where, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.connect_by, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.start_with, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.after_cb_filter, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.group_by, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.having, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.using_index, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.with_increment, arg);
  /* ... more hint slots ... */
  PT_APPLY_WALK (parser, p->info.query.into_list, arg);
  PT_APPLY_WALK (parser, p->info.query.order_by, arg);
  PT_APPLY_WALK (parser, p->info.query.orderby_for, arg);
  PT_APPLY_WALK (parser, p->info.query.qcache_hint, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.check_where, arg);
  PT_APPLY_WALK (parser, p->info.query.limit, arg);
  PT_APPLY_WALK (parser, p->info.query.q.select.for_update, arg);
  return p;
}

The order matters: every walker (semantic check, type inference, name resolution, plan generation, pretty-printing) visits these children in this order. A new child added to PT_QUERY_INFO::q.select that is not added here will be silently invisible to every existing walker — which is one of the few places in the parser where the build will not warn about an oversight.

pt_init_select is the parallel for default initialisation:

// pt_init_select — parse_tree_cl.c

static PT_NODE *
pt_init_select (PT_NODE * p)
{
  p->info.query.q.select.hint         = PT_HINT_NONE;
  p->info.query.q.select.check_cycles = CONNECT_BY_CYCLES_ERROR;
  p->info.query.all_distinct          = PT_ALL;
  p->info.query.is_subquery           = (PT_MISC_TYPE) 0;
  p->info.query.hint                  = PT_HINT_NONE;
  p->info.query.scan_op_type          = S_SELECT;
  return p;
}

pt_init_f[node->node_type] is called from parser_init_node on every newly-allocated node, so the default values are in-place before any reduce action gets to the new node.

The walker — `parser_walk_tree` / `pt_walk_private`

parser_walk_tree is the universal pre/post-order traversal:

// parser_walk_tree — parse_tree_cl.c

PT_NODE *
parser_walk_tree (PARSER_CONTEXT * parser, PT_NODE * node,
                  PT_NODE_WALK_FUNCTION pre_function,  void *pre_argument,
                  PT_NODE_WALK_FUNCTION post_function, void *post_argument)
{
  if (node == NULL) return NULL;

  PT_WALK_ARG walk_argument;
  walk_argument.continue_walk = PT_CONTINUE_WALK;
  walk_argument.pre_function  = pre_function;
  walk_argument.pre_argument  = pre_argument;
  walk_argument.post_function = post_function;
  walk_argument.post_argument = post_argument;

  PT_APPLY_WALK (parser, node, &walk_argument);
  return node;
}

The real engine is pt_walk_private. For each node it: (a) calls pre_function, (b) dispatches through pt_apply_f[node_type] to visit the typed children, then visits data_type, then or_next, then next, in that order, (c) calls post_function. The four continue_walk modes (PT_STOP_WALK, PT_CONTINUE_WALK, PT_LEAF_WALK, PT_LIST_WALK) let the pre-function prune the search:

// pt_walk_private — parse_tree_cl.c (condensed)

static PT_NODE *
pt_walk_private (PARSER_CONTEXT * parser, PT_NODE * node, void *void_arg)
{
  PT_WALK_ARG *walk = (PT_WALK_ARG *) void_arg;
  PT_NODE_TYPE node_type;
  PARSER_APPLY_NODE_FUNC apply;
  int save_continue;

  if (walk->pre_function)
    {
      node = (*walk->pre_function) (parser, node, walk->pre_argument,
                                    &(walk->continue_walk));
      if (!node) return NULL;
    }

  if (walk->continue_walk != PT_STOP_WALK)
    {
      save_continue = walk->continue_walk;

      if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LEAF_WALK)
        {
          node_type = node->node_type;
          if (node_type >= PT_LAST_NODE_NUMBER || !(apply = pt_apply_f[node_type]))
            return NULL;

          (*apply) (parser, node, walk);                /* typed children */
          PT_APPLY_WALK (parser, node->data_type, walk); /* data_type child */
        }

      if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LEAF_WALK
          || save_continue == PT_LIST_WALK)
        PT_APPLY_WALK (parser, node->or_next, walk);

      if (save_continue == PT_CONTINUE_WALK || save_continue == PT_LIST_WALK)
        PT_APPLY_WALK (parser, node->next, walk);

      if (walk->continue_walk != PT_STOP_WALK)
        walk->continue_walk = save_continue;
    }

  if (walk->post_function)
    node = (*walk->post_function) (parser, node, walk->post_argument,
                                   &(walk->continue_walk));
  return node;
}

The traversal shape — pre-visit, typed-child via apply, data_type, or_next, next, post-visit — is fixed; only the user-supplied pre_function and post_function vary. This is how every later pass (pt_resolve_names, pt_check_with_info, pt_semantic_type, pt_to_buildlist_proc, pt_print_tree) plugs into the same walker without each having to re-implement the per-type child visit. The companion parser_walk_leaves is the variant that starts at the leaves of the input node — it skips the pre-visit on node itself and goes straight to the typed children.

Memory model — per-parser arena

The third key piece is the allocator. Every parse-time allocation goes through one of two paths into a per-parser pool:

// parse_tree.c — block-pool sketch (one block per pool node)

#define NODES_PER_BLOCK 256

typedef struct parser_node_block PARSER_NODE_BLOCK;
struct parser_node_block
{
  PARSER_NODE_BLOCK *next;
  int                parser_id;
  PT_NODE            nodes[NODES_PER_BLOCK];
};

static PARSER_NODE_BLOCK      *parser_Node_blocks[HASH_NUMBER];
static PARSER_NODE_FREE_LIST  *parser_Node_free_lists[HASH_NUMBER];
static PARSER_STRING_BLOCK    *parser_String_blocks[HASH_NUMBER];

static int parser_id = 1;

parser_create_parser is the lifecycle entry:

// parser_create_parser — parse_tree.c (condensed)

PARSER_CONTEXT *
parser_create_parser (void)
{
  PARSER_CONTEXT *parser;
  struct timeval t;
  struct drand48_data rand_buf;

  parser = (PARSER_CONTEXT *) calloc (sizeof (PARSER_CONTEXT), 1);
  if (parser == NULL)
    {
      er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_OUT_OF_VIRTUAL_MEMORY,
              1, sizeof (PARSER_CONTEXT));
      return NULL;
    }

#if !defined (SERVER_MODE)
  parser_init_func_vectors ();      /* lazy fill of pt_apply_f / init_f / print_f */
#endif

  parser->id = parser_id++;
  if (pt_register_parser (parser) == ER_FAILED)
    {
      free_and_init (parser);
      return NULL;
    }

  parser->execution_values.row_count = -1;

  /* ... seed lrand48 / drand48 / etc ... */
  parser->query_id = NULL_QUERY_ID;
  /* ... 15 single-bit flags ... */
  return parser;
}

Two non-obvious facts. First, parser_init_func_vectors is lazy-fired here — the first parser created in a process pays the cost of populating the three function-pointer tables, every subsequent parser inherits them. Second, the parser does not own its node blocks directly; it owns a parser_id and the node blocks are looked up in a process-wide hash (parser_Node_blocks[HASH_NUMBER]). This makes parser_free_parser a hash-walk that reclaims every block tagged with the dying parser’s id.

Node allocation:

// parser_new_node — parse_tree_cl.c

PT_NODE *
parser_new_node (PARSER_CONTEXT * parser, PT_NODE_TYPE node_type)
{
  PT_NODE *node;

  node = parser_create_node (parser);    /* carve from PARSER_NODE_BLOCK */
  if (node)
    {
      parser_init_node (node, node_type); /* memset + pt_init_f[node_type] */
      pt_parser_line_col (node);          /* stamp from current YYLTYPE */
      node->sql_user_text     = g_query_string;
      node->sql_user_text_len = g_query_string_len;
    }
  return node;
}

Three things happen in a single call: (a) a PT_NODE-sized slot is taken from the parser’s node-block pool; (b) the slot is zeroed, the node_type field is set, and pt_init_f[node_type] runs to fill in default values for the type-specific union arm; (c) the (line, column, buffer_pos) triple is stamped from the lexer’s current csql_yylloc/g_query_string_pos so future error messages can quote the source position of this exact node.

parser_init_node is the second half:

// parser_init_node — parse_tree_cl.c

PT_NODE *
parser_init_node (PT_NODE * node, PT_NODE_TYPE node_type)
{
  assert (node != NULL);
  assert (node_type < PT_LAST_NODE_NUMBER);
  int parser_id = node->parser_id;

  memset (node, 0x00, sizeof (PT_NODE));
  node->buffer_pos = -1;
  node->type_enum  = PT_TYPE_NONE;

  node->parser_id  = parser_id;
  node->node_type  = node_type;
  if (pt_init_f[node_type])
    node = (pt_init_f[node_type]) (node);

  return node;
}

The parser_id is preserved across the memset because the slot’s identity (which parser does it belong to?) outlives the node’s logical state; this is what lets parser_reinit_node re-stamp an in-use slot during walk-rewrites without losing the slot’s home parser.

The companion parser_alloc is for everything that is not a PT_NODE — strings, arrays, host-variable arrays:

// parser_alloc — parse_tree.c (excerpt)

void *
parser_alloc (const PARSER_CONTEXT * parser, const int length)
{
  /* ... locate or create a PARSER_STRING_BLOCK with `length` bytes free,
   *     bump the block's high-water mark, return the new region ... */
}

Strings inside parser_alloc’s pool live and die with the parser context just like nodes. The grammar’s calls to pt_makename(yytext) (which calls pt_append_string) for every identifier therefore do not leak — the entire universe of identifier strings allocated during a parse is freed in one sweep by parser_free_parser → pt_free_string_blocks.

Helpers used by reduce actions

A handful of functions live in csql_grammar.y’s epilogue because every grammar action calls them:

// parser_make_expression — csql_grammar.y

PT_NODE *
parser_make_expression (PARSER_CONTEXT * parser, PT_OP_TYPE OP,
                        PT_NODE * arg1, PT_NODE * arg2, PT_NODE * arg3)
{
  PT_NODE *expr = parser_new_node (parser, PT_EXPR);
  if (expr)
    {
      expr->info.expr.op   = OP;
      if (pt_is_operator_logical (expr->info.expr.op))
        expr->type_enum = PT_TYPE_LOGICAL;

      expr->info.expr.arg1 = arg1;
      expr->info.expr.arg2 = arg2;
      expr->info.expr.arg3 = arg3;
      /* ... INST_NUM / GROUPBY_NUM / ORDERBY_NUM compatibility checks ... */
      /* ... mark parser_cannot_cache for SERIAL / SYS_DATE-family ops ... */
    }
  return expr;
}

static PT_NODE *
parser_make_link (PT_NODE * list, PT_NODE * node)
{
  parser_append_node (node, list);
  return list;
}

parser_make_expression is what every binary or ternary operator rule in the grammar reduces to — <expr> + <expr>, <expr> AND <expr>, <expr> BETWEEN <expr> AND <expr> all funnel through it. parser_make_link is what lets a recursive list rule (select_list : select_list ',' expr) extend the list using the next pointer rather than allocating a separate list node. These two helpers — plus parser_make_func_with_arg_count for arity-checked function nodes and pt_makename for interning identifier strings — are about 80% of every reduce action.

Putting it together — the SELECT example

The deck’s running example, SELECT * FROM code WHERE s_name > 'G';, exercises every piece above. The rough trace:

The lexer returns the tokens SELECT '*' FROM IDENTIFIER('code') WHERE IDENTIFIER('s_name') '>' STRING_LITERAL('G') ';'. IDENTIFIER and STRING_LITERAL carry their lexeme through pt_makename/pt_append_bytes; the others carry only their token code.
As the bottom-up parser reduces, each rule’s action allocates the right PT_NODE and wires its children. Allocation order is roughly bottom-up: the literal 'G' reduces first (PT_VALUE, type_enum = PT_TYPE_CHAR), then the identifier s_name (PT_NAME), then the comparison (PT_EXPR with op = PT_GT), then the wildcard * (PT_VALUE, type_enum = PT_TYPE_STAR), then the FROM-spec (PT_SPEC whose entity_name points at a PT_NAME for code), then the PT_SELECT whose info.query.q.select.list/from/where points at all of the above.
The top-level reduce pushes the resulting PT_SELECT onto parser->node_stack.
parser_main snapshots the stack into parser->statements and returns it.

After parser_parse_string returns, the rest of the system sees only the PT_SELECT root. The lexer and grammar are out of the picture; every later transformation walks pt_apply_f/pt_walk_private and rewrites the same arena.

Build wiring — Flex/Bison via CMake

Both .l and .y are inputs to a CMake custom command that runs Flex and Bison at build time:

# CMakeLists.txt (root) — CSQL FLEX/BISON targets (excerpt)

# Replace old bison directives with new ones; regenerate csql_grammar.yy
# whenever csql_grammar.y is modified.
add_custom_command(OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy
  COMMAND ${CMAKE_COMMAND}
    -DGRAMMAR_INPUT_FILE=${CMAKE_SOURCE_DIR}/src/parser/csql_grammar.y
    -DGRAMMAR_OUTPUT_FILE=${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy
    -P ${CMAKE_SOURCE_DIR}/cmake/CSQLGrammarTransform.cmake
  MAIN_DEPENDENCY ${CMAKE_SOURCE_DIR}/src/parser/csql_grammar.y
)

set(CSQL_GRAMMAR_INPUT ${CMAKE_CURRENT_BINARY_DIR}/csql_grammar.yy)

bison_target(csql_grammar ${CSQL_GRAMMAR_INPUT} ${CSQL_GRAMMAR_OUTPUT}
  COMPILE_FLAGS "--no-lines --name-prefix=csql_yy -d -r all")
flex_target(csql_lexer  ${CSQL_LEXER_INPUT}     ${CSQL_LEXER_OUTPUT}
  COMPILE_FLAGS "--noline --never-interactive --prefix=csql_yy")
add_flex_bison_dependency(csql_lexer csql_grammar)

add_custom_target(gen_csql_grammar DEPENDS ${BISON_csql_grammar_OUTPUTS})
add_custom_target(gen_csql_lexer   DEPENDS ${FLEX_csql_lexer_OUTPUTS})

Two notable choices: --name-prefix=csql_yy (so all generated symbols begin csql_yy* and do not collide with the loader’s own Flex/Bison grammar in src/loaddb/), and the csql_grammar.yy indirection — csql_grammar.y is pre-processed by CSQLGrammarTransform.cmake before Bison sees it, which is how the project rewrites legacy YACC directives into Bison-3 syntax without editing the original .y file.

Source Walkthrough

Anchor on symbol names, not line numbers. The CUBRID source moves; the position table at the end is scoped to the doc’s updated: date.

Lexer (`src/parser/csql_lexer.l`)

Definitions block — YY_INPUT / YY_USER_ACTION / start conditions / globals (yybuffer_pos, is_dblink_query_string, expecting_pl_lang_spec).
Keyword rules — about 600 [xX]... patterns each returning a Bison token code. Keywords that need their lexeme retained set csql_yylval.cptr = pt_makename(yytext) first.
Identifier / numeric / string-literal rules.
Start-condition rules — <QUOTED_CHAR_STRING>, <DELIMITED_ID_NAME>, <BRACKET_ID_NAME>, <BACKTICK_ID_NAME>, <PLCSQL_TEXT>, <POST_PLCSQL_TEXT>, <PL_LANG_SPEC>, etc.
User-subroutines block — parser_yyinput and its single / multi-line variants, csql_yyerror, csql_yyerror_explicit, yywrap, begin_token, parser_c_hint / parser_line_hint.

Grammar (`src/parser/csql_grammar.y`)

Prologue (%{ ... %}) — YYLTYPE extension with buffer_pos, helper-container typedefs (container_2..11), parser_make_func_with_arg_count, parser_make_expression, parser_make_link declarations.
Bison declarations — %locations, %glr-parser, %union, %type (~1100 entries), %token (~600 entries).
Grammar rules — stmt, stmt_, select_stmt, select_expression, select_expression_opt_with, select_list, from_list, extended_table_specification, single_table_spec, subquery, expression, predicate, function_call, data_type, create_stmt, alter_stmt, insert_stmt, update_stmt, delete_stmt, merge_stmt … ~1700 productions.
Epilogue — parser_initialize_parser_context, parser_make_func_with_arg_count, parser_make_expression, parser_make_link, parser_make_link_or, parser_main, parser_init_yydebug.

Parse-tree definitions (`src/parser/parse_tree.h`)

enum pt_node_type — PT_NODE_NUMBER/PT_LAST_NODE_NUMBER sentinel.
enum pt_type_enum — runtime value-type tag (PT_TYPE_INTEGER, PT_TYPE_VARCHAR, …).
Per-type PT_*_INFO structs (PT_QUERY_INFO, PT_SELECT_INFO inside PT_QUERY_INFO, PT_SPEC_INFO, PT_NAME_INFO, PT_EXPR_INFO, PT_VALUE_INFO, PT_DATA_TYPE_INFO, …).
union pt_statement_info PT_STATEMENT_INFO.
struct parser_node — the PT_NODE itself with the 27 flag bits and the info union.
struct parser_context — node stack, statements array, per-parser globals (error_msgs, host_variables, …).
enum { PT_STOP_WALK, PT_CONTINUE_WALK, PT_LEAF_WALK, PT_LIST_WALK } — walker continuation modes.
typedef PT_NODE *(*PT_NODE_WALK_FUNCTION) (...) — the walker function-pointer type users plug into parser_walk_tree.

Tree-and-arena code (`src/parser/parse_tree.c`)

PARSER_NODE_BLOCK, PARSER_NODE_FREE_LIST, PARSER_STRING_BLOCK — the per-parser pool types.
parser_create_node_block — extend the block list.
parser_create_node — bump-allocate one PT_NODE slot.
pt_free_node_blocks — release all blocks for a parser id.
parser_create_string_block / parser_allocate_string_buffer — strings.
pt_register_parser / pt_unregister_parser — process-wide hash registration.
parser_alloc — generic-allocator wrapper.
pt_append_string / pt_append_varchar / pt_append_bytes — string accumulation.
parser_create_parser / parser_free_parser — context lifecycle.
parser_free_node_resources / parser_free_node — individual-node free (rarely called; the arena handles bulk free).

Walker, init, helpers (`src/parser/parse_tree_cl.c`)

pt_apply_f, pt_init_f, pt_print_f — the three per-PT_NODE_TYPE function-pointer arrays.
pt_apply_func_array, pt_init_func_array, pt_print_func_array — the static backing arrays.
pt_init_apply_f, pt_init_init_f, pt_init_print_f — populate the three arrays.
parser_init_func_vectors — lazy entry called from parser_create_parser.
pt_walk_private — the recursive walker, shared by parser_walk_tree and parser_walk_leaves.
parser_walk_tree, parser_walk_leaves, pt_continue_walk — public traversal API.
parser_new_node, parser_init_node, parser_reinit_node — node creation and re-init.
parser_free_tree, parser_free_subtrees — explicit subtree free (used by transforms that re-write a subtree before the parser dies).
parser_copy_tree, parser_copy_tree_list — deep-copy with copy_node_in_tree_pre.
parser_append_node, parser_append_node_or — list-spine helpers.
pt_apply_select, pt_init_select, pt_print_select — the canonical example for PT_SELECT; every other PT_NODE_TYPE has a parallel triple.
parser_parse_string, parser_parse_string_with_escapes, parser_parse_string_use_sys_charset, parser_parse_file — public entrypoints.

Public API (`src/parser/parser.h`)

parser_main, parser_create_parser, parser_free_parser, parser_init_func_vectors, parser_alloc.
parser_parse_string, parser_parse_string_with_escapes, parser_parse_string_use_sys_charset, parser_parse_file.
parser_new_node, parser_init_node, parser_reinit_node, parser_free_tree, parser_free_subtrees, parser_clear_node, parser_free_node, parser_free_node_resources.
parser_walk_tree, parser_walk_leaves, pt_continue_walk.
parser_print_tree, parser_print_tree_with_quotes, parser_print_tree_list.

Scanner support (`src/parser/scanner_support.c`)

pt_makename — interns an identifier string into the per-parser arena and returns the char *.
pt_get_keyword / pt_check_word_for_keyword — look up whether a candidate identifier is actually a reserved word.
dbcs_get_next / buffgetin / binarygetin / fgetin — next_char / next_byte implementations the parser context wires up before calling the lexer.

Position hints as of 2026-04-30

Symbol	File	Line
`enum pt_node_type` (`PT_NODE_TYPE`)	`parse_tree.h`	963
`PT_LAST_NODE_NUMBER`	`parse_tree.h`	1091
`enum pt_type_enum` (`PT_TYPE_ENUM`)	`parse_tree.h`	1096
`union pt_statement_info`	`parse_tree.h`	3470
`struct parser_node` (the `PT_NODE` itself)	`parse_tree.h`	3649
`struct parser_context`	`parse_tree.h`	3766
`PT_NODE_WALK_FUNCTION` typedef	`parse_tree.h`	1737
`PT_STOP_WALK / CONTINUE / LEAF / LIST`	`parse_tree.h`	882
`parser_main`	`csql_grammar.y`	23678
`parser_make_expression`	`csql_grammar.y`	22573
`parser_make_link`	`csql_grammar.y`	22633
`parser_make_func_with_arg_count`	`csql_grammar.y`	22542
`parser_initialize_parser_context`	`csql_grammar.y`	23480
`select_stmt` rule	`csql_grammar.y`	12848
`select_expression` rule	`csql_grammar.y`	12365
`select_expression_opt_with` rule	`csql_grammar.y`	12263
`stmt` rule	`csql_grammar.y`	1776
`%union`	`csql_grammar.y`	545
`%locations / %glr-parser`	`csql_grammar.y`	540
`pt_walk_private`	`parse_tree_cl.c`	914
`parser_walk_leaves`	`parse_tree_cl.c`	1001
`parser_walk_tree`	`parse_tree_cl.c`	1045
`parser_new_node`	`parse_tree_cl.c`	2245
`parser_init_node`	`parse_tree_cl.c`	2266
`parser_reinit_node`	`parse_tree_cl.c`	2292
`parser_append_node`	`parse_tree_cl.c`	3238
`pt_init_apply_f`	`parse_tree_cl.c`	5016
`pt_init_init_f`	`parse_tree_cl.c`	5145
`pt_init_print_f`	`parse_tree_cl.c`	5280
`pt_apply_select`	`parse_tree_cl.c`	14302
`pt_init_select`	`parse_tree_cl.c`	14340
`pt_print_select`	`parse_tree_cl.c`	14358
`parser_init_func_vectors`	`parse_tree_cl.c`	17754
`parser_parse_string`	`parse_tree_cl.c`	1800
`parser_parse_string_with_escapes`	`parse_tree_cl.c`	1813
`parser_parse_string_use_sys_charset`	`parse_tree_cl.c`	1782
`parser_parse_file`	`parse_tree_cl.c`	1924
`parser_create_parser`	`parse_tree.c`	1174
`parser_free_parser`	`parse_tree.c`	1253
`parser_create_node`	`parse_tree.c`	276
`parser_create_node_block`	`parse_tree.c`	219
`parser_alloc`	`parse_tree.c`	956
`parser_main` declaration	`parser.h`	68
`parser_create_parser` declaration	`parser.h`	73
`parser_walk_tree` declaration	`parser.h`	114
`pt_makename`	`scanner_support.c`	213

Cross-check Notes

The raw analyses (PDF “Parser - Lexing and Parsing SQL v1.0”, PDF “Parser - Parsing Tree Structure v1.0”, PPTX “parser_semantic_check_basic v1.0”) were written against an earlier branch of CUBRID. Verified deltas as of the source state captured above:

parser_yyinput is now a single function, not three. The PDF describes parser_yyinput, parser_yyinput_single_line, and parser_yyinput_multi_line as three peers in the user-subroutines block. Current csql_lexer.l declares only parser_yyinput as the active reader; the old multi-line helper is referenced from parser_yyinput_single_mode’s one-shot mode but is not the default. Behaviour is the same for typical multi-statement strings.
The extern int yyline; extern int dot_flag; declarations in the deck no longer appear at the top of csql_lexer.l. Current source has only extern int yycolumn and extern int yycolumn_end; line tracking moved to Flex’s own %option yylineno. The deck’s snippet pre-dates that switch.
parser_make_link returns PT_NODE *, not void. The PDF excerpt at the bottom of “Lexing and Parsing SQL” (parser_make_link returning a PT_NODE *) is correct; this is consistent with the current source.
The pt_walk_private body in the deck’s “Parsing Tree Structure” PDF is current, modulo one structural change. The PDF’s if (node && walk->pre_function) opening guard has been simplified to an unconditional pre-call (the caller already null-checks). The end-of-walk data_type visit is now folded inside the typed-child arm (PT_APPLY_WALK (parser, node->data_type, walk) after (*apply)), not after it. The four continue_walk modes (PT_STOP_WALK, PT_CONTINUE_WALK, PT_LEAF_WALK, PT_LIST_WALK) and their semantics are unchanged.
PT_NODE::flag is a 27-bit struct, not a free int. The deck’s “PT_NODE” diagram shows … for “etc flag” — it glosses over the bitfield. The current source has 27 named single-bit flags including recompile, cannot_prepare, partition_pruned, si_datetime, si_tran_id, clt_cache_check, use_plan_cache, is_hidden_column, with_rollup, do_not_fold, is_value_query, do_not_replace_orderby, is_added_by_parser, is_alias_enabled_expr, is_wrapped_res_for_coll, is_system_generated_stmt, use_auto_commit, done_reduce_equality_terms, print_in_value_for_dblink, do_not_use_subquery_cache, for_default_func — the rest are pre-existing.
The next_row link in PT_NODE is for PT_NODE_LIST / VALUES-query rows. The deck does not mention it; current source uses it for PT_VALUE/PT_NAME/PT_EXPR nodes that belong to a PT_NODE_LIST for PT_SELECT “values” generated queries. The walker (pt_walk_private) does not follow next_row directly — a PT_NODE_LIST apply visits its inner row list explicitly via the typed-child path.
parser_init_func_vectors is now lazy-fired from parser_create_parser (under #if !defined (SERVER_MODE)). The deck describes parser_init_func_vectors as called during process startup; in current source it is called the first time a parser context is created in a process.
PT_NODE::sql_user_text is per-node, set inside parser_new_node. The deck doesn’t describe it; this is what lets the plan cache key on the literal SQL string for the node’s owning statement (and what makes per-statement EXPLAIN quote the original text rather than a re-rendered form).

Open Questions

GLR conflict count is unmeasured. The grammar declares %glr-parser because it has shift/reduce ambiguities; what is the actual ambiguous-state count, and which constructs are responsible? Investigation path: build with -r all -d (already in bison_target’s COMPILE_FLAGS), inspect csql_grammar.output, count entries in “State has conflicts”. The deck does not address this.
csql_grammar.yy pre-processor scope. cmake/CSQLGrammarTransform.cmake rewrites csql_grammar.y into csql_grammar.yy before Bison sees it. What does the rewrite do — replace deprecated YACC directives (%pure_parser → %define api.pure)? Does it rewrite reduce-action syntax? This matters because every grammar edit goes through that filter. Investigation path: read CSQLGrammarTransform.cmake and diff a representative pair of input / output files in a debug build.
expecting_pl_lang_spec is a one-shot flag. The lexer uses it to switch into the PLCSQL_TEXT start condition when the grammar is about to consume a procedure body. How does it get reset on parse error mid-procedure-body — does an error inside PLCSQL_TEXT leak the start-condition state into the next parser_main call? Investigation path: trace the BEGIN(INITIAL) calls in csql_lexer.l and confirm that parser_main’s entry-time expecting_pl_lang_spec = 0 is sufficient to recover (it is not obvious — Flex’s start-condition stack is separate from the variable).
Memory recovery on parser_main re-entry. The comment parser_main can be reentered while executing statements loaded by loaddb -s. During the loaddb -s, the yybuffer_pos must not be currupted saves and restores yybuffer_pos on entry/exit. Are there other thread-local-ish parser globals (g_query_string, g_query_string_pos, parser_select_level, parser_within_join_condition) that re-entry should protect but currently does not? Investigation path: grep for ^static .* parser_ in csql_grammar.y and audit each for re-entry safety.
parser_id overflow. The static parser_id counter in parse_tree.c (static int parser_id = 1;) is the key into parser_Node_blocks[HASH_NUMBER]. After 2³¹ parser creations it overflows. In a long-lived cub_server (or a heavily-stressed broker CAS) is this reachable, and what happens — does the hash chain just collide and old blocks get returned, or is there a guard? Investigation path: read pt_register_parser for the collision check.
PT_NODE::flag versus struct alignment growth. The flag struct currently has 27 single-bit fields; one more bit pushes it to a second word on 32-bit hosts. The PT_NODE is allocated 256-at-a-time; growing it by even one machine word changes the pool’s per-block memory footprint by 8 KB. Is anyone tracking this? Investigation path: sizeof (PT_NODE) snapshots over recent commits.

Sources

Raw analyses (`raw/code-analysis/cubrid/query-processing/`)

code_analysis_Parser-Lexing_and_Parsing_SQL_v1_0.pdf — the lex/parse pipeline, file-level walkthrough of csql_lexer.l and csql_grammar.y, build wiring snippet.
code_analysis_Parser-Parsing_Tree_Structure_v_1_0.pdf — PT_NODE structure, three-element (node, info, walk) framing, BST analogy for traversal, pt_walk_private body.
parser_semantic_check_basic_v1.0.pptx — query-processing pipeline diagram, PT_NODE quick reference, the running SELECT * FROM code WHERE s_name > 'G' example, full example trees for joins / DOT paths / derived tables / Oracle-outer joins. The document used the parser-related slides; the semantic-check half of the deck is upstream of the next code-analysis doc.
_converted/codeanalysisparser-lexingandparsingsqlv10.pdf.md, _converted/codeanalysisparser-parsingtreestructurev10.pdf.md, _converted/parsersemanticcheckbasicv1.0.pptx.md — markitdown extracts of the above.

Sibling docs

knowledge/code-analysis/cubrid/cubrid-mvcc.md — MVCC visibility runs against the PT_NODE tree only after semantic check; the parser is upstream of the MVCC code path.

Textbook chapters / references

Database Internals (Petrov), Ch. 1 §“Query Processor” — the architectural picture of where the parser sits; cited for the framing of the parser as the producer of the tree every later pass walks.
Aho, Lam, Sethi, Ullman, Compilers: Principles, Techniques, and Tools (Dragon book), 2nd ed. — Ch. 3 (lexing, regular languages → DFA, Lex), Ch. 4.7 (LALR(1)), Ch. 4.1.4 (error recovery), Ch. 5 (syntax-directed translation, semantic actions). The canonical reference for everything Flex/Bison is generating under the hood.
Bison manual, GNU Free Software Foundation — %glr-parser semantics, %locations, the csql_yyparse / csql_yylex driver contract, YYLTYPE / csql_yylval interactions.
PostgreSQL source — src/backend/parser/scan.l, src/backend/parser/gram.y, src/include/nodes/nodes.h (the Node/NodeTag pattern that CUBRID’s PT_NODE / PT_NODE_TYPE mirrors).
MySQL source — sql/sql_yacc.yy, sql/sql_lex.cc (the GLR-flavoured large grammar that CUBRID’s pattern resembles).

CUBRID source (`/data/hgryoo/references/cubrid/`)

src/parser/csql_lexer.l — Flex source.
src/parser/csql_grammar.y — Bison source plus parser_main and reduce-action helpers.
src/parser/parse_tree.h — PT_NODE, PT_NODE_TYPE, PT_STATEMENT_INFO, parser_context, walker-typedefs.
src/parser/parse_tree.c — per-parser arena, lifecycle.
src/parser/parse_tree_cl.c — walker, init vectors, per-type apply/init/print, public parser_parse_* wrappers, node-creation entry points.
src/parser/parser.h — public API surface.
src/parser/scanner_support.c — pt_makename, next_char / next_byte implementations.
cmake/CSQLGrammarTransform.cmake, CMakeLists.txt (root) — flex_target / bison_target wiring.

CUBRID Parser — Flex/Bison Pipeline, PT_NODE Tree, and the Parser Memory Model

Theoretical Background

Lexing — regular languages and DFAs

Parsing — LALR(1) and GLR

AST construction and error recovery

Common DBMS Design

Flex + Bison as the default front-end stack

ParseNode / PT_NODE / Item — the tagged-union AST

Per-PARSER_CONTEXT block allocator

Per-type child layout — the apply function table

Theory ↔ CUBRID mapping

CUBRID’s Approach

Pipeline at runtime

The lexer — csql_lexer.l

The grammar — csql_grammar.y

Driver — parser_main

PT_NODE — the polymorphic-tagged-union node

Per-type child layout — pt_apply_f, pt_init_f, pt_print_f

The walker — parser_walk_tree / pt_walk_private