CUBRID List-File — Spillable Tuple-Stream Inter-Operator Pipe and Materialisation Substrate
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”A query plan is a directed acyclic graph of operators that ingest tuples and emit tuples. Two textbook strategies move tuples between adjacent operators. Volcano (Graefe 1994) couples them by demand-driven iteration: the consumer calls next() on the producer and the producer fabricates exactly one tuple per call, walking down the tree until a leaf yields a row. Nothing is materialised; the call stack is the pipe. The other strategy, explicit materialisation, has the producer write every tuple it generates into a temporary tuple-stream first, and the consumer reads that stream back. Graefe 1993 (Query Evaluation Techniques for Large Databases) catalogues the two designs side-by-side and observes that real engines mix them — most operators iterate, but a small set must materialise.
Four classes of operator have to materialise no matter how aggressive the engine is about pipelining:
- Sort. A merge-sort consumes its entire input before it can emit the smallest tuple. The input has to live somewhere — typically on disk if the run does not fit in memory.
- Hash build (build side of hash join, hash GROUP BY). The probe side streams, but the build side must be fully consumed and indexed before any probe begins.
- Multi-pass scan. Operations that need to walk the same input more than once — analytic functions over a partition, recursive CTEs, certain CONNECT BY traversals, the inner side of a correlated subquery executed once per outer row — must keep the input around. The most efficient way to do this is to capture it in a re-scannable buffer.
- Sub-query result memoisation. When an uncorrelated sub-query feeds the outer plan via a
CONSTANTreference, the sub-query is run once, its result tuple is stored, and every reference reads from that store. Without materialisation the sub-query would re-execute on every reference.
Hellerstein and Stonebraker, in Anatomy of a Database System, name this object the inter-operator pipe and observe that it dominates the executor’s runtime cost when queries spill. The pipe has three properties that distinguish it from a generic temp file:
- Sequential write, sequential read. The pipe is never randomly mutated mid-query; the producer appends until it is finished, then becomes immutable for the consumer’s reads. Sort and hash internally re-write, but the pipe they emit is append-only.
- Self-describing tuples. The pipe must carry enough metadata that a reader can decode each tuple back into typed values (for example, to evaluate predicates against it). Per-tuple length-prefixes plus a stream-level type list are the canonical encoding; Database Internals (Petrov, ch. 1, “File Organization”) describes the same pattern under “slotted page” and “fixed/variable-length tuples”.
- Spillable. When the result fits in a few KB the pipe should not allocate disk space; when it grows past that, it should transition to disk transparently. Postgres’
Tuplestoreand CUBRID’s list-file both pin themselves to a small in-memory page array first and migrate to a temporary file under pressure.
Tuple representation inside the pipe is a separate design dial. Three families exist: packed row format (fixed-or-variable-length values written contiguously, length prefixes for variable parts; what most row stores use for temp streams), columnar (one buffer per column; favoured by analytical engines), and opaque encoded blobs (e.g., serialised C structs; brittle and rare). CUBRID picks packed row format with two-level length prefixes — a per-tuple length and a per-value length — described in detail in ## CUBRID's Approach below.
The final theoretical piece is lifecycle. A pipe exists strictly inside a query’s execution; it is created when the producer opens, written until the producer closes, scanned by the consumer until the consumer closes, then destroyed. Crashes do not need to recover the pipe — there is nothing to roll back, no transaction has committed against its contents — and so the storage manager treats it as a non-recoverable, non-WAL-logged file. CUBRID’s FILE_TEMP and PAGE_QRESULT page type encode exactly this contract.
Common DBMS Design
Section titled “Common DBMS Design”Engines diverge in how many tuple-stream abstractions they expose, and in where the in-memory-to-disk transition happens. The shape of the abstractions is otherwise uniform.
PostgreSQL — Tuplestore and Tuplesort
Section titled “PostgreSQL — Tuplestore and Tuplesort”PostgreSQL has two distinct objects:
Tuplestore(src/backend/utils/sort/tuplestore.c) — a generic spillable tuple buffer. Used forMaterialplan nodes,WindowAgg,Tablefunc, and as the holding tank forRowsFromFunc. Lives entirely inwork_memuntil it overflows, then writes to aBufFile-backed disk file. Supports forward scan, backward scan, and rewind. The producer callstuplestore_puttuple/tuplestore_putvalues; the consumer iterates withtuplestore_gettupleslot.Tuplesort(tuplesort.c) — a specialised spillable buffer that sorts on insertion. Used bySort,Agg(sort strategy),Unique,Limit, and by ANALYZE. Internally falls into one of several states (TSS_INITIAL,TSS_BOUNDED,TSS_BUILDRUNS,TSS_FINALMERGE) depending on whether the input fits inwork_memand whether top-N can be exploited.
The split is intentional: Tuplestore does not pay for sort logic when not needed, and Tuplesort’s sort is not bolted onto a generic buffer. Both share BufFile as the spill substrate.
MySQL — filesort, temp tables, Filesort_buffer
Section titled “MySQL — filesort, temp tables, Filesort_buffer”MySQL’s analogue is split across more pieces. Filesort (sql/filesort.cc) handles ORDER BY and is conceptually a Tuplesort-shaped object — sort buffer in memory, spill to a sequence of merge files in tmpdir when the buffer is full. Temporary tables (sql/sql_tmp_table.cc) handle GROUP BY and DISTINCT; they look more like a regular MyISAM/InnoDB table than a stream, and migrate from in-memory MEMORY engine to on-disk MyISAM/InnoDB when the size limit (tmp_table_size) is exceeded. The streaming half of the equation is JOIN_CACHE and the iterator interfaces in modern MySQL 8 (AggregateIterator, WindowIterator).
SQL Server — Table Spool / Index Spool
Section titled “SQL Server — Table Spool / Index Spool”SQL Server has two spool operators in the plan tree itself: Table Spool materialises an arbitrary tuple stream into TempDB, and Index Spool materialises a stream and adds a B-tree index for re-probe. Spools are first-class plan nodes, so the planner reasons explicitly about which sub-plans to spool versus re-execute. The spool storage lives in TempDB, which has its own buffer pool and recovery rules.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”CUBRID collapses everything PostgreSQL splits across Tuplestore and Tuplesort into a single tuple-stream abstraction — the list file. Sort, group-by, hash build, sub-query memo, and final-result delivery all use the same QFILE_LIST_ID underneath. Sort is implemented as a function on top of a list file (qfile_sort_list reads an unsorted list file and produces a sorted list file via external_sort); it is not a different object.
| Theoretical concept | CUBRID name |
|---|---|
| Tuple-stream identifier (the “what”) | QFILE_LIST_ID (query_list.h) |
| Underlying storage handle | QMGR_TEMP_FILE (query_manager.h) |
| In-memory spill buffer | QMGR_TEMP_FILE::membuf[] (PAGE_PTR array) |
| On-disk spill substrate | FILE_TEMP via file_create_temp (file_manager.c) |
| Page format on disk | PAGE_QRESULT (set by qmgr_init_external_file_page) |
| Per-page header | QFILE_PAGE_HEADER (32 bytes) |
| Tuple format | length-prefixed packed row, QFILE_TUPLE |
| Per-tuple value format | QFILE_TUPLE_VALUE_HEADER (flag + length) |
| Stream-level type list | QFILE_TUPLE_VALUE_TYPE_LIST (array of TP_DOMAIN *) |
| Producer iterator state | implicit in QFILE_LIST_ID::last_pgptr / last_offset / lasttpl_len |
| Consumer iterator state | QFILE_LIST_SCAN_ID (query_list.h) |
| Scannable wrapper as scan operator | S_LIST_SCAN / LLIST_SCAN_ID (scan_manager.h) |
| Open the pipe (producer side) | qfile_open_list |
| Append a tuple | qfile_add_tuple_to_list (or qfile_generate_tuple_into_list) |
| Close the pipe (producer side) | qfile_close_list |
| Open a scan over the pipe | qfile_open_list_scan |
| Pull next tuple (consumer side) | qfile_scan_list_next |
| Close the scan | qfile_close_scan |
| Destroy the pipe (release temp pages) | qfile_destroy_list |
| Set-operation pipes | qfile_combine_two_list (UNION / INTERSECT / DIFFERENCE) |
| Sort wrapper | qfile_sort_list / qfile_sort_list_with_func |
| Cross-query result memoisation | QFILE_LIST_CACHE_ENTRY + qfile_lookup_list_cache_entry |
The architectural payoff of one shared abstraction is large: the executor (query_executor.c), the scan layer (scan_manager.c), the post-processing layer (query_aggregate.cpp, query_analytic.cpp), and the cross-query cache (the list-file cache) all speak the same vocabulary. Adding a new materialising operator (or a new consumer) is mostly a matter of calling qfile_open_list and qfile_scan_list_next; the spill logic, page layout, and lifecycle are inherited.
CUBRID’s Approach
Section titled “CUBRID’s Approach”This section walks the abstraction at four levels of zoom — the identifier (QFILE_LIST_ID), the page format (QFILE_PAGE_HEADER + tuples), the producer/consumer operations, and the lifecycle binding to the surrounding XASL execution. Each level is exposed by a small set of functions in list_file.c, anchored by symbol names rather than line numbers (the position-hint table at the end of ## Source Walkthrough ties symbols to current line offsets).
QFILE_LIST_ID — the tuple-stream identifier
Section titled “QFILE_LIST_ID — the tuple-stream identifier”QFILE_LIST_ID is the handle every component passes around. It is not the tuples and not the bytes — it is a small struct (~120 bytes) describing where the bytes live, how many tuples are there, what their schema is, and which writer-side cursor state is current.
// QFILE_LIST_ID — src/query/query_list.hstruct qfile_list_id{ QFILE_TUPLE_VALUE_TYPE_LIST type_list; /* domain of each column */ SORT_LIST *sort_list; /* sort attributes (if sorted) */ INT64 tuple_cnt; /* total tuples */ int page_cnt; /* total pages */ VPID first_vpid; /* head of page chain */ VPID last_vpid; /* tail of page chain */ PAGE_PTR last_pgptr; /* live last-page latch (writer) */ int last_offset; /* end-of-data on last page */ int lasttpl_len; /* last tuple length (for prev links) */ QUERY_ID query_id; /* owning query */ VFID temp_vfid; /* duplicated from tfile_vfid */ struct qmgr_temp_file *tfile_vfid; /* the actual storage handle */ QFILE_TUPLE_DESCRIPTOR tpl_descr; /* writer-side fast path */ bool is_domain_resolved; bool is_result_cached; QFILE_LIST_ID *dependent_list_id; /* chained on qfile_connect_list */};Three observations:
- The struct mixes a static description of the file (type list, sort attributes, query owner) with a live writer cursor (
last_pgptr,last_offset,lasttpl_len). When a list-file is being constructed, those cursor fields point at a latched page; onceqfile_close_listruns they are nulled out and the page is unlatched. A list-file is therefore in one of three states: open-and-writing (cursor live), closed-and-readable (cursor null, pages on disk), or destroyed (struct cleared). - The actual storage is one indirection away.
tfile_vfidpoints to aQMGR_TEMP_FILE, which holds the in-memorymembuf[]and, once the membuf is full, thetemp_vfidof aFILE_TEMPfile allocated viafile_create_temp. The list-file does not see disk directly — it seesqmgr_get_new_page/qmgr_get_old_page, both of which check the membuf first. dependent_list_idis a uni-directional chain of secondary list-files attached viaqfile_connect_list. Used when a producer needs to logically extend a parent stream with a child — for example, a sub-query’s intermediate that the outer execution will outlive. The chain is destroyed transitively (qfile_destroy_listrecurses intodependent_list_id), so ownership is unambiguous.
Page-chain layout
Section titled “Page-chain layout”A list-file is a doubly-linked chain of database pages, each of size DB_PAGESIZE (default 16 KB). Each page begins with a 32-byte QFILE_PAGE_HEADER and stores tuples back-to-back from byte 32 to last_offset.
// qfile_page_header — src/query/query_list.hstruct qfile_page_header{ int pg_tplcnt; /* # tuples on this page */ PAGEID prev_pgid; /* previous data page */ PAGEID next_pgid; /* next data page */ int lasttpl_off; /* last-tuple offset (for backward scan) */ PAGEID ovfl_pgid; /* overflow page chain head */ VOLID prev_volid; /* (volid, pageid) split into separate fields */ VOLID next_volid; VOLID ovfl_volid;};A separate per-page overflow chain (rooted at ovfl_pgid / ovfl_volid) holds tuples larger than QFILE_MAX_TUPLE_SIZE_IN_PAGE (= DB_PAGESIZE - 32). Big tuples are split: the first chunk lives in-line on the data page, the rest is split across overflow pages and stitched together by qfile_get_tuple_from_current_list when the consumer encounters a tuple whose page header has ovfl_pgid != NULL_PAGEID.
flowchart LR
subgraph MEMBUF["QMGR_TEMP_FILE.membuf[] (in-memory pages)"]
M0["page 0<br/>QFILE_PAGE_HEADER<br/>tuples..."]
M1["page 1<br/>tuples..."]
M2["page 2<br/>tuples..."]
end
subgraph TEMP["FILE_TEMP (on-disk spill)"]
D0["page 3<br/>tuples..."]
D1["page 4<br/>tuples..."]
DN["page N<br/>tuples..."]
end
LISTID["QFILE_LIST_ID<br/>first_vpid → page 0<br/>last_vpid → page N<br/>last_pgptr → page N latch"]
M0 -- next_vpid --> M1
M1 -- next_vpid --> M2
M2 -- next_vpid --> D0
D0 -- next_vpid --> D1
D1 -- next_vpid --> DN
LISTID -. owns .-> M0
LISTID -. live cursor on .-> DN
The chain spans both flavours of storage. A page in the membuf has volid == NULL_VOLID and a small pageid index into the array; a page in FILE_TEMP has a real (volid, pageid). The reader does not care: qmgr_get_old_page consults the membuf first (vpid->pageid <= membuf_last) and falls through to pgbuf_fix for real pages. From the list-file’s perspective the chain is uniform.
Tuple format — packed row, two-level length prefix
Section titled “Tuple format — packed row, two-level length prefix”Each tuple inside a page is a length-prefixed packed row:
[ tuple_length (4) | prev_tuple_length (4) | value_0 | value_1 | ... | value_N-1 ]prev_tuple_length enables backward scan (qfile_scan_list_prev) without revisiting the page header. Each value is itself prefixed:
[ flag (4) | val_len (4) | <val_len bytes of packed value, MAX_ALIGNMENT-padded> ]The flag is V_BOUND for a present value or V_UNBOUND for SQL NULL (in which case val_len == 0). The packed value is in disk-image format — what pr_data_writeval of the column’s PR_TYPE produces. Decoding back into a DB_VALUE is the consumer’s job (typically via qdata_set_valptr_list_with_domain after fetch_val_list).
The macros that realise this layout — QFILE_PUT_TUPLE_LENGTH, QFILE_GET_TUPLE_VALUE_HEADER_POSITION, QFILE_GET_TUPLE_VALUE_LENGTH, QFILE_GET_TUPLE_VALUE_FLAG — live in query_list.h and are inlined wherever a tuple is touched. They are not type-aware; they manipulate raw byte offsets.
Why two prefixes (tuple-length and value-length)?
Section titled “Why two prefixes (tuple-length and value-length)?”Because the consumer needs to skip past values it does not care about (e.g., to evaluate a predicate on column 5 without decoding columns 0–4), and because the writer needs to know how much space the next tuple needs before allocating a new page. The tuple-level prefix lets qfile_scan_next walk pages without decoding; the value-level prefix lets qfile_locate_tuple_value jump to column N in O(N) macro stepping.
Producer side — open / write / close
Section titled “Producer side — open / write / close”sequenceDiagram
autonumber
participant Op as Producer (BUILDLIST_PROC)
participant LF as list_file
participant QM as query_manager
participant FM as file_manager
Op->>LF: qfile_open_list(type_list, query_id, flag, NULL)
LF->>QM: qmgr_create_new_temp_file(TEMP_FILE_MEMBUF_NORMAL)
QM-->>LF: QMGR_TEMP_FILE (membuf empty, no FILE_TEMP yet)
LF-->>Op: QFILE_LIST_ID*
loop for each output tuple
Op->>LF: qfile_generate_tuple_into_list(T_NORMAL)
LF->>LF: qfile_allocate_new_page_if_need
alt membuf has free page
LF->>QM: qmgr_get_new_page (returns membuf[i+1])
else membuf full → spill
LF->>QM: qmgr_get_new_page (no membuf left)
QM->>FM: file_create_temp(1, &temp_vfid) (first time)
QM->>FM: file_alloc(temp_vfid, init_page, ...)
FM-->>QM: real disk page
QM-->>LF: PAGE_PTR
end
LF->>LF: qfile_save_normal_tuple → memcpy values<br/>qfile_add_tuple_to_list_id (advance cursor)<br/>qfile_set_dirty_page
end
Op->>LF: qfile_close_list (unlatch last_pgptr)
qfile_open_list is a constructor: it allocates the QFILE_LIST_ID, copies the type list (deep-copies the TP_DOMAIN * array), allocates a QMGR_TEMP_FILE for storage, optionally builds a SORT_LIST if QFILE_FLAG_DISTINCT was set, and registers the file in the per-query temp-file accounting.
// qfile_open_list — src/query/list_file.c (condensed)QFILE_LIST_ID *qfile_open_list (THREAD_ENTRY * thread_p, QFILE_TUPLE_VALUE_TYPE_LIST * type_list_p, SORT_LIST * sort_list_p, QUERY_ID query_id, int flag, QFILE_LIST_ID * existing_list_id){ QFILE_LIST_ID *list_id_p = existing_list_id ? existing_list_id : (QFILE_LIST_ID *) malloc (sizeof (QFILE_LIST_ID)); // ... condensed: NULL checks ...
QFILE_CLEAR_LIST_ID (list_id_p); list_id_p->type_list.type_cnt = type_list_p->type_cnt; list_id_p->query_id = query_id;
if (QFILE_IS_FLAG_SET (flag, QFILE_FLAG_RESULT_FILE) && !QFILE_IS_LIST_CACHE_DISABLED) list_id_p->tfile_vfid = qmgr_create_result_file (thread_p, query_id); else if (QFILE_IS_FLAG_SET (flag, QFILE_FLAG_USE_KEY_BUFFER)) list_id_p->tfile_vfid = qmgr_create_new_temp_file (thread_p, query_id, TEMP_FILE_MEMBUF_KEY_BUFFER); else list_id_p->tfile_vfid = qmgr_create_new_temp_file (thread_p, query_id, TEMP_FILE_MEMBUF_NORMAL); // ... condensed: allocate type_list.domp, build sort_list if DISTINCT ... return list_id_p;}Three flavours of underlying temp file are chosen at open time:
qmgr_create_result_file— the final result of a query. Used whenQFILE_FLAG_RESULT_FILEis set on the top-level XASL’s list-file. Its lifetime survives the executor; the client fetches pages viaxqfile_get_list_file_page.TEMP_FILE_MEMBUF_NORMAL— the default. Membuf size =PRM_ID_TEMP_MEM_BUFFER_PAGES(default 4 pages). Most intermediate list-files use this.TEMP_FILE_MEMBUF_KEY_BUFFER— a larger membuf (sized byPRM_ID_INDEX_SCAN_KEY_BUFFER_PAGES) for index-scan key buffers. Same code path otherwise.
After open, qfile_open_list does not allocate a data page yet. The first page is created lazily inside the first qfile_add_tuple_to_list (or qfile_generate_tuple_into_list / qfile_get_first_page) call:
// qfile_allocate_new_page (condensed) — src/query/list_file.cstatic PAGE_PTRqfile_allocate_new_page (THREAD_ENTRY * thread_p, QFILE_LIST_ID * list_id_p, PAGE_PTR page_p, bool is_ovf_page){ VPID new_vpid; PAGE_PTR new_page_p = qmgr_get_new_page (thread_p, &new_vpid, list_id_p->tfile_vfid); if (new_page_p == NULL) return NULL;
QFILE_PUT_TUPLE_COUNT (new_page_p, 0); QFILE_PUT_PREV_VPID (new_page_p, &list_id_p->last_vpid); if (is_ovf_page) QFILE_PUT_NEXT_VPID_NULL (new_page_p); else LS_PUT_NEXT_VPID (new_page_p); /* NULL_PAGEID_IN_PROGRESS */ QFILE_PUT_LAST_TUPLE_OFFSET (new_page_p, QFILE_PAGE_HEADER_SIZE); QFILE_PUT_OVERFLOW_VPID_NULL (new_page_p);
if (page_p) QFILE_PUT_NEXT_VPID (page_p, &new_vpid); /* link old → new */ list_id_p->page_cnt++; if (page_p) qfile_set_dirty_page (thread_p, page_p, FREE, list_id_p->tfile_vfid); else QFILE_COPY_VPID (&list_id_p->first_vpid, &new_vpid); /* very first page */ QFILE_COPY_VPID (&list_id_p->last_vpid, &new_vpid); list_id_p->last_pgptr = new_page_p; list_id_p->last_offset = QFILE_PAGE_HEADER_SIZE; return new_page_p;}qmgr_get_new_page is the gatekeeper that decides membuf-vs-disk:
// qmgr_get_new_page (condensed) — src/query/query_manager.cPAGE_PTRqmgr_get_new_page (THREAD_ENTRY * thread_p, VPID * vpid_p, QMGR_TEMP_FILE * tfile_vfid_p){ /* first try the in-memory page array */ if (tfile_vfid_p->membuf != NULL && tfile_vfid_p->membuf_last < tfile_vfid_p->membuf_npages - 1) { vpid_p->volid = NULL_VOLID; vpid_p->pageid = ++(tfile_vfid_p->membuf_last); return tfile_vfid_p->membuf[tfile_vfid_p->membuf_last]; }
/* membuf exhausted → spill: ensure a FILE_TEMP exists, then alloc from it */ if (VFID_ISNULL (&tfile_vfid_p->temp_vfid)) { file_create_temp (thread_p, 1, &tfile_vfid_p->temp_vfid); tfile_vfid_p->temp_file_type = FILE_TEMP; /* ... TDE setup if encrypted ... */ } return qmgr_get_external_file_page (thread_p, vpid_p, tfile_vfid_p);}Pages live in two regions stitched into one logical chain. The membuf is a fixed-size array of pre-allocated DB_PAGESIZE buffers carved out of a single large malloc (qmgr_allocate_tempfile_with_buffer); the FILE_TEMP pages are real disk pages tracked by the file manager’s tempcache. The transition happens once per list-file, when the (fixed-size) membuf is full.
Tuple write path — three flavours
Section titled “Tuple write path — three flavours”The producer has three entry points, distinguished by how it knows the tuple bytes:
qfile_add_tuple_to_list— the bytes are already in aQFILE_TUPLEbuffer. Used when the producer already serialised the tuple (e.g., a sub-block copies into a parent’s list, or the executor ran out oftpl_descrcapacity and fell back toqdata_copy_valptr_list_to_tuple). Handles tuples larger than a page via the overflow-page mechanism (qfile_allocate_new_ovf_page).qfile_generate_tuple_into_list— the bytes are not serialised yet; they live in aQFILE_TUPLE_DESCRIPTORwhosef_valp[]array points atDB_VALUE *s. The descriptor’stuple_type(T_SINGLE_BOUND_ITEM,T_NORMAL,T_SORTKEY,T_MERGE) selects one of fourqfile_save_*_tuplecallbacks that walk the descriptor and write the values. This is the fast path used by the executor’s main loop inqexec_end_one_iteration.qfile_fast_intint_tuple_to_list/qfile_fast_intval_tuple_to_list/qfile_fast_val_tuple_to_list— micro-optimisations that bypass the descriptor and write a 1-or-2-value tuple directly. Used by performance-sensitive paths that emit small fixed-shape rows (e.g., the parallel-coordinator emitting(worker_id, count)pairs).
// qfile_save_normal_tuple — src/query/list_file.c (condensed)static intqfile_save_normal_tuple (QFILE_TUPLE_DESCRIPTOR * tuple_descr_p, char *tuple_p, char *page_p, int tuple_length){ int total_tuple_value_size = 0, tuple_value_size; for (int i = 0; i < tuple_descr_p->f_cnt; i++) { if (qdata_copy_db_value_to_tuple_value (tuple_descr_p->f_valp[i], tuple_p, &tuple_value_size) != NO_ERROR) return ER_FAILED; total_tuple_value_size += tuple_value_size; tuple_p += tuple_value_size; } QFILE_PUT_TUPLE_LENGTH (page_p, tuple_length); return NO_ERROR;}qdata_copy_db_value_to_tuple_value writes [flag, val_len, packed_bytes] for each field, returning the bytes consumed. After all fields are written, the first four bytes of the row are stamped with the total tuple length so consumers can stride past it.
Big-tuple path — overflow pages
Section titled “Big-tuple path — overflow pages”If tuple_length > QFILE_MAX_TUPLE_SIZE_IN_PAGE (= DB_PAGESIZE - 32), the writer takes the overflow path:
// qfile_add_tuple_to_list — overflow tail (condensed)prev_page_p = cur_page_p;for (offset = tuple_page_size, tuple_p = tuple + offset; offset < tuple_length; offset += tuple_page_size, tuple_p += tuple_page_size) { new_page_p = qfile_allocate_new_ovf_page (thread_p, list_id_p, cur_page_p, prev_page_p, tuple_length, offset, &tuple_page_size); if (new_page_p == NULL) /* error */ memcpy ((char *) new_page_p + QFILE_PAGE_HEADER_SIZE, tuple_p, tuple_page_size); prev_page_p = new_page_p; }The first chunk fits on the data page like a normal tuple; subsequent chunks live one-per-page in a chain rooted at the data page’s ovfl_pgid. Each overflow page is marked by QFILE_PUT_TUPLE_COUNT (page, QFILE_OVERFLOW_TUPLE_COUNT_FLAG) (= -2), so consumers can recognise it and not mistake it for a regular page.
Closing the producer
Section titled “Closing the producer”qfile_close_list is a small unlatch-only operation. The data is already on disk (or in membuf); closing only releases the latch on the live writer page so consumers and other producers can fix it:
// qfile_close_list — src/query/list_file.cvoidqfile_close_list (THREAD_ENTRY * thread_p, QFILE_LIST_ID * list_id_p){ if (list_id_p && list_id_p->last_pgptr != NULL) { QFILE_PUT_NEXT_VPID_NULL (list_id_p->last_pgptr); /* terminate chain */ qmgr_free_old_page_and_init (thread_p, list_id_p->last_pgptr, list_id_p->tfile_vfid); }}Note the chain-termination write: while the writer was building the file, every newly allocated page had its next_vpid stamped with NULL_PAGEID_IN_PROGRESS (via the LS_PUT_NEXT_VPID macro under SERVER_MODE); only qfile_close_list finalises it to NULL_PAGEID. This signals to consumers (especially in async / streaming mode) that the producer is done and there will be no further pages. qfile_reopen_list_as_append_mode reverses this — it re-fetches the last page with a write latch so subsequent qfile_add_tuple_to_list calls can extend the file.
Consumer side — scan
Section titled “Consumer side — scan”// QFILE_LIST_SCAN_ID — src/query/query_list.hstruct qfile_list_scan_id{ SCAN_STATUS status; /* S_OPENED / S_STARTED / S_ENDED / S_CLOSED */ SCAN_POSITION position; /* S_BEFORE / S_ON / S_AFTER */ VPID curr_vpid; PAGE_PTR curr_pgptr; /* live page latch */ QFILE_TUPLE curr_tpl; /* pointer into curr_pgptr */ bool keep_page_on_finish; bool is_read_only; int curr_offset; int curr_tplno; QFILE_TUPLE_RECORD tplrec; /* spliced buffer for overflow tuples */ QFILE_LIST_ID list_id; /* COPY of the list-file (independent cursor) */};Note that list_id inside the scan is a copy of the producer’s QFILE_LIST_ID, made by qfile_copy_list_id inside qfile_open_list_scan. This decouples the consumer’s read cursor from the producer (and from any other consumer) — multiple readers can scan the same list-file concurrently, each holding its own curr_pgptr latch. Lifecycle is handled by ref-counting the underlying QMGR_TEMP_FILE (via qfile_update_qlist_count) so the temp file is not destroyed until the last cursor closes.
// qfile_open_list_scan — src/query/list_file.cintqfile_open_list_scan (QFILE_LIST_ID * list_id_p, QFILE_LIST_SCAN_ID * scan_id_p){ scan_id_p->status = S_OPENED; scan_id_p->position = S_BEFORE; scan_id_p->keep_page_on_finish = 0; scan_id_p->is_read_only = false; scan_id_p->curr_vpid.pageid = NULL_PAGEID; scan_id_p->curr_vpid.volid = NULL_VOLID; QFILE_CLEAR_LIST_ID (&scan_id_p->list_id);
if (qfile_copy_list_id (&scan_id_p->list_id, list_id_p, true, QFILE_SKIP_DEPENDENT) != NO_ERROR) return ER_FAILED;
scan_id_p->tplrec.size = 0; scan_id_p->tplrec.tpl = NULL; return NO_ERROR;}State starts at (S_OPENED, S_BEFORE, no-page-fixed). The scan does not fix any page until the first qfile_scan_list_next is called; until then there is no buffer-pool cost for an open-but-unread cursor.
qfile_scan_list_next is a thin wrapper that delegates to qfile_scan_next (the position machine) and then qfile_retrieve_tuple (the materialiser):
// qfile_scan_next — src/query/list_file.c (condensed)static SCAN_CODEqfile_scan_next (THREAD_ENTRY * thread_p, QFILE_LIST_SCAN_ID * scan_id_p){ if (scan_id_p->position == S_BEFORE) { if (scan_id_p->list_id.tuple_cnt == 0) return S_END; /* fix first page, position cursor at first tuple */ scan_id_p->curr_pgptr = qmgr_get_old_page (thread_p, &scan_id_p->list_id.first_vpid, scan_id_p->list_id.tfile_vfid); QFILE_COPY_VPID (&scan_id_p->curr_vpid, &scan_id_p->list_id.first_vpid); scan_id_p->curr_offset = QFILE_PAGE_HEADER_SIZE; scan_id_p->curr_tpl = (char *) scan_id_p->curr_pgptr + QFILE_PAGE_HEADER_SIZE; scan_id_p->curr_tplno = 0; scan_id_p->position = S_ON; return S_SUCCESS; } else if (scan_id_p->position == S_ON) { /* still tuples on this page? */ if (scan_id_p->curr_tplno < QFILE_GET_TUPLE_COUNT (scan_id_p->curr_pgptr) - 1) { scan_id_p->curr_offset += QFILE_GET_TUPLE_LENGTH (scan_id_p->curr_tpl); scan_id_p->curr_tpl += QFILE_GET_TUPLE_LENGTH (scan_id_p->curr_tpl); scan_id_p->curr_tplno++; return S_SUCCESS; } /* page exhausted — follow next_vpid */ else if (qfile_has_next_page (scan_id_p->curr_pgptr)) { VPID next_vpid; QFILE_GET_NEXT_VPID (&next_vpid, scan_id_p->curr_pgptr); PAGE_PTR next_page_p = qmgr_get_old_page (thread_p, &next_vpid, scan_id_p->list_id.tfile_vfid); qmgr_free_old_page_and_init (thread_p, scan_id_p->curr_pgptr, scan_id_p->list_id.tfile_vfid); QFILE_COPY_VPID (&scan_id_p->curr_vpid, &next_vpid); scan_id_p->curr_pgptr = next_page_p; scan_id_p->curr_tplno = 0; scan_id_p->curr_offset = QFILE_PAGE_HEADER_SIZE; scan_id_p->curr_tpl = (char *) scan_id_p->curr_pgptr + QFILE_PAGE_HEADER_SIZE; return S_SUCCESS; } else { scan_id_p->position = S_AFTER; if (!scan_id_p->keep_page_on_finish) qmgr_free_old_page_and_init (thread_p, scan_id_p->curr_pgptr, scan_id_p->list_id.tfile_vfid); return S_END; } } /* ... S_AFTER → S_END ... */}The state machine is exactly what one would expect: S_BEFORE → S_ON (fix first page) → S_ON (advance within page or to next page) → S_AFTER / S_END (release page). qfile_scan_list_prev is the symmetric operation, using the per-tuple prev_tuple_length prefix to step backward and prev_pgid in the page header to walk the chain backward.
Retrieve — peek vs copy
Section titled “Retrieve — peek vs copy”// qfile_retrieve_tuple — src/query/list_file.c (condensed)static SCAN_CODEqfile_retrieve_tuple (THREAD_ENTRY * thread_p, QFILE_LIST_SCAN_ID * scan_id_p, QFILE_TUPLE_RECORD * tuple_record_p, int peek){ if (QFILE_GET_OVERFLOW_PAGE_ID (scan_id_p->curr_pgptr) == NULL_PAGEID) { if (peek) tuple_record_p->tpl = scan_id_p->curr_tpl; /* zero-copy */ else memcpy (tuple_record_p->tpl, scan_id_p->curr_tpl, QFILE_GET_TUPLE_LENGTH (scan_id_p->curr_tpl)); } else { /* tuple has overflow pages → splice into scan_id_p->tplrec */ qfile_get_tuple_from_current_list (thread_p, scan_id_p, &scan_id_p->tplrec); tuple_record_p->tpl = scan_id_p->tplrec.tpl; } return S_SUCCESS;}peek = true is the zero-copy fast path used by everything that reads the tuple before advancing — predicate evaluation, projection, sort-key extraction, sub-tuple set-membership tests. The pointer is valid until the next qfile_scan_list_next (which may unlatch the page). peek = false copies into a caller-owned buffer and is used when the caller wants to retain the tuple beyond the next advance — e.g., merge join’s “remember the current row of the inner side while we advance the outer”.
For overflow tuples (QFILE_GET_OVERFLOW_PAGE_ID(curr_pgptr) != NULL_PAGEID), the peek path is impossible — the bytes are spread across multiple pages — so qfile_get_tuple_from_current_list always copies them into the per-scan tplrec buffer first.
Set operations — qfile_combine_two_list
Section titled “Set operations — qfile_combine_two_list”UNION, INTERSECT, DIFFERENCE between two list-files share one driver: qfile_combine_two_list. The high-level shape is “sort-merge”, with per-tuple action callbacks plumbed in by flag:
flowchart TB
A["qfile_combine_two_list(L, R, flag)"] --> B{"flag == UNION ALL?"}
B -- "yes" --> U["qfile_union_list (clone L, append R, no sort)"]
B -- "no" --> S1["sort L by all columns (Q_DISTINCT or Q_ALL)"]
S1 --> S2["sort R by all columns"]
S2 --> O["open dest list"]
O --> M{"choose action callbacks"}
M -- "INTERSECT" --> M1["act_both = qfile_add_one_tuple<br/>act_left = act_right = NULL"]
M -- "DIFFERENCE" --> M2["act_left = qfile_add_tuple<br/>act_both = act_right = NULL"]
M -- "UNION DISTINCT" --> M3["act_left = act_right = qfile_add_tuple<br/>act_both = qfile_add_one_tuple"]
M -- "UNION ALL (post-sort)" --> M4["act_both = qfile_add_two_tuple<br/>(both copies)"]
M1 --> L["merge loop: compare top-of-L vs top-of-R<br/>cmp<0 → act_left, cmp==0 → act_both, cmp>0 → act_right"]
M2 --> L
M3 --> L
M4 --> L
L --> C["qfile_close_list (dest)"]
The qfile_advance_single vs qfile_advance_group choice is what makes DISTINCT cheap: _group collapses runs of equal keys into one logical position before each comparison, so duplicate rows in the inputs do not generate duplicate output rows even when both sides have them.
qfile_union_list is the all-the-way-fast-path for UNION ALL: it does not sort, it does not iterate tuples — it physically appends the second list’s pages to the first by editing the next_vpid of the first list’s last page to point at the second list’s first page (qfile_append_list).
Sort wrapper — qfile_sort_list / qfile_sort_list_with_func
Section titled “Sort wrapper — qfile_sort_list / qfile_sort_list_with_func”Sort is not a list-file primitive. It is a function from one list-file to another, implemented as a callback sandwich around external_sort:
// qfile_sort_list_with_func — src/query/list_file.c (condensed)QFILE_LIST_ID *qfile_sort_list_with_func (THREAD_ENTRY * thread_p, QFILE_LIST_ID * list_id_p, SORT_LIST * sort_list_p, QUERY_OPTIONS option, int flag, SORT_GET_FUNC * get_func, SORT_PUT_FUNC * put_func, SORT_CMP_FUNC * cmp_func, void *extra_arg, int limit, bool do_close, int parallelism, ORDERBY_STATS * orderby_stats){ qfile_close_list (thread_p, list_id_p); /* unlatch input writer */ QFILE_LIST_ID *srlist_id = qfile_open_list (thread_p, &list_id_p->type_list, sort_list_p, list_id_p->query_id, flag, NULL); /* output */
QFILE_LIST_SCAN_ID t_scan_id; QFILE_SORT_SCAN_ID s_scan_id; SORT_INFO info; qfile_open_list_scan (list_id_p, &t_scan_id); qfile_initialize_sort_info (&info, list_id_p, sort_list_p); info.s_id = &s_scan_id; info.output_file = srlist_id; info.input_file = list_id_p;
if (get_func == NULL) get_func = &qfile_get_next_sort_item; /* default reader callback */ if (put_func == NULL) put_func = &qfile_put_next_sort_item; /* default writer callback */ if (cmp_func == NULL) cmp_func = (info.key_info.use_original ? &qfile_compare_partial_sort_record : &qfile_compare_all_sort_record);
int sort_result = sort_listfile (thread_p, NULL_VOLID, estimated_pages, get_func, &info, put_func, &info, cmp_func, &info.key_info, (option == Q_DISTINCT ? SORT_ELIM_DUP : SORT_DUP), limit, srlist_id->tfile_vfid->tde_encrypted, parallel_type); /* ... destroy input list_id_p, copy srlist_id over list_id_p ... */ return list_id_p;}Three callbacks plug external_sort into the list-file:
get_func—external_sortcalls this to read the next input record. Default isqfile_get_next_sort_item, which pulls a tuple from the input scan and copies its sort-key columns into aSORT_REC.put_func—external_sortcalls this to emit a sorted record. Default isqfile_put_next_sort_item, which decodes theSORT_RECback into a tuple andqfile_add_tuple_to_lists it to the output. Aggregation overrides this withqexec_gby_put_next(sort-based GROUP BY) so each emitted tuple folds into the running aggregate instead of being written.cmp_func—external_sortcalls this to compare twoSORT_RECs. Default isqfile_compare_all_sort_record(full-key comparison) orqfile_compare_partial_sort_record(when sort-key columns are a prefix of the original tuple).
flowchart LR
subgraph CONS["CUBRID list-file"]
LIN["unsorted list-file<br/>(QFILE_LIST_ID)"]
LSCAN["QFILE_LIST_SCAN_ID<br/>(input cursor)"]
LOUT["sorted list-file<br/>(QFILE_LIST_ID)"]
end
subgraph EXT["external_sort.c"]
GETFN["get_func →<br/>qfile_get_next_sort_item"]
PUTFN["put_func →<br/>qfile_put_next_sort_item"]
CMPFN["cmp_func →<br/>qfile_compare_∗_sort_record"]
INPHASE["IN-PHASE:<br/>build sorted runs<br/>in sort buffer"]
EXPHASE["EX-PHASE:<br/>k-way merge runs<br/>through sort buffer"]
end
LIN --> LSCAN
LSCAN --> GETFN
GETFN --> INPHASE
INPHASE --> EXPHASE
EXPHASE --> CMPFN
EXPHASE --> PUTFN
PUTFN --> LOUT
external_sort itself spills runs to its own FILE_TEMP files when the sort buffer is full — a separate spill substrate from the list-file’s own membuf-then-FILE_TEMP. The list-file is a transport layer; sort is the algorithm.
Type-list inference and the executor binding
Section titled “Type-list inference and the executor binding”A list-file carries its column types as an array of TP_DOMAIN * (type_list.domp). The executor builds this from the producing XASL’s outptr_list:
flowchart LR X["xasl_node<br/>type=BUILDLIST_PROC"] --> O["outptr_list<br/>(REGU_VARIABLE list)"] O --> R["each REGU_VARIABLE has<br/>domain (TP_DOMAIN*)"] R --> T["pt_node_to_db_domain or<br/>regu_variable_get_domain"] T --> TL["QFILE_TUPLE_VALUE_TYPE_LIST<br/>(domp[], type_cnt)"] TL --> Q["qfile_open_list(..., &type_list, ...)"] Q --> L["new QFILE_LIST_ID<br/>type_list.domp = deep copy"]
The deep copy in qfile_open_list is intentional — the source domains may be on the parser-tree arena (which is freed before query execution finishes); the list-file outlives that, so it must own its own copies. qfile_clear_list_id free_and_inits type_list.domp when the list is destroyed.
qfile_unify_types reconciles two list-files’ type lists when one is appended to or combined with another; it walks both domp[] arrays and picks the LUB (least upper bound) domain via tp_domain_resolve_value. This is the path that turns SELECT 'abc' UNION SELECT 1 into a stream whose column has VARCHAR domain and whose integer rows are coerced.
qfile_modify_type_list is a lighter rebuilder used when the executor knows the type list it wants and just needs to override the one auto-derived from outptr_list (e.g., for hash-join build sides where the build columns are a subset).
Lifecycle bound to the surrounding XASL
Section titled “Lifecycle bound to the surrounding XASL”A list-file is owned by exactly one XASL node for the duration of that XASL’s execution. The owning node is whichever node was constructing the file via qfile_open_list; the result is stored as xasl->list_id. When the executor’s qexec_clear_xasl_head runs (during qexec_clear_xasl cleanup of an XASL tree), it walks every node and calls:
// qexec_clear_xasl_head excerpt — src/query/query_executor.cif (xasl->list_id && !xasl->list_id->is_result_cached) { (void) qfile_close_list (thread_p, xasl->list_id); qfile_destroy_list (thread_p, xasl->list_id); }Two important implications:
- A consumed sub-block’s list-file dies with the consumer’s XASL. A
BUILDLIST_PROCsub-XASL’slist_idis materialised once duringqexec_open_scan, the parent reads it viaS_LIST_SCAN, and the file is destroyed when the parent’s XASL tree is cleared. There is no GC; lifetimes are stack-shaped. - Result-cached list-files survive across queries. When
is_result_cachedis true, the list-file is owned by the list-file cache (QFILE_LIST_CACHE_ENTRY), not by the producing XASL. The XASL tree is torn down but the list-file stays — until the cache decides to evict it (qfile_clear_list_cache).
sequenceDiagram
autonumber
participant E as Query (executor)
participant X as Parent XASL (BUILDLIST_PROC)
participant SX as Sub-XASL (BUILDLIST_PROC)
participant LF as list-file (sub-XASL.list_id)
E->>X: qexec_execute_mainblock(parent)
X->>SX: qexec_execute_mainblock(sub) (e.g., aptr_list)
SX->>LF: qfile_open_list (TEMP_FILE_MEMBUF_NORMAL)
loop sub-XASL produces tuples
SX->>LF: qfile_generate_tuple_into_list (T_NORMAL)
end
SX->>LF: qfile_close_list
X->>LF: scan_open_list_scan (S_LIST_SCAN)
loop parent reads tuples
X->>LF: qfile_scan_list_next (peek)
end
X->>LF: qfile_close_scan
E->>X: qexec_clear_xasl (tree cleanup)
X->>LF: qfile_close_list + qfile_destroy_list (sub-XASL.list_id)
Caching path — QFILE_LIST_CACHE_ENTRY
Section titled “Caching path — QFILE_LIST_CACHE_ENTRY”The list-file cache is a per-XASL-cache-entry hash table that memoises the result of an entire query keyed on its parameter values (DB_VALUE_ARRAY). Hit on qfile_lookup_list_cache_entry returns a QFILE_LIST_CACHE_ENTRY whose list_id is a previously-built (and not-destroyed) list-file:
// qfile_lookup_list_cache_entry — src/query/list_file.c (condensed)QFILE_LIST_CACHE_ENTRY *qfile_lookup_list_cache_entry (THREAD_ENTRY * thread_p, XASL_CACHE_ENTRY * xasl, const DB_VALUE_ARRAY * params, bool * result_cached){ *result_cached = false; if (QFILE_IS_LIST_CACHE_DISABLED) return NULL; if (qfile_List_cache.n_hts == 0) return NULL;
csect_enter (thread_p, CSECT_QPROC_LIST_CACHE, INF_WAIT); if (xasl->list_ht_no < 0) xasl->list_ht_no = qcache_get_new_ht_no (thread_p);
QFILE_LIST_CACHE_ENTRY *lent = mht_get (qfile_List_cache.list_hts[xasl->list_ht_no], params); qfile_List_cache.lookup_counter++; if (lent && !lent->deletion_marker) { *result_cached = true; /* register tran_index, bump ref_count + time_last_used */ } csect_exit (thread_p, CSECT_QPROC_LIST_CACHE); return *result_cached ? lent : NULL;}When a query begins execution and a cache hit is found, the executor skips materialisation — it uses the cached list_id directly as the result, and the XASL tree’s own list-files are never built. When a query finishes execution and conditions allow caching (no triggers fired, no autocommit boundary issues, etc.), qfile_update_list_cache_entry clones the producing list-file into a new cache entry. Eviction is by the standard combination of TTL + LRU (time_last_used, ref_count) plus capacity (PRM_ID_LIST_MAX_QUERY_CACHE_PAGES), implemented in qfile_list_cache_cleanup.
The interaction with the XASL cache is: every list-cache entry is owned by an XASL-cache entry. When the XASL cache evicts a plan, qfile_clear_list_cache walks all of that XASL’s list-cache entries and deletes them. This guarantees a list-cache entry never references a stale plan.
Source Walkthrough
Section titled “Source Walkthrough”This section names the symbols that any cross-reference into the list-file module should anchor on. Group order matches the producer / consumer / set-ops / sort / cache / lifecycle order used in ## CUBRID's Approach.
Types and macros (query_list.h)
Section titled “Types and macros (query_list.h)”QFILE_LIST_ID— the tuple-stream identifier.QFILE_LIST_SCAN_ID— consumer cursor.QFILE_PAGE_HEADER(32 bytes) and offset macros:QFILE_TUPLE_COUNT_OFFSET,QFILE_PREV_PAGE_ID_OFFSET,QFILE_NEXT_PAGE_ID_OFFSET,QFILE_LAST_TUPLE_OFFSET,QFILE_OVERFLOW_PAGE_ID_OFFSET,QFILE_PREV_VOL_ID_OFFSET,QFILE_NEXT_VOL_ID_OFFSET,QFILE_OVERFLOW_VOL_ID_OFFSET.- Tuple-format macros:
QFILE_TUPLE_LENGTH_SIZE(= 8 — both length prefixes),QFILE_TUPLE_VALUE_HEADER_SIZE(= 8),QFILE_MAX_TUPLE_SIZE_IN_PAGE,QFILE_GET_TUPLE_LENGTH,QFILE_GET_PREV_TUPLE_LENGTH,QFILE_GET_TUPLE_VALUE_FLAG,QFILE_GET_TUPLE_VALUE_LENGTH,QFILE_GET_TUPLE_VALUE_HEADER_POSITION. - Stream-flag enum:
QFILE_FLAG_RESULT_FILE,QFILE_FLAG_UNION,QFILE_FLAG_INTERSECT,QFILE_FLAG_DIFFERENCE,QFILE_FLAG_ALL,QFILE_FLAG_DISTINCT,QFILE_FLAG_USE_KEY_BUFFER,QFILE_NOT_USE_MEMBUF. - Sort metadata:
SORT_LIST,SORT_TYPE,SORT_ORDER,SORT_NULLS,QFILE_SORTED_LIST_ID,QFILE_SORT_SCAN_ID. - Single-tuple type enum:
QFILE_TUPLE_TYPE(T_SINGLE_BOUND_ITEM,T_NORMAL,T_SORTKEY,T_MERGE). - Special flag for overflow pages:
QFILE_OVERFLOW_TUPLE_COUNT_FLAG(= -2).
Open / write / close
Section titled “Open / write / close”qfile_initialize/qfile_finalize— module-wide setup ofqfile_sort_list_Freelist(lock-free freelist ofSORT_LISTrecords) and theqfile_Max_tuple_page_sizeconstant.qfile_open_list— constructor; allocatesQFILE_LIST_ID, picksTEMP_FILE_MEMBUF_NORMAL/_KEY_BUFFER/ result-file substrate, deep-copies type list, builds optional sort list for DISTINCT.qfile_close_list— terminatesnext_vpidchain and unfixes the live writer page.qfile_reopen_list_as_append_mode— re-fixes the last page with a write latch; used byUNION ALLfast-path appending.qfile_get_first_page— eagerly allocates the first data page (used when external code needs the first VPID before adding tuples).qfile_destroy_list— unwinds the temp-file storage; recurses intodependent_list_id.qfile_free_list_id/QFILE_FREE_AND_INIT_LIST_ID— frees theQFILE_LIST_IDstruct and its owned arrays.qfile_clear_list_id— zeroes the struct and freestype_list.domp/sort_list/tpl_descr.f_valpwithout touching the underlying temp-file (for transferring ownership).qfile_copy_list_id/qfile_clone_list_id— duplicate the identifier withQFILE_DEPENDENT_MODEcontrolling whether the chaineddependent_list_idis followed.
Tuple-write path
Section titled “Tuple-write path”qfile_add_tuple_to_list— append a pre-serialisedQFILE_TUPLE; handles oversize tuples via overflow pages.qfile_generate_tuple_into_list— append from aQFILE_TUPLE_DESCRIPTOR; selectsqfile_save_normal_tuple/qfile_save_single_bound_item_tuple/qfile_save_sort_key_tuple/qfile_save_merge_tuple.qfile_save_tuple— entry point dispatching to the fourqfile_save_*callbacks.qfile_fast_intint_tuple_to_list/qfile_fast_intval_tuple_to_list/qfile_fast_val_tuple_to_list— bypassing the descriptor for known small shapes.qfile_add_item_to_list— single-column convenience wrapper.qfile_add_overflow_tuple_to_list— copies an already-overflowing tuple from one list-file into another, preserving its overflow chain.qfile_add_tuple_get_pos_in_list— append + return theQFILE_TUPLE_POSITIONof the just-written tuple (used by hash-join build to record where each key landed).
Page-allocation helpers (static)
Section titled “Page-allocation helpers (static)”qfile_allocate_new_page— fix a fresh page fromqmgr_get_new_page, link prev/next, copy chain pointers intoQFILE_LIST_ID.qfile_allocate_new_ovf_page— same but for an overflow page.qfile_allocate_new_page_if_need— branchy gate around the two allocators called from every tuple-write path.qfile_add_tuple_to_list_id— bookkeeping after a tuple is written (advancetuple_cnt,last_offset,lasttpl_len).qfile_initialize_page_header— zero-init a fresh page header.qfile_set_dirty_page/qfile_set_dirty_page_and_skip_logging— mark the page dirty and either send through the page-buffer’s normal flush path (with WAL) or skip logging forFILE_TEMPpages (which are not recovered).
Scan path
Section titled “Scan path”qfile_open_list_scan— initialize aQFILE_LIST_SCAN_ID, deep-copylist_id.qfile_scan_list_next— wrapper callingqfile_scan_list (qfile_scan_next, ...).qfile_scan_list_prev— wrapper callingqfile_scan_list (qfile_scan_prev, ...).qfile_close_scan— release the live page latch, free per-scantplrec, deep-clear the scan-locallist_id.qfile_save_current_scan_tuple_position/qfile_jump_scan_tuple_position— bookmark and seek; used by hash-join build to record per-key row positions and probe to seek to them.qfile_start_scan_fix/qfile_end_scan_fix— pin/unpin the current page across a sub-call that should not lose the cursor (e.g., the executor briefly swaps to a sub-XASL).qfile_get_tuple— read a single tuple byQFILE_TUPLE_POSITION(used to materialise an overflow tuple’s full bytes).qfile_overwrite_tuple— in-place rewrite of a tuple’s bytes (UPDATE-style modification on a list-file used as a working buffer; rare).qfile_set_tuple_column_value— narrower in-place column update.
Static state-machine helpers
Section titled “Static state-machine helpers”qfile_scan_next/qfile_scan_prev— theS_BEFORE/S_ON/S_AFTERposition machine.qfile_retrieve_tuple— peek-vs-copy materialise after positioning.qfile_get_tuple_from_current_list— splice an overflow tuple’s bytes from its overflow-page chain into the scan’stplrec.
Set operations and append
Section titled “Set operations and append”qfile_combine_two_list— UNION / INTERSECT / DIFFERENCE driver; sorts inputs (unless UNION ALL fast-path), opens dest, runs sort-merge with action callbacks.qfile_union_list(static, called byqfile_combine_two_listfor UNION ALL) — physical page-chain splice viaqfile_clone_list_id+qfile_reopen_list_as_append_mode+qfile_copy_tuple(for the tail).qfile_append_list— physical page-chain extension (no tuple iteration).qfile_connect_list— link a childdependent_list_idonto a parent so destruction cascades.qfile_truncate_list— drop all tuples but keep theQFILE_LIST_IDopen for re-use.qfile_advance_single/qfile_advance_group(static) — advance one or one-group within the merge.qfile_add_one_tuple/qfile_add_two_tuple(static) — action callbacks plumbed into the merge byqfile_combine_two_list.qfile_compare_tuple_helper(static) — full-row comparison for the merge.qfile_unify_types— reconcile column domains across the two sides.
Sort wrapper
Section titled “Sort wrapper”qfile_sort_list— narrow wrapper choosing flags fromQUERY_OPTIONS(Q_DISTINCT vs Q_ALL).qfile_sort_list_with_func— full-power entry for callers that supply their ownget_func/put_func/cmp_func(the sort-based GROUP BY inquery_executor.cdoes this; seecubrid-post-processing.md).qfile_make_sort_key/qfile_generate_sort_tuple— pack/unpack a tuple into/out ofSORT_REC(the in-buffer record passed toexternal_sort).qfile_compare_partial_sort_record/qfile_compare_all_sort_record— defaultcmp_funcs.qfile_initialize_sort_key_info/qfile_clear_sort_key_info— build/free aSORTKEY_INFOfrom aSORT_LIST.qfile_initialize_sort_info/qfile_clear_sort_info(static) — build/free theSORT_INFOplumbing struct.qfile_get_next_sort_item/qfile_put_next_sort_item(static) — default callbacks plugged intoexternal_sort.qfile_get_estimated_pages_for_sorting— fed toexternal_sortso it knows how many runs to expect.
List-file (query result) cache
Section titled “List-file (query result) cache”qfile_initialize_list_cache/qfile_finalize_list_cache— module setup ofqfile_List_cache(the globalQFILE_LIST_CACHEinstance plus its hash-table pool).qfile_lookup_list_cache_entry— hit-or-miss probe by(xasl_cache_entry, params).qfile_update_list_cache_entry— insert (called when a query that opted into caching completes).qfile_clear_list_cache— invalidate every list-cache entry attached to a given XASL-cache entry.qfile_end_use_of_list_cache_entry— drop a transaction’s reference; eviction-eligible when ref_count → 0.qfile_list_cache_cleanup(static) — LRU + TTL eviction driven byPRM_ID_LIST_MAX_QUERY_CACHE_PAGES.qfile_clear_uncommited_list_cache_entry— purge cache entries from a transaction that aborted (their results are not visible to anyone else).qfile_allocate_list_cache_entry/qfile_free_list_cache_entry— pool-backed allocation (qfile_List_cache_entry_pool).qfile_print_list_cache_entry/qfile_dump_list_cache_internal— diagnostic.qfile_hash_db_value_array/qfile_compare_equal_db_value_array— hash-table key functions.
Storage-side substrate
Section titled “Storage-side substrate”QMGR_TEMP_FILE(query_manager.h) — wraps(membuf[], membuf_last, membuf_npages, temp_vfid, temp_file_type, tde_encrypted).qmgr_create_new_temp_file(query_manager.c) — build aQMGR_TEMP_FILEwith a freshly carved membuf.qmgr_create_result_file— same, but flagged so the temp file is preserved (survives the producing query for the client to fetch).qmgr_get_new_page(query_manager.c) — membuf-vs-FILE_TEMParbitration on append.qmgr_get_old_page/qmgr_get_old_page_read_only— membuf-vs-FILE_TEMParbitration on read.qmgr_free_old_page_and_init— release a fixed page (no-op for membuf pages,pgbuf_unfixfor FILE_TEMP).qmgr_free_list_temp_file— destroy aQMGR_TEMP_FILE(called fromqfile_destroy_list).qmgr_get_external_file_page(static) —file_allocagainst thetemp_vfid, setsPAGE_QRESULTptype.file_create_temp(file_manager.c) — allocate aFILE_TEMPVFID; this is what backs the spill.file_temp_retire— return aFILE_TEMPto the tempcache.xqfile_get_list_file_page— server-to-client page transfer for query results.
Cross-references in query_executor.c
Section titled “Cross-references in query_executor.c”qexec_end_one_iteration— callsqfile_generate_tuple_into_list (T_NORMAL)for each row that survives outer-block predicates.qexec_clear_xasl/qexec_clear_xasl_head— callsqfile_close_list+qfile_destroy_liston every XASL’slist_id(unlessis_result_cached).qexec_orderby_distinct— callsqfile_sort_list.qexec_groupby/qexec_hash_gby_finalize— callsqfile_sort_list_with_funcwithqexec_gby_put_next(sort-based) or builds a hash viaqexec_hash_gby_agg_tuple(hash-based; the resulting list-file is the spill from the hash’s overflow tuples).qexec_execute_analytic— drivesqfile_sort_listover the analytic input plus several auxiliary list-files (group_list_id,value_list_idfor window-function state).xqmgr_execute_query/qmgr_process_query— the entry point that owns the top-levelQFILE_LIST_IDand registers it with the query manager.
Cross-references in scan_manager.c
Section titled “Cross-references in scan_manager.c”scan_open_list_scan— opens aQFILE_LIST_SCAN_IDand wraps it in aLLIST_SCAN_IDplus aSCAN_IDof typeS_LIST_SCAN. This is what makes a list-file appear as a scan operator to the rest of the executor.scan_next_scan_local(caseS_LIST_SCAN) →scan_next_list_scan— pulls one tuple viaqfile_scan_list_next, then evaluatesregu_list_pred/regu_list_restagainst it.scan_init_scan_id/scan_init_scan_pred— generic scan-id init shared across all scan types.
Position hints (as of 2026-05-01, on the worktree at /data/hgryoo/references/cubrid)
Section titled “Position hints (as of 2026-05-01, on the worktree at /data/hgryoo/references/cubrid)”| Symbol | File | Line |
|---|---|---|
qfile_page_header (struct) | src/query/query_list.h | 66 |
qfile_list_id (struct) | src/query/query_list.h | 425 |
qfile_list_scan_id (struct) | src/query/query_list.h | 498 |
qfile_initialize | src/query/list_file.c | 1156 |
qfile_finalize | src/query/list_file.c | 1178 |
qfile_open_list | src/query/list_file.c | 1203 |
qfile_close_list | src/query/list_file.c | 1352 |
qfile_reopen_list_as_append_mode | src/query/list_file.c | 1373 |
qfile_allocate_new_page | src/query/list_file.c | 1465 |
qfile_allocate_new_ovf_page | src/query/list_file.c | 1526 |
qfile_allocate_new_page_if_need | src/query/list_file.c | 1561 |
qfile_add_tuple_to_list_id | src/query/list_file.c | 1592 |
qfile_add_tuple_to_list | src/query/list_file.c | 1610 |
qfile_save_single_bound_item_tuple | src/query/list_file.c | 1672 |
qfile_save_normal_tuple | src/query/list_file.c | 1695 |
qfile_save_sort_key_tuple | src/query/list_file.c | 1715 |
qfile_save_merge_tuple | src/query/list_file.c | 1751 |
qfile_save_tuple | src/query/list_file.c | 1804 |
qfile_generate_tuple_into_list | src/query/list_file.c | 1848 |
qfile_fast_intint_tuple_to_list | src/query/list_file.c | 1908 |
qfile_fast_intval_tuple_to_list | src/query/list_file.c | 1968 |
qfile_fast_val_tuple_to_list | src/query/list_file.c | 2055 |
qfile_add_overflow_tuple_to_list | src/query/list_file.c | 2138 |
qfile_get_first_page | src/query/list_file.c | 2218 |
qfile_destroy_list | src/query/list_file.c | 2269 |
xqfile_get_list_file_page | src/query/list_file.c | 2312 |
qfile_add_item_to_list | src/query/list_file.c | 2442 |
qfile_combine_two_list | src/query/list_file.c | 2688 |
qfile_append_list | src/query/list_file.c | 2953 |
qfile_connect_list | src/query/list_file.c | 3130 |
qfile_truncate_list | src/query/list_file.c | 3210 |
qfile_union_list (static) | src/query/list_file.c | 3359 |
qfile_make_sort_key | src/query/list_file.c | 3507 |
qfile_generate_sort_tuple | src/query/list_file.c | 3643 |
qfile_get_next_sort_item | src/query/list_file.c | 3729 |
qfile_put_next_sort_item | src/query/list_file.c | 3765 |
qfile_compare_partial_sort_record | src/query/list_file.c | 3925 |
qfile_compare_all_sort_record | src/query/list_file.c | 4003 |
qfile_initialize_sort_key_info | src/query/list_file.c | 4145 |
qfile_clear_sort_key_info | src/query/list_file.c | 4243 |
qfile_initialize_sort_info | src/query/list_file.c | 4266 |
qfile_clear_sort_info (static) | src/query/list_file.c | 4290 |
qfile_sort_list_with_func | src/query/list_file.c | 4331 |
qfile_sort_list | src/query/list_file.c | 4482 |
qfile_copy_list_pages (static) | src/query/list_file.c | 4509 |
qfile_duplicate_list | src/query/list_file.c | 4626 |
qfile_get_tuple | src/query/list_file.c | 4663 |
qfile_get_tuple_from_current_list (static) | src/query/list_file.c | 4738 |
qfile_scan_next (static) | src/query/list_file.c | 4762 |
qfile_scan_prev (static) | src/query/list_file.c | 4865 |
qfile_save_current_scan_tuple_position | src/query/list_file.c | 4959 |
qfile_retrieve_tuple (static) | src/query/list_file.c | 4971 |
qfile_jump_scan_tuple_position | src/query/list_file.c | 5033 |
qfile_start_scan_fix | src/query/list_file.c | 5124 |
qfile_open_list_scan | src/query/list_file.c | 5162 |
qfile_scan_list (static) | src/query/list_file.c | 5206 |
qfile_scan_list_next | src/query/list_file.c | 5229 |
qfile_scan_list_prev | src/query/list_file.c | 5243 |
qfile_end_scan_fix | src/query/list_file.c | 5257 |
qfile_close_scan | src/query/list_file.c | 5279 |
qfile_initialize_list_cache | src/query/list_file.c | 5400 |
qfile_finalize_list_cache | src/query/list_file.c | 5521 |
qfile_clear_list_cache | src/query/list_file.c | 5570 |
qfile_lookup_list_cache_entry | src/query/list_file.c | 6035 |
qfile_update_list_cache_entry | src/query/list_file.c | 6196 |
qfile_end_use_of_list_cache_entry | src/query/list_file.c | 6419 |
qfile_add_tuple_get_pos_in_list | src/query/list_file.c | 6559 |
qfile_has_next_page | src/query/list_file.c | 6636 |
qfile_update_domains_on_type_list | src/query/list_file.c | 6648 |
qfile_set_tuple_column_value | src/query/list_file.c | 6740 |
qfile_overwrite_tuple | src/query/list_file.c | 6868 |
qfile_update_qlist_count | src/query/list_file.c | 7041 |
qfile_collect_list_sector_info | src/query/list_file.c | 7085 |
qfile_free_list_sector_info | src/query/list_file.c | 7181 |
qmgr_create_new_temp_file | src/query/query_manager.c | 2954 |
qmgr_get_new_page | src/query/query_manager.c | 2822 |
qmgr_get_external_file_page (static) | src/query/query_manager.c | 2912 |
qmgr_init_external_file_page (static) | src/query/query_manager.c | 2890 |
scan_open_list_scan | src/query/scan_manager.c | 3715 |
Cross-check Notes
Section titled “Cross-check Notes”- Versus
cubrid-query-executor.md— that doc names list-file as the materialised output of every BUILDLIST_PROC and assertsqexec_end_one_iterationwrites viaqfile_generate_tuple_into_list (T_NORMAL). Verified:query_executor.cnear line 1262 in this worktree. The fast-path branch throughqexec_hash_gby_agg_tuple(wheng_hash_eligibleis set) calls intoqfile_generate_tuple_into_listonly on the spill path; the in-memory hash table holds tuples in anAGGREGATE_HASH_CONTEXT, not in a list-file. The spill list-file isagg_hash_context->part_list_id— a list-file used as a hash-spill substrate, not as a streaming pipe. - Versus
cubrid-post-processing.md— that doc namesqfile_sort_listas the entry to sort-based GROUP BY and ORDER BY, andqfile_combine_two_listas the UNION/INTERSECT/DIFFERENCE driver. Both verified. The post-processing doc also namesexternal_sort.c::sort_listfileas the actual sort kernel — list-file is the input/output substrate, sort is external_sort’s job. The boundary is thesort_listfilecallback contract (get_func/put_func/cmp_func); seeqfile_sort_list_with_funcin## CUBRID's Approach. - Versus
cubrid-page-buffer-manager.md— list-file pages are normal database pages once they spill toFILE_TEMP. They go throughpgbuf_fix/pgbuf_unfixlike any other page, withPAGE_QRESULTptype. This means a list-file’s spill pages compete for slots in the page buffer pool with heap and B-tree pages, and they participate in the same LRU. Membuf pages, by contrast, are not in the page buffer pool at all — they are rawmalloc’d memory inqmgr_allocate_tempfile_with_buffer. The two-tier design means that a small list-file is essentially free (no pgbuf slot occupied), and a large one becomes a page-buffer citizen only after it spills. - MVCC interaction. List-files are query-private; no other transaction ever sees them. The visibility check for tuples flowing into a list-file happens at the producing scan (
heap_get_visible_versionand friends inscan_next_heap_scan). Once written into the list-file, tuples are committed-from-the-perspective-of-this-query — there is no per-list-file MVCC. This means a query whose outer scan ran early in a transaction and whose materialised list-file is read late in the same transaction will see the same tuples both times even if the underlying heap has been updated in between. That is the desired snapshot-stable behaviour for a single statement; for cursors (HOLDABLEcursors) the result list-file is the only thing that survives a commit, so it is implicitly snapshot-stable across the commit boundary. - WAL. List-file pages on
FILE_TEMPuseqfile_set_dirty_page(or, for index-key buffer mode,qfile_set_dirty_page_and_skip_logging). BecauseFILE_TEMPfiles are not recovered, log records for these pages are unnecessary and CUBRID skips logging them where it can. After a crash, the tempcache is wiped, and any in-flight queries are simply restarted from the client. This is the same trade-off PostgreSQL makes forBufFileand SQL Server makes for TempDB minimal-logging — temp tuple-streams pay no WAL cost, but they cost a re-execute on crash. - Naming hazard. The list-file cache (
QFILE_LIST_CACHE_ENTRY) is not the XASL cache (XASL_CACHE_ENTRY); the two are linked but distinct. The XASL cache memoises plans; the list-file cache memoises results of those plans. Result-cache hits skip execution; plan-cache hits skip parse-and-optimise. A result-cache entry holds a back-pointer to the XASL-cache entry (xcache_entry) so XASL eviction cascades into list-cache eviction (qfile_clear_list_cache).
Open Questions
Section titled “Open Questions”- Parallel writers.
qfile_open_listproduces a single live writer cursor (last_pgptr). The parallel-heap-scan path (scan_open_parallel_heap_scan) appears to fan tuples out to multiple producer threads — verify whether each parallel worker writes into its own private list-file and whether a coordinator merges them, or whether the writer cursor is mutex-protected. Searchingquery_executor.cforparallelismreveals anORDERBY_STATSparallel argument threaded throughqfile_sort_list_with_func, suggesting parallel sort is supported, but the parallel-write path inside list-file itself is not obvious. - Compressed tuple format. Every tuple is stored uncompressed; for wide list-files (e.g., a sub-query selecting many columns) this is wasteful. Whether CUBRID has plans for or experiments with row-compressed list-files is unclear; no
compresssymbols appear inlist_file.c. - Columnar variant. The packed-row format is fine for OLTP-shaped queries but pathological for wide-aggregation analytics. CUBRID has no columnar list-file. PostgreSQL has experimental “columnar” extensions but not in core; MySQL/MariaDB ColumnStore is a separate engine. A columnar list-file (
QFILE_LIST_IDwith one membuf chain per column) would be an interesting refactor with bounded blast radius, but no evidence of work in this direction. - Membuf size policy.
PRM_ID_TEMP_MEM_BUFFER_PAGESdefaults to a few pages; in practice many sub-query results are small and never spill. Whether the default is correct for current workloads — and whether per-query adaptive sizing would pay off — would need profiling. The split betweenTEMP_FILE_MEMBUF_NORMALandTEMP_FILE_MEMBUF_KEY_BUFFERshows the engine already differentiates by use-case, so adding aTEMP_FILE_MEMBUF_LARGEfor known-big producers (CONNECT BY, large hash builds) is a small change away. is_result_cachedownership boundary. Whenxasl->list_id->is_result_cachedis true, the executor skips destroying the list-file and the cache takes ownership. The control flow that flips this bit is spread acrossqexec_execute_query,qmgr_process_query, andxcache_can_entry_cache_list. A single ownership-transfer function would help, but the current scattering is consistent with CUBRID’s “C error model” — every owner returns success/failure and the caller decides. Worth a refactor only if a real bug is found.
Sources
Section titled “Sources”src/query/list_file.c(~7200 lines) — the implementation, including page allocation (qfile_allocate_new_page), tuple writing (qfile_save_normal_tupleand friends), scan state machine (qfile_scan_next), set operations (qfile_combine_two_list), sort wrapper (qfile_sort_list_with_func), and the result-cache (qfile_lookup_list_cache_entryetc.).src/query/list_file.h— public API.src/query/query_list.h—QFILE_LIST_ID,QFILE_LIST_SCAN_ID,QFILE_PAGE_HEADER, tuple-format macros, sort-list types.src/query/query_manager.h—QMGR_TEMP_FILE(the storage handle).src/query/query_manager.c—qmgr_create_new_temp_file,qmgr_get_new_page/qmgr_get_old_page, the membuf-vs-FILE_TEMParbitration.src/query/scan_manager.c—scan_open_list_scan(the wrapping that turns a list-file into anS_LIST_SCANscan operator).src/query/query_executor.c— call sites:qexec_end_one_iteration(writer),qexec_clear_xasl(lifecycle),qexec_groupby/qexec_orderby_distinct/qexec_execute_analytic(sort callers),qexec_clear_list_cache_by_class(cache invalidation).src/storage/file_manager.c—file_create_temp,file_alloc,file_temp_retire; theFILE_TEMPsubstrate underneath the spill.
Textbook references:
- Goetz Graefe, Volcano — An Extensible and Parallel Query Evaluation System (TKDE 1994). The iterator model and the open / next / close contract that every list-file consumer follows.
- Goetz Graefe, Query Evaluation Techniques for Large Databases (Computing Surveys 1993). Tuple buffers, materialisation, sort-merge for set operations.
- Alex Petrov, Database Internals (O’Reilly 2019). Chapter 1 (file organisation, slotted page, tuple length prefixes) and Chapter 14 (sort, merge, group-by). Captured locally at
knowledge/research/dbms-general/database-internals.md. - Joseph M. Hellerstein, Michael Stonebraker, James Hamilton, Architecture of a Database System (Foundations and Trends in Databases 2007). The “inter-operator pipe” terminology and the operator-tree implementation model.
- Abraham Silberschatz, Henry Korth, S. Sudarshan, Database System Concepts (6th ed.) — Chapter 12 (“Query Processing”) and Chapter 13 (“Query Optimization”) describe the materialisation-vs-pipelining trade-off textbook-style. Captured locally at
knowledge/research/dbms-general/database-system-concepts.md.