CUBRID External Sort — Run Generation, Multi-Way Merge, and the Sort Substrate
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”External sorting is what a database engine does when the input does not fit in RAM. The Art of Computer Programming, vol. 3 (Knuth, 2nd ed.), devotes the entire second half of “Sorting and Searching” to this problem; Graefe’s two surveys — Query Evaluation Techniques (ACM CSUR, 1993) and Implementing Sorting in Database Systems (ACM CSUR, 2006) — translate Knuth into the engineering vocabulary of relational systems. Database Internals (Petrov, 2019) summarises the same model in chapter 14. The shared model has two phases.
Phase 1 — run generation. Read the input until the in-memory sort buffer is full, sort that chunk, and write it to a temporary file as a run. Repeat until the input is exhausted. The trade-off is between simplicity and run length:
- Quicksort / mergesort over a buffer of size M produces runs of length M each. Number of runs ≈ N/M.
- Replacement selection (Knuth §5.4.1, “Algorithm R”) maintains a heap of size M and outputs the minimum that is still ≥ the last emitted key. New input either continues the current run (when ≥ last emitted) or is held back into the next run. On random input this produces runs of expected length 2M; on partially sorted input it can produce a single run of any length. The textbook trade-off: replacement selection halves the merge fan-in needed but pays for a heap-update per record rather than a one-shot quicksort.
A modern variant, used by Postgres before v15 and by a few
research engines, is natural mergesort with run-detection —
walk the input, detect already-monotonic ranges, and treat each
as an existing run; sort only the boundaries. Output run length
is at least M (one buffer-full), often much more on
near-sorted input. CUBRID’s run generator is in this family
(Knuth’s “natural mergesort” plus an explicit reverse-flip for
descending stretches; see sort_run_find and sort_run_sort
below).
Phase 2 — multi-way merge. Given R runs each of length L, merge them into one sorted output. The fan-in k is bounded by the number of buffers available — one read buffer per input run plus one write buffer. With k-way merge the number of passes is ⌈log_k(R)⌉ and the total I/O is roughly 2N·⌈log_k(R)⌉ pages (read+write per pass). The data structures used at each merge step:
- Heap (winner tree / loser tree). A balanced binary tree whose leaves are the heads of the k input runs. The root is the current minimum across all runs. Each emit replaces the winner’s leaf with the next element from that run and sifts the path to the root, in O(log k) per element. This is Knuth’s algorithm 5.4.2D.
- Sorted linked list. For small k (≤ 8 or so) a simple insertion-sorted singly linked list of run pointers is faster than a heap because the cache footprint is one cache line and the most common case (winner is the same run as last time) is O(1).
- Polyphase merge (Knuth §5.4.2, Gilstad 1960) — when the
number of temporary tapes/files is fixed and small, polyphase
arranges runs across files according to a Fibonacci-like
distribution so that no single file ever holds the bottleneck
run. CUBRID does not use polyphase; it uses balanced merge
with a fixed cap of
SORT_MAX_HALF_FILES = 4input files plus 4 output files (8 total).
Deduplication during sort. When the same key appears
multiple times — DISTINCT, unique-index loading — duplicates can
be eliminated at every merge step rather than only at the
final output. This shrinks intermediate runs, which shrinks I/O
in subsequent passes. The cost is a comparison check per emit
that is already happening for ordering anyway. CUBRID exposes
this through the SORT_DUP_OPTION enum, { SORT_DUP, SORT_ELIM_DUP },
and dispatches to a separate sort_exphase_merge_elim_dup merge
loop on the dedup path.
The I/O cost model. With sort buffer size M (in pages),
total input N pages, and merge fan-in k = M − 1 (one buffer per
input run + one write buffer), the number of passes is
⌈log_k(N/M)⌉ + 1 (one for run generation, the rest for merging).
Total I/O is roughly 2N·passes. The model says: making M bigger
helps multiplicatively (fewer runs, higher fan-in, fewer
passes); making k bigger helps logarithmically. Real engines
expose M as a sysparam (work_mem, sort_buffer_size,
sr_nbuffers) precisely because the multiplicative effect is
worth tuning per workload.
Common DBMS Design
Section titled “Common DBMS Design”Engines diverge on five axes: (1) run-generation algorithm (quicksort vs. replacement-selection vs. natural mergesort), (2) merge-tree data structure (heap vs. sorted list vs. tree of losers), (3) memory model (fixed buffer vs. spillable hash), (4) parallelism (single-threaded vs. partitioned worker pool), (5) integration with B+Tree bulk load and other in-engine sort clients.
PostgreSQL — tuplesort.c
Section titled “PostgreSQL — tuplesort.c”PostgreSQL’s external sort is implemented in
src/backend/utils/sort/tuplesort.c. Until v15 it used Knuth’s
algorithm 5.4.2D — polyphase merge with replacement selection.
Run generation kept a binary heap of work_mem-worth of tuples
and emitted in HeapTuple-key order; runs were written to
“logical tapes” (logtape.c) that share a physical temp file.
Merge fan-in was bounded by maintenance_work_mem /
max_files_per_process. Postgres v15 switched the run-generator
to a quicksort-and-spill scheme (still
tuplesort.c); the merge phase still uses a heap. The
client interface (tuplesort_begin_heap, tuplesort_putdatum,
tuplesort_performsort, tuplesort_getdatum) is a state
machine with explicit “begin / put / sort / get / end” phases
rather than a pair of get/put callbacks.
MySQL — filesort
Section titled “MySQL — filesort”MySQL InnoDB’s filesort (in sql/filesort.cc) uses a
quicksort-based run generator (no replacement selection),
optionally with radix sort for short fixed-width keys.
Merging is k-way with a heap; k is bounded by
sort_buffer_size. MySQL’s filesort has a special
“in-memory only” fast path that skips temp files entirely when
the result fits in sort_buffer_size; the disk path is only
exercised on overflow.
SQL Server / Oracle
Section titled “SQL Server / Oracle”SQL Server’s sort iterator is a black box but the documentation
describes “exchange and merge” with parallel sort across worker
threads when the optimizer estimates large input. Oracle’s
sort uses workareas managed by the Automatic Memory Management
infrastructure (pga_aggregate_target); when a sort exceeds its
quota, it spills to TEMPFILE.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”The five axes above map onto CUBRID symbols as follows:
| Theoretical concept | CUBRID name |
|---|---|
| Top-level sort entry | sort_listfile (external_sort.c) |
| Per-call sort state | SORT_PARAM (external_sort.c:134) |
| Sort buffer (M pages) | SORT_PARAM::internal_memory, sized by PRM_ID_SR_NBUFFERS |
| Run-generation phase (phase 1) | sort_inphase_sort |
| In-memory sort algorithm | sort_run_sort → sort_run_find + sort_run_merge (natural mergesort) |
| Run-detection (Knuth “natural mergesort”) | sort_run_find (ascending or descending stretch detection + flip) |
| Run flush | sort_run_flush writes a slotted-page run into a FILE_TEMP file |
| Merge phase (phase 2) | sort_exphase_merge / sort_exphase_merge_elim_dup |
| Merge tournament data structure | SORT_REC_LIST sr_list[SORT_MAX_HALF_FILES] — sorted linked list (k ≤ 4) |
| Merge fan-in cap | SORT_MAX_HALF_FILES = 4 (8 total temp files = 4 input + 4 output) |
| Per-merge buffer split | sort_find_inbuf_size — half buffers to output, rest divided across input |
| Backing temp file | FILE_TEMP allocation via file_create_temp / sort_add_new_file |
| Long-record (multi-page) overflow | sort_param->multipage_file + overflow_insert / sort_retrieve_longrec |
| Dedup-during-merge | sort_exphase_merge_elim_dup, gated by SORT_DUP_OPTION::SORT_ELIM_DUP |
| Comparator dispatch | SORT_CMP_FUNC callback (caller-supplied) |
| Input/output sandwich | SORT_GET_FUNC / SORT_PUT_FUNC callbacks |
| Caller-tag for parallel routing | SORT_PARALLEL_TYPE enum (external_sort.h:59) |
| Parallel sort coordinator | sort_start_parallelism / sort_end_parallelism |
| Parallel-merge tree fan-in | SORT_PX_MERGE_FILES (multi-level merge of N parallel results) |
| ORDER BY caller | qfile_sort_list → qfile_sort_list_with_func → sort_listfile |
| Sort-based GROUP BY caller | qexec_groupby → sort_listfile (with qexec_gby_put_next) |
| Hash-spill GROUP BY caller | qexec_groupby → sort_listfile (with qexec_hash_gby_put_next) |
| B+Tree bulk-load caller | btree_index_sort → sort_listfile (with btree_construct_leafs) |
| Parallel index-leaf build | SORT_INDEX_LEAF parallel type, btree_sort_get_next_parallel |
CUBRID’s Approach
Section titled “CUBRID’s Approach”CUBRID’s external sort is a single C file —
src/storage/external_sort.c, ~5,850 lines — exposing exactly
two public symbols: sort_listfile (the entry point) and
btree_sort_get_next_parallel (the parallel B+Tree input
adapter). Everything else is static. The file is laid out
as five concentric layers: (a) slotted-page primitives for
temp-file pages, (b) in-memory run-find / run-merge primitives,
(c) the run-generation phase (sort_inphase_sort), (d) the
merge phase (sort_exphase_merge and its dedup twin), and (e)
the parallel coordination layer that wraps phases c and d
across SORT_PARALLEL_TYPE workers.
Top-level flow — sort_listfile
Section titled “Top-level flow — sort_listfile”sort_listfile is the contract every caller writes against. Its
signature, declared in external_sort.h, is:
// sort_listfile — src/storage/external_sort.hextern int sort_listfile (THREAD_ENTRY *thread_p, INT16 volid, int est_inp_pg_cnt, SORT_GET_FUNC *get_fn, void *get_arg, SORT_PUT_FUNC *put_fn, void *put_arg, SORT_CMP_FUNC *cmp_fn, void *cmp_arg, SORT_DUP_OPTION option, int limit, bool includes_tde_class, SORT_PARALLEL_TYPE sort_parallel_type);The caller supplies three function pointers — get_fn delivers
the next input record, put_fn receives sorted records as they
emerge, cmp_fn defines order — and three opaque argument
pointers passed back to those callbacks. Whatever the caller’s
data model is (list-file tuples for ORDER BY, heap-record OIDs
for B+Tree load, hash-spill entries for GROUP BY) is invisible
to the sort module; from inside external_sort.c every record
is just RECDES { data, length, type } plus an opaque
comparator.
The top-level body is straightforward:
// sort_listfile — src/storage/external_sort.c (condensed)intsort_listfile (THREAD_ENTRY *thread_p, INT16 volid, int est_inp_pg_cnt, SORT_GET_FUNC *get_fn, void *get_arg, SORT_PUT_FUNC *put_fn, void *put_arg, SORT_CMP_FUNC *cmp_fn, void *cmp_arg, SORT_DUP_OPTION option, int limit, bool includes_tde_class, SORT_PARALLEL_TYPE parallel_type){ SORT_PARAM ori_sort_param; SORT_PARAM *sort_param = &ori_sort_param; // ... condensed: install callbacks, init temp file slots ...
if (est_inp_pg_cnt > 0) input_pages = est_inp_pg_cnt + MAX ((int)(est_inp_pg_cnt * 0.1), 2); else input_pages = prm_get_integer_value (PRM_ID_SR_NBUFFERS);
sort_param->tot_buffers = MIN (prm_get_integer_value (PRM_ID_SR_NBUFFERS), input_pages); sort_param->tot_buffers = MAX (4, sort_param->tot_buffers);
sort_param->internal_memory = (char *) malloc ((size_t) sort_param->tot_buffers * (size_t) DB_PAGESIZE); // ... condensed: fall back to 4-page minimum on OOM ...
sort_param->half_files = sort_get_num_half_tmpfiles (sort_param->tot_buffers, input_pages); sort_param->tot_tempfiles = sort_param->half_files << 1; sort_param->in_half = 0;
// ... condensed: per-file dynamic arrays for run-list bookkeeping ...
#if defined(SERVER_MODE) sort_param->px_parallel_num = sort_check_parallelism (thread_p, sort_param); if (sort_param->px_parallel_num <= 1) error = sort_listfile_internal (thread_p, sort_param); /* serial path */ else { /* parallel */ px_sort_param = malloc (sizeof (SORT_PARAM) * sort_param->px_parallel_num); error = sort_start_parallelism (thread_p, px_sort_param, sort_param); SORT_EXECUTE_PARALLEL (sort_param->px_parallel_num, px_sort_param, sort_listfile_execute); SORT_WAIT_PARALLEL (sort_param->px_parallel_num, sort_param, px_sort_param); error = sort_end_parallelism (thread_p, px_sort_param, sort_param); }#else error = sort_listfile_internal (thread_p, sort_param);#endif
// ... condensed: return all temp resources, free internal memory ... return error;}Three things are worth pulling out. First, the sort buffer size
in pages is tot_buffers, derived as min (PRM_ID_SR_NBUFFERS, input_pages_with_10pct_slack) and floored at 4. The system
parameter sr_nbuffers (alias sort_buffer_size) is the only
knob the DBA has; the caller’s est_inp_pg_cnt only shrinks the
buffer when the input is known to be small. Second,
half_files (the merge fan-in) is computed from the expected
number of runs so that small sorts don’t allocate eight temp
files just to produce one or two runs (see
sort_get_num_half_tmpfiles further down). Third, the parallel
path is gated by a runtime check (sort_check_parallelism) —
the planner does not commit to a parallel sort, the executor
does.
sort_listfile_internal — the two-phase driver
Section titled “sort_listfile_internal — the two-phase driver”When the parallel decision is “no”, or inside each parallel
worker, control reaches sort_listfile_internal:
// sort_listfile_internal — src/storage/external_sort.cintsort_listfile_internal (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param){ int error = NO_ERROR; int file_pg_cnt_est;
/* Phase 1: run generation */ error = sort_inphase_sort (thread_p, sort_param, sort_param->get_fn, sort_param->get_arg, &sort_param->total_numrecs); if (error != NO_ERROR) return error;
if (sort_param->tot_runs > 1) { /* Allocate the output half of the temp-file pair. * The input half was allocated lazily inside Phase 1. */ file_pg_cnt_est = sort_get_avg_numpages_of_nonempty_tmpfile (sort_param); file_pg_cnt_est = MAX (1, file_pg_cnt_est); for (i = sort_param->half_files; i < sort_param->tot_tempfiles; i++) { error = sort_add_new_file (thread_p, &(sort_param->temp[i]), file_pg_cnt_est, true, sort_param->tde_encrypted); if (error != NO_ERROR) return error; }
/* Phase 2: merge. Two variants depending on dedup option. */ if (sort_param->option == SORT_ELIM_DUP) error = sort_exphase_merge_elim_dup (thread_p, sort_param); else error = sort_exphase_merge (thread_p, sort_param); } return error;}The structure is rigid: phase 1 always runs; phase 2 only runs when more than one run was produced; the dedup variant is selected at the top of phase 2 once and not revisited. There is no streaming hand-off between phases — phase 1 must completely finish before phase 2 starts, because phase 2 reads the runs that phase 1 wrote.
flowchart TB
subgraph CALLER["Caller"]
GET["get_fn / get_arg<br/>(scan, hash spill, heap iter)"]
PUT["put_fn / put_arg<br/>(list-file writer, leaf builder)"]
CMP["cmp_fn / cmp_arg<br/>(SORTKEY_INFO, collation, NULLS, ASC/DESC)"]
end
subgraph PHASE1["Phase 1 — run generation"]
SIS["sort_inphase_sort"]
SRS["sort_run_sort<br/>(natural mergesort over buffer)"]
SRF["sort_run_flush<br/>(write run as slotted pages)"]
end
subgraph PHASE2["Phase 2 — merge"]
SEM["sort_exphase_merge<br/>or sort_exphase_merge_elim_dup"]
TRN["k-way merge tournament<br/>(SORT_REC_LIST sr_list)"]
end
subgraph TEMP["FILE_TEMP backing store"]
H1["temp[0..half_files-1]<br/>(input half)"]
H2["temp[half_files..tot]<br/>(output half)"]
OVF["multipage_file<br/>(long-record overflow)"]
end
subgraph OUT["Output"]
PS["put_fn called per record<br/>in sorted order"]
end
GET --> SIS
CMP -.-> SRS
CMP -.-> TRN
SIS --> SRS --> SRF --> H1
SIS -.long record.-> OVF
H1 --> SEM
SEM --> TRN --> H2
H2 -.swap halves on next pass.-> H1
SEM -- final pass --> PS --> PUT
The figure makes one structural point. Run generation writes
into the input half of the temp-file pool; the merge phase
reads from input and writes to output. After each merge pass
the halves swap (sort_param->in_half ^= half_files). The
final pass does not write back to a temp file — it streams
records directly to put_fn (see “very_last_run” branches in
sort_exphase_merge).
Phase 1 — run generation in sort_inphase_sort
Section titled “Phase 1 — run generation in sort_inphase_sort”sort_inphase_sort runs a fill-sort-flush loop. It maintains a
buffer split into two halves: a record area (growing forward
from the start of internal_memory) and an index area (growing
backward from output_buffer). Records are accumulated in
the record area; an array of pointers (the index area) is
built in parallel. When the two halves meet, the buffer is
“full” and the pointer array is sorted.
// sort_inphase_sort — src/storage/external_sort.c (condensed shape)static intsort_inphase_sort (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param, SORT_GET_FUNC *get_fn, void *get_arg, unsigned int *total_numrecs){ RECDES temp_recdes; char *item_ptr; /* head of free space in the record area */ char **index_area; /* index-area pointer cursor (grows down) */ char **index_buff; /* boundary used to detect "full" */ long numrecs = 0, sort_numrecs = 0; int out_curfile = sort_param->in_half;
output_buffer = sort_param->internal_memory + ((long)(sort_param->tot_buffers - 1) * DB_PAGESIZE); item_ptr = sort_param->internal_memory + SORT_RECORD_LENGTH_SIZE; index_area = (char **)(output_buffer - sizeof (char *)); index_buff = index_area - 1;
for (;;) { if ((char *)index_buff < item_ptr) status = SORT_REC_DOESNT_FIT; /* buffer full */ else { temp_recdes.data = item_ptr; temp_recdes.area_size = ...; /* remaining contiguous */ status = (*get_fn) (thread_p, &temp_recdes, get_arg); if (status == SORT_NOMORE_RECS) break; }
switch (status) { case SORT_REC_DOESNT_FIT: /* Sort what we have, flush it as a run, reset cursors. */ if (numrecs > 0) { index_area = sort_run_sort (thread_p, sort_param, index_area + 1, numrecs, sort_numrecs, index_buff, &numrecs); if (sort_param->option == SORT_ELIM_DUP && /* still room and dedup shrunk us */ ...) { /* keep filling the same run */ ... } else { error = sort_run_flush (thread_p, sort_param, out_curfile, cur_page, output_buffer, index_area, numrecs, REC_HOME); numrecs = 0; item_ptr = sort_param->internal_memory + SORT_RECORD_LENGTH_SIZE; index_area = (char **)(output_buffer - sizeof (char *)); index_buff = index_area - 1; if (++out_curfile >= sort_param->half_files) out_curfile = sort_param->in_half; /* round-robin */ } sort_numrecs = numrecs; } /* Long-record path: temp_recdes.length > SORT_MAXREC_LENGTH */ if (temp_recdes.length > SORT_MAXREC_LENGTH) { /* Re-fetch the full record into long_recdes. * Push it into sort_param->multipage_file via overflow_insert. * The in-buffer entry stores only the VPID pointer. */ ... error = sort_run_flush (..., REC_BIGONE); ... } break;
case SORT_SUCCESS: SORT_RECORD_LENGTH (item_ptr) = temp_recdes.length; *index_area = item_ptr; numrecs++; index_area--; index_buff -= 2; /* one slot for the pointer, one for slack */ item_ptr += DB_ALIGN (temp_recdes.length, MAX_ALIGNMENT) + SORT_RECORD_LENGTH_SIZE; break; } }
/* Drain whatever is left in the buffer at end of input. */ if (numrecs > 0) { index_area = sort_run_sort (..., &numrecs); if (sort_param->tot_runs > 0 || SORT_IS_PARALLEL (sort_param)) sort_run_flush (..., REC_HOME); /* one more run */ else for (i = 0; i < numrecs; i++) /* short-circuit */ (*sort_param->put_fn) (thread_p, &temp_recdes, sort_param->put_arg); } ...}Several design decisions are worth naming.
Round-robin run distribution. Successive runs go to
successive input-half temp files (out_curfile). With
half_files = 4 and 13 runs, the layout becomes
temp[0]: runs 0,4,8,12; temp[1]: runs 1,5,9; temp[2]: runs 2,6,10; temp[3]: runs 3,7,11. This produces a balanced merge
fan-in regardless of run count and avoids the polyphase
distribution Knuth describes for tape drives.
The “run-extension on dedup” branch. When
SORT_ELIM_DUP is in effect, sorting the buffer often shrinks
it (duplicates collapse). If the post-sort tail still has room
for more records, sort_inphase_sort does not flush; it
re-positions the cursors and continues filling the same run.
This makes the actual run length variable (and often longer
than the buffer) on duplicate-heavy inputs — important for
unique-index loading where many heap rows may share a key.
Long-record overflow. If a single tuple exceeds
SORT_MAXREC_LENGTH = DB_PAGESIZE - sizeof(SLOTTED_PAGE_HEADER) - sizeof(SLOT)
— roughly one full page minus slot overhead — it cannot be
stored inline. The record is re-fetched into a heap-allocated
long_recdes, written to a side-car overflow file
(sort_param->multipage_file, lazily created by
file_create_temp on first use), and the in-buffer entry
becomes a single-VPID pointer with rec_type = REC_BIGONE. The
merge phase, when it sees a REC_BIGONE entry, calls
sort_retrieve_longrec (which goes through overflow_get) to
materialise the actual record before comparison and emission.
Long-record runs are flushed individually — they are always the
last (and only) record in their run.
The single-run short-circuit. If tot_runs == 0 after the
loop drains, the entire input fit in the buffer. There is no
point writing it to a temp file just to read it back; the code
simply iterates the sorted index area and calls put_fn
directly. sort_listfile_internal then sees tot_runs <= 1
and skips phase 2 entirely.
sort_run_sort — the natural-mergesort engine
Section titled “sort_run_sort — the natural-mergesort engine”sort_run_sort is what actually orders the in-memory pointer
array. It is not quicksort; it is Knuth’s natural mergesort:
detect monotone runs with sort_run_find (with a flip if the
run is descending), push them onto a SORT_STACK, and merge
adjacent equal-depth runs with sort_run_merge:
// sort_run_sort — src/storage/external_sort.c (condensed)static char **sort_run_sort (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param, char **base, long limit, long sort_numrecs, char **otherbase, long *srun_limit){ // ... limit accounting elided ... st_p->top = -1; cnt = (int) (log10 (ceil ((double) limit / 2.0)) / log10 (2.0)) + 2; st_p->srun = db_private_alloc (NULL, cnt * sizeof (SRUN));
do { sort_run_find (src, &src_top, st_p, limit, compare, comp_arg, option); if (src_top < limit) sort_run_find (src, &src_top, st_p, limit, compare, comp_arg, option);
while (st_p->top >= 1 && ((src_top >= limit) /* final */ || (st_p->srun[st_p->top - 1].tree_depth == st_p->srun[st_p->top].tree_depth))) /* equal */ sort_run_merge (dest, src, st_p, compare, comp_arg, option); } while (src_top < limit);
// ... merge with previously-found run if any ...}The “merge equal-depth runs” rule is what gives natural
mergesort its O(N log N) bound: every merge increments the
depth, so the stack height stays at most ⌈log₂ N⌉. The
SRUN structure carries low_high ∈ {'L', 'H'} to track
whether the current run lives in the low (base) or high
(otherbase) buffer — sort_run_merge ping-pongs between
them, so each merge alternates target buffers and avoids
allocating a third one.
sort_run_find is the run detector: read the next two slots,
decide whether the run is increasing or decreasing, extend it
as long as the trend holds, flip it if descending, advance
*top past the run, push the descriptor onto the stack. On
already-sorted input it produces a single run of length N and
the merge stack has only one entry — degenerate sort cost is
O(N).
The dedup option is folded into sort_run_find itself via the
SORT_CHECK_DUPLICATE(stop, next_stop) macro: when two
consecutive elements compare equal under SORT_DUP the
duplicate is appended to the survivor’s chain (so the put-
callback can walk the list); under SORT_ELIM_DUP the
duplicate’s slot is set to NULL and skipped on the right-shift
compaction at the end of the run-find. Either way the merge
phase sees a duplicate-free input.
Phase 2 — sort_exphase_merge
Section titled “Phase 2 — sort_exphase_merge”The merge phase is a loop of merge passes. Each pass reads
act_infiles runs from the input half, merges them, and writes
the result to the output half. After the pass, halves swap.
The loop terminates when only one active run remains across all
input files — at which point that single run streams to
put_fn.
// sort_exphase_merge — src/storage/external_sort.c (skeletal)static intsort_exphase_merge (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param){ while ((act_infiles = sort_get_numpages_of_active_infiles (sort_param)) > 1) { sort_checkalloc_numpages_of_outfiles (thread_p, sort_param); in_sectsize = sort_find_inbuf_size (sort_param->tot_buffers, act_infiles); out_sectsize = sort_param->tot_buffers - in_sectsize * act_infiles; // ... lay out per-input-section addresses ...
num_runs = max(file_contents[i].count for i in input half); if (num_runs == 1) very_last_run = true;
for (j = num_runs; j > 0; j--) { /* Special case: only one input run survives in this pass. * Just copy it to the output instead of running the merge. */ if (!very_last_run && j == 1 && sort_get_numpages_of_active_infiles (sort_param) == 1) { // bulk page copy: read into internal_memory, write to output continue; }
/* Initialize per-input cursors: read first page of each run, peek * its first record into smallest_elem_ptr[i]. * Set up the SORT_REC_LIST tournament. */ ... /* Tournament loop: emit min, advance its cursor, re-position. */ for (;;) { min = min_p->rec_pos; if (very_last_run && !SORT_IS_PARALLEL(sort_param)) (*sort_param->put_fn)(thread_p, &smallest_elem_ptr[min], sort_param->put_arg); else sort_spage_insert (out_cur_bufaddr, &smallest_elem_ptr[min]);
/* Advance min's cursor; refill its input section if exhausted; * if its run is fully drained, unlink it from sr_list. */ ... /* Re-establish the order of sr_list using "last_elem_cmp" * shortcut — see below. */ ... } // ... flush output section, record run length ... } /* swap input/output halves for the next pass */ temp = sort_param->in_half; sort_param->in_half = out_half; out_half = temp; }}The merge tournament — sorted linked list, not a heap
Section titled “The merge tournament — sorted linked list, not a heap”CUBRID’s k-way merge is not a heap. With
SORT_MAX_HALF_FILES = 4, the maximum fan-in is 4 — too small
for a heap to pay off. Instead, the merge keeps a singly linked
list of SORT_REC_LIST nodes, one per active input file,
sorted ascending by the head of each input. Emission is “take
the head of sr_list” (O(1)); after advancing the winner’s
cursor, the new head is bubble-walked down the list until it
finds its position (O(k) worst case but typically O(1) because
in many runs the new head is still the smallest).
// SORT_REC_LIST — src/storage/external_sort.ctypedef struct sort_rec_list SORT_REC_LIST;struct sort_rec_list{ struct sort_rec_list *next; /* next sorted record item */ int rec_pos; /* which input file (index 0..k-1) */ bool is_duplicated; /* dedup-during-merge marker */};The rec_pos field is what carries the tournament state — the
list nodes themselves are static, only their rec_pos values
get permuted. min_p is the head pointer; after emit, the new
head’s record is read from the input section, the list is
re-sorted (often just one swap), and min_p becomes the new
head.
flowchart LR
subgraph IN["Input runs (one per input-half temp file)"]
R0["run on temp[0]<br/>head: 'apple'"]
R1["run on temp[1]<br/>head: 'banana'"]
R2["run on temp[2]<br/>head: 'cherry'"]
R3["run on temp[3]<br/>head: 'date'"]
end
subgraph TOURN["sr_list — sorted linked list"]
M["min_p → rec_pos=0<br/>(apple)"]
M2["next → rec_pos=1<br/>(banana)"]
M3["next → rec_pos=2<br/>(cherry)"]
M4["next → rec_pos=3<br/>(date)"]
M --> M2 --> M3 --> M4
end
subgraph OUT["Output run"]
OS["out_sectaddr<br/>(write buffer = half of internal_memory)"]
end
R0 --> M
R1 --> M2
R2 --> M3
R3 --> M4
M -- emit 'apple' --> OS
R0 -- next: 'avocado' --> M
M -.bubble: apple's slot now 'avocado'<br/>'avocado' > 'banana', swap.-> M2
The last_elem_cmp shortcut
Section titled “The last_elem_cmp shortcut”A non-trivial optimisation in sort_exphase_merge: when one
run’s last record on the current input page compares less than
the second-smallest run’s head, the merge knows that every
remaining record on this page of the winning run is also
smaller than the next contender. So it skips the per-record
list re-sort entirely and just streams that page out:
// sort_exphase_merge — last_elem_cmp shortcut (condensed)if (act_slot[min_p->rec_pos] == 0) /* new input page */ { /* peek the page's last slot */ sort_spage_get_record (in_cur_bufaddr[min_p->rec_pos], last_slot[min_p->rec_pos] - 1, &last_elem_ptr, PEEK); last_elem_cmp = (*compare) (&last_elem_ptr.data, &smallest_elem_ptr[p->rec_pos].data, compare_arg); }// ... in the main emit loop ...if (last_elem_cmp <= 0) ; /* already found min — skip the bubble walk */else for (s = min_p; s; s = s->next) { /* re-sort sr_list */ ... }On nearly-disjoint inputs (e.g., a dataset already partitioned by the sort key) this collapses comparison cost to one per page rather than one per tuple.
Dedup-during-merge — sort_exphase_merge_elim_dup
Section titled “Dedup-during-merge — sort_exphase_merge_elim_dup”The dedup variant differs from the plain merge in two places.
First, after the initial sort of sr_list, it makes a second
pass to mark consecutive equal heads as is_duplicated = true.
Second, in the emit loop, duplicate-marked records are not
written to the output area — they are silently advanced past:
// sort_exphase_merge_elim_dup — emission gate (condensed)if (min_p->is_duplicated == false) { /* OUTPUT THE RECORD (final pass: put_fn; otherwise: out section) */ ... }else ; /* skip the duplicate */
/* Always advance the winner's cursor regardless of duplication. */A subtle case: when the next record arrives at the start of a
fresh input page, the dedup loop also has to compare it against
the previous head of that page — the
(act_slot == last_slot - 1) && (last_elem_cmp == 0) check
handles “the last element of this page duplicates the first
element of the next page in some other run”, a case the plain
merge ignores. This is what makes sort_exphase_merge_elim_dup
a separate function rather than a parametrised version of
sort_exphase_merge — the bookkeeping is too entangled with
duplicate-tracking to merge cleanly.
Backing store — FILE_TEMP and the run files
Section titled “Backing store — FILE_TEMP and the run files”Every temp file used by the sort is a FILE_TEMP allocation,
described in cubrid-disk-manager.md. The lifecycle:
- Allocation.
sort_add_new_filecallsfile_create_tempwith an estimated page count. The file is registered in the transaction’s temp-file list and survives until either the caller retires it or the transaction ends. - Encryption. If the caller passes
includes_tde_class = true(set when the underlying class is TDE-encrypted), the temp file is itself encrypted viafile_apply_tde_algorithm. The TDE flag rides onsort_param->tde_encryptedand is checked at everysort_write_area. - Page I/O.
sort_write_areaandsort_read_areausefile_numerable_find_nthto map a file-relative page number to aVPID, thenpgbuf_copy_from_area/pgbuf_copy_to_areato bypass the regular fix path — temp pages don’t need WAL or LSN tracking. - Retirement. When
sort_listfileexits, every temp file insort_param->temp[]is released throughsort_return_used_resources→file_temp_retire. The multipage overflow file, if created, is retired with it.
The run-list bookkeeping is a sparse layer over the temp file:
each FILE_CONTENTS entry has a dynamic num_pages[] array
recording one entry per run on that file. sort_run_add_new
appends; sort_run_remove_first advances first_run (it
doesn’t compact the array). The run is identified by
(file index, run index); its physical pages are numbered
consecutively starting at cur_page[file_index] and ending at
cur_page + num_pages[run_index].
// FILE_CONTENTS — src/storage/external_sort.cstruct file_contents{ int *num_pages; /* dynamic array of per-run page counts */ int num_slots; /* size of the num_pages allocation */ int first_run; /* index of the head run, or -1 if empty */ int last_run; /* index of the tail run */ int start_index; /* used for parallel split of the last run */};Comparator dispatch — SORTKEY_INFO and the cmp_fn
Section titled “Comparator dispatch — SORTKEY_INFO and the cmp_fn”The comparator is supplied by the caller, but its argument
shape is conventional: the caller-provided cmp_arg is, by
convention across CUBRID’s callers, a SORTKEY_INFO *
describing the multi-column sort key. Each column has a
SUBKEY_INFO:
// SUBKEY_INFO — src/storage/external_sort.hstruct SUBKEY_INFO{ int col; /* tuple-column index */ int permuted_col; TP_DOMAIN *col_dom; TP_DOMAIN *cmp_dom; /* domain for cross-domain MEDIAN */ DB_VALUE_COMPARE_RESULT (*sort_f) (void *tplp1, void *tplp2, TP_DOMAIN *dom, int do_coercion, int total_order, int *start_col); int is_desc; /* descending column */ int is_nulls_first; /* NULLS FIRST per-column */ bool use_cmp_dom;};
struct SORTKEY_INFO{ int nkeys; int use_original; /* sort tuples by ref vs by-value */ SUBKEY_INFO *key; SUBKEY_INFO default_keys[8]; /* inline allocation for common case */ int error;};The list-file caller (qfile_sort_list_with_func) installs
either qfile_compare_partial_sort_record (when
use_original == 1 — only some columns are key columns and
the comparator must reach back into the original tuple) or
qfile_compare_all_sort_record (when the entire tuple is the
key). The B+Tree caller installs compare_driver, which knows
how to compare midxkey arguments. The comparator is the only
place per-column collation, ASC/DESC, and NULLS-FIRST/LAST is
honoured — the rest of the sort module is collation-blind.
Caller integration
Section titled “Caller integration”Four major callers wire sort_listfile into different parts of
the engine. Each picks a SORT_PARALLEL_TYPE tag that tells
the sort module how to (or whether to) parallelise.
flowchart TB
subgraph CALL["sort_listfile callers"]
QSL["qfile_sort_list_with_func<br/>(SORT_ORDER_BY / SORT_ORDER_WITH_LIMIT)"]
QGB["qexec_groupby<br/>(SORT_GROUP_BY)"]
QGB2["qexec_groupby (hash-spill path)<br/>(SORT_GROUP_BY)"]
QAN["qexec_execute_analytic<br/>(SORT_ANALYTIC)"]
BIS["btree_index_sort<br/>(SORT_INDEX_LEAF)"]
end
subgraph SR["sort_listfile (entry)"]
DEC["sort_check_parallelism<br/>(uses parallel_type + input pages)"]
SER["sort_listfile_internal (serial)"]
PAR["px_sort_param[1..N] + worker_manager"]
end
subgraph OUTS["per-caller put_fn"]
QFS["qfile_put_next_sort_item<br/>(append to QFILE_LIST_ID)"]
GBS["qexec_gby_put_next<br/>(per-group accumulator fold)"]
HGB["qexec_hash_gby_put_next<br/>(re-aggregate spilled groups)"]
ANP["qexec_analytic_put_next"]
BCL["btree_construct_leafs<br/>(write leaf records)"]
end
QSL --> DEC
QGB --> DEC
QGB2 --> DEC
QAN --> DEC
BIS --> DEC
DEC --> SER
DEC --> PAR
SER --> QFS
SER --> GBS
SER --> HGB
SER --> ANP
SER --> BCL
PAR --> QFS
PAR --> BCL
ORDER BY — qfile_sort_list_with_func
Section titled “ORDER BY — qfile_sort_list_with_func”qfile_sort_list_with_func (in list_file.c) is the standard
ORDER BY entry. It opens an input scan on the unsorted list
file, allocates a fresh output list file (srlist_id),
initialises a SORT_INFO struct that carries both, and calls
sort_listfile with the default callbacks
qfile_get_next_sort_item and qfile_put_next_sort_item:
// qfile_sort_list_with_func — src/query/list_file.c (condensed)sort_result = sort_listfile (thread_p, NULL_VOLID, estimated_pages, get_func, &info, put_func, &info, cmp_func, &info.key_info, dup_option, limit, srlist_id->tfile_vfid->tde_encrypted, parallel_type); /* SORT_ORDER_BY */dup_option is SORT_ELIM_DUP for SELECT DISTINCT and
SORT_DUP for plain ORDER BY. parallel_type is
SORT_ORDER_BY for the unlimited ORDER BY case, falling back
to SORT_ORDER_WITH_LIMIT when LIMIT is present (parallel
sort is gated off for top-K because its split-and-merge logic
doesn’t yet honour the limit). After the sort, the original
list file is destroyed and the output list file’s identifier
is copied into the caller’s slot.
For details of how the sorted output is consumed by aggregation
and window functions, see cubrid-post-processing.md.
Sort-based GROUP BY — qexec_groupby
Section titled “Sort-based GROUP BY — qexec_groupby”The GROUP BY caller (in query_executor.c) does not
materialise a sorted output list at all in the sort path — the
put-callback is the aggregator:
// qexec_groupby — src/query/query_executor.c (condensed)if (sort_listfile (thread_p, NULL_VOLID, estimated_pages, &qexec_gby_get_next, &gbstate, &qexec_gby_put_next, &gbstate, gbstate.cmp_fn, &gbstate.key_info, SORT_DUP, NO_SORT_LIMIT, gbstate.output_file->tfile_vfid->tde_encrypted, SORT_GROUP_BY) != NO_ERROR) GOTO_EXIT_ON_ERROR;qexec_gby_put_next is invoked once per sorted record. It
compares the record’s group key against the last seen key; on
key change it flushes the previous group’s accumulators to the
output list file, resets the accumulators, and starts the new
group. The sort module never knows that aggregation is
happening — from its perspective, put_fn is just a sink.
The hash-spill fallback path uses the same pattern but with
qexec_hash_gby_put_next, which re-aggregates partial sums
that were already partially folded in the spilled hash table.
Both paths are walked in detail in
cubrid-post-processing.md.
B+Tree bulk load — btree_index_sort
Section titled “B+Tree bulk load — btree_index_sort”When CREATE INDEX runs on an existing table, btree_load.c’s
btree_index_sort orchestrates bulk-load: scan the heap
file(s), produce sort records of the form
(key || OID || class_OID), sort them, and feed the sorted
stream into btree_construct_leafs which builds leaf pages
bottom-up:
// btree_index_sort — src/storage/btree_load.creturn sort_listfile (thread_p, sort_args->hfids[0].vfid.volid, 0, &btree_sort_get_next, sort_args, out_func, out_args, /* btree_construct_leafs */ compare_driver, sort_args, SORT_DUP, /* allow dups; uniq enforced inside */ NO_SORT_LIMIT, includes_tde_class, SORT_INDEX_LEAF);The SORT_DUP option is used even for unique indexes — the
unique check is enforced inside btree_construct_leafs rather
than by the sort module, because the comparator on
(key || OID) would treat two equal-keyed rows as different
records and the sort would not detect the violation anyway.
The SORT_INDEX_LEAF parallel type triggers the more
elaborate parallel path: each worker independently scans a
disjoint subset of the heap’s data sectors via
btree_sort_get_next_parallel, produces its own sorted runs,
and the runs are tree-merged across workers (see below). For a
walk of how btree_construct_leafs consumes the sorted stream,
see cubrid-btree.md.
Hash-join spill fallback
Section titled “Hash-join spill fallback”When a hash-join’s build side overflows memory, CUBRID’s
join executor falls back to sort-merge join — this is the
historical “spill to sort” path. The sort here is just a
plain qfile_sort_list on the build side and the probe side,
with no special put-callback; the join logic walks the two
sorted lists independently. Cross-reference
cubrid-hash-join.md for the spill conditions and the
post-spill iteration loop.
Parallel sort
Section titled “Parallel sort”Parallel sort is wired through three coordinator functions:
sort_check_parallelism (decides whether to parallelise),
sort_start_parallelism (splits the input across workers), and
sort_end_parallelism (merges per-worker results into one).
The decision is type-specific:
// sort_check_parallelism — src/storage/external_sort.c (condensed)intsort_check_parallelism (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param){ if (sort_param->px_type == SORT_ORDER_BY) { SORT_INFO *sort_info_p = (SORT_INFO *) sort_param->get_arg; parallel_num = parallel_query::compute_parallel_degree (parallel_query::parallel_type::SORT, sort_info_p->input_file->page_cnt, sort_info_p->parallelism /* hint */); if (parallel_num < 2 || sort_info_p->input_file->tuple_cnt <= parallel_num) return 1; sort_param->px_worker_manager = parallel_query::worker_manager::try_reserve_workers (parallel_num); return sort_param->px_worker_manager ? parallel_num : 1; } else if (sort_param->px_type == SORT_INDEX_LEAF) { // ... compute degree from heap page/sector counts ... } else return 1; /* GROUP BY, ANALYTIC, ORDER WITH LIMIT: no parallel yet */}Only SORT_ORDER_BY and SORT_INDEX_LEAF have parallel
support. GROUP BY, analytic functions, and ORDER BY with LIMIT
fall through the return 1 (serial) path.
ORDER BY parallel split
Section titled “ORDER BY parallel split”For ORDER BY, sort_split_input_temp_file walks the input
list-file’s page chain, breaks it into parallel_num roughly
equal segments by re-pointing the prev/next VPID links, and
hands each segment to a worker as its private input. Each
worker independently runs sort_listfile_internal — phase 1
and phase 2 to a single sorted run on its private temp file —
then signals completion through a pthread_cond_t.
The main thread waits via SORT_WAIT_PARALLEL, then calls
sort_end_parallelism → sort_merge_run_for_parallel. The
final merge is multi-level: with up to SORT_MAX_PARALLEL
workers and SORT_PX_MERGE_FILES per merge step, the workers’
result runs are merged in a tree of height
⌈log_{SORT_PX_MERGE_FILES}(parallel_num)⌉, each level posting
sort_merge_nruns_parallel tasks back into the worker pool.
Finally, sort_split_last_run partitions the merged result
back across workers so each can write its share into the
output list file in parallel — this preserves the ORDER BY
ordering while letting writes parallelise.
Index-leaf parallel build
Section titled “Index-leaf parallel build”For SORT_INDEX_LEAF, the parallel split is at the file
sector level, not the input list file. sort_start_parallelism
calls file_get_all_data_sectors on each heap file and uses
parallel_query::ftab_set to partition the sector list across
workers. Each worker then runs its own
btree_sort_get_next_parallel, which iterates pages within its
sector subset and produces sort records exactly like the serial
btree_sort_get_next.
The post-merge step
(sort_merge_run_for_parallel_index_leaf_build) does the
multi-level merge tree, then calls sort_put_result_from_tmpfile
to feed the merged stream into btree_construct_leafs on the
main thread. The leaf-construction itself is not
parallelised — only the sort upstream of it is — because B+Tree
leaves are written bottom-up in strict order and parallel
appenders would have to serialise on the rightmost leaf
anyway.
flowchart TB
subgraph S1["sort_start_parallelism"]
SP1["sort_split_input_temp_file<br/>(ORDER BY: chop list-file VPID chain)"]
SP2["ftab_set::split (INDEX_LEAF: chop sector list)"]
end
subgraph WORKERS["N worker threads"]
W0["worker 0:<br/>sort_listfile_execute<br/>= phase 1 + phase 2"]
W1["worker 1: ..."]
W2["worker 2: ..."]
W3["worker 3: ..."]
end
subgraph S2["sort_end_parallelism"]
SM["sort_merge_run_for_parallel<br/>(level 0..L: merge SORT_PX_MERGE_FILES at a time)"]
SLR["sort_split_last_run<br/>(partition merged run back to workers)"]
SPF["sort_put_result_for_parallel<br/>(parallel write to output list file)"]
end
S1 --> WORKERS
WORKERS --> SM --> SLR --> SPF
The parallel layer is opt-in per call (the caller’s
SORT_PARALLEL_TYPE and compute_parallel_degree’s answer
both gate it). Workers that fail funnel their error through
sort_param->main_error_context so the main thread can re-set
it via cuberr::context::swap on completion.
Why CUBRID picked these defaults
Section titled “Why CUBRID picked these defaults”The combination of k = 4 (small fan-in), sorted linked list (no heap), and natural mergesort (run-detection + merge) is unusual but coherent.
- A small
SORT_MAX_HALF_FILESkeeps the linked-list tournament cheap and avoids file-descriptor pressure on large concurrent workloads. The trade-off is more merge passes for large inputs, butPRM_ID_SR_NBUFFERSdefaults on most installs are large enough that the extra passes are rare. - Natural mergesort plus run-detection is friendly to near-sorted inputs, which the executor often hands the sort (for example, GROUP BY on a column that is already an index prefix). A quicksort-based generator would not see the pre-sortedness; this one collapses to O(N).
- Dedup-during-merge in a separate function is a deliberate duplication-of-code choice: the equality-tracking is pervasive enough that interleaving it with the plain merge via runtime checks would slow the (more common) plain merge path. The two functions share ~70 % of their body but each is straight-line.
Source Walkthrough
Section titled “Source Walkthrough”Symbols below, grouped by responsibility.
Public entry / driver
sort_listfile— top-level entry; sets upSORT_PARAM, picks parallel vs serial, dispatches.sort_listfile_internal— serial driver; runs phase 1 then phase 2.sort_listfile_execute— parallel-worker variant; called bycubthread::callable_task.
Phase 1 — run generation
sort_inphase_sort— fill-sort-flush loop; produces all runs.sort_run_sort— natural mergesort over the in-memory pointer array; callssort_run_findthensort_run_merge.sort_run_find— detect a monotonic stretch (asc or desc), flip if descending, push onto SORT_STACK. Marks duplicates inline.sort_run_merge— merge two adjacent stack-top runs into the alternate buffer; ping-pongslow_high.sort_run_flip— in-place reverse (used by run-find on descending stretches).sort_run_flush— serialise a sorted in-memory run as a sequence of slotted pages on aFILE_TEMPfile.sort_validate— debug-only ordering check.
Phase 2 — merge
sort_exphase_merge— the standard k-way merge loop.sort_exphase_merge_elim_dup— dedup-during-merge variant.sort_get_numpages_of_active_infiles— count input files with at least one run remaining.sort_get_num_file_contents— count runs on one input file.sort_run_add_new/sort_run_remove_first— manage the per-file run list.sort_checkalloc_numpages_of_outfiles— pre-allocate output pages before each merge pass; retire output files that won’t be used.sort_find_inbuf_size— split buffer between input sections (one per active input) and the output section.
Backing store
sort_add_new_file— wrapfile_create_tempfor sort temp files; honourstde_encrypted.sort_write_area/sort_read_area— bulk page I/O viapgbuf_copy_from_area/pgbuf_copy_to_area.sort_return_used_resources— retire all temp files, free internal memory, clean up parallel state.sort_retrieve_longrec— fetch a long record frommultipage_fileviaoverflow_get.sort_get_avg_numpages_of_nonempty_tmpfile— estimator for output-half page allocation.
Slotted-page primitives (private)
sort_spage_initialize,sort_spage_insert,sort_spage_get_record,sort_spage_get_numrecs,sort_spage_offsetcmp,sort_spage_compact,sort_spage_find_free— local re-implementation of the slotted-page model for sort temp pages, separate from the sharedslotted_page.cbecause temp pages have no LSN, no WAL, and no page-buffer fix.
Comparator + key info (header-exposed)
SORT_GET_FUNC,SORT_PUT_FUNC,SORT_CMP_FUNC— callback typedefs (external_sort.h).SORT_REC— temp-record header with breadcrumb VPID + offset vector; lives in temp-file pages.SUBKEY_INFO/SORTKEY_INFO— multi-column key descriptor including per-column collation, ASC/DESC, NULLS FIRST/LAST.SORT_INFO— per-list-file caller’s wrapper; carriess_id,input_file,output_file,key_info,parallelism,orderby_stats.
Parallelism (server-mode only)
sort_check_parallelism— decides degree fromparallel_query::compute_parallel_degreeand worker reservation.sort_start_parallelism/sort_end_parallelism— coordinate worker spawn and result merge.sort_split_input_temp_file— chop ORDER BY input list-file pages into parallel_num segments.sort_merge_run_for_parallel/sort_merge_run_for_parallel_index_leaf_build— multi-level tree merge across workers’ results, usingSORT_PX_MERGE_FILESfan-in per level.sort_merge_nruns/sort_merge_nruns_parallel— single merge-tree level worker.sort_split_last_run— partition the final merged run back to workers for parallel write to the output list file.sort_put_result_for_parallel— per-worker writer of its share of the final result.sort_put_result_from_tmpfile— drain the final temp file throughput_fn.
Caller adapters
qfile_sort_list_with_func(list_file.c) — generic ORDER BY / DISTINCT entry; suppliesqfile_get_next_sort_item+qfile_put_next_sort_itemand theqfile_compare_partial_sort_record/qfile_compare_all_sort_recordcomparators.qfile_sort_list(list_file.c) — thin wrapper overqfile_sort_list_with_funcfor callers that don’t supply custom callbacks.qexec_groupby(query_executor.c) — sort-based GROUP BY driver; usesqexec_gby_get_next/qexec_gby_put_nextand the spill-pathqexec_hash_gby_put_next.qexec_orderby_distinct(query_executor.c) — top-level ORDER BY / DISTINCT driver; delegates toqfile_sort_list_with_func.qexec_execute_analytic(query_executor.c) — window- function driver; sorts by(PARTITION BY, ORDER BY).btree_index_sort(btree_load.c) — B+Tree bulk-load driver; usesbtree_sort_get_nextandbtree_construct_leafs.btree_sort_get_next_parallel(btree_load.c,external_sort.h) — parallel-mode get callback for index-leaf build.
Position hints (as of 2026-05-01)
Section titled “Position hints (as of 2026-05-01)”| Symbol | File | Line |
|---|---|---|
SORT_PARALLEL_TYPE | external_sort.h | 59 |
SORT_GET_FUNC / SORT_PUT_FUNC / SORT_CMP_FUNC | external_sort.h | 68 |
SORT_REC | external_sort.h | 77 |
SUBKEY_INFO | external_sort.h | 101 |
SORTKEY_INFO | external_sort.h | 131 |
SORT_INFO | external_sort.h | 142 |
sort_listfile (decl) | external_sort.h | 160 |
SORT_MAX_HALF_FILES | external_sort.c | 75 |
SORT_MAXREC_LENGTH | external_sort.c | 91 |
FILE_CONTENTS | external_sort.c | 107 |
SORT_PARAM | external_sort.c | 134 |
SORT_REC_LIST | external_sort.c | 192 |
SORT_STACK | external_sort.c | 238 |
sort_inphase_sort (decl) | external_sort.c | 253 |
sort_exphase_merge_elim_dup (decl) | external_sort.c | 255 |
sort_exphase_merge (decl) | external_sort.c | 256 |
sort_run_sort | external_sort.c | 1170 |
sort_listfile | external_sort.c | 1346 |
sort_listfile_execute (server-mode) | external_sort.c | 1599 |
sort_listfile_internal | external_sort.c | 1748 |
sort_inphase_sort | external_sort.c | 1848 |
sort_run_find | external_sort.c | 887 |
sort_run_merge | external_sort.c | 1008 |
sort_retrieve_longrec | external_sort.c | 2412 |
sort_exphase_merge_elim_dup | external_sort.c | 2456 |
sort_put_result_from_tmpfile | external_sort.c | 3245 |
sort_split_last_run | external_sort.c | 3344 |
sort_exphase_merge | external_sort.c | 3380 |
sort_split_input_temp_file | external_sort.c | 4457 |
sort_merge_run_for_parallel | external_sort.c | 4568 |
sort_merge_run_for_parallel_index_leaf_build | external_sort.c | 4733 |
sort_merge_nruns | external_sort.c | 4883 |
sort_check_parallelism | external_sort.c | 4935 |
sort_put_result_for_parallel | external_sort.c | 5040 |
sort_merge_nruns_parallel | external_sort.c | 5131 |
sort_start_parallelism | external_sort.c | 5195 |
sort_end_parallelism | external_sort.c | 5329 |
sort_write_area | external_sort.c | 5417 |
sort_read_area | external_sort.c | 5477 |
sort_get_num_half_tmpfiles | external_sort.c | 5522 |
sort_checkalloc_numpages_of_outfiles | external_sort.c | 5577 |
sort_get_numpages_of_active_infiles | external_sort.c | 5684 |
sort_find_inbuf_size | external_sort.c | 5723 |
sort_run_add_new | external_sort.c | 5746 |
sort_run_remove_first | external_sort.c | 5786 |
sort_get_num_file_contents | external_sort.c | 5807 |
qfile_sort_list_with_func | list_file.c | 4330 |
qfile_sort_list | list_file.c | 4481 |
qexec_groupby (sort path) | query_executor.c | 5247 |
qexec_groupby (hash-spill sort_listfile) | query_executor.c | 5465 |
qexec_groupby (sort-based sort_listfile) | query_executor.c | 5546 |
qexec_orderby_distinct | query_executor.c | 3936 |
qexec_orderby_distinct_by_sorting | query_executor.c | 4000 |
btree_index_sort | btree_load.c | 3208 |
btree_sort_get_next_parallel | btree_load.c | 3325 |
btree_sort_get_next | btree_load.c | 3659 |
Cross-check Notes
Section titled “Cross-check Notes”cubrid-post-processing.mddocuments howqexec_groupbyandqexec_execute_analyticuse the sort callbacks (qexec_gby_put_next,qexec_hash_gby_put_next,qexec_analytic_put_next). The strategy choice (sort-based vs hash-based GROUP BY) lives there; this document only covers the sort substrate that both paths end up calling. The line-number references incubrid-post-processing.md(e.g.external_sort.c:1347forsort_listfile) are off by one as of 2026-05-01 — the function signature line is 1347 but the function name is on 1346. Treat the symbol as the anchor.cubrid-btree.mddocuments the bulk-load consumer (btree_construct_leafs) and the leaf-page write path. Thebtree_index_sortfunction itself is a thin wrapper oversort_listfile, walked here. The unique-constraint check is in the consumer, not the sort —SORT_DUPis used even for unique indexes because the sort comparator on(key || OID)cannot detect uniqueness violations on the key alone.cubrid-disk-manager.mdownsFILE_TEMPlifecycle. Sort temp files have one peculiarity: they may carry the TDE flag (includes_tde_class) so that intermediate encrypted-class data isn’t readable from raw disk. The flag rides on everysort_write_areacall.cubrid-list-file.md(planned) — documentsQFILE_LIST_ID, the input/output sink shape for ORDER BY and GROUP BY. TheSORT_INFOstruct inexternal_sort.his the bridge.cubrid-hash-join.md(planned) — when a hash join spills its build side, the fallback sort-merge join usesqfile_sort_liston both inputs. The path is plain ORDER BY-style, no special put-callback.- The deck
Sort, Hash Group By.pptx(referenced fromcubrid-post-processing.md) frames sort GROUP BY’s callback discipline and matches the structure here. - The phrase “loser tree” is sometimes used informally in
CUBRID code reviews; the actual implementation
(
SORT_REC_LIST sr_list) is a sorted singly linked list that runs in O(k) per emit rather than a tree’s O(log k). For k = 4 the two have identical asymptotic behaviour and the linked list wins on cache footprint.
Open Questions
Section titled “Open Questions”- Adaptive run-generation algorithm. CUBRID always uses natural mergesort over the in-memory buffer. Is there a workload regime where switching to replacement-selection (longer expected runs at the cost of per-record heap update) would lower total I/O? Postgres v15 made the opposite move — from replacement-selection to quicksort — citing cache behaviour. A measurement on CUBRID’s typical ORDER-BY-heavy OLAP workloads would settle it.
- GPU-accelerated sort. Modern engines (HeavyDB,
ClickHouse with custom kernels) push the in-memory sort to
the GPU when available. CUBRID has no GPU integration; the
question is whether the in-memory phase (everything inside
sort_run_sort) could be replaced by a CUDA Thrust call with a host-side DMA in/out. The temp-file phase would remain CPU. - Sort-state checkpointing for restartable progress. A
long-running ORDER BY on a multi-terabyte dataset that
fails near the end currently restarts from scratch.
Checkpointing the run-list bookkeeping every N pages would
let recovery skip the completed merge passes. The
bookkeeping is small enough (one
FILE_CONTENTSper temp file, each a dynamic int array) that the checkpoint cost would be negligible. - Polyphase merge for k > 4 cases. CUBRID’s hard cap of
SORT_MAX_HALF_FILES = 4means very large sorts make many passes. Raising the cap is straightforward but raises file-descriptor pressure under concurrency; an alternative is polyphase merge (Knuth §5.4.2) which keeps the same total file count but distributes runs Fibonacci-style for better merge balance. Given CUBRID’s typical sort buffer size, the gain on most workloads would be small. - Parallel sort for GROUP BY and analytic. As of
2026-05-01, only
SORT_ORDER_BYandSORT_INDEX_LEAFhave parallel coverage. TheSORT_GROUP_BYandSORT_ANALYTICbranches insort_check_parallelismandsort_start_parallelismexplicitly fall through toreturn 1/ER_FAILED. Extending parallel to GROUP BY is non-trivial because the per-group accumulator state needs to be merged across workers — equivalent to the hash-spill re-aggregation problem at a different layer. - Hash-join spill fallback unification. The planned
cubrid-hash-join.mdcross-reference is currently a forward declaration; the actual code path (qexec_hashjoinand friends) callsqfile_sort_listdirectly when a partition cannot fit. There is noSORT_HASH_JOIN_SPILLparallel-type tag — the spill is treated as plain ORDER BY, which means it cannot leverage parallel sort even when the spill is large. Whether this is worth a dedicated parallel-type slot is open.
Sources
Section titled “Sources”- Code:
src/storage/external_sort.c(driver + run gen + merge- parallel coordination)
src/storage/external_sort.h(callback typedefs +sort_listfiledeclaration)src/storage/btree_load.c(btree_index_sort,btree_sort_get_next,btree_sort_get_next_parallel,btree_construct_leafs)src/query/list_file.c(qfile_sort_list_with_func,qfile_sort_list,qfile_get_next_sort_item,qfile_put_next_sort_item,qfile_compare_partial_sort_record,qfile_compare_all_sort_record)src/query/query_executor.c(qexec_groupby,qexec_orderby_distinct,qexec_orderby_distinct_by_sorting,qexec_execute_analytic, the GROUP BY put/get callbacks)src/storage/file_manager.c(file_create_temp,file_temp_retire,file_apply_tde_algorithm,file_numerable_find_nth,file_alloc_multiple)src/base/system_parameter.c(PRM_ID_SR_NBUFFERS,PRM_ID_SORT_BUFFER_SIZE)src/storage/px_sort.h,src/query/px_callable_task.hpp,src/query/px_worker_manager.hpp,src/query/px_parallel.hpp,src/query/px_ftab_set.hpp(parallel infrastructure referenced fromexternal_sort.c)
- Theoretical references:
- Knuth, The Art of Computer Programming, vol. 3: Sorting and Searching, 2nd ed., chapter 5 (especially §5.4.1 “Multiway merging and replacement selection”, §5.4.2 “The polyphase merge”, §5.2.4 “Sorting by merging”).
- Graefe, “Query Evaluation Techniques for Large Databases”, ACM Computing Surveys 25(2), 1993, §7 “External sorting”.
- Graefe, “Implementing Sorting in Database Systems”, ACM Computing Surveys 38(3), 2006 — practical survey of in-database sort engineering.
- Petrov, Database Internals (O’Reilly, 2019), chapter 14 “Query Execution” — external-sort and sort-based GROUP BY framing.
- Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book, 2nd ed., chapter 15.4 — the I/O cost model used in §“Theoretical Background”.
- Cross-references in this knowledge base:
knowledge/code-analysis/cubrid/cubrid-post-processing.md(sort-based GROUP BY + analytic functions over the same sort substrate)knowledge/code-analysis/cubrid/cubrid-btree.md(btree_construct_leafsconsumer of the bulk-load sort)knowledge/code-analysis/cubrid/cubrid-disk-manager.md(FILE_TEMPlifecycle)knowledge/research/dbms-general/database-internals.md(Petrov ch. 14 capture)