CUBRID External Sort — Run Generation, Multi-Way Merge, and the Sort Substrate

Contents:

Theoretical Background
Common DBMS Design
CUBRID’s Approach
Source Walkthrough
Cross-check Notes
Open Questions
Sources

Theoretical Background

External sorting is what a database engine does when the input does not fit in RAM. The Art of Computer Programming, vol. 3 (Knuth, 2nd ed.), devotes the entire second half of “Sorting and Searching” to this problem; Graefe’s two surveys — Query Evaluation Techniques (ACM CSUR, 1993) and Implementing Sorting in Database Systems (ACM CSUR, 2006) — translate Knuth into the engineering vocabulary of relational systems. Database Internals (Petrov, 2019) summarises the same model in chapter 14. The shared model has two phases.

Phase 1 — run generation. Read the input until the in-memory sort buffer is full, sort that chunk, and write it to a temporary file as a run. Repeat until the input is exhausted. The trade-off is between simplicity and run length:

Quicksort / mergesort over a buffer of size M produces runs of length M each. Number of runs ≈ N/M.
Replacement selection (Knuth §5.4.1, “Algorithm R”) maintains a heap of size M and outputs the minimum that is still ≥ the last emitted key. New input either continues the current run (when ≥ last emitted) or is held back into the next run. On random input this produces runs of expected length 2M; on partially sorted input it can produce a single run of any length. The textbook trade-off: replacement selection halves the merge fan-in needed but pays for a heap-update per record rather than a one-shot quicksort.

A modern variant, used by Postgres before v15 and by a few research engines, is natural mergesort with run-detection — walk the input, detect already-monotonic ranges, and treat each as an existing run; sort only the boundaries. Output run length is at least M (one buffer-full), often much more on near-sorted input. CUBRID’s run generator is in this family (Knuth’s “natural mergesort” plus an explicit reverse-flip for descending stretches; see sort_run_find and sort_run_sort below).

Phase 2 — multi-way merge. Given R runs each of length L, merge them into one sorted output. The fan-in k is bounded by the number of buffers available — one read buffer per input run plus one write buffer. With k-way merge the number of passes is ⌈log_k(R)⌉ and the total I/O is roughly 2N·⌈log_k(R)⌉ pages (read+write per pass). The data structures used at each merge step:

Heap (winner tree / loser tree). A balanced binary tree whose leaves are the heads of the k input runs. The root is the current minimum across all runs. Each emit replaces the winner’s leaf with the next element from that run and sifts the path to the root, in O(log k) per element. This is Knuth’s algorithm 5.4.2D.
Sorted linked list. For small k (≤ 8 or so) a simple insertion-sorted singly linked list of run pointers is faster than a heap because the cache footprint is one cache line and the most common case (winner is the same run as last time) is O(1).
Polyphase merge (Knuth §5.4.2, Gilstad 1960) — when the number of temporary tapes/files is fixed and small, polyphase arranges runs across files according to a Fibonacci-like distribution so that no single file ever holds the bottleneck run. CUBRID does not use polyphase; it uses balanced merge with a fixed cap of SORT_MAX_HALF_FILES = 4 input files plus 4 output files (8 total).

Deduplication during sort. When the same key appears multiple times — DISTINCT, unique-index loading — duplicates can be eliminated at every merge step rather than only at the final output. This shrinks intermediate runs, which shrinks I/O in subsequent passes. The cost is a comparison check per emit that is already happening for ordering anyway. CUBRID exposes this through the SORT_DUP_OPTION enum, { SORT_DUP, SORT_ELIM_DUP }, and dispatches to a separate sort_exphase_merge_elim_dup merge loop on the dedup path.

The I/O cost model. With sort buffer size M (in pages), total input N pages, and merge fan-in k = M − 1 (one buffer per input run + one write buffer), the number of passes is ⌈log_k(N/M)⌉ + 1 (one for run generation, the rest for merging). Total I/O is roughly 2N·passes. The model says: making M bigger helps multiplicatively (fewer runs, higher fan-in, fewer passes); making k bigger helps logarithmically. Real engines expose M as a sysparam (work_mem, sort_buffer_size, sr_nbuffers) precisely because the multiplicative effect is worth tuning per workload.

Common DBMS Design

Engines diverge on five axes: (1) run-generation algorithm (quicksort vs. replacement-selection vs. natural mergesort), (2) merge-tree data structure (heap vs. sorted list vs. tree of losers), (3) memory model (fixed buffer vs. spillable hash), (4) parallelism (single-threaded vs. partitioned worker pool), (5) integration with B+Tree bulk load and other in-engine sort clients.

PostgreSQL — `tuplesort.c`

PostgreSQL’s external sort is implemented in src/backend/utils/sort/tuplesort.c. Until v15 it used Knuth’s algorithm 5.4.2D — polyphase merge with replacement selection. Run generation kept a binary heap of work_mem-worth of tuples and emitted in HeapTuple-key order; runs were written to “logical tapes” (logtape.c) that share a physical temp file. Merge fan-in was bounded by maintenance_work_mem / max_files_per_process. Postgres v15 switched the run-generator to a quicksort-and-spill scheme (still tuplesort.c); the merge phase still uses a heap. The client interface (tuplesort_begin_heap, tuplesort_putdatum, tuplesort_performsort, tuplesort_getdatum) is a state machine with explicit “begin / put / sort / get / end” phases rather than a pair of get/put callbacks.

MySQL — `filesort`

MySQL InnoDB’s filesort (in sql/filesort.cc) uses a quicksort-based run generator (no replacement selection), optionally with radix sort for short fixed-width keys. Merging is k-way with a heap; k is bounded by sort_buffer_size. MySQL’s filesort has a special “in-memory only” fast path that skips temp files entirely when the result fits in sort_buffer_size; the disk path is only exercised on overflow.

SQL Server / Oracle

SQL Server’s sort iterator is a black box but the documentation describes “exchange and merge” with parallel sort across worker threads when the optimizer estimates large input. Oracle’s sort uses workareas managed by the Automatic Memory Management infrastructure (pga_aggregate_target); when a sort exceeds its quota, it spills to TEMPFILE.

Theory ↔ CUBRID mapping

The five axes above map onto CUBRID symbols as follows:

Theoretical concept	CUBRID name
Top-level sort entry	`sort_listfile` (`external_sort.c`)
Per-call sort state	`SORT_PARAM` (`external_sort.c:134`)
Sort buffer (M pages)	`SORT_PARAM::internal_memory`, sized by `PRM_ID_SR_NBUFFERS`
Run-generation phase (phase 1)	`sort_inphase_sort`
In-memory sort algorithm	`sort_run_sort` → `sort_run_find` + `sort_run_merge` (natural mergesort)
Run-detection (Knuth “natural mergesort”)	`sort_run_find` (ascending or descending stretch detection + flip)
Run flush	`sort_run_flush` writes a slotted-page run into a `FILE_TEMP` file
Merge phase (phase 2)	`sort_exphase_merge` / `sort_exphase_merge_elim_dup`
Merge tournament data structure	`SORT_REC_LIST sr_list[SORT_MAX_HALF_FILES]` — sorted linked list (k ≤ 4)
Merge fan-in cap	`SORT_MAX_HALF_FILES = 4` (8 total temp files = 4 input + 4 output)
Per-merge buffer split	`sort_find_inbuf_size` — half buffers to output, rest divided across input
Backing temp file	`FILE_TEMP` allocation via `file_create_temp` / `sort_add_new_file`
Long-record (multi-page) overflow	`sort_param->multipage_file` + `overflow_insert` / `sort_retrieve_longrec`
Dedup-during-merge	`sort_exphase_merge_elim_dup`, gated by `SORT_DUP_OPTION::SORT_ELIM_DUP`
Comparator dispatch	`SORT_CMP_FUNC` callback (caller-supplied)
Input/output sandwich	`SORT_GET_FUNC` / `SORT_PUT_FUNC` callbacks
Caller-tag for parallel routing	`SORT_PARALLEL_TYPE` enum (`external_sort.h:59`)
Parallel sort coordinator	`sort_start_parallelism` / `sort_end_parallelism`
Parallel-merge tree fan-in	`SORT_PX_MERGE_FILES` (multi-level merge of N parallel results)
ORDER BY caller	`qfile_sort_list` → `qfile_sort_list_with_func` → `sort_listfile`
Sort-based GROUP BY caller	`qexec_groupby` → `sort_listfile` (with `qexec_gby_put_next`)
Hash-spill GROUP BY caller	`qexec_groupby` → `sort_listfile` (with `qexec_hash_gby_put_next`)
B+Tree bulk-load caller	`btree_index_sort` → `sort_listfile` (with `btree_construct_leafs`)
Parallel index-leaf build	`SORT_INDEX_LEAF` parallel type, `btree_sort_get_next_parallel`

CUBRID’s Approach

CUBRID’s external sort is a single C file — src/storage/external_sort.c, ~5,850 lines — exposing exactly two public symbols: sort_listfile (the entry point) and btree_sort_get_next_parallel (the parallel B+Tree input adapter). Everything else is static. The file is laid out as five concentric layers: (a) slotted-page primitives for temp-file pages, (b) in-memory run-find / run-merge primitives, (c) the run-generation phase (sort_inphase_sort), (d) the merge phase (sort_exphase_merge and its dedup twin), and (e) the parallel coordination layer that wraps phases c and d across SORT_PARALLEL_TYPE workers.

Top-level flow — `sort_listfile`

sort_listfile is the contract every caller writes against. Its signature, declared in external_sort.h, is:

// sort_listfile — src/storage/external_sort.h
extern int sort_listfile (THREAD_ENTRY *thread_p, INT16 volid,
                          int est_inp_pg_cnt,
                          SORT_GET_FUNC *get_fn, void *get_arg,
                          SORT_PUT_FUNC *put_fn, void *put_arg,
                          SORT_CMP_FUNC *cmp_fn, void *cmp_arg,
                          SORT_DUP_OPTION option, int limit,
                          bool includes_tde_class,
                          SORT_PARALLEL_TYPE sort_parallel_type);

The caller supplies three function pointers — get_fn delivers the next input record, put_fn receives sorted records as they emerge, cmp_fn defines order — and three opaque argument pointers passed back to those callbacks. Whatever the caller’s data model is (list-file tuples for ORDER BY, heap-record OIDs for B+Tree load, hash-spill entries for GROUP BY) is invisible to the sort module; from inside external_sort.c every record is just RECDES { data, length, type } plus an opaque comparator.

The top-level body is straightforward:

// sort_listfile — src/storage/external_sort.c (condensed)
int
sort_listfile (THREAD_ENTRY *thread_p, INT16 volid, int est_inp_pg_cnt,
               SORT_GET_FUNC *get_fn, void *get_arg, SORT_PUT_FUNC *put_fn,
               void *put_arg, SORT_CMP_FUNC *cmp_fn, void *cmp_arg,
               SORT_DUP_OPTION option, int limit, bool includes_tde_class,
               SORT_PARALLEL_TYPE parallel_type)
{
  SORT_PARAM ori_sort_param;
  SORT_PARAM *sort_param = &ori_sort_param;
  // ... condensed: install callbacks, init temp file slots ...

  if (est_inp_pg_cnt > 0)
    input_pages = est_inp_pg_cnt + MAX ((int)(est_inp_pg_cnt * 0.1), 2);
  else
    input_pages = prm_get_integer_value (PRM_ID_SR_NBUFFERS);

  sort_param->tot_buffers = MIN (prm_get_integer_value (PRM_ID_SR_NBUFFERS), input_pages);
  sort_param->tot_buffers = MAX (4, sort_param->tot_buffers);

  sort_param->internal_memory =
      (char *) malloc ((size_t) sort_param->tot_buffers * (size_t) DB_PAGESIZE);
  // ... condensed: fall back to 4-page minimum on OOM ...

  sort_param->half_files = sort_get_num_half_tmpfiles (sort_param->tot_buffers, input_pages);
  sort_param->tot_tempfiles = sort_param->half_files << 1;
  sort_param->in_half = 0;

  // ... condensed: per-file dynamic arrays for run-list bookkeeping ...

#if defined(SERVER_MODE)
  sort_param->px_parallel_num = sort_check_parallelism (thread_p, sort_param);
  if (sort_param->px_parallel_num <= 1)
    error = sort_listfile_internal (thread_p, sort_param);   /* serial path */
  else
    {                                                         /* parallel */
      px_sort_param = malloc (sizeof (SORT_PARAM) * sort_param->px_parallel_num);
      error = sort_start_parallelism (thread_p, px_sort_param, sort_param);
      SORT_EXECUTE_PARALLEL (sort_param->px_parallel_num, px_sort_param,
                             sort_listfile_execute);
      SORT_WAIT_PARALLEL (sort_param->px_parallel_num, sort_param, px_sort_param);
      error = sort_end_parallelism (thread_p, px_sort_param, sort_param);
    }
#else
  error = sort_listfile_internal (thread_p, sort_param);
#endif

  // ... condensed: return all temp resources, free internal memory ...
  return error;
}

Three things are worth pulling out. First, the sort buffer size in pages is tot_buffers, derived as min (PRM_ID_SR_NBUFFERS, input_pages_with_10pct_slack) and floored at 4. The system parameter sr_nbuffers (alias sort_buffer_size) is the only knob the DBA has; the caller’s est_inp_pg_cnt only shrinks the buffer when the input is known to be small. Second, half_files (the merge fan-in) is computed from the expected number of runs so that small sorts don’t allocate eight temp files just to produce one or two runs (see sort_get_num_half_tmpfiles further down). Third, the parallel path is gated by a runtime check (sort_check_parallelism) — the planner does not commit to a parallel sort, the executor does.

`sort_listfile_internal` — the two-phase driver

When the parallel decision is “no”, or inside each parallel worker, control reaches sort_listfile_internal:

// sort_listfile_internal — src/storage/external_sort.c
int
sort_listfile_internal (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param)
{
  int error = NO_ERROR;
  int file_pg_cnt_est;

  /* Phase 1: run generation */
  error = sort_inphase_sort (thread_p, sort_param,
                             sort_param->get_fn, sort_param->get_arg,
                             &sort_param->total_numrecs);
  if (error != NO_ERROR)
    return error;

  if (sort_param->tot_runs > 1)
    {
      /* Allocate the output half of the temp-file pair.
       * The input half was allocated lazily inside Phase 1. */
      file_pg_cnt_est = sort_get_avg_numpages_of_nonempty_tmpfile (sort_param);
      file_pg_cnt_est = MAX (1, file_pg_cnt_est);
      for (i = sort_param->half_files; i < sort_param->tot_tempfiles; i++)
        {
          error = sort_add_new_file (thread_p, &(sort_param->temp[i]),
                                     file_pg_cnt_est, true,
                                     sort_param->tde_encrypted);
          if (error != NO_ERROR) return error;
        }

      /* Phase 2: merge. Two variants depending on dedup option. */
      if (sort_param->option == SORT_ELIM_DUP)
        error = sort_exphase_merge_elim_dup (thread_p, sort_param);
      else
        error = sort_exphase_merge (thread_p, sort_param);
    }
  return error;
}

The structure is rigid: phase 1 always runs; phase 2 only runs when more than one run was produced; the dedup variant is selected at the top of phase 2 once and not revisited. There is no streaming hand-off between phases — phase 1 must completely finish before phase 2 starts, because phase 2 reads the runs that phase 1 wrote.

flowchart TB
  subgraph CALLER["Caller"]
    GET["get_fn / get_arg<br/>(scan, hash spill, heap iter)"]
    PUT["put_fn / put_arg<br/>(list-file writer, leaf builder)"]
    CMP["cmp_fn / cmp_arg<br/>(SORTKEY_INFO, collation, NULLS, ASC/DESC)"]
  end
  subgraph PHASE1["Phase 1 — run generation"]
    SIS["sort_inphase_sort"]
    SRS["sort_run_sort<br/>(natural mergesort over buffer)"]
    SRF["sort_run_flush<br/>(write run as slotted pages)"]
  end
  subgraph PHASE2["Phase 2 — merge"]
    SEM["sort_exphase_merge<br/>or sort_exphase_merge_elim_dup"]
    TRN["k-way merge tournament<br/>(SORT_REC_LIST sr_list)"]
  end
  subgraph TEMP["FILE_TEMP backing store"]
    H1["temp[0..half_files-1]<br/>(input half)"]
    H2["temp[half_files..tot]<br/>(output half)"]
    OVF["multipage_file<br/>(long-record overflow)"]
  end
  subgraph OUT["Output"]
    PS["put_fn called per record<br/>in sorted order"]
  end
  GET --> SIS
  CMP -.-> SRS
  CMP -.-> TRN
  SIS --> SRS --> SRF --> H1
  SIS -.long record.-> OVF
  H1 --> SEM
  SEM --> TRN --> H2
  H2 -.swap halves on next pass.-> H1
  SEM -- final pass --> PS --> PUT

The figure makes one structural point. Run generation writes into the input half of the temp-file pool; the merge phase reads from input and writes to output. After each merge pass the halves swap (sort_param->in_half ^= half_files). The final pass does not write back to a temp file — it streams records directly to put_fn (see “very_last_run” branches in sort_exphase_merge).

Phase 1 — run generation in `sort_inphase_sort`

sort_inphase_sort runs a fill-sort-flush loop. It maintains a buffer split into two halves: a record area (growing forward from the start of internal_memory) and an index area (growing backward from output_buffer). Records are accumulated in the record area; an array of pointers (the index area) is built in parallel. When the two halves meet, the buffer is “full” and the pointer array is sorted.

// sort_inphase_sort — src/storage/external_sort.c (condensed shape)
static int
sort_inphase_sort (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param,
                   SORT_GET_FUNC *get_fn, void *get_arg,
                   unsigned int *total_numrecs)
{
  RECDES temp_recdes;
  char *item_ptr;        /* head of free space in the record area */
  char **index_area;     /* index-area pointer cursor (grows down) */
  char **index_buff;     /* boundary used to detect "full" */
  long  numrecs = 0, sort_numrecs = 0;
  int   out_curfile = sort_param->in_half;

  output_buffer = sort_param->internal_memory
                  + ((long)(sort_param->tot_buffers - 1) * DB_PAGESIZE);
  item_ptr     = sort_param->internal_memory + SORT_RECORD_LENGTH_SIZE;
  index_area   = (char **)(output_buffer - sizeof (char *));
  index_buff   = index_area - 1;

  for (;;)
    {
      if ((char *)index_buff < item_ptr)
        status = SORT_REC_DOESNT_FIT;          /* buffer full */
      else
        {
          temp_recdes.data      = item_ptr;
          temp_recdes.area_size = ...;          /* remaining contiguous */
          status = (*get_fn) (thread_p, &temp_recdes, get_arg);
          if (status == SORT_NOMORE_RECS) break;
        }

      switch (status)
        {
        case SORT_REC_DOESNT_FIT:
          /* Sort what we have, flush it as a run, reset cursors. */
          if (numrecs > 0)
            {
              index_area = sort_run_sort (thread_p, sort_param,
                                          index_area + 1, numrecs, sort_numrecs,
                                          index_buff, &numrecs);
              if (sort_param->option == SORT_ELIM_DUP
                  && /* still room and dedup shrunk us */ ...)
                {
                  /* keep filling the same run */
                  ...
                }
              else
                {
                  error = sort_run_flush (thread_p, sort_param,
                                          out_curfile, cur_page,
                                          output_buffer, index_area,
                                          numrecs, REC_HOME);
                  numrecs = 0;
                  item_ptr  = sort_param->internal_memory + SORT_RECORD_LENGTH_SIZE;
                  index_area = (char **)(output_buffer - sizeof (char *));
                  index_buff = index_area - 1;
                  if (++out_curfile >= sort_param->half_files)
                    out_curfile = sort_param->in_half;     /* round-robin */
                }
              sort_numrecs = numrecs;
            }
          /* Long-record path: temp_recdes.length > SORT_MAXREC_LENGTH */
          if (temp_recdes.length > SORT_MAXREC_LENGTH)
            {
              /* Re-fetch the full record into long_recdes.
               * Push it into sort_param->multipage_file via overflow_insert.
               * The in-buffer entry stores only the VPID pointer. */
              ...
              error = sort_run_flush (..., REC_BIGONE);
              ...
            }
          break;

        case SORT_SUCCESS:
          SORT_RECORD_LENGTH (item_ptr) = temp_recdes.length;
          *index_area = item_ptr;
          numrecs++;
          index_area--;
          index_buff -= 2;        /* one slot for the pointer, one for slack */
          item_ptr   += DB_ALIGN (temp_recdes.length, MAX_ALIGNMENT)
                        + SORT_RECORD_LENGTH_SIZE;
          break;
        }
    }

  /* Drain whatever is left in the buffer at end of input. */
  if (numrecs > 0)
    {
      index_area = sort_run_sort (..., &numrecs);
      if (sort_param->tot_runs > 0 || SORT_IS_PARALLEL (sort_param))
        sort_run_flush (..., REC_HOME);              /* one more run */
      else
        for (i = 0; i < numrecs; i++)                /* short-circuit */
          (*sort_param->put_fn) (thread_p, &temp_recdes, sort_param->put_arg);
    }
  ...
}

Several design decisions are worth naming.

Round-robin run distribution. Successive runs go to successive input-half temp files (out_curfile). With half_files = 4 and 13 runs, the layout becomes temp[0]: runs 0,4,8,12; temp[1]: runs 1,5,9; temp[2]: runs 2,6,10; temp[3]: runs 3,7,11. This produces a balanced merge fan-in regardless of run count and avoids the polyphase distribution Knuth describes for tape drives.

The “run-extension on dedup” branch. When SORT_ELIM_DUP is in effect, sorting the buffer often shrinks it (duplicates collapse). If the post-sort tail still has room for more records, sort_inphase_sort does not flush; it re-positions the cursors and continues filling the same run. This makes the actual run length variable (and often longer than the buffer) on duplicate-heavy inputs — important for unique-index loading where many heap rows may share a key.

Long-record overflow. If a single tuple exceeds SORT_MAXREC_LENGTH = DB_PAGESIZE - sizeof(SLOTTED_PAGE_HEADER) - sizeof(SLOT) — roughly one full page minus slot overhead — it cannot be stored inline. The record is re-fetched into a heap-allocated long_recdes, written to a side-car overflow file (sort_param->multipage_file, lazily created by file_create_temp on first use), and the in-buffer entry becomes a single-VPID pointer with rec_type = REC_BIGONE. The merge phase, when it sees a REC_BIGONE entry, calls sort_retrieve_longrec (which goes through overflow_get) to materialise the actual record before comparison and emission. Long-record runs are flushed individually — they are always the last (and only) record in their run.

The single-run short-circuit. If tot_runs == 0 after the loop drains, the entire input fit in the buffer. There is no point writing it to a temp file just to read it back; the code simply iterates the sorted index area and calls put_fn directly. sort_listfile_internal then sees tot_runs <= 1 and skips phase 2 entirely.

`sort_run_sort` — the natural-mergesort engine

sort_run_sort is what actually orders the in-memory pointer array. It is not quicksort; it is Knuth’s natural mergesort: detect monotone runs with sort_run_find (with a flip if the run is descending), push them onto a SORT_STACK, and merge adjacent equal-depth runs with sort_run_merge:

// sort_run_sort — src/storage/external_sort.c (condensed)
static char **
sort_run_sort (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param,
               char **base, long limit, long sort_numrecs,
               char **otherbase, long *srun_limit)
{
  // ... limit accounting elided ...
  st_p->top = -1;
  cnt = (int) (log10 (ceil ((double) limit / 2.0)) / log10 (2.0)) + 2;
  st_p->srun = db_private_alloc (NULL, cnt * sizeof (SRUN));

  do {
      sort_run_find (src, &src_top, st_p, limit, compare, comp_arg, option);
      if (src_top < limit)
        sort_run_find (src, &src_top, st_p, limit, compare, comp_arg, option);

      while (st_p->top >= 1
             && ((src_top >= limit)                                  /* final */
                 || (st_p->srun[st_p->top - 1].tree_depth ==
                     st_p->srun[st_p->top].tree_depth)))             /* equal */
        sort_run_merge (dest, src, st_p, compare, comp_arg, option);
  } while (src_top < limit);

  // ... merge with previously-found run if any ...
}

The “merge equal-depth runs” rule is what gives natural mergesort its O(N log N) bound: every merge increments the depth, so the stack height stays at most ⌈log₂ N⌉. The SRUN structure carries low_high ∈ {'L', 'H'} to track whether the current run lives in the low (base) or high (otherbase) buffer — sort_run_merge ping-pongs between them, so each merge alternates target buffers and avoids allocating a third one.

sort_run_find is the run detector: read the next two slots, decide whether the run is increasing or decreasing, extend it as long as the trend holds, flip it if descending, advance *top past the run, push the descriptor onto the stack. On already-sorted input it produces a single run of length N and the merge stack has only one entry — degenerate sort cost is O(N).

The dedup option is folded into sort_run_find itself via the SORT_CHECK_DUPLICATE(stop, next_stop) macro: when two consecutive elements compare equal under SORT_DUP the duplicate is appended to the survivor’s chain (so the put- callback can walk the list); under SORT_ELIM_DUP the duplicate’s slot is set to NULL and skipped on the right-shift compaction at the end of the run-find. Either way the merge phase sees a duplicate-free input.

Phase 2 — `sort_exphase_merge`

The merge phase is a loop of merge passes. Each pass reads act_infiles runs from the input half, merges them, and writes the result to the output half. After the pass, halves swap. The loop terminates when only one active run remains across all input files — at which point that single run streams to put_fn.

// sort_exphase_merge — src/storage/external_sort.c (skeletal)
static int
sort_exphase_merge (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param)
{
  while ((act_infiles = sort_get_numpages_of_active_infiles (sort_param)) > 1)
    {
      sort_checkalloc_numpages_of_outfiles (thread_p, sort_param);
      in_sectsize  = sort_find_inbuf_size (sort_param->tot_buffers, act_infiles);
      out_sectsize = sort_param->tot_buffers - in_sectsize * act_infiles;
      // ... lay out per-input-section addresses ...

      num_runs = max(file_contents[i].count for i in input half);
      if (num_runs == 1) very_last_run = true;

      for (j = num_runs; j > 0; j--)
        {
          /* Special case: only one input run survives in this pass.
           * Just copy it to the output instead of running the merge. */
          if (!very_last_run && j == 1
              && sort_get_numpages_of_active_infiles (sort_param) == 1)
            {
              // bulk page copy: read into internal_memory, write to output
              continue;
            }

          /* Initialize per-input cursors: read first page of each run, peek
           * its first record into smallest_elem_ptr[i].
           * Set up the SORT_REC_LIST tournament. */
          ...
          /* Tournament loop: emit min, advance its cursor, re-position. */
          for (;;)
            {
              min = min_p->rec_pos;
              if (very_last_run && !SORT_IS_PARALLEL(sort_param))
                (*sort_param->put_fn)(thread_p, &smallest_elem_ptr[min],
                                      sort_param->put_arg);
              else
                sort_spage_insert (out_cur_bufaddr, &smallest_elem_ptr[min]);

              /* Advance min's cursor; refill its input section if exhausted;
               * if its run is fully drained, unlink it from sr_list. */
              ...
              /* Re-establish the order of sr_list using "last_elem_cmp"
               * shortcut — see below. */
              ...
            }
          // ... flush output section, record run length ...
        }
      /* swap input/output halves for the next pass */
      temp = sort_param->in_half;
      sort_param->in_half = out_half;
      out_half = temp;
    }
}

The merge tournament — sorted linked list, not a heap

CUBRID’s k-way merge is not a heap. With SORT_MAX_HALF_FILES = 4, the maximum fan-in is 4 — too small for a heap to pay off. Instead, the merge keeps a singly linked list of SORT_REC_LIST nodes, one per active input file, sorted ascending by the head of each input. Emission is “take the head of sr_list” (O(1)); after advancing the winner’s cursor, the new head is bubble-walked down the list until it finds its position (O(k) worst case but typically O(1) because in many runs the new head is still the smallest).

// SORT_REC_LIST — src/storage/external_sort.c
typedef struct sort_rec_list SORT_REC_LIST;
struct sort_rec_list
{
  struct sort_rec_list *next;  /* next sorted record item */
  int  rec_pos;                /* which input file (index 0..k-1) */
  bool is_duplicated;          /* dedup-during-merge marker */
};

The rec_pos field is what carries the tournament state — the list nodes themselves are static, only their rec_pos values get permuted. min_p is the head pointer; after emit, the new head’s record is read from the input section, the list is re-sorted (often just one swap), and min_p becomes the new head.

flowchart LR
  subgraph IN["Input runs (one per input-half temp file)"]
    R0["run on temp[0]<br/>head: 'apple'"]
    R1["run on temp[1]<br/>head: 'banana'"]
    R2["run on temp[2]<br/>head: 'cherry'"]
    R3["run on temp[3]<br/>head: 'date'"]
  end
  subgraph TOURN["sr_list — sorted linked list"]
    M["min_p → rec_pos=0<br/>(apple)"]
    M2["next → rec_pos=1<br/>(banana)"]
    M3["next → rec_pos=2<br/>(cherry)"]
    M4["next → rec_pos=3<br/>(date)"]
    M --> M2 --> M3 --> M4
  end
  subgraph OUT["Output run"]
    OS["out_sectaddr<br/>(write buffer = half of internal_memory)"]
  end
  R0 --> M
  R1 --> M2
  R2 --> M3
  R3 --> M4
  M -- emit 'apple' --> OS
  R0 -- next: 'avocado' --> M
  M -.bubble: apple's slot now 'avocado'<br/>'avocado' > 'banana', swap.-> M2

The `last_elem_cmp` shortcut

A non-trivial optimisation in sort_exphase_merge: when one run’s last record on the current input page compares less than the second-smallest run’s head, the merge knows that every remaining record on this page of the winning run is also smaller than the next contender. So it skips the per-record list re-sort entirely and just streams that page out:

// sort_exphase_merge — last_elem_cmp shortcut (condensed)
if (act_slot[min_p->rec_pos] == 0)   /* new input page */
  {
    /* peek the page's last slot */
    sort_spage_get_record (in_cur_bufaddr[min_p->rec_pos],
                           last_slot[min_p->rec_pos] - 1,
                           &last_elem_ptr, PEEK);
    last_elem_cmp = (*compare) (&last_elem_ptr.data,
                                &smallest_elem_ptr[p->rec_pos].data,
                                compare_arg);
  }
// ... in the main emit loop ...
if (last_elem_cmp <= 0)
  ;            /* already found min — skip the bubble walk */
else
  for (s = min_p; s; s = s->next) { /* re-sort sr_list */ ... }

On nearly-disjoint inputs (e.g., a dataset already partitioned by the sort key) this collapses comparison cost to one per page rather than one per tuple.

Dedup-during-merge — `sort_exphase_merge_elim_dup`

The dedup variant differs from the plain merge in two places. First, after the initial sort of sr_list, it makes a second pass to mark consecutive equal heads as is_duplicated = true. Second, in the emit loop, duplicate-marked records are not written to the output area — they are silently advanced past:

// sort_exphase_merge_elim_dup — emission gate (condensed)
if (min_p->is_duplicated == false)
  {
    /* OUTPUT THE RECORD (final pass: put_fn; otherwise: out section) */
    ...
  }
else
  ;        /* skip the duplicate */

/* Always advance the winner's cursor regardless of duplication. */

A subtle case: when the next record arrives at the start of a fresh input page, the dedup loop also has to compare it against the previous head of that page — the (act_slot == last_slot - 1) && (last_elem_cmp == 0) check handles “the last element of this page duplicates the first element of the next page in some other run”, a case the plain merge ignores. This is what makes sort_exphase_merge_elim_dup a separate function rather than a parametrised version of sort_exphase_merge — the bookkeeping is too entangled with duplicate-tracking to merge cleanly.

Backing store — `FILE_TEMP` and the run files

Every temp file used by the sort is a FILE_TEMP allocation, described in cubrid-disk-manager.md. The lifecycle:

Allocation. sort_add_new_file calls file_create_temp with an estimated page count. The file is registered in the transaction’s temp-file list and survives until either the caller retires it or the transaction ends.
Encryption. If the caller passes includes_tde_class = true (set when the underlying class is TDE-encrypted), the temp file is itself encrypted via file_apply_tde_algorithm. The TDE flag rides on sort_param->tde_encrypted and is checked at every sort_write_area.
Page I/O. sort_write_area and sort_read_area use file_numerable_find_nth to map a file-relative page number to a VPID, then pgbuf_copy_from_area / pgbuf_copy_to_area to bypass the regular fix path — temp pages don’t need WAL or LSN tracking.
Retirement. When sort_listfile exits, every temp file in sort_param->temp[] is released through sort_return_used_resources → file_temp_retire. The multipage overflow file, if created, is retired with it.

The run-list bookkeeping is a sparse layer over the temp file: each FILE_CONTENTS entry has a dynamic num_pages[] array recording one entry per run on that file. sort_run_add_new appends; sort_run_remove_first advances first_run (it doesn’t compact the array). The run is identified by (file index, run index); its physical pages are numbered consecutively starting at cur_page[file_index] and ending at cur_page + num_pages[run_index].

// FILE_CONTENTS — src/storage/external_sort.c
struct file_contents
{
  int *num_pages;         /* dynamic array of per-run page counts */
  int  num_slots;         /* size of the num_pages allocation */
  int  first_run;         /* index of the head run, or -1 if empty */
  int  last_run;          /* index of the tail run */
  int  start_index;       /* used for parallel split of the last run */
};

Comparator dispatch — `SORTKEY_INFO` and the cmp_fn

The comparator is supplied by the caller, but its argument shape is conventional: the caller-provided cmp_arg is, by convention across CUBRID’s callers, a SORTKEY_INFO * describing the multi-column sort key. Each column has a SUBKEY_INFO:

// SUBKEY_INFO — src/storage/external_sort.h
struct SUBKEY_INFO
{
  int          col;                /* tuple-column index */
  int          permuted_col;
  TP_DOMAIN   *col_dom;
  TP_DOMAIN   *cmp_dom;            /* domain for cross-domain MEDIAN */
  DB_VALUE_COMPARE_RESULT (*sort_f) (void *tplp1, void *tplp2,
                                     TP_DOMAIN *dom, int do_coercion,
                                     int total_order, int *start_col);
  int          is_desc;            /* descending column */
  int          is_nulls_first;     /* NULLS FIRST per-column */
  bool         use_cmp_dom;
};

struct SORTKEY_INFO
{
  int           nkeys;
  int           use_original;      /* sort tuples by ref vs by-value */
  SUBKEY_INFO  *key;
  SUBKEY_INFO   default_keys[8];   /* inline allocation for common case */
  int           error;
};

The list-file caller (qfile_sort_list_with_func) installs either qfile_compare_partial_sort_record (when use_original == 1 — only some columns are key columns and the comparator must reach back into the original tuple) or qfile_compare_all_sort_record (when the entire tuple is the key). The B+Tree caller installs compare_driver, which knows how to compare midxkey arguments. The comparator is the only place per-column collation, ASC/DESC, and NULLS-FIRST/LAST is honoured — the rest of the sort module is collation-blind.

Caller integration

Four major callers wire sort_listfile into different parts of the engine. Each picks a SORT_PARALLEL_TYPE tag that tells the sort module how to (or whether to) parallelise.

flowchart TB
  subgraph CALL["sort_listfile callers"]
    QSL["qfile_sort_list_with_func<br/>(SORT_ORDER_BY / SORT_ORDER_WITH_LIMIT)"]
    QGB["qexec_groupby<br/>(SORT_GROUP_BY)"]
    QGB2["qexec_groupby (hash-spill path)<br/>(SORT_GROUP_BY)"]
    QAN["qexec_execute_analytic<br/>(SORT_ANALYTIC)"]
    BIS["btree_index_sort<br/>(SORT_INDEX_LEAF)"]
  end
  subgraph SR["sort_listfile (entry)"]
    DEC["sort_check_parallelism<br/>(uses parallel_type + input pages)"]
    SER["sort_listfile_internal (serial)"]
    PAR["px_sort_param[1..N] + worker_manager"]
  end
  subgraph OUTS["per-caller put_fn"]
    QFS["qfile_put_next_sort_item<br/>(append to QFILE_LIST_ID)"]
    GBS["qexec_gby_put_next<br/>(per-group accumulator fold)"]
    HGB["qexec_hash_gby_put_next<br/>(re-aggregate spilled groups)"]
    ANP["qexec_analytic_put_next"]
    BCL["btree_construct_leafs<br/>(write leaf records)"]
  end
  QSL --> DEC
  QGB --> DEC
  QGB2 --> DEC
  QAN --> DEC
  BIS --> DEC
  DEC --> SER
  DEC --> PAR
  SER --> QFS
  SER --> GBS
  SER --> HGB
  SER --> ANP
  SER --> BCL
  PAR --> QFS
  PAR --> BCL

ORDER BY — `qfile_sort_list_with_func`

qfile_sort_list_with_func (in list_file.c) is the standard ORDER BY entry. It opens an input scan on the unsorted list file, allocates a fresh output list file (srlist_id), initialises a SORT_INFO struct that carries both, and calls sort_listfile with the default callbacks qfile_get_next_sort_item and qfile_put_next_sort_item:

// qfile_sort_list_with_func — src/query/list_file.c (condensed)
sort_result =
  sort_listfile (thread_p, NULL_VOLID, estimated_pages,
                 get_func, &info, put_func, &info,
                 cmp_func, &info.key_info,
                 dup_option, limit,
                 srlist_id->tfile_vfid->tde_encrypted,
                 parallel_type);                 /* SORT_ORDER_BY */

dup_option is SORT_ELIM_DUP for SELECT DISTINCT and SORT_DUP for plain ORDER BY. parallel_type is SORT_ORDER_BY for the unlimited ORDER BY case, falling back to SORT_ORDER_WITH_LIMIT when LIMIT is present (parallel sort is gated off for top-K because its split-and-merge logic doesn’t yet honour the limit). After the sort, the original list file is destroyed and the output list file’s identifier is copied into the caller’s slot.

For details of how the sorted output is consumed by aggregation and window functions, see cubrid-post-processing.md.

Sort-based GROUP BY — `qexec_groupby`

The GROUP BY caller (in query_executor.c) does not materialise a sorted output list at all in the sort path — the put-callback is the aggregator:

// qexec_groupby — src/query/query_executor.c (condensed)
if (sort_listfile (thread_p, NULL_VOLID, estimated_pages,
                   &qexec_gby_get_next, &gbstate,
                   &qexec_gby_put_next, &gbstate,
                   gbstate.cmp_fn, &gbstate.key_info,
                   SORT_DUP, NO_SORT_LIMIT,
                   gbstate.output_file->tfile_vfid->tde_encrypted,
                   SORT_GROUP_BY) != NO_ERROR)
  GOTO_EXIT_ON_ERROR;

qexec_gby_put_next is invoked once per sorted record. It compares the record’s group key against the last seen key; on key change it flushes the previous group’s accumulators to the output list file, resets the accumulators, and starts the new group. The sort module never knows that aggregation is happening — from its perspective, put_fn is just a sink.

The hash-spill fallback path uses the same pattern but with qexec_hash_gby_put_next, which re-aggregates partial sums that were already partially folded in the spilled hash table. Both paths are walked in detail in cubrid-post-processing.md.

B+Tree bulk load — `btree_index_sort`

When CREATE INDEX runs on an existing table, btree_load.c’s btree_index_sort orchestrates bulk-load: scan the heap file(s), produce sort records of the form (key || OID || class_OID), sort them, and feed the sorted stream into btree_construct_leafs which builds leaf pages bottom-up:

// btree_index_sort — src/storage/btree_load.c
return sort_listfile (thread_p, sort_args->hfids[0].vfid.volid, 0,
                      &btree_sort_get_next, sort_args,
                      out_func, out_args,                 /* btree_construct_leafs */
                      compare_driver, sort_args,
                      SORT_DUP,                           /* allow dups; uniq enforced inside */
                      NO_SORT_LIMIT,
                      includes_tde_class,
                      SORT_INDEX_LEAF);

The SORT_DUP option is used even for unique indexes — the unique check is enforced inside btree_construct_leafs rather than by the sort module, because the comparator on (key || OID) would treat two equal-keyed rows as different records and the sort would not detect the violation anyway.

The SORT_INDEX_LEAF parallel type triggers the more elaborate parallel path: each worker independently scans a disjoint subset of the heap’s data sectors via btree_sort_get_next_parallel, produces its own sorted runs, and the runs are tree-merged across workers (see below). For a walk of how btree_construct_leafs consumes the sorted stream, see cubrid-btree.md.

Hash-join spill fallback

When a hash-join’s build side overflows memory, CUBRID’s join executor falls back to sort-merge join — this is the historical “spill to sort” path. The sort here is just a plain qfile_sort_list on the build side and the probe side, with no special put-callback; the join logic walks the two sorted lists independently. Cross-reference cubrid-hash-join.md for the spill conditions and the post-spill iteration loop.

Parallel sort

Parallel sort is wired through three coordinator functions: sort_check_parallelism (decides whether to parallelise), sort_start_parallelism (splits the input across workers), and sort_end_parallelism (merges per-worker results into one). The decision is type-specific:

// sort_check_parallelism — src/storage/external_sort.c (condensed)
int
sort_check_parallelism (THREAD_ENTRY *thread_p, SORT_PARAM *sort_param)
{
  if (sort_param->px_type == SORT_ORDER_BY)
    {
      SORT_INFO *sort_info_p = (SORT_INFO *) sort_param->get_arg;
      parallel_num = parallel_query::compute_parallel_degree
                       (parallel_query::parallel_type::SORT,
                        sort_info_p->input_file->page_cnt,
                        sort_info_p->parallelism /* hint */);
      if (parallel_num < 2 || sort_info_p->input_file->tuple_cnt <= parallel_num)
        return 1;
      sort_param->px_worker_manager =
          parallel_query::worker_manager::try_reserve_workers (parallel_num);
      return sort_param->px_worker_manager ? parallel_num : 1;
    }
  else if (sort_param->px_type == SORT_INDEX_LEAF)
    {
      // ... compute degree from heap page/sector counts ...
    }
  else
    return 1;            /* GROUP BY, ANALYTIC, ORDER WITH LIMIT: no parallel yet */
}

Only SORT_ORDER_BY and SORT_INDEX_LEAF have parallel support. GROUP BY, analytic functions, and ORDER BY with LIMIT fall through the return 1 (serial) path.

ORDER BY parallel split

For ORDER BY, sort_split_input_temp_file walks the input list-file’s page chain, breaks it into parallel_num roughly equal segments by re-pointing the prev/next VPID links, and hands each segment to a worker as its private input. Each worker independently runs sort_listfile_internal — phase 1 and phase 2 to a single sorted run on its private temp file — then signals completion through a pthread_cond_t.

The main thread waits via SORT_WAIT_PARALLEL, then calls sort_end_parallelism → sort_merge_run_for_parallel. The final merge is multi-level: with up to SORT_MAX_PARALLEL workers and SORT_PX_MERGE_FILES per merge step, the workers’ result runs are merged in a tree of height ⌈log_{SORT_PX_MERGE_FILES}(parallel_num)⌉, each level posting sort_merge_nruns_parallel tasks back into the worker pool. Finally, sort_split_last_run partitions the merged result back across workers so each can write its share into the output list file in parallel — this preserves the ORDER BY ordering while letting writes parallelise.

Index-leaf parallel build

For SORT_INDEX_LEAF, the parallel split is at the file sector level, not the input list file. sort_start_parallelism calls file_get_all_data_sectors on each heap file and uses parallel_query::ftab_set to partition the sector list across workers. Each worker then runs its own btree_sort_get_next_parallel, which iterates pages within its sector subset and produces sort records exactly like the serial btree_sort_get_next.

The post-merge step (sort_merge_run_for_parallel_index_leaf_build) does the multi-level merge tree, then calls sort_put_result_from_tmpfile to feed the merged stream into btree_construct_leafs on the main thread. The leaf-construction itself is not parallelised — only the sort upstream of it is — because B+Tree leaves are written bottom-up in strict order and parallel appenders would have to serialise on the rightmost leaf anyway.

flowchart TB
  subgraph S1["sort_start_parallelism"]
    SP1["sort_split_input_temp_file<br/>(ORDER BY: chop list-file VPID chain)"]
    SP2["ftab_set::split (INDEX_LEAF: chop sector list)"]
  end
  subgraph WORKERS["N worker threads"]
    W0["worker 0:<br/>sort_listfile_execute<br/>= phase 1 + phase 2"]
    W1["worker 1: ..."]
    W2["worker 2: ..."]
    W3["worker 3: ..."]
  end
  subgraph S2["sort_end_parallelism"]
    SM["sort_merge_run_for_parallel<br/>(level 0..L: merge SORT_PX_MERGE_FILES at a time)"]
    SLR["sort_split_last_run<br/>(partition merged run back to workers)"]
    SPF["sort_put_result_for_parallel<br/>(parallel write to output list file)"]
  end
  S1 --> WORKERS
  WORKERS --> SM --> SLR --> SPF

The parallel layer is opt-in per call (the caller’s SORT_PARALLEL_TYPE and compute_parallel_degree’s answer both gate it). Workers that fail funnel their error through sort_param->main_error_context so the main thread can re-set it via cuberr::context::swap on completion.

Why CUBRID picked these defaults

The combination of k = 4 (small fan-in), sorted linked list (no heap), and natural mergesort (run-detection + merge) is unusual but coherent.

A small SORT_MAX_HALF_FILES keeps the linked-list tournament cheap and avoids file-descriptor pressure on large concurrent workloads. The trade-off is more merge passes for large inputs, but PRM_ID_SR_NBUFFERS defaults on most installs are large enough that the extra passes are rare.
Natural mergesort plus run-detection is friendly to near-sorted inputs, which the executor often hands the sort (for example, GROUP BY on a column that is already an index prefix). A quicksort-based generator would not see the pre-sortedness; this one collapses to O(N).
Dedup-during-merge in a separate function is a deliberate duplication-of-code choice: the equality-tracking is pervasive enough that interleaving it with the plain merge via runtime checks would slow the (more common) plain merge path. The two functions share ~70 % of their body but each is straight-line.

Source Walkthrough

Symbols below, grouped by responsibility.

Public entry / driver

sort_listfile — top-level entry; sets up SORT_PARAM, picks parallel vs serial, dispatches.
sort_listfile_internal — serial driver; runs phase 1 then phase 2.
sort_listfile_execute — parallel-worker variant; called by cubthread::callable_task.

Phase 1 — run generation

sort_inphase_sort — fill-sort-flush loop; produces all runs.
sort_run_sort — natural mergesort over the in-memory pointer array; calls sort_run_find then sort_run_merge.
sort_run_find — detect a monotonic stretch (asc or desc), flip if descending, push onto SORT_STACK. Marks duplicates inline.
sort_run_merge — merge two adjacent stack-top runs into the alternate buffer; ping-pongs low_high.
sort_run_flip — in-place reverse (used by run-find on descending stretches).
sort_run_flush — serialise a sorted in-memory run as a sequence of slotted pages on a FILE_TEMP file.
sort_validate — debug-only ordering check.

Phase 2 — merge

sort_exphase_merge — the standard k-way merge loop.
sort_exphase_merge_elim_dup — dedup-during-merge variant.
sort_get_numpages_of_active_infiles — count input files with at least one run remaining.
sort_get_num_file_contents — count runs on one input file.
sort_run_add_new / sort_run_remove_first — manage the per-file run list.
sort_checkalloc_numpages_of_outfiles — pre-allocate output pages before each merge pass; retire output files that won’t be used.
sort_find_inbuf_size — split buffer between input sections (one per active input) and the output section.

Backing store

sort_add_new_file — wrap file_create_temp for sort temp files; honours tde_encrypted.
sort_write_area / sort_read_area — bulk page I/O via pgbuf_copy_from_area / pgbuf_copy_to_area.
sort_return_used_resources — retire all temp files, free internal memory, clean up parallel state.
sort_retrieve_longrec — fetch a long record from multipage_file via overflow_get.
sort_get_avg_numpages_of_nonempty_tmpfile — estimator for output-half page allocation.

Slotted-page primitives (private)

sort_spage_initialize, sort_spage_insert, sort_spage_get_record, sort_spage_get_numrecs, sort_spage_offsetcmp, sort_spage_compact, sort_spage_find_free — local re-implementation of the slotted-page model for sort temp pages, separate from the shared slotted_page.c because temp pages have no LSN, no WAL, and no page-buffer fix.

Comparator + key info (header-exposed)

SORT_GET_FUNC, SORT_PUT_FUNC, SORT_CMP_FUNC — callback typedefs (external_sort.h).
SORT_REC — temp-record header with breadcrumb VPID + offset vector; lives in temp-file pages.
SUBKEY_INFO / SORTKEY_INFO — multi-column key descriptor including per-column collation, ASC/DESC, NULLS FIRST/LAST.
SORT_INFO — per-list-file caller’s wrapper; carries s_id, input_file, output_file, key_info, parallelism, orderby_stats.

Parallelism (server-mode only)

sort_check_parallelism — decides degree from parallel_query::compute_parallel_degree and worker reservation.
sort_start_parallelism / sort_end_parallelism — coordinate worker spawn and result merge.
sort_split_input_temp_file — chop ORDER BY input list-file pages into parallel_num segments.
sort_merge_run_for_parallel / sort_merge_run_for_parallel_index_leaf_build — multi-level tree merge across workers’ results, using SORT_PX_MERGE_FILES fan-in per level.
sort_merge_nruns / sort_merge_nruns_parallel — single merge-tree level worker.
sort_split_last_run — partition the final merged run back to workers for parallel write to the output list file.
sort_put_result_for_parallel — per-worker writer of its share of the final result.
sort_put_result_from_tmpfile — drain the final temp file through put_fn.

Caller adapters

qfile_sort_list_with_func (list_file.c) — generic ORDER BY / DISTINCT entry; supplies qfile_get_next_sort_item + qfile_put_next_sort_item and the qfile_compare_partial_sort_record / qfile_compare_all_sort_record comparators.
qfile_sort_list (list_file.c) — thin wrapper over qfile_sort_list_with_func for callers that don’t supply custom callbacks.
qexec_groupby (query_executor.c) — sort-based GROUP BY driver; uses qexec_gby_get_next / qexec_gby_put_next and the spill-path qexec_hash_gby_put_next.
qexec_orderby_distinct (query_executor.c) — top-level ORDER BY / DISTINCT driver; delegates to qfile_sort_list_with_func.
qexec_execute_analytic (query_executor.c) — window- function driver; sorts by (PARTITION BY, ORDER BY).
btree_index_sort (btree_load.c) — B+Tree bulk-load driver; uses btree_sort_get_next and btree_construct_leafs.
btree_sort_get_next_parallel (btree_load.c, external_sort.h) — parallel-mode get callback for index-leaf build.

Position hints (as of 2026-05-01)

Symbol	File	Line
`SORT_PARALLEL_TYPE`	external_sort.h	59
`SORT_GET_FUNC` / `SORT_PUT_FUNC` / `SORT_CMP_FUNC`	external_sort.h	68
`SORT_REC`	external_sort.h	77
`SUBKEY_INFO`	external_sort.h	101
`SORTKEY_INFO`	external_sort.h	131
`SORT_INFO`	external_sort.h	142
`sort_listfile` (decl)	external_sort.h	160
`SORT_MAX_HALF_FILES`	external_sort.c	75
`SORT_MAXREC_LENGTH`	external_sort.c	91
`FILE_CONTENTS`	external_sort.c	107
`SORT_PARAM`	external_sort.c	134
`SORT_REC_LIST`	external_sort.c	192
`SORT_STACK`	external_sort.c	238
`sort_inphase_sort` (decl)	external_sort.c	253
`sort_exphase_merge_elim_dup` (decl)	external_sort.c	255
`sort_exphase_merge` (decl)	external_sort.c	256
`sort_run_sort`	external_sort.c	1170
`sort_listfile`	external_sort.c	1346
`sort_listfile_execute` (server-mode)	external_sort.c	1599
`sort_listfile_internal`	external_sort.c	1748
`sort_inphase_sort`	external_sort.c	1848
`sort_run_find`	external_sort.c	887
`sort_run_merge`	external_sort.c	1008
`sort_retrieve_longrec`	external_sort.c	2412
`sort_exphase_merge_elim_dup`	external_sort.c	2456
`sort_put_result_from_tmpfile`	external_sort.c	3245
`sort_split_last_run`	external_sort.c	3344
`sort_exphase_merge`	external_sort.c	3380
`sort_split_input_temp_file`	external_sort.c	4457
`sort_merge_run_for_parallel`	external_sort.c	4568
`sort_merge_run_for_parallel_index_leaf_build`	external_sort.c	4733
`sort_merge_nruns`	external_sort.c	4883
`sort_check_parallelism`	external_sort.c	4935
`sort_put_result_for_parallel`	external_sort.c	5040
`sort_merge_nruns_parallel`	external_sort.c	5131
`sort_start_parallelism`	external_sort.c	5195
`sort_end_parallelism`	external_sort.c	5329
`sort_write_area`	external_sort.c	5417
`sort_read_area`	external_sort.c	5477
`sort_get_num_half_tmpfiles`	external_sort.c	5522
`sort_checkalloc_numpages_of_outfiles`	external_sort.c	5577
`sort_get_numpages_of_active_infiles`	external_sort.c	5684
`sort_find_inbuf_size`	external_sort.c	5723
`sort_run_add_new`	external_sort.c	5746
`sort_run_remove_first`	external_sort.c	5786
`sort_get_num_file_contents`	external_sort.c	5807
`qfile_sort_list_with_func`	list_file.c	4330
`qfile_sort_list`	list_file.c	4481
`qexec_groupby` (sort path)	query_executor.c	5247
`qexec_groupby` (hash-spill `sort_listfile`)	query_executor.c	5465
`qexec_groupby` (sort-based `sort_listfile`)	query_executor.c	5546
`qexec_orderby_distinct`	query_executor.c	3936
`qexec_orderby_distinct_by_sorting`	query_executor.c	4000
`btree_index_sort`	btree_load.c	3208
`btree_sort_get_next_parallel`	btree_load.c	3325
`btree_sort_get_next`	btree_load.c	3659

Cross-check Notes

cubrid-post-processing.md documents how qexec_groupby and qexec_execute_analytic use the sort callbacks (qexec_gby_put_next, qexec_hash_gby_put_next, qexec_analytic_put_next). The strategy choice (sort-based vs hash-based GROUP BY) lives there; this document only covers the sort substrate that both paths end up calling. The line-number references in cubrid-post-processing.md (e.g. external_sort.c:1347 for sort_listfile) are off by one as of 2026-05-01 — the function signature line is 1347 but the function name is on 1346. Treat the symbol as the anchor.
cubrid-btree.md documents the bulk-load consumer (btree_construct_leafs) and the leaf-page write path. The btree_index_sort function itself is a thin wrapper over sort_listfile, walked here. The unique-constraint check is in the consumer, not the sort — SORT_DUP is used even for unique indexes because the sort comparator on (key || OID) cannot detect uniqueness violations on the key alone.
cubrid-disk-manager.md owns FILE_TEMP lifecycle. Sort temp files have one peculiarity: they may carry the TDE flag (includes_tde_class) so that intermediate encrypted-class data isn’t readable from raw disk. The flag rides on every sort_write_area call.
cubrid-list-file.md (planned) — documents QFILE_LIST_ID, the input/output sink shape for ORDER BY and GROUP BY. The SORT_INFO struct in external_sort.h is the bridge.
cubrid-hash-join.md (planned) — when a hash join spills its build side, the fallback sort-merge join uses qfile_sort_list on both inputs. The path is plain ORDER BY-style, no special put-callback.
The deck Sort, Hash Group By.pptx (referenced from cubrid-post-processing.md) frames sort GROUP BY’s callback discipline and matches the structure here.
The phrase “loser tree” is sometimes used informally in CUBRID code reviews; the actual implementation (SORT_REC_LIST sr_list) is a sorted singly linked list that runs in O(k) per emit rather than a tree’s O(log k). For k = 4 the two have identical asymptotic behaviour and the linked list wins on cache footprint.

Open Questions

Adaptive run-generation algorithm. CUBRID always uses natural mergesort over the in-memory buffer. Is there a workload regime where switching to replacement-selection (longer expected runs at the cost of per-record heap update) would lower total I/O? Postgres v15 made the opposite move — from replacement-selection to quicksort — citing cache behaviour. A measurement on CUBRID’s typical ORDER-BY-heavy OLAP workloads would settle it.
GPU-accelerated sort. Modern engines (HeavyDB, ClickHouse with custom kernels) push the in-memory sort to the GPU when available. CUBRID has no GPU integration; the question is whether the in-memory phase (everything inside sort_run_sort) could be replaced by a CUDA Thrust call with a host-side DMA in/out. The temp-file phase would remain CPU.
Sort-state checkpointing for restartable progress. A long-running ORDER BY on a multi-terabyte dataset that fails near the end currently restarts from scratch. Checkpointing the run-list bookkeeping every N pages would let recovery skip the completed merge passes. The bookkeeping is small enough (one FILE_CONTENTS per temp file, each a dynamic int array) that the checkpoint cost would be negligible.
Polyphase merge for k > 4 cases. CUBRID’s hard cap of SORT_MAX_HALF_FILES = 4 means very large sorts make many passes. Raising the cap is straightforward but raises file-descriptor pressure under concurrency; an alternative is polyphase merge (Knuth §5.4.2) which keeps the same total file count but distributes runs Fibonacci-style for better merge balance. Given CUBRID’s typical sort buffer size, the gain on most workloads would be small.
Parallel sort for GROUP BY and analytic. As of 2026-05-01, only SORT_ORDER_BY and SORT_INDEX_LEAF have parallel coverage. The SORT_GROUP_BY and SORT_ANALYTIC branches in sort_check_parallelism and sort_start_parallelism explicitly fall through to return 1 / ER_FAILED. Extending parallel to GROUP BY is non-trivial because the per-group accumulator state needs to be merged across workers — equivalent to the hash-spill re-aggregation problem at a different layer.
Hash-join spill fallback unification. The planned cubrid-hash-join.md cross-reference is currently a forward declaration; the actual code path (qexec_hashjoin and friends) calls qfile_sort_list directly when a partition cannot fit. There is no SORT_HASH_JOIN_SPILL parallel-type tag — the spill is treated as plain ORDER BY, which means it cannot leverage parallel sort even when the spill is large. Whether this is worth a dedicated parallel-type slot is open.

Sources

Code:
- src/storage/external_sort.c (driver + run gen + merge
  - parallel coordination)
- src/storage/external_sort.h (callback typedefs + sort_listfile declaration)
- src/storage/btree_load.c (btree_index_sort, btree_sort_get_next, btree_sort_get_next_parallel, btree_construct_leafs)
- src/query/list_file.c (qfile_sort_list_with_func, qfile_sort_list, qfile_get_next_sort_item, qfile_put_next_sort_item, qfile_compare_partial_sort_record, qfile_compare_all_sort_record)
- src/query/query_executor.c (qexec_groupby, qexec_orderby_distinct, qexec_orderby_distinct_by_sorting, qexec_execute_analytic, the GROUP BY put/get callbacks)
- src/storage/file_manager.c (file_create_temp, file_temp_retire, file_apply_tde_algorithm, file_numerable_find_nth, file_alloc_multiple)
- src/base/system_parameter.c (PRM_ID_SR_NBUFFERS, PRM_ID_SORT_BUFFER_SIZE)
- src/storage/px_sort.h, src/query/px_callable_task.hpp, src/query/px_worker_manager.hpp, src/query/px_parallel.hpp, src/query/px_ftab_set.hpp (parallel infrastructure referenced from external_sort.c)
Theoretical references:
- Knuth, The Art of Computer Programming, vol. 3: Sorting and Searching, 2nd ed., chapter 5 (especially §5.4.1 “Multiway merging and replacement selection”, §5.4.2 “The polyphase merge”, §5.2.4 “Sorting by merging”).
- Graefe, “Query Evaluation Techniques for Large Databases”, ACM Computing Surveys 25(2), 1993, §7 “External sorting”.
- Graefe, “Implementing Sorting in Database Systems”, ACM Computing Surveys 38(3), 2006 — practical survey of in-database sort engineering.
- Petrov, Database Internals (O’Reilly, 2019), chapter 14 “Query Execution” — external-sort and sort-based GROUP BY framing.
- Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book, 2nd ed., chapter 15.4 — the I/O cost model used in §“Theoretical Background”.
Cross-references in this knowledge base:
- knowledge/code-analysis/cubrid/cubrid-post-processing.md (sort-based GROUP BY + analytic functions over the same sort substrate)
- knowledge/code-analysis/cubrid/cubrid-btree.md (btree_construct_leafs consumer of the bulk-load sort)
- knowledge/code-analysis/cubrid/cubrid-disk-manager.md (FILE_TEMP lifecycle)
- knowledge/research/dbms-general/database-internals.md (Petrov ch. 14 capture)

CUBRID External Sort — Run Generation, Multi-Way Merge, and the Sort Substrate

Theoretical Background

Common DBMS Design

PostgreSQL — tuplesort.c

MySQL — filesort