Skip to content

PostgreSQL tuplesort — In-Memory Quicksort, External Merge Sort, Logical Tapes, and Tuplestore

Contents:

Sorting is one of the oldest problems in computing and also one of the most load-bearing inside a relational engine. ORDER BY is the obvious consumer, but sorting also underlies merge joins, GROUP BY/DISTINCT done by sorting, CREATE INDEX (a B-tree is built by sorting the heap tuples and bulk-loading the leaves), window functions, and UNION. The single fact that organizes every database sort implementation is that the data may not fit in memory. A textbook in-memory qsort assumes random access to the whole array; a database cannot assume that, so it must degrade gracefully from a fast in-memory sort to a disk-based one that touches each datum a bounded number of times. Knuth’s The Art of Computer Programming, Vol. 3 (Sorting and Searching) is the canonical reference, and PostgreSQL’s source comments cite it directly — the merge heap is “Algorithm 5.2.3H” and the run-building loop descends from “Algorithm 5.4.1, replacement selection.” Database System Concepts (Silberschatz 7e, ch. 15 “Query Processing”) frames the same material from the DBMS side as external sort-merge.

The external sort-merge algorithm has two phases:

  1. Run generation. Read the input in M-page chunks (where M is the memory budget in pages), sort each chunk in memory, and write it out as a sorted run. After this phase the input has become ⌈N/M⌉ sorted runs on disk, where N is the input size in pages.

  2. Merge. Merge the runs into one. A single merge pass can combine up to M-1 runs at once (one input buffer per run, one output buffer), so it takes ⌈log_{M-1}(N/M)⌉ passes to collapse everything into a single run. Each pass reads and writes the entire dataset once, so the total I/O is 2N · (1 + ⌈log_{M-1}(N/M)⌉) page transfers. For any realistic M, two or three passes suffice for terabytes of data — the logarithm base is large.

Two refinements on this skeleton matter for understanding real code. The first is replacement selection (Knuth 5.4.1): instead of sorting fixed M-page chunks, keep a heap of M tuples and, each time you emit the smallest, read one more from the input; if the new tuple is >= the one just emitted it can join the current run, otherwise it is held for the next run. On already-partially-sorted input this produces runs averaging 2M pages, halving the number of runs and often the number of merge passes. The classic AlphaSort paper (Nyberg et al. 1994, AlphaSort: A Cache-Sensitive Parallel External Sort, captured in dbms-papers/alphasortsigmod.md) argues the opposite for cache-bound modern hardware: replacement selection has poor locality (the heap is random-access and larger than cache), so it can be faster to generate shorter, quicksorted runs that each fit in CPU cache and pay for one extra merge pass. PostgreSQL took exactly this lesson — it abandoned replacement selection for run generation in 2016 in favor of quicksorted runs, keeping the heap only for the merge.

The second refinement is the comparison cost itself. External sort cost is usually modeled in page I/O, but for in-memory sorts of moderate data the bottleneck is the comparator: a text comparison may call strcoll into the OS collation library, which is hundreds of cycles. The abbreviated key technique (popularized for databases but old in spirit) stores a fixed-size, order-preserving proxy for the leading sort column — for text under a C-like collation, the first 8 bytes packed into a Datum — so the common case resolves with a single integer compare and only ties fall through to the expensive full comparator. This is a recurring theme: the number of comparisons is O(N log N) and hard to beat, so engines attack the cost per comparison.

Finally, the bounded case (ORDER BY ... LIMIT n) is asymptotically different. You do not need to sort N items to return the smallest n; a bounded max-heap of size n solves it in O(N log n) time and O(n) space, never spilling to disk regardless of N. The PatSort/patience-sort line of work (Chandramouli & Goldstein 2014, Patience is a Virtue, captured in dbms-papers/patsort-sigmod14.md) explores yet another axis — exploiting existing runs (“patience” run detection) in the input — which PostgreSQL does not implement, but which frames what its quicksort-based run generation leaves on the table.

Every serious engine ends up with a sort component shaped by the same forces. Naming the shared conventions first lets the PostgreSQL symbols read as choices in a known design space.

A database sort module is almost always written as a single state machine that begins optimistically in memory and falls back to disk only when it must. The shared pattern:

  • Accumulate first, decide later. Tuples are copied into an in-memory array as they arrive. A running tally of consumed memory is compared against a budget (work_mem in PostgreSQL). As long as the budget holds, no I/O and no tape machinery is touched.
  • A one-way switch to external mode. The moment the budget is exceeded the module commits to external sorting: it sorts what it has into the first run, writes it out, and from then on operates run-at-a-time. The switch is one-way — once you have spilled a run you cannot un-spill it.
  • Run generation, then merge. The build phase produces sorted runs; the merge phase combines them. The merge uses a selection tree or loser tree or simple binary heap holding the current front tuple of each run, so each output tuple costs one log(M) heap sift rather than a linear scan of M fronts.

A naive “one OS file per tape” design has a notorious flaw: during the final merge, every datum exists simultaneously in an input run and the output run, so peak disk usage is roughly 2x the data volume. Engines that care about disk footprint solve this by packing all logical tapes into a single backing file and recycling a block back into a global free pool the moment it is read — because each run is read strictly once, sequentially, its blocks become free as the merge consumes them, and the freed space is immediately reused by the output run. Peak usage drops to roughly 1x.

  • Abbreviated / normalized keys. A fixed-size sortable proxy for the leading column, compared as a machine integer, with a fallback to the full comparator on ties. Because abbreviation can fail to discriminate (e.g., a column of near-identical long strings), good implementations measure during the sort and abort the optimization if it is not paying off.
  • Type-specialized comparators. Replacing an indirect funcptr(a,b) call with an inlined integer compare for the common int4/int8/Datum cases removes both the call overhead and a branch-prediction miss per comparison.

ORDER BY ... LIMIT n is special-cased into a fixed-size heap: keep the n smallest seen so far in a max-heap keyed so the largest of the kept set is at the root; each new tuple is discarded in one comparison if it is not better than the root. This never spills and runs in O(N log n).

Engines also need to buffer a tuple stream without reordering it — for cursors that scroll backward, for Materialize nodes that re-read an inner relation, for recursive CTE working tables. This is a strictly simpler component: same in-memory-then-spill structure, same memory accounting, but no comparator and no runs — just a sequence with one or more read cursors. PostgreSQL calls it tuplestore.

flowchart TD
  IN["incoming tuples"] --> ACC["accumulate in memtuples[]<br/>track availMem"]
  ACC --> FIT{"fit in work_mem?"}
  FIT -->|"yes, at performsort"| QS["in-memory quicksort<br/>TSS_SORTEDINMEM"]
  FIT -->|"no: budget exceeded"| EXT["switch to external<br/>TSS_BUILDRUNS"]
  EXT --> RUN["quicksort this batch<br/>write sorted run to a tape"]
  RUN --> MORE{"more input?"}
  MORE -->|"yes"| RUN
  MORE -->|"no"| MERGE["merge runs with<br/>binary min-heap"]
  MERGE --> PASS{"runs left > merge order?"}
  PASS -->|"yes"| MERGE
  PASS -->|"no: 1 run"| DONE["TSS_SORTEDONTAPE /<br/>TSS_FINALMERGE on the fly"]
  QS --> OUT["gettuple returns sorted order"]
  DONE --> OUT

PostgreSQL implements all of the above in one file, tuplesort.c, with the tape mechanics factored into logtape.c and the no-sort sibling in tuplestore.c. The generic engine is type-agnostic: callers supply comparetup/writetup/readtup callbacks via tuplesortvariants.c (heap tuples, index tuples, datums, BRIN, GIN), and the core engine only ever manipulates SortTuple structs.

The SortTuple — what the engine actually shuffles

Section titled “The SortTuple — what the engine actually shuffles”

The engine never sorts your HeapTuple directly; it sorts an array of tiny fixed-size SortTuple structs, each carrying the leading key inline so the common comparison needs no heap_getattr:

// SortTuple — src/include/utils/tuplesort.h
typedef struct
{
void *tuple; /* the tuple itself */
Datum datum1; /* value of first key column */
bool isnull1; /* is first key column NULL? */
int srctape; /* source tape number */
} SortTuple;

datum1 holds either the leading column’s value (for pass-by-value types) or, when abbreviation is active, its abbreviated proxy. srctape is only used during the merge, to know which run a heap element came from. The full tuple hangs off tuple in a separate palloc chunk (or, during merge, a slab slot).

A Tuplesortstate walks through six possible states, and the whole module is a switch on the current one:

// TupSortStatus — src/backend/utils/sort/tuplesort.c
typedef enum
{
TSS_INITIAL, /* Loading tuples; still within memory limit */
TSS_BOUNDED, /* Loading tuples into bounded-size heap */
TSS_BUILDRUNS, /* Loading tuples; writing to tape */
TSS_SORTEDINMEM, /* Sort completed entirely in memory */
TSS_SORTEDONTAPE, /* Sort completed, final run is on tape */
TSS_FINALMERGE, /* Performing final merge on-the-fly */
} TupSortStatus;

TSS_INITIAL is the optimistic in-memory state. If work_mem is never exceeded, performsort quicksorts the array and moves to TSS_SORTEDINMEM — no tape, no I/O. If memory is exceeded the engine moves to TSS_BUILDRUNS and spills runs. TSS_BOUNDED is the LIMIT-n heap. The two terminal external states differ by whether a single result tape exists (TSS_SORTEDONTAPE, when the merge collapsed to one run, e.g. for random access) or whether the final merge is being performed lazily as the caller pulls tuples (TSS_FINALMERGE).

flowchart TD
  INIT["TSS_INITIAL<br/>accumulate in memtuples[]"]
  INIT -->|"bounded and count > 2*bound"| BND["TSS_BOUNDED<br/>fixed-size max-heap of n"]
  INIT -->|"work_mem exceeded:<br/>inittapes + dumptuples"| BR["TSS_BUILDRUNS<br/>quicksort each run to tape"]
  INIT -->|"performsort, fit in memory"| SIM["TSS_SORTEDINMEM<br/>final quicksort"]
  BND -->|"performsort:<br/>sort_bounded_heap"| SIM
  BR -->|"performsort: dumptuples<br/>then mergeruns"| FM["TSS_FINALMERGE<br/>merge on the fly"]
  BR -->|"mergeruns collapses to 1 run<br/>or random access requested"| SOT["TSS_SORTEDONTAPE<br/>single frozen result tape"]
  SIM --> GET["tuplesort_gettuple_common"]
  FM --> GET
  SOT --> GET

Run generation uses quicksort, not replacement selection

Section titled “Run generation uses quicksort, not replacement selection”

This is the AlphaSort lesson in code. Each time memory fills, dumptuples quicksorts the whole memtuples array and streams it to the current destination tape as one run, then resets tuple memory:

// dumptuples — src/backend/utils/sort/tuplesort.c
/* Sort all tuples accumulated within the allowed amount of memory for
* this run using quicksort */
tuplesort_sort_memtuples(state);
/* ... */
memtupwrite = state->memtupcount;
for (i = 0; i < memtupwrite; i++)
{
SortTuple *stup = &state->memtuples[i];
WRITETUP(state, state->destTape, stup);
}
state->memtupcount = 0;
/* ... reset tuplecontext, FREEMEM(tupleMem), markrunend ... */
markrunend(state->destTape);

Runs are therefore approximately work_mem-sized (modulo tape buffers), and each is quicksorted independently — cache-friendly per AlphaSort, at the cost of producing more (and thus possibly needing an extra merge pass) than replacement selection would.

The in-memory sort (both the all-in-memory case and each run) dispatches to the cheapest available quicksort. If the leading key is a bare Datum with a known integer comparator, it uses a fully inlined, type-specialized variant that avoids the comparator function-pointer call entirely:

// tuplesort_sort_memtuples — src/backend/utils/sort/tuplesort.c
if (state->base.haveDatum1 && state->base.sortKeys)
{
if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)
{
qsort_tuple_unsigned(state->memtuples, state->memtupcount, state);
return;
}
#if SIZEOF_DATUM >= 8
else if (state->base.sortKeys[0].comparator == ssup_datum_signed_cmp)
{
qsort_tuple_signed(state->memtuples, state->memtupcount, state);
return;
}
#endif
else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)
{
qsort_tuple_int32(state->memtuples, state->memtupcount, state);
return;
}
}
/* Can we use the single-key sort function? */
if (state->base.onlyKey != NULL)
qsort_ssup(state->memtuples, state->memtupcount, state->base.onlyKey);
else
qsort_tuple(state->memtuples, state->memtupcount,
state->base.comparetup, state);

These qsort_tuple* functions are generated from sort_template.h (the same template that produces other in-tree sorts), giving an interruptible introsort-style quicksort with insertion-sort cutoffs — but specialized so the hot comparison is an inlined integer compare rather than an indirect call.

When sortKeys[0] provides an abbrev_converter, tuplesort_puttuple_common replaces datum1 with the abbreviated proxy as each tuple arrives. Crucially, it periodically asks the opclass whether abbreviation is working and aborts it (reverting all already-stored datum1s to the real value) if not:

// tuplesort_puttuple_common — src/backend/utils/sort/tuplesort.c
if (!useAbbrev)
{
/* Leave ordinary Datum representation, or NULL value. ... */
}
else if (!consider_abort_common(state))
{
/* Store abbreviated key representation */
tuple->datum1 = state->base.sortKeys->abbrev_converter(tuple->datum1,
state->base.sortKeys);
}
else
{
/* Set state to be consistent with never trying abbreviation. */
REMOVEABBREV(state, state->memtuples, state->memtupcount);
}

consider_abort_common samples at exponentially growing intervals (abbrevNext doubles each check) and calls the opclass’s abbrev_abort, which decides whether the abbreviations seen so far are discriminating enough to be worth keeping. Abbreviation is also automatically disabled once we go external, since serialized tuples on tape do not carry abbreviated keys (mergeruns nulls out abbrev_converter before reading runs back).

This section names the stable symbols, grouped by call-flow, and ends with a position-hint table. The narrative anchors are the public API (tuplesort_begin_*tuplesort_puttuple_commontuplesort_performsorttuplesort_gettuple_common), the external-sort machinery (inittapes, dumptuples, mergeruns, mergeonerun, beginmerge), the logical-tape layer, and tuplestore.

Loading tuples and the one-way switch (tuplesort.c)

Section titled “Loading tuples and the one-way switch (tuplesort.c)”

tuplesort_begin_common allocates the state and the initial memtuples array; the per-variant tuplesort_begin_heap/_begin_datum/_begin_index_* wrappers in tuplesortvariants.c install the comparator/IO callbacks. Tuples enter through tuplesort_puttuple_common, whose TSS_INITIAL arm is the heart of the in-memory-vs-external decision:

// tuplesort_puttuple_common (TSS_INITIAL arm) — src/backend/utils/sort/tuplesort.c
case TSS_INITIAL:
/* Save the tuple into the unsorted array. First, grow the array
* as needed. ... if we fail, there'll still be room to store the
* incoming tuple, and then we'll switch to tape-based operation. */
if (state->memtupcount >= state->memtupsize - 1)
{
(void) grow_memtuples(state);
Assert(state->memtupcount < state->memtupsize);
}
state->memtuples[state->memtupcount++] = *tuple;
/* ... bounded-sort switch check (see below) ... */
/* Done if we still fit in available memory and have array slots. */
if (state->memtupcount < state->memtupsize && !LACKMEM(state))
{
MemoryContextSwitchTo(oldcontext);
return;
}
/* Nope; time to switch to tape-based operation. */
inittapes(state, true);
dumptuples(state, false);
break;

The two exit conditions are array full (memtupcount >= memtupsize, the array could not grow further) and out of memory (LACKMEMavailMem < 0). Either one triggers inittapes followed by the first dumptuples. After that the state is TSS_BUILDRUNS; subsequent puttuple calls land in the TSS_BUILDRUNS arm, which just appends to the array and calls dumptuples again whenever memory fills, cutting a new run.

grow_memtuples doubles the array but is careful: it stops growing (sets growmemtuples = false) when doubling would blow the memory budget, so the array growth and the spill decision interlock cleanly.

performsort is where the accumulated state is finalized. It is a switch on the current status:

// tuplesort_performsort — src/backend/utils/sort/tuplesort.c
case TSS_INITIAL:
if (SERIAL(state))
{
/* Just qsort 'em and we're done */
tuplesort_sort_memtuples(state);
state->status = TSS_SORTEDINMEM;
}
else if (WORKER(state))
{
/* Parallel workers must still dump out tuples to tape. No
* merge is required to produce single output run, though. */
inittapes(state, false);
dumptuples(state, true);
worker_nomergeruns(state);
state->status = TSS_SORTEDONTAPE;
}
else
{
/* Leader will take over worker tapes and merge worker runs. */
leader_takeover_tapes(state);
mergeruns(state);
}
break;
case TSS_BOUNDED:
sort_bounded_heap(state); /* heap -> sorted array */
break;
case TSS_BUILDRUNS:
dumptuples(state, true); /* flush the last run */
mergeruns(state); /* sets TSS_SORTEDONTAPE or TSS_FINALMERGE */
break;

The three sub-cases of TSS_INITIAL correspond to a plain serial sort (everything fit, just quicksort), a parallel worker (must materialize its one run to a tape for the leader, even though no merge is needed locally), and the parallel leader (which has done no sorting itself and now imports the workers’ tapes and merges them).

Building tapes: inittapes, selectnewtape, tuplesort_merge_order

Section titled “Building tapes: inittapes, selectnewtape, tuplesort_merge_order”

inittapes chooses the merge order, creates the LogicalTapeSet, and selects the first output tape. The merge order — how many runs we will merge per pass — is derived purely from work_mem:

// tuplesort_merge_order — src/backend/utils/sort/tuplesort.c
mOrder = allowedMem / (2 * TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE);
/* Even in minimum memory, use at least a MINORDER merge. ... do not use
* more than a MAXORDER merge. ... high order merges are quite slow due to
* CPU cache effects; it can be faster to pay the I/O cost of a multi-pass
* merge than to perform a single merge pass across many hundreds of tapes. */
mOrder = Max(mOrder, MINORDER); /* MINORDER == 6 */
mOrder = Min(mOrder, MAXORDER); /* MAXORDER == 500 */
return mOrder;

This is the modern (post-9.6) tape model: there is no fixed “7 tapes” Polyphase scheme anymore. MINORDER = 6, MAXORDER = 500, each tape budgets one BLCKSZ block plus a MERGE_BUFFER_SIZE (= BLCKSZ * 32) prefetch window. The cap at 500 is explicitly justified in the comment: beyond that, cache effects make a many-way merge slower than paying for an extra pass.

selectnewtape implements round-robin run placement: while fewer than maxTapes output tapes exist it creates a new one per run; after that it appends new runs to existing tapes modulo nOutputTapes. So if you have 1000 runs and a merge order of 500, you get 500 tapes carrying ~2 runs each, ready for a balanced multi-pass merge.

The merge: mergeruns, mergeonerun, beginmerge

Section titled “The merge: mergeruns, mergeonerun, beginmerge”

mergeruns is the balanced k-way merge driver. It first tears down the big memtuples array (it no longer needs work_mem/run worth of slots — only one slot per input tape for the heap), switches the tuple memory to a slab allocator (fixed-size slots, no fragmentation), and gives the rest of memory to the tapes as prefetch buffers:

// mergeruns — src/backend/utils/sort/tuplesort.c
/* We no longer need a large memtuples array. ... */
FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
pfree(state->memtuples);
/* Initialize the slab allocator. We need one slab slot per input tape,
* for the tuples in the heap, plus one to hold the tuple last returned
* from tuplesort_gettuple. */
if (state->base.tuples)
init_slab_allocator(state, state->nOutputTapes + 1);
/* Allocate a new 'memtuples' array, for the heap. It will hold one tuple
* from each input tape. */
state->memtupsize = state->nOutputTapes;
state->memtuples = (SortTuple *) MemoryContextAlloc(state->base.maincontext,
state->nOutputTapes * sizeof(SortTuple));
/* Use all the remaining memory we have available for tape buffers ... */
state->tape_buffer_mem = state->availMem;

Then it loops. Each iteration that starts a fresh pass turns the previous pass’s output tapes into the next pass’s input tapes, splits the buffer memory among them, and merges. If after a pass more than one run remains it loops again (multi-pass merge); if only one run remains, the merge is done. When random access is not required, mergeruns stops one short — it leaves the runs unmerged on their tapes when there are <= maxTapes of them and lets gettuple perform the final merge lazily on the fly, which is the TSS_FINALMERGE state. That saves writing the whole dataset out one last time.

The actual merge of one set of runs is a textbook heap merge:

// mergeonerun — src/backend/utils/sort/tuplesort.c
beginmerge(state); /* load 1 tuple from each input tape */
while (state->memtupcount > 0)
{
/* write the smallest (heap root) to destTape */
srcTapeIndex = state->memtuples[0].srctape;
srcTape = state->inputTapes[srcTapeIndex];
WRITETUP(state, state->destTape, &state->memtuples[0]);
if (state->memtuples[0].tuple)
RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);
/* pull next tuple from that same tape, replace heap root with it */
if (mergereadnext(state, srcTape, &stup))
{
stup.srctape = srcTapeIndex;
tuplesort_heap_replace_top(state, &stup);
}
else
{
tuplesort_heap_delete_top(state); /* that run is exhausted */
state->nInputRuns--;
}
}
markrunend(state->destTape);

beginmerge primes the heap with the front tuple of each active tape. The heap itself is a binary min-heap (tuplesort_heap_insert / tuplesort_heap_replace_top / tuplesort_heap_delete_top), implemented per Knuth 5.2.3H. replace_top is the hot operation: it overwrites the root and sifts down in one pass, avoiding the delete-then-insert pair.

When the planner knows a LIMIT n, it calls tuplesort_set_bound, and the TSS_INITIAL arm watches for the moment a fixed-size heap becomes cheaper than carrying everything:

// tuplesort_puttuple_common (bounded switch) — src/backend/utils/sort/tuplesort.c
if (state->bounded &&
(state->memtupcount > state->bound * 2 ||
(state->memtupcount > state->bound && LACKMEM(state))))
{
make_bounded_heap(state); /* -> TSS_BOUNDED */
return;
}

make_bounded_heap reverses the sort direction (so the largest of the kept set sits at the root) and builds a heap of exactly bound elements. From then on the TSS_BOUNDED arm discards any incoming tuple that is not smaller than the current root in a single comparison, or replaces the root otherwise:

// tuplesort_puttuple_common (TSS_BOUNDED arm) — src/backend/utils/sort/tuplesort.c
case TSS_BOUNDED:
if (COMPARETUP(state, tuple, &state->memtuples[0]) <= 0)
{
/* new tuple <= top of the heap, so we can discard it */
free_sort_tuple(state, tuple);
}
else
{
/* discard top of heap, replacing it with the new tuple */
free_sort_tuple(state, &state->memtuples[0]);
tuplesort_heap_replace_top(state, tuple);
}
break;

This never spills — the heap is O(bound) — and sort_bounded_heap at performsort unheapifies in place into a sorted array. This is the asymptotic win described in the theory section: O(N log n) instead of O(N log N).

Parallel sort: workers produce runs, leader merges

Section titled “Parallel sort: workers produce runs, leader merges”

PostgreSQL parallelizes sorting by having each worker sort its share of the input into exactly one run, written to a worker-owned logical tape inside a shared SharedFileSet. The leader does no scanning or sorting of its own; it imports the workers’ frozen tapes and runs one final merge over them. Three macros classify a state:

// SERIAL / WORKER / LEADER — src/backend/utils/sort/tuplesort.c
#define SERIAL(state) ((state)->shared == NULL)
#define WORKER(state) ((state)->shared && (state)->worker != -1)
#define LEADER(state) ((state)->shared && (state)->worker == -1)

worker_nomergeruns (called from a worker’s performsort) just freezes the worker’s single output tape as its result. leader_takeover_tapes is the leader-side glue: after confirming all workers finished, it imports each worker tape and arranges the leader’s state to look exactly as if it had generated those runs itself, so mergeruns can run unmodified:

// leader_takeover_tapes — src/backend/utils/sort/tuplesort.c
/* There will always be exactly 1 run per worker, and exactly one input
* tape per run, because workers always output exactly 1 run ... */
state->outputTapes = palloc0(nParticipants * sizeof(LogicalTape *));
state->nOutputTapes = nParticipants;
state->nOutputRuns = nParticipants;
for (j = 0; j < nParticipants; j++)
state->outputTapes[j] = LogicalTapeImport(state->tapeset, j, &shared->tapes[j]);
state->status = TSS_BUILDRUNS;

The cross-process shared state is set up by tuplesort_estimate_shared, tuplesort_initialize_shared, and tuplesort_attach_shared; the Sharedsort struct carries the SharedFileSet, a spinlock-protected workersFinished counter, and the per-worker TapeShare descriptors that LogicalTapeImport consumes.

Logical tapes: one file, recycled blocks (logtape.c)

Section titled “Logical tapes: one file, recycled blocks (logtape.c)”

logtape.c exists to defeat the 2x-disk problem. Its header states the contract plainly: each tape dataset (except the final output) is written and read exactly once, sequentially, so a block once read is never needed again and its space can be recycled. All tapes live in one BufFile; each is a chain of BLCKSZ blocks linked by a trailer at the end of each block:

// TapeBlockTrailer macros — src/backend/utils/sort/logtape.c
#define TapeBlockPayloadSize (BLCKSZ - sizeof(TapeBlockTrailer))
#define TapeBlockGetTrailer(buf) \
((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)

The recycling happens during reads. ltsReadBlock reads a block and, unless the tape is frozen, immediately returns it to the freelist:

// ltsReadFillBuffer (read+recycle) — src/backend/utils/sort/logtape.c
ltsReadBlock(lt->tapeSet, datablocknum, thisbuf);
if (!lt->frozen)
ltsReleaseBlock(lt->tapeSet, datablocknum);
lt->curBlockNumber = lt->nextBlockNumber;
lt->nbytes += TapeBlockGetNBytes(thisbuf);

The freelist is a min-heap of block numbers so writes always prefer the lowest free block, keeping the file compact and writes as sequential as possible. ltsGetFreeBlock pops the heap root (or extends the file if empty), and ltsReleaseBlock pushes a freed block back:

// ltsGetFreeBlock — src/backend/utils/sort/logtape.c
/* freelist empty; allocate a new block */
if (lts->nFreeBlocks == 0)
return lts->nBlocksAllocated++;
/* remove top of minheap */
blocknum = heap[0];
holeval = heap[--lts->nFreeBlocks];
/* sift down ... choose smaller child, stop when holeval <= both children */

Because every block freed by a read is instantly available to the merge’s output run, peak file size stays near the actual data volume. When a sort result must support random access (a scrollable cursor, or a merge join that mark/restores), the final tape is frozen by LogicalTapeFreeze, which flushes the last partial block, switches to a single-block read buffer, and stops recycling so the tape can be re-read and seeked:

// LogicalTapeFreeze — src/backend/utils/sort/logtape.c
if (lt->dirty)
{
TapeBlockSetNBytes(lt->buffer, lt->nbytes);
ltsWriteBlock(lt->tapeSet, lt->curBlockNumber, lt->buffer);
}
lt->writing = false;
lt->frozen = true;
/* ... reset to single-BLCKSZ buffer, read first block, set up for seeks ... */

logtape.c also supports enable_prealloc (block preallocation per tape) for the HashAgg use case, where many tapes are written concurrently and interleaved single-block allocation would fragment badly; TAPE_WRITE_PREALLOC_MIN/MAX bound the doubling preallocation window.

Tuplestore: the no-sort sibling (tuplestore.c)

Section titled “Tuplestore: the no-sort sibling (tuplestore.c)”

tuplestore.c is “a dumbed-down version of tuplesort.c; it does no sorting of tuples but can only store and regurgitate a sequence.” Its key extra feature is multiple independent read pointers — needed so a Materialize node, or a cursor scrolling backward, or a recursive CTE working table, can re-read the buffered sequence. Each pointer carries its own position:

// TSReadPointer — src/backend/utils/sort/tuplestore.c
typedef struct
{
int eflags; /* capability flags */
bool eof_reached; /* read has reached EOF */
int current; /* next array index to read */
int file; /* temp file# */
off_t offset; /* byte offset in file */
} TSReadPointer;

It has the same three-state shape as tuplesort but no runs and no comparator:

// TupStoreStatus — src/backend/utils/sort/tuplestore.c
typedef enum
{
TSS_INMEM, /* Tuples still fit in memory */
TSS_WRITEFILE, /* Writing to temp file */
TSS_READFILE, /* Reading from temp file */
} TupStoreStatus;

tuplestore_puttuple_common accumulates in memtuples[] while in TSS_INMEM; when LACKMEM trips it creates a single BufFile, freezes the backward-scan decision (you cannot start storing length-prefixed records and then change your mind about backward scan), dumps everything, and enters TSS_WRITEFILE:

// tuplestore_puttuple_common (TSS_INMEM spill) — src/backend/utils/sort/tuplestore.c
if (state->memtupcount < state->memtupsize && !LACKMEM(state))
return;
/* Nope; time to switch to tape-based operation. */
PrepareTempTablespaces();
state->myfile = BufFileCreateTemp(state->interXact);
state->backward = (state->eflags & EXEC_FLAG_BACKWARD) != 0;
tuplestore_updatemax(state);
state->status = TSS_WRITEFILE;
dumptuples(state);

Reading toggles between TSS_WRITEFILE and TSS_READFILE by seeking the single BufFile to the active pointer’s position — the comment at the top of the file documents that only the active pointer’s position is implicitly the file’s seek position, while inactive pointers carry explicit file/offset. tuplestore_trim discards tuples before the oldest read pointer when rewind has been disabled for all pointers, which is how a Materialize that is read strictly forward keeps memory bounded. Unlike tuplesort, tuplestore never uses logtape.c: there are no runs to recycle, so a plain BufFile suffices. (sharedtuplestore.c is a separate, parallel-aware variant used by parallel hash join, out of scope here.)

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
TupSortStatus (enum)tuplesort.c154
SLAB_SLOT_SIZE / SlabSlottuplesort.c142
MINORDER / MAXORDER / MERGE_BUFFER_SIZEtuplesort.c176
Tuplesortstate (struct)tuplesort.c185
SERIAL / WORKER / LEADERtuplesort.c403
tuplesort_begin_commontuplesort.c642
grow_memtuplestuplesort.c1052
tuplesort_puttuple_commontuplesort.c1169
consider_abort_commontuplesort.c1318
tuplesort_performsorttuplesort.c1363
tuplesort_gettuple_commontuplesort.c1470
tuplesort_merge_ordertuplesort.c1778
inittapestuplesort.c1865
selectnewtapetuplesort.c1948
init_slab_allocatortuplesort.c1981
mergerunstuplesort.c2017
mergeoneruntuplesort.c2200
beginmergetuplesort.c2260
dumptuplestuplesort.c2307
make_bounded_heaptuplesort.c2587
sort_bounded_heaptuplesort.c2636
tuplesort_sort_memtuplestuplesort.c2676
tuplesort_heap_inserttuplesort.c2738
tuplesort_heap_replace_toptuplesort.c2798
worker_nomergerunstuplesort.c3047
leader_takeover_tapestuplesort.c3068
ssup_datum_unsigned_cmptuplesort.c3139
SortTuple (struct)tuplesort.h148
LogicalTape (struct)logtape.c137
LogicalTapeSet (struct)logtape.c187
TapeBlockGetTrailer (macro)logtape.c104
ltsGetFreeBlocklogtape.c371
ltsReleaseBlocklogtape.c469
LogicalTapeSetCreatelogtape.c556
LogicalTapeImportlogtape.c609
LogicalTapeWritelogtape.c761
LogicalTapeRewindForReadlogtape.c846
LogicalTapeReadlogtape.c928
LogicalTapeFreezelogtape.c981
TupStoreStatus (enum)tuplestore.c72
TSReadPointer (struct)tuplestore.c91
Tuplestorestate (struct)tuplestore.c103
tuplestore_begin_heaptuplestore.c330
tuplestore_alloc_read_pointertuplestore.c395
tuplestore_puttuple_commontuplestore.c799
tuplestore_gettupletuplestore.c955
dumptuples (tuplestore)tuplestore.c1258
tuplestore_trimtuplestore.c1424
  • Run generation uses quicksort, not replacement selection. Verified in dumptuples, which calls tuplesort_sort_memtuples (a quicksort) on the whole memtuples array and streams it to the destination tape as one run. There is no Knuth-5.4.1 replacement-selection heap in the build phase; the heap (tuplesort_heap_*) is used only in mergeonerun/beginmerge for the merge. This matches the AlphaSort cache-locality argument and the project’s documented 2016 switch away from replacement selection.

  • The in-memory-vs-external decision is a one-way switch keyed on LACKMEM or array exhaustion. Verified in tuplesort_puttuple_common’s TSS_INITIAL arm: the engine stays in memory while memtupcount < memtupsize && !LACKMEM(state), and otherwise calls inittapes + dumptuples, after which state->status == TSS_BUILDRUNS and never returns to TSS_INITIAL.

  • Merge order is computed from work_mem and clamped to [6, 500]. Verified in tuplesort_merge_order: mOrder = allowedMem / (2 * TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE), then Max(mOrder, MINORDER) and Min(mOrder, MAXORDER) with MINORDER = 6, MAXORDER = 500. The 500 cap is justified in the source comment by CPU-cache effects of high-order merges.

  • All logical tapes share one temp file and blocks are recycled on read via a min-heap freelist. Verified in the logtape.c header comment, in ltsReadFillBuffer (calls ltsReleaseBlock immediately after ltsReadBlock unless the tape is frozen), and in ltsGetFreeBlock / ltsReleaseBlock (a min-heap of block numbers, popping the lowest free block for the next write). This is what bounds peak disk usage near 1x.

  • Bounded sort uses a fixed-size reversed heap that never spills. Verified in the TSS_INITIAL bounded-switch test (memtupcount > bound * 2 || (... && LACKMEM)), in make_bounded_heap (reversedirection then a heap of exactly bound elements), and in the TSS_BOUNDED arm (discard-or-replace in one comparison). sort_bounded_heap unheapifies in place at performsort.

  • Abbreviated keys are installed at put time and abortable at runtime. Verified in tuplesort_puttuple_common (calls abbrev_converter to fill datum1, or REMOVEABBREV if aborting) and consider_abort_common (samples at exponentially growing abbrevNext intervals and calls abbrev_abort). mergeruns nulls abbrev_converter before reading runs back, since on-tape tuples lack abbreviated keys.

  • Type-specialized quicksorts bypass the comparator function pointer. Verified in tuplesort_sort_memtuples: it dispatches to qsort_tuple_unsigned/_signed/_int32 when sortKeys[0].comparator is the matching ssup_datum_*_cmp, else qsort_ssup (single key) or the generic qsort_tuple.

  • Parallel sort: each worker emits exactly one run on its own tape; the leader imports them and runs one final merge. Verified in tuplesort_performsort (the WORKER arm: inittapes(false) + dumptuples(true) + worker_nomergeruns) and leader_takeover_tapes (imports nParticipants worker tapes via LogicalTapeImport, sets nOutputRuns = nParticipants, status TSS_BUILDRUNS, then mergeruns).

  • Tuplestore does no sorting and supports multiple read pointers, spilling to a plain BufFile. Verified in the tuplestore.c header (“dumbed-down version of tuplesort.c; it does no sorting”), TSReadPointer, tuplestore_alloc_read_pointer, and tuplestore_puttuple_common (spills via BufFileCreateTemp, never LogicalTapeSet). tuplestore_trim reclaims tuples before the oldest read pointer.

  1. When does TSS_FINALMERGE (lazy on-the-fly merge) actually beat materializing one result tape? mergeruns stops one merge pass short when random access is not required and there are <= maxTapes runs, letting gettuple merge lazily. The break-even between saving one full write/read pass and paying per-tuple heap maintenance during the scan is not characterized in the code. Investigation path: compare TSS_FINALMERGE versus a forced final merge under varying run counts.

  2. How often does abbreviation abort in practice, and at what wasted cost? consider_abort_common reverts already-stored datum1s via REMOVEABBREV when it gives up. The cost of converting-then-reverting on adversarial (non-discriminating) leading keys, and how aggressive the doubling abbrevNext schedule is at catching it early, is workload-dependent and unmeasured here.

  3. Whether the min-heap freelist’s “lowest free block” write policy still matters on SSDs. The logtape.c comment itself flags the LIFO-vs-min-heap choice as unclear (“XXX perhaps a LIFO policy for free blocks would be better?”). On seek-free storage the sequentiality argument weakens, but the fragmentation/compactness argument may remain. Investigation path: measure temp-file size and write amplification under both policies on NVMe.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • AlphaSort and cache-conscious run generation. PostgreSQL’s choice of quicksorted (not replacement-selected) runs is the AlphaSort thesis applied to a disk-oriented engine: short cache-fitting runs plus an extra merge pass can beat long replacement-selection runs whose heap thrashes the cache. A measurement of run-count vs. merge-pass tradeoff on modern hardware would quantify what PostgreSQL gives up by not adapting the run-generation strategy to input pre-sortedness. Paper: Nyberg et al. 1994 (dbms-papers/alphasortsigmod.md).

  • Patience sort / run detection. Chandramouli & Goldstein 2014 (Patience is a Virtue, dbms-papers/patsort-sigmod14.md) detect naturally occurring ascending runs in the input and merge them, exploiting partial pre-sortedness that PostgreSQL’s fixed-batch quicksort runs ignore entirely. On nearly-sorted inputs (e.g., append-mostly time-series) a patience-style run detector could collapse to near-linear; positioning it against PostgreSQL’s dumptuples is a concrete research frontier.

  • CUBRID’s external sort. CUBRID has its own external sort with a tape/run model and its own temp-file recycling; a side-by-side of CUBRID’s run merge against mergeruns/logtape.c would sharpen what PostgreSQL’s single-file block-recycling min-heap buys versus a per-tape file scheme. (See the CUBRID query-execution analysis in the cubrid tree.)

  • Vectorized / SIMD sort. Modern analytical engines sort with branch-free, SIMD-accelerated network sorts and radix passes on fixed-width keys, far beyond PostgreSQL’s specialized-but-scalar qsort_tuple_int32. Abbreviated keys are PostgreSQL’s halfway step toward “sort on a fixed-width proxy”; a full normalized-key radix sort is the unrealized extension.

  • Spilling philosophy: tuplesort vs. tuplestore vs. HashAgg. All three spill to temp files but with different recyclers (logtape min-heap for sort, plain BufFile for tuplestore, preallocating logtape for HashAgg). A unified account of the three spill paths and why they diverge would clarify the temp-file subsystem as a whole. (See postgres-memory-contexts.md for the allocator side and postgres-agg-sort-nodes.md for HashAgg spilling.)

In-tree source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/utils/sort/tuplesort.c — the generic sort engine: state machine, puttuple/performsort/gettuple, run generation (dumptuples), tape setup (inittapes, tuplesort_merge_order, selectnewtape), the merge (mergeruns, mergeonerun, beginmerge), the merge heap (tuplesort_heap_*), bounded sort (make_bounded_heap, sort_bounded_heap), abbreviation (consider_abort_common), and parallel glue (worker_nomergeruns, leader_takeover_tapes).
  • src/backend/utils/sort/tuplesortvariants.c — per-type begin/comparetup/ writetup/readtup callbacks (heap, datum, index btree/hash, BRIN, GIN); the specialized ssup_datum_* comparators wire in here.
  • src/backend/utils/sort/logtape.c — the logical-tape set: one BufFile, block chains with trailers, the min-heap freelist (ltsGetFreeBlock/ ltsReleaseBlock), read-time recycling, LogicalTapeFreeze, preallocation.
  • src/backend/utils/sort/tuplestore.c — the no-sort sequence buffer with multiple read pointers (TSReadPointer), in-mem-then-BufFile spilling, tuplestore_trim.
  • src/backend/utils/sort/sharedtuplestore.c — parallel-aware shared tuple store (parallel hash join); referenced for contrast, out of scope.
  • src/backend/utils/sort/sortsupport.cSortSupport setup, abbreviation hookup.
  • src/include/utils/tuplesort.hSortTuple, TuplesortPublic, the public API and SortCoordinate/Sharedsort for parallel sort.
  • Knuth, D. E. The Art of Computer Programming, Vol. 3: Sorting and Searching. The source comments cite Algorithm 5.2.3H (the merge heap) and the replacement-selection material (5.4.1) that PostgreSQL deliberately no longer uses for run generation.
  • Nyberg, Barclay, Cvetanovic, Gray, Lomet (1994). “AlphaSort: A Cache-Sensitive Parallel External Sort.” SIGMOD. The cache-locality argument for quicksorted runs over replacement selection (knowledge/research/dbms-papers/alphasortsigmod.md).
  • Chandramouli & Goldstein (2014). “Patience is a Virtue: Revisiting Merge and Sort on Modern Processors.” SIGMOD. Run detection and patience sort (knowledge/research/dbms-papers/patsort-sigmod14.md).
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 15 “Query Processing” — the external sort-merge algorithm, run generation, and merge-pass cost model (knowledge/research/dbms-general/).
  • Database Internals (Petrov 2019), ch. 7 — external sorting, run merging, and temp-file management framing.

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-agg-sort-nodes.md — the Sort, IncrementalSort, and Agg/HashAgg plan nodes that drive tuplesort and tuplestore; HashAgg’s own spill path uses logtape.c preallocation.
  • postgres-executor.md — the executor scaffolding (Materialize, cursors, recursive CTE working tables) that instantiates tuplestores.
  • postgres-memory-contexts.md — the work_mem budget, the slab/bump allocators that tuplesort uses for tuple memory, and MemoryContextReset.
  • postgres-parallel-query.md — the parallel-worker/DSM machinery underneath tuplesort_initialize_shared / SharedFileSet.
  • postgres-nbtree.mdCREATE INDEX bulk-loads a B-tree from a tuplesort of the heap (tuplesort_begin_index_btree).