PostgreSQL Parallel Query — From No Parallelism to Parallel Joins, Aggregates, and Maintenance

This is a version-evolution document. It traces how one subsystem — intra-query parallelism — was built up across PostgreSQL major releases, ending at the current REL_18 state. It does not re-derive the mechanism; for how parallel execution works today (the DSM plan, the shm_mq tuple queues, worker state restoration, gather_readnext), read the current-state module doc postgres-parallel-query.md. This doc answers a different question: what changed in each release, and why.

Contents:

Why this subsystem had to evolve — the single-process limitation and what made parallelism finally feasible.
Timeline — the eras from 9.6 through REL_18, left to right.
Pre-9.6 — the single-process executor — the starting point and the infrastructure that had to land first.
9.6 — parallel sequential scan and Gather — the first parallel node; the partial-path concept; the read-only contract.
10 — parallel joins, parallel aggregate, and GatherMerge — making the subtree below Gather worth parallelizing; order-preserving merge.
11 — parallel CREATE INDEX and parallel-aware Append — parallelism enters DDL and the partition-scan layer.
12 — planner and cost refinements — parallel-safety bookkeeping and better partial-path generation.
13 — parallel VACUUM — parallelism reaches maintenance.
14–18 — incremental hardening — leader-participation tuning, more parallel-safe paths, and node-level extensions.
Where it stands at REL_18 — the consolidated design and the PG19 forward note.
Sources — release notes, module docs, key source files.

Why this subsystem had to evolve (the original limitation)

For roughly its first two decades PostgreSQL executed every query in a single backend process. One client connection mapped to one OS process (forked by the postmaster), and that one process ran the entire plan tree top to bottom by itself. This is the model still described structurally in postgres-executor.md: the executor is a demand-pull tree of PlanState nodes, and ExecProcNode pulls one tuple at a time up through that tree inside a single process.

That design was a deliberate, defensible choice. A single-process executor is simple to reason about: there is exactly one snapshot, one transaction state, one set of locks, one memory hierarchy of contexts, and no cross-process coordination to get wrong. For OLTP — many small concurrent transactions — it is also close to optimal, because the server is already parallel across connections; each short query wants a whole core to itself, not a fraction of several.

The limitation bit hard on a different workload: a single large query on a single large table or join. Consider SELECT count(*) FROM big WHERE ... over a hundred-gigabyte table on a machine with 32 idle cores. The single-process executor scans the whole heap on one core while the other 31 sit idle. Analytic and reporting queries — big sequential scans, large hash joins, wide aggregations — were fundamentally CPU- and scan-bound on a single thread, and PostgreSQL simply could not spend more than one core on them no matter how much hardware was available.

Three things had to be true before this could be fixed, and they did not all arrive at once. That staging is why the evolution took several releases rather than landing in one:

A way to run code in extra processes tied to a transaction. PostgreSQL chose processes, not threads, so “use more cores” means “fork more backends” — and those backends must share the originating transaction’s visibility, not start their own. The background worker facility (9.3, generalized in 9.4) and dynamic shared memory (DSM) (9.4) were the enabling primitives. Without DSM there was no post-startup shared region to hand a plan and a result channel through.
A way to ship state between processes so visibility matches. A worker must see exactly what the leader sees: the same snapshot, the same in-progress transaction’s XID and combo CIDs, the same GUC settings, the same loaded libraries and user identity. That serialization machinery — the ParallelContext in src/backend/access/transam/parallel.c — is itself a substantial subsystem that had to be built and is documented in postgres-parallel-query.md.
A planner that knows when parallelism is safe and cheaper. A function marked VOLATILE with side effects, a query that writes, a node that cannot be split across workers — all of these make parallelism wrong or illegal. The optimizer needed parallel-safety classification of every function and plan node, and a cost model that knows the price of launching workers and shuffling tuples through queues, before it could choose a parallel plan. That bookkeeping kept expanding release after release, which is much of why later versions parallelize more node types than earlier ones.

The arc below is the story of those three capabilities maturing: first the plumbing and a single parallel node (9.6), then a rich parallelizable subtree (10), then DDL and partition-aware parallelism (11), then maintenance (13), then years of widening the set of things the planner is willing and able to run in parallel.

Timeline

timeline
    title PostgreSQL Intra-Query Parallelism — Evolution
    section Pre-9.6 (≤2016)
        Single-process executor : One backend runs the whole plan
                                 : DSM (9.4) + bgworkers (9.3/9.4) land as enabling plumbing
    section 9.6 (2016)
        First parallel node : Parallel Seq Scan
                            : Gather node + per-worker tuple queues
                            : Partial paths, read-only parallel mode
    section 10 (2017)
        Rich parallel subtree : Parallel hash / merge / nestloop joins
                              : Partial + Finalize aggregation
                              : GatherMerge preserves sort order
                              : Parallel bitmap heap scan, parallel index scan
    section 11 (2018)
        Parallelism in DDL and partitions : Parallel CREATE INDEX (btree)
                                          : Parallel-aware Append over partitions
                                          : Parallel Hash (shared hash table)
    section 12 (2019)
        Planner refinements : Parallel-safety bookkeeping tightened
                            : Better partial-path generation
    section 13 (2020)
        Parallelism in maintenance : Parallel VACUUM (indexes in parallel)
                                    : Parallel CREATE INDEX cost tuning
    section 14-18 (2021-2025)
        Incremental hardening : Leader-participation tuning
                              : More parallel-safe functions and paths
                              : Node-level extensions, REL_18 current state

Pre-9.6 — the single-process executor (the starting point)

Before 9.6 there was no intra-query parallelism at all. The starting point is worth pinning down precisely, because every later era is a delta against it.

What execution looked like. The postmaster forked one backend per connection. That backend parsed, planned, and executed the query inside itself. Execution was the demand-pull PlanState tree still documented in postgres-executor.md: ExecutePlan repeatedly calls ExecProcNode on the top node, which recursively pulls tuples from its children. A Seq Scan over a billion-row table returned those rows one at a time, all on one core, with the rest of the machine idle for the duration.

Why it could not simply spawn threads. PostgreSQL’s process-per-backend model means there is no shared address space to fan work into. Two design facts made naive parallelism impossible:

Shared memory was fixed at postmaster startup. The main shared-memory segment (buffers, locks, ProcArray) was sized once and never grew. There was no general mechanism for a running backend to create a new shared region, hand it to a helper process, and tear it down at query end.
Transaction state lived in process-local memory. The active snapshot, the current XID, combo CIDs, the resource owner tree — all of it was per-process. A helper process forking fresh would compute different visibility and corrupt results.

The enabling plumbing (9.3–9.4). The fix to the first problem arrived before parallel query did, as standalone infrastructure:

Background workers (9.3, generalized in 9.4): a way to register and launch extra processes managed by the postmaster but tied to a backend’s lifetime.
Dynamic shared memory / DSM (9.4): a backend can now create a shared segment after startup, addressable by cooperating processes, and destroy it when done. This is the substrate every parallel plan rides on.
The shared-memory message queue (shm_mq) and the table-of-contents (shm_toc) layout helpers: a single-reader / single-writer byte pipe over DSM, plus a keyed directory so processes can find sub-structures inside one segment by a known key.

None of this was yet parallel query. It was the runway. 9.6 was the first plane to use it.

Cross-link. The structural model that this era establishes — the PlanState tree, ExecProcNode, the single registered snapshot — is the baseline described in postgres-executor.md. Parallel query does not replace it; it replicates it across processes. Each worker runs its own copy of essentially this same tree.

9.6 — parallel sequential scan and Gather (the first parallel node)

PostgreSQL 9.6 introduced intra-query parallelism for the first time. The release is the genesis of everything in postgres-parallel-query.md: the Gather node, the DSM-serialized plan, the per-worker tuple queues, and the read-only parallel-mode contract all date from here.

What changed

Three new concepts landed together, and they are still the load-bearing ones today:

The Gather node. A new executor node that sits above a partial plan. At execution time it launches N background workers, each running a copy of the subtree below it, and collects their output tuples into a single stream for its parent. Gather is the boundary between “many processes” below and “one process” above. Its current implementation lives in src/backend/executor/nodeGather.c.
Partial paths and the parallel-aware subtree. The planner gained the notion of a partial path: a path that, when run by N workers simultaneously, collectively produces the full result with no duplication. The canonical first example is Parallel Seq Scan — each worker grabs a disjoint block range of the heap (coordinated through shared state in the scan descriptor) so that the union of all workers’ output equals one full scan. The planner builds these in the partial-path machinery now in src/backend/optimizer/path/allpaths.c.
The DSM-carried plan and tuple queues. On first execution Gather serializes the plan into a DSM segment, the postmaster forks workers, each worker attaches the segment, restores the leader’s transaction state so visibility matches, runs its executor copy, and writes MinimalTuples into a per-worker shm_mq tuple queue. The leader drains those queues. This is exactly the ExecInitParallelPlan / ParallelWorkerMain / gather_readnext flow detailed in postgres-parallel-query.md.

Why

The motivation was the limitation from the opening section: large scans were stuck on one core. A parallel sequential scan is the simplest possible win — heap blocks are independent, so dividing block ranges among workers needs no data exchange beyond a shared “next block” counter. It was the right first target precisely because it required the least coordination while exercising all the new plumbing end to end.

The structural shift (before → after)

flowchart TB
    subgraph before["Before 9.6 — single process"]
        A1["Aggregate count(*)"] --> A2["Seq Scan big"]
        note1["One backend\nscans all blocks\non one core"]
    end

    subgraph after["9.6 — Gather + Parallel Seq Scan"]
        B1["Aggregate count(*)\n(leader, finalizes)"] --> B2["Gather\nleader collects tuples"]
        B2 --> B3["Partial Aggregate\nworker 1"]
        B2 --> B4["Partial Aggregate\nworker 2"]
        B2 --> B5["Partial Aggregate\nleader"]
        B3 --> B6["Parallel Seq Scan\ndisjoint block range"]
        B4 --> B7["Parallel Seq Scan\ndisjoint block range"]
        B5 --> B8["Parallel Seq Scan\ndisjoint block range"]
    end

The key shape change: a Gather node is spliced into the tree, and the subtree below it becomes parallel-aware (multiple concurrent copies share the scan). Above Gather the world is single-process as before. Note that even in 9.6 a count(*) already needed a partial aggregate below Gather and a final aggregate above it — but 9.6’s aggregate support was limited; the rich partial/finalize split came into its own in PG10.

The read-only contract

9.6 made parallel mode strictly read-only, enforced by entering parallel mode (EnterParallelMode) which forbids the leader and workers from assigning new XIDs or writing. A query that writes, calls a parallel-unsafe function, or uses constructs the planner cannot reason about simply runs serially. This conservative contract — parallelize only what is provably safe — is the design invariant that the rest of the evolution slowly relaxes at the edges (more functions and node types proven safe) without ever abandoning the core rule. See the parallel-mode discussion in postgres-parallel-query.md.

Cross-link

The mechanism introduced here — ExecInitParallelPlan, the shm_toc layout, worker state restoration, gather_readnext — is documented as the current design in postgres-parallel-query.md. 9.6 is where it was born; the later eras add new consumers of this same plumbing rather than replacing it.

10 — parallel joins, parallel aggregate, and GatherMerge (a worth-it subtree)

9.6 could parallelize a scan, but the subtree below Gather was thin: a parallel scan, a filter, a limited partial aggregate. The moment a query needed a join or a real grouped aggregate, the plan fell back to serial. PostgreSQL 10 is the release that made the partial subtree rich enough to be worth parallelizing for real workloads.

What changed

Parallel joins — all three strategies. 10 taught the planner to put a join below Gather so each worker joins its share of the outer side:
- Parallel hash join: in 10 each worker builds its own private copy of the inner hash table from the (shared) inner scan, then probes it with its slice of the outer rows. (The shared hash table — one hash table built cooperatively by all workers — is a PG11 addition; see the next era.)
- Parallel merge join and parallel nested-loop join: the outer side is a partial path (workers each take a slice); the inner side is re-executed per outer row / merged as usual within each worker. The join-strategy mechanics themselves are documented in postgres-join-nodes.md; what 10 added was the planner’s willingness to generate partial join paths and the executor’s parallel-awareness of the outer input.
Partial + Finalize aggregation. This is the centerpiece. 10 split aggregation into two cooperating nodes:
- Partial Aggregate (below Gather, one per worker): each worker computes a partial aggregate state over its slice — e.g. a running (count, sum) for an avg.
- Finalize Aggregate (above Gather, in the leader): combines the per-worker partial states into the final result using each aggregate’s combine function (aggcombinefn) and emits the user-visible value via its final function. This required every parallelizable aggregate to declare a combine function (combinefn not set for aggregate function is the error you hit otherwise). The partial/finalize split is what lets SUM, COUNT, AVG, MIN/MAX and friends run across workers. Aggregate execution internals are in postgres-agg-sort-nodes.md.
GatherMerge — order-preserving collection. Plain Gather collects tuples in arbitrary arrival order (round-robin over whichever worker queue has data). That destroys any sort order the workers produced, so a query with ORDER BY could not benefit: it would have to re-sort above Gather. 10 added GatherMerge (src/backend/executor/nodeGatherMerge.c): if each worker produces its slice already sorted, GatherMerge binary-heap-merges the per-worker streams to produce a single sorted output — the parallel analogue of a merge step. This unlocked parallel ORDER BY, parallel merge joins feeding sorted output upward, and parallel grouped aggregation that relies on sorted input.
More parallel scan types. 10 also added parallel bitmap heap scan and parallel index scan, so the leaf of the parallel subtree was no longer limited to a plain sequential scan.

Why

After 9.6, the bottleneck moved from “can we scan in parallel” to “the interesting part of an analytic query — the join and the aggregate — still runs serially.” A parallel scan feeding a serial hash join wastes most of the benefit, because the join is where the CPU goes. Making joins and aggregates partial-aware is what turned parallel query from a proof-of-concept into something that accelerated real reporting queries. GatherMerge was the companion fix: without it, every sorted parallel plan paid a full re-sort at the leader, often erasing the gain.

The structural shift (before → after)

flowchart TB
    subgraph nine6["9.6 — thin subtree, serial join"]
        C1["Hash Join\n(serial, leader only)"] --> C2["Gather"]
        C1 --> C3["Hash inner\n(serial)"]
        C2 --> C4["Parallel Seq Scan\nouter"]
    end

    subgraph ten["10 — join + aggregate below Gather"]
        D1["Finalize Aggregate\n(leader, combinefn)"] --> D2["GatherMerge\nmerge sorted streams"]
        D2 --> D3["Partial Aggregate\nworker"]
        D3 --> D4["Parallel Hash Join\nworker"]
        D4 --> D5["Parallel Seq Scan\nouter slice"]
        D4 --> D6["Hash inner\nprivate per worker"]
    end

The shift: in 9.6 the join sat above Gather (serial); in 10 the entire join-and-partial-aggregate stack sits below Gather, with only the cheap finalize step left for the leader, and GatherMerge preserves order on the way up.

Cross-link

The planner-side machinery that decides to generate these partial join and partial aggregate paths — partial path generation, the parallel cost terms — is the optimizer behavior summarized in postgres-planner-overview.md. The executor-side GatherMerge heap-merge and the partial/finalize node execution are part of the current design in postgres-parallel-query.md and postgres-executor.md.

11 — parallel CREATE INDEX and parallel-aware Append (parallelism leaves SELECT)

Through 10, parallelism lived entirely inside read query execution. PG11 broke parallelism out of pure SELECT in two directions: into DDL (building an index) and into the partition layer (handing whole partitions to workers). It also upgraded the parallel hash join from private to shared hash tables.

What changed

Parallel CREATE INDEX (btree). Building a btree is dominated by one thing: sorting every row’s index key. PG11 parallelized that sort. The leader and workers cooperatively scan the heap and feed a parallel external sort (Tuplesort gained a parallel mode); the workers produce sorted runs, and the final merge builds the btree. The driver lives in src/backend/access/nbtree/nbtsort.c (_bt_parallel_* helpers, the BTBuildState shared between leader and workers). This is the first time parallelism touched a write-side operation — legal because CREATE INDEX reads a stable snapshot of the heap and the only writer of the new index is the build itself. The sort mechanics are in postgres-tuplesort.md; the build flow is in postgres-index-creation.md.
Parallel-aware Append. Before 11, an Append over a partitioned table (or a UNION ALL) ran its child subplans one after another within each worker. PG11 made Append parallel-aware (src/backend/executor/nodeAppend.c, the ParallelAppendState): the children are placed in shared memory and workers claim subplans to run, so different workers process different partitions concurrently. Larger non-partial children are scheduled first (and can be claimed by only one worker), while partial children can be shared by several. This is the key to parallelizing queries over partitioned tables where each partition is itself only moderately large — the win comes from running partitions across workers rather than parallelizing each scan in isolation.
Parallel Hash (shared hash table). PG10’s parallel hash join had each worker build a private copy of the inner hash table — duplicated work and duplicated memory. PG11 added Parallel Hash: the workers cooperatively build one shared hash table in DSM, partitioning the build work among themselves, then all probe the single shared table. This removed the per-worker memory blowup and made parallel hash join scale to large inner relations. The hash-join internals are in postgres-join-nodes.md.

Why

Two distinct pressures:

Index builds were a serial wall during data loads and pg_restore. After bulk-loading a large table, the CREATE INDEX step ran on one core for minutes to hours. Parallelizing the sort attacks exactly the dominant cost, with no concurrency hazard because the new index is private until commit.
Partitioning had become a first-class feature (declarative partitioning landed in 10), so queries over many partitions were common — but a serial Append meant only one partition was being scanned at a time even when the planner had chosen parallelism. Parallel-aware Append let the partition dimension itself become a source of parallelism.

The shared-hash upgrade is the natural maturation of PG10’s parallel hash join: once joins ran in parallel, the wasted per-worker hash table was the obvious next inefficiency to remove.

The structural shift (before → after)

flowchart TB
    subgraph before["10 — serial Append, private hash"]
        E1["Append\n(serial: child by child)"] --> E2["Scan part_1"]
        E1 --> E3["Scan part_2"]
        E1 --> E4["Scan part_3"]
        E5["CREATE INDEX\n(serial sort, one core)"]
    end

    subgraph after["11 — parallel Append, shared hash, parallel index build"]
        F1["Gather"] --> F2["Parallel Append\nworkers claim subplans"]
        F2 --> F3["worker A -> part_1"]
        F2 --> F4["worker B -> part_2"]
        F2 --> F5["worker C -> part_3"]
        G1["CREATE INDEX\nparallel Tuplesort"] --> G2["worker sorted run"]
        G1 --> G3["worker sorted run"]
        G1 --> G4["leader merge -> btree"]
    end

The shift on the read side: Append stops being a serial loop over children and becomes a work queue of subplans that workers pull from. On the write side: index construction grows a parallel sort below it for the first time.

Cross-link

The current parallel-aware Append execution (ParallelAppendState, subplan claiming) is part of the executor design in postgres-executor.md and the parallel plumbing in postgres-parallel-query.md. Parallel index build flow is covered by postgres-index-creation.md and postgres-tuplesort.md.

PG12 added no headline new parallel node, but it is a meaningful era because the constraint on parallelism by this point was no longer “can the executor do it” — it was “will the planner choose it correctly, and is the work classified safe.” 12 is representative of the steady planner-side investment that runs through every release.

What changed

Parallel-safety bookkeeping tightened. The optimizer’s classification of expressions, functions, and subplans as parallel-safe / -restricted / -unsafe continued to be refined so that more plans qualified for parallelism without risking incorrect results. (The three-level classification is the gate that decides whether a Gather may be placed above a given subtree at all.)
Better partial-path generation and costing. The planner became more willing and more accurate about generating partial paths in more places, and the parallel cost terms (worker startup, tuple-queue transfer) were refined so the chosen degree of parallelism better matched reality. The cost framework is documented in postgres-cost-model.md.
Surrounding features made more parallel-friendly. As features like partition pruning and partitionwise join matured, the interaction with parallel Append and parallel join improved, so partitioned analytic queries got faster plans even without a new executor node.

Why

By PG12 the executor could parallelize scans, joins, aggregates, sorts (via GatherMerge), Append, and index builds. The remaining gains were less about new mechanisms and more about the planner applying the existing mechanisms in more situations and choosing degree-of-parallelism well. A node that the executor can run in parallel is useless if the planner never emits it or prices it wrong.

The structural shift

There is no node-shape change to draw here; the shift is in the planner’s decision surface — more subtrees now qualify for a Gather, and the degree of parallelism is chosen more accurately. This is the kind of era that does not change the diagrams but changes how often the parallel diagrams from earlier eras actually get used.

Cross-link

Parallel-safety classification and partial-path costing are part of the optimizer behavior in postgres-planner-overview.md and postgres-cost-model.md.

13 — parallel VACUUM (parallelism reaches maintenance)

PG13 extended parallelism to maintenance, the third major domain after read queries (9.6–10) and DDL index builds (11). It made VACUUM able to process a table’s indexes in parallel.

What changed

VACUUM over a table with many indexes spends much of its time in the index-vacuum phase: for each index, scan it and remove pointers to dead heap tuples. Serially, that is one index after another. PG13’s parallel vacuum (src/backend/commands/vacuumparallel.c) launches background workers and assigns each index to a worker, so multiple indexes are vacuumed (and cleaned up) concurrently. The leader still performs the heap scan and coordinates; the parallelism is across the indexes, which is where the embarrassingly-parallel work is — each index is an independent structure.

Key design points, distilled (the mechanism itself is documented in postgres-vacuum.md, “Parallel vacuum”):

Parallelism is per-index, not per-block within one index. An index is eligible only if it is large enough to be worth a worker (min_parallel_index_scan_size) and its access method supports parallel vacuum.
It applies to the index vacuum and index cleanup phases, not the heap scan. The dead-TID list the workers consume is built by the leader’s heap pass and shared through DSM.
It is available to manual VACUUM (PARALLEL n) and plain VACUUM; autovacuum does not use it by default, to avoid autovacuum workers multiplying into many processes.

Why

Maintenance windows were a real operational pain on large tables with many indexes: a nightly VACUUM could run for hours, almost entirely single-threaded, while the box had idle cores. Because indexes are independent structures, vacuuming them concurrently is a clean, low-coordination win — structurally the same insight as parallel Append (independent children) and parallel scan (independent blocks): find the axis along which the work is already disjoint, and assign disjoint pieces to workers. Crucially, parallel vacuum reuses the same ParallelContext plumbing from src/backend/access/transam/parallel.c that 9.6 built for query execution — this era is mostly a new consumer of existing infrastructure, not new infrastructure.

The structural shift (before → after)

flowchart TB
    subgraph before["<=12 — serial index vacuum"]
        H1["VACUUM leader"] --> H2["heap scan -> dead TIDs"]
        H2 --> H3["vacuum index 1"]
        H3 --> H4["vacuum index 2"]
        H4 --> H5["vacuum index 3"]
    end

    subgraph after["13 — parallel index vacuum"]
        I1["VACUUM leader"] --> I2["heap scan -> dead TIDs (shared in DSM)"]
        I2 --> I3["worker -> vacuum index 1"]
        I2 --> I4["worker -> vacuum index 2"]
        I2 --> I5["leader -> vacuum index 3"]
    end

The shift: the serial chain of per-index vacuum steps becomes a fan-out where each index is an independent unit of work claimed by a worker (or the leader), all consuming the one shared dead-TID list.

Cross-link

The current parallel-vacuum design — eligibility rules, the DSM-shared dead-TID list, the leader/worker split — is documented in postgres-vacuum.md. The underlying ParallelContext it rides on is the same one described in postgres-parallel-query.md.

14–18 — incremental hardening (more safe paths, better tuning)

From PG14 onward the evolution shifts from “add a new parallel domain” to hardening and widening what already exists. There is no single dramatic node in this span; instead, each release files down rough edges. Grouping them is honest about their character — they are deltas, not eras with a new architecture.

What changed (themes across 14–18)

Leader-participation tuning. The leader process can also run the partial plan itself (parallel_leader_participation), but on some plans — notably certain parallel hash and GatherMerge shapes — a busy leader stalls draining the worker queues, hurting throughput. Successive releases refined when the leader participates and how the planner costs leader participation, so the default behavior is better on more plan shapes. The leader-participation mechanics are part of the current design in postgres-parallel-query.md.
More parallel-safe functions and constructs. The set of built-in functions and SQL constructs marked parallel-safe kept growing, and more query shapes (additional aggregate/window situations, more subquery forms, more of the partitionwise machinery) became eligible for parallelism. Each such change lets a Gather sit above subtrees that previously forced serial execution.
Node-level extensions and better partial paths. More plan node types gained partial-path variants or improved parallel costing, and partitionwise join/aggregate interplay with parallel Append continued to improve, so partitioned analytic queries got steadily better parallel plans.
Infrastructure that parallel query benefits from indirectly. REL_18 brings a unified asynchronous I/O subsystem (see postgres-aio.md); parallel scans are among the workloads that benefit from better I/O concurrency underneath the per-worker scans, even though AIO is not itself “parallel query.”

Why

By PG14 the architecture of parallel query was essentially complete: Gather / GatherMerge, the ParallelContext plumbing, partial paths, parallel joins/aggregates/Append, parallel index build, parallel vacuum. The remaining returns came from coverage and tuning — making the planner choose parallelism correctly in more cases and making the chosen plans run efficiently — rather than from inventing new parallel mechanisms. This is the expected maturity curve: the hard structural work front-loads into a few releases (9.6–13), then a long tail of incremental polish.

The structural shift

No new node-shape diagram applies; the change is quantitative (more subtrees qualify, better-tuned execution) rather than structural. The “after” picture is simply the union of all the earlier diagrams, now reachable from more queries and tuned to run better — which is exactly the REL_18 picture below.

Where it stands at REL_18 (the current design)

At REL_18 (commit 273fe94, PG 18.x) intra-query parallelism is a mature, multi-domain subsystem built on one consistent foundation. The current design — which this doc deliberately does not re-derive — is documented in postgres-parallel-query.md. In summary of where the arc lands:

The boundary node. Every parallel plan is a single-process tree with a Gather (arbitrary order) or GatherMerge (sorted) node spliced in. Below it the subtree is parallel-aware; above it execution is single-process. Implementations: src/backend/executor/nodeGather.c, src/backend/executor/nodeGatherMerge.c.
The plumbing. On first execution ExecInitParallelPlan (src/backend/executor/execParallel.c) serializes the plan into a DSM segment laid out by an shm_toc, sizes it by walking the PlanState tree, and lays out per-worker shm_mq tuple queues. The ParallelContext in src/backend/access/transam/parallel.c forks the workers and restores leader state (snapshot, XID, combo CIDs, GUCs, libraries, user ID) so visibility matches. ParallelWorkerMain runs each worker’s executor copy; the leader’s gather_readnext drains the queues.
What can be parallelized. Sequential / index / bitmap-heap scans; hash / merge / nested-loop joins (with a shared Parallel Hash); partial+finalize aggregation; sorts via GatherMerge; parallel-aware Append over partitions and UNION ALL; parallel btree CREATE INDEX; and parallel VACUUM of indexes. The read-only contract from 9.6 still holds: parallel mode forbids writes, and only provably parallel-safe work is eligible.
Where each piece came from. Gather + parallel seq scan (9.6); joins, partial aggregate, GatherMerge (10); parallel CREATE INDEX, parallel-aware Append, shared Parallel Hash (11); planner/cost refinements (12); parallel VACUUM (13); leader-participation and coverage tuning (14–18).

The single most important structural fact is the one the timeline makes visible: 9.6–13 built the mechanisms; everything since widens and tunes them on the same ParallelContext substrate. New domains (DDL, maintenance) were added by becoming consumers of the 9.6 plumbing, not by reinventing it.

PG19 forward note. PostgreSQL development continues to push parallel coverage outward — more node types and utility operations gaining parallel-safe paths, and continued tuning of when parallelism and leader participation pay off. Treat any specific PG19 item as a just-released forward step, not as REL_18 behavior; the REL_18 design above is the current state of record.

Sources

Release notes (feature attributions verified against these):

PostgreSQL 9.6 release notes — parallel sequential scan, parallel joins (initial), Gather node, parallel aggregate (initial), background-worker and DSM infrastructure.
PostgreSQL 10 release notes — parallel merge join, parallel bitmap heap scan, parallel index scan, improved parallel aggregation (partial/finalize), GatherMerge.
PostgreSQL 11 release notes — parallel CREATE INDEX (btree), parallel-aware Append, parallel hash join with shared hash table.
PostgreSQL 12 release notes — parallel-query planner and cost refinements; partitioning interactions.
PostgreSQL 13 release notes — parallel VACUUM.
PostgreSQL 14–18 release notes — incremental parallel-safety and leader-participation improvements; REL_18 (commit 273fe94) current state.

Current-state module docs (mechanism lives here — cited, not duplicated):

postgres-parallel-query.md — Gather / GatherMerge, DSM plan, ParallelContext, worker state restoration, gather_readnext, parallel mode.
postgres-executor.md — the PlanState tree, ExecProcNode, parallel-aware node execution, parallel Append.
postgres-planner-overview.md — partial-path generation, parallel-safety classification.
postgres-vacuum.md — parallel vacuum design.
postgres-cost-model.md — parallel cost terms.
postgres-join-nodes.md, postgres-agg-sort-nodes.md, postgres-index-creation.md, postgres-tuplesort.md, postgres-aio.md — supporting mechanism for the parallelized operations.

Key source files (observable on REL_18, commit 273fe94):

src/backend/executor/execParallel.c — ExecInitParallelPlan, DSM/shm_toc layout, worker setup.
src/backend/executor/nodeGather.c — Gather node, gather_readnext.
src/backend/executor/nodeGatherMerge.c — order-preserving merge.
src/backend/executor/nodeAppend.c — ParallelAppendState, parallel-aware Append (PG11).
src/backend/access/transam/parallel.c — ParallelContext, worker fork and state restoration; reused by parallel vacuum and parallel index build.
src/backend/access/nbtree/nbtsort.c — _bt_parallel_*, parallel btree build (PG11).
src/backend/commands/vacuumparallel.c — parallel vacuum (PG13).
src/backend/optimizer/path/allpaths.c — partial-path generation.

PostgreSQL Parallel Query — From No Parallelism to Parallel Joins, Aggregates, and Maintenance

Why this subsystem had to evolve (the original limitation)

Timeline

Pre-9.6 — the single-process executor (the starting point)

9.6 — parallel sequential scan and Gather (the first parallel node)

What changed

Why

The structural shift (before → after)

The read-only contract

Cross-link

10 — parallel joins, parallel aggregate, and GatherMerge (a worth-it subtree)

What changed

Why

The structural shift (before → after)

Cross-link

11 — parallel CREATE INDEX and parallel-aware Append (parallelism leaves SELECT)

What changed

Why

The structural shift (before → after)

Cross-link

12 — planner and cost refinements (no new node, a sharper planner)

What changed

Why

The structural shift

Cross-link

13 — parallel VACUUM (parallelism reaches maintenance)

What changed

Why

The structural shift (before → after)

Cross-link

14–18 — incremental hardening (more safe paths, better tuning)

What changed (themes across 14–18)

Why

The structural shift

Where it stands at REL_18 (the current design)

Sources