PostgreSQL Parallel Query — Gather, Workers, and the DSM Plan

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

A single query, no matter how well optimized, runs on one CPU core in the classic demand-pull iterator model (covered by postgres-executor.md). When the bottleneck is a large scan or aggregate over millions of rows, the only way to go faster is to split the work across cores. Database System Concepts (Silberschatz, 7e, ch. 22 “Parallel and Distributed Query Processing”) frames this as intraquery parallelism — “the processing of different parts of the execution of a single query, in parallel across multiple nodes” — and distinguishes it from interquery parallelism (running many independent queries at once, the concern of a transaction-processing system). Intraquery parallelism, the textbook says, “is essential for speeding up long-running queries, and it is the focus of this chapter” (§22.1).

The textbook then splits intraquery parallelism into two complementary sources:

Intraoperation parallelism (§22.1, §22.2–22.4). A single relational operator — a scan, a join, an aggregate — is run in parallel by partitioning its input across nodes. “Since the number of tuples in a relation can be large, the degree of intraoperation parallelism is also potentially very large; thus, intraoperation parallelism is natural in a database system.” A sort over a range-partitioned relation, for example, sorts each partition independently and concatenates the results.
Interoperation parallelism (§22.5.1). Different operators of one query run at the same time, either as a pipeline (operator B consumes A’s output as A produces it, on a different core) or independently (two subtrees with no dependency run concurrently). The textbook is candid that pipelined parallelism “does not scale up well” because pipeline chains are short and stalls propagate; “when the degree of parallelism is high, pipelining is a less important source of parallelism than partitioning” (§22.5.1.1).

The unifying mechanism the textbook settles on is the exchange operator (§22.5, §22.5.2), the contribution of Graefe’s Volcano system:

“a model of parallel query execution which breaks parallel query processing into two types of steps: partitioning of data using the exchange operator, and execution of operations on purely local data, without any data exchange. This model is surprisingly powerful and is widely used in parallel database implementations.” (DSC §22.5)

The exchange operator is the key abstraction because it encapsulates parallelism inside one operator type. Every other operator in the plan is written as if it were sequential, operating on “purely local data”; the exchange operator is the only place that knows about partitioning, data movement between processes, and the merge of multiple streams back into one. Drop an exchange into an otherwise sequential operator tree and the tree becomes parallel without any other operator being rewritten. Graefe’s 1990 paper is titled, precisely, “Encapsulation of Parallelism in the Volcano Query Processing System.”

Three properties of the exchange model matter before reading any engine’s source:

Producers and a consumer. An exchange has N producer processes each running an identical copy of the subplan below it, and a consumer side that gathers their output streams into one. The producers move tuples to the consumer over some transport (a network socket on a shared-nothing cluster; shared memory on a multicore box).
The subplan is unaware. The subplan below the exchange does not know it is one of N copies. Correctness requires that the N copies, taken together, produce the whole result with no duplication and no omission — which is a property of how the leaf scan partitions its input, not of the operators above it.
Order is a separate concern. A plain gather makes no promise about the order tuples arrive in; if the query needs sorted output, the exchange must merge the already-sorted producer streams rather than simply concatenate them.

PostgreSQL is a faithful, multicore-shared-memory realization of this model. Its exchange operator is the Gather node (and its order-preserving variant GatherMerge); its N producers are background worker processes; its transport is a shared-memory queue (shm_mq); and “the subplan is unaware” becomes the parallel-aware flag on the partial plan beneath the Gather. The rest of this document traces those pieces in the REL_18 source.

Common DBMS Design

The textbook gives the model — an exchange operator over otherwise-local operators. This section names the engineering conventions a multicore-shared-memory engine adopts to make that model work, the problems the textbook’s shared-nothing framing leaves implicit. Read PostgreSQL’s choices in the next section as one point in this space.

Processes (or threads) plus a shared transport

The producers must run concurrently, so they are separate threads or processes; and they must hand tuples to the consumer cheaply, so they share an in-memory ring buffer rather than copying over a socket. A thread-per-worker engine shares the whole address space for free but pays in fragile global state and locking; a process-per-worker engine isolates faults and reuses the single-threaded executor unchanged, but must explicitly copy every piece of state a worker needs. PostgreSQL is firmly in the process camp — a parallel worker is a full backend — which is why so much of its machinery is about state transfer.

A serialized snapshot of the leader’s state

A worker process starts empty. To run the same plan and see the same rows, it must be handed: the plan itself, the bound parameters, the MVCC snapshot, the set of in-progress transaction IDs (so visibility checks agree), combo-CID mappings, GUC settings, the authenticated user, loaded libraries. The convention is to serialize all of this into a shared segment the workers can read, with a small table-of-contents index so each consumer finds its piece by a well-known key. The plan is serialized by the engine’s generic node-tree serializer; the snapshot and txn state by purpose-built routines.

Two-pass sizing of the shared segment

The shared segment must be allocated at one fixed size before any worker exists, but its size depends on the plan (how many nodes need shared state, how big the serialized plan is). The convention is a two-pass walk of the operator tree: an estimate pass that sums up every node’s space request, then allocate, then an initialize pass that fills the now -allocated space. Both passes visit the identical node set in the identical order so offsets line up.

Strictly read-only workers

Allowing workers to write would require propagating new XIDs, command- counter increments, and combo CIDs back to the leader and the other workers mid-flight — a synchronization problem with no cheap solution. The universal convention for a first-generation parallel executor is therefore to make parallel mode read-only: no INSERT/UPDATE/DELETE, no DDL, no XID assignment. An explicit enter/exit parallel mode bracket arms error checks that catch violations.

Graceful degradation and best-effort workers

Worker slots are a finite, shared resource (max_worker_processes). A parallel plan must run correctly even if zero workers are obtained — the consumer falls back to running the plan itself. So worker launch is best-effort: request N, accept however many the postmaster grants, and have the leader optionally participate as an N+1-th producer.

Out-of-band error propagation

A worker that hits an error cannot just exit(1) — the leader would block forever waiting for tuples. The convention is a dedicated error channel (separate from the tuple channel) plus a signal that pokes the leader to drain it; the leader re-throws the worker’s error as its own at the next interrupt-check point, so parallel errors “just work” from the user’s perspective.

Theory ↔ PostgreSQL mapping

By the time you reach a named PostgreSQL symbol, you should know what kind of thing it is:

Theory / convention	PostgreSQL name
Exchange operator (gather side)	`Gather` / `GatherState` (`ExecGather`)
Order-preserving exchange	`GatherMerge` / `GatherMergeState` (`ExecGatherMerge`)
“Subplan is unaware” / parallel producer	partial plan with `plan->parallel_aware` true
Producer process	background worker running `ParallelWorkerMain` → `ParallelQueryMain`
Shared transport for tuples	per-worker `shm_mq` tuple queue (`PARALLEL_TUPLE_QUEUE_SIZE` = 64 KB)
Tuple format on the wire	`MinimalTuple` (no txn header), via `TupleQueueReader`
Serialized leader state segment	DSM segment created by `InitializeParallelDSM`
Table-of-contents index	`shm_toc` keyed by `PARALLEL_KEY_*` magic numbers
Serialized plan	`ExecSerializePlan` → `nodeToString(PlannedStmt)`
Two-pass sizing	`ExecParallelEstimate` then `ExecParallelInitializeDSM`
Per-node shared scratch (e.g. scan cursor)	DSA area (`PARALLEL_KEY_DSA`, `dsa_create_in_place`)
Read-only bracket	`EnterParallelMode` / `ExitParallelMode`
Graceful degradation	`need_to_scan_locally`, `parallel_leader_participation`
Out-of-band errors	error `shm_mq` + `PROCSIG_PARALLEL_MESSAGE` + `ProcessParallelMessages`
Per-execution parallel handle	`ParallelExecutorInfo` (`pei`) / `ParallelContext` (`pcxt`)

The generic process-launch and DSM/shm_mq plumbing — ParallelContext, InitializeParallelDSM, shm_mq, background-worker registration — is the subject of postgres-shared-memory-ipc.md; this document covers the executor’s use of it. Why a Gather appears in the plan at all, and how partial paths are costed, belongs to postgres-planner-overview.md and postgres-path-generation.md. The single-process iterator model the workers each run is postgres-executor.md. This document covers the seam: how Gather turns a partial plan into a DSM-serialized job, how workers are launched and synchronized, and how the leader+worker tuple streams are merged back into one.

PostgreSQL’s Approach

PostgreSQL’s parallel query is the exchange model rendered for a multicore shared-memory box, with five design decisions that give it its shape:

The exchange operator is a Gather node, planted by the planner above a partial plan. The partial subtree’s leaf scan is parallel-aware — e.g. a parallel SeqScan whose block range is doled out from a shared counter — so that N identical copies cover the whole relation exactly once. GatherMerge is the variant that preserves the input sort order by heap-merging the worker streams.
Workers are full backend processes, forked lazily on first execution. ExecGather does nothing parallel at ExecInitNode time; only when the first tuple is pulled does it build the DSM segment and ask the postmaster to launch workers — “it needs to allocate a large dynamic segment, so it is better to do it only if it is really needed.”
The plan and all leader state travel through a DSM segment indexed by an shm_toc. ExecInitParallelPlan serializes a dummy PlannedStmt with nodeToString, sizes the segment by a two-pass walk of the PlanState tree, and the generic InitializeParallelDSM layers on GUCs, snapshots, the txn XID set, combo CIDs, libraries — everything a worker needs so its visibility checks match the leader’s.
Tuples flow leader-ward over per-worker shm_mq queues as MinimalTuples. Each worker is the sender on its own 64 KB shared-memory ring; the leader is the receiver on all of them and drains them round-robin. A minimal tuple carries no transaction header, so it is exactly what crosses a process boundary cheaply.
Parallel mode is strictly read-only and best-effort. EnterParallelMode arms checks that forbid writes; the leader tolerates getting fewer workers than requested (down to zero) and can itself act as a producer when parallel_leader_participation is on.

The plan shape: Gather over a partial plan

A parallel plan is an ordinary serial plan with a Gather (or GatherMerge) node spliced in. Everything below the Gather runs in each worker; everything above runs only in the leader. The node directly below is typically a partial path whose leaf is parallel-aware.

flowchart TB
  subgraph LEADER["runs in leader only (above Gather)"]
    AGG["Finalize Aggregate"]
    GATH["Gather<br/>num_workers=2<br/>(the exchange operator)"]
    AGG --> GATH
  end
  subgraph WORKERS["partial plan — one identical copy per worker (+ leader)"]
    PAGG["Partial Aggregate"]
    PSCAN["Parallel SeqScan<br/>(parallel_aware = true)<br/>block range from shared counter"]
    PAGG --> PSCAN
  end
  GATH -->|"DSM-serialized plan;<br/>tuples back via shm_mq"| PAGG

Figure 1 — A parallel aggregate. The Gather is the exchange operator: the partial subtree below it is copied into each worker (and optionally run by the leader too); the parallel-aware SeqScan partitions the heap by handing each worker a different block range from one shared counter, so the copies together scan every block exactly once. The leader’s Finalize Aggregate combines the workers’ partial aggregates.

ExecInitGather — set up the funnel, defer the workers

ExecInitGather builds the GatherState and, crucially, initializes the outer (child) plan but does not launch any workers. It also sets up the funnel slot — a MinimalTuple-backed slot that every worker tuple will be stored into — because that is the format tuples arrive in:

// ExecInitGather — src/backend/executor/nodeGather.c (condensed)
gatherstate = makeNode(GatherState);
gatherstate->ps.plan = (Plan *) node;
gatherstate->ps.state = estate;
gatherstate->ps.ExecProcNode = ExecGather;        /* the next() function */

gatherstate->initialized = false;                 /* workers launched lazily */
gatherstate->need_to_scan_locally =
    !node->single_copy && parallel_leader_participation;
gatherstate->tuples_needed = -1;

/* now initialize outer plan (the partial subtree) */
outerNode = outerPlan(node);
outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
tupDesc = ExecGetResultType(outerPlanState(gatherstate));

/* Initialize funnel slot to same tuple descriptor as outer plan. */
gatherstate->funnel_slot = ExecInitExtraTupleSlot(estate, tupDesc,
                                                  &TTSOpsMinimalTuple);

Two design facts are visible. First, initialized = false: the whole DSM

worker apparatus is built on the first ExecGather call, not here. Second, need_to_scan_locally records the parallel_leader_participation policy — whether the leader will also act as a producer — but its final value is recomputed after launch, once the actual worker count is known.

ExecGather — lazy launch, then pull from the funnel

ExecGather is the next() for the exchange operator. On the first call it fires up the parallel machinery; on every call it pulls one tuple from gather_getnext and optionally projects it.

// ExecGather — src/backend/executor/nodeGather.c (condensed)
if (!node->initialized)
{
    EState *estate = node->ps.state;
    Gather *gather = (Gather *) node->ps.plan;

    if (gather->num_workers > 0 && estate->es_use_parallel_mode)
    {
        ParallelContext *pcxt;

        /* Build (or rebuild) the shared state the workers need. */
        if (!node->pei)
            node->pei = ExecInitParallelPlan(outerPlanState(node), estate,
                                             gather->initParam,
                                             gather->num_workers,
                                             node->tuples_needed);
        else
            ExecParallelReinitialize(outerPlanState(node), node->pei,
                                     gather->initParam);

        pcxt = node->pei->pcxt;
        LaunchParallelWorkers(pcxt);                 /* ask postmaster to fork */
        node->nworkers_launched = pcxt->nworkers_launched;

        if (pcxt->nworkers_launched > 0)
        {
            ExecParallelCreateReaders(node->pei);    /* one reader per worker */
            node->nreaders = pcxt->nworkers_launched;
            node->reader = palloc(node->nreaders * sizeof(TupleQueueReader *));
            memcpy(node->reader, node->pei->reader,
                   node->nreaders * sizeof(TupleQueueReader *));
        }
        else { node->nreaders = 0; node->reader = NULL; }
        node->nextreader = 0;
    }

    /* Run plan locally if no workers, or if leader participation is on. */
    node->need_to_scan_locally = (node->nreaders == 0)
        || (!gather->single_copy && parallel_leader_participation);
    node->initialized = true;
}

The graceful-degradation convention is right here: num_workers > 0 && es_use_parallel_mode may still yield nworkers_launched == 0, in which case need_to_scan_locally becomes true and the leader runs the plan itself. The single_copy flag is the degenerate one-worker mode where the leader does not also scan, so the child need not be parallel-aware at all.

gather_getnext / gather_readnext — drain the queues, maybe scan locally

gather_getnext is the heart of the consumer side. It loops: try a worker tuple (gather_readnext); if none and the leader is a participant, pull one tuple from the local copy of the plan via ExecProcNode. The local-scan branch installs the shared DSA area into es_query_dsa first, because the parallel-aware leaf nodes the leader is now running themselves reach into that shared area for their cursor state:

// gather_getnext — src/backend/executor/nodeGather.c (condensed)
while (gatherstate->nreaders > 0 || gatherstate->need_to_scan_locally)
{
    CHECK_FOR_INTERRUPTS();

    if (gatherstate->nreaders > 0)
    {
        tup = gather_readnext(gatherstate);           /* a worker MinimalTuple */
        if (HeapTupleIsValid(tup))
        {
            ExecStoreMinimalTuple(tup, fslot, false); /* into the funnel slot */
            return fslot;
        }
    }

    if (gatherstate->need_to_scan_locally)
    {
        EState *estate = gatherstate->ps.state;
        /* Install our DSA area while executing the plan. */
        estate->es_query_dsa = gatherstate->pei ? gatherstate->pei->area : NULL;
        outerTupleSlot = ExecProcNode(outerPlan);     /* leader as producer */
        estate->es_query_dsa = NULL;
        if (!TupIsNull(outerTupleSlot))
            return outerTupleSlot;
        gatherstate->need_to_scan_locally = false;    /* local copy exhausted */
    }
}
return ExecClearTuple(fslot);                         /* everyone done */

gather_readnext is where the round-robin multiplexing lives. It reads non-blocking from the current worker’s queue; advances to the next worker only when the current one would block; removes a worker from the active array when its queue signals done; and, when it has visited every surviving worker without success and the leader is not also scanning, sleeps on its latch until a worker wakes it:

// gather_readnext — src/backend/executor/nodeGather.c (condensed)
for (;;)
{
    CHECK_FOR_INTERRUPTS();                            /* drains worker errors */

    reader = gatherstate->reader[gatherstate->nextreader];
    tup = TupleQueueReaderNext(reader, true, &readerdone);  /* nowait=true */

    if (readerdone)                                    /* this worker is finished */
    {
        --gatherstate->nreaders;
        if (gatherstate->nreaders == 0)
        {
            ExecShutdownGatherWorkers(gatherstate);
            return NULL;
        }
        memmove(/* compact the reader array, removing this slot */);
        if (gatherstate->nextreader >= gatherstate->nreaders)
            gatherstate->nextreader = 0;
        continue;
    }

    if (tup)
        return tup;                                    /* got one; keep this queue */

    /* Empty for now: round-robin to the next worker. */
    gatherstate->nextreader++;
    if (gatherstate->nextreader >= gatherstate->nreaders)
        gatherstate->nextreader = 0;

    if (++nvisited >= gatherstate->nreaders)           /* visited all of them */
    {
        if (gatherstate->need_to_scan_locally)
            return NULL;                               /* let caller scan locally */
        (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
                         WAIT_EVENT_EXECUTE_GATHER);    /* nothing to do but wait */
        ResetLatch(MyLatch);
        nvisited = 0;
    }
}

The comment in the source explains a tuning subtlety: PostgreSQL “used to advance the nextreader pointer after every tuple, but it turns out to be much more efficient to keep reading from the same queue until that would require blocking” — staying on one queue maximizes batch locality and minimizes latch churn.

flowchart TB
  subgraph LDR["Leader process"]
    EG["ExecGather / gather_getnext"]
    RR["gather_readnext<br/>round-robin, non-blocking"]
    LOC["local scan<br/>(parallel_leader_participation)"]
    EG --> RR
    EG --> LOC
  end
  subgraph W0["Worker 0 (ParallelQueryMain)"]
    P0["executor copy of partial plan"]
  end
  subgraph W1["Worker 1"]
    P1["executor copy of partial plan"]
  end
  P0 -->|"MinimalTuple"| Q0[["shm_mq queue 0<br/>64 KB ring"]]
  P1 -->|"MinimalTuple"| Q1[["shm_mq queue 1"]]
  Q0 --> RR
  Q1 --> RR
  LOC -->|"reads via es_query_dsa<br/>shared scan cursor"| DSA[("DSA area")]
  P0 -.-> DSA
  P1 -.-> DSA

Figure 2 — The funnel. Each worker is the sole sender on its own 64 KB shm_mq; the leader is the receiver on all of them and drains them round-robin, parking on its latch only when every queue is empty and it is not itself scanning. The parallel-aware leaf nodes (in both workers and, when participating, the leader) coordinate which heap blocks each one reads through a shared cursor in the DSA area.

GatherMerge — the order-preserving exchange

GatherMerge is used when the plan needs the gathered output sorted (e.g. a Sort below each worker feeding an order-dependent node above). Instead of gather_readnext’s round-robin concatenation, it maintains a binary heap of the per-worker stream heads keyed by the sort columns, and on each call pops the globally-smallest tuple and refills that one worker’s slot. Its launch path is identical to Gather — ExecInitParallelPlan, LaunchParallelWorkers, ExecParallelCreateReaders:

// ExecGatherMerge — src/backend/executor/nodeGatherMerge.c (condensed)
if (!node->initialized)
{
    if (gm->num_workers > 0 && estate->es_use_parallel_mode)
    {
        if (!node->pei)
            node->pei = ExecInitParallelPlan(outerPlanState(node), estate,
                                             gm->initParam, gm->num_workers,
                                             node->tuples_needed);
        else
            ExecParallelReinitialize(outerPlanState(node), node->pei,
                                     gm->initParam);
        pcxt = node->pei->pcxt;
        LaunchParallelWorkers(pcxt);
        node->nworkers_launched = pcxt->nworkers_launched;
        if (pcxt->nworkers_launched > 0)
            ExecParallelCreateReaders(node->pei);
        /* ... build the binary heap, one slot per reader ... */
    }
}

The cost is that GatherMerge cannot return a tuple from a worker until that worker has produced at least one tuple, so a slow or stalled worker holds back the merge; a plain Gather has no such ordering coupling. That is the engineering price of the “order is a separate concern” property from the theory.

Source Walkthrough

Anchor on symbol names, not line numbers. A function or struct name is the stable handle; line numbers drift the moment someone reformats. Use git grep -n '<symbol>' src/backend/ to locate the current position. The line numbers in the position-hint table were observed at commit 273fe94 (REL_18_STABLE) and are quick hints only.

Serializing the plan — ExecSerializePlan

The plan that travels to workers is a copy (workers must not scribble on the leader’s plan), wrapped in a dummy PlannedStmt. Two fix-ups matter: resjunk columns are un-marked (the tuples come back to another backend that may need them, not to the user), and parallel-unsafe subplans are replaced by a NULL “hole” so a worker can never even ExecInitNode one:

// ExecSerializePlan — src/backend/executor/execParallel.c (condensed)
plan = copyObject(plan);                              /* don't touch the original */

foreach(lc, plan->targetlist)
    lfirst_node(TargetEntry, lc)->resjunk = false;    /* keep junk cols on the wire */

pstmt = makeNode(PlannedStmt);
pstmt->commandType = CMD_SELECT;                      /* always read-only */
pstmt->planTree = plan;
pstmt->rtable = estate->es_range_table;
/* Transfer only parallel-safe subplans, NULL-padding the unsafe ones. */
pstmt->subplans = NIL;
foreach(lc, estate->es_plannedstmt->subplans)
{
    Plan *subplan = (Plan *) lfirst(lc);
    if (subplan && !subplan->parallel_safe)
        subplan = NULL;
    pstmt->subplans = lappend(pstmt->subplans, subplan);
}
return nodeToString(pstmt);                           /* the generic serializer */

nodeToString is the engine-wide node-tree serializer (owned by postgres-node-trees.md); reusing it is what makes “ship the plan” a one-liner.

Two-pass DSM sizing — ExecParallelEstimate then init

ExecInitParallelPlan sizes the segment before creating it. The estimate pass walks the PlanState tree via planstate_tree_walker, counting nodes and letting each parallel-aware node reserve space; nodes that need DSM only for EXPLAIN ANALYZE instrumentation reserve unconditionally:

// ExecParallelEstimate — src/backend/executor/execParallel.c (condensed)
e->nnodes++;                                          /* count every node */
switch (nodeTag(planstate))
{
    case T_SeqScanState:
        if (planstate->plan->parallel_aware)
            ExecSeqScanEstimate((SeqScanState *) planstate, e->pcxt);
        break;
    case T_SortState:
        /* even when not parallel-aware, for EXPLAIN ANALYZE */
        ExecSortEstimate((SortState *) planstate, e->pcxt);
        break;
    /* ... Index/BitmapHeap/Append/Hash/Agg/Memoize/... ... */
    default:
        break;
}
return planstate_tree_walker(planstate, ExecParallelEstimate, e);

The estimator accumulates a chunk size and a key count for each of the executor’s fixed pieces — fixed state, query text, the serialized PlannedStmt, ParamListInfo, per-worker BufferUsage/WalUsage, the tuple queues, optional instrumentation, and the DSA area — each tagged with a PARALLEL_KEY_* magic number above 2^32 so individual nodes can use smaller keys for their own state:

// ExecInitParallelPlan — src/backend/executor/execParallel.c (condensed)
pstmt_data = ExecSerializePlan(planstate->plan, estate);
pcxt = CreateParallelContext("postgres", "ParallelQueryMain", nworkers);
pei->pcxt = pcxt;

shm_toc_estimate_chunk(&pcxt->estimator, sizeof(FixedParallelExecutorState));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* ... query text, PlannedStmt, ParamListInfo, Buffer/Wal usage ... */
shm_toc_estimate_chunk(&pcxt->estimator,
                       mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);

/* Let parallel-aware nodes add to the estimate; also counts nnodes. */
e.pcxt = pcxt; e.nnodes = 0;
ExecParallelEstimate(planstate, &e);

shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);   /* DSA area */
shm_toc_estimate_keys(&pcxt->estimator, 1);

/* Everyone's asked for space; now create the DSM. */
InitializeParallelDSM(pcxt);

After InitializeParallelDSM returns, the space exists but is uninitialized; the function then allocates and fills each chunk — shm_toc_allocate to carve it, shm_toc_insert to register it under its key — storing the fixed state, the query string, the serialized plan, the serialized params, and setting up the tuple queues:

// ExecInitParallelPlan — store phase (condensed)
fpes = shm_toc_allocate(pcxt->toc, sizeof(FixedParallelExecutorState));
fpes->tuples_needed = tuples_needed;
fpes->eflags = estate->es_top_eflags;
fpes->jit_flags = estate->es_jit_flags;
shm_toc_insert(pcxt->toc, PARALLEL_KEY_EXECUTOR_FIXED, fpes);

pstmt_space = shm_toc_allocate(pcxt->toc, pstmt_len);
memcpy(pstmt_space, pstmt_data, pstmt_len);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, pstmt_space);

pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);   /* the funnel queues */

/* A DSA area shared by leader and workers (parallel-aware node scratch). */
area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_DSA, area_space);
pei->area = dsa_create_in_place(area_space, dsa_minsize,
                                LWTRANCHE_PARALLEL_QUERY_DSA, pcxt->seg);

/* Initialize pass: visits the identical nnodes in identical order. */
estate->es_query_dsa = pei->area;
ExecParallelInitializeDSM(planstate, &d);
estate->es_query_dsa = NULL;
if (e.nnodes != d.nnodes)
    elog(ERROR, "inconsistent count of PlanState nodes");

The e.nnodes != d.nnodes assertion is the guardrail on the two-pass contract: estimate and init must visit the same node set, or offsets computed in one pass would be wrong in the other.

The tuple queues — ExecParallelSetupTupleQueues

Each worker gets one shm_mq of PARALLEL_TUPLE_QUEUE_SIZE (64 KB), carved from a contiguous block. The leader sets itself as the receiver of each here; the worker will set itself as the sender later:

// ExecParallelSetupTupleQueues — src/backend/executor/execParallel.c (condensed)
tqueuespace = shm_toc_allocate(pcxt->toc,
                               mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
for (i = 0; i < pcxt->nworkers; ++i)
{
    shm_mq *mq = shm_mq_create(tqueuespace + ((Size) i) * PARALLEL_TUPLE_QUEUE_SIZE,
                               (Size) PARALLEL_TUPLE_QUEUE_SIZE);
    shm_mq_set_receiver(mq, MyProc);                  /* leader receives */
    responseq[i] = shm_mq_attach(mq, pcxt->seg, NULL);
}
shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tqueuespace);
return responseq;

The readers are created separately by ExecParallelCreateReaders, after launch, because the leader can do useful setup work while workers fork:

// ExecParallelCreateReaders — src/backend/executor/execParallel.c (condensed)
for (i = 0; i < nworkers; i++)
{
    shm_mq_set_handle(pei->tqueue[i], pei->pcxt->worker[i].bgwhandle);
    pei->reader[i] = CreateTupleQueueReader(pei->tqueue[i]);
}

Launching workers — LaunchParallelWorkers

The generic LaunchParallelWorkers (in parallel.c) registers one background worker per requested slot. The entry point name baked in is ParallelWorkerMain; the DSM handle is passed as the worker’s main argument so it can attach. The leader first becomes a lock-group leader to prevent undetected deadlocks between itself and its workers:

// LaunchParallelWorkers — src/backend/access/transam/parallel.c (condensed)
BecomeLockGroupLeader();                              /* group locking */

memset(&worker, 0, sizeof(worker));
snprintf(worker.bgw_name, BGW_MAXLEN, "parallel worker for PID %d", MyProcPid);
worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION
    | BGWORKER_CLASS_PARALLEL;
worker.bgw_start_time = BgWorkerStart_ConsistentState;
sprintf(worker.bgw_library_name, "postgres");
sprintf(worker.bgw_function_name, "ParallelWorkerMain");
worker.bgw_main_arg = UInt32GetDatum(dsm_segment_handle(pcxt->seg));
worker.bgw_notify_pid = MyProcPid;

for (i = 0; i < pcxt->nworkers_to_launch; ++i)
{
    memcpy(worker.bgw_extra, &i, sizeof(int));        /* this worker's number */
    if (!any_registrations_failed &&
        RegisterDynamicBackgroundWorker(&worker, &pcxt->worker[i].bgwhandle))
    {
        shm_mq_set_handle(pcxt->worker[i].error_mqh, pcxt->worker[i].bgwhandle);
        pcxt->nworkers_launched++;                    /* best-effort count */
    }
    else
        any_registrations_failed = true;              /* hit max_worker_processes */
}

Registration failure is not an error — it simply caps nworkers_launched below the request, and the caller (the Gather) copes by scanning locally. This is the graceful-degradation convention in code.

The worker — ParallelWorkerMain restores leader state

ParallelWorkerMain (in parallel.c) is what every parallel worker runs. Before it can execute a single tuple it must make its private process look like the leader. The ordering is delicate and heavily commented in the source; the essential restores:

// ParallelWorkerMain — src/backend/access/transam/parallel.c (condensed, reordered)
seg = dsm_attach(DatumGetUInt32(main_arg));           /* attach the segment */
toc = shm_toc_attach(PARALLEL_MAGIC, dsm_segment_address(seg));
fps = shm_toc_lookup(toc, PARALLEL_KEY_FIXED, false);

/* Attach the error queue FIRST, so later failures reach the leader. */
mq = (shm_mq *) (error_queue_space + ParallelWorkerNumber * PARALLEL_ERROR_QUEUE_SIZE);
shm_mq_set_sender(mq, MyProc);
mqh = shm_mq_attach(mq, seg, NULL);
pq_redirect_to_shm_mq(seg, mqh);                      /* elog/ereport -> leader */

BecomeLockGroupMember(fps->parallel_leader_pgproc, fps->parallel_leader_pid);
SetAuthenticatedUserId(fps->authenticated_user_id);
BackgroundWorkerInitializeConnectionByOid(fps->database_id, ...);  /* same DB */

StartParallelWorkerTransaction(tstatespace);          /* XID set, txn block state */
RestoreComboCIDState(combocidspace);                  /* visibility consistency */

asnapshot = RestoreSnapshot(asnapspace);              /* same MVCC view */
RestoreTransactionSnapshot(tsnapshot, fps->parallel_leader_pgproc);
PushActiveSnapshot(asnapshot);
InvalidateSystemCaches();                             /* tuples we can see changed */

RestoreGUCState(gucspace);                            /* every GUC value */
SetUserIdAndSecContext(fps->current_user_id, fps->sec_context);

EnterParallelMode();                                  /* arm read-only checks */
entrypt(seg, toc);                                    /* == ParallelQueryMain */

The README is explicit about why each restore exists: the XID set and combo CIDs so “tuple visibility checks return the same results in the worker as they do in the initiating backend”; the snapshots for the same reason; GUCs so planner/executor behavior matches. The error-queue attach is deliberately first — “until we do that, any errors that happen here will not be reported back.”

The worker executor — ParallelQueryMain

Once the environment matches the leader, the executor-specific entry point ParallelQueryMain (back in execParallel.c) reconstructs the QueryDesc from the DSM and runs an ordinary executor against it, with the twist that its DestReceiver writes into the worker’s tuple queue:

// ParallelQueryMain — src/backend/executor/execParallel.c (condensed)
fpes = shm_toc_lookup(toc, PARALLEL_KEY_EXECUTOR_FIXED, false);
receiver = ExecParallelGetReceiver(seg, toc);          /* writes to the shm_mq */
queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);

area = dsa_attach_in_place(shm_toc_lookup(toc, PARALLEL_KEY_DSA, false), seg);

ExecutorStart(queryDesc, fpes->eflags);
queryDesc->planstate->state->es_query_dsa = area;      /* shared scan cursors */
ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt);
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);

InstrStartParallelQuery();
ExecutorRun(queryDesc, ForwardScanDirection,
            fpes->tuples_needed < 0 ? (int64) 0 : fpes->tuples_needed);
ExecutorFinish(queryDesc);
/* report Buffer/Wal usage + instrumentation back through DSM, then: */
ExecutorEnd(queryDesc);

The receiver is built by ExecParallelGetReceiver, which finds this worker’s slot in the tuple-queue block, marks the worker as the sender, and wraps it in a CreateTupleQueueDestReceiver — so every tuple the worker’s executor produces is serialized as a MinimalTuple and pushed into the ring the leader is draining:

// ExecParallelGetReceiver — src/backend/executor/execParallel.c
mqspace = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE, false);
mqspace += ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE;
mq = (shm_mq *) mqspace;
shm_mq_set_sender(mq, MyProc);                         /* worker sends */
return CreateTupleQueueDestReceiver(shm_mq_attach(mq, seg, NULL));

The plan is reconstructed by ExecParallelGetQueryDesc with stringToNode (the inverse of the leader’s nodeToString) and given GetActiveSnapshot() — the snapshot the leader pushed and that ParallelWorkerMain already restored.

flowchart TB
  L0["Leader: ExecGather (first call)"]
  L1["ExecInitParallelPlan<br/>serialize plan, size + fill DSM"]
  L2["LaunchParallelWorkers<br/>RegisterDynamicBackgroundWorker x N"]
  L0 --> L1 --> L2
  L2 -->|"postmaster forks"| W0["ParallelWorkerMain"]
  W0 --> W1["attach DSM + error queue<br/>restore GUC/snapshot/XID/comboCID"]
  W1 --> W2["EnterParallelMode<br/>ParallelQueryMain"]
  W2 --> W3["ExecutorStart/Run<br/>DestReceiver -> shm_mq"]
  W3 -->|"MinimalTuples"| L3["leader gather_readnext<br/>drains queues"]
  W3 -->|"on error"| ERR["error shm_mq +<br/>PROCSIG_PARALLEL_MESSAGE"]
  ERR --> L4["leader CHECK_FOR_INTERRUPTS<br/>ProcessParallelMessages re-throws"]

Figure 3 — End-to-end launch and dataflow. The leader serializes the plan and state into the DSM, asks the postmaster to fork workers, and each worker reconstructs the leader’s environment before running its own executor copy whose output DestReceiver writes MinimalTuples into the shared queue the leader drains. Errors take the separate error queue and surface at the leader’s next interrupt check.

Error propagation — the out-of-band channel

A worker’s elog/ereport is redirected (pq_redirect_to_shm_mq) into its dedicated error queue, distinct from its tuple queue. Sending a message signals the leader with PROCSIG_PARALLEL_MESSAGE; the leader’s next CHECK_FOR_INTERRUPTS runs ProcessParallelMessages, which drains every worker’s error queue and re-throws:

// ProcessParallelMessages — src/backend/access/transam/parallel.c (condensed)
dlist_foreach(iter, &pcxt_list)
{
    pcxt = dlist_container(ParallelContext, node, iter.cur);
    for (i = 0; i < pcxt->nworkers_launched; ++i)
        while (pcxt->worker[i].error_mqh != NULL)
        {
            res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes, &data, true);
            if (res == SHM_MQ_WOULD_BLOCK)
                break;
            else if (res == SHM_MQ_SUCCESS)
                ProcessParallelMessage(pcxt, i, &msg);  /* re-throws ErrorResponse */
            else
                ereport(ERROR, ... "lost connection to parallel worker");
        }
}

This is why the README stresses that the leader “execute CHECK_FOR_INTERRUPTS() regularly” — it is the only point at which a worker’s error becomes the leader’s error. Note gather_readnext calls CHECK_FOR_INTERRUPTS on every loop iteration precisely so that a worker error surfaces promptly while the leader is draining tuples.

Teardown and reuse — ExecParallelFinish / Cleanup / Reinitialize

ExecParallelFinish detaches the tuple queues (so any still-running worker notices no more output is wanted), destroys the readers, and WaitForParallelWorkersToFinish — the leader must not clean up the transaction until every worker has exited, or “the relation could disappear while the worker is still busy scanning it.” ExecParallelCleanup then accumulates instrumentation, frees serialized params, detaches the DSA, and DestroyParallelContext tears down the DSM. For a node that rescans (e.g. a Gather on the inner side of a nested loop), ExecParallelReinitialize recycles the same DSM via ReinitializeParallelDSM and re-serializes only the (possibly changed) parameters, avoiding a full segment rebuild.

Position hints (as of 2026-06-05, REL_18 273fe94)

Symbol	File	Line
`ExecInitGather`	`src/backend/executor/nodeGather.c`	53
`ExecGather`	`src/backend/executor/nodeGather.c`	137
`gather_getnext`	`src/backend/executor/nodeGather.c`	263
`gather_readnext`	`src/backend/executor/nodeGather.c`	311
`ExecShutdownGatherWorkers`	`src/backend/executor/nodeGather.c`	400
`ExecShutdownGather`	`src/backend/executor/nodeGather.c`	418
`ExecReScanGather`	`src/backend/executor/nodeGather.c`	442
`ExecInitGatherMerge`	`src/backend/executor/nodeGatherMerge.c`	67
`ExecGatherMerge`	`src/backend/executor/nodeGatherMerge.c`	183
`gather_merge_readnext`	`src/backend/executor/nodeGatherMerge.c`	636
`ExecSerializePlan`	`src/backend/executor/execParallel.c`	146
`ExecParallelEstimate`	`src/backend/executor/execParallel.c`	233
`ExecParallelSetupTupleQueues`	`src/backend/executor/execParallel.c`	547
`ExecInitParallelPlan`	`src/backend/executor/execParallel.c`	599
`ExecParallelCreateReaders`	`src/backend/executor/execParallel.c`	890
`ExecParallelReinitialize`	`src/backend/executor/execParallel.c`	916
`ExecParallelFinish`	`src/backend/executor/execParallel.c`	1156
`ExecParallelCleanup`	`src/backend/executor/execParallel.c`	1209
`ExecParallelGetReceiver`	`src/backend/executor/execParallel.c`	1245
`ExecParallelGetQueryDesc`	`src/backend/executor/execParallel.c`	1261
`ExecParallelInitializeWorker`	`src/backend/executor/execParallel.c`	1334
`ParallelQueryMain`	`src/backend/executor/execParallel.c`	1429
`CreateParallelContext`	`src/backend/access/transam/parallel.c`	173
`InitializeParallelDSM`	`src/backend/access/transam/parallel.c`	211
`ReinitializeParallelDSM`	`src/backend/access/transam/parallel.c`	508
`LaunchParallelWorkers`	`src/backend/access/transam/parallel.c`	580
`WaitForParallelWorkersToFinish`	`src/backend/access/transam/parallel.c`	803
`ProcessParallelMessages`	`src/backend/access/transam/parallel.c`	1055
`ProcessParallelMessage`	`src/backend/access/transam/parallel.c`	1144
`ParallelWorkerMain`	`src/backend/access/transam/parallel.c`	1299
`PARALLEL_TUPLE_QUEUE_SIZE` (= 65536)	`src/backend/executor/execParallel.c`	69

Source verification (as of 2026-06-05)

Verified against /data/hgryoo/references/postgres at commit 273fe94 (REL_18_STABLE). Checks performed:

Gather launches workers lazily on first ExecGather, not at init. Confirmed: ExecInitGather sets gatherstate->initialized = false and only calls ExecInitNode on the outer plan; the ExecInitParallelPlan / LaunchParallelWorkers block is inside if (!node->initialized) in ExecGather. The source comment states the rationale (“it needs to allocate a large dynamic segment”).
Tuple queue size is 64 KB per worker. #define PARALLEL_TUPLE_QUEUE_SIZE 65536 in execParallel.c; each queue is shm_mq_created at that size in ExecParallelSetupTupleQueues.
Tuples cross as MinimalTuple. The funnel slot is created with &TTSOpsMinimalTuple in ExecInitGather; gather_getnext stores worker tuples via ExecStoreMinimalTuple; the worker receiver is CreateTupleQueueDestReceiver.
The plan is serialized with nodeToString and rebuilt with stringToNode. ExecSerializePlan returns nodeToString(pstmt); ExecParallelGetQueryDesc calls stringToNode(pstmtspace).
Two-pass node-count invariant exists. ExecParallelEstimate increments e.nnodes; ExecParallelInitializeDSM increments d.nnodes; ExecInitParallelPlan errors with “inconsistent count of PlanState nodes” if they differ.
Read-only enforcement. ExecSerializePlan hard-codes pstmt->commandType = CMD_SELECT; ParallelWorkerMain calls EnterParallelMode() before invoking the entry point. (The EnterParallelMode/ExitParallelMode check machinery lives in xact.c; see postgres-xact.md.)
Error path. ParallelWorkerMain calls pq_redirect_to_shm_mq; ProcessParallelMessages drains pcxt->worker[i].error_mqh and ProcessParallelMessage re-throws. The PROCSIG_PARALLEL_MESSAGE signal constant is referenced in parallel.c.
REL_18 sanity. No XLOG2 rmgr and no B_DATACHECKSUMSWORKER_* symbols were asserted. The worker reports per-query BufferUsage and WalUsage (PARALLEL_KEY_WAL_USAGE), consistent with modern releases.
Symbols exist. Every symbol in the position-hint table was located by grep -nE '^<symbol>' in the cited file at the listed line (±0 at this commit).

Not independently re-derived (taken from source comments / README): the precise list of state the generic InitializeParallelDSM copies (libraries, combo CIDs, relmapper, REINDEX state, etc.) is enumerated in README.parallel and the InitializeParallelDSM body; this doc quotes rather than re-verifies each one.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

PostgreSQL’s parallel executor is a deliberately conservative, first-generation implementation of the Volcano exchange model. Placing it against the design space and the literature clarifies which of its choices are fundamental and which are artifacts of its starting point.

Process-per-worker vs. thread-per-worker

PostgreSQL’s single most consequential decision is that a worker is a full backend process, not a thread. The benefit is enormous code reuse: a worker runs the same ExecutorStart/ExecutorRun as a normal query (postgres-executor.md), and fault isolation is free — a crashing worker cannot corrupt the leader’s heap. The cost is everything in §“State Sharing”: every GUC, snapshot, XID, combo CID, and library must be explicitly serialized and re-installed, because processes share no address space. A thread-per-worker engine (e.g. SQL Server, MySQL, DuckDB, most OLAP engines) shares all of that for free, paying instead in fragile global state, pervasive locking, and the loss of crash isolation. PostgreSQL chose processes because its entire architecture (postgres-postmaster.md, postgres-backend-lifecycle.md) is process-per-connection; parallel workers reuse that machinery wholesale. The serialization tax is the price of that reuse.

Pull-based exchange vs. push-based / morsel-driven

PostgreSQL’s Gather is pull-based: the leader’s gather_readnext calls down into per-worker queues, and each worker independently pulls its partition through an iterator. This is the classic Volcano exchange. The influential alternative is morsel-driven parallelism (Leis et al., “Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age,” SIGMOD 2014), as implemented in HyPer and later Umbra. There, a fixed pool of worker threads pulls small “morsels” of input from a shared dispatcher and pushes them through a compiled pipeline; parallelism is elastic (threads pick up the next morsel when free), NUMA-aware, and avoids the static degree-of-parallelism PostgreSQL fixes at plan time. PostgreSQL’s parallel-aware SeqScan block-range handout is a coarse, single-operator echo of morsel dispatch — the shared block counter in the DSA area is a morsel dispatcher in miniature — but PostgreSQL fixes num_workers at plan time and cannot rebalance across operators the way morsel-driven engines do. Vectorized push engines (the MonetDB/X100 lineage; Boncz, Zukowski & Nes, “MonetDB/X100: Hyper-Pipelining Query Execution,” CIDR 2005) combine push dataflow with batch-at-a-time processing, attacking the per-tuple interpreter overhead that PostgreSQL’s tuple-at-a-time funnel still pays — though PG’s expression JIT (postgres-expression-eval.md) narrows that gap inside a worker.

The funnel as a scalability bottleneck

A single Gather is a funnel: every worker tuple is serialized to a MinimalTuple, copied through a 64 KB ring, and deserialized by one leader. For high-cardinality results this re-centralizes work the parallel scan spread out, and the leader can become the bottleneck — the classic “exchange is the scalability limit” observation from parallel-DB folklore and quantified in many shared-nothing studies. Production parallel engines mitigate this with multi-level exchanges (partial aggregation below the exchange so fewer tuples cross it — which PostgreSQL does do via Partial / Finalize Aggregate) and with repartitioning exchanges that keep data parallel across a pipeline of operators rather than funneling to one node after a single operator. PostgreSQL has no general repartitioning exchange: parallelism collapses at every Gather and must be re-expanded below the next one, which is why deeply parallel plans favor pushing as much work as possible below a single top Gather. Parallel hash join (postgres-table-am.md adjacent; the T_HashJoinState parallel-aware path in ExecParallelEstimate) is the main exception where state stays shared across workers via the DSA area rather than funneling.

Static vs. adaptive degree of parallelism

The worker count is decided by the planner from table size and max_parallel_workers_per_gather, then is best-effort at runtime (nworkers_launched may be lower). It is never increased mid-query and never rebalanced. Adaptive systems revisit the degree of parallelism based on runtime feedback — the research frontier of adaptive query processing (Deshpande, Ives & Raman, “Adaptive Query Processing,” Foundations and Trends in Databases, 2007) and runtime re-optimization. PostgreSQL’s parallel_leader_participation is a tiny adaptive touch (the leader fills in when workers are scarce), but the architecture is fundamentally plan-time-static.

Distributed shared-nothing exchange

The DSC chapter frames the exchange operator on a shared-nothing cluster where producers send tuples over the network. PostgreSQL’s exchange is strictly single-node shared-memory: workers are local processes and the transport is shared memory, never a socket. Cross-node parallelism in the PostgreSQL ecosystem lives in extensions (Citus, Greenplum, postgres_fdw async append) layered above the core executor, not in nodeGather.c. This is the cleanest illustration of scope: the core implements exactly the multicore slice of the textbook model, and defers the distributed slice entirely.

Frontier summary

The directions a modern reader should watch: (1) morsel-driven / work-stealing schedulers to replace static num_workers; (2) vectorized or fully compiled push pipelines to retire the tuple-at-a-time funnel; (3) repartitioning exchanges so parallelism survives across operators; (4) relaxing the read-only restriction (parallel DML, parallel index build already exists for B-tree — see postgres-nbtree.md); and (5) NUMA -aware worker placement. PostgreSQL’s design is intentionally the simplest correct point in this space — its conservatism (read-only, process-based, static, single-node, funnel-per-Gather) is exactly what made parallel query landable on a 30-year-old single-threaded executor without rewriting it.

Sources

Source tree (/data/hgryoo/references/postgres, REL_18_STABLE, commit 273fe94, observed 2026-06-05):
- src/backend/executor/nodeGather.c — Gather node: lazy launch, gather_getnext/gather_readnext funnel, local-scan fallback.
- src/backend/executor/nodeGatherMerge.c — order-preserving variant; binary-heap merge of worker streams.
- src/backend/executor/execParallel.c — ExecInitParallelPlan, plan serialization, two-pass DSM sizing, tuple queues, ParallelQueryMain, finish/cleanup/reinitialize.
- src/backend/access/transam/parallel.c — generic ParallelContext, InitializeParallelDSM, LaunchParallelWorkers, ParallelWorkerMain state restore, error message processing.
- src/backend/access/transam/README.parallel — the design document: overview, error reporting, state sharing, transaction integration, coding conventions.
Textbook (knowledge/research/dbms-general/database-system-concepts.md): Silberschatz, Korth & Sudarshan, Database System Concepts, 7e, ch. 22 “Parallel and Distributed Query Processing” — §22.1 (intra-/inter-query, intra-/inter-operation parallelism), §22.5 (parallel evaluation of query plans), §22.5.1 (interoperation: pipelined vs. independent parallelism), §22.5.2 (the exchange-operator model); ch. 15 §15.7.2 (pipelining, referenced by the parallel chapter).
Foundational papers (see .omc/plans/postgres-paper-bibliography.md): G. Graefe, “Encapsulation of Parallelism in the Volcano Query Processing System,” SIGMOD 1990 — the exchange operator. G. Graefe & W. McKenna, “The Volcano Optimizer Generator,” ICDE 1993.
Comparative / frontier literature (named in §6, not in the KB): Leis et al., “Morsel-Driven Parallelism,” SIGMOD 2014; Boncz, Zukowski & Nes, “MonetDB/X100: Hyper-Pipelining Query Execution,” CIDR 2005; Deshpande, Ives & Raman, “Adaptive Query Processing,” FnT Databases 2007.
Cross-references (this KB): postgres-executor.md (the per-worker iterator model), postgres-shared-memory-ipc.md (DSM, shm_toc, shm_mq, background workers), postgres-planner-overview.md and postgres-path-generation.md (why a Gather and partial paths are produced), postgres-mvcc-snapshots.md (snapshot semantics restored in workers), postgres-xact.md (parallel-mode read-only enforcement), postgres-expression-eval.md (per-worker expression evaluation/JIT), postgres-postmaster.md / postgres-backend-lifecycle.md (the process-per-worker substrate).