PostgreSQL Asynchronous I/O — The PG18 AIO Subsystem, io_uring, and Read Streams

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

For most of its history PostgreSQL performed I/O the simplest way possible: when a backend needed a page that was not in shared_buffers, it called pread() and blocked until the kernel delivered the bytes. This worked far better than it had any right to, for one reason — the operating system hides storage latency on the backend’s behalf. The kernel’s own read-ahead notices sequential access and prefetches the next few blocks into its page cache, so the synchronous pread() often finds the data already resident and returns almost immediately. PostgreSQL leaned on that, and on posix_fadvise() hints (effective_io_concurrency) to nudge the kernel into prefetching for the handful of access patterns — bitmap heap scans, mostly — where the engine knew the future better than the kernel could guess.

The Architecture of a Database System survey (Hellerstein, Stonebraker & Hamilton 2007; captured in dbms-papers/fntdb07-architecture.md) frames the problem the textbook way: a DBMS is fundamentally a machine for overlapping computation with I/O. A query that must read a million pages spends almost all of its wall-clock time waiting for storage unless the system can have many reads outstanding at once and do useful work — decode tuples, evaluate predicates, build hash tables — while the disk arm (or the NVMe queue) is busy. Synchronous blocking I/O makes that overlap impossible within a single backend: the backend stops dead at every cache miss. PostgreSQL’s process-per-connection model historically recovered some parallelism by running many backends, but a single large scan in a single backend was latency-bound.

Two structural shifts made the old approach untenable. First, storage got faster than the CPU’s ability to feed it. On a modern NVMe array the device can service hundreds of thousands of IOPS, but a single synchronous backend, issuing one 8 KB read at a time and waiting for each, cannot keep the queue deep enough to saturate it — the bottleneck moves from the device to the round-trip latency of the syscall path. Second, buffered I/O itself became the cost. Copying data from the kernel page cache into shared_buffers burns CPU and double-buffers the data (once in the OS cache, once in Postgres). Direct I/O (O_DIRECT) sidesteps both — DMA moves bytes straight from device to the buffer pool, freeing the CPU — but, as the AIO README bluntly states, “Without AIO, Direct IO is unusably slow for most purposes”: there is no kernel read-ahead to hide latency, so every miss is a full storage round trip. Direct I/O is therefore only viable once the engine itself issues explicit, deep, asynchronous prefetches. AIO is the prerequisite for direct I/O, not a separate feature.

Asynchronous I/O decouples issuing a read from consuming its result. The backend submits a read, gets control back immediately, does other work (issues more reads, processes already-resident pages), and only later waits for the specific read it now needs. The classic abstraction is a pair of operations — submit(io) returning a handle, and wait(handle) blocking until that IO completes — plus a completion mechanism that updates shared state (marks a buffer valid, verifies a checksum) when the bytes land. The engineering difficulty is never the happy path; it is everything around it: bounding the number of in-flight IOs, avoiding deadlock when shared resources (buffers) are both the source and the target of concurrent asynchronous reads, and making completion processing safe to run in a critical section (because WAL flushes, which must run in critical sections, want to use AIO too).

Common DBMS Design

Database engines that have adopted asynchronous I/O converge on a recognizable set of building blocks, though they differ in mechanism.

An I/O request descriptor with a lifecycle. Every AIO design has some object — call it a request, a handle, or an IOCB — that represents one outstanding operation: which file, which offset, which buffer, how many bytes, and a state field tracking unsubmitted -> in flight -> complete. The descriptor must outlive the syscall that started it and must be reachable by whatever code processes the completion. Where the descriptor lives matters enormously in a process-per-connection engine like PostgreSQL: it cannot be a plain stack variable, because another process may need to complete the IO.

A pluggable transport. Portable engines abstract how the IO is actually performed behind an interface, because the best mechanism is platform-specific: Linux io_uring, POSIX AIO (aio_read), Windows overlapped I/O / IOCP, a pool of synchronous worker threads, or plain blocking syscalls as a fallback. The transport is a strategy object — a table of function pointers for submit, wait, and initialization — selected at startup.

Completion handling decoupled from issuance. Because the issuer may be busy or blocked when the bytes arrive, mature designs let any worker drain completions, or offload completion to a dedicated thread/process. This is the crux of deadlock avoidance: if a backend prefetches ten pages and then blocks on a lock, those ten completions must still be processed — by someone — or the whole system can wedge. Engines solve this either by making completions globally drainable (io_uring: any backend can reap another’s ring) or by offloading the work entirely (worker pool: the worker that did the read also does the completion).

I/O combining and look-ahead. A read of one 8 KB page is wasteful when the next four pages are also wanted; combining them into one vectored preadv()/ring submission amortizes the per-syscall cost. The producer of “what to read next” is usually decoupled from the consumer via a prefetch stream or async scan iterator that looks a tunable distance ahead, merges adjacent requests, and caps the number concurrently in flight. SQL Server’s read-ahead manager, Oracle’s db_file_multiblock_read_count, and DB2’s prefetchers are all variations on this theme.

// IoMethodOps — src/include/storage/aio_internal.h
// The pluggable-transport vtable: each io_method fills in a subset.
typedef struct IoMethodOps
{
    bool        wait_on_fd_before_close;
    size_t      (*shmem_size) (void);
    void        (*shmem_init) (bool first_time);
    void        (*init_backend) (void);
    bool        (*needs_synchronous_execution) (PgAioHandle *ioh);
    int         (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
    void        (*wait_one) (PgAioHandle *ioh, uint64 ref_generation);
} IoMethodOps;

The hardest design constraints are the ones the textbooks skip. Because PostgreSQL is a multi-process server, AIO state must live in shared memory: a handle initiated by backend A may be completed by backend B (when A is busy or when io_uring lets B reap A’s ring), so the descriptor, its callbacks, and its result cannot be process-local pointers. And because an EXEC_BACKEND build re-maps each process’s code at a different address under ASLR, shared memory cannot hold function pointers at all — a completion callback installed by backend A would point at garbage in backend B. This single fact forces the callback-by-integer-ID design that pervades the implementation.

PostgreSQL’s Approach

PostgreSQL 18 introduces asynchronous I/O as a self-contained subsystem under src/backend/storage/aio/, layered so that most callers never touch the low-level API at all. The whole subsystem is governed by one GUC, io_method, with three values — sync, worker (the default), and io_uring (Linux only, compiled in when liburing is present).

The AIO handle is the unit of work. A PgAioHandle (defined in aio_internal.h) is a fixed-size shared-memory record. The pool is sized once at startup: AioProcs() (every backend plus auxiliary processes) times io_max_concurrency handles each. A backend acquires one with pgaio_io_acquire(), which is guaranteed to succeed — if the backend’s handles are all in use, it blocks waiting for one of its own in-flight IOs to complete. To make that guarantee sound, the API enforces that a backend may hold at most one un-submitted handle at a time (handed_out_io); otherwise a backend could exhaust its handles without any way to wait for them to free up, and self-deadlock.

Definition is layered through the storage stack. The backend that acquires the handle is usually not the code that defines the IO. For a shared-buffer read, bufmgr.c acquires the handle and registers a completion callback, then passes it down to smgr.c, which passes it to md.c, which translates the block number into a segment file and offset and finally calls pgaio_io_start_readv() in fd.c. Each layer on the way down can register its own completion callback. This is how the AIO subsystem keeps layers ignorant of each other: bufmgr does not know the IO goes through md, and md does not know how to validate a page checksum — each contributes one callback that understands only its own concern.

Eight states, one direction. A handle moves monotonically through the states in PgAioHandleState: IDLE (in the backend’s free list) -> HANDED_OUT (returned by acquire) -> DEFINED (operation associated) -> STAGED (stage callbacks run, ready to submit) -> SUBMITTED (handed to the kernel/worker) -> COMPLETED_IO (bytes landed, result known) -> COMPLETED_SHARED (shared callbacks run — buffer marked valid) -> COMPLETED_LOCAL (issuer’s local callbacks run), after which the handle’s generation is bumped and it returns to IDLE for reuse.

// PgAioHandleState — src/include/storage/aio_internal.h
typedef enum PgAioHandleState
{
    PGAIO_HS_IDLE = 0,
    PGAIO_HS_HANDED_OUT,        /* returned by pgaio_io_acquire() */
    PGAIO_HS_DEFINED,           /* pgaio_io_start_*() called */
    PGAIO_HS_STAGED,            /* stage() ran; ready to submit */
    PGAIO_HS_SUBMITTED,         /* given to the IO method */
    PGAIO_HS_COMPLETED_IO,      /* finished, result not yet processed */
    PGAIO_HS_COMPLETED_SHARED,  /* shared completion callbacks ran */
    PGAIO_HS_COMPLETED_LOCAL,   /* local completion callbacks ran */
} PgAioHandleState;

Wait references survive reuse. Because a handle is recycled the instant its IO completes, you cannot wait on the handle itself — by the time you look, it might already be servicing someone else’s read. Instead the issuer takes a PgAioWaitRef before submitting, which packs the handle’s array index plus its 64-bit generation counter. pgaio_wref_wait() resolves the reference and checks the generation: if it has advanced, the IO you cared about is long done. A wait reference can sit in shared memory and be waited on by any process.

Callbacks are integers, results are compact, errors are deferred. Completion logic is registered as a PgAioHandleCallbackID — a one-byte index into a static table aio_handle_cbs[] (e.g. PGAIO_HCB_SHARED_BUFFER_READV, PGAIO_HCB_MD_READV). The indirection exists precisely because shared memory cannot hold function pointers under EXEC_BACKEND ASLR. Callbacks run in critical sections (so AIO is usable for WAL), which means they cannot ereport(ERROR) — instead each callback “distills” the raw syscall result into a compact PgAioResult (a status enum plus a few bits of error data). The issuing backend later reads the result out of a backend-local PgAioReturn and, if it indicates failure, calls pgaio_result_report() to raise the error in a context where throwing is safe.

// PgAioHandleFlags — src/include/storage/aio.h
typedef enum PgAioHandleFlags
{
    /* request synchronous execution even when AIO is configured */
    PGAIO_HF_SYNCHRONOUS = 1 << 0,
    /* references process-local memory; worker mode can't reopen it */
    PGAIO_HF_REFERENCES_LOCAL = 1 << 1,
    /* buffered (not direct) IO — io_uring may offload to its workers */
    PGAIO_HF_BUFFERED = 1 << 2,
} PgAioHandleFlags;

The read stream is the helper most callers actually use. Sequential scans, VACUUM, ANALYZE, bitmap heap scans, and more do not call the AIO API directly; they create a ReadStream (read_stream.c), hand it a callback that yields successive block numbers, and pull pinned buffers out one at a time with read_stream_next_buffer(). The stream looks ahead, merges adjacent blocks into vectored reads up to io_combine_limit, keeps up to max_ios reads in flight, and adapts its look-ahead distance to recent cache-hit history — shrinking to 1 when everything is cached (so it does no useless work) and doubling after each real I/O. This is the bridge between the high-level “I want to read these blocks” intent and the low-level AIO handle machinery, and it is where the performance win actually shows up for ordinary queries.

Source Walkthrough

The subsystem is small in line count but dense in invariants. We trace it bottom-up: first the handle lifecycle (aio.c), then the three method implementations (method_sync.c, method_worker.c, method_io_uring.c), then the read-stream consumer that almost every caller actually uses (read_stream.c). Buffer-pool integration (StartReadBuffer / WaitReadBuffers, the PGAIO_HCB_SHARED_BUFFER_READV callback body) lives in postgres-buffer-manager.md; segment-and-fd translation (md.c, PGAIO_HCB_MD_READV) lives in postgres-smgr-md.md. Here we stay inside src/backend/storage/aio/.

Acquiring a handle (`aio.c`)

pgaio_io_acquire() is the entry point and is guaranteed to return a handle: if the backend’s free list is empty it calls pgaio_io_wait_for_free() to reclaim one of its own in-flight IOs. The non-blocking variant pgaio_io_acquire_nb() is the real worker — it enforces the two invariants that make the guarantee sound. First, a backend may have at most PGAIO_SUBMIT_BATCH_SIZE (32) IOs staged; if it is already at the cap it flushes them with pgaio_submit_staged(). Second, a backend may hold at most one un-submitted handle at a time — the handed_out_io guard turns a second concurrent acquire into a hard elog(ERROR).

// pgaio_io_acquire_nb — src/backend/storage/aio/aio.c
if (pgaio_my_backend->num_staged_ios >= PGAIO_SUBMIT_BATCH_SIZE)
{
    Assert(pgaio_my_backend->num_staged_ios == PGAIO_SUBMIT_BATCH_SIZE);
    pgaio_submit_staged();
}

if (pgaio_my_backend->handed_out_io)
    elog(ERROR, "API violation: Only one IO can be handed out");
// ... pops from dclist idle_ios, moves IDLE -> HANDED_OUT ...
pgaio_io_update_state(ioh, PGAIO_HS_HANDED_OUT);
pgaio_my_backend->handed_out_io = ioh;

The handed_out_io rule is the linchpin of the README’s anti-self-deadlock argument: because a backend can only ever be defining one IO at a time, it can never paint itself into a corner where it holds N undefined handles with no way to wait for any of them.

Staging and the submit batch (`aio.c`)

Once bufmgr/md/fd have defined the operation and registered callbacks, the handle is staged. pgaio_io_stage() runs the per-target stage callbacks (the counterpart of completion callbacks — they pin resources, snapshot state), moves the handle DEFINED -> STAGED, then decides whether the IO must run synchronously (e.g. the method is sync, or the IO references process-local memory). If not, the handle is appended to the backend’s staged_ios[] array and — unless the caller explicitly opted into batching via pgaio_enter_batchmode() — submitted immediately.

// pgaio_io_stage — src/backend/storage/aio/aio.c
pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
pgaio_my_backend->handed_out_io = NULL;     /* allow a new IO to be staged */
pgaio_io_call_stage(ioh);
pgaio_io_update_state(ioh, PGAIO_HS_STAGED);

needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
if (!needs_synchronous)
{
    pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
    if (!pgaio_my_backend->in_batchmode)
        pgaio_submit_staged();
}

pgaio_submit_staged() is where control crosses into the method vtable. It wraps the call to pgaio_method_ops->submit() in a critical section — because submission may itself need to complete earlier IOs (the WAL-in-critical-section scenario from the README), and completion must be crash-safe.

// pgaio_submit_staged — src/backend/storage/aio/aio.c
START_CRIT_SECTION();
did_submit = pgaio_method_ops->submit(pgaio_my_backend->num_staged_ios,
                                      pgaio_my_backend->staged_ios);
END_CRIT_SECTION();
pgaio_my_backend->num_staged_ios = 0;

Completion processing (`aio.c`)

No matter which method ran the IO, completion funnels through one function, pgaio_io_process_completion(). It is always called inside a critical section (Assert(CritSectionCount > 0)), drives the handle SUBMITTED -> COMPLETED_IO, runs the shared completion callbacks (which update shared state visible to every backend — marking a buffer valid, distilling errors into the PgAioResult), moves to COMPLETED_SHARED, and broadcasts the handle’s condition variable so waiters wake. Only if the issuing backend is the one processing the completion does it go on to run local callbacks and reclaim the handle.

// pgaio_io_process_completion — src/backend/storage/aio/aio.c
Assert(ioh->state == PGAIO_HS_SUBMITTED);
Assert(CritSectionCount > 0);
ioh->result = result;
pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
pgaio_io_call_complete_shared(ioh);
pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
ConditionVariableBroadcast(&ioh->cv);
if (ioh->owner_procno == MyProcNumber)
    pgaio_io_reclaim(ioh);          /* runs local callbacks too */

The shared/local split is exactly the EXEC_BACKEND constraint made concrete: the shared callback (PGAIO_HCB_SHARED_BUFFER_READV) must run in whatever process reaps the completion, so it touches only shared memory; the local callback runs only in the issuer, so it may touch backend-local state. Both are named by one-byte PgAioHandleCallbackID, never by pointer.

flowchart TD
  A["pgaio_io_acquire_nb()<br/>IDLE to HANDED_OUT"] --> B["bufmgr/smgr/md define IO<br/>register callbacks"]
  B --> C["pgaio_io_stage()<br/>stage cbs, DEFINED to STAGED"]
  C -->|sync needed| D["pgaio_io_perform_synchronously()"]
  C -->|async| E["staged_ios[]<br/>append, maybe batch"]
  E --> F["pgaio_submit_staged()<br/>CRIT: method->submit()"]
  F --> G["SUBMITTED<br/>kernel / worker owns it"]
  G --> H["completion reaped<br/>worker, io_uring drain, or sync"]
  H --> I["pgaio_io_process_completion()<br/>CRIT: shared cbs, COMPLETED_SHARED"]
  I --> J["ConditionVariableBroadcast(cv)<br/>waiters wake"]
  I -->|issuer == self| K["pgaio_io_reclaim()<br/>local cbs, bump generation, IDLE"]
  J -.->|"pgaio_wref_wait()"| K

Method 1 — sync (`method_sync.c`)

The sync method is the degenerate case and the safety net. Its submit hook does not exist (the IO is executed inline during staging via needs_synchronous_execution returning true), so there is nothing asynchronous about it: each IO is performed with a blocking preadv()/pwritev() right where it is issued. It exists for two reasons — debugging the upper layers without any AIO machinery in the way, and providing a fallback on platforms or builds where neither workers nor io_uring are usable. Because it has no out-of-process completion, it sidesteps every deadlock concern by construction.

Method 2 — worker (`method_worker.c`, the default)

Worker mode dispatches IOs to a pool of dedicated B_IO_WORKER processes (default io_workers = 3, hard cap MAX_IO_WORKERS = 32) over a shared-memory ring, PgAioWorkerSubmissionQueue. Submission inserts each staged handle into the queue under AioWorkerSubmissionQueueLock and wakes one idle worker by setting its latch; any IOs that don’t fit in the queue are run synchronously by the submitter, after dispatching the rest, to keep concurrency high.

// pgaio_worker_submit_internal — src/backend/storage/aio/method_worker.c
LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
for (int i = 0; i < num_staged_ios; ++i)
{
    if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
    {
        /* queue full: fall back to synchronous, but only after dispatching */
        synchronous_ios[nsync++] = staged_ios[i];
        continue;
    }
    if (wakeup == NULL)
    {
        worker = pgaio_worker_choose_idle();
        if (worker >= 0)
            wakeup = io_worker_control->workers[worker].latch;
    }
}
LWLockRelease(AioWorkerSubmissionQueueLock);
if (wakeup)
    SetLatch(wakeup);

IoWorkerMain() is the per-worker loop. It consumes one queue entry under the lock, opportunistically wakes up to IO_WORKER_WAKEUP_FANOUT (2) peers if the queue is deep, re-opens the file descriptor (pgaio_io_reopen() — the worker is a different process and does not share the issuer’s open fds), then performs the IO synchronously. The completion runs in the worker, which is precisely how worker mode satisfies the README’s anti-deadlock rule: even if the issuing backend is blocked, the worker still completes the IO.

// IoWorkerMain — src/backend/storage/aio/method_worker.c
LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
    io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
else
{
    io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
    nwakeups = Min(pgaio_worker_submission_queue_depth(), IO_WORKER_WAKEUP_FANOUT);
    /* ... gather peer latches ... */
}
LWLockRelease(AioWorkerSubmissionQueueLock);
// ... later, for a real job:
HOLD_INTERRUPTS();
pgaio_io_reopen(ioh);
pgaio_io_perform_synchronously(ioh);    /* contains its own crit section */
RESUME_INTERRUPTS();

flowchart LR
  subgraph Issuer["issuing backend"]
    S["pgaio_worker_submit_internal()<br/>insert into ring"]
  end
  subgraph SHM["shared memory"]
    Q["PgAioWorkerSubmissionQueue<br/>(ring of io_handle indices)"]
    M["idle_worker_mask"]
  end
  subgraph Workers["B_IO_WORKER pool (io_workers, max 32)"]
    W1["IoWorkerMain #0<br/>reopen + preadv + complete"]
    W2["IoWorkerMain #1"]
    W3["IoWorkerMain #2"]
  end
  S -->|"queue_insert + SetLatch"| Q
  Q --> W1
  Q --> W2
  Q --> W3
  W1 -->|"pgaio_io_process_completion()"| SHM
  M -.->|"choose_idle"| S

Method 3 — io_uring (`method_io_uring.c`, Linux)

io_uring mode gives each backend its own ring (PgAioUringContext, one per pgaio_uring_procs() slot in shared memory) created with io_uring_queue_init_mem() so the ring buffers live in the shared segment. Submission fills a submission-queue entry per staged IO via pgaio_uring_sq_from_io() and calls io_uring_submit(). A subtlety: for buffered IO with several IOs already in flight, the code sets IOSQE_ASYNC so the kernel offloads the page-cache copy to its own worker threads rather than doing it inline (inline is lower-latency for the first few, but serializes the copies under load).

// pgaio_uring_submit — src/backend/storage/aio/method_io_uring.c
sqe = io_uring_get_sqe(uring_instance);
if (!sqe)
    elog(ERROR, "io_uring submission queue is unexpectedly full");
pgaio_io_prepare_submit(ioh);
pgaio_uring_sq_from_io(ioh, sqe);
if (in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED))
    io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);
// ... then loop on io_uring_submit(), handling EINTR/EAGAIN ...

The crucial property is that any backend can drain another backend’s ring. The ring is in shared memory, but the kernel’s CQ-reaping path is not inherently multi-process-safe, so each context carries a completion_lock LWLock; pgaio_uring_drain_locked() peeks a batch of CQEs, and for each one recovers the PgAioHandle * stashed in the CQE user-data and routes it through the same pgaio_io_process_completion() used by every method.

// pgaio_uring_drain_locked — src/backend/storage/aio/method_io_uring.c
Assert(LWLockHeldByMeInMode(&context->completion_lock, LW_EXCLUSIVE));
orig_ready = ready = io_uring_cq_ready(&context->io_uring_ring);
while (ready > 0)
{
    ncqes = io_uring_peek_batch_cqe(&context->io_uring_ring, cqes,
                                    Min(PGAIO_MAX_LOCAL_COMPLETED_IO, ready));
    ready -= ncqes;
    for (int i = 0; i < ncqes; i++)
    {
        struct io_uring_cqe *cqe = cqes[i];
        PgAioHandle *ioh = io_uring_cqe_get_data(cqe);
        io_uring_cqe_seen(&context->io_uring_ring, cqe);
        pgaio_io_process_completion(ioh, cqe->res);
    }
}

This is the other half of the anti-deadlock design: where worker mode offloads completion, io_uring mode makes completion globally drainable — backend B blocking on an IO that backend A issued simply takes A’s completion_lock and reaps it itself. The PgAioUringContext comment states it directly: “Multiple backends can process completions for this backend’s io_uring instance … only a single backend gets io completions … at a time.”

Waiting: wait references (`aio.c`)

Because a handle is reclaimed and recycled the instant its IO completes, you cannot safely wait on the handle pointer. The issuer captures a PgAioWaitRef (index + 64-bit generation) before submitting; pgaio_wref_wait() resolves it and returns immediately if the generation has moved on. The low-level pgaio_io_wait() checks pgaio_io_was_recycled() — a generation comparison behind a read barrier — so a stale wait is a cheap no-op rather than a use-after-free.

The read-stream consumer (`read_stream.c`)

Almost no caller touches the handle API directly. Sequential scans, VACUUM, ANALYZE, bitmap heap scans, and more create a ReadStream via read_stream_begin_relation() (or read_stream_begin_smgr_relation()), hand it a ReadStreamBlockNumberCB that yields successive block numbers, and pull pinned buffers out one at a time with read_stream_next_buffer(). The stream is the piece that turns “I want these blocks” into deep, combined, concurrent AIO.

read_stream_look_ahead() is the engine. It runs while two budgets allow: fewer than max_ios reads are in flight, and the number of pinned-or-pending buffers is below the adaptive look-ahead distance. It merges a new block into the pending read when it is contiguous, and starts the pending read (issuing the vectored AIO) when it reaches io_combine_limit or can grow no further.

// read_stream_look_ahead — src/backend/storage/aio/read_stream.c
while (stream->ios_in_progress < stream->max_ios &&
       stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
{
    // ...
    blocknum = read_stream_get_block(stream, per_buffer_data);
    if (blocknum == InvalidBlockNumber)
    {
        stream->distance = 0;           /* end of stream */
        break;
    }
    /* contiguous? merge into the pending vectored read */
    if (stream->pending_read_nblocks > 0 &&
        stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
    {
        stream->pending_read_nblocks++;
        continue;
    }
    /* non-contiguous: flush the pending read, then start a new one */
    while (stream->pending_read_nblocks > 0)
    {
        if (!read_stream_start_pending_read(stream) ||
            stream->ios_in_progress == stream->max_ios)
        {
            read_stream_unget_block(stream, blocknum);   /* rewind, stop */
            return;
        }
    }
    stream->pending_read_blocknum = blocknum;
    stream->pending_read_nblocks = 1;
}

read_stream_next_buffer() is the consumer side and where the distance adapts. When it must actually wait for the IO behind the buffer it is about to return (WaitReadBuffers), it doubles the look-ahead distance (clamped to max_pinned_buffers) — so a stream that keeps hitting real I/O ramps its prefetch depth up rapidly, while a stream finding everything in cache lets the distance collapse toward 1 and does no wasted look-ahead.

// read_stream_next_buffer — src/backend/storage/aio/read_stream.c
WaitReadBuffers(&stream->ios[io_index].op);
stream->ios_in_progress--;
if (++stream->oldest_io_index == stream->max_ios)
    stream->oldest_io_index = 0;

/* Look-ahead distance ramps up rapidly after we do I/O. */
distance = stream->distance * 2;
distance = Min(distance, stream->max_pinned_buffers);
stream->distance = distance;

Note batch_mode: when a stream’s callback is known to be deadlock-safe, the stream brackets its look-ahead in pgaio_enter_batchmode() / pgaio_exit_batchmode(), so several reads are staged and then submitted in one pgaio_submit_staged() call — amortizing the submit cost. The default is to submit each IO immediately (see pgaio_io_stage() above), precisely because batching un-submitted IO is what can deadlock if the callback blocks.

Position hints (as of 2026-06-06, REL_18 273fe94)

Symbol	File	Line
`pgaio_io_acquire`	`src/backend/storage/aio/aio.c`	162
`pgaio_io_acquire_nb`	`src/backend/storage/aio/aio.c`	188
`pgaio_io_get_wref`	`src/backend/storage/aio/aio.c`	366
`pgaio_io_update_state`	`src/backend/storage/aio/aio.c`	386
`pgaio_io_stage`	`src/backend/storage/aio/aio.c`	424
`pgaio_io_prepare_submit`	`src/backend/storage/aio/aio.c`	510
`pgaio_io_process_completion`	`src/backend/storage/aio/aio.c`	528
`pgaio_io_was_recycled`	`src/backend/storage/aio/aio.c`	558
`pgaio_io_wait`	`src/backend/storage/aio/aio.c`	579
`pgaio_io_wait_for_free`	`src/backend/storage/aio/aio.c`	761
`pgaio_wref_wait`	`src/backend/storage/aio/aio.c`	991
`pgaio_submit_staged`	`src/backend/storage/aio/aio.c`	1123
`PgAioWorkerSubmissionQueue` (struct)	`src/backend/storage/aio/method_worker.c`	55
`pgaio_worker_submission_queue_insert`	`src/backend/storage/aio/method_worker.c`	181
`pgaio_worker_submission_queue_consume`	`src/backend/storage/aio/method_worker.c`	202
`pgaio_worker_submit_internal`	`src/backend/storage/aio/method_worker.c`	244
`pgaio_worker_submit`	`src/backend/storage/aio/method_worker.c`	295
`IoWorkerMain`	`src/backend/storage/aio/method_worker.c`	386
`io_workers` (GUC var)	`src/backend/storage/aio/method_worker.c`	94
`PgAioUringContext` (struct)	`src/backend/storage/aio/method_io_uring.c`	87
`pgaio_uring_submit`	`src/backend/storage/aio/method_io_uring.c`	405
`pgaio_uring_sq_from_io`	`src/backend/storage/aio/method_io_uring.c`	(decl 59)
`pgaio_uring_drain_locked`	`src/backend/storage/aio/method_io_uring.c`	526
`pgaio_uring_wait_one`	`src/backend/storage/aio/method_io_uring.c`	584
`read_stream_get_block`	`src/backend/storage/aio/read_stream.c`	179
`read_stream_start_pending_read`	`src/backend/storage/aio/read_stream.c`	230
`read_stream_look_ahead`	`src/backend/storage/aio/read_stream.c`	429
`read_stream_begin_relation`	`src/backend/storage/aio/read_stream.c`	746
`read_stream_next_buffer`	`src/backend/storage/aio/read_stream.c`	800
`PGAIO_SUBMIT_BATCH_SIZE` (=32)	`src/include/storage/aio_internal.h`	28
`PgAioHandleState` (enum)	`src/include/storage/aio_internal.h`	43
`IoMethodOps` (vtable)	`src/include/storage/aio_internal.h`	260
`PgAioHandleCallbackID` (enum)	`src/include/storage/aio.h`	192
`PgAioResult` (struct)	`src/include/storage/aio_types.h`	99
`MAX_IO_WORKERS` (=32)	`src/include/storage/proc.h`	460

Source verification (as of 2026-06-05)

All claims below were checked against the working tree at /data/hgryoo/references/postgres, branch REL_18_STABLE, commit 273fe94852b3a7e34fd171e8abdf1481beb302fa (PostgreSQL 18.x).

Verified facts

Three io_methods, worker is the default. IoMethod in src/include/storage/aio.h enumerates IOMETHOD_SYNC, IOMETHOD_WORKER, and (guarded by IOMETHOD_IO_URING_ENABLED) IOMETHOD_IO_URING. The default is worker mode; io_workers = 3 is the default worker count (method_worker.c).
At most one handed-out IO. pgaio_io_acquire_nb() (aio.c) raises elog(ERROR, "API violation: Only one IO can be handed out") when handed_out_io is already set — verified verbatim.
Submit batch size is 32. PGAIO_SUBMIT_BATCH_SIZE is defined as 32 in aio_internal.h; staged_ios[PGAIO_SUBMIT_BATCH_SIZE] is the per-backend staging array.
Eight-state handle machine. PgAioHandleState lists exactly PGAIO_HS_IDLE, _HANDED_OUT, _DEFINED, _STAGED, _SUBMITTED, _COMPLETED_IO, _COMPLETED_SHARED, _COMPLETED_LOCAL (8 values) in aio_internal.h.
Completion runs in a critical section. pgaio_io_process_completion() asserts CritSectionCount > 0; pgaio_submit_staged() wraps pgaio_method_ops->submit() in START_CRIT_SECTION() / END_CRIT_SECTION(). Confirmed.
Callbacks are integer IDs, not pointers. PgAioHandleCallbackID (aio.h) is an enum (PGAIO_HCB_INVALID, PGAIO_HCB_MD_READV, PGAIO_HCB_SHARED_BUFFER_READV, PGAIO_HCB_LOCAL_BUFFER_READV); the README states shared memory “currently cannot contain function pointers” under EXEC_BACKEND ASLR, motivating the indirection.
PgAioResult is 8 bytes, bit-packed. aio_types.h defines it with bitfields id, status, error_data plus an int32 result, and a StaticAssertDecl(sizeof(PgAioResult) == 8, ...).
Worker pool cap is 32. MAX_IO_WORKERS is #defined to 32 in src/include/storage/proc.h; IO_WORKER_WAKEUP_FANOUT is 2 in method_worker.c. Workers run as B_IO_WORKER (MyBackendType = B_IO_WORKER in IoWorkerMain).
io_uring: one ring per backend, drainable by any backend. PgAioUringContext carries a per-ring completion_lock LWLock; its header comment confirms multiple backends may drain one ring under that lock. pgaio_uring_drain_locked() recovers the handle via io_uring_cqe_get_data() and routes through pgaio_io_process_completion().
io_uring buffered-IO offload heuristic. pgaio_uring_submit() sets IOSQE_ASYNC only when in_flight_before > 4 && (ioh->flags & PGAIO_HF_BUFFERED) — verified verbatim.
Read-stream distance doubles after real I/O. read_stream_next_buffer() computes distance = stream->distance * 2; distance = Min(distance, stream->max_pinned_buffers) immediately after WaitReadBuffers(). Look-ahead is bounded by max_ios and the adaptive distance; merges are capped at io_combine_limit (DEFAULT_IO_COMBINE_LIMIT = Min(MAX_IO_COMBINE_LIMIT, (128*1024)/BLCKSZ) in bufmgr.h).
“Direct IO is unusably slow without AIO.” Quoted from src/backend/storage/aio/README.md (Motivation), confirming AIO is the prerequisite for direct I/O rather than an independent feature.

Open questions / deferred

The body of PGAIO_HCB_SHARED_BUFFER_READV (page-validity marking, checksum verification) and StartReadBuffer/WaitReadBuffers live in bufmgr.c and are covered in postgres-buffer-manager.md, not re-verified here.
PGAIO_HCB_MD_READV and the block-to-segment translation it wraps live in md.c / smgr.c; see postgres-smgr-md.md.
Exact sizing of the per-backend handle pool (AioProcs() * io_max_concurrency) is in aio_init.c / aio_funcs.c; the doc states the shape but the precise auxiliary-process accounting was not line-verified.
io_uring capability probing (pgaio_uring_check_capabilities(), io_uring_queue_init_mem() vs io_uring_queue_init() fallback) is described at the design level only.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

PostgreSQL’s AIO subsystem arrives late relative to other engines, and the design reflects lessons learned from watching them.

SQL Server has had an asynchronous, scatter/gather read-ahead manager since the 1990s. Its buffer manager issues read-ahead based on the access pattern the query processor declares (a range scan tells the storage engine the next N pages it will want), and completions are serviced by I/O completion ports (IOCP) — Windows’ equivalent of io_uring’s globally drainable completion model. The key structural difference is threading: SQL Server is a thread-per-task (or fiber) engine, so an outstanding read does not block a worker the way a synchronous pread() blocks a PostgreSQL backend. PostgreSQL’s process model is exactly why it needed shared-memory handles and integer callback IDs — state that a thread engine keeps on a private stack must, here, be reachable by any process.

Oracle exposes asynchronous I/O through DBWR/LGWR background processes and db_file_multiblock_read_count for multi-block reads — the direct ancestor of PostgreSQL’s io_combine_limit. Oracle’s ASM and direct-path reads bypass the OS cache much as PostgreSQL’s direct-I/O-plus-AIO does, and Oracle long ago concluded what the PG18 README now states: direct I/O without engine-driven prefetch is a performance trap.

The io_uring interface itself (Axboe, 2019) is the enabling technology, and PostgreSQL’s adoption is a careful one. io_uring’s submission/completion ring pair maps almost directly onto the submit()/drain model, but PG had to solve a problem io_uring does not address: in a multi-process server, whose ring does a completion land in, and who is allowed to reap it? The answer — one ring per backend in shared memory, guarded by a per-ring completion_lock so any backend can drain any ring — is PostgreSQL-specific glue around a Linux primitive. The IOSQE_ASYNC heuristic for buffered I/O reflects a known io_uring sharp edge: inline execution copies page-cache data on the submitting CPU, which serializes under load, so PG offloads to kernel workers once a few IOs are already in flight.

Research frontiers. The textbook framing — a DBMS as a machine for overlapping computation with I/O (Hellerstein, Stonebraker & Hamilton, Architecture of a Database System, captured in dbms-papers/fntdb07-architecture.md) — is now being pushed in two directions. First, learned and adaptive prefetching: PostgreSQL’s read-stream distance heuristic (double-on-miss, collapse-on-hit) is a hand-tuned controller, and there is active research on replacing such controllers with models that predict access patterns. Second, kernel-bypass and computational storage: SPDK-style user-space NVMe drivers and smart SSDs that run predicate evaluation on the device push the “overlap I/O with compute” idea past what a host-side AIO layer can do. PostgreSQL’s pluggable IoMethodOps vtable is, deliberately, the seam where such a method could one day be slotted in beside sync, worker, and io_uring. The deadlock-avoidance contract (a method must either let any backend complete an IO, or guarantee out-of-band completion) is the invariant any future method must honor.

Sources

Code (REL_18_STABLE, commit 273fe94):
- src/backend/storage/aio/aio.c — handle lifecycle, staging, submission, completion, wait references.
- src/backend/storage/aio/aio_callback.c, aio_target.c, aio_io.c — callback dispatch, target abstraction, IO-op definition.
- src/backend/storage/aio/aio_init.c, aio_funcs.c — shared-memory sizing and SQL-visible introspection.
- src/backend/storage/aio/method_sync.c — synchronous fallback method.
- src/backend/storage/aio/method_worker.c — worker pool, submission queue, IoWorkerMain, B_IO_WORKER.
- src/backend/storage/aio/method_io_uring.c — per-backend rings, submit, drain, wait_one.
- src/backend/storage/aio/read_stream.c — look-ahead, IO combining, adaptive distance, the consumer API.
- src/backend/storage/aio/README.md — motivation, deadlock/starvation design criteria, EXEC_BACKEND callback-ID rationale.
- src/include/storage/aio.h, aio_internal.h, aio_types.h — public API, IoMethodOps vtable, PgAioHandleState, PgAioResult, callback IDs.
- src/include/storage/proc.h — MAX_IO_WORKERS.
- src/include/storage/bufmgr.h — io_combine_limit, MAX_IO_COMBINE_LIMIT.
Theory: Hellerstein, Stonebraker & Hamilton, Architecture of a Database System (FnTDB 2007) — overlap of computation and I/O; captured locally in knowledge/research/dbms-papers/fntdb07-architecture.md.
Related KB docs: postgres-buffer-manager.md (shared-buffer completion callbacks, StartReadBuffer/WaitReadBuffers), postgres-smgr-md.md (segment/fd translation, PGAIO_HCB_MD_READV), postgres-shared-memory-ipc.md (shared-memory layout, LWLocks), postgres-aux-processes.md (B_IO_WORKER in the auxiliary-process taxonomy), postgres-checkpoint.md and postgres-xlog-wal.md (AIO for WAL/data writes).