PostgreSQL COPY — Bulk Load/Unload, Formats, and Multi-Insert
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every database engine offers two doors into and out of its tables. The
narrow door is the per-row INSERT/SELECT path: each row passes
through the full SQL machinery — parse, plan, execute, log, index — and
pays that machinery’s fixed per-statement overhead. The wide door is a
bulk loader: a command that streams a large external dataset into a
table (or out of it) while amortizing as much of that overhead as
possible across many rows. PostgreSQL’s wide door is COPY.
The problem a bulk loader solves is fundamentally an impedance-matching
one. On one side sits an external byte stream — a file, a pipe to a
program, or a client socket — laid out in some serialization format
(delimited text, CSV, a binary tuple encoding). On the other side sits
the engine’s internal row representation: a TupleTableSlot of Datum
values typed according to the table’s TupleDesc, destined for heap
pages through the table access method. The loader’s job is to convert
between these two representations as fast as correctness allows, and the
unloader’s job is the exact inverse.
Database System Concepts (Silberschatz, Korth, Sudarshan), in its
chapter on storage and the physical organization of data, frames the
cost of loading in terms of the number of distinct I/O and bookkeeping
operations per row. A naïve loader that inserts one row at a time pays,
for each row: a tuple-insert call, a WAL record, a per-row index update,
and a free-space-map probe. A good bulk loader collapses these: it
batches tuple inserts so one page write carries many rows, it can
skip the free-space-map when it knows the relation is new, and — when
the engine’s transactional rules permit — it can write tuples
pre-frozen so they never need a later visibility-establishing rewrite
by VACUUM. Each of these is a recognized bulk-load optimization;
PostgreSQL’s COPY implements all three.
The design space a bulk loader chooses within has four axes:
-
Format negotiation. Does the loader hard-code one wire format, or does it support several (human-readable delimited text, RFC-4180 CSV, a compact binary tuple encoding) behind a common dispatch layer? A pluggable format layer lets the same row-handling core serve very different serializations.
-
Parsing strategy. Text formats require a state machine that finds row and field boundaries while honoring quoting and escaping rules; binary formats require a framed reader that trusts a length-prefixed layout. The parser is where most of a text loader’s CPU goes, so it is heavily tuned.
-
Batching and durability. How many rows accumulate before a write, and what transactional guarantees does the load make? Batching trades memory for throughput; freezing trades MVCC strictness for the elimination of a later rewrite.
-
Error tolerance. Does one malformed row abort the whole load, or can the loader skip bad rows and keep going up to some limit? This is the difference between a brittle and a production-grade loader.
PostgreSQL’s COPY is a single command (DoCopy) that resolves all four
axes from an option list and then runs one of two engines — a load
engine (CopyFrom) or an unload engine (DoCopyTo) — sharing a common
format-routine abstraction. The rest of this document follows that
structure.
Common DBMS Design
Section titled “Common DBMS Design”Bulk loaders across systems converge on a small set of engineering conventions. Naming them makes PostgreSQL’s specific symbols read as one set of choices within a shared playbook.
A format-dispatch vtable
Section titled “A format-dispatch vtable”Rather than thread if (csv) ... else if (binary) ... through every
function, mature loaders factor each supported format into a small table
of function pointers: start, per-field type setup, per-row, end.
The core engine calls through the vtable and never re-tests the format.
PostgreSQL names these CopyToRoutine and CopyFromRoutine
(copyapi.h) and selects one with CopyFromGetRoutine /
CopyToGetRoutine. The same indirection is what lets out-of-core
extensions register custom COPY formats.
A staged buffer pipeline for text parsing
Section titled “A staged buffer pipeline for text parsing”Text/CSV parsing is split into stages with a buffer between each, so each
stage has one job and the expensive ones (encoding conversion, line
splitting, field de-escaping) are isolated and individually tunable. The
universal shape is: read raw bytes -> transcode to server encoding ->
split into lines -> split each line into de-escaped fields -> call type
input functions. PostgreSQL implements exactly this:
raw_buf -> input_buf -> line_buf -> attribute_buf.
Multi-row insert batching
Section titled “Multi-row insert batching”The single most important load optimization is to insert many rows per
physical operation. Loaders accumulate parsed rows in an in-memory buffer
and flush the buffer as a batch to the storage layer, which writes one
(or few) pages and one WAL record covering all of them. The batch size is
bounded by both a row count and a byte count to keep memory predictable.
PostgreSQL’s buffer is CopyMultiInsertBuffer; the batch flush is
table_multi_insert.
Conditional fast paths and correctness fallbacks
Section titled “Conditional fast paths and correctness fallbacks”Batching is only safe when nothing observes the partially-loaded table
mid-load. Triggers that query the table, volatile default expressions,
and certain foreign-data-wrapper limitations all force a fallback to
one-row-at-a-time insertion. Every batching loader carries this
escape hatch. PostgreSQL encodes it as the CopyInsertMethod enum
(CIM_SINGLE, CIM_MULTI, CIM_MULTI_CONDITIONAL).
Pre-frozen / minimally-logged loads
Section titled “Pre-frozen / minimally-logged loads”When a load targets a relation created in the same transaction (so no
other session can yet see it), the engine can take liberties: skip the
free-space map, and write tuples already marked as committed-and-frozen
so a later VACUUM need not revisit them. PostgreSQL exposes this as
COPY ... FREEZE and implements it with the TABLE_INSERT_FROZEN and
TABLE_INSERT_SKIP_FSM table-AM flags.
Soft error handling
Section titled “Soft error handling”A production loader distinguishes hard errors (protocol corruption, I/O
failure — always fatal) from soft errors (a single field that won’t
parse as its declared type). Soft errors can be trapped, the offending
row skipped, and the load continued up to a configurable limit.
PostgreSQL routes type-input failures through an ErrorSaveContext and
InputFunctionCallSafe, governed by ON_ERROR, REJECT_LIMIT, and
LOG_VERBOSITY.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”One entry point, two engines
Section titled “One entry point, two engines”DoCopy (in copy.c) is the SQL-level entry point. It performs the
privilege checks that make server-side file access safe, opens and locks
the relation, transforms an optional WHERE clause, and then branches on
direction:
// DoCopy — src/backend/commands/copy.cif (is_from){ CopyFromState cstate; Assert(rel); /* check read-only transaction and parallel mode */ if (XactReadOnly && !rel->rd_islocaltemp) PreventCommandIfReadOnly("COPY FROM"); cstate = BeginCopyFrom(pstate, rel, whereClause, stmt->filename, stmt->is_program, NULL, stmt->attlist, stmt->options); *processed = CopyFrom(cstate); /* copy from file to database */ EndCopyFrom(cstate);}else{ CopyToState cstate; cstate = BeginCopyTo(pstate, rel, query, relid, stmt->filename, stmt->is_program, NULL, stmt->attlist, stmt->options); *processed = DoCopyTo(cstate); /* copy from database to file */ EndCopyTo(cstate);}The privilege model deserves a note: anyone may COPY ... TO STDOUT or
FROM STDIN (the data flows over the client connection), but reading or
writing a server-side file or program requires membership in
pg_read_server_files / pg_write_server_files /
pg_execute_server_program. That check is the first thing DoCopy does.
A second subtlety: if row-level security is enabled on the target,
COPY relation TO is silently rewritten into a query-based
COPY (SELECT ...) TO so the rewriter can inject the RLS quals; and
COPY FROM under RLS is simply refused.
Options become a CopyFormatOptions
Section titled “Options become a CopyFormatOptions”ProcessCopyOptions walks the DefElem list from the grammar and
transposes it into a flat CopyFormatOptions struct, applying a long
sequence of self-consistency checks. This is where FORMAT csv|binary,
the delimiter/quote/escape characters, HEADER, FREEZE, ON_ERROR,
LOG_VERBOSITY, and REJECT_LIMIT are all resolved, and where the
defaults are filled in:
// ProcessCopyOptions — src/backend/commands/copy.c/* Set defaults for omitted options */if (!opts_out->delim) opts_out->delim = opts_out->csv_mode ? "," : "\t";if (!opts_out->null_print) opts_out->null_print = opts_out->csv_mode ? "" : "\\N";opts_out->null_print_len = strlen(opts_out->null_print);if (opts_out->csv_mode){ if (!opts_out->quote) opts_out->quote = "\""; if (!opts_out->escape) opts_out->escape = opts_out->quote;}Note the format-coupled defaults: the field delimiter is a tab in text
mode but a comma in CSV mode; the NULL marker is \N in text mode but an
empty unquoted field in CSV. Many of the subsequent checks enforce that
options only legal in one mode are rejected in the others — for example
only ON_ERROR STOP is allowed in BINARY mode, and REJECT_LIMIT requires ON_ERROR to be set to IGNORE.
The three error-handling options are parsed by dedicated helpers that map keywords to enum values:
// defGetCopyOnErrorChoice — src/backend/commands/copy.cif (pg_strcasecmp(sval, "stop") == 0) return COPY_ON_ERROR_STOP;if (pg_strcasecmp(sval, "ignore") == 0) return COPY_ON_ERROR_IGNORE;// defGetCopyLogVerbosityChoice maps silent/default/verbose to// COPY_LOG_VERBOSITY_{SILENT,DEFAULT,VERBOSE}Three formats behind one vtable
Section titled “Three formats behind one vtable”The format abstraction is the spine of the modern COPY code. Each
built-in format is a static const routine table; CopyFromGetRoutine
picks one based on the resolved options:
// CopyFromRoutineCSV / CopyFromGetRoutine — src/backend/commands/copyfrom.cstatic const CopyFromRoutine CopyFromRoutineCSV = { .CopyFromInFunc = CopyFromTextLikeInFunc, .CopyFromStart = CopyFromTextLikeStart, .CopyFromOneRow = CopyFromCSVOneRow, .CopyFromEnd = CopyFromTextLikeEnd,};
static const CopyFromRoutine *CopyFromGetRoutine(const CopyFormatOptions *opts){ if (opts->csv_mode) return &CopyFromRoutineCSV; else if (opts->binary) return &CopyFromRoutineBinary; return &CopyFromRoutineText; /* default is text */}Text and CSV share the same start, infunc, and end callbacks and
differ only in the per-row callback (CopyFromTextOneRow vs.
CopyFromCSVOneRow), both thin wrappers around the inlined
CopyFromTextLikeOneRow(..., is_csv). Binary is wholly separate: it uses
the type receive functions instead of input functions, and its
start callback reads and validates the PGCOPY file header. The
CopyToRoutine side mirrors this exactly.
The load engine: CopyFrom
Section titled “The load engine: CopyFrom”CopyFrom is the heart of COPY FROM. After validating that the target
is a plain/foreign/partitioned table (or a view with an INSTEAD OF
trigger), it decides the insert method, sets up executor state and a
ResultRelInfo, installs the error-context callback, and then loops:
NextCopyFrom → optional soft-error skip → WHERE filter → partition
routing → triggers/constraints → insert (batched or single).
flowchart TD
A["DoCopy: COPY FROM"] --> B["BeginCopyFrom<br/>ProcessCopyOptions, open file/pipe/socket,<br/>look up input functions, alloc buffers"]
B --> C["CopyFrom: choose CopyInsertMethod<br/>CIM_SINGLE / CIM_MULTI / CIM_MULTI_CONDITIONAL"]
C --> D{"for each row"}
D --> E["NextCopyFrom<br/>parse one row into values[]/nulls[]"]
E --> F{"soft error?<br/>(ON_ERROR IGNORE)"}
F -->|yes| G["num_errors++, skip row,<br/>check REJECT_LIMIT"]
G --> D
F -->|no| H{"WHERE matches?"}
H -->|no| D
H -->|yes| I{"partitioned?"}
I -->|yes| J["ExecFindPartition,<br/>map root rowtype to leaf"]
I -->|no| K["BEFORE ROW triggers,<br/>constraints, gen cols"]
J --> K
K --> L{"multi-insert<br/>enabled?"}
L -->|yes| M["buffer slot<br/>(CopyMultiInsertInfoStore)"]
M --> N{"buffer full?"}
N -->|yes| O["CopyMultiInsertInfoFlush<br/>(table_multi_insert)"]
N -->|no| D
O --> D
L -->|no| P["table_tuple_insert,<br/>index tuples, AFTER triggers"]
P --> D
D -->|EOF| Q["flush remaining buffers,<br/>NOTICE skipped rows, cleanup"]
Multi-insert buffering
Section titled “Multi-insert buffering”When batching is allowed, parsed tuples are not inserted immediately;
they are materialized into slots held in a CopyMultiInsertBuffer and
flushed in bulk. The buffer is bounded two ways — a tuple count and a
byte count — so a load of wide rows flushes on bytes and a load of narrow
rows flushes on count:
// CopyMultiInsertInfoIsFull / limits — src/backend/commands/copyfrom.c#define MAX_BUFFERED_TUPLES 1000#define MAX_BUFFERED_BYTES 65535static inline boolCopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo){ if (miinfo->bufferedTuples >= MAX_BUFFERED_TUPLES || miinfo->bufferedBytes >= MAX_BUFFERED_BYTES) return true; return false;}For a partitioned target there is one buffer per partition, all tracked
in CopyMultiInsertInfo->multiInsertBuffers. To bound memory when a load
sprays across thousands of partitions, after each flush the list is
trimmed back to MAX_PARTITION_BUFFERS (32), evicting the
oldest-created buffers first — never the one currently in use. This is
the quadratic-memory guard called out in the source comment on
MAX_BUFFERED_TUPLES.
COPY FREEZE and the insert flags
Section titled “COPY FREEZE and the insert flags”COPY FROM builds up a ti_options bitmask. Two flags matter for bulk
loads. TABLE_INSERT_SKIP_FSM is set whenever the relation’s storage was
created in this (sub)transaction — there is no point probing the
free-space map for a brand-new relfilenode. TABLE_INSERT_FROZEN is the
FREEZE option: it writes tuples already marked frozen, but only after a
strict eligibility check that guarantees the rows can never become
visible to a transaction that should not see them.
// CopyFrom (FREEZE path) — src/backend/commands/copyfrom.cif (cstate->opts.freeze){ /* COPY FREEZE is disallowed on partitioned and foreign tables */ ... InvalidateCatalogSnapshot(); if (!ThereAreNoPriorRegisteredSnapshots() || !ThereAreNoReadyPortals()) ereport(ERROR, ... "cannot perform COPY FREEZE because of prior transaction activity"); if (cstate->rel->rd_createSubid != GetCurrentSubTransactionId() && cstate->rel->rd_newRelfilelocatorSubid != GetCurrentSubTransactionId()) ereport(ERROR, ... "the table was not created or truncated in the current subtransaction"); ti_options |= TABLE_INSERT_FROZEN;}The subtransaction test is “crucial for correctness,” as the comment
says: if this subxact aborts, the frozen rows vanish with the relfilenode,
so the MVCC anomaly FREEZE introduces (other sessions can see the rows
immediately) is confined to a relation that will not survive a rollback.
The heap-AM side of TABLE_INSERT_FROZEN is covered in
postgres-heap-am.md.
Soft error handling: ON_ERROR / REJECT_LIMIT / LOG_VERBOSITY
Section titled “Soft error handling: ON_ERROR / REJECT_LIMIT / LOG_VERBOSITY”By default a row that fails type input aborts the whole COPY
(COPY_ON_ERROR_STOP). With ON_ERROR ignore, BeginCopyFrom allocates
an ErrorSaveContext, and the per-row parser calls the safe variant of
the type input function, which records a soft error rather than throwing:
// CopyFromTextLikeOneRow (soft error path) — src/backend/commands/copyfromparse.celse if (!InputFunctionCallSafe(&in_functions[m], string, typioparams[m], att->atttypmod, (Node *) cstate->escontext, &values[m])){ Assert(cstate->opts.on_error != COPY_ON_ERROR_STOP); cstate->num_errors++; if (cstate->opts.log_verbosity == COPY_LOG_VERBOSITY_VERBOSE) ereport(NOTICE, errmsg("skipping row due to data type incompatibility ...")); return true; /* row returned with error_occurred set */}CopyFrom sees escontext->error_occurred, resets it, bumps the skipped
counter, and — if REJECT_LIMIT is exceeded — turns the soft error into a
hard ERROR. At end of load it emits a summary NOTICE of how many rows
were skipped (unless LOG_VERBOSITY silent). The whole mechanism is built
on the generic soft-error infrastructure described in
postgres-error-handling.md; COPY is its most prominent consumer.
The unload engine: DoCopyTo
Section titled “The unload engine: DoCopyTo”COPY TO is simpler. BeginCopyTo either takes a table directly or, for
COPY (query) TO, runs parse-analysis/rewrite/plan and wires a
DestCopyOut receiver onto the query. DoCopyTo then either table-scans
the relation, formatting each slot with CopyOneRowTo, or runs the plan
and lets the copy_dest_receive callback format each emitted tuple:
// DoCopyTo (table-scan path) — src/backend/commands/copyto.cscandesc = table_beginscan(cstate->rel, GetActiveSnapshot(), 0, NULL);slot = table_slot_create(cstate->rel, NULL);processed = 0;while (table_scan_getnextslot(scandesc, ForwardScanDirection, slot)){ CHECK_FOR_INTERRUPTS(); slot_getallattrs(slot); /* deconstruct the tuple */ CopyOneRowTo(cstate, slot); /* format and send */ pgstat_progress_update_param(PROGRESS_COPY_TUPLES_PROCESSED, ++processed);}Each row is formatted in a per-row memory context that is reset every
iteration, so the type output functions can leak freely without
unbounded growth. The detail of running an arbitrary SELECT/INSERT ... RETURNING through a DestReceiver connects to
postgres-ddl-execution.md and the executor docs.
The output side mirrors the input side’s vtable. CopyOneRowTo walks
the slot’s attributes, and for text/CSV each non-null value is run
through the type’s output function and then escaped by
CopyAttributeOutText (or CopyAttributeOutCSV in CSV mode), which is
the exact inverse of CopyReadAttributesText: control characters become
\n, \t, \r, the delimiter is backslash-escaped, and NULLs are
emitted as the null_print marker. The binary writer is symmetric to
the binary reader — it emits the PGCOPY signature once, then a
per-row int16 field count and length-prefixed bytea blobs from each
type’s send function:
// CopyToBinaryOneRow — src/backend/commands/copyto.cCopySendInt16(cstate, list_length(cstate->attnumlist)); /* field count */foreach_int(attnum, cstate->attnumlist){ Datum value = slot->tts_values[attnum - 1]; bool isnull = slot->tts_isnull[attnum - 1]; if (isnull) CopySendInt32(cstate, -1); /* NULL = length of -1 */ else { bytea *outputbytes = SendFunctionCall(&out_functions[attnum - 1], value); CopySendInt32(cstate, VARSIZE(outputbytes) - VARHDRSZ); /* length */ CopySendData(cstate, VARDATA(outputbytes), VARSIZE(outputbytes) - VARHDRSZ); }}CopySendEndOfRow(cstate);CopyToBinaryStart writes the fixed 19-byte header — an 11-byte
signature ("PGCOPY\n\377\r\n\0"), a 4-byte flags word, and a 4-byte
header-extension length — and CopyToBinaryEnd writes the -1
field-count trailer that the reader recognizes as the EOF marker. The
exact same BinarySignature constant gates the reader (next section), so
a binary dump produced by one server reloads on any other.
The text parse pipeline: four buffers, four stages
Section titled “The text parse pipeline: four buffers, four stages”The single densest piece of COPY is the load-side parser in
copyfromparse.c. Its file header lays out the staged pipeline
explicitly: raw bytes are read into raw_buf, transcoded into
input_buf, split into a single line in line_buf, and de-escaped into
per-field slices of attribute_buf. The clever optimization is that
when no encoding conversion is needed, raw_buf and input_buf are the
same buffer, so transcoding collapses to a validation pass:
flowchart TD
SRC["source<br/>file / PROGRAM pipe / client socket"] --> CGD["CopyGetData<br/>read bytes"]
CGD --> RAW["raw_buf<br/>(RAW_BUF_SIZE)"]
RAW --> CC{"need_transcoding?"}
CC -->|yes| CONV["CopyConvertBuf<br/>encoding conversion"]
CC -->|no| ALIAS["input_buf aliases raw_buf<br/>(validate only)"]
CONV --> INP["input_buf<br/>(server encoding)"]
ALIAS --> INP
INP --> CRL["CopyReadLineText<br/>honor quote/escape, find EOL"]
CRL --> LB["line_buf<br/>(one logical row)"]
LB --> FMT{"format?"}
FMT -->|text| CRAT["CopyReadAttributesText<br/>split on delim, de-escape"]
FMT -->|csv| CRAC["CopyReadAttributesCSV<br/>quote-aware split"]
CRAT --> AB["attribute_buf + raw_fields[]<br/>(de-escaped field slices)"]
CRAC --> AB
AB --> NCF["NextCopyFrom<br/>InputFunctionCall(Safe) per column"]
NCF --> VALS["values[] / nulls[]<br/>typed Datums for the slot"]
CopyLoadRawBuf is the bottom of the stack: it slides any unprocessed
bytes to the front of raw_buf, reads more via CopyGetData, and sets
raw_reached_eof when the source is drained. CopyConvertBuf runs the
encoding conversion routine only when need_transcoding is set;
otherwise the validation comment in its header notes the buffers are
aliased and it merely confirms the bytes are valid in the server
encoding.
// CopyLoadRawBuf — src/backend/commands/copyfromparse.cnbytes = RAW_BUF_BYTES(cstate);if (nbytes > 0 && cstate->raw_buf_index > 0) memmove(cstate->raw_buf, cstate->raw_buf + cstate->raw_buf_index, nbytes);cstate->raw_buf_len -= cstate->raw_buf_index;cstate->raw_buf_index = 0;/* Load more data */inbytes = CopyGetData(cstate, cstate->raw_buf + cstate->raw_buf_len, 1, RAW_BUF_SIZE - cstate->raw_buf_len);nbytes += inbytes;cstate->raw_buf[nbytes] = '\0';cstate->raw_buf_len = nbytes;if (inbytes == 0) cstate->raw_reached_eof = true;Finding line boundaries: CopyReadLineText
Section titled “Finding line boundaries: CopyReadLineText”CopyReadLineText is the state machine that converts a byte stream into
one logical row in line_buf. The subtlety is quoting: in CSV mode a
\r or \n inside a quoted field is data, not a row terminator, so
the loop carries in_quote/last_was_esc flags and only treats a
newline as end-of-line when not inside quotes. A second subtlety is
encoding-safety — because every supported server encoding has the
high bit set on all bytes of a multibyte character, the loop can scan
byte-by-byte without ever mistaking a continuation byte for a delimiter:
// CopyReadLineText — src/backend/commands/copyfromparse.cif (is_csv){ quotec = cstate->opts.quote[0]; escapec = cstate->opts.escape[0]; /* ignore special escape processing if it's the same as quotec */ if (quotec == escapec) escapec = '\0';}/* ... scan input_buf byte-by-byte, tracking in_quote, until an * unquoted \n / \r / \r\n (or the \. end-marker) is found, copying * the line — minus its terminator — into line_buf ... */Splitting a text line into fields: CopyReadAttributesText
Section titled “Splitting a text line into fields: CopyReadAttributesText”Once line_buf holds one row, CopyReadAttributesText walks it
left-to-right, splitting on the (unescaped) delimiter and de-escaping
backslash sequences (\t, \n, octal \013, hex \x1b) into a
parallel attribute_buf, recording each field as a pointer into that
buffer via raw_fields[]. A crucial ordering rule lives in its inner
loop’s comment: the scanner must locate the end of the field and test it
against the NULL marker before throwing any de-escaping syntax error,
so that a field equal to \N is read as SQL NULL rather than an invalid
escape:
// CopyReadAttributesText — src/backend/commands/copyfromparse.cfor (;;) /* outer loop over fields */{ start_ptr = cur_ptr; cstate->raw_fields[fieldno] = output_ptr; for (;;) /* inner loop scans + de-escapes one field */ { c = *cur_ptr++; if (c == delimc) { found_delim = true; break; } if (c == '\\') { /* handle \n \t \013 \x1b ... */ } else *output_ptr++ = c; } /* compare raw field to null_print BEFORE committing de-escaped bytes */}CopyReadAttributesCSV is the RFC-4180 variant: a field that opens with
the quote character is read verbatim (including embedded delimiters and
newlines) until a closing quote, with a doubled quote ("") collapsing
to one. Both functions return the field count, which NextCopyFrom
checks against the column list.
Binary parsing: framed, length-prefixed, header-validated
Section titled “Binary parsing: framed, length-prefixed, header-validated”The binary reader is the mirror image of the binary writer. The load
opens by validating the 19-byte header in ReceiveCopyBinaryHeader:
the 11-byte signature must memcmp-equal BinarySignature, the
flags word must not set the old WITH-OIDS bit, and any header extension
is skipped:
// ReceiveCopyBinaryHeader — src/backend/commands/copyfromparse.cstatic const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";if (CopyReadBinaryData(cstate, readSig, 11) != 11 || memcmp(readSig, BinarySignature, 11) != 0) ereport(ERROR, ... "COPY file signature not recognized");if (!CopyGetInt32(cstate, &tmp)) ... ; /* flags */if ((tmp & (1 << 16)) != 0) ereport(ERROR, ... "invalid COPY file header (WITH OIDS)");Each row then begins with an int16 field count (-1 is the EOF
marker), and each field is read by CopyReadBinaryAttribute: a 4-byte
length (-1 = NULL), then that many bytes handed to the type’s
receive function. Two integrity checks make the format
self-describing — a negative non--1 length is rejected, and the
receive function must consume the whole buffer or the row is declared
malformed:
// CopyReadBinaryAttribute — src/backend/commands/copyfromparse.cif (!CopyGetInt32(cstate, &fld_size)) ereport(ERROR, ... "unexpected EOF in COPY data");if (fld_size == -1) { *isnull = true; return ReceiveFunctionCall(flinfo, NULL, ...); }if (fld_size < 0) ereport(ERROR, ... "invalid field size");/* load fld_size bytes into attribute_buf, then: */result = ReceiveFunctionCall(flinfo, &cstate->attribute_buf, typioparam, typmod);if (cstate->attribute_buf.cursor != cstate->attribute_buf.len) ereport(ERROR, ... "incorrect binary data format"); /* must eat whole field */Because binary fields carry no type names and no per-value text, the
binary path has no soft-error story: ON_ERROR IGNORE is rejected for
BINARY at option-parse time (only ON_ERROR STOP is legal), since a
length-framed stream that desynchronizes cannot be resynchronized at a
row boundary the way a text stream can at the next newline.
The insert-method decision, in full
Section titled “The insert-method decision, in full”CopyFrom resolves the CopyInsertMethod once, before the row loop,
through a cascade of disqualifying conditions — each one a case where a
buffered, not-yet-inserted tuple could be observed (or mis-handled) by
something running mid-load:
// CopyFrom (insert-method cascade) — src/backend/commands/copyfrom.cif (resultRelInfo->ri_TrigDesc != NULL && (resultRelInfo->ri_TrigDesc->trig_insert_before_row || resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) insertMethod = CIM_SINGLE; /* trigger may query the table */else if (resultRelInfo->ri_FdwRoutine != NULL && resultRelInfo->ri_BatchSize == 1) insertMethod = CIM_SINGLE; /* FDW can't batch */else if (cstate->volatile_defexprs) insertMethod = CIM_SINGLE; /* volatile DEFAULT may query table */else if (contain_volatile_functions(cstate->whereClause)) insertMethod = CIM_SINGLE; /* volatile WHERE */else insertMethod = proute ? CIM_MULTI_CONDITIONAL : CIM_MULTI;CIM_MULTI_CONDITIONAL is the partitioned-table compromise: the parent
qualifies for batching, but each leaf partition is only checked when the
first row routes to it (a leaf might be a foreign table or carry its own
BEFORE-ROW trigger), at which point leafpart_use_multi_insert is
decided per partition.
Flushing a batch: table_multi_insert and the index/AFTER-trigger fan-out
Section titled “Flushing a batch: table_multi_insert and the index/AFTER-trigger fan-out”When a buffer fills (CopyMultiInsertInfoIsFull) or the load ends,
CopyMultiInsertBufferFlush hands the whole slot array to the table AM’s
table_multi_insert in one call, then — because the AM only writes heap
tuples — loops over the now-inserted slots to maintain indexes and fire
AFTER-ROW triggers. The flush switches to a short-lived memory context
first, because table_multi_insert is documented to leak:
// CopyMultiInsertBufferFlush — src/backend/commands/copyfrom.coldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));table_multi_insert(resultRelInfo->ri_RelationDesc, slots, nused, mycid, ti_options, buffer->bistate);MemoryContextSwitchTo(oldcontext);for (i = 0; i < nused; i++){ if (resultRelInfo->ri_NumIndices > 0) /* ExecInsertIndexTuples for slot i */ ; /* ExecARInsertTriggers for slot i */}For a foreign target the same function instead drives
ExecForeignBatchInsert in ri_BatchSize-sized chunks — the FDW batch
path covered in postgres-fdw.md. The heap side of table_multi_insert
(one WAL record covering many tuples, heap_multi_insert) lives in
postgres-heap-am.md.
Source Walkthrough
Section titled “Source Walkthrough”This section indexes the stable symbols that make up COPY, grouped by
call-flow. Treat the symbol names as the durable anchors; the
(file, line) pairs in the closing table are hints valid as of the
updated: date.
Entry point and option parsing (copy.c)
Section titled “Entry point and option parsing (copy.c)”DoCopy— the SQL-level entry. Resolves privileges (pg_read_server_files/pg_write_server_files/pg_execute_server_programfor server-side file/program access), opens and locks the relation, applies the RLS rewrite to a query when needed, transforms theWHEREclause, and branches to the load or unload engine.ProcessCopyOptions— folds the grammar’sDefElemlist into a flatCopyFormatOptions, fills format-coupled defaults (tab vs comma delimiter,\Nvs empty NULL marker,"quote/escape in CSV), and enforces cross-option legality (e.g.ON_ERROR IGNOREforbidden inBINARY;REJECT_LIMITrequiresON_ERROR IGNORE).defGetCopyOnErrorChoice/defGetCopyLogVerbosityChoice/defGetCopyRejectLimit— keyword→enum mappers for the error-handling options.
Load engine and format vtable (copyfrom.c)
Section titled “Load engine and format vtable (copyfrom.c)”CopyFromGetRoutine— selectsCopyFromRoutineCSV,CopyFromRoutineBinary, orCopyFromRoutineTextfrom the resolved options.CopyFromText/CSV/BinaryStart/OneRow/Endcallbacks — the three vtables. Text and CSV share start/infunc/end and differ only in the per-row callback; binary uses receive functions and validates thePGCOPYheader inCopyFromBinaryStart→ReceiveCopyBinaryHeader.BeginCopyFrom— opens the source, looks up per-column input (or receive) functions andtypioparams, allocates the four parse buffers, evaluates default expressions, and — underON_ERROR IGNORE— allocates theErrorSaveContext.CopyFrom— the load driver. Decides theCopyInsertMethod, setsti_options(TABLE_INSERT_SKIP_FSM,TABLE_INSERT_FROZEN), installsCopyFromErrorCallback, and runs the per-row loop.CopyMultiInsertInfoInit/…SetupBuffer/…Store/…IsFull/…Flush/CopyMultiInsertBufferFlush— the batching subsystem;MAX_BUFFERED_TUPLES(1000),MAX_BUFFERED_BYTES(65535),MAX_PARTITION_BUFFERS(32) bound it.CopyFromErrorCallback— theerrcontextcallback that addsCOPY rel, line N, column C(and the offending value, in text mode) to every error raised mid-load.
Parse pipeline (copyfromparse.c)
Section titled “Parse pipeline (copyfromparse.c)”CopyGetData— lowest-level reader; abstracts file vs PROGRAM pipe vs frontend socket.CopyLoadRawBuf— fillsraw_buf, slides unprocessed bytes, setsraw_reached_eof.CopyConvertBuf— encoding conversionraw_buf → input_buf(or a validate-only pass when the buffers are aliased).CopyReadLine/CopyReadLineText— the quote-aware EOL state machine that fillsline_buf.CopyReadAttributesText/CopyReadAttributesCSV— split one line into de-escapedraw_fields[]slices ofattribute_buf.NextCopyFromRawFields/NextCopyFromRawFieldsInternal— public raw-field reader (also used by extensions); handlesHEADER/HEADER MATCHvalidation.NextCopyFrom— turns raw fields into typedvalues[]/nulls[]viaInputFunctionCallSafe(orExecEvalExprfor defaulted columns), routing soft errors to theErrorSaveContext.CopyFromTextLikeOneRow/CopyFromCSVOneRow/CopyFromBinaryOneRow— the per-row vtable callbacks.ReceiveCopyBinaryHeader/CopyReadBinaryAttribute/CopyGetInt16/CopyGetInt32/CopyReadBinaryData— the binary framing layer.
Unload engine (copyto.c)
Section titled “Unload engine (copyto.c)”CopyToGetRoutine— selects theCopyToRoutinevtable.BeginCopyTo— opens the destination; forCOPY (query) TOruns parse-analysis/rewrite/plan and wires aDestCopyOutreceiver.DoCopyTo— the unload driver: table-scan +CopyOneRowTo, or run-the-plan withcopy_dest_receive.CopyOneRowTo— formats one slot; calls the format’s per-row callback.CopyAttributeOutText/CopyAttributeOutCSV— the escaping inverse of the read-attributes functions.CopyToBinaryStart/CopyToBinaryOneRow/CopyToBinaryEnd/CopySendInt16/CopySendInt32/CopySendData— the binary writer;BinarySignatureheader, length-prefixed fields,-1trailer.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
DoCopy | src/backend/commands/copy.c | 62 |
ProcessCopyOptions | src/backend/commands/copy.c | 536 |
CopyFromGetRoutine | src/backend/commands/copyfrom.c | 156 |
CopyFromBinaryStart | src/backend/commands/copyfrom.c | 221 |
CopyFromErrorCallback | src/backend/commands/copyfrom.c | 254 |
CopyMultiInsertInfoSetupBuffer | src/backend/commands/copyfrom.c | 380 |
CopyMultiInsertInfoIsFull | src/backend/commands/copyfrom.c | 425 |
CopyMultiInsertBufferFlush | src/backend/commands/copyfrom.c | 446 |
CopyMultiInsertInfoFlush | src/backend/commands/copyfrom.c | 662 |
CopyMultiInsertInfoStore | src/backend/commands/copyfrom.c | 756 |
CopyFrom | src/backend/commands/copyfrom.c | 779 |
BeginCopyFrom | src/backend/commands/copyfrom.c | 1529 |
MAX_BUFFERED_TUPLES (=1000) | src/backend/commands/copyfrom.c | 63 |
MAX_BUFFERED_BYTES (=65535) | src/backend/commands/copyfrom.c | 69 |
MAX_PARTITION_BUFFERS (=32) | src/backend/commands/copyfrom.c | 75 |
| `ti_options | = TABLE_INSERT_SKIP_FSM` | src/backend/commands/copyfrom.c |
| `ti_options | = TABLE_INSERT_FROZEN` | src/backend/commands/copyfrom.c |
BinarySignature | src/backend/commands/copyfromparse.c | 139 |
ReceiveCopyBinaryHeader | src/backend/commands/copyfromparse.c | 190 |
CopyConvertBuf | src/backend/commands/copyfromparse.c | 400 |
CopyLoadRawBuf | src/backend/commands/copyfromparse.c | 590 |
NextCopyFromRawFields | src/backend/commands/copyfromparse.c | 747 |
NextCopyFrom | src/backend/commands/copyfromparse.c | 871 |
CopyFromTextLikeOneRow | src/backend/commands/copyfromparse.c | 937 |
CopyFromBinaryOneRow | src/backend/commands/copyfromparse.c | 1086 |
CopyReadLineText | src/backend/commands/copyfromparse.c | 1234 |
CopyReadAttributesText | src/backend/commands/copyfromparse.c | 1564 |
CopyReadAttributesCSV | src/backend/commands/copyfromparse.c | 1818 |
CopyReadBinaryAttribute | src/backend/commands/copyfromparse.c | 2013 |
CopyToBinaryStart | src/backend/commands/copyto.c | 314 |
CopyToBinaryOneRow | src/backend/commands/copyto.c | 345 |
BeginCopyTo | src/backend/commands/copyto.c | 623 |
DoCopyTo | src/backend/commands/copyto.c | 1026 |
CopyOneRowTo | src/backend/commands/copyto.c | 1122 |
CopyAttributeOutText | src/backend/commands/copyto.c | 1147 |
CopyAttributeOutCSV | src/backend/commands/copyto.c | 1300 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
DoCopydispatches to exactly two engines and applies the RLS rewrite/refusal. Verified inDoCopy(copy.c):is_from→BeginCopyFrom/CopyFrom/EndCopyFrom, elseBeginCopyTo/DoCopyTo/EndCopyTo. Server-side file/program access is gated on thepg_read_server_files/pg_write_server_files/pg_execute_server_programroles;COPY rel TOunder RLS is rewritten toCOPY (SELECT …) TOandCOPY FROMunder RLS is refused. -
Three formats are dispatched through a
static constCopyFromRoutine/CopyToRoutinevtable. Verified inCopyFromGetRoutine(csv → CSV routine, binary → Binary routine, else Text) and the symmetricCopyToGetRoutine. Text and CSV share start/infunc/end; only the per-row callback differs. -
The text load runs a four-buffer pipeline
raw_buf → input_buf → line_buf → attribute_buf, and the first two alias when no transcoding is needed. Verified against thecopyfromparse.cfile header andCopyLoadRawBuf(theraw_buf == input_bufaliasing assertions) andCopyConvertBuf. -
Multi-insert batching is bounded by both a tuple count and a byte count, and partition buffers are capped. Verified:
MAX_BUFFERED_TUPLES= 1000,MAX_BUFFERED_BYTES= 65535,MAX_PARTITION_BUFFERS= 32, withCopyMultiInsertInfoIsFulltesting the first two andCopyMultiInsertInfoFlushtrimming the buffer list to the third. -
The insert method is chosen once via a disqualification cascade. Verified in
CopyFrom: BEFORE/INSTEAD-OF row triggers, a non-batching FDW (ri_BatchSize == 1), a partitioned statement-level insert trigger, volatile default expressions, or a volatileWHEREeach forceCIM_SINGLE; a partitioned target otherwise getsCIM_MULTI_CONDITIONAL, a plain targetCIM_MULTI. -
COPY FREEZE sets
TABLE_INSERT_FROZENonly after a prior-activity check and a same-subtransaction check. Verified inCopyFrom:InvalidateCatalogSnapshot(), theThereAreNoPriorRegisteredSnapshots() / ThereAreNoReadyPortals()guard, and therd_createSubid/rd_newRelfilelocatorSubid == GetCurrentSubTransactionId()test, all gatingti_options |= TABLE_INSERT_FROZEN.TABLE_INSERT_SKIP_FSMis set independently when the relfilenode is new in this subxact. -
Soft errors flow through
InputFunctionCallSafe+ anErrorSaveContext, are counted innum_errors, and onlyON_ERROR STOPis legal forBINARY. Verified inCopyFromTextLikeOneRow(theInputFunctionCallSafe(...)soft-error branch that bumpsnum_errorsand conditionally emits theVERBOSENOTICE) and the option checks inProcessCopyOptions. The binary per-row callbackCopyFromBinaryOneRowhas no soft-error branch. -
The binary format is header-validated and length-framed. Verified in
ReceiveCopyBinaryHeader(theBinarySignaturememcmp, the WITH-OIDS flag rejection),CopyReadBinaryAttribute(-1length = NULL; the “incorrect binary data format” check that the receive function consumed the whole field), and the symmetric writerCopyToBinaryStart/CopyToBinaryOneRow/CopyToBinaryEnd.
Open questions
Section titled “Open questions”-
The practical ceiling on
MAX_PARTITION_BUFFERSchurn. A COPY that round-robins across thousands of partitions evicts and rebuilds multi-insert buffers repeatedly (32-buffer cap, oldest-first eviction). The amortized cost of buffer setup/teardown under an adversarial partition-key ordering is not obvious from the code; investigation path is to instrumentCopyMultiInsertInfoFlush’s trim loop under a synthetic interleaved load. -
Interaction of
COPY FREEZEwith parallel-safe table AMs. The freeze eligibility check is heap-AM-shaped (same-subxact relfilenode); how a future custom table AM that does not establish visibility via the heap’st_infomaskfrozen bits would honorTABLE_INSERT_FROZENis left topostgres-table-am.md.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s COPY sits at a deliberately modest point in the bulk-load design space: a single-process, row-at-a-time parser feeding a batched inserter, with no parallelism inside one COPY statement. The comparative landscape is wide.
External bulk loaders and the \copy client variant. The
psql-side \copy is not a server feature at all — it runs COPY … FROM STDIN and streams a client-local file over the protocol, which
is why it needs no server-file privilege. Tools like pg_bulkload
(out of core, not covered here) bypass the buffer manager and WAL by
writing heap pages directly, trading crash-safety for speed — the same
trade other systems expose as a “direct path” load.
Direct-path and minimally-logged loads elsewhere. Oracle’s
SQL*Loader direct path and SQL Server’s BULK INSERT with
TABLOCK + minimal logging both formalize what COPY FREEZE only
gestures at: under an exclusive table lock on a freshly created or
empty object, the loader writes formatted pages straight to the data
files and emits only an allocation-level redo record rather than a
per-row log. PostgreSQL’s WAL design (full-tuple logging unless
wal_level = minimal lets a same-transaction-created relation skip WAL
entirely) reaches a similar place by a different route; the
wal_level = minimal optimization is the closest analogue, and
COPY FREEZE layers visibility-skipping on top.
Parallel and distributed ingest. Greenplum and Citus shard the
input and run one loader per segment/worker; cloud warehouses
(Snowflake, BigQuery, Redshift COPY) parse many input files
concurrently across compute nodes and commit a manifest. PostgreSQL
core has no intra-statement COPY parallelism — the parser is a serial
state machine — though the staged-buffer design (the four-buffer
pipeline) is exactly the shape one would pipeline across threads, and
the format vtable is exactly the seam at which a parallel parser would
plug in. The relevant theory anchor is the classic observation in the
Architecture of a Database System survey (Hellerstein, Stonebraker,
Hamilton; captured in dbms-papers/fntdb07-architecture.md) that
admission and loading are throughput-bound on CPU for parse/transcode,
which is why every serious warehouse parallelizes the parser, not the
inserter.
The format vtable as an extension seam. Recent PostgreSQL versions
made CopyFromRoutine / CopyToRoutine an extensible surface
(copyapi.h), so an extension can register a custom COPY format —
Arrow, Parquet-row, JSON-lines — that reuses the entire row-handling
core (partition routing, batching, FREEZE, soft errors) and supplies
only start/infunc/onerow/end. This is the same plug-point philosophy
as the table AM (postgres-table-am.md) and the index AM
(postgres-index-am.md): a narrow vtable that lets out-of-core code
ride the in-core machinery. It is the single most consequential recent
change in the COPY subsystem’s shape.
Error-tolerant ingest as a first-class feature. ON_ERROR IGNORE
REJECT_LIMIT+LOG_VERBOSITYbrings PostgreSQL toward what warehouses have long offered (ON_ERROR = CONTINUE/MAXERROR/ reject tables). The PostgreSQL implementation is notable for reusing the generic soft-error infrastructure (InputFunctionCallSafe+ErrorSaveContext, seepostgres-error-handling.md) rather than building a COPY-specific error path — COPY is simply that machinery’s most prominent consumer. A natural frontier is a reject-rows-to-a-table sink (currently the rows are only counted and optionally logged, not captured), which theErrorSaveContextalready has enough structure to support.
Sources
Section titled “Sources”- PostgreSQL source (REL_18_STABLE, commit 273fe94, as of
2026-06-05):
src/backend/commands/copy.c—DoCopy,ProcessCopyOptions, option enum mappers, privilege/RLS handling.src/backend/commands/copyfrom.c—CopyFrom, theCopyFromRoutinevtables,BeginCopyFrom, theCopyMultiInsert*batching subsystem, COPY FREEZE eligibility.src/backend/commands/copyfromparse.c— the four-buffer parse pipeline (CopyLoadRawBuf,CopyConvertBuf,CopyReadLineText,CopyReadAttributesText/CSV,NextCopyFrom), and the binary framing (ReceiveCopyBinaryHeader,CopyReadBinaryAttribute).src/backend/commands/copyto.c—DoCopyTo,BeginCopyTo,CopyOneRowTo,CopyAttributeOutText/CSV, the binary writer.src/include/commands/copyfrom_internal.h,src/include/commands/copyapi.h,src/include/commands/copy.h— theCopyFromState/CopyToStatestructs, the format-routine vtable types, and the publicCopyFormatOptions.
- Theory / textbook anchors:
- Database System Concepts (Silberschatz, Korth, Sudarshan) — storage and physical-organization cost model for per-row vs bulk loading.
- Architecture of a Database System (Hellerstein, Stonebraker,
Hamilton;
knowledge/research/dbms-papers/fntdb07-architecture.md) — process model and the CPU-bound nature of parse/admission.
- Cross-references within this KB:
postgres-heap-am.md—table_multi_insert/heap_multi_insert,TABLE_INSERT_FROZENheap semantics.postgres-ddl-execution.md— runningCOPY (query) TOthrough aDestReceiver; partition routing setup.postgres-error-handling.md—ErrorSaveContext/InputFunctionCallSafesoft-error infrastructure.postgres-fdw.md—ExecForeignBatchInsertbatch path.postgres-table-am.md,postgres-index-am.md— the vtable / extension-seam design pattern COPY’s format routines follow.