PostgreSQL Triggers — Definition, Firing Points, and the After-Trigger Queue
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A trigger is a database object that ties a piece of procedural code to a data-modification event so that the code fires automatically and as part of the same transaction as the event. Database System Concepts (Silberschatz, 7e, ch. 5 “Advanced SQL”, §5.3 “Triggers”) defines it as “a statement that the system executes automatically as a side effect of a modification to the database,” and names the two design decisions that every trigger facility must make explicit:
- The event and condition that cause the trigger to be executed. SQL
triggers specify the triggering event (
INSERT,UPDATE,DELETE), optionally a column list forUPDATE, and an optionalWHENcondition that is checked before the body runs (§5.3.1). Only when the event occurs and the condition holds does the action fire — the textbook’s event-condition-action (ECA) model. - The actions to be taken when the trigger executes, together with two
orthogonal axes that decide how often and when the action runs:
- Granularity. A
FOR EACH ROWtrigger fires once per affected row and can see the row’sOLD/NEWimages; aFOR EACH STATEMENTtrigger fires once for the whole statement regardless of how many rows it touched (§5.3.1, “row-level” vs. “statement-level”). - Timing. A
BEFOREtrigger runs before the change is applied — and can therefore inspect or alter the proposed row, or cancel the operation; anAFTERtrigger runs after the change, when the new state is visible and final (§5.3.1).
- Granularity. A
The textbook is also blunt about the hazards, and these hazards are exactly what the implementation must defend against. “Triggers can be used to implement certain integrity constraints” and to maintain derived data such as materialized aggregates, but “triggers need to be written with great care, since a trigger error detected at runtime causes the failure of the … statement that set off the trigger” (§5.3.2). The two named pitfalls are cascading / non-termination — a trigger whose action fires further triggers, possibly without bound — and unintended ordering, where the result depends on the order in which multiple triggers on the same event run. SQL standardizes the latter only weakly (the standard leaves order largely to the implementation), so each engine must pick and document a firing order.
There is a third concept the standard adds and that the textbook treats more
briefly: transition tables (REFERENCING OLD TABLE AS / NEW TABLE AS). A statement-level AFTER trigger can ask to see the entire set of
rows the statement changed, as two read-only relations, rather than firing
per row. This turns a per-row visitor into a set-oriented one — the
difference between an O(rows) cascade of function calls and a single call
that can issue one set-based SQL statement over the delta. It is the
construct that makes statement-level triggers useful for bulk integrity
maintenance.
Finally, the textbook situates triggers against declarative integrity
constraints: “In many cases it is preferable to use … features [foreign
keys, check constraints] rather than triggers” because the system can reason
about declarative constraints, whereas a trigger is opaque procedural code
(§5.3.3). The deep implementation consequence — which PostgreSQL makes
literal — is that constraints are themselves implemented as triggers: a
foreign key is a pair of internal AFTER triggers, and DEFERRABLE
constraints are exactly deferrable AFTER triggers. So the trigger
machinery is not a peripheral feature; it is the substrate on which
referential integrity rides, which raises the engineering bar for its queue
discipline, ordering, and transaction integration.
Common DBMS Design
Section titled “Common DBMS Design”The ECA model gives the semantics; production engines converge on a small set of engineering conventions to make those semantics fast, ordered, and transaction-safe. PostgreSQL’s specific choices in the next section are best read as one point in this shared design space.
1. Catalog the trigger; compile a per-relation dispatch summary. A trigger definition is metadata — a catalog row naming a function, an event mask, a timing/level flag, and an optional condition. But consulting the catalog on every row would be ruinous, so engines build a cached, per-relation descriptor listing the relation’s triggers, attached to the in-memory table descriptor and invalidated when the catalog changes. Crucially, the descriptor carries summary booleans (“does this table have any BEFORE-INSERT-ROW trigger at all?”) so the hot path can branch out in one test when a table has no triggers of the relevant kind — the common case.
2. A fixed set of firing points wired into the DML executor. The engine does not “scan for triggers” at arbitrary times; it calls a fixed family of hooks at hard-coded points in its insert/update/delete code — one hook per (timing × event × level) combination. Each hook is a no-op (one boolean test) when the summary flag is clear. This keeps the trigger subsystem pluggable into the executor rather than entangled with it.
3. BEFORE-row triggers are synchronous and may transform the tuple.
Because a BEFORE FOR EACH ROW trigger can rewrite NEW or veto the
operation, its hook must run inline, take the candidate tuple, and return a
(possibly different, possibly null) tuple that the executor then proceeds to
store. This is a pipeline transform, not a queued event.
4. AFTER triggers are deferred onto a queue. An AFTER trigger must see
the final post-modification state and, for DEFERRABLE constraints, must
possibly wait until commit. So the AFTER hook does not run the function; it
records an event — minimally, which trigger and which row — onto a
queue, and a later drain phase fires the queued events in order. The two
hard problems this creates are (a) compactly identifying the row so the
queue does not balloon (engines store a row identifier / tuple pointer, not
a tuple copy), and (b) re-locating the row at fire time under the right
visibility, since the heap may have changed.
5. Deterministic firing order. With the standard silent, engines pick a rule and document it. The dominant convention — alphabetical by trigger name within a (timing, event) class — is arbitrary but stable and inspectable, which is what users actually need.
6. Transaction and subtransaction integration. The AFTER queue is transaction-scoped: it must survive across the statements of a transaction (for deferred constraints), be drained at commit, be discarded at abort, and roll back partially on subtransaction abort — events queued by a savepoint that rolls back must vanish, while earlier events survive. This demands that the queue’s structure make “truncate back to a saved position” cheap.
7. Re-entrancy and recursion control. A trigger function can issue DML that fires more triggers. The queue drain must therefore be a loop (“fire; new events may appear; fire again until empty”), and the engine tracks nesting depth to bound runaway recursion and to integrate with statement timeouts.
PostgreSQL implements every one of these conventions, and the rest of this
document is essentially a tour of how: pg_trigger + TriggerDesc for
(1) and (5); the ExecBR/AR/IR/BS/AS family for (2); ExecBRInsertTriggers
returning a tuple for (3); AfterTriggerSaveEvent + the chunked
AfterTriggerEventList for (4) and (6); and MyTriggerDepth plus the
firing loop for (7).
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”A trigger is a pg_trigger row, compiled into a relcache TriggerDesc
Section titled “A trigger is a pg_trigger row, compiled into a relcache TriggerDesc”The persistent form of a trigger is one row in the pg_trigger system
catalog. CREATE TRIGGER (CreateTriggerFiringOn in trigger.c) parses
the statement, validates it, looks up the function, and inserts that row.
The single most important field is tgtype — a packed int16 bitmask that
encodes timing, level, and events together, using the bits defined in
pg_trigger.h:
// tgtype bit layout — src/include/catalog/pg_trigger.h#define TRIGGER_TYPE_ROW (1 << 0) /* else STATEMENT */#define TRIGGER_TYPE_BEFORE (1 << 1) /* else AFTER (=0) */#define TRIGGER_TYPE_INSERT (1 << 2)#define TRIGGER_TYPE_DELETE (1 << 3)#define TRIGGER_TYPE_UPDATE (1 << 4)#define TRIGGER_TYPE_TRUNCATE (1 << 5)#define TRIGGER_TYPE_INSTEAD (1 << 6) /* INSTEAD OF, views only */Note two encodings that trip readers up. First, AFTER and STATEMENT are
not bits — they are the absence of TRIGGER_TYPE_BEFORE/INSTEAD and
TRIGGER_TYPE_ROW respectively. Second, a pg_trigger row may carry
several event bits at once (INSERT OR UPDATE), unlike a runtime
TriggerEvent, which always names exactly one operation (see the comment in
trigger.h warning that the two representations differ).
When a backend first touches a relation, the relcache builds a TriggerDesc
by scanning pg_trigger for that table. RelationBuildTriggers reads the
rows in name order (it scans via TriggerRelidNameIndexId), which is
how PostgreSQL realizes the “alphabetical firing order” convention — the
array order is the firing order:
// RelationBuildTriggers — src/backend/commands/trigger.c/* * Note: since we scan the triggers using TriggerRelidNameIndexId, we will * be reading the triggers in name order ... This in turn * ensures that triggers will be fired in name order. */ScanKeyInit(&skey, Anum_pg_trigger_tgrelid, BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(RelationGetRelid(relation)));tgrel = table_open(TriggerRelationId, AccessShareLock);tgscan = systable_beginscan(tgrel, TriggerRelidNameIndexId, true, NULL, 1, &skey);while (HeapTupleIsValid(htup = systable_getnext(tgscan))){ Form_pg_trigger pg_trigger = (Form_pg_trigger) GETSTRUCT(htup); /* ... copy tgname, tgfoid, tgtype, tgenabled, tgdeferrable, ... into build ... */}The resulting in-memory shape is two structs in reltrigger.h. Trigger is
one trigger (mostly a copy of the pg_trigger row plus the resolved OID);
TriggerDesc is the per-relation array plus a wall of summary booleans:
// TriggerDesc — src/include/utils/reltrigger.htypedef struct TriggerDesc{ Trigger *triggers; /* array, in name order */ int numtriggers; bool trig_insert_before_row; /* one flag per (event,timing,level) */ bool trig_insert_after_row; bool trig_insert_instead_row; bool trig_insert_before_statement; bool trig_insert_after_statement; /* ... update_*, delete_*, truncate_* ... */ bool trig_insert_new_table; /* any NEW TABLE transition table? */ bool trig_update_old_table; bool trig_update_new_table; bool trig_delete_old_table;} TriggerDesc;Those flags are filled by SetTriggerFlags, which ORs each trigger’s
classification into the descriptor. The whole point is the negative test:
the executor’s firing hooks begin with if (!trigdesc->trig_insert_before_row) return;, so a table with no triggers of a given class pays a single boolean
load, never an array walk.
// SetTriggerFlags — src/backend/commands/trigger.ctrigdesc->trig_insert_before_row |= TRIGGER_TYPE_MATCHES(tgtype, TRIGGER_TYPE_ROW, TRIGGER_TYPE_BEFORE, TRIGGER_TYPE_INSERT);trigdesc->trig_insert_after_row |= TRIGGER_TYPE_MATCHES(tgtype, TRIGGER_TYPE_ROW, TRIGGER_TYPE_AFTER, TRIGGER_TYPE_INSERT);/* ... and so on for every (event, timing, level) combination ... */trigdesc->trig_insert_new_table |= (TRIGGER_FOR_INSERT(tgtype) && TRIGGER_USES_TRANSITION_TABLE(trigger->tgnewtable));The firing-point family: one hook per (timing, level, event)
Section titled “The firing-point family: one hook per (timing, level, event)”At runtime, a trigger does not “watch” for events. The executor —
overwhelmingly nodeModifyTable.c, plus COPY, ExecuteTruncate, and the
RI code — calls a hook at the exact moment a tuple operation happens. The
hooks form a regular grid, named Exec + timing + level + event + Triggers:
| timing \ event | INSERT | UPDATE | DELETE |
|---|---|---|---|
| BEFORE statement | ExecBSInsertTriggers | ExecBSUpdateTriggers | ExecBSDeleteTriggers |
| BEFORE row | ExecBRInsertTriggers | ExecBRUpdateTriggers | ExecBRDeleteTriggers |
| INSTEAD OF row | ExecIRInsertTriggers | ExecIRUpdateTriggers | ExecIRDeleteTriggers |
| AFTER row | ExecARInsertTriggers | ExecARUpdateTriggers | ExecARDeleteTriggers |
| AFTER statement | ExecASInsertTriggers | ExecASUpdateTriggers | ExecASDeleteTriggers |
(B=before, A=after, S=statement, R=row, I=instead.) TRUNCATE adds
only ExecBSTruncateTriggers / ExecASTruncateTriggers — there is no
row-level truncate trigger, as the TriggerDesc comment notes. The contract
differs sharply by timing, and that difference is the heart of the design:
flowchart TD
subgraph row["per affected row, inside nodeModifyTable"]
BR["ExecBRInsertTriggers<br/>runs function NOW<br/>returns tuple or NULL"]
STORE["heap_insert / table_tuple_update<br/>apply the change"]
AR["ExecARInsertTriggers<br/>does NOT run function<br/>queues an event"]
end
BR -->|"NULL = skip this row"| SKIP["row discarded"]
BR -->|"tuple"| STORE
STORE --> AR
AR --> SAVE["AfterTriggerSaveEvent<br/>append AfterTriggerEventData"]
SAVE --> Q[("per-query<br/>AfterTriggerEventList")]
Q -.->|"AfterTriggerEndQuery"| FIRE["afterTriggerInvokeEvents<br/>fire immediate-mode events"]
Q -.->|"deferrable events<br/>moved to xact list"| DEF["AfterTriggerFireDeferred<br/>at commit"]
BEFORE-row hooks run the function synchronously and transform the tuple.
ExecBRInsertTriggers walks the trigger array, and for each matching+enabled
trigger calls ExecCallTriggerFunc; the returned tuple becomes the input to
the next trigger, so triggers chain. A NULL return means “skip this row”;
a different (non-NULL) tuple is stored back into the slot:
// ExecBRInsertTriggers — src/backend/commands/trigger.cnewtuple = ExecCallTriggerFunc(&LocTriggerData, i, relinfo->ri_TrigFunctions, relinfo->ri_TrigInstrument, GetPerTupleMemoryContext(estate));if (newtuple == NULL){ if (should_free) heap_freetuple(oldtuple); return false; /* "do nothing" — skip this row */}else if (newtuple != oldtuple){ newtuple = check_modified_virtual_generated(RelationGetDescr(...), newtuple); ExecForceStoreHeapTuple(newtuple, slot, false); /* trigger rewrote NEW */ /* ... partition-fit recheck for cloned triggers ... */}AFTER-row hooks do almost nothing at fire time. ExecARInsertTriggers
is a thin guard that, if any after-row trigger or transition table is in
play, calls AfterTriggerSaveEvent — and returns. No user code runs here:
// ExecARInsertTriggers — src/backend/commands/trigger.cif ((trigdesc && trigdesc->trig_insert_after_row) || (transition_capture && transition_capture->tcs_insert_new_table)) AfterTriggerSaveEvent(estate, relinfo, NULL, NULL, TRIGGER_EVENT_INSERT, true /* row_trigger */, NULL, slot, recheckIndexes, NULL, transition_capture, false);Statement-level AFTER hooks are similar — ExecASInsertTriggers calls
AfterTriggerSaveEvent once with row_trigger = false — while
statement-level BEFORE hooks (ExecBSInsertTriggers) run synchronously like
BEFORE-row but cannot return a value (a BEFORE STATEMENT trigger returning
non-NULL is an error). INSTEAD-OF hooks (ExecIR*, views only) run
synchronously and replace the operation entirely.
The single chokepoint through which every trigger function is actually
invoked is ExecCallTriggerFunc, which switches into the per-tuple memory
context, sets up the fcinfo with the TriggerData as fmgr context,
bumps MyTriggerDepth (recursion accounting), and calls through fmgr:
// ExecCallTriggerFunc — src/backend/commands/trigger.coldContext = MemoryContextSwitchTo(per_tuple_context);InitFunctionCallInfoData(*fcinfo, finfo, 0, InvalidOid, (Node *) trigdata, NULL);pgstat_init_function_usage(fcinfo, &fcusage);MyTriggerDepth++;PG_TRY();{ result = FunctionCallInvoke(fcinfo);}PG_FINALLY();{ MyTriggerDepth--;}PG_END_TRY();The function receives no SQL arguments; everything — event type, OLD/NEW
slots, the Trigger struct, transition tuplestores — arrives through the
TriggerData node in fcinfo->context, which a PL trigger reads via the
CALLED_AS_TRIGGER macro and per-language glue.
TriggerEnabled: the per-firing gate
Section titled “TriggerEnabled: the per-firing gate”Before any hook actually calls ExecCallTriggerFunc, it filters each
candidate trigger through TriggerEnabled, which folds together the three
“should this trigger fire for this specific row/event” tests that the
catalog flags alone cannot answer:
// TriggerEnabled — src/backend/commands/trigger.c/* 1. session_replication_role vs. tgenabled */if (SessionReplicationRole == SESSION_REPLICATION_ROLE_REPLICA){ if (trigger->tgenabled == TRIGGER_FIRES_ON_ORIGIN || trigger->tgenabled == TRIGGER_DISABLED) return false;}/* 2. column-specific UPDATE trigger: skip if no listed column changed */if (trigger->tgnattr > 0 && TRIGGER_FIRED_BY_UPDATE(event)){ /* ... return false unless some tgattr[] member is in modifiedCols ... */}/* 3. WHEN (...) qualifier */if (trigger->tgqual){ econtext->ecxt_innertuple = oldslot; /* OLD -> INNER_VAR */ econtext->ecxt_outertuple = newslot; /* NEW -> OUTER_VAR */ if (!ExecQual(*predicate, econtext)) return false;}Three things are worth pulling out. First, tgenabled is not a boolean:
it is a four-way state (Origin / Replica / Always / Disabled) that
interacts with session_replication_role, which is how logical replication
apply workers suppress origin-side triggers — the same mechanism pg_dump
and pglogical rely on. Second, the column list (UPDATE OF col1, col2)
is checked here against the statement’s modifiedCols bitmap, not at queue
time — a column-specific UPDATE trigger on an unmodified column is filtered
out before it ever reaches the queue. Third, the WHEN condition is
compiled lazily (the first firing per query stringToNodes tgqual, rewrites
OLD/NEW Var references to INNER_VAR/OUTER_VAR, and caches the
ExprState in ri_TrigWhenExprs[]) and then evaluated against the OLD/NEW
slots — so a WHEN that fails also keeps the event off the queue entirely.
For AFTER triggers this matters: the WHEN is evaluated at save time
against the in-flight tuples, not at fire time, which is the only point where
the OLD/NEW images are still both at hand.
Worked example: ordering, recursion, and the firing loop
Section titled “Worked example: ordering, recursion, and the firing loop”Put the pieces together with one statement. Suppose table t has a
BEFORE-row trigger a_stamp, an AFTER-row trigger b_audit, and an
AFTER-statement trigger c_summary, and we run
UPDATE t SET x = x + 1 WHERE x < 100 touching 40 rows.
flowchart TD
START["ExecutorStart -> AfterTriggerBeginQuery<br/>query_depth++"]
BS["ExecBSUpdateTriggers<br/>(no BS trigger here: flag clear, return)"]
LOOP["for each of the 40 matching rows"]
BR["ExecBRUpdateTriggers<br/>run a_stamp NOW, may rewrite NEW"]
UPD["table_tuple_update applies the row"]
AR["ExecARUpdateTriggers<br/>queue b_audit event (ctid1=old, ctid2=new)"]
AS["ExecASUpdateTriggers<br/>queue ONE c_summary statement event"]
END["ExecutorFinish -> AfterTriggerEndQuery"]
MARK["afterTriggerMarkEvents: stamp firable events<br/>with firing_id = counter++"]
INV["afterTriggerInvokeEvents: fire in queue order<br/>40x b_audit, then c_summary"]
START --> BS --> LOOP --> BR --> UPD --> AR --> LOOP
LOOP -->|"all rows done"| AS --> END --> MARK --> INV
INV -.->|"a fired trigger queued more?"| MARK
Several invariants from the source surface in this trace. The 40 b_audit
events are queued in row-processing order and share a single
AfterTriggerSharedData record (same tgoid, relid, rolid), so the
queue holds 40 one-CTID-ish event records plus one descriptor. The single
c_summary event is queued after all row events because ExecASUpdateTriggers
runs once at end-of-statement; and cancel_prior_stmt_triggers (called from
AfterTriggerSaveEvent for the statement event) guarantees the statement
trigger fires exactly once even across writable-CTE re-entry. At
AfterTriggerEndQuery the events are marked with a fresh firing_id and
fired in queue order — row triggers before the statement trigger — and the
surrounding for (;;) loop re-runs afterTriggerMarkEvents if b_audit or
c_summary itself issued DML that queued more events. Runaway recursion is
bounded not by static analysis but by MyTriggerDepth (incremented in
ExecCallTriggerFunc) interacting with max_stack_depth and statement
timeouts.
The after-trigger event queue
Section titled “The after-trigger event queue”The reason AFTER triggers are not run inline is that they must observe the final state of the statement (after all rows are modified, after BEFORE triggers, after constraint application) and, for deferrable constraints, may have to wait until commit. PostgreSQL therefore records a tiny event per firing and drains the queue later. The record is deliberately minimal — a flags word and one or two item pointers (CTIDs), not a tuple copy:
// AfterTriggerEventData — src/backend/commands/trigger.ctypedef struct AfterTriggerEventData{ TriggerFlags ate_flags; /* status bits + offset to shared data */ ItemPointerData ate_ctid1; /* inserted/deleted/old-updated tuple */ ItemPointerData ate_ctid2; /* new updated tuple */ Oid ate_src_part; /* cross-partition update only */ Oid ate_dst_part;} AfterTriggerEventData;The clever part is that the per-trigger metadata — which trigger, which
relation, which role, modified-column set — is factored out into a separate
AfterTriggerSharedData record, and many events can point at one shared
record. The low 27 bits of ate_flags hold the byte offset from the event
to its shared record (GetTriggerSharedData), and the high bits encode size
class and status (AFTER_TRIGGER_1CTID, _2CTID, _CP_UPDATE, plus
AFTER_TRIGGER_IN_PROGRESS / _DONE). Because most queued events for a
statement share the same trigger and relation, the per-event cost collapses
to roughly the size of AfterTriggerEventDataOneCtid — a flags word plus a
single 6-byte CTID. For a million-row UPDATE that fires one FK check
trigger, the queue is a million 12-ish-byte records sharing one descriptor,
not a million tuple copies.
Events live in an AfterTriggerEventList — a linked list of
geometrically growing chunks (1 KB doubling up to 1 MB). Each chunk is a
double-ended arena: AfterTriggerEventData records grow upward from
freeptr, AfterTriggerSharedData records grow downward from endfree,
and the offset link bridges them:
flowchart LR
subgraph chunk["AfterTriggerEventChunk (arena)"]
direction TB
E1["event[0]<br/>flags+ctid1"]
E2["event[1]<br/>flags+ctid1"]
EDOTS["..."]
FREE["free space"]
SDOTS["..."]
S1["shared[1]"]
S0["shared[0]<br/>tgoid, relid, firing_id"]
end
E1 -.->|"ate_flags & OFFSET"| S0
E2 -.->|"ate_flags & OFFSET"| S0
L["AfterTriggerEventList<br/>head / tail / tailfree"] --> chunk
afterTriggerAddEvent is the allocator: it finds room in the tail chunk (or
mallocs a bigger one from the AfterTriggerEvents memory context), scans the
chunk’s existing shared records for a match, reuses it if found or copies a
new one in, then memcpys the event and patches the offset link:
// afterTriggerAddEvent — src/backend/commands/trigger.c/* try to locate a matching shared-data record already in the chunk */for (newshared = (AfterTriggerShared) chunk->endfree; (char *) newshared < chunk->endptr; newshared++){ if (newshared->ats_tgoid == evtshared->ats_tgoid && newshared->ats_event == evtshared->ats_event && newshared->ats_firing_id == 0 && /* ... relid, rolid, modifiedcols all equal ... */ ) break;}/* ... allocate a new shared record if none matched ... */newevent = (AfterTriggerEvent) chunk->freeptr;memcpy(newevent, event, eventsize);newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;newevent->ate_flags |= (char *) newshared - (char *) newevent; /* link */chunk->freeptr += eventsize;The queue is two-level: afterTriggers.query_stack[query_depth].events
holds events from the currently running query, while
afterTriggers.events is the transaction-global deferred list. The split is
what makes immediate-vs-deferred and subtransaction rollback tractable, and
it is described in the big comment on AfterTriggersData:
// AfterTriggersData — src/backend/commands/trigger.ctypedef struct AfterTriggersData{ CommandId firing_counter; /* next firing-cycle ID to assign */ SetConstraintState state; /* active SET CONSTRAINTS state */ AfterTriggerEventList events; /* transaction-global deferred list */ MemoryContext event_cxt; /* memory context for events */ AfterTriggersQueryData *query_stack; /* per-query-level events */ int query_depth; /* current index; -1 when empty */ int maxquerydepth; AfterTriggersTransData *trans_stack; /* per-subxact saved pointers */ int maxtransdepth;} AfterTriggersData;The lifecycle hooks (all called from xact.c / the executor, not from user
code) are:
AfterTriggerBeginXact— zero the state at transaction start;firing_counter = 1,query_depth = -1.AfterTriggerBeginQuery—query_depth++; called fromstandard_ExecutorStart/ExecutorStart. Cheap: real allocation is lazy.AfterTriggerEndQuery— the drain for immediate-mode events; called fromExecutorFinish. It callsafterTriggerMarkEventsto tag firable events with the next firing-cycle ID (deferred ones are migrated to the global list here), then loopsafterTriggerInvokeEventsuntil none remain, because a fired trigger may queue more at the same level.
// AfterTriggerEndQuery — src/backend/commands/trigger.cqs = &afterTriggers.query_stack[afterTriggers.query_depth];for (;;){ if (afterTriggerMarkEvents(&qs->events, &afterTriggers.events, true)) { CommandId firing_id = afterTriggers.firing_counter++; AfterTriggerEventChunk *oldtail = qs->events.tail; if (afterTriggerInvokeEvents(&qs->events, firing_id, estate, false)) break; /* all fired */ qs = &afterTriggers.query_stack[afterTriggers.query_depth]; /* may have moved */ /* drop fully-fired leading chunks to speed the rescan */ while (qs->events.head != oldtail) afterTriggerDeleteHeadEventChunk(qs); } else break;}AfterTriggerFireDeferred— the drain for the transaction-global deferred list; called fromCommitTransactionjust before commit. It pushes a snapshot, then loops mark+invoke until empty, since deferred triggers may queue more.
The firing_counter/firing_id scheme is what keeps SET CONSTRAINTS ... IMMEDIATE sane: each drain pass stamps the events it intends to fire with a
unique cycle ID, and afterTriggerInvokeEvents only fires events whose
ats_firing_id matches the current cycle and whose AFTER_TRIGGER_IN_PROGRESS
bit is set — so a nested SET CONSTRAINTS issued by a trigger fires only
events that were not already scheduled.
At fire time the event carries only CTIDs, so afterTriggerInvokeEvents (via
AfterTriggerExecute) re-fetches the tuple by CTID under SnapshotAny
— the row must be found regardless of MVCC visibility, because the trigger
acts on behalf of the modifying transaction:
// AfterTriggerExecute — src/backend/commands/trigger.c (default, heap case)if (!table_tuple_fetch_row_version(src_rel, &(event->ate_ctid1), SnapshotAny, src_slot)) elog(ERROR, "failed to fetch tuple1 for AFTER trigger");LocTriggerData.tg_trigtuple = ExecFetchSlotHeapTuple(LocTriggerData.tg_trigslot, false, &should_free_trig);(Foreign-table events take a different branch: their tuples cannot be
re-fetched by CTID, so they are spooled into an FDW tuplestore at save time
and read back here, flagged AFTER_TRIGGER_FDW_FETCH/_REUSE.)
Subtransaction rollback is handled by trans_stack: at subxact start the
current events head/tail pointers are saved, and on abort
afterTriggerRestoreEventList truncates the list back to the saved position,
discarding exactly the chunks added by the aborted subxact — O(chunks-added),
not O(total events).
Transition tables are captured separately, into tuplestores
Section titled “Transition tables are captured separately, into tuplestores”REFERENCING OLD TABLE / NEW TABLE is a different mechanism that shares
the AFTER machinery’s lifecycle but not its CTID-event representation. When a
relation has any transition-table trigger, the executor (e.g.,
nodeModifyTable’s setup) builds a TransitionCaptureState via
MakeTransitionCaptureState, which allocates tuplestore objects in the
(sub)transaction’s CurTransactionContext:
// MakeTransitionCaptureState — src/backend/commands/trigger.cif (need_old_upd && upd_table->old_tuplestore == NULL) upd_table->old_tuplestore = tuplestore_begin_heap(false, false, work_mem);if (need_new_upd && upd_table->new_tuplestore == NULL) upd_table->new_tuplestore = tuplestore_begin_heap(false, false, work_mem);/* ... old_del, new_ins similarly; keyed by (relid, cmdType) ... */state = (TransitionCaptureState *) palloc0(sizeof(TransitionCaptureState));state->tcs_update_old_table = need_old_upd;state->tcs_update_new_table = need_new_upd;state->tcs_update_private = upd_table;As each row flows through ExecAR* → AfterTriggerSaveEvent, the OLD/NEW
slot is also appended to the matching tuplestore (TransitionTableAddTuple)
before — or instead of — queuing an event. The decisive design constraint,
stated in both MakeTransitionCaptureState and the AfterTriggersData
comment, is that transition tables are never deferrable: they live only
until AfterTriggerEndQuery, so a deferrable trigger cannot reference one.
This is why the tuplestores can sit in CurTransactionContext and be freed
when the query level pops, rather than surviving to commit like the deferred
event list. The trigger function ultimately sees these as
tg_oldtable/tg_newtable in its TriggerData, exposed to SQL as the named
OLD/NEW relations.
Source Walkthrough
Section titled “Source Walkthrough”This section names the stable symbols, grouped by the path a trigger takes
from catalog to execution. Adjacent mechanisms — the DML node that calls
these hooks (nodeModifyTable.c), the generic CREATE TRIGGER utility
plumbing, and the fmgr call convention — are covered in
postgres-executor.md, postgres-ddl-execution.md, and postgres-fmgr.md
respectively; here we stay inside the trigger subsystem proper.
Definition and catalog → relcache descriptor.
CreateTrigger/CreateTriggerFiringOn— implementCREATE TRIGGER; validate, look up the function, build andCatalogTupleInsertthepg_triggerrow. Returns the new trigger’sObjectAddress.RemoveTriggerById,renametrig,EnableDisableTrigger— the rest of the DDL surface (drop, rename,ENABLE/DISABLE).Form_pg_trigger/ thetgtypebit macros (TRIGGER_TYPE_ROW,_BEFORE,_INSERT, …,_INSTEAD) andTRIGGER_TYPE_MATCHES— the packed on-disk classification and the test macro.RelationBuildTriggers— relcache hook; scanspg_triggerbyTriggerRelidNameIndexId(name order = firing order) and fills aTriggerDesc.SetTriggerFlags—ORs each trigger into theTriggerDescsummary booleans (trig_insert_before_row, …,trig_*_old_table).CopyTriggerDesc,FreeTriggerDesc,equalTriggerDescs— descriptor lifecycle used by the relcache.Trigger,TriggerDesc(inreltrigger.h);TriggerData,TransitionCaptureState(intrigger.h) — the in-memory structs.
Row/statement firing points (called by the executor).
ExecBSInsertTriggers/ExecBSUpdateTriggers/ExecBSDeleteTriggers/ExecBSTruncateTriggers— BEFORE STATEMENT; run synchronously, must not return a value. Guard against double-firing viabefore_stmt_triggers_fired.ExecBRInsertTriggers/ExecBRUpdateTriggers/ExecBRDeleteTriggers— BEFORE ROW; run synchronously, return the (possibly rewritten / NULL) tuple.ExecBRUpdateTriggers/ExecBRDeleteTriggersfirst fetch the old row viaGetTupleForTrigger.ExecIRInsertTriggers/ExecIRUpdateTriggers/ExecIRDeleteTriggers— INSTEAD OF ROW (views); replace the operation.ExecARInsertTriggers/ExecARUpdateTriggers/ExecARDeleteTriggers— AFTER ROW; thin guards that callAfterTriggerSaveEvent.ExecASInsertTriggers/ExecASUpdateTriggers/ExecASDeleteTriggers/ExecASTruncateTriggers— AFTER STATEMENT; callAfterTriggerSaveEventwithrow_trigger = false.ExecCallTriggerFunc— the single fmgr chokepoint; sets upTriggerData, bumpsMyTriggerDepth, invokes the function in the per-tuple context.TriggerEnabled— evaluatestgenabled(session replication role) and theWHENqualifier; returns whether this trigger fires for this row/event.
After-trigger queue (event records, chunks, drain).
AfterTriggerEventData(+…NoOids,…OneCtid,…ZeroCtidssize variants),AfterTriggerSharedData,AfterTriggerEventChunk,AfterTriggerEventList— the on-queue representation.SizeofTriggerEvent,GetTriggerSharedData, thefor_each_event/for_each_chunkiterator macros,AFTER_TRIGGER_OFFSET/_IN_PROGRESS/_DONE/_1CTID/_2CTID/_CP_UPDATEflag bits.AfterTriggersData,AfterTriggersQueryData,AfterTriggersTransData,AfterTriggersTableData, and the file-staticafterTriggers— global state.AfterTriggerSaveEvent— the entry point from everyExecAR*/ExecAS*; validates the event, captures transition tuples, computes flags, and callsafterTriggerAddEvent.afterTriggerAddEvent— the chunked-arena allocator + shared-record dedup.afterTriggerMarkEvents— tag firable events with the currentfiring_id; migrate deferred events to the move-list.afterTriggerInvokeEvents→AfterTriggerExecute— re-fetch tuples by CTID underSnapshotAny(or from the FDW tuplestore) and callExecCallTriggerFunc.afterTriggerCheckState,SetConstraintsCommand,SetConstraintStateCreate—SET CONSTRAINTS(deferral) state.afterTriggerFreeEventList,afterTriggerRestoreEventList,afterTriggerDeleteHeadEventChunk— teardown and subxact-abort truncation.
Lifecycle hooks (called from xact.c / executor).
AfterTriggerBeginXact,AfterTriggerBeginQuery,AfterTriggerEndQuery,AfterTriggerFireDeferred,AfterTriggerEndXact,AfterTriggerBeginSubXact,AfterTriggerEndSubXact— the queue’s transaction integration.AfterTriggerEnlargeQueryState— growquery_stackon demand.
Transition tables.
MakeTransitionCaptureState— allocate the OLD/NEW tuplestores keyed by(relid, cmdType); returns NULL when no transition table is needed.GetAfterTriggersTableData,GetAfterTriggersTransitionTable,GetAfterTriggersStoreSlot,TransitionTableAddTuple— find/create the per-table data and append rows.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
Trigger (struct) | src/include/utils/reltrigger.h | 23 |
TriggerDesc (struct) | src/include/utils/reltrigger.h | 47 |
TriggerData (struct) | src/include/commands/trigger.h | 31 |
TransitionCaptureState (struct) | src/include/commands/trigger.h | 56 |
TRIGGER_EVENT_* flags | src/include/commands/trigger.h | 94 |
TRIGGER_TYPE_ROW … _INSTEAD | src/include/catalog/pg_trigger.h | 93 |
TRIGGER_TYPE_MATCHES | src/include/catalog/pg_trigger.h | 141 |
CreateTrigger | src/backend/commands/trigger.c | 161 |
CreateTriggerFiringOn | src/backend/commands/trigger.c | 178 |
RelationBuildTriggers | src/backend/commands/trigger.c | 1862 |
SetTriggerFlags | src/backend/commands/trigger.c | 2014 |
CopyTriggerDesc | src/backend/commands/trigger.c | 2091 |
ExecCallTriggerFunc | src/backend/commands/trigger.c | 2310 |
ExecBSInsertTriggers | src/backend/commands/trigger.c | 2402 |
ExecASInsertTriggers | src/backend/commands/trigger.c | 2453 |
ExecBRInsertTriggers | src/backend/commands/trigger.c | 2466 |
ExecARInsertTriggers | src/backend/commands/trigger.c | 2544 |
ExecIRInsertTriggers | src/backend/commands/trigger.c | 2570 |
ExecBRDeleteTriggers | src/backend/commands/trigger.c | 2702 |
ExecARDeleteTriggers | src/backend/commands/trigger.c | 2802 |
ExecBRUpdateTriggers | src/backend/commands/trigger.c | 2972 |
ExecARUpdateTriggers | src/backend/commands/trigger.c | 3145 |
ExecBSTruncateTriggers | src/backend/commands/trigger.c | 3281 |
TriggerEnabled | src/backend/commands/trigger.c | 3483 |
AFTER_TRIGGER_* flag bits | src/backend/commands/trigger.c | 3682 |
AfterTriggerSharedData | src/backend/commands/trigger.c | 3694 |
AfterTriggerEventData | src/backend/commands/trigger.c | 3707 |
SizeofTriggerEvent / GetTriggerSharedData | src/backend/commands/trigger.c | 3743 |
AfterTriggerEventChunk | src/backend/commands/trigger.c | 3762 |
AfterTriggerEventList | src/backend/commands/trigger.c | 3774 |
AfterTriggersData | src/backend/commands/trigger.c | 3880 |
afterTriggerCheckState | src/backend/commands/trigger.c | 4008 |
afterTriggerAddEvent | src/backend/commands/trigger.c | 4078 |
afterTriggerRestoreEventList | src/backend/commands/trigger.c | 4226 |
AfterTriggerExecute | src/backend/commands/trigger.c | 4328 |
afterTriggerMarkEvents | src/backend/commands/trigger.c | 4614 |
afterTriggerInvokeEvents | src/backend/commands/trigger.c | 4698 |
GetAfterTriggersTableData | src/backend/commands/trigger.c | 4867 |
MakeTransitionCaptureState | src/backend/commands/trigger.c | 4958 |
AfterTriggerBeginXact | src/backend/commands/trigger.c | 5084 |
AfterTriggerBeginQuery | src/backend/commands/trigger.c | 5116 |
AfterTriggerEndQuery | src/backend/commands/trigger.c | 5136 |
AfterTriggerFireDeferred | src/backend/commands/trigger.c | 5287 |
AfterTriggerEndXact | src/backend/commands/trigger.c | 5343 |
GetAfterTriggersTransitionTable | src/backend/commands/trigger.c | 5536 |
AfterTriggerSaveEvent | src/backend/commands/trigger.c | 6169 |
before_stmt_triggers_fired | src/backend/commands/trigger.c | 6584 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Checked against /data/hgryoo/references/postgres at REL_18_STABLE,
commit 273fe94. Confirmed facts:
- No PG18-incompatible symbols asserted. The doc references no removed
rmgr (
XLOG2) orB_DATACHECKSUMSWORKER_*worker states. All trigger symbols cited exist in the REL_18 tree. tgtypebit layout (TRIGGER_TYPE_ROW=1<<0 …TRIGGER_TYPE_INSTEAD=1<<6, withSTATEMENT/AFTERbeing the zero-bit cases) verified insrc/include/catalog/pg_trigger.hlines 93–98;TRIGGER_TYPE_MATCHESat line 141.- Name-order firing verified by the
TriggerRelidNameIndexIdscan and the explanatory comment inRelationBuildTriggers(src/backend/commands/trigger.c). - The 15+2 firing-point functions (
ExecBR/AR/IR× INSERT/UPDATE/DELETE,ExecBS/AS× INSERT/UPDATE/DELETE/TRUNCATE; no row-level TRUNCATE) all present with the signatures quoted;ExecBRInsertTriggersreturns a tuple andExecARInsertTriggersreturnsvoidafter callingAfterTriggerSaveEvent, as quoted. - Event record sizing.
AfterTriggerEventDatacarriesate_flags,ate_ctid1,ate_ctid2,ate_src_part,ate_dst_part; the four size variants andSizeofTriggerEventare as quoted; the offset link via the low 27 bits (AFTER_TRIGGER_OFFSET = 0x07FFFFFF) andGetTriggerSharedDataverified. - Chunk growth 1 KB → 1 MB (
MIN_CHUNK_SIZE 1024,MAX_CHUNK_SIZE 1024*1024) with the doubling/halving heuristic inafterTriggerAddEventverified. SnapshotAnyre-fetch at fire time viatable_tuple_fetch_row_versioninAfterTriggerExecuteverified; FDW path usesAFTER_TRIGGER_FDW_FETCH/_REUSEand a tuplestore, as stated.- Lifecycle ordering —
AfterTriggerEndQueryfires immediate events and migrates deferred ones;AfterTriggerFireDeferred(pre-commit) drains the global list under a pushed transaction snapshot — verified against the quoted bodies and their header comments. - Transition tables never deferrable — asserted in the
MakeTransitionCaptureStateandAfterTriggersDatacomments; tuplestores allocated inCurTransactionContextand freed inAfterTriggerFreeQuery. - Caveat on line numbers. The position-hint table lists line numbers as
observed at 273fe94;
AfterTriggerExecute(4328) andCreateTriggerFiringOn(178) are function-definition lines. Symbols are the durable anchor; line numbers decay on any reformat.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s trigger architecture is one well-tested point in a space the SQL standard only partly fixes. Comparing it to other engines and to the research literature sharpens why its choices land where they do.
Firing order and the standard’s silence. The SQL standard says little
about the order in which multiple triggers on the same event fire, and
engines diverge. PostgreSQL fires in trigger-name alphabetical order
(a consequence of the TriggerRelidNameIndexId scan in
RelationBuildTriggers) — arbitrary but stable and inspectable, which is
the property Database System Concepts (§5.3.2) implicitly calls for when
it warns that “unintended order of firing” is a hazard. Oracle historically
left same-timing order undefined until 11g added the FOLLOWS/PRECEDES
clause for explicit ordering; SQL Server fires AFTER triggers in an
order that is undefined except for a configurable first/last via
sp_settriggerorder. PostgreSQL’s “name order” is the most predictable of
the three but the least expressive — there is no per-trigger ordering
clause, so users encode order in names (01_audit, 02_fk). This is a
deliberate simplicity trade.
BEFORE-row tuple rewrite vs. the standard’s SET model. PostgreSQL lets
a BEFORE FOR EACH ROW trigger return a modified NEW tuple, and the
returned tuple becomes what is stored (ExecBRInsertTriggers). The SQL
standard instead models row modification through assignment to NEW.col in
the trigger body. The two are observationally similar, but PostgreSQL’s
“return a tuple” convention is what makes BEFORE triggers compose as a
pipeline (each trigger’s output feeds the next) and is the same convention
that lets a BEFORE trigger veto a row by returning NULL — a capability
the assignment model expresses only awkwardly.
Constraints as triggers — the deep design bet. PostgreSQL implements
foreign keys and deferrable uniqueness as internal AFTER triggers riding
the same queue (RI_FKey_* trigger functions, F_UNIQUE_KEY_RECHECK). This
unifies two facilities the textbook treats separately (§5.3.3 contrasts
declarative constraints against triggers), and it means the after-trigger
queue’s correctness is referential-integrity correctness. The cost is
visible in AfterTriggerSaveEvent, which is shot through with RI-specific
skip logic (RI_FKey_trigger_type, the cross-partition-update special
cases). An engine that kept constraints in a separate enforcement path would
have a leaner trigger queue but two integrity mechanisms to keep coherent.
The research lineage here is assertion / integrity-constraint maintenance
(Ceri & Widom’s work on deriving production rules for constraint
maintenance, early 1990s), which framed triggers as the operational form of
declarative rules — exactly PostgreSQL’s stance.
Active databases and the ECA heritage. The trigger is the surviving
commercial fragment of the active database research program (HiPAC, Ariel,
Starburst, late 1980s–early 1990s), which studied event-condition-action
rules as a general reactive mechanism: composite events, coupling modes
(immediate / deferred / detached), and rule-execution semantics
(termination, confluence). PostgreSQL implements a pragmatic subset —
immediate and deferred coupling map to its immediate-mode and deferrable
events; composite events and detached (separate-transaction) coupling are
absent. The termination problem the active-DB literature studied formally
shows up here as a runtime guard: MyTriggerDepth plus
max_stack_depth/statement timeout, rather than static confluence analysis.
Set-oriented vs. row-oriented reactions. Transition tables (and their
tuplestore capture in MakeTransitionCaptureState) are PostgreSQL’s answer
to a long-standing critique of row-level triggers: firing a function per row
is O(rows) procedure-call overhead, whereas a statement-level trigger over a
transition table can run one set-based SQL statement over the whole delta.
This mirrors the delta-relation approach in incremental view maintenance
(the DRed and counting algorithms; Gupta & Mumick), where a change is
represented as insert/delete sets rather than per-tuple events. A frontier
question — relevant to PostgreSQL’s incremental-matview efforts — is whether
the transition-table capture path could feed an IVM engine directly rather
than a user trigger.
Push-down and streaming frontiers. Modern systems push reactive logic
out of the trigger queue: change-data-capture and logical replication
(PostgreSQL’s own logical_decoding, covered in
postgres-logical-decoding.md) reconstruct row deltas from WAL after
commit, decoupling reaction from the writing transaction entirely — the
“detached coupling mode” the active-DB literature anticipated. For
high-fan-out audit/replication workloads this is strictly cheaper than an
AFTER trigger per row, because it pays nothing on the write path. The
trigger queue remains the right tool only when the reaction must be
synchronous with and transactional with the change — which is precisely
the FK-enforcement case the queue was built around.
Sources
Section titled “Sources”- Source tree.
/data/hgryoo/references/postgresatREL_18_STABLE, commit 273fe94 (PG 18.x). Primary file:src/backend/commands/trigger.c. Headers:src/include/commands/trigger.h,src/include/utils/reltrigger.h,src/include/catalog/pg_trigger.h. Callers (cross-referenced, not re-analyzed here):src/backend/executor/nodeModifyTable.c,src/backend/utils/cache/relcache.c(RelationBuildTriggershook),src/backend/access/transam/xact.c(lifecycle calls). - Textbook anchor. Silberschatz, Korth & Sudarshan, Database System
Concepts, 7th ed., ch. 5 “Advanced SQL”, §5.3 “Triggers” (ECA model,
granularity, timing, transition tables, the constraints-vs-triggers
guidance and the cascading/ordering hazards). Captured under
knowledge/research/dbms-general/. - Comparative / historical. Active-database ECA lineage (HiPAC, Ariel, Starburst); Ceri & Widom on production rules for integrity-constraint maintenance; Gupta & Mumick on incremental view maintenance and delta relations — cited for context, not consumed as PG source.
- Cross-references within this KB.
knowledge/code-analysis/postgres/postgres-executor.md(the demand-pull node tree andExecutorFinishthat drivesAfterTriggerEndQuery),postgres-ddl-execution.md(theCREATE TRIGGERutility path),postgres-fmgr.md(the function-call convention behindExecCallTriggerFunc),postgres-mvcc-snapshots.md(why fire-time re-fetch usesSnapshotAny),postgres-logical-decoding.md(the post-commit, detached alternative to AFTER triggers).