CUBRID Partitioning — Range/Hash/List Strategies, Partition Pruning, and Per-Partition Execution
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”A relational table is, semantically, a single set of tuples. When the set grows past the point at which a single heap and a single index fit comfortably in working memory or on a single device, physical horizontal partitioning splits the set into disjoint subsets — each a “partition” — keyed by a deterministic function of the row. The logical table is then implemented as the union of its partitions.
Database System Concepts (Silberschatz, Korth, Sudarshan, 7th ed.,
Ch. 22 “Parallel and Distributed Storage”) gives the canonical
treatment: a partitioning function P(tuple) → {0, 1, …, N-1} is
chosen by the DBA (or generated by the system) so that each tuple
maps to exactly one partition. The textbook then enumerates three
canonical partitioning strategies:
- Range partitioning.
P(t) = iwheret.k ∈ [low_i, high_i). The partition key column has a total order; partitions cover non-overlapping intervals of that order. Good for time-series data where most queries hit a recent slice. - Hash partitioning.
P(t) = h(t.k) mod N. The partition key is hashed; the modulus picks the partition. Good when queries want even distribution across partitions and there is no natural ordering to exploit (or when the natural ordering is skewed). - List partitioning. Each partition declares an explicit set
of key values it owns;
P(t) = iift.k ∈ values_i. Good for categorical data — e.g., country codes or tenant IDs.
Petrov’s Database Internals (Ch. 11 “Distributed Systems”) emphasizes a fourth design dimension: partition awareness in the optimizer and executor. A partitioned table only pays for itself when the engine can:
- Prune at plan time. Given a predicate
WHERE k = 42over a range-partitioned table, the optimizer should compute, before generating any scan, that only the partition containing 42 needs to be scanned. The other N-1 partitions are eliminated. This is partition pruning. - Dispatch at execute time. Even after pruning, the surviving
partitions are independent heaps with independent indexes. The
executor must produce a single tuple stream by iterating through
each surviving partition’s
SCAN_IDin turn (or in parallel). - Resolve per-row at DML time. For
INSERT INTO t VALUES (...)the engine must computeP(row)to know which heap to write to. ForUPDATEit must do the same on the post-image and, if the partition assignment changed, move the row from the old heap to the new one.
The textbook calls out two further refinements that more aggressive engines support:
- Partition-wise joins. If
T1andT2are both partitioned on the same key with the same partitioning function, then the joinT1 ⨝ T2 ON T1.k = T2.kcan be decomposed into N parallel joins ofT1.partition_i ⨝ T2.partition_i. No row inT1.partition_ican match a row inT2.partition_jfori ≠ j, so the cross-partition matches are eliminated. - Local vs global indexes. A local index is partitioned along with the table — each partition owns its own B-tree. A global index spans all partitions; its leaf entries carry the partition identifier in addition to the row pointer. Local indexes are cheaper to maintain (DDL on a partition only touches that partition’s index) but cannot enforce uniqueness across the whole table without help. Global indexes are more expensive to maintain but enforce table-wide uniqueness directly.
Partition pruning itself reduces to a simple algebraic question:
given a predicate Φ over the partition key column and a partition
descriptor P_i (a range, a hash modulus position, or a value list),
is there any value v such that Φ(v) ∧ v ∈ P_i is satisfiable?
If no, the partition is pruned. The pruning algorithm walks the
predicate tree and, for each leaf comparison, asks each partition
descriptor whether it can supply a witness. AND nodes intersect the
results, OR nodes union them. NOT and unfamiliar functions
conservatively keep all partitions.
The remainder of this document follows CUBRID’s realization of these concepts: how it represents partition descriptors in catalog memory, how the optimizer hooks pruning into access-spec preparation, how the executor dispatches per-partition scans, and how the locator routes DML to the correct partition heap.
Common DBMS Design
Section titled “Common DBMS Design”Every relational engine that supports partitioning makes the same set of decisions in slightly different shapes; understanding the variants clarifies CUBRID’s choices.
PostgreSQL — declarative partitioning, native pruning, partition-wise joins
Section titled “PostgreSQL — declarative partitioning, native pruning, partition-wise joins”PostgreSQL added declarative partitioning in 10 (range/list, 2017), hash in 11 (2018), partition-wise joins in 12, and tightened pruning in subsequent releases. The model:
- Catalog. Each partitioned table is a parent
pg_classrow; each partition is a childpg_classrow connected bypg_inherits. The parent has no heap of its own (it is a “logical parent”). The partition descriptor lives inpg_partitioned_tablepg_class.relpartbound.
- Pruning. Two layers. Plan-time pruning
(
partprune.c::prune_append_rel_partitions) eliminates partitions when the predicate is fully known at plan time. Execution-time pruning re-runs the pruning logic each time the executor opens theAppendnode, so parameter values bound at execute time (e.g., from a generic plan or from a parameterized nested-loop) can still drop partitions. - Plan shape. The query plan replaces the parent table with an
Append(orMergeAppend) of per-partition scans. The optimizer treats each partition as an independent relation for cost purposes; statistics are maintained per partition. - Partition-wise join. Optional (
enable_partitionwise_join). When both inputs share the same partition shape, the planner decomposes the join into per-partition joins under a singleAppend. - Indexes. Local. The parent table has no heap, so a “global” index would be storage with no parent heap to index. Each partition owns its own indexes; uniqueness enforced across all partitions requires the partition key to be part of the unique constraint.
MySQL — partitioning syntax, type-pruning at plan time
Section titled “MySQL — partitioning syntax, type-pruning at plan time”MySQL has supported partitioning since 5.1. The model:
- Catalog. A single
.frm(nowmysql.tablesrow) describes the partitioned table including the per-partition definition. Partitions are not separate tables; the storage engine (ha_partitionhistorically, native InnoDB partitioning since 5.7) is responsible for fan-out. - Pruning.
sql/sql_partition.cwalks the WHERE clause and fills a bitmap of “used partitions”. The pruning engine understands range, hash, key, and list partitioning, and composite partitioning (subpartition by hash within range). - Plan shape. The optimizer hands a partition bitmap to the
storage engine; the storage engine iterates over the surviving
partitions internally. There is no
Appendnode visible at the optimizer level. - Indexes. Local only. A unique index that does not include the partition key is rejected at DDL time.
Oracle — range/hash/list/composite plus interval partitioning
Section titled “Oracle — range/hash/list/composite plus interval partitioning”Oracle has the most mature partitioning feature set in the commercial space:
- Strategies. Range, hash, list, plus composite partitioning (range-hash, range-list, list-hash, list-range, list-list) and interval partitioning where new partitions are auto-created as data arrives.
- Pruning. Static (plan-time) and dynamic (execute-time, for bind variables and join keys). Documented in the Oracle Database VLDB and Partitioning Guide.
- Indexes. Both local and global indexes. Global indexes
contain
(key, ROWID, partition_no)so they can resolve rows across partitions; index-organized tables get special treatment. - Partition-wise join. Supported, both full (both sides identically partitioned) and partial (one side partitioned, serial broadcast or hash redistribution of the other).
Where CUBRID sits
Section titled “Where CUBRID sits”CUBRID’s partitioning is closest in spirit to MySQL’s: range, hash,
and list strategies; pruning at plan time only; local indexes;
no partition-wise joins; no composite partitioning. But CUBRID’s
implementation is closer to PostgreSQL’s: each partition is a
real CUBRID class (so it has its own OID, HFID, attributes
inherited from the parent, and its own catalog entries), and
partition pruning is a server-side helper invoked by the executor
when it opens an access spec rather than a parser-level rewrite.
The CUBRID-specific twist is that the master class — the
“partitioned class” — is not just a logical parent like Postgres’s;
it is a real class object that holds the partition rule in its
SM_PARTITION member, and partition discovery walks the
class-superclass chain (heap_get_class_supers) because each
partition’s class has the master class as its single
superclass. This reuses the OODB inheritance machinery described
in cubrid-class-object.md rather than introducing a new
catalog table for partitions. The cost is that a partition of a
partitioned table looks, to the schema layer, like a subclass —
and CUBRID consequently forbids ordinary subclassing on
partitioned tables (ER_SM_NO_PARTITION_ON_HIERARCHIES).
CUBRID’s Approach
Section titled “CUBRID’s Approach”CUBRID partitions a table by creating a master class (the
partitioned table the user references in SQL) and N child
classes, one per partition. The master class holds the partition
rule (range bounds / hash count / list values) in its
SM_PARTITION member; each child class also has a SM_PARTITION
that says “I am partition i of class M with values V_i”. The DDL
path in execute_schema.c materializes both. The runtime path in
partition.c reads them, builds a PRUNING_CONTEXT, and uses it
to (a) eliminate partitions before scan generation, (b) compute
the destination partition for each inserted row, (c) compute the
partition that owns a given index key.
The whole subsystem is server-side: partition.c is wrapped in
#if !defined (SERVER_MODE) && !defined (SA_MODE) guards (see
partition_sr.h:26-28) and is called from query_executor.c,
locator_sr.c, btree.c, and btree_load.c. The client-side DDL
in execute_schema.c runs in client mode and writes the
partition catalog through ordinary schema-template machinery.
The catalog model: SM_PARTITION on the master class
Section titled “The catalog model: SM_PARTITION on the master class”The persistent partition descriptor is SM_PARTITION, defined in
class_object.h:
// sm_partition — class_object.hstruct sm_partition{ struct sm_partition *next; /* currently not used, always NULL */ const char *pname; /* partition name */ int partition_type; /* partition type (range, list, hash) */ DB_CLASS_PARTITION_TYPE class_partition_type; /* partitioned class | partition class */ DB_SEQ *values; /* values for range and list partition types */ const char *expr; /* partition expression */ const char *comment;};SM_CLASS carries it as SM_PARTITION *partition
(class_object.h:798). For the master class, partition is a
single descriptor with class_partition_type = DB_PARTITIONED_CLASS, the strategy in partition_type (one of
the three values from the DB_PARTITION_TYPE enum), the user’s
partition expression in expr, and the partition predicate
sequence in values. The partition predicate is a serialized
FUNC_PRED (a regu-variable expression) that, when evaluated
against a row, yields the value used to assign the row to a
partition. For a child class partition->class_partition_type
is DB_PARTITION_CLASS and values carries the per-partition
data — for range it is [min, max), for list it is the explicit
value set, for hash it is unused.
The DB_PARTITION_TYPE and DB_CLASS_PARTITION_TYPE enums are
in storage_common.h:
// DB_PARTITION_TYPE — storage_common.htypedef enum{ DB_PARTITION_HASH = 0, DB_PARTITION_RANGE, DB_PARTITION_LIST} DB_PARTITION_TYPE;
typedef enum{ DB_NOT_PARTITIONED_CLASS = 0, DB_PARTITIONED_CLASS = 1, DB_PARTITION_CLASS = 2} DB_CLASS_PARTITION_TYPE;On disk the same information is split across two structures. The
per-class summary on the partitioned class lives in
OR_CLASSREP::has_partition_info (a single bit), and the actual
per-partition tuple array materialized at server startup time is
OR_PARTITION from object_representation_sr.h:
// or_partition — object_representation_sr.htypedef struct or_partition OR_PARTITION;struct or_partition{ OID class_oid; /* class OID */ HFID class_hfid; /* class HFID */ int partition_type; /* partition type (range, list, hash) */ REPR_ID rep_id; /* class representation id */ DB_SEQ *values; /* values for range and list partition types */};This is the runtime form the server hands to partition.c.
Conceptually OR_PARTITION[0] is the master (the partitioned
class itself) and OR_PARTITION[1..N] are the partitions. The
master entry carries the serialized partition predicate in its
values (see partition_load_partition_predicate); the
partition entries carry the partition’s data (range bounds or
list values) in values. The bridge between the SM form (in the
client workspace) and the OR form (on the server) is the
heap_get_class_partitions helper called from
partition_load_pruning_context.
classDiagram
class SM_CLASS {
SM_PARTITION* partition
DB_OBJLIST* users
}
class SM_PARTITION {
const char* pname
int partition_type
DB_CLASS_PARTITION_TYPE class_partition_type
DB_SEQ* values
const char* expr
}
class OR_PARTITION {
OID class_oid
HFID class_hfid
int partition_type
REPR_ID rep_id
DB_SEQ* values
}
class PRUNING_CONTEXT {
OID root_oid
OR_PARTITION* partitions
int count
DB_PARTITION_TYPE partition_type
ATTR_ID attr_id
func_pred* partition_pred
}
SM_CLASS *-- SM_PARTITION
SM_PARTITION ..> OR_PARTITION : serialized to disk and reloaded
OR_PARTITION ..> PRUNING_CONTEXT : array, [0]=master, [1..N]=partitions
Two design choices stand out. First, partitions are real CUBRID
classes — each child class has its own OID, its own HFID,
its own attribute representations, its own indexes — and they are
linked to the master class by single inheritance (SM_CLASS::users
on the master, SM_CLASS::inheritance on the partition). Pruning
discovery exploits this: partition_find_root_class_oid calls
heap_get_class_supers and treats the single super as the
partitioned class root. Second, the partition predicate is
stored as a serialized regu-variable expression rather than as a
column-id plus operator. This is more flexible (you can
PARTITION BY HASH (YEAR(birthdate)) and the predicate is a
function call) but it means pruning has to deserialize and
evaluate it through stx_map_stream_to_func_pred and
fetch_peek_dbval rather than do a fast attribute compare.
The DDL path: do_create_partition and friends
Section titled “The DDL path: do_create_partition and friends”When the user writes CREATE TABLE t (...) PARTITION BY HASH (id) PARTITIONS 8, the parser builds a PT_CREATE_ENTITY whose
partition_info describes the strategy and (for HASH) the
partition count. The schema executor calls
do_create_partition (in execute_schema.c:4036+) which:
- Fetches the master class (
au_fetch_class) so it has the attribute list and OID handy. - Rejects the partitioning DDL if the master class is part of an
inheritance hierarchy:
if (smclass->users != NULL || smclass->inheritance != NULL) { … ER_SM_NO_PARTITION_ON_HIERARCHIES }. This is the constraint that prevents a partitioned table from being subclassed and a subclass from being partitioned. - For HASH: synthesizes the partition names
<class>__p__p<i>for i in[0, hashsize), and for each one callsdbt_create_classanddo_create_localto materialize the partition class.PARTITIONED_SUB_CLASS_TAGis the delimiter"__p__"that lets later code grep partition names. - For RANGE/LIST: walks the explicit
partslist from the parse tree, validates that range bounds are monotonic and disjoint (or that list values across partitions are pairwise disjoint), and creates one class per partition entry. - Each created partition class is linked to the master via single
inheritance (
partition_parent_atts = smclass->attributes), gets its ownSM_PARTITIONrecord withclass_partition_type = DB_PARTITION_CLASS, and finally is committed through the schema-template flow.
Companion DDL helpers:
do_alter_partitioning_pre/do_alter_partitioning_post— pair of pre/post-pruning hooks that run before and after the actualALTER TABLEbody, e.g., forADD PARTITION,DROP PARTITION,COALESCE PARTITION,REORGANIZE PARTITION.do_remove_partition_pre/_post— drops a partition and detaches it from the master.do_coalesce_partition_pre/_post— collapses N HASH partitions to N-K (rehashes existing data).do_reorganize_partition_pre/_post— restructures range or list partitions, redistributing data.do_create_partition_constraintsanddo_create_partition_constraint— replicate indexes and constraints from the master to each partition (since CUBRID uses local indexes; see below).
The DDL writes the partition info into the master class’s
SM_PARTITION member and into each child class’s own member;
both ride through the standard schema-template flush, so the
catalog records (_db_class, _db_partition if present, and the
representation chain in _db_class_repr) are updated atomically.
There is no special-purpose partition catalog table — the data is
all on _db_class.partition (a DB_SEQ literal embedded in the
class record) plus the inheritance edges.
The runtime model: PRUNING_CONTEXT
Section titled “The runtime model: PRUNING_CONTEXT”When an executor wants to scan, insert, or update a partitioned
table, it builds a server-side PRUNING_CONTEXT. The struct lives
in partition_sr.h:
// pruning_context — partition_sr.hstruct pruning_context{ THREAD_ENTRY *thread_p; OID root_oid; /* OID of the root class */ REPR_ID root_repr_id; access_spec_node *spec; /* class to prune */ val_descr *vd; DB_PARTITION_TYPE partition_type; /* hash, range, list */
OR_PARTITION *partitions; /* partitions array, [0] is master */ OR_PARTITION *selected_partition; SCANCACHE_LIST *scan_cache_list; int count; /* number of entries (master+children) */
xasl_unpack_info *fp_cache_context; func_pred *partition_pred; /* deserialized partition expression */ int attr_position; /* attribute position in index key */ ATTR_ID attr_id; /* attribute id of partition key */ HEAP_CACHE_ATTRINFO attr_info; int error_code;
int pruning_type; /* DB_PARTITIONED_CLASS | DB_PARTITION_CLASS */ bool is_attr_info_inited; bool is_from_cache;};partitions[0] is the master, partitions[1..count-1] are the
partitions, partition_pred is the deserialized regu-variable
that computes the partition key, attr_id is the column id of
the partition key (extracted from the partition predicate by
walking the regu tree), and attr_info is the heap attribute
cache used to evaluate the predicate against a RECDES.
The lifecycle is clear:
partition_init_pruning_contextzeroes the struct.partition_load_pruning_contextpopulates it. First it tries to load from a per-server hash cache (db_Partition_Ht, keyed by root class OID, withPARTITION_CACHE_SIZE = 200). On miss it callsheap_get_class_partitionsto read theOR_PARTITIONarray from disk, thenpartition_load_partition_predicateto deserialize the regu tree frompartitions[0].values[2](a serializedFUNC_PREDstream), then it caches the result.- The caller does pruning (per spec, per insert, per update, per index lookup).
partition_clear_pruning_contextreleases scan caches and the deserialized predicate.
The cache is essential for INSERT INTO t SELECT ... FROM t2
style bulk DML: without it every inserted row would re-read the
catalog. The cache hash table is created by
partition_cache_init at server boot and walked on schema
invalidation by partition_decache_class.
Pruning at optimize time: partition_prune_spec
Section titled “Pruning at optimize time: partition_prune_spec”The principal optimizer-side entry point is
partition_prune_spec in partition.c:3313. It is called from
query_executor.c:8656 (regular scan preparation) and from
query_executor.c:13361 (parallel-query class lookup), and from
the parallel px_heap_scan task. The contract: given an
access_spec_node whose pruning_type == DB_PARTITIONED_CLASS,
walk spec->where_pred, spec->where_key, and
spec->where_range against the partition descriptor; produce a
PARTITION_SPEC_TYPE linked list in spec->parts listing the
surviving partitions.
// partition_prune_spec — partition.cintpartition_prune_spec (THREAD_ENTRY * thread_p, val_descr * vd, access_spec_node * spec){ ... if (spec->pruning_type != DB_PARTITIONED_CLASS) { spec->pruned = true; return NO_ERROR; } ... partition_load_pruning_context (thread_p, &ACCESS_SPEC_CLS_OID (spec), spec->pruning_type, &pinfo); ... pruningset_init (&pruned, PARTITIONS_COUNT (&pinfo)); pruningset_set_all (&pruned);
if (pinfo.spec->where_pred != NULL) { pruningset_init (&pruned_pred, PARTITIONS_COUNT (&pinfo)); status = partition_match_pred_expr (&pinfo, pinfo.spec->where_pred, &pruned_pred); if (status != MATCH_NOT_FOUND) pruningset_intersect (&pruned, &pruned_pred); } if (pinfo.spec->where_key != NULL) { /* same idea, intersect */ } if (pinfo.spec->where_range != NULL) { /* same idea, intersect */ }
error = pruningset_to_spec_list (&pinfo, &pruned); ... spec->pruned = true;}The bitset pruned starts as “all partitions kept”. Each of the
three predicate slots — where_pred (regular WHERE),
where_key (key-range filter for index scans),
where_range (MVCC-reevaluation predicate for UPDATE/DELETE) —
contributes a bitset; the result is the intersection. If a slot
returns MATCH_NOT_FOUND (predicate doesn’t reference the
partition key), it is not intersected — its bit pattern is
treated as “no information”, which is the safe default.
pruningset_to_spec_list then walks the bitset and produces a
PARTITION_SPEC_TYPE * chain in spec->parts, copying
(class_oid, class_hfid, btid) for each surviving partition.
For index scans, it also calls heap_get_index_with_name to find
the local index with the same name on each partition.
The bitset itself (PRUNING_BITSET) is hand-rolled at
partition.c:70-83, sized for MAX_ELEMENTS = 1024 partitions
(matching the MAX_PARTITIONS = 1024 macro in partition.h).
The operations are unsurprising — pruningset_init, _set_all,
_add, _remove, _intersect, _union, _is_set,
_popcount, plus an iterator. The encoding is one bit per
partition; partition i corresponds to OR_PARTITION[i+1]
because index 0 is reserved for the master.
partition_match_pred_expr walks the predicate tree:
T_PREDwithB_AND: recurse on both children, intersect the resulting bitsets. If one side isMATCH_NOT_FOUND, take the other side’s bitset alone. (BecauseA AND ?is at mostA— anythingAexcludes is excluded; whatever?would exclude is a free additional reduction.)T_PREDwithB_OR: recurse on both children, union the resulting bitsets. If either side isMATCH_NOT_FOUND, returnMATCH_NOT_FOUND(becauseA OR ?is at leastAand?could be anything). This is the conservative-OR rule.T_EVAL_TERM/T_COMP_EVAL_TERM: a comparisonlhs op rhs. Iflhsmatches the partition expression, prune byrhs; ifrhsmatches, prune bylhs. If neither matches but the partition expression is a function (e.g.,YEAR(date)) and one side of the predicate contains the partition expression as a sub-tree, fall through topartition_prune_for_functionto try a tighter analysis.T_EVAL_TERM/T_ALSM_EVAL_TERM: handlesIN (...)/NOT IN (...)and= ANY (...)/<> ANY (...). The op is rewritten toPO_IN/PO_NOT_INbased on theeq_flag.T_LIKE_EVAL_TERM,T_RLIKE_EVAL_TERM,T_NOT_TERM: returnsMATCH_NOT_FOUND. CUBRID does not currently extract a range from a LIKE pattern even when one is feasible (e.g.,name LIKE 'A%'over a list-partitioned table).
The leaf operation is partition_prune_db_val, which dispatches
on partition_type:
// partition_prune_db_val — partition.cstatic MATCH_STATUSpartition_prune_db_val (PRUNING_CONTEXT * pinfo, const DB_VALUE * val, const PRUNING_OP op, PRUNING_BITSET * pruned){ switch (pinfo->partition_type) { case DB_PARTITION_HASH: return partition_prune_hash (pinfo, val, op, pruned); case DB_PARTITION_RANGE: return partition_prune_range (pinfo, val, op, pruned); case DB_PARTITION_LIST: return partition_prune_list (pinfo, val, op, pruned); default: return MATCH_NOT_FOUND; }}Hash pruning
Section titled “Hash pruning”partition_prune_hash is the simplest:
// partition_prune_hash — partition.c (excerpted)case PO_EQ: if (TP_DOMAIN_TYPE (col_domain) != DB_VALUE_TYPE (val_p)) { /* coerce val_p into the partition column's domain */ if (tp_value_cast (val_p, &val, col_domain, false) == DOMAIN_INCOMPATIBLE) { /* impossible; treat as no-match */ er_clear (); status = MATCH_OK; } } else pr_clone_value (val_p, &val);
idx = mht_get_hash_number (hash_size, &val); pruningset_add (pruned, idx); status = MATCH_OK; break;
case PO_IN: /* iterate through values in the IN-list, hash each, set its bit */case PO_IS_NULL: pruningset_add (pruned, 0); /* first partition holds NULLs */ break;default: status = MATCH_NOT_FOUND;The key call is mht_get_hash_number (hash_size, &val) — the
generic CUBRID hash function (the same one used for hash-tables
in mht.c). For =, exactly one bit is set; for IN, one bit
per IN-list element; for < > <= >= and other ops the result is
“don’t know” so the hash type returns MATCH_NOT_FOUND and all
partitions stay alive. NULLs are pinned to partition 0 by
convention.
Range pruning
Section titled “Range pruning”partition_prune_range is the most involved because it must
reason about interval overlap. For each partition i, it reads
min and max from partitions[i+1].values[0..1] (where
min == NULL means MINVALUE — first partition — and max == NULL means MAXVALUE — last partition). It then compares
val against both:
// partition_prune_range — partition.c (excerpted)rmin = (min is null ? DB_LT : tp_value_compare (&min, val, 1, 1));rmax = (max is null ? DB_LT : tp_value_compare (val, &max, 1, 1));
switch (op) { case PO_EQ: /* min <= val < max */ if ((rmin == DB_EQ || rmin == DB_LT) && rmax == DB_LT) pruningset_add (pruned, i), goto cleanup; /* exactly one fits */ break; case PO_LT: /* val < max for any partition where min < val */ if (rmin == DB_LT) pruningset_add (pruned, i); break; case PO_LE: ... case PO_GT: ... case PO_GE: ... case PO_IS_NULL: if (DB_IS_NULL (&min)) pruningset_add (pruned, i), goto cleanup; break; }The range interval is [min, max) — left-closed, right-open. For
PO_GT (strict >), the code calls partition_decrement_value
on max first to convert the half-open interval to a fully
closed comparison. The trick is that > on a value that equals
max should not match the partition [min, max), but the
bound comparison wants val < max; by decrementing max to
max - 1 (only valid for integer-like and date-like types) the
comparison stays correct. For types where decrement fails the
function returns false and the code falls through to the
default rule. For collection-typed values
(val IN (1, 2, 3) etc.), the function recursively prunes by
each element and unions the per-element bitsets (or
intersects for = since equality with a collection is
strictly impossible — but the pre-existing code path handles it
defensively).
List pruning
Section titled “List pruning”partition_prune_list walks the per-partition value set and asks
membership questions:
// partition_prune_list — partition.c (excerpted)case PO_EQ: if (db_set_ismember (part_collection, (DB_VALUE *) val)) { pruningset_add (pruned, i); return MATCH_OK; /* values are disjoint, no need to look further */ } break;case PO_IN: /* perform intersection of part_collection and val_collection; if non-empty, this partition qualifies */ for (j = 0; j < size; j++) { db_set_get (part_collection, j, &col); if (db_set_ismember (val_collection, &col)) { pruningset_add (pruned, i); break; } }case PO_IS_NULL: if (db_set_has_null (part_collection)) pruningset_add (pruned, i), return MATCH_OK;case PO_NE: case PO_LT: case PO_LE: case PO_GT: case PO_GE: return MATCH_NOT_FOUND; /* list partitions are unordered */The disjointness optimization in PO_EQ (“no need to look any
further”) is key: list partitions, by definition, have pairwise
disjoint value sets (the DDL enforces this), so the first match
is also the only match. Comparison operators other than = and
IN always return MATCH_NOT_FOUND because list partitioning
imposes no order on the value set.
flowchart TD
Start[partition_prune_spec called] --> Init[partition_load_pruning_context<br/>cache hit ? load from cache : read OR_PARTITION array]
Init --> Bitset[pruned = all_ones bitset]
Bitset --> WhereP{where_pred?}
WhereP -->|yes| MatchP[partition_match_pred_expr<br/>walks T_PRED tree]
MatchP --> Intersect1[pruned = pruned ∩ pruned_pred]
WhereP -->|no| WhereK
Intersect1 --> WhereK{where_key?}
WhereK -->|yes| MatchK[partition_match_pred_expr<br/>on key filter]
MatchK --> Intersect2[pruned = pruned ∩ pruned_key]
WhereK -->|no| WhereR
Intersect2 --> WhereR{where_range?}
WhereR -->|yes| MatchR[partition_match_pred_expr<br/>on MVCC reeval predicate]
MatchR --> Intersect3[pruned = pruned ∩ pruned_range]
WhereR -->|no| ToList
Intersect3 --> ToList[pruningset_to_spec_list<br/>materialize PARTITION_SPEC_TYPE chain]
ToList --> SpecParts[access_spec.parts = chain]
SpecParts --> Done[spec->pruned = true]
Per-row partition resolution at INSERT/UPDATE
Section titled “Per-row partition resolution at INSERT/UPDATE”For DML, pruning must produce a single partition: the row’s
target heap. This is partition_find_partition_for_record in
partition.c:3465. It is wrapped by partition_prune_insert
and partition_prune_update to provide INSERT-side and
UPDATE-side semantics.
// partition_find_partition_for_record — partition.c (excerpted)static intpartition_find_partition_for_record (PRUNING_CONTEXT * pinfo, const OID * class_oid, RECDES * recdes, OID * partition_oid, HFID * partition_hfid){ ... /* lazy-init heap attribute cache for partition column */ if (pinfo->is_attr_info_inited == false) { heap_attrinfo_start (pinfo->thread_p, &pinfo->root_oid, 1, &pinfo->attr_id, &pinfo->attr_info); partition_set_cache_info_for_expr (pinfo->partition_pred->func_regu, pinfo->attr_id, &pinfo->attr_info); pinfo->is_attr_info_inited = true; }
/* read the partition column value out of the record */ repr_id = or_rep_id (recdes); or_set_rep_id (recdes, pinfo->root_repr_id); /* read as if for the master */ heap_attrinfo_read_dbvalues (pinfo->thread_p, ..., recdes, &pinfo->attr_info); or_set_rep_id (recdes, repr_id); /* restore */
/* evaluate the partition expression */ fetch_peek_dbval (pinfo->thread_p, pinfo->partition_pred->func_regu, ..., &result);
if (db_value_is_null (result)) op = PO_IS_NULL;
/* prune by the resulting value */ status = partition_prune_db_val (pinfo, result, op, &pruned); count = pruningset_popcount (&pruned); if (status != MATCH_OK || count != 1) return ER_PARTITION_NOT_EXIST; /* row has no home */
pos = pruningset_iterator_next (&it); COPY_OID (partition_oid, &pinfo->partitions[pos + 1].class_oid); HFID_COPY (partition_hfid, &pinfo->partitions[pos + 1].class_hfid);
/* rewrite the record's REPR_ID to the partition's representation */ if (!OID_EQ (class_oid, partition_oid)) or_set_rep_id (recdes, pinfo->partitions[pos + 1].rep_id);}Three interesting moments:
- Read with the master REPR_ID, write with the partition REPR_ID. The record arrives with whichever REPR_ID the caller built it under. To get the partition column out, the function temporarily flips the REPR_ID to the master class’s so the heap-attribute machinery decodes the layout. After the evaluation, it flips it back. Then, before returning, if the target partition has a different REPR_ID, it overwrites the REPR_ID bits in place. The comment in the source explains why this shortcut is safe: partitions inherit their attribute layout from the master, so the disk image is identical except for the REPR_ID bits.
- NULL handling:
PO_IS_NULLop. A NULL result of the partition expression is pruned withPO_IS_NULL, which routes to partition 0 for hash, to whichever range partition hasmin == NULL(the MINVALUE partition) for range, and to a list partition that explicitly accepts NULL for list. If no partition accepts NULL the function returnsER_PARTITION_NOT_EXIST. - Exactly-one assertion. For INSERT, more than one partition
matching means the partition descriptor is corrupt; zero
matching means the row has no home. Both are
ER_PARTITION_NOT_EXIST.
partition_prune_insert is a thin wrapper that loads the
pruning context (cached or fresh), calls
partition_find_partition_for_record, and — for the
DB_PARTITION_CLASS pruning type (caller is targeting a
specific partition like INSERT INTO t__p__p3 VALUES …) —
verifies that the computed partition matches the targeted one,
otherwise returns ER_INVALID_DATA_FOR_PARTITION.
partition_prune_update is similar but starts from the partition
class OID instead of the master OID (because UPDATE always
operates on a specific partition’s heap, not on the master).
It calls partition_find_root_class_oid to climb to the master,
loads the pruning context, computes the destination partition,
and reports it back to the caller. The locator
(locator_update_force / locator_move_record) then decides
whether to rewrite the row in place (same partition) or move it
between partitions (different partition).
flowchart TD
Insert[INSERT INTO t VALUES ...] --> Locator[locator_insert_force<br/>pruning_type != DB_NOT_PARTITIONED_CLASS]
Locator --> Prune[partition_prune_insert]
Prune --> Find[partition_find_partition_for_record]
Find --> Read[heap_attrinfo_read_dbvalues<br/>read partition column with master REPR_ID]
Read --> Eval[fetch_peek_dbval on partition_pred]
Eval --> Dispatch{partition_type}
Dispatch -->|HASH| Hash[mht_get_hash_number mod N → idx]
Dispatch -->|RANGE| Range[binary search for [min, max)]
Dispatch -->|LIST| List[for each partition: db_set_ismember]
Hash --> One[exactly one bit must be set]
Range --> One
List --> One
One -->|yes| Rewrite[or_set_rep_id to partition's REPR_ID]
Rewrite --> Cache[locator_get_partition_scancache for target heap]
Cache --> Heap[heap_insert_logical into target partition's HFID]
One -->|no| Err[ER_PARTITION_NOT_EXIST]
Per-partition scan dispatch at execute time
Section titled “Per-partition scan dispatch at execute time”Once partition_prune_spec has populated spec->parts, the
executor iterates partitions one at a time. The relevant
access_spec_node fields are in xasl.h:
// access_spec_node — xasl.hstruct access_spec_node{ ... int pruning_type; /* DB_PARTITIONED_CLASS | DB_NOT_PARTITIONED_CLASS | DB_PARTITION_CLASS */ ...#if defined (SERVER_MODE) || defined (SA_MODE) SCAN_ID s_id; /* scan identifier */ PARTITION_SPEC_TYPE *parts; /* partitions of the current spec */ PARTITION_SPEC_TYPE *curent; /* current partition */ ... bool pruned; /* true if partition pruning has been performed */#endif};
// partition_spec_node — xasl.hstruct partition_spec_node{ OID oid; /* class oid */ HFID hfid; /* class hfid */ BTID btid; /* index id */ PARTITION_SPEC_TYPE *next; /* next partition */ SCAN_STATS scan_stats;};The scan loop in the executor (in query_executor.c and
scan_manager.c) re-opens the underlying SCAN_ID once per
partition, walking the linked list parts via curent, and
emits each partition’s tuples in turn. For an aggregate that
collapses across all partitions, the executor treats the
partitions as logically one stream; for a sort/hash that
preserves per-partition identity, the scan stats are tracked per
partition_spec_node (scan_stats).
The lock-acquisition step in query_executor.c:8680+ is
important: after pruning, the executor walks spec->parts and
takes an IS_LOCK (or IX_LOCK for UPDATE/DELETE) on each
surviving partition’s class OID via lock_subclass. The
unpruned partitions are not locked, which is one of the
performance wins of pruning: if a 100-partition table has 99
partitions pruned, only one class lock is taken instead of 100.
Locator integration: routing DML to the partition heap
Section titled “Locator integration: routing DML to the partition heap”locator_sr.c is the bridge between the heap layer (per-class
HFIDs) and the partition layer. The relevant integration
points:
-
locator_insert_force(locator_sr.c:4937+): when called withpruning_type != DB_NOT_PARTITIONED_CLASS, it callspartition_prune_insertto compute the target(real_class_oid, real_hfid). It then takes an IX-lock on the partition class vialock_subclass, fetches a per-partition scan cache vialocator_get_partition_scancache, and routes the actualheap_insert_logicalcall to the partition’s heap rather than the master’s. The master class has no heap of its own (in the partitioned case), so an insert that bypassed the pruning step would fail. -
locator_get_partition_scancache(locator_sr.c:12314+): this is the helper that backspartition_get_scancache/partition_new_scancache. Given a target partition OID and HFID, it either finds an existingPRUNING_SCAN_CACHEin the pruning context’sscan_cache_listor allocates a new one and starts aHEAP_SCANCACHEon it. The cache lives for the lifetime of the pruning context — typically the whole multi-row INSERT/UPDATE statement — so each partition pays the scan-cache setup cost only once. -
locator_move_record(locator_sr.c:5295+): when an UPDATE on a partitioned table changes the partition key value such that the post-image belongs to a different partition than the pre-image, the locator must delete-and-re-insert the row in the new partition. This function does both atomically: it fetches the destination partition’s scan cache, callslocator_insert_forceon the new heap, thenlocator_delete_force_for_movingon the old. The pruning layer has already rewritten the record’s REPR_ID to the destination partition’s, so the insert just lays the bytes down. The outermost driver (locator_update_force) callspartition_prune_updatefirst to discover whether the partition assignment changed, and dispatches tolocator_move_recordonly when it did (seelocator_sr.c:5915, 5953). -
locator_area_op_to_pruning_type(locator_sr.c:12372+): mapsLC_FLUSH_INSERT_PRUNE→DB_PARTITIONED_CLASS,LC_FLUSH_INSERT_PRUNE_VERIFY→DB_PARTITION_CLASS, etc. The client-side workspace flush marks each dirty record with one of theseLC_FLUSH_*codes based on whether the user inserted via the master or a specific partition; the server uses this to decide between full pruning and pruning-as-verification.
The unique-index integrity check in locator_sr.c:4139 is
another pruning consumer: when checking a unique-index
constraint on insert, the server uses
partition_prune_unique_btid to find the partition that owns
the unique key, then checks just that partition’s local index.
This works because CUBRID uses local indexes — the partition
that owns the row also owns the unique-index entry — and
because the unique-index column must include the partition
key when pruning is to identify a single partition. (Without
that, uniqueness across partitions would not be enforceable
locally.)
Indexes on partitioned tables: local only
Section titled “Indexes on partitioned tables: local only”CUBRID currently uses local indexes on partitioned tables.
When do_create_partition_constraints runs, it iterates over the
master class’s constraints and creates a copy of each one on
every partition. The B-tree IDs (BTIDs) are different on each
partition — heap_get_index_with_name is the lookup that finds a
partition’s local BTID given the master’s index name. The
implication of local-only:
- Cheap DDL. Adding/dropping a partition only affects that partition’s indexes; no global rebuild.
- Cheap maintenance. Inserts and updates only touch the local BTID in the partition where the row lives.
- No partition-wise unique enforcement without partition key in
the unique columns. If the user defines
UNIQUE(x)on a table partitioned byy, two different partitions could each have a row with the samexand the local-only enforcement would not catch it. CUBRID’s DDL therefore requires that the partition key is part of any UNIQUE or PRIMARY KEY on a partitioned table. partition_prune_unique_btidis the helper that, given a unique-index search key, computes which partition owns the key (via the partition key value embedded in the search key) and rewrites(class_oid, hfid, btid)to the partition-local triple.
There is a feature flag partition_is_global_index in
partition.c:4250, but it is wrapped in #if 0 (disabled). The
function distinguishes between a primary-key index (which it
would treat as global) and a non-primary-key unique index (whose
globality depends on whether the index columns include the
partition key). Today it is dead code. The current production
behavior is “all indexes are local”.
Partition-wise joins
Section titled “Partition-wise joins”query_planner.c shows several qo_partition / QO_PARTITION
references — but these are planner partitions, not
table-level partitions. In query_planner.c, QO_PARTITION is
the planner’s term for a connected component of the join graph
(Selinger-style partitioning of the query graph for join
enumeration). Symbols like qo_search_partition,
qo_combine_partitions, QO_PARTITION_NODES,
QO_PARTITION_EDGES operate on join-graph partitions, not
horizontal table partitions.
A grep over the optimizer for table-partition awareness comes up mostly empty. Partition pruning is run after plan generation, when the executor opens each access spec — not folded into the optimizer’s cost computation. Consequence: CUBRID’s optimizer plans a partitioned-table scan as a single scan against the master class, with pruning applied opaquely at runtime. There is no partition-wise join optimization. Two partitioned tables with the same partition shape get joined as if they were unpartitioned tables; the partitions of T1 each scan against all partitions of T2.
flowchart LR
Parser[Parser] --> Resolve[Name resolution]
Resolve --> SemCheck[Semantic check]
SemCheck --> Opt[Optimizer<br/>cost-based plan]
Opt --> XASL[XASL generation<br/>access_spec.pruning_type set]
XASL --> Stream[xasl_to_stream → stream_to_xasl]
Stream --> Exec[query_executor<br/>opens each spec]
Exec --> Prune[partition_prune_spec<br/>fills spec->parts]
Prune --> Lock[lock_subclass on each surviving partition]
Lock --> ScanLoop[for each PARTITION_SPEC_TYPE in spec->parts:<br/>scan_open_scan, scan_next_scan, scan_close_scan]
ScanLoop --> Output[merged tuple stream]
The diagram shows clearly that pruning is between XASL deserialization and scan execution, not folded into the plan itself. This is a deliberate trade-off: pruning is fast (a few hundred microseconds for 1024 partitions) and runs lazily, so the cost of “always plan as if all partitions are alive” is small in practice. The cost is that partition-wise join is not available — the planner does not see per-partition relations to combine.
Source Walkthrough
Section titled “Source Walkthrough”Symbols are grouped by subsystem so that the reader can navigate by role rather than by file order.
Catalog metadata (in-memory and on-disk)
Section titled “Catalog metadata (in-memory and on-disk)”SM_PARTITION(struct,class_object.h) — the in-memory partition descriptor on eachSM_CLASS. Carriespartition_type,class_partition_type,pname,expr,values. Per-class instance:SM_CLASS::partition.OR_PARTITION(struct,object_representation_sr.h) — the server-side reloaded form:(class_oid, class_hfid, partition_type, rep_id, values). Materialized byheap_get_class_partitions(inheap_file.c).OR_CLASSREP::has_partition_info— the single bit on each class representation that says “this class has partition info, load it”.DB_PARTITION_TYPE(enum,storage_common.h) —DB_PARTITION_HASH=0,DB_PARTITION_RANGE=1,DB_PARTITION_LIST=2.DB_CLASS_PARTITION_TYPE(enum,storage_common.h) —DB_NOT_PARTITIONED_CLASS=0,DB_PARTITIONED_CLASS=1,DB_PARTITION_CLASS=2. The first marks ordinary tables; the second marks the master class; the third marks a partition.classobj_make_partition_info,classobj_free_partition_info,classobj_copy_partition_info— the SM_PARTITION life-cycle helpers (inclass_object.c).
DDL (client-side, execute_schema.c)
Section titled “DDL (client-side, execute_schema.c)”do_create_partition— entry forCREATE TABLE … PARTITION BY …andALTER TABLE … ADD PARTITIONandALTER TABLE … ADD HASH PARTITION. Materializes the partition classes, links them to the master via inheritance, fills theSM_PARTITIONrecords.do_alter_partitioning_pre/do_alter_partitioning_post— pre/post hook pair that runs around the schema-template flush for partition-related ALTER variants.do_remove_partition_pre/_post— drops a partition.do_coalesce_partition_pre/_post— collapses N HASH partitions to fewer (rehashes rows).do_reorganize_partition_pre/_post— restructures RANGE/LIST partitions (redistribution).do_create_partition_constraints/do_create_partition_constraint— replicates indexes and constraints from the master to each partition.SM_PARTITION_ALTER_INFO(struct) — context bundle passed through the alter-partition pipeline (root class OP, key column, promoted partition names).PART_CLASS_INFO(struct) — per-partition scratch info during create (name, template, object).PARTITION_VARCHAR_LEN,PARTITIONED_SUB_CLASS_TAG ("__p__")— naming-policy constants.
Per-row partition resolution (server-side, partition.c)
Section titled “Per-row partition resolution (server-side, partition.c)”partition_prune_insert— outer entry for INSERT-side routing. Loads/uses aPRUNING_CONTEXT, callspartition_find_partition_for_record, returns(pruned_class_oid, pruned_hfid).partition_prune_update— outer entry for UPDATE-side routing. Climbs to the root class first (partition_find_root_class_oid), then same as INSERT.partition_find_partition_for_record(static) — the heart of per-row routing. Reads the partition column out of the record, evaluates the partition expression, prunes by the result, returns the unique target partition.partition_find_root_class_oid— given any class OID (master or partition), returns the master class OID. Usesheap_get_class_supers.partition_prune_unique_btid— given a unique-index search key, returns the partition that owns the key plus the partition-localBTID. Used bybtree.cfor unique-constraint enforcement and byquery_executor.cfor unique scans.partition_prune_partition_index— given a search key and the master’s index BTID, returns the position (index) of the partition that owns the key. Backspartition_prune_unique_btid.partition_attrinfo_get_key— extracts the partition-key DB_VALUE out of a (possibly multi-column) index search key.
Pruning at optimize time (server-side, partition.c)
Section titled “Pruning at optimize time (server-side, partition.c)”partition_prune_spec— outer entry for scan-side pruning. Applieswhere_pred ∩ where_key ∩ where_rangeto the partition descriptor, fillsspec->partswith the surviving partitions.partition_match_pred_expr(static, recursive) — walks aPRED_EXPRtree (T_PRED/T_EVAL_TERM/T_COMP_EVAL_TERM/T_ALSM_EVAL_TERM/T_NOT_TERM/T_LIKE_EVAL_TERM) and returns a bitset of surviving partitions plus aMATCH_STATUS.partition_match_index_key— pruning specialization for index key ranges.partition_match_key_range— handlesKEY_RANGElower and upper bounds for index scans.partition_prune— leaf wrapper that resolves aREGU_VARIABLEto aDB__VALUE(viapartition_get_value_from_regu_var) and then callspartition_prune_db_val.partition_prune_db_val— strategy dispatch: HASH →partition_prune_hash, RANGE →partition_prune_range, LIST →partition_prune_list.partition_prune_hash—mht_get_hash_number (hash_size, &val) → bit position.partition_prune_range— interval comparison against(min, max)per partition; usespartition_decrement_valueto convert(min, max]to(min, max)for>comparisons.partition_prune_list—db_set_ismemberper partition; short-circuits on first match because list partitions are pairwise disjoint.partition_prune_for_function— pruning for predicates whose left or right side contains the partition expression as a sub-expression (e.g.,YEAR(date) BETWEEN 2020 AND 2024when partitioning byYEAR(date)).partition_supports_pruning_op_for_function— lookup table that says which(op, function)combinations the function-pruning logic can handle.partition_do_regu_variables_match/partition_do_regu_variables_contain— structural matchers on regu trees: “are these the same partition expression?” vs. “does this regu tree contain the partition expression?”.partition_get_value_from_key,partition_get_value_from_inarith,partition_get_value_from_regu_var— three flavors of “pull aDB_VALUEout of a regu node” depending on shape.partition_is_reguvar_const— whether a regu variable is a constant value (and therefore pruning-safe).partition_decrement_value— type-aware integer/date decrement, used to massage range comparisons.partition_get_attribute_id— extracts the underlying attribute id out of a partition-expression regu tree. Used during context load to populatePRUNING_CONTEXT::attr_id.partition_set_cache_info_for_expr/partition_set_cache_dbvalp_for_attribute— wire the heap-attribute cache into the regu tree before evaluation, sofetch_peek_dbvalcan read the column value out ofpinfo->attr_inforather than via a fresh disk fetch.partition_get_position_in_key— for index pruning, finds the position of the partition column within the index key column list.partition_rel_op_to_pruning_op—REL_OP→PRUNING_OPenum mapping (R_LT→PO_LT,R_EQ→PO_EQ, etc.).
Partition descriptor lifecycle (server-side, partition.c)
Section titled “Partition descriptor lifecycle (server-side, partition.c)”partition_init_pruning_context— zero-init.partition_load_pruning_context— populate (try cache first; on miss readOR_PARTITIONarray from heap and cache).partition_clear_pruning_context— release scan caches and predicate.partition_load_partition_predicate/partition_free_partition_predicate— deserialize and free the partition expression (stx_map_stream_to_func_pred).partition_set_specified_partition— when caller explicitly named a partition (DB_PARTITION_CLASSmode), pointselected_partitionat it.PARTITION_CACHE_ENTRY(struct) — per-class cache entry.partition_cache_init/partition_cache_finalize/partition_decache_class— process-wide hash table of cached pruning contexts (db_Partition_Ht,PARTITION_CACHE_SIZE = 200).partition_cache_pruning_context/partition_load_context_from_cache/partition_cache_entry_to_pruning_context/partition_pruning_context_to_cache_entry— bidirectional conversion.partition_free_cache_entry/partition_free_cache_entry_kv— destructors.
Per-partition scan caches
Section titled “Per-partition scan caches”PRUNING_SCAN_CACHE(struct,partition_sr.h) — wraps aHEAP_SCANCACHEplus function-index unpack info per partition.SCANCACHE_LIST(struct) — singly-linked list ofPRUNING_SCAN_CACHEhanging offPRUNING_CONTEXT.partition_get_scancache— find an existing scan cache by partition OID.partition_new_scancache— create and prepend a new one.locator_get_partition_scancache(locator_sr.c:12314+) — full-service wrapper used bylocator_insert_force/locator_move_record: find or create, and start the innerHEAP_SCANCACHEif new.
Pruning bitset (server-side, partition.c)
Section titled “Pruning bitset (server-side, partition.c)”PRUNING_BITSET(struct) —unsigned int set[];plus count, hand-rolled bitset for up to 1024 partitions.PRUNING_BITSET_ITERATOR— companion iterator.pruningset_init/_set_all/_copy/_add/_remove/_intersect/_union/_is_set/_popcount— bitset arithmetic.pruningset_iterator_init/pruningset_iterator_next— iteration.pruningset_to_spec_list— bitset → linked list ofPARTITION_SPEC_TYPEfor the executor.
Scan dispatch (executor + scan_manager)
Section titled “Scan dispatch (executor + scan_manager)”access_spec_node(struct,xasl.h) — the optimizer’s access spec; carriespruning_type,pruned,parts(linked list of surviving partitions),curent(current iterator position).partition_spec_node/PARTITION_SPEC_TYPE(struct,xasl.h) —(oid, hfid, btid, next, scan_stats).prepare_mvcc_reev_data(inquery_executor.c) — callspartition_prune_specandlock_subclassper partition before scan open.qexec_clear_partition_expression(referenced frompartition_free_partition_predicate) — releases the deserialized partition expression’s regu tree.
Locator integration (locator_sr.c)
Section titled “Locator integration (locator_sr.c)”locator_insert_force— inserts a record; callspartition_prune_insertif the class is partitioned and routes the heap insert to the partition.locator_update_force— updates a record; callspartition_prune_updateand dispatches tolocator_move_recordif the partition assignment changed.locator_move_record— delete-then-insert across two partitions atomically.locator_get_partition_scancache— see above.locator_area_op_to_pruning_type—LC_FLUSH_*→DB_*_PARTITIONED_CLASSenum mapping.FOR_INSERT_OR_DELETE/FOR_MOVE— internal enum that tags a delete-and-reinsert pair as a partition move (so MVCC treats the new row as a continuation rather than a fresh insert for some bookkeeping purposes).redistribute_partition_data— bulk path used byREORGANIZE PARTITION.
B-tree / index integration
Section titled “B-tree / index integration”btree.c::partition_prune_unique_btidcallers (line 6080, 22719) — uniqueness checks on insert.btree_load.c::partition_prune_partition_indexcaller (line 4559) — bulk index build sees per-partition keys.btree.c::partition_load_pruning_contextcaller (line 22707) — referencing-side foreign-key lookup needs the FK’s pruning context.statistics_sr.c::partition_get_partition_oidscaller (line 198) — analyze runs per partition.system_catalog.c::partition_get_partition_oidscaller (line 5449) — catalog accessor.file_manager.c::partition_get_partition_oidscaller (line 6854) — file-cleanup walks all partitions.
Aggregate hierarchy helper
Section titled “Aggregate hierarchy helper”partition_load_aggregate_helper— for queries with aggregates over partitioned tables, builds anHIERARCHY_AGGREGATE_HELPERcontaining per-partition(HFID, BTID)arrays so the aggregate executor can iterate partitions in turn.HIERARCHY_AGGREGATE_HELPER(inquery_aggregate.hpp) — the recipient struct.
Position hints as of this revision
Section titled “Position hints as of this revision”The following table records the symbol positions observed during
this document’s updated: date. Symbol names are the canonical
anchor; the line numbers age with refactors.
| Symbol | File | Line |
|---|---|---|
MAX_PARTITIONS = 1024 | src/query/partition.h | 26 |
pruning_scan_cache | src/query/partition_sr.h | 50 |
scancache_list | src/query/partition_sr.h | 60 |
pruning_context | src/query/partition_sr.h | 67 |
partition_init_pruning_context (decl) | src/query/partition_sr.h | 95 |
partition_load_pruning_context (decl) | src/query/partition_sr.h | 97 |
partition_prune_spec (decl) | src/query/partition_sr.h | 112 |
partition_prune_insert (decl) | src/query/partition_sr.h | 114 |
partition_prune_update (decl) | src/query/partition_sr.h | 118 |
partition_prune_unique_btid (decl) | src/query/partition_sr.h | 122 |
partition_get_partition_oids (decl) | src/query/partition_sr.h | 125 |
partition_load_aggregate_helper (decl) | src/query/partition_sr.h | 128 |
partition_find_root_class_oid (decl) | src/query/partition_sr.h | 135 |
partition_prune_partition_index (decl) | src/query/partition_sr.h | 137 |
pruning_bitset | src/query/partition.c | 71 |
partition_cache_entry | src/query/partition.c | 95 |
partition_init_pruning_context | src/query/partition.c | 2589 |
partition_load_pruning_context | src/query/partition.c | 2674 |
partition_clear_pruning_context | src/query/partition.c | 2805 |
partition_load_partition_predicate | src/query/partition.c | 2862 |
partition_set_specified_partition | src/query/partition.c | 2929 |
partition_get_attribute_id | src/query/partition.c | 3052 |
partition_get_position_in_key | src/query/partition.c | 3125 |
partition_prune_spec | src/query/partition.c | 3313 |
partition_find_partition_for_record | src/query/partition.c | 3465 |
partition_prune_insert | src/query/partition.c | 3603 |
partition_prune_update | src/query/partition.c | 3710 |
partition_get_scancache | src/query/partition.c | 3831 |
partition_new_scancache | src/query/partition.c | 3861 |
partition_get_partition_oids | src/query/partition.c | 3891 |
partition_decrement_value | src/query/partition.c | 3976 |
partition_prune_unique_btid | src/query/partition.c | 4064 |
partition_find_inherited_btid | src/query/partition.c | 4106 |
partition_load_aggregate_helper | src/query/partition.c | 4150 |
partition_attrinfo_get_key | src/query/partition.c | 4337 |
partition_prune_partition_index | src/query/partition.c | 4441 |
partition_prune_list | src/query/partition.c | 1086 |
partition_prune_hash | src/query/partition.c | 1221 |
partition_prune_range | src/query/partition.c | 1340 |
partition_prune_db_val | src/query/partition.c | 1545 |
partition_prune | src/query/partition.c | 1574 |
partition_match_pred_expr | src/query/partition.c | 1933 |
partition_prune_for_function | src/query/partition.c | 2105 |
partition_match_key_range | src/query/partition.c | 2406 |
pruningset_to_spec_list | src/query/partition.c | 873 |
pruningset_init, etc. | src/query/partition.c | 187-396 |
sm_partition | src/object/class_object.h | 728 |
SM_CLASS::partition | src/object/class_object.h | 798 |
or_partition | src/base/object_representation_sr.h | 196 |
DB_PARTITION_TYPE | src/storage/storage_common.h | 1211 |
DB_CLASS_PARTITION_TYPE | src/storage/storage_common.h | 1218 |
partition_spec_node | src/query/xasl.h | 1029 |
access_spec_node::pruning_type | src/query/xasl.h | 1052 |
access_spec_node::parts | src/query/xasl.h | 1057 |
access_spec_node::pruned | src/query/xasl.h | 1061 |
do_create_partition | src/query/execute_schema.c | 4036 |
do_alter_partitioning_pre (decl) | src/query/execute_schema.c | 334 |
do_alter_partitioning_post (decl) | src/query/execute_schema.c | 335 |
do_remove_partition_pre (decl) | src/query/execute_schema.c | 337 |
do_coalesce_partition_pre (decl) | src/query/execute_schema.c | 339 |
do_reorganize_partition_pre (decl) | src/query/execute_schema.c | 341 |
do_create_partition_constraints | src/query/execute_schema.c | 5728 |
SM_PARTITION_ALTER_INFO | src/query/execute_schema.c | 195 |
partition_prune_spec caller | src/query/query_executor.c | 8656 |
partition_prune_spec parallel-path caller | src/query/query_executor.c | 13361 |
locator_insert_force partition branch | src/transaction/locator_sr.c | 4969 |
locator_move_record | src/transaction/locator_sr.c | 5295 |
locator_update_force::partition_prune_update | src/transaction/locator_sr.c | 5915 |
locator_get_partition_scancache | src/transaction/locator_sr.c | 12314 |
locator_area_op_to_pruning_type | src/transaction/locator_sr.c | 12372 |
partition_prune_unique_btid (in btree.c) | src/storage/btree.c | 6080, 22719 |
partition_prune_partition_index (in btree_load.c) | src/storage/btree_load.c | 4559 |
partition_get_partition_oids (statistics) | src/storage/statistics_sr.c | 198 |
Cross-check Notes
Section titled “Cross-check Notes”This document was written after cubrid-class-object.md,
cubrid-query-optimizer.md, cubrid-locator.md, and
cubrid-catalog-manager.md. The points of contact and disagreement:
-
cubrid-class-object.mdintroducesSM_PARTITIONas a member ofSM_CLASSand notes that partitions are subclasses of the master class wired through the OODB inheritance machinery (SM_CLASS::users/SM_CLASS::inheritance). This document fills in the run-time half: howSM_PARTITION(an in-memory schema artifact) maps toOR_PARTITION(an on-disk reload artifact) and toPRUNING_CONTEXT(a request-scoped evaluation artifact). All three carry the partition rule, but in three different shapes and lifetimes. -
cubrid-query-optimizer.mddescribes the optimizer-side job — XASL generation and cost-based plan selection — and observes that the optimizer’s notion of “partition” (as inqo_partition,QO_PARTITION) is the planner’s term for a connected component of the join graph, not a horizontal table partition. This document confirms the cross-check: CUBRID’s optimizer does not know about table partitions; pruning runs lazily, just before scan open, inpartition_prune_spec. There is no partition-wise join. -
cubrid-locator.mddescribes the locator’s MOP routing and per-class scan-cache infrastructure. The DML routing in this document —locator_insert_forcecallingpartition_prune_insert,locator_move_recordfor partition moves,locator_get_partition_scancachereusing per-partition caches — slots cleanly into the locator narrative. The new detail here is that pruning is performed server-side with a full deserialized regu-variable expression, not client-side via the workspace. The client-side workspace code inlocator_cl.conly setsmop->pruning_type = DB_PARTITIONED_CLASSon the master class’s MOP so that the flush stage emitsLC_FLUSH_INSERT_PRUNErequests, which the server then routes throughpartition_prune_insert. -
cubrid-catalog-manager.mddescribes how the schema flushes to_db_classrecords.SM_PARTITIONdata persists as aDB_SEQliteral embedded in the class object (no dedicated partition catalog table). TheOR_PARTITIONarray is built at server side from the master class’s_db_classrow and the_db_class_reprrows of the partitions; the partition predicate stream lives in the master partition’svalues[2]slot. Confirmed against this document’s reading ofpartition_load_partition_predicate. -
cubrid-btree.mddescribes the B-tree subsystem; this document adds the “B-trees on partitioned tables are local per partition” rule.partition_prune_unique_btidis the bridge that translates a search on the master’s logical BTID into a search on the right partition’s physical BTID.
Open Questions
Section titled “Open Questions”-
Subpartitioning / composite partitioning. Is a CUBRID partition itself allowed to be partitioned (Oracle-style range-hash composite, or MySQL’s HASH-within-RANGE)? The DDL forbids
CREATE TABLEon a partition class (SM_NO_PARTITION_ON_HIERARCHIES), but this document did not trace whetherALTER TABLE partition__p__p3 PARTITION BY HASH(...)is reachable through some code path. Worth confirming with a small test (./csql -Sand DDL). -
Online repartitioning.
do_reorganize_partition_pre/postsuggests there is a path that redistributes data across partitions when bounds change. Does it run online (without a full table lock) or offline? Theredistribute_partition_datasymbol inlocator_sr.c:218is the worth-tracing entry; does it use MVCC to serve readers while it copies, or does it take an exclusive lock? -
Foreign keys across partitions. When a partitioned table is the referencing side of a foreign key, the FK has to look up the referenced table’s index.
btree_load.cline 4559 showspartition_prune_partition_indexbeing used in an FK context on the referenced side. What about when the FK references a partitioned table (the referenced side is partitioned) and the index is local? Is the cross-partition uniqueness assumption documented? Is the FK guaranteed to find the referenced row? -
Partition pruning for parameterized predicates. This document focused on plan-time pruning, but
partition_prune_specis called from the executor every time the access spec is re-opened. Does CUBRID re-prune when a host variable bound to the partition key changes between executions of a prepared statement? The cachepartition_load_context_from_cacheargues yes (the context is reusable across executions, but the bitset is rebuilt fresh each timepartition_prune_specruns). A side-by-side comparison with PostgreSQL’s “execution-time pruning” (which also re-runs pruning per execution to catch parameter changes) would clarify whether CUBRID matches. -
The disabled
partition_is_global_index. The function sits behind#if 0inpartition.c:4250. Is global-index support a planned future feature, or is it permanently dead code from a feature that was prototyped and abandoned? Annotating with agit log -ppass on that block would answer. -
partition_match_pred_exprandT_NOT_TERM. Currently anyNOTpredicate returnsMATCH_NOT_FOUND(no pruning). The textbook treatment of partition pruning under negation is straightforward: De Morgan-rewrite, then prune. Is the current behavior a deliberate choice (the rewrite is unsafe in CUBRID’s predicate language) or just unfinished?
Sources
Section titled “Sources”src/query/partition.c— server-side partition logic; pruning, per-row resolution, partition descriptor cache.src/query/partition.h—MAX_PARTITIONS = 1024.src/query/partition_sr.h—PRUNING_CONTEXT,PRUNING_SCAN_CACHE,SCANCACHE_LIST, public API.src/optimizer/query_planner.c— checked for table-partition awareness; the optimizer’sQO_PARTITIONis the join graph’s connected components, not table partitions; no partition-wise joins.src/transaction/locator_sr.c—locator_insert_forcepartition routing,locator_move_record,locator_get_partition_scancache,locator_area_op_to_pruning_type.src/query/execute_schema.c—do_create_partition,do_alter_partitioning_*,do_remove_partition_*,do_coalesce_partition_*,do_reorganize_partition_*,do_create_partition_constraints.src/object/class_object.c/src/object/class_object.h—SM_PARTITION,SM_CLASS::partition, lifecycle helpers.src/base/object_representation_sr.h—OR_PARTITION(the on-disk-reload form).src/storage/storage_common.h—DB_PARTITION_TYPE,DB_CLASS_PARTITION_TYPEenums.src/query/xasl.h—access_spec_node::pruning_type,access_spec_node::parts,partition_spec_node.src/query/query_executor.c—partition_prune_speccallers, per-partition lock and scan dispatch.src/storage/btree.candsrc/storage/btree_load.c—partition_prune_unique_btidandpartition_prune_partition_indexconsumers.
Theoretical references:
- Silberschatz, Korth, Sudarshan. Database System Concepts (7th ed.), Ch. 22 (“Parallel and Distributed Storage”), §22.2 on horizontal partitioning strategies and §22.5 on partition pruning.
- Petrov. Database Internals, Ch. 11 on distributed systems and the section on partition placement and partition awareness in query planning.
- PostgreSQL documentation: “Table Partitioning” chapter in the
PostgreSQL manual; the
partprune.csource file in the PostgreSQL source tree for the prune algorithm. - MySQL documentation: “Partitioning” chapter in the MySQL Reference
Manual; the
sql/sql_partition.ccsource file for the prune algorithm. - Oracle Database VLDB and Partitioning Guide (Oracle Database 19c or later) for range/hash/list/composite + interval partitioning.