CUBRID LOB — External Storage, Locator Lifecycle, and Transactional Cleanup
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Source verification (as of 2026-05-01)
- Beyond CUBRID — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A large object (LOB) — BLOB for binary, CLOB for character — is a column type whose payload is too large to keep inline with the rest of the row without distorting the storage strategy. Tables are tuned for short rows that pack many records per page; a 50 MB image as the fourth column undoes that tuning. Every relational engine therefore splits LOB storage from the surrounding row.
Database textbooks frame the design space along three axes:
- Where does the byte payload live? Inside the same heap file (with row-level chunking and continuation pointers), in a separate internal LOB file, or outside the database entirely on the host filesystem (or an object store).
- What does the row carry? A direct pointer to the payload, an indirect handle (locator) the engine resolves on demand, or both (handle for small LOBs, indirection for large ones).
- How does the engine reconcile transactional semantics with the storage choice? When an INSERT is rolled back, the LOB write must disappear. When a DELETE commits, the bytes must eventually be reclaimed. If the bytes live outside the WAL-protected data volume, the engine cannot rely on log undo/redo to fix them up — it has to intercept commit/abort itself.
The textbook reference is Database Internals (Petrov, 2019), Ch. 2 “File Formats”, which discusses out-of-page record storage; Database System Concepts (Silberschatz et al., 7e), §13.5 “Record Organization”, covers BLOB/CLOB semantics. Lehman and Lindsay’s “The Starburst Long Field Manager” (VLDB 1989) is the early reference for transaction-aware out-of-row LOB management.
CUBRID makes the third choice on axis 1 (filesystem outside the data volume), the second on axis 2 (a locator URI string in the record), and explicitly handles axis 3 in the transaction commit/abort path through a per-transaction tracking structure. The next two sections explain why each of those choices is sensible and how CUBRID implements them.
Common DBMS Design
Section titled “Common DBMS Design”Out-of-row LOB storage is a well-trodden pattern. Almost every production engine adopts some subset of the following conventions, and CUBRID’s specific choices are best read as one set of dials within this shared space.
Locator handles, not raw pointers
Section titled “Locator handles, not raw pointers”A LOB-bearing column stores a locator — a fixed-size string or
struct — rather than the bytes. The locator is enough for the engine
to find the bytes when a query asks for them, and small enough that
it pages with the rest of the row. PostgreSQL’s lo_oid, Oracle’s
LOB locator, and CUBRID’s DB_ELO.locator URI all play this role.
The locator format usually carries three pieces of information: (a) which storage backend to dispatch to (filesystem, object store, internal), (b) the address inside that backend (a path, an OID, a hash), and (c) lightweight metadata so the engine can sanity-check or reject the locator without dereferencing it (size, type, owning class). The locator is opaque to the SQL layer.
Background reclamation, not synchronous delete
Section titled “Background reclamation, not synchronous delete”When a transaction commits a DELETE, the LOB cannot be unlinked
from the filesystem at the same moment as the row’s MVCC delete bit
is set. Other concurrent snapshots may still be reading the old
version. Engines therefore defer the unlink: they record the
intent (“delete this locator at commit”), let the transaction
finish, and then (synchronously at commit, or asynchronously via a
vacuum-style daemon) actually remove the file. This is the LOB
analogue of MVCC’s dead-tuple reclamation.
Two-level hash directories on disk
Section titled “Two-level hash directories on disk”A naive LOB-as-file scheme runs into the directory inode limit:
a single directory holding millions of files becomes hostile both to
the kernel’s readdir and to backup tools. Every engine that uses
the host filesystem responds with the same trick: hash the filename
into one or two levels of intermediate directories, so each leaf
directory holds a bounded number of files. PostgreSQL’s pg_largeobject
uses internal pages instead and skips the issue; engines that do go
to the filesystem (Oracle BFILE, MySQL with file-per-LOB extensions,
CUBRID’s POSIX/OWFS backends) use the hash-dir layout.
Transactional cleanup on a per-TDES list
Section titled “Transactional cleanup on a per-TDES list”The defer-and-clean-up policy needs a place to record which locators
are pending. The standard answer is a per-transaction list (or tree)
of (locator, intended action) pairs hung off the transaction
descriptor. Commit walks the list and applies the intent; rollback
walks the same list with the opposite intent. The list lives in
volatile memory (it is regenerated on demand from log records during
recovery, but in the steady state it is just bookkeeping).
Savepoint stack per locator
Section titled “Savepoint stack per locator”Subtransactions and savepoints add a wrinkle: a locator can be
created at savepoint A, mutated at savepoint B, and partially rolled
back to A. The textbook implementation gives each locator entry a
stack of prior states, popping back through them as the rollback
LSA crosses each savepoint boundary. Few systems document this
explicitly; CUBRID’s lob_savepoint_entry is one of the cleaner
implementations.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theory | CUBRID name |
|---|---|
| Locator (handle pointing into a backend) | DB_ELO.locator — URI string <scheme>:<backend-specific> (elo.c) |
| Backend-selector prefix in the locator | ES_OWFS_PATH_PREFIX = "owfs:", ES_POSIX_PATH_PREFIX = "file:", ES_LOCAL_PATH_PREFIX = "local:" (es_common.h) |
| Per-storage-type backend | es_owfs_* (OWFS — Object-style FS), xes_posix_* (POSIX — host filesystem), es_local_* (local-read-only client cache) (es*.c) |
| Two-level hash dir | ces_<HASH1>/ces_<HASH2>/<file> via es_get_unique_name and ES_POSIX_HASH1 / ES_POSIX_HASH2 (es_posix.c) |
| LOB state machine | enum lob_locator_state — six states; the transition table is documented inline in lob_locator.hpp |
| Per-TDES locator list | tdes->lob_locator_root — a red-black tree (lob_rb_root), keyed by hash + key portion of the locator (transaction_transient.cpp) |
| Savepoint stack per locator | lob_locator_entry::top — a lob_savepoint_entry linked list pushed at state-change, popped at partial rollback (same file) |
| Commit/rollback walker | tx_lob_locator_clear (tdes, at_commit, savept_lsa) — single entry point called from log_manager.c (commit, abort) and transaction_sr.c (sysop end) |
| Background reclamation hand-off | vacuum_notify_es_deleted — at commit, LOB_PERMANENT_DELETED files are queued for vacuum rather than unlinked synchronously |
CUBRID’s Approach
Section titled “CUBRID’s Approach”CUBRID has four moving parts on the LOB path: the ES (External Storage) layer that abstracts where the bytes live, the locator URI that names a single LOB, the per-TDES locator tree that remembers which locators this transaction touched, and the commit/rollback walker that turns that tree into actual filesystem operations. We walk them in that order.
Overall structure
Section titled “Overall structure”flowchart LR
subgraph SQL["SQL row"]
R["heap record<br/>(...) ... locator='file:/.../ces_007/dba.t1.123_4567' ..."]
end
subgraph TDES["Per-transaction state (TDES)"]
RB["lob_locator_root<br/>(red-black tree)"]
E1["lob_locator_entry<br/>key, hash, top"]
E2["lob_locator_entry<br/>key, hash, top"]
SV["savepoint stack<br/>state, savept_lsa"]
RB --> E1
RB --> E2
E1 --> SV
end
subgraph ES["ES layer (es.c, es_init/es_create_file/...)"]
POS["es_posix_∗<br/>host filesystem"]
OWFS["es_owfs_∗<br/>object-style FS"]
LOC["es_local_∗<br/>client-side cache"]
end
DISK[("filesystem<br/>ces_<H1>/ces_<H2>/<file>")]
R --> RB
ES --> DISK
POS --> DISK
OWFS --> DISK
TX[("commit / rollback")] --> CLR["tx_lob_locator_clear"]
CLR --> RB
CLR --> ES
The figure encodes the three boundaries that matter for correctness:
(SQL boundary) the row carries a locator URI, never bytes;
(TDES boundary) every state-changing locator operation funnels
through xtx_add_lob_locator / xtx_change_state_of_locator /
xtx_drop_lob_locator so the per-transaction tree is the single
source of truth; (ES boundary) every filesystem call goes through
the es.c dispatch front so the backend choice (OWFS / POSIX / LOCAL)
is one switch in one place.
The ES layer — backend dispatch on a URI prefix
Section titled “The ES layer — backend dispatch on a URI prefix”Every LOB byte read or write goes through es.c. The module is
initialized once with es_init(uri); the URI’s prefix determines
which backend the rest of the engine routes to.
// ES_TYPE — src/storage/es_common.htypedef enum{ ES_NONE = -1, ES_OWFS = 0, ES_POSIX = 1, ES_LOCAL = 2} ES_TYPE;
#define ES_OWFS_PATH_PREFIX "owfs:"#define ES_POSIX_PATH_PREFIX "file:"#define ES_LOCAL_PATH_PREFIX "local:"// es_create_file — src/storage/es.c (condensed)intes_create_file (char *out_uri){ // ... condensed ... if (es_initialized_type == ES_OWFS) { memcpy (out_uri, ES_OWFS_PATH_PREFIX, sizeof (ES_OWFS_PATH_PREFIX)); ret = es_owfs_create_file (ES_OWFS_PATH_POS (out_uri)); } else if (es_initialized_type == ES_POSIX) { memcpy (out_uri, ES_POSIX_PATH_PREFIX, sizeof (ES_POSIX_PATH_PREFIX));#if defined (CS_MODE) ret = es_posix_create_file (ES_POSIX_PATH_POS (out_uri));#else ret = xes_posix_create_file (ES_POSIX_PATH_POS (out_uri));#endif } // ... condensed (ES_LOCAL is read-only, no create) ...}Two properties fall out of this dispatch. (a) ES_LOCAL has no
create path — it is only a client-side read cache for files staged
elsewhere; the deck does not mention it because the deck is server-
focused. (b) Inside CUBRID’s server, POSIX paths are served by
xes_posix_* — the x-prefixed variants are the in-server
implementations, while the bare es_posix_* symbols are RPC stubs
the client invokes via network_interface_cl.h.
The locator URI — name, key, and meta
Section titled “The locator URI — name, key, and meta”A locator is a zero-terminated string with a fixed shape:
file:/<base-dir>/ces_<HASH1>/ces_<HASH2>/<metaname>.<unum>_<rand>Two helpers in lob_locator.cpp parse it without allocating:
// lob_locator_meta / lob_locator_key — src/object/lob_locator.cppconst char *lob_locator_key (const char *locator){ return std::strrchr (locator, '.') + 1; // points just past last '.'}
const char *lob_locator_meta (const char *locator){ return std::strrchr (locator, PATH_SEPARATOR); // points at last '/'}key is the unique-name suffix (<unum>_<rand>), used to identify
the locator inside the per-transaction tree. meta is the directory
portion, which the rollback path uses to rename a locator back to its
pre-savepoint identity (see §“Savepoint stack” below).
The <metaname> segment between meta and key carries
human-readable context — typically <schema>.<table>. The deck calls
this “db 정보 (예) dba.t1”. On creation the metaname is the temporary
sentinel ces_temp until the locator graduates to permanent at commit
(see §“Create flow”).
The state machine — six states, one transition table
Section titled “The state machine — six states, one transition table”Every locator the engine touches is in one of six states. The
canonical reference is the comment block at the top of
lob_locator.hpp:
// enum lob_locator_state — src/object/lob_locator.hpp/* * locator | created | deleted * ----------|-----------------------|-------------------------- * in-tran | LOB_TRANSIENT_CREATED | LOB_UNKNOWN (s1) * permanent | LOB_PERMANENT_CREATED | LOB_PERMANENT_DELETED (s3) * out-tran | LOB_UNKNOWN | LOB_UNKNOWN * | LOB_UNKNOWN | LOB_TRANSIENT_DELETED (s4) * * s1: create transient locator and delete it * LOB_TRANSIENT_CREATED -> LOB_UNKNOWN * s2: create transient locator and bind it to a row * LOB_TRANSIENT_CREATED -> LOB_PERMANENT_CREATED * s3: bind transient locator to a row and delete the locator * LOB_PERMANENT_CREATED -> LOB_PERMANENT_DELETED * s4: delete a locator created out of transaction * LOB_UNKNOWN -> LOB_TRANSIENT_DELETED */enum lob_locator_state{ LOB_UNKNOWN, LOB_TRANSIENT_CREATED, LOB_TRANSIENT_DELETED, LOB_PERMANENT_CREATED, LOB_PERMANENT_DELETED, LOB_NOT_FOUND};The LOB_TRANSIENT_* states mean “the file exists on disk but no row
in any committed table points at it yet” — a transaction has called
elo_create() or asked to delete a still-uncommitted file. The
LOB_PERMANENT_* states mean “a row in a committed (or about-to-
commit) table points at this locator”. LOB_UNKNOWN and
LOB_NOT_FOUND are sentinel results from lob_locator_find when the
caller’s locator does not appear in this transaction’s tree.
The state machine is what the commit/rollback walker (§“Commit and rollback dispatch” below) reads to decide what to do.
Per-TDES locator tree — red-black, hashed key
Section titled “Per-TDES locator tree — red-black, hashed key”Every transaction that touches a LOB grows a red-black tree on its own TDES:
// lob_locator_entry / lob_savepoint_entry — src/transaction/transaction_transient.cppstruct lob_savepoint_entry{ LOB_LOCATOR_STATE state; LOG_LSA savept_lsa; // savepoint at which this state was set char locator[ES_URI]; lob_savepoint_entry *prev; // savepoint stack};
struct lob_locator_entry{ RB_ENTRY (lob_locator_entry) head; lob_savepoint_entry *top; // top of the savepoint stack int key_hash; // 5strhash of m_key, for fast compare std::string m_key; // the <unum>_<rand> suffix};The compare function looks at key_hash first and only falls through
to a std::string::compare on the key when the hash matches:
// lob_locator_cmp — src/transaction/transaction_transient.cppstatic intlob_locator_cmp (const lob_locator_entry *e1, const lob_locator_entry *e2){ if (e1->key_hash != e2->key_hash) return e1->key_hash - e2->key_hash; return e1->m_key.compare (e2->m_key);}This is the standard “hash-then-compare” trick: most lookups exit on
the cheap integer compare, only collisions pay for string::compare.
Savepoint stack — pop on partial rollback
Section titled “Savepoint stack — pop on partial rollback”xtx_change_state_of_locator is called every time a locator’s state
changes (e.g. LOB_TRANSIENT_CREATED → LOB_PERMANENT_CREATED at
commit-binding, or rename during update). Crucially, if the change
happens at a savepoint LSA strictly later than the entry’s last
savepoint, it pushes the prior state onto the savepoint stack
rather than overwriting it:
// xtx_change_state_of_locator — src/transaction/transaction_transient.cpp (condensed)last_lsa = LSA_GE (&tdes->savept_lsa, &tdes->topop_lsa) ? tdes->savept_lsa : tdes->topop_lsa;
if (LSA_LT (&entry->top->savept_lsa, &last_lsa)) { lob_savepoint_entry *savept = new lob_savepoint_entry (); savept->state = entry->top->state; savept->savept_lsa = entry->top->savept_lsa; std::strcpy (savept->locator, entry->top->locator); savept->prev = entry->top; entry->top = savept; // push }
if (new_locator != NULL) strlcpy (entry->top->locator, new_locator, sizeof (ES_URI));entry->top->state = state;entry->top->savept_lsa = last_lsa;When the transaction rolls back to a savepoint LSA, tx_lob_locator_clear
walks the stack popping entries whose savept_lsa >= rollback LSA,
calling es_rename_file for each pop where the locator string changed
(so the on-disk filename is restored to its pre-rollback identity).
The walker is the only place in CUBRID that touches es_rename_file
for LOBs.
Commit and rollback dispatch — one entry point
Section titled “Commit and rollback dispatch — one entry point”tx_lob_locator_clear is called from exactly four places in the
engine: commit (log_commit), abort (log_rollback), partial
rollback to a savepoint (log_rollback_to_savepoint), and end of a
nested system op (xlog_topop_end). It receives at_commit (bool)
and savept_lsa (NULL for full commit/abort, non-NULL for partial
rollback) and decides per-entry whether the file should be deleted,
renamed, or left alone:
// tx_lob_locator_clear — src/transaction/transaction_transient.cpp (condensed)for (entry = RB_MIN (lob_rb_root, &tdes->lob_locator_root); entry != NULL; entry = next) { next = RB_NEXT (lob_rb_root, &tdes->lob_locator_root, entry); need_to_delete = false;
if (at_commit) { // anything not bound to a row at commit time is garbage if (entry->top->state != LOB_PERMANENT_CREATED) need_to_delete = true; } else // rollback { if (savept_lsa != NULL) { // partial rollback: pop savepoint stack, rename files back // ... condensed: see source ... } // anything created and rolled back is garbage if ((savept_lsa == NULL || LSA_GE (&entry->top->savept_lsa, savept_lsa)) && entry->top->state != LOB_TRANSIENT_DELETED) need_to_delete = true; }
if (need_to_delete) {#if defined (SERVER_MODE) if (at_commit && entry->top->state == LOB_PERMANENT_DELETED) vacuum_notify_es_deleted (thread_p, entry->top->locator); else (void) es_delete_file (entry->top->locator);#else (void) es_delete_file (entry->top->locator);#endif RB_REMOVE (lob_rb_root, &tdes->lob_locator_root, entry); lob_locator_free (entry); } }The rule the dispatch encodes:
| at_commit | top->state | action |
|---|---|---|
| true | LOB_PERMANENT_CREATED | leave file on disk (it backs a committed row) |
| true | LOB_PERMANENT_DELETED | hand off to vacuum via vacuum_notify_es_deleted |
| true | LOB_TRANSIENT_* | es_delete_file directly (visible to nobody else) |
| false | LOB_TRANSIENT_DELETED | leave alone (the file the transaction tried to delete still exists) |
| false | other states | es_delete_file directly (rolled-back creation; nothing referenced it) |
The vacuum hand-off is the LOB analogue of the heap’s
vacuum_log_vacuum_record: at commit a LOB_PERMANENT_DELETED file
may still be visible to older snapshots, so the vacuum daemon
controls the actual unlink — it sees the file gone from heap and
no in-flight snapshot still references the row before reclaiming.
Create flow — elo_create end to end
Section titled “Create flow — elo_create end to end”A user’s INSERT INTO t VALUES (1, BIT_TO_BLOB(X'...')) reaches
elo_create after the parser/value layer has built a DB_ELO:
// elo_create — src/object/elo.c (condensed)intelo_create (DB_ELO *elo){ ES_URI out_uri; int ret;
ret = es_create_file (out_uri); // (1) backend creates "ces_temp.<unum>_<rand>" // ... condensed ... elo->locator = db_private_strdup (NULL, out_uri); elo->type = ELO_FBO; // FBO = File-Backed Object elo->es_type = es_get_type (uri);
if (ELO_NEEDS_TRANSACTION (elo)) // ES_OWFS or ES_POSIX { ret = lob_locator_add (elo->locator, LOB_TRANSIENT_CREATED); // (2) per-TDES tree } return ret;}
#define ELO_NEEDS_TRANSACTION(e) \ ((e)->es_type == ES_OWFS || (e)->es_type == ES_POSIX)Two non-obvious properties:
- The file is created before the locator is added to the tree —
if the create fails, no per-TDES bookkeeping is needed; if the
create succeeds and the
lob_locator_addfails, an orphan file remains. The deck does not mention this and the source’s error handling does not unlink either — see §“Open questions”. ES_LOCALdoes not go through transactional tracking (ELO_NEEDS_TRANSACTIONexcludes it). Read-only client-side caches have no write path, so there is nothing for commit/rollback to reconcile.
Subsequent calls in the lifecycle:
elo_write (elo, pos, buf, count)→es_write_file(no locator state change; the file isLOB_TRANSIENT_CREATEDand writes are free).- The
INSERTreaches the heap layer and the row is inserted withelo->locatoras a column value. - At commit, the row’s heap insert is visible;
tx_lob_locator_clearfires; the locator is inLOB_TRANSIENT_CREATEDstate — but the row in the heap binds it. Question: who sets it toLOB_PERMANENT_CREATED? In current source, this is the job ofxtx_change_state_of_locatorcalled fromnetwork_interface_sr.cpp:2380(the only call site outside the module). The deck’s “rename ces_temp.xxx → dba.t1.xxx” sequence is this call. See §“Open questions” for the version-drift detail.
Read flow — elo_read and the locator round trip
Section titled “Read flow — elo_read and the locator round trip”A SELECT clob_to_char(c2) FROM t1 walks down through the value
layer to elo_read:
// elo_read — src/object/elo.c (signature)extern ssize_t elo_read (const DB_ELO *elo, off_t pos, void *buf, size_t count);The caller passes a position and length; elo_read fans out to
es_read_file which dispatches to xes_posix_read_file /
es_owfs_read_file based on the locator’s prefix. No tree
traversal happens on read — the locator string in the heap row is
self-sufficient. Reads do not allocate per-TDES state because they do
not need to be undone: a read that the transaction later rolls back
is invisible to the on-disk file.
Pre-read size lookup goes through elo_size → es_get_file_size for
the same reason: the locator alone is enough.
Update flow — copy-then-replace, not in-place
Section titled “Update flow — copy-then-replace, not in-place”The deck makes the update path explicit because it differs from create:
new directory / file 생성 — create와 로직이 다름 generated_file → locator entry (LOB_TRANSIENT_CREATED) old_file → locator entry (LOB_TRANSIENT_DELETED) commit 시 old_file delete
The reason update does not use elo_create + elo_write is that the
old locator must also be tracked. CUBRID handles this with two
locator-tree entries: one at LOB_TRANSIENT_CREATED for the new
file, one at LOB_TRANSIENT_DELETED for the old. At commit:
LOB_TRANSIENT_CREATEDfor the new file → bound to the row →LOB_PERMANENT_CREATED(viaxtx_change_state_of_locator).LOB_TRANSIENT_DELETEDfor the old file → vacuum hand-off (because the row that referenced it is now an MVCC dead version, and other snapshots may still need to read it).
elo_copy (in elo.c) is the shared helper used by update — it
calls es_copy_file, registers the new locator, marks the old as
deleted, and returns. The deck’s “elo_copy() 함수는 rename, locator
entry drop, copy file 작업 수행” describes this composite behaviour.
Hash directory layout — bounded files per directory
Section titled “Hash directory layout — bounded files per directory”The POSIX backend distributes files across two hash levels so no single directory holds too many entries:
// es_get_unique_name — src/storage/es_posix.c (condensed)static voides_get_unique_name (char *dirname1, char *dirname2, const char *metaname, char *filename){ UINT64 unum; int hashval, r;
r = (rand () < 0) ? -rand () : rand (); unum = es_get_unique_num (); // microsecond-precision time
snprintf (filename, NAME_MAX, "%s.%020llu_%04d", metaname, unum, r % 10000);
hashval = es_name_hash_func (ES_POSIX_HASH1, filename); snprintf (dirname1, NAME_MAX, "ces_%03d", hashval);
hashval = es_name_hash_func (ES_POSIX_HASH2, filename); snprintf (dirname2, NAME_MAX, "ces_%03d", hashval);}ES_POSIX_HASH1 and ES_POSIX_HASH2 are the bucket counts at each
level. The hash is mht_5strhash(filename) mod bucket count, so
sibling files in the same metaname (e.g. dba.t1.*) end up scattered
across both levels — uniform distribution by design.
The deck flags the design’s tradeoff cleanly: hash dirs reduce
files-per-directory and ease lock contention on readdir-style
operations, but the backup story is harder (every LOB lives in a
different leaf directory) and any administrative scan must walk both
levels. CUBRID lives with this tradeoff because the DB-volume
alternative (BLOB-as-segment) would push every LOB through the page
buffer and WAL.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. Function names survive most refactors; line numbers drift the moment someone reformats a header.
ES front and dispatch
Section titled “ES front and dispatch”es_init,es_final(ines.c) — choose backend by URI prefix on first call; tear down on shutdown.es_get_type,es_get_type_string(ines_common.c) — URI prefix →ES_TYPEenum and back.es_create_file,es_read_file,es_write_file,es_delete_file,es_copy_file,es_rename_file(ines.c) — public API the rest of the engine calls. Each switches ones_initialized_typeand forwards to the chosen backend.
POSIX backend
Section titled “POSIX backend”xes_posix_create_file,xes_posix_write_file,xes_posix_read_file,xes_posix_delete_file,xes_posix_rename_file,xes_posix_copy_file(ines_posix.c) — server-side implementations.es_posix_create_fileetc. (without thexprefix, ines_posix.c) — client-side stubs that RPC to the server vianetwork_interface_cl.h.es_get_unique_name(ines_posix.c) — file-name-and-hash-dir generator.es_make_dirs(ines_posix.c) —mkdir -pfor the two-level hash directory.
OWFS backend
Section titled “OWFS backend”es_owfs_create_file,es_owfs_write_file, etc. (ines_owfs.c) — One-World FS object-storage backend; same surface as POSIX.
LOB locator state machine
Section titled “LOB locator state machine”enum lob_locator_state(inlob_locator.hpp) — six states with inline transition table.lob_locator_is_valid,lob_locator_key,lob_locator_meta(inlob_locator.cpp) — locator-string parsing.lob_locator_add,lob_locator_change_state,lob_locator_drop,lob_locator_find(inlob_locator.cpp) — public wrappers that dispatch to the server-sidextx_*functions or to RPC stubs in client mode.
Per-TDES tracking
Section titled “Per-TDES tracking”struct lob_rb_root,struct lob_locator_entry,struct lob_savepoint_entry(intransaction_transient.cpp) — the data structures.xtx_add_lob_locator— RB-insert at first state-change.xtx_change_state_of_locator— savepoint-stack push + state update.xtx_drop_lob_locator— RB-remove (used by callers that decide a locator is no longer interesting before transaction end, e.g. client clean-up of failed creates).xtx_find_lob_locator— RB-find by hashed key; returns the top state.tx_lob_locator_clear— the commit/rollback walker.lob_locator_cmp— hash-then-compare comparator.
ELO API
Section titled “ELO API”elo_create,elo_copy,elo_copy_with_prefix,elo_delete,elo_size,elo_read,elo_write(inelo.c) — the per-row-value layer that orchestrates ES + locator-tree.elo_init_structure,elo_copy_structure,elo_free_structure(inelo.c) —DB_ELOlifecycle.ELO_NEEDS_TRANSACTIONmacro — distinguishes OWFS/POSIX (tracked) from LOCAL (untracked).
Vacuum hand-off
Section titled “Vacuum hand-off”vacuum_notify_es_deleted(declared in vacuum subsystem, called fromtx_lob_locator_clear) — queues aLOB_PERMANENT_DELETEDfile for asynchronous unlink so vacuum can delay the unlink past in-flight snapshots.
Position hints as of this revision
Section titled “Position hints as of this revision”Each line is the function-definition line of the symbol. The §“Source verification” bullets quote sub-ranges within the same function bodies; cross-check by counting from the symbol’s definition line.
| Symbol | File | Line |
|---|---|---|
enum lob_locator_state | src/object/lob_locator.hpp | 53 |
lob_locator_key | src/object/lob_locator.cpp | 56 |
lob_locator_meta | src/object/lob_locator.cpp | 62 |
lob_locator_add | src/object/lob_locator.cpp | 90 |
lob_locator_change_state | src/object/lob_locator.cpp | 107 |
xtx_add_lob_locator | src/transaction/transaction_transient.cpp | 174 |
xtx_find_lob_locator | src/transaction/transaction_transient.cpp | 210 |
xtx_change_state_of_locator | src/transaction/transaction_transient.cpp | 245 |
xtx_drop_lob_locator | src/transaction/transaction_transient.cpp | 308 |
tx_lob_locator_clear | src/transaction/transaction_transient.cpp | 374 |
lob_locator_cmp | src/transaction/transaction_transient.cpp | 477 |
elo_create | src/object/elo.c | 85 |
ELO_NEEDS_TRANSACTION | src/object/elo.c | 71 |
es_create_file | src/storage/es.c | 142 |
es_get_unique_name | src/storage/es_posix.c | 78 |
es_get_type | src/storage/es_common.c | 45 |
enum ES_TYPE | src/storage/es_common.h | 28 |
Source verification (as of 2026-05-01)
Section titled “Source verification (as of 2026-05-01)”Each entry leads with a fact about the current source. The trailing note shows verification trail and any historical drift. Open questions follow as the curator’s recorded gaps.
Verified facts
Section titled “Verified facts”-
The ES layer supports exactly three backends. Verified in
src/storage/es_common.hon 2026-05-01:ES_TYPEis{ES_NONE, ES_OWFS, ES_POSIX, ES_LOCAL}. The deck listedES_OWFS, ES_POSIX, ES_LOCALin the order POSIX/OWFS/LOCAL; current source numbers them0/1/2withES_NONE = -1as the uninitialized sentinel. -
ELO_NEEDS_TRANSACTIONexcludesES_LOCALfrom the per-TDES tree. Verified insrc/object/elo.c:71on 2026-05-01: the macro is(es_type == ES_OWFS || es_type == ES_POSIX). The deck does not mention this asymmetry; readers reaching the locator tree from the LOB column type would mistakenly assume all locators are tracked. -
The locator-state transition table in
lob_locator.hppis authoritative and matches the deck. Verified by reading the comment block atsrc/object/lob_locator.hpp:26-52(immediately precedingenum lob_locator_stateat line 53) on 2026-05-01. The four labelled transitions (s1throughs4) in the comment are the same four cases the deck illustrates with INSERT / UPDATE / DELETE / abort examples. -
Commit-time
LOB_PERMANENT_DELETEDfiles are queued for vacuum, not unlinked synchronously. Verified insidetx_lob_locator_clear(src/transaction/transaction_transient.cpp:374, body lines 443-457) on 2026-05-01: thevacuum_notify_es_deleted (thread_p, entry->top->locator)branch fires only for(at_commit && state == LOB_PERMANENT_DELETED)inSERVER_MODE. Other delete paths calles_delete_filedirectly. The deck describes the commit-time delete as immediate (“commit 시 delete 수행”) which is not accurate for the deleted-permanent case. -
Hash directories are two-level by
mht_5strhashof the filename. Verified insrc/storage/es_posix.c:104-108on 2026-05-01:ES_POSIX_HASH1andES_POSIX_HASH2give the bucket counts; the dirname format isces_%03d(zero-padded three-digit hash, so each level supports up to 1000 buckets). ACUBRID_OWFS_POSIX_TWO_DEPTH_DIRECTORYbuild flag (line 124) controls whether the second level is actually nested or flattened — most builds set it. -
Per-locator savepoint stack pushes only when the savepoint LSA has advanced. Verified inside
xtx_change_state_of_locator(src/transaction/transaction_transient.cpp:245, body lines 273-283) on 2026-05-01: theLSA_LT (&entry->top->savept_lsa, &last_lsa)guard means same-savepoint state changes overwrite, but cross-savepoint changes push a newlob_savepoint_entry. The deck does not surface this optimization.
Open questions
Section titled “Open questions”-
Who calls
xtx_change_state_of_locatorto graduate a locator fromTRANSIENT_CREATEDtoPERMANENT_CREATED? The only non-test reference outside the module issrc/communication/network_interface_sr.cpp:2380. The deck describes this as part ofelo_copy()and the commit handler. Trace the actual path fromINSERTexecution to the state change; document whether it happens at heap insert time, at row commit time, or via a separate client-driven RPC. Investigation path: trace the sr-side stub fromxes_posix_create_filecallers, search forLOB_PERMANENT_CREATEDwrites. -
Orphan-file behaviour on
lob_locator_addfailure afteres_create_filesucceeds.elo_createdoes not unlink the file if the locator-tree insert fails. Investigation path: readlob_locator_addfailure modes (only error isER_LOG_UNKNOWN_TRANINDEX) and check whether the caller chain ever sees a successful create + failed add; if it does, this is an orphan-leak corner case worth a CBRD ticket. -
What does
LOB_NOT_FOUNDmean to the heap layer when a locator in a row is not present in the per-TDES tree? Verified thatxtx_find_lob_locatorreturnsLOB_NOT_FOUNDand copies the input locator unchanged, but did not trace how the heap-sideelo_read/elo_sizereacts. If the answer is “fall through to the on-disk file”, that’s correct for read-after-commit; if the answer is “error”, concurrent rollbacks could surface spurious failures. Investigation path: trace heap →elo_read→ ES dispatch and check whether the locator-tree state is even consulted on reads. -
Two-level vs. flattened hash directories. The
CUBRID_OWFS_POSIX_TWO_DEPTH_DIRECTORYmacro (es_posix.c:124) gates the second level. Default is enabled in current builds, but a flag-controlled fallback exists. Investigation path:git log -S CUBRID_OWFS_POSIX_TWO_DEPTH_DIRECTORYto find when single-level was the default; check if any active deployment still uses it. -
Behaviour of
xes_posix_rename_fileon partial rollback when the on-disk filename has already been observed by another replica. HA replication ships log records; it does not ship filesystem renames. If a replica has read the post-rename locator and the primary then rolls back the rename, the replica’s on-disk path no longer exists. Investigation path: trace HA’s LOB handling, search forlob_locatorin HA paths, check whether physical replication carries ES-side actions. -
Deprecation status of OWFS. The deck and current source keep OWFS as a first-class backend, but every deployment we have visibility on uses POSIX. Investigation path:
git log src/storage/es_owfs.cto look for new development; ask whether any 11.x customer has OWFS in production.
Beyond CUBRID — Comparative Designs & Research Frontiers
Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.
-
PostgreSQL TOAST — The Oversized-Attribute Storage Technique (Stonebraker et al., as documented in PostgreSQL internals). Inline LOBs whose payload is chunked into a per-table TOAST table with the same MVCC visibility rules as ordinary rows. CUBRID’s external-file approach trades MVCC integration for simpler per-LOB I/O; a comparison would quantify the cost of the vacuum-driven TOAST chunk reclamation against ES’s commit-time unlink.
-
PostgreSQL Large Objects (
lo_*) — separate from TOAST, this is the olderpg_largeobjectsystem table. Locator is an OID, payload is sliced into 2 KB internal pages, all fully WAL-protected. CUBRID’s choice of out-of-WAL ES storage trades some recovery semantics for I/O performance and simpler LOB copy. -
Oracle SecureFiles — Oracle’s modern LOB engine adds in-place updates, deduplication, encryption, and compression to the older BasicFiles. The relevant CUBRID comparison is the encryption story (CUBRID’s TDE does not currently encrypt ES files; see
cubrid-tde.mdonce written). -
Oracle BFILE — Oracle’s read-only out-of-database file pointer; the closest analogue to CUBRID’s
ES_LOCAL. A side-by- side would clarify whetherES_LOCALis actually used in production, given the deck does not mention it. -
MySQL InnoDB off-page columns — InnoDB stores large columns on overflow pages within the same tablespace;
ROW_FORMAT=DYNAMICpushes long columns entirely off-page with a 20-byte pointer. Closer to TOAST than to ES, but with a different page-format story. A comparison would highlight CUBRID’s design choice of sidestepping the page format entirely. -
Lehman & Lindsay, “The Starburst Long Field Manager” (VLDB 1989). The earliest paper to articulate transactional out-of-row LOB management. The savepoint-stack pattern in
lob_savepoint_entryis recognizably descended from Starburst’s multi-level recovery; a re-reading would clarify whether CUBRID’s stack semantics match Starburst’s exactly or diverge. -
Object stores as LOB backends (S3, GCS, MinIO) — modern successor to OWFS. None of the cloud-native object stores support filesystem rename atomicity, which would force a redesign of the rollback path (CUBRID currently uses
es_rename_filefor partial-rollback file renames). A research- grade follow-up would map CUBRID’s rename-driven rollback to a versioned-PUT object-store equivalent.
Sources
Section titled “Sources”Raw analyses (under raw/code-analysis/cubrid/storage/lob/)
Section titled “Raw analyses (under raw/code-analysis/cubrid/storage/lob/)”LOB 세미나.pptx— single-deck seminar by 인치준 (CUBRID development team 2). Covers the CRUD logic at the locator level with example SQL, theces_temprename trick, the per-TDES locator entry list, and the hashdir structure tradeoffs.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database Internals (Petrov), Ch. 2 “File Formats” — out-of-page records, indirection vs. inline.
- Database System Concepts (Silberschatz et al.), §13.5 “Record Organization” — BLOB/CLOB semantics.
External workspace pages
Section titled “External workspace pages”- (None cited — this doc was assembled from raw deck + source.)
CUBRID source (under /data/hgryoo/references/cubrid/)
Section titled “CUBRID source (under /data/hgryoo/references/cubrid/)”src/object/elo.h,src/object/elo.csrc/object/lob_locator.hpp,src/object/lob_locator.cppsrc/storage/es.h,src/storage/es.csrc/storage/es_common.h,src/storage/es_common.csrc/storage/es_posix.h,src/storage/es_posix.csrc/storage/es_owfs.h,src/storage/es_owfs.csrc/transaction/transaction_transient.hpp,src/transaction/transaction_transient.cppsrc/compat/db_elo.h,src/compat/db_elo.c(client-side wrapper)