Skip to content

CUBRID LOB — External Storage, Locator Lifecycle, and Transactional Cleanup

Contents:

A large object (LOB) — BLOB for binary, CLOB for character — is a column type whose payload is too large to keep inline with the rest of the row without distorting the storage strategy. Tables are tuned for short rows that pack many records per page; a 50 MB image as the fourth column undoes that tuning. Every relational engine therefore splits LOB storage from the surrounding row.

Database textbooks frame the design space along three axes:

  1. Where does the byte payload live? Inside the same heap file (with row-level chunking and continuation pointers), in a separate internal LOB file, or outside the database entirely on the host filesystem (or an object store).
  2. What does the row carry? A direct pointer to the payload, an indirect handle (locator) the engine resolves on demand, or both (handle for small LOBs, indirection for large ones).
  3. How does the engine reconcile transactional semantics with the storage choice? When an INSERT is rolled back, the LOB write must disappear. When a DELETE commits, the bytes must eventually be reclaimed. If the bytes live outside the WAL-protected data volume, the engine cannot rely on log undo/redo to fix them up — it has to intercept commit/abort itself.

The textbook reference is Database Internals (Petrov, 2019), Ch. 2 “File Formats”, which discusses out-of-page record storage; Database System Concepts (Silberschatz et al., 7e), §13.5 “Record Organization”, covers BLOB/CLOB semantics. Lehman and Lindsay’s “The Starburst Long Field Manager” (VLDB 1989) is the early reference for transaction-aware out-of-row LOB management.

CUBRID makes the third choice on axis 1 (filesystem outside the data volume), the second on axis 2 (a locator URI string in the record), and explicitly handles axis 3 in the transaction commit/abort path through a per-transaction tracking structure. The next two sections explain why each of those choices is sensible and how CUBRID implements them.

Out-of-row LOB storage is a well-trodden pattern. Almost every production engine adopts some subset of the following conventions, and CUBRID’s specific choices are best read as one set of dials within this shared space.

A LOB-bearing column stores a locator — a fixed-size string or struct — rather than the bytes. The locator is enough for the engine to find the bytes when a query asks for them, and small enough that it pages with the rest of the row. PostgreSQL’s lo_oid, Oracle’s LOB locator, and CUBRID’s DB_ELO.locator URI all play this role.

The locator format usually carries three pieces of information: (a) which storage backend to dispatch to (filesystem, object store, internal), (b) the address inside that backend (a path, an OID, a hash), and (c) lightweight metadata so the engine can sanity-check or reject the locator without dereferencing it (size, type, owning class). The locator is opaque to the SQL layer.

Background reclamation, not synchronous delete

Section titled “Background reclamation, not synchronous delete”

When a transaction commits a DELETE, the LOB cannot be unlinked from the filesystem at the same moment as the row’s MVCC delete bit is set. Other concurrent snapshots may still be reading the old version. Engines therefore defer the unlink: they record the intent (“delete this locator at commit”), let the transaction finish, and then (synchronously at commit, or asynchronously via a vacuum-style daemon) actually remove the file. This is the LOB analogue of MVCC’s dead-tuple reclamation.

A naive LOB-as-file scheme runs into the directory inode limit: a single directory holding millions of files becomes hostile both to the kernel’s readdir and to backup tools. Every engine that uses the host filesystem responds with the same trick: hash the filename into one or two levels of intermediate directories, so each leaf directory holds a bounded number of files. PostgreSQL’s pg_largeobject uses internal pages instead and skips the issue; engines that do go to the filesystem (Oracle BFILE, MySQL with file-per-LOB extensions, CUBRID’s POSIX/OWFS backends) use the hash-dir layout.

The defer-and-clean-up policy needs a place to record which locators are pending. The standard answer is a per-transaction list (or tree) of (locator, intended action) pairs hung off the transaction descriptor. Commit walks the list and applies the intent; rollback walks the same list with the opposite intent. The list lives in volatile memory (it is regenerated on demand from log records during recovery, but in the steady state it is just bookkeeping).

Subtransactions and savepoints add a wrinkle: a locator can be created at savepoint A, mutated at savepoint B, and partially rolled back to A. The textbook implementation gives each locator entry a stack of prior states, popping back through them as the rollback LSA crosses each savepoint boundary. Few systems document this explicitly; CUBRID’s lob_savepoint_entry is one of the cleaner implementations.

TheoryCUBRID name
Locator (handle pointing into a backend)DB_ELO.locator — URI string <scheme>:<backend-specific> (elo.c)
Backend-selector prefix in the locatorES_OWFS_PATH_PREFIX = "owfs:", ES_POSIX_PATH_PREFIX = "file:", ES_LOCAL_PATH_PREFIX = "local:" (es_common.h)
Per-storage-type backendes_owfs_* (OWFS — Object-style FS), xes_posix_* (POSIX — host filesystem), es_local_* (local-read-only client cache) (es*.c)
Two-level hash dirces_<HASH1>/ces_<HASH2>/<file> via es_get_unique_name and ES_POSIX_HASH1 / ES_POSIX_HASH2 (es_posix.c)
LOB state machineenum lob_locator_state — six states; the transition table is documented inline in lob_locator.hpp
Per-TDES locator listtdes->lob_locator_root — a red-black tree (lob_rb_root), keyed by hash + key portion of the locator (transaction_transient.cpp)
Savepoint stack per locatorlob_locator_entry::top — a lob_savepoint_entry linked list pushed at state-change, popped at partial rollback (same file)
Commit/rollback walkertx_lob_locator_clear (tdes, at_commit, savept_lsa) — single entry point called from log_manager.c (commit, abort) and transaction_sr.c (sysop end)
Background reclamation hand-offvacuum_notify_es_deleted — at commit, LOB_PERMANENT_DELETED files are queued for vacuum rather than unlinked synchronously

CUBRID has four moving parts on the LOB path: the ES (External Storage) layer that abstracts where the bytes live, the locator URI that names a single LOB, the per-TDES locator tree that remembers which locators this transaction touched, and the commit/rollback walker that turns that tree into actual filesystem operations. We walk them in that order.

flowchart LR
  subgraph SQL["SQL row"]
    R["heap record<br/>(...) ... locator='file:/.../ces_007/dba.t1.123_4567' ..."]
  end
  subgraph TDES["Per-transaction state (TDES)"]
    RB["lob_locator_root<br/>(red-black tree)"]
    E1["lob_locator_entry<br/>key, hash, top"]
    E2["lob_locator_entry<br/>key, hash, top"]
    SV["savepoint stack<br/>state, savept_lsa"]
    RB --> E1
    RB --> E2
    E1 --> SV
  end
  subgraph ES["ES layer (es.c, es_init/es_create_file/...)"]
    POS["es_posix_∗<br/>host filesystem"]
    OWFS["es_owfs_∗<br/>object-style FS"]
    LOC["es_local_∗<br/>client-side cache"]
  end
  DISK[("filesystem<br/>ces_&lt;H1&gt;/ces_&lt;H2&gt;/&lt;file&gt;")]
  R --> RB
  ES --> DISK
  POS --> DISK
  OWFS --> DISK
  TX[("commit / rollback")] --> CLR["tx_lob_locator_clear"]
  CLR --> RB
  CLR --> ES

The figure encodes the three boundaries that matter for correctness: (SQL boundary) the row carries a locator URI, never bytes; (TDES boundary) every state-changing locator operation funnels through xtx_add_lob_locator / xtx_change_state_of_locator / xtx_drop_lob_locator so the per-transaction tree is the single source of truth; (ES boundary) every filesystem call goes through the es.c dispatch front so the backend choice (OWFS / POSIX / LOCAL) is one switch in one place.

The ES layer — backend dispatch on a URI prefix

Section titled “The ES layer — backend dispatch on a URI prefix”

Every LOB byte read or write goes through es.c. The module is initialized once with es_init(uri); the URI’s prefix determines which backend the rest of the engine routes to.

// ES_TYPE — src/storage/es_common.h
typedef enum
{
ES_NONE = -1,
ES_OWFS = 0,
ES_POSIX = 1,
ES_LOCAL = 2
} ES_TYPE;
#define ES_OWFS_PATH_PREFIX "owfs:"
#define ES_POSIX_PATH_PREFIX "file:"
#define ES_LOCAL_PATH_PREFIX "local:"
// es_create_file — src/storage/es.c (condensed)
int
es_create_file (char *out_uri)
{
// ... condensed ...
if (es_initialized_type == ES_OWFS)
{
memcpy (out_uri, ES_OWFS_PATH_PREFIX, sizeof (ES_OWFS_PATH_PREFIX));
ret = es_owfs_create_file (ES_OWFS_PATH_POS (out_uri));
}
else if (es_initialized_type == ES_POSIX)
{
memcpy (out_uri, ES_POSIX_PATH_PREFIX, sizeof (ES_POSIX_PATH_PREFIX));
#if defined (CS_MODE)
ret = es_posix_create_file (ES_POSIX_PATH_POS (out_uri));
#else
ret = xes_posix_create_file (ES_POSIX_PATH_POS (out_uri));
#endif
}
// ... condensed (ES_LOCAL is read-only, no create) ...
}

Two properties fall out of this dispatch. (a) ES_LOCAL has no create path — it is only a client-side read cache for files staged elsewhere; the deck does not mention it because the deck is server- focused. (b) Inside CUBRID’s server, POSIX paths are served by xes_posix_* — the x-prefixed variants are the in-server implementations, while the bare es_posix_* symbols are RPC stubs the client invokes via network_interface_cl.h.

A locator is a zero-terminated string with a fixed shape:

file:/<base-dir>/ces_<HASH1>/ces_<HASH2>/<metaname>.<unum>_<rand>

Two helpers in lob_locator.cpp parse it without allocating:

// lob_locator_meta / lob_locator_key — src/object/lob_locator.cpp
const char *
lob_locator_key (const char *locator)
{
return std::strrchr (locator, '.') + 1; // points just past last '.'
}
const char *
lob_locator_meta (const char *locator)
{
return std::strrchr (locator, PATH_SEPARATOR); // points at last '/'
}

key is the unique-name suffix (<unum>_<rand>), used to identify the locator inside the per-transaction tree. meta is the directory portion, which the rollback path uses to rename a locator back to its pre-savepoint identity (see §“Savepoint stack” below).

The <metaname> segment between meta and key carries human-readable context — typically <schema>.<table>. The deck calls this “db 정보 (예) dba.t1”. On creation the metaname is the temporary sentinel ces_temp until the locator graduates to permanent at commit (see §“Create flow”).

The state machine — six states, one transition table

Section titled “The state machine — six states, one transition table”

Every locator the engine touches is in one of six states. The canonical reference is the comment block at the top of lob_locator.hpp:

// enum lob_locator_state — src/object/lob_locator.hpp
/*
* locator | created | deleted
* ----------|-----------------------|--------------------------
* in-tran | LOB_TRANSIENT_CREATED | LOB_UNKNOWN (s1)
* permanent | LOB_PERMANENT_CREATED | LOB_PERMANENT_DELETED (s3)
* out-tran | LOB_UNKNOWN | LOB_UNKNOWN
* | LOB_UNKNOWN | LOB_TRANSIENT_DELETED (s4)
*
* s1: create transient locator and delete it
* LOB_TRANSIENT_CREATED -> LOB_UNKNOWN
* s2: create transient locator and bind it to a row
* LOB_TRANSIENT_CREATED -> LOB_PERMANENT_CREATED
* s3: bind transient locator to a row and delete the locator
* LOB_PERMANENT_CREATED -> LOB_PERMANENT_DELETED
* s4: delete a locator created out of transaction
* LOB_UNKNOWN -> LOB_TRANSIENT_DELETED
*/
enum lob_locator_state
{
LOB_UNKNOWN,
LOB_TRANSIENT_CREATED,
LOB_TRANSIENT_DELETED,
LOB_PERMANENT_CREATED,
LOB_PERMANENT_DELETED,
LOB_NOT_FOUND
};

The LOB_TRANSIENT_* states mean “the file exists on disk but no row in any committed table points at it yet” — a transaction has called elo_create() or asked to delete a still-uncommitted file. The LOB_PERMANENT_* states mean “a row in a committed (or about-to- commit) table points at this locator”. LOB_UNKNOWN and LOB_NOT_FOUND are sentinel results from lob_locator_find when the caller’s locator does not appear in this transaction’s tree.

The state machine is what the commit/rollback walker (§“Commit and rollback dispatch” below) reads to decide what to do.

Per-TDES locator tree — red-black, hashed key

Section titled “Per-TDES locator tree — red-black, hashed key”

Every transaction that touches a LOB grows a red-black tree on its own TDES:

// lob_locator_entry / lob_savepoint_entry — src/transaction/transaction_transient.cpp
struct lob_savepoint_entry
{
LOB_LOCATOR_STATE state;
LOG_LSA savept_lsa; // savepoint at which this state was set
char locator[ES_URI];
lob_savepoint_entry *prev; // savepoint stack
};
struct lob_locator_entry
{
RB_ENTRY (lob_locator_entry) head;
lob_savepoint_entry *top; // top of the savepoint stack
int key_hash; // 5strhash of m_key, for fast compare
std::string m_key; // the <unum>_<rand> suffix
};

The compare function looks at key_hash first and only falls through to a std::string::compare on the key when the hash matches:

// lob_locator_cmp — src/transaction/transaction_transient.cpp
static int
lob_locator_cmp (const lob_locator_entry *e1, const lob_locator_entry *e2)
{
if (e1->key_hash != e2->key_hash)
return e1->key_hash - e2->key_hash;
return e1->m_key.compare (e2->m_key);
}

This is the standard “hash-then-compare” trick: most lookups exit on the cheap integer compare, only collisions pay for string::compare.

Savepoint stack — pop on partial rollback

Section titled “Savepoint stack — pop on partial rollback”

xtx_change_state_of_locator is called every time a locator’s state changes (e.g. LOB_TRANSIENT_CREATED → LOB_PERMANENT_CREATED at commit-binding, or rename during update). Crucially, if the change happens at a savepoint LSA strictly later than the entry’s last savepoint, it pushes the prior state onto the savepoint stack rather than overwriting it:

// xtx_change_state_of_locator — src/transaction/transaction_transient.cpp (condensed)
last_lsa = LSA_GE (&tdes->savept_lsa, &tdes->topop_lsa) ? tdes->savept_lsa : tdes->topop_lsa;
if (LSA_LT (&entry->top->savept_lsa, &last_lsa))
{
lob_savepoint_entry *savept = new lob_savepoint_entry ();
savept->state = entry->top->state;
savept->savept_lsa = entry->top->savept_lsa;
std::strcpy (savept->locator, entry->top->locator);
savept->prev = entry->top;
entry->top = savept; // push
}
if (new_locator != NULL)
strlcpy (entry->top->locator, new_locator, sizeof (ES_URI));
entry->top->state = state;
entry->top->savept_lsa = last_lsa;

When the transaction rolls back to a savepoint LSA, tx_lob_locator_clear walks the stack popping entries whose savept_lsa >= rollback LSA, calling es_rename_file for each pop where the locator string changed (so the on-disk filename is restored to its pre-rollback identity). The walker is the only place in CUBRID that touches es_rename_file for LOBs.

Commit and rollback dispatch — one entry point

Section titled “Commit and rollback dispatch — one entry point”

tx_lob_locator_clear is called from exactly four places in the engine: commit (log_commit), abort (log_rollback), partial rollback to a savepoint (log_rollback_to_savepoint), and end of a nested system op (xlog_topop_end). It receives at_commit (bool) and savept_lsa (NULL for full commit/abort, non-NULL for partial rollback) and decides per-entry whether the file should be deleted, renamed, or left alone:

// tx_lob_locator_clear — src/transaction/transaction_transient.cpp (condensed)
for (entry = RB_MIN (lob_rb_root, &tdes->lob_locator_root); entry != NULL; entry = next)
{
next = RB_NEXT (lob_rb_root, &tdes->lob_locator_root, entry);
need_to_delete = false;
if (at_commit)
{
// anything not bound to a row at commit time is garbage
if (entry->top->state != LOB_PERMANENT_CREATED)
need_to_delete = true;
}
else // rollback
{
if (savept_lsa != NULL)
{
// partial rollback: pop savepoint stack, rename files back
// ... condensed: see source ...
}
// anything created and rolled back is garbage
if ((savept_lsa == NULL || LSA_GE (&entry->top->savept_lsa, savept_lsa))
&& entry->top->state != LOB_TRANSIENT_DELETED)
need_to_delete = true;
}
if (need_to_delete)
{
#if defined (SERVER_MODE)
if (at_commit && entry->top->state == LOB_PERMANENT_DELETED)
vacuum_notify_es_deleted (thread_p, entry->top->locator);
else
(void) es_delete_file (entry->top->locator);
#else
(void) es_delete_file (entry->top->locator);
#endif
RB_REMOVE (lob_rb_root, &tdes->lob_locator_root, entry);
lob_locator_free (entry);
}
}

The rule the dispatch encodes:

at_committop->stateaction
trueLOB_PERMANENT_CREATEDleave file on disk (it backs a committed row)
trueLOB_PERMANENT_DELETEDhand off to vacuum via vacuum_notify_es_deleted
trueLOB_TRANSIENT_*es_delete_file directly (visible to nobody else)
falseLOB_TRANSIENT_DELETEDleave alone (the file the transaction tried to delete still exists)
falseother stateses_delete_file directly (rolled-back creation; nothing referenced it)

The vacuum hand-off is the LOB analogue of the heap’s vacuum_log_vacuum_record: at commit a LOB_PERMANENT_DELETED file may still be visible to older snapshots, so the vacuum daemon controls the actual unlink — it sees the file gone from heap and no in-flight snapshot still references the row before reclaiming.

A user’s INSERT INTO t VALUES (1, BIT_TO_BLOB(X'...')) reaches elo_create after the parser/value layer has built a DB_ELO:

// elo_create — src/object/elo.c (condensed)
int
elo_create (DB_ELO *elo)
{
ES_URI out_uri;
int ret;
ret = es_create_file (out_uri); // (1) backend creates "ces_temp.<unum>_<rand>"
// ... condensed ...
elo->locator = db_private_strdup (NULL, out_uri);
elo->type = ELO_FBO; // FBO = File-Backed Object
elo->es_type = es_get_type (uri);
if (ELO_NEEDS_TRANSACTION (elo)) // ES_OWFS or ES_POSIX
{
ret = lob_locator_add (elo->locator, LOB_TRANSIENT_CREATED); // (2) per-TDES tree
}
return ret;
}
#define ELO_NEEDS_TRANSACTION(e) \
((e)->es_type == ES_OWFS || (e)->es_type == ES_POSIX)

Two non-obvious properties:

  1. The file is created before the locator is added to the tree — if the create fails, no per-TDES bookkeeping is needed; if the create succeeds and the lob_locator_add fails, an orphan file remains. The deck does not mention this and the source’s error handling does not unlink either — see §“Open questions”.
  2. ES_LOCAL does not go through transactional tracking (ELO_NEEDS_TRANSACTION excludes it). Read-only client-side caches have no write path, so there is nothing for commit/rollback to reconcile.

Subsequent calls in the lifecycle:

  • elo_write (elo, pos, buf, count)es_write_file (no locator state change; the file is LOB_TRANSIENT_CREATED and writes are free).
  • The INSERT reaches the heap layer and the row is inserted with elo->locator as a column value.
  • At commit, the row’s heap insert is visible; tx_lob_locator_clear fires; the locator is in LOB_TRANSIENT_CREATED state — but the row in the heap binds it. Question: who sets it to LOB_PERMANENT_CREATED? In current source, this is the job of xtx_change_state_of_locator called from network_interface_sr.cpp:2380 (the only call site outside the module). The deck’s “rename ces_temp.xxx → dba.t1.xxx” sequence is this call. See §“Open questions” for the version-drift detail.

Read flow — elo_read and the locator round trip

Section titled “Read flow — elo_read and the locator round trip”

A SELECT clob_to_char(c2) FROM t1 walks down through the value layer to elo_read:

// elo_read — src/object/elo.c (signature)
extern ssize_t elo_read (const DB_ELO *elo, off_t pos, void *buf, size_t count);

The caller passes a position and length; elo_read fans out to es_read_file which dispatches to xes_posix_read_file / es_owfs_read_file based on the locator’s prefix. No tree traversal happens on read — the locator string in the heap row is self-sufficient. Reads do not allocate per-TDES state because they do not need to be undone: a read that the transaction later rolls back is invisible to the on-disk file.

Pre-read size lookup goes through elo_sizees_get_file_size for the same reason: the locator alone is enough.

Update flow — copy-then-replace, not in-place

Section titled “Update flow — copy-then-replace, not in-place”

The deck makes the update path explicit because it differs from create:

new directory / file 생성 — create와 로직이 다름 generated_file → locator entry (LOB_TRANSIENT_CREATED) old_file → locator entry (LOB_TRANSIENT_DELETED) commit 시 old_file delete

The reason update does not use elo_create + elo_write is that the old locator must also be tracked. CUBRID handles this with two locator-tree entries: one at LOB_TRANSIENT_CREATED for the new file, one at LOB_TRANSIENT_DELETED for the old. At commit:

  • LOB_TRANSIENT_CREATED for the new file → bound to the row → LOB_PERMANENT_CREATED (via xtx_change_state_of_locator).
  • LOB_TRANSIENT_DELETED for the old file → vacuum hand-off (because the row that referenced it is now an MVCC dead version, and other snapshots may still need to read it).

elo_copy (in elo.c) is the shared helper used by update — it calls es_copy_file, registers the new locator, marks the old as deleted, and returns. The deck’s “elo_copy() 함수는 rename, locator entry drop, copy file 작업 수행” describes this composite behaviour.

Hash directory layout — bounded files per directory

Section titled “Hash directory layout — bounded files per directory”

The POSIX backend distributes files across two hash levels so no single directory holds too many entries:

// es_get_unique_name — src/storage/es_posix.c (condensed)
static void
es_get_unique_name (char *dirname1, char *dirname2,
const char *metaname, char *filename)
{
UINT64 unum;
int hashval, r;
r = (rand () < 0) ? -rand () : rand ();
unum = es_get_unique_num (); // microsecond-precision time
snprintf (filename, NAME_MAX, "%s.%020llu_%04d", metaname, unum, r % 10000);
hashval = es_name_hash_func (ES_POSIX_HASH1, filename);
snprintf (dirname1, NAME_MAX, "ces_%03d", hashval);
hashval = es_name_hash_func (ES_POSIX_HASH2, filename);
snprintf (dirname2, NAME_MAX, "ces_%03d", hashval);
}

ES_POSIX_HASH1 and ES_POSIX_HASH2 are the bucket counts at each level. The hash is mht_5strhash(filename) mod bucket count, so sibling files in the same metaname (e.g. dba.t1.*) end up scattered across both levels — uniform distribution by design.

The deck flags the design’s tradeoff cleanly: hash dirs reduce files-per-directory and ease lock contention on readdir-style operations, but the backup story is harder (every LOB lives in a different leaf directory) and any administrative scan must walk both levels. CUBRID lives with this tradeoff because the DB-volume alternative (BLOB-as-segment) would push every LOB through the page buffer and WAL.

Anchor on symbol names, not line numbers. Function names survive most refactors; line numbers drift the moment someone reformats a header.

  • es_init, es_final (in es.c) — choose backend by URI prefix on first call; tear down on shutdown.
  • es_get_type, es_get_type_string (in es_common.c) — URI prefix → ES_TYPE enum and back.
  • es_create_file, es_read_file, es_write_file, es_delete_file, es_copy_file, es_rename_file (in es.c) — public API the rest of the engine calls. Each switches on es_initialized_type and forwards to the chosen backend.
  • xes_posix_create_file, xes_posix_write_file, xes_posix_read_file, xes_posix_delete_file, xes_posix_rename_file, xes_posix_copy_file (in es_posix.c) — server-side implementations.
  • es_posix_create_file etc. (without the x prefix, in es_posix.c) — client-side stubs that RPC to the server via network_interface_cl.h.
  • es_get_unique_name (in es_posix.c) — file-name-and-hash-dir generator.
  • es_make_dirs (in es_posix.c) — mkdir -p for the two-level hash directory.
  • es_owfs_create_file, es_owfs_write_file, etc. (in es_owfs.c) — One-World FS object-storage backend; same surface as POSIX.
  • enum lob_locator_state (in lob_locator.hpp) — six states with inline transition table.
  • lob_locator_is_valid, lob_locator_key, lob_locator_meta (in lob_locator.cpp) — locator-string parsing.
  • lob_locator_add, lob_locator_change_state, lob_locator_drop, lob_locator_find (in lob_locator.cpp) — public wrappers that dispatch to the server-side xtx_* functions or to RPC stubs in client mode.
  • struct lob_rb_root, struct lob_locator_entry, struct lob_savepoint_entry (in transaction_transient.cpp) — the data structures.
  • xtx_add_lob_locator — RB-insert at first state-change.
  • xtx_change_state_of_locator — savepoint-stack push + state update.
  • xtx_drop_lob_locator — RB-remove (used by callers that decide a locator is no longer interesting before transaction end, e.g. client clean-up of failed creates).
  • xtx_find_lob_locator — RB-find by hashed key; returns the top state.
  • tx_lob_locator_clear — the commit/rollback walker.
  • lob_locator_cmp — hash-then-compare comparator.
  • elo_create, elo_copy, elo_copy_with_prefix, elo_delete, elo_size, elo_read, elo_write (in elo.c) — the per-row-value layer that orchestrates ES + locator-tree.
  • elo_init_structure, elo_copy_structure, elo_free_structure (in elo.c) — DB_ELO lifecycle.
  • ELO_NEEDS_TRANSACTION macro — distinguishes OWFS/POSIX (tracked) from LOCAL (untracked).
  • vacuum_notify_es_deleted (declared in vacuum subsystem, called from tx_lob_locator_clear) — queues a LOB_PERMANENT_DELETED file for asynchronous unlink so vacuum can delay the unlink past in-flight snapshots.

Each line is the function-definition line of the symbol. The §“Source verification” bullets quote sub-ranges within the same function bodies; cross-check by counting from the symbol’s definition line.

SymbolFileLine
enum lob_locator_statesrc/object/lob_locator.hpp53
lob_locator_keysrc/object/lob_locator.cpp56
lob_locator_metasrc/object/lob_locator.cpp62
lob_locator_addsrc/object/lob_locator.cpp90
lob_locator_change_statesrc/object/lob_locator.cpp107
xtx_add_lob_locatorsrc/transaction/transaction_transient.cpp174
xtx_find_lob_locatorsrc/transaction/transaction_transient.cpp210
xtx_change_state_of_locatorsrc/transaction/transaction_transient.cpp245
xtx_drop_lob_locatorsrc/transaction/transaction_transient.cpp308
tx_lob_locator_clearsrc/transaction/transaction_transient.cpp374
lob_locator_cmpsrc/transaction/transaction_transient.cpp477
elo_createsrc/object/elo.c85
ELO_NEEDS_TRANSACTIONsrc/object/elo.c71
es_create_filesrc/storage/es.c142
es_get_unique_namesrc/storage/es_posix.c78
es_get_typesrc/storage/es_common.c45
enum ES_TYPEsrc/storage/es_common.h28

Each entry leads with a fact about the current source. The trailing note shows verification trail and any historical drift. Open questions follow as the curator’s recorded gaps.

  • The ES layer supports exactly three backends. Verified in src/storage/es_common.h on 2026-05-01: ES_TYPE is {ES_NONE, ES_OWFS, ES_POSIX, ES_LOCAL}. The deck listed ES_OWFS, ES_POSIX, ES_LOCAL in the order POSIX/OWFS/LOCAL; current source numbers them 0/1/2 with ES_NONE = -1 as the uninitialized sentinel.

  • ELO_NEEDS_TRANSACTION excludes ES_LOCAL from the per-TDES tree. Verified in src/object/elo.c:71 on 2026-05-01: the macro is (es_type == ES_OWFS || es_type == ES_POSIX). The deck does not mention this asymmetry; readers reaching the locator tree from the LOB column type would mistakenly assume all locators are tracked.

  • The locator-state transition table in lob_locator.hpp is authoritative and matches the deck. Verified by reading the comment block at src/object/lob_locator.hpp:26-52 (immediately preceding enum lob_locator_state at line 53) on 2026-05-01. The four labelled transitions (s1 through s4) in the comment are the same four cases the deck illustrates with INSERT / UPDATE / DELETE / abort examples.

  • Commit-time LOB_PERMANENT_DELETED files are queued for vacuum, not unlinked synchronously. Verified inside tx_lob_locator_clear (src/transaction/transaction_transient.cpp:374, body lines 443-457) on 2026-05-01: the vacuum_notify_es_deleted (thread_p, entry->top->locator) branch fires only for (at_commit && state == LOB_PERMANENT_DELETED) in SERVER_MODE. Other delete paths call es_delete_file directly. The deck describes the commit-time delete as immediate (“commit 시 delete 수행”) which is not accurate for the deleted-permanent case.

  • Hash directories are two-level by mht_5strhash of the filename. Verified in src/storage/es_posix.c:104-108 on 2026-05-01: ES_POSIX_HASH1 and ES_POSIX_HASH2 give the bucket counts; the dirname format is ces_%03d (zero-padded three-digit hash, so each level supports up to 1000 buckets). A CUBRID_OWFS_POSIX_TWO_DEPTH_DIRECTORY build flag (line 124) controls whether the second level is actually nested or flattened — most builds set it.

  • Per-locator savepoint stack pushes only when the savepoint LSA has advanced. Verified inside xtx_change_state_of_locator (src/transaction/transaction_transient.cpp:245, body lines 273-283) on 2026-05-01: the LSA_LT (&entry->top->savept_lsa, &last_lsa) guard means same-savepoint state changes overwrite, but cross-savepoint changes push a new lob_savepoint_entry. The deck does not surface this optimization.

  1. Who calls xtx_change_state_of_locator to graduate a locator from TRANSIENT_CREATED to PERMANENT_CREATED? The only non-test reference outside the module is src/communication/network_interface_sr.cpp:2380. The deck describes this as part of elo_copy() and the commit handler. Trace the actual path from INSERT execution to the state change; document whether it happens at heap insert time, at row commit time, or via a separate client-driven RPC. Investigation path: trace the sr-side stub from xes_posix_create_file callers, search for LOB_PERMANENT_CREATED writes.

  2. Orphan-file behaviour on lob_locator_add failure after es_create_file succeeds. elo_create does not unlink the file if the locator-tree insert fails. Investigation path: read lob_locator_add failure modes (only error is ER_LOG_UNKNOWN_TRANINDEX) and check whether the caller chain ever sees a successful create + failed add; if it does, this is an orphan-leak corner case worth a CBRD ticket.

  3. What does LOB_NOT_FOUND mean to the heap layer when a locator in a row is not present in the per-TDES tree? Verified that xtx_find_lob_locator returns LOB_NOT_FOUND and copies the input locator unchanged, but did not trace how the heap-side elo_read / elo_size reacts. If the answer is “fall through to the on-disk file”, that’s correct for read-after-commit; if the answer is “error”, concurrent rollbacks could surface spurious failures. Investigation path: trace heap → elo_read → ES dispatch and check whether the locator-tree state is even consulted on reads.

  4. Two-level vs. flattened hash directories. The CUBRID_OWFS_POSIX_TWO_DEPTH_DIRECTORY macro (es_posix.c:124) gates the second level. Default is enabled in current builds, but a flag-controlled fallback exists. Investigation path: git log -S CUBRID_OWFS_POSIX_TWO_DEPTH_DIRECTORY to find when single-level was the default; check if any active deployment still uses it.

  5. Behaviour of xes_posix_rename_file on partial rollback when the on-disk filename has already been observed by another replica. HA replication ships log records; it does not ship filesystem renames. If a replica has read the post-rename locator and the primary then rolls back the rename, the replica’s on-disk path no longer exists. Investigation path: trace HA’s LOB handling, search for lob_locator in HA paths, check whether physical replication carries ES-side actions.

  6. Deprecation status of OWFS. The deck and current source keep OWFS as a first-class backend, but every deployment we have visibility on uses POSIX. Investigation path: git log src/storage/es_owfs.c to look for new development; ask whether any 11.x customer has OWFS in production.

Beyond CUBRID — Comparative Designs & Research Frontiers

Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.

  • PostgreSQL TOASTThe Oversized-Attribute Storage Technique (Stonebraker et al., as documented in PostgreSQL internals). Inline LOBs whose payload is chunked into a per-table TOAST table with the same MVCC visibility rules as ordinary rows. CUBRID’s external-file approach trades MVCC integration for simpler per-LOB I/O; a comparison would quantify the cost of the vacuum-driven TOAST chunk reclamation against ES’s commit-time unlink.

  • PostgreSQL Large Objects (lo_*) — separate from TOAST, this is the older pg_largeobject system table. Locator is an OID, payload is sliced into 2 KB internal pages, all fully WAL-protected. CUBRID’s choice of out-of-WAL ES storage trades some recovery semantics for I/O performance and simpler LOB copy.

  • Oracle SecureFiles — Oracle’s modern LOB engine adds in-place updates, deduplication, encryption, and compression to the older BasicFiles. The relevant CUBRID comparison is the encryption story (CUBRID’s TDE does not currently encrypt ES files; see cubrid-tde.md once written).

  • Oracle BFILE — Oracle’s read-only out-of-database file pointer; the closest analogue to CUBRID’s ES_LOCAL. A side-by- side would clarify whether ES_LOCAL is actually used in production, given the deck does not mention it.

  • MySQL InnoDB off-page columns — InnoDB stores large columns on overflow pages within the same tablespace; ROW_FORMAT=DYNAMIC pushes long columns entirely off-page with a 20-byte pointer. Closer to TOAST than to ES, but with a different page-format story. A comparison would highlight CUBRID’s design choice of sidestepping the page format entirely.

  • Lehman & Lindsay, “The Starburst Long Field Manager” (VLDB 1989). The earliest paper to articulate transactional out-of-row LOB management. The savepoint-stack pattern in lob_savepoint_entry is recognizably descended from Starburst’s multi-level recovery; a re-reading would clarify whether CUBRID’s stack semantics match Starburst’s exactly or diverge.

  • Object stores as LOB backends (S3, GCS, MinIO) — modern successor to OWFS. None of the cloud-native object stores support filesystem rename atomicity, which would force a redesign of the rollback path (CUBRID currently uses es_rename_file for partial-rollback file renames). A research- grade follow-up would map CUBRID’s rename-driven rollback to a versioned-PUT object-store equivalent.

Raw analyses (under raw/code-analysis/cubrid/storage/lob/)

Section titled “Raw analyses (under raw/code-analysis/cubrid/storage/lob/)”
  • LOB 세미나.pptx — single-deck seminar by 인치준 (CUBRID development team 2). Covers the CRUD logic at the locator level with example SQL, the ces_temp rename trick, the per-TDES locator entry list, and the hashdir structure tradeoffs.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database Internals (Petrov), Ch. 2 “File Formats” — out-of-page records, indirection vs. inline.
  • Database System Concepts (Silberschatz et al.), §13.5 “Record Organization” — BLOB/CLOB semantics.
  • (None cited — this doc was assembled from raw deck + source.)

CUBRID source (under /data/hgryoo/references/cubrid/)

Section titled “CUBRID source (under /data/hgryoo/references/cubrid/)”
  • src/object/elo.h, src/object/elo.c
  • src/object/lob_locator.hpp, src/object/lob_locator.cpp
  • src/storage/es.h, src/storage/es.c
  • src/storage/es_common.h, src/storage/es_common.c
  • src/storage/es_posix.h, src/storage/es_posix.c
  • src/storage/es_owfs.h, src/storage/es_owfs.c
  • src/transaction/transaction_transient.hpp, src/transaction/transaction_transient.cpp
  • src/compat/db_elo.h, src/compat/db_elo.c (client-side wrapper)