Date: 2026-05-04
Scope: Determine canonical source, schema, progress marker, and parser risks for lethe collector.
Respects: IV10, PC3, AS3, AS4, UK1.
Recommendation: SQLite database (~/.local/share/opencode/opencode.db)
Rationale:
storage/ are hydrated projections of DB rows with extra derived fields (e.g. messageID in part JSON).opencode.db-wal ~2MB, -shm 32KB), so reads must use the live DB path, not a copy.Alternative rejected:
storage/{message,part,session,project}/ JSON tree — incomplete, redundant, and 2GB total vs 1.3GB DB.event table — append-only event log with event_sequence aggregate IDs. Useful for replay but overkill for collector polling.~/.cache/opencode/ — models.json, node_modules; no transcript data.~/.config/opencode/ — auth, plugins, skills; no transcript data.~/.local/share/opencode/tool-output/ — binary blobs referenced by callID, not structured text.CREATE TABLE session (
id TEXT PRIMARY KEY,
project_id TEXT NOT NULL,
parent_id TEXT,
slug TEXT NOT NULL,
directory TEXT NOT NULL,
title TEXT NOT NULL,
version TEXT NOT NULL,
time_created INTEGER NOT NULL,
time_updated INTEGER NOT NULL
);
CREATE TABLE message (
id TEXT PRIMARY KEY,
session_id TEXT NOT NULL,
time_created INTEGER NOT NULL,
time_updated INTEGER NOT NULL,
data TEXT NOT NULL -- JSON
);
CREATE TABLE part (
id TEXT PRIMARY KEY,
message_id TEXT NOT NULL,
session_id TEXT NOT NULL,
time_created INTEGER NOT NULL,
time_updated INTEGER NOT NULL,
data TEXT NOT NULL -- JSON
);
Indexes:
part_session_idx ON part (session_id)session_project_idx ON session (project_id)data JSON shapeTop-level keys (observed):
role: "user" | "assistant"agent: "build" | "implementer-deep" | "implementer-smart" | "oracle" | ...mode: same as agent, the execution modeparentID: previous message ID (for threading)path: { cwd, root } — absolute filesystem paths (PII risk)modelID, providerID: model metadatatokens: { input, output, reasoning, cache: { read, write } }cost: floattime: { created, completed? } — epoch millisfinish: "stop" | "tool-calls" | ...data JSON shape (by type)| Type | Keys | Content |
|---|---|---|
text |
type, text, time, metadata? |
Raw prose |
reasoning |
type, text, time |
Model reasoning block |
tool |
type, tool, callID, state |
Tool invocation + output |
step-start |
type |
Turn boundary marker |
step-finish |
type, reason, tokens, cost, snapshot |
Turn summary |
patch |
type, patch, path |
Code diff |
file |
type, path, content? |
File reference |
compaction |
... | Summarized history |
Tool outputs are not stored in part.data JSON. Instead:
part.data.state.output contains a short summary/truncation marker for some tools.~/.local/share/opencode/tool-output/tool_<callID> as opaque files.callID in the part JSON links to the filename.A single user → assistant turn is not one row. It is a multi-part sequence:
message (user, role=user)
└── part (type=text) — user prompt
message (assistant, role=assistant)
└── part (type=step-start) — turn begins
└── part (type=reasoning) — model thinking
└── part (type=tool) — tool call(s)
└── part (type=step-finish) — turn ends
For search indexing, a "turn" is best defined as:
message with role=user + its text part(s).message with role=assistant + all associated text, reasoning, and tool parts between step-start and step-finish.Recommendation: message.rowid as the progress offset.
Rationale:
rowid is an integer marker and fits the existing collector state schema without a new format.Seq values. TurnEvent.Seq stores the current message.rowid, while ingestion_state.last_offset stores the next rowid to scan. This keeps partial acceptance resumable: a saved next event rowid is included on the next poll, and a fully accepted scan saves max(rowid)+1.message.rowid is unique and monotonic for rows in the source DB, so equal-millisecond messages do not skip unaccepted rows.message.time_created is still the event timestamp, but it is not a safe offset: multiple messages can share the same millisecond.time_created, but messages are the turn boundaries; parts inherit the message's session and can be queried via message_id.message.id lexicographic — IDs are sortable ULIDs (e.g. msg_df14a9a20001MitrpJBdwZHMN4) but integer offset is simpler for the existing state schema.event.seq — requires joining through event_sequence, adds complexity.Query pattern for incremental collection:
SELECT * FROM message
WHERE rowid >= ?
ORDER BY rowid ASC;
Fields that must be redacted in test fixtures and logs:
path.cwd, path.root — absolute local paths.modelID / providerID — may contain API key fragments or account info.tool-output/tool_<callID>) — may contain secrets, env vars, tokens.auth.json — contains access tokens, refresh tokens; never read in parser.title — may contain internal project names or ticket IDs.Safe to retain:
ses_, msg_, prt_, tool_).role, agent, mode, type enums.time_created relative offsets.slug — random adjective-noun pairs, no PII.| Risk | Impact | Mitigation |
|---|---|---|
| Schema drift | opencode may add new part types or message keys |
Defensive JSON unmarshaling; ignore unknown keys |
| WAL lock contention | Collector reads while opencode writes | Use mode=ro URI or copy with sqlite3 CLI if needed |
| Large tool outputs | 6MB+ files could bloat index | Only index tool output metadata/summary; skip tool-output/ files unless include_tool_outputs=1 |
| Part ordering ambiguity | time_created may collide for parts in same message |
Use id lexicographic as secondary sort |
| Eventual consistency | Event sourcing means DB state may lag events | Poll message table directly; events are for replay only |
| Nested JSON depth | state.output in tool parts may be stringified JSON |
Treat as opaque text or attempt one-level unmarshal |
| Session archiving | time_archived column exists; archived sessions may be compacted |
Respect time_archived IS NULL or include archived based on config |
reasoning parts be indexed as prose or excluded by default?patch parts be indexed as code/text or handled separately?parentID) in search results?compaction parts? (Only 60 observed, low priority.)~/.local/share/opencode/opencode.db, read-only.session, message, part.message.rowid (INTEGER SQLite row marker).message row + linked part rows by message_id.tool-output/, referenced by callID.parser.Parser + registration in collector.