# opencode Storage Format Spike **Date:** 2026-05-04 **Scope:** Determine canonical source, schema, progress marker, and parser risks for lethe collector. **Respects:** IV10, PC3, AS3, AS4, UK1. --- ## 1. Canonical Source Choice **Recommendation: SQLite database (`~/.local/share/opencode/opencode.db`)** Rationale: - The DB is the source of truth. JSON files in `storage/` are hydrated projections of DB rows with extra derived fields (e.g. `messageID` in part JSON). - DB counts: **725 sessions**, **26,611 messages**, **107,889 parts**. - JSON files are fewer: 16,926 message JSONs vs 26,611 DB rows, indicating some messages exist only in the DB. - The DB is actively WAL-journaled (`opencode.db-wal` ~2MB, `-shm` 32KB), so reads must use the live DB path, not a copy. Alternative rejected: - `storage/{message,part,session,project}/` JSON tree — incomplete, redundant, and 2GB total vs 1.3GB DB. - `event` table — append-only event log with `event_sequence` aggregate IDs. Useful for replay but overkill for collector polling. - `~/.cache/opencode/` — models.json, node_modules; no transcript data. - `~/.config/opencode/` — auth, plugins, skills; no transcript data. - `~/.local/share/opencode/tool-output/` — binary blobs referenced by `callID`, not structured text. --- ## 2. Schema Overview ### 2.1 Tables (relevant to collector) ```sql CREATE TABLE session ( id TEXT PRIMARY KEY, project_id TEXT NOT NULL, parent_id TEXT, slug TEXT NOT NULL, directory TEXT NOT NULL, title TEXT NOT NULL, version TEXT NOT NULL, time_created INTEGER NOT NULL, time_updated INTEGER NOT NULL ); CREATE TABLE message ( id TEXT PRIMARY KEY, session_id TEXT NOT NULL, time_created INTEGER NOT NULL, time_updated INTEGER NOT NULL, data TEXT NOT NULL -- JSON ); CREATE TABLE part ( id TEXT PRIMARY KEY, message_id TEXT NOT NULL, session_id TEXT NOT NULL, time_created INTEGER NOT NULL, time_updated INTEGER NOT NULL, data TEXT NOT NULL -- JSON ); ``` Indexes: - `part_session_idx` ON `part` (`session_id`) - `session_project_idx` ON `session` (`project_id`) ### 2.2 Message `data` JSON shape Top-level keys (observed): - `role`: `"user" | "assistant"` - `agent`: `"build" | "implementer-deep" | "implementer-smart" | "oracle" | ...` - `mode`: same as agent, the execution mode - `parentID`: previous message ID (for threading) - `path`: `{ cwd, root }` — absolute filesystem paths (PII risk) - `modelID`, `providerID`: model metadata - `tokens`: `{ input, output, reasoning, cache: { read, write } }` - `cost`: float - `time`: `{ created, completed? }` — epoch millis - `finish`: `"stop" | "tool-calls" | ...` ### 2.3 Part `data` JSON shape (by type) | Type | Keys | Content | |------|------|---------| | `text` | `type`, `text`, `time`, `metadata?` | Raw prose | | `reasoning` | `type`, `text`, `time` | Model reasoning block | | `tool` | `type`, `tool`, `callID`, `state` | Tool invocation + output | | `step-start` | `type` | Turn boundary marker | | `step-finish` | `type`, `reason`, `tokens`, `cost`, `snapshot` | Turn summary | | `patch` | `type`, `patch`, `path` | Code diff | | `file` | `type`, `path`, `content?` | File reference | | `compaction` | ... | Summarized history | ### 2.4 Tool output storage Tool outputs are **not** stored in `part.data` JSON. Instead: - `part.data.state.output` contains a short summary/truncation marker for some tools. - Large tool outputs are written to `~/.local/share/opencode/tool-output/tool_` as opaque files. - 20 files observed, ranging from 0B to 6MB. - The `callID` in the part JSON links to the filename. --- ## 3. Session / Message / Part Flow A single user → assistant turn is not one row. It is a multi-part sequence: ``` message (user, role=user) └── part (type=text) — user prompt message (assistant, role=assistant) └── part (type=step-start) — turn begins └── part (type=reasoning) — model thinking └── part (type=tool) — tool call(s) └── part (type=step-finish) — turn ends ``` For search indexing, a "turn" is best defined as: - **User turn**: one `message` with `role=user` + its `text` part(s). - **Assistant turn**: one `message` with `role=assistant` + all associated `text`, `reasoning`, and `tool` parts between `step-start` and `step-finish`. --- ## 4. Progress Marker Choice **Recommendation: `message.rowid` as the progress offset.** Rationale: - AS4 assumes an integer offset. SQLite `rowid` is an integer marker and fits the existing collector state schema without a new format. - The collector persists progress from accepted event `Seq` values. `TurnEvent.Seq` stores the current `message.rowid`, while `ingestion_state.last_offset` stores the next rowid to scan. This keeps partial acceptance resumable: a saved next event rowid is included on the next poll, and a fully accepted scan saves `max(rowid)+1`. - `message.rowid` is unique and monotonic for rows in the source DB, so equal-millisecond messages do not skip unaccepted rows. - `message.time_created` is still the event timestamp, but it is not a safe offset: multiple messages can share the same millisecond. - Messages and parts both have `time_created`, but messages are the turn boundaries; parts inherit the message's session and can be queried via `message_id`. - Alternative: `message.id` lexicographic — IDs are sortable ULIDs (e.g. `msg_df14a9a20001MitrpJBdwZHMN4`) but integer offset is simpler for the existing state schema. - Alternative: `event.seq` — requires joining through `event_sequence`, adds complexity. Query pattern for incremental collection: ```sql SELECT * FROM message WHERE rowid >= ? ORDER BY rowid ASC; ``` --- ## 5. Fixture Anonymization Notes Fields that must be redacted in test fixtures and logs: - `path.cwd`, `path.root` — absolute local paths. - `modelID` / `providerID` — may contain API key fragments or account info. - Tool output file contents (`tool-output/tool_`) — may contain secrets, env vars, tokens. - `auth.json` — contains access tokens, refresh tokens; never read in parser. - Session `title` — may contain internal project names or ticket IDs. Safe to retain: - ID prefixes and structure (`ses_`, `msg_`, `prt_`, `tool_`). - `role`, `agent`, `mode`, `type` enums. - `time_created` relative offsets. - `slug` — random adjective-noun pairs, no PII. - Token counts and costs (aggregated, not content). --- ## 6. Parser Risks | Risk | Impact | Mitigation | |------|--------|------------| | **Schema drift** | opencode may add new `part` types or message keys | Defensive JSON unmarshaling; ignore unknown keys | | **WAL lock contention** | Collector reads while opencode writes | Use `mode=ro` URI or copy with `sqlite3` CLI if needed | | **Large tool outputs** | 6MB+ files could bloat index | Only index tool output metadata/summary; skip `tool-output/` files unless `include_tool_outputs=1` | | **Part ordering ambiguity** | `time_created` may collide for parts in same message | Use `id` lexicographic as secondary sort | | **Eventual consistency** | Event sourcing means DB state may lag events | Poll `message` table directly; events are for replay only | | **Nested JSON depth** | `state.output` in tool parts may be stringified JSON | Treat as opaque text or attempt one-level unmarshal | | **Session archiving** | `time_archived` column exists; archived sessions may be compacted | Respect `time_archived IS NULL` or include archived based on config | --- ## 7. Open Questions for Parser Implementation 1. Should the collector index **every** session or only those matching a project filter? 2. Should `reasoning` parts be indexed as prose or excluded by default? 3. Should `patch` parts be indexed as code/text or handled separately? 4. How to handle parent-child message threading (`parentID`) in search results? 5. What is the compaction format in `compaction` parts? (Only 60 observed, low priority.) --- ## 8. Summary for Next Phase - **Source:** SQLite `~/.local/share/opencode/opencode.db`, read-only. - **Tables:** `session`, `message`, `part`. - **Progress:** `message.rowid` (INTEGER SQLite row marker). - **Turn definition:** `message` row + linked `part` rows by `message_id`. - **Tool outputs:** External files in `tool-output/`, referenced by `callID`. - **Registration:** One parser package implementing `parser.Parser` + registration in collector.