~bigbes/lethe (f0f651bfc74681988f56bd610074e1dce6dbee1c): docs/spikes/opencode-format.md

#opencode Storage Format Spike

Date: 2026-05-04
Scope: Determine canonical source, schema, progress marker, and parser risks for lethe collector.
Respects: IV10, PC3, AS3, AS4, UK1.

#1. Canonical Source Choice

Recommendation: SQLite database (~/.local/share/opencode/opencode.db)

Rationale:

The DB is the source of truth. JSON files in storage/ are hydrated projections of DB rows with extra derived fields (e.g. messageID in part JSON).
DB counts: 725 sessions, 26,611 messages, 107,889 parts.
JSON files are fewer: 16,926 message JSONs vs 26,611 DB rows, indicating some messages exist only in the DB.
The DB is actively WAL-journaled (opencode.db-wal ~2MB, -shm 32KB), so reads must use the live DB path, not a copy.

Alternative rejected:

storage/{message,part,session,project}/ JSON tree — incomplete, redundant, and 2GB total vs 1.3GB DB.
event table — append-only event log with event_sequence aggregate IDs. Useful for replay but overkill for collector polling.
~/.cache/opencode/ — models.json, node_modules; no transcript data.
~/.config/opencode/ — auth, plugins, skills; no transcript data.
~/.local/share/opencode/tool-output/ — binary blobs referenced by callID, not structured text.

#2. Schema Overview

#2.1 Tables (relevant to collector)

CREATE TABLE session (
    id          TEXT PRIMARY KEY,
    project_id  TEXT NOT NULL,
    parent_id   TEXT,
    slug        TEXT NOT NULL,
    directory   TEXT NOT NULL,
    title       TEXT NOT NULL,
    version     TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL
);

CREATE TABLE message (
    id          TEXT PRIMARY KEY,
    session_id  TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL,
    data        TEXT NOT NULL  -- JSON
);

CREATE TABLE part (
    id          TEXT PRIMARY KEY,
    message_id  TEXT NOT NULL,
    session_id  TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL,
    data        TEXT NOT NULL  -- JSON
);

Indexes:

part_session_idx ON part (session_id)
session_project_idx ON session (project_id)

#2.2 Message `data` JSON shape

Top-level keys (observed):

role: "user" | "assistant"
agent: "build" | "implementer-deep" | "implementer-smart" | "oracle" | ...
mode: same as agent, the execution mode
parentID: previous message ID (for threading)
path: { cwd, root } — absolute filesystem paths (PII risk)
modelID, providerID: model metadata
tokens: { input, output, reasoning, cache: { read, write } }
cost: float
time: { created, completed? } — epoch millis
finish: "stop" | "tool-calls" | ...

#2.3 Part `data` JSON shape (by type)

Type	Keys	Content
`text`	`type`, `text`, `time`, `metadata?`	Raw prose
`reasoning`	`type`, `text`, `time`	Model reasoning block
`tool`	`type`, `tool`, `callID`, `state`	Tool invocation + output
`step-start`	`type`	Turn boundary marker
`step-finish`	`type`, `reason`, `tokens`, `cost`, `snapshot`	Turn summary
`patch`	`type`, `patch`, `path`	Code diff
`file`	`type`, `path`, `content?`	File reference
`compaction`	...	Summarized history

#2.4 Tool output storage

Tool outputs are not stored in part.data JSON. Instead:

part.data.state.output contains a short summary/truncation marker for some tools.
Large tool outputs are written to ~/.local/share/opencode/tool-output/tool_<callID> as opaque files.
20 files observed, ranging from 0B to 6MB.
The callID in the part JSON links to the filename.

#3. Session / Message / Part Flow

A single user → assistant turn is not one row. It is a multi-part sequence:

message (user, role=user)
  └── part (type=text)          — user prompt

message (assistant, role=assistant)
  └── part (type=step-start)    — turn begins
  └── part (type=reasoning)     — model thinking
  └── part (type=tool)          — tool call(s)
  └── part (type=step-finish)   — turn ends

For search indexing, a "turn" is best defined as:

User turn: one message with role=user + its text part(s).
Assistant turn: one message with role=assistant + all associated text, reasoning, and tool parts between step-start and step-finish.

#4. Progress Marker Choice

Recommendation: message.rowid as the progress offset.

Rationale:

AS4 assumes an integer offset. SQLite rowid is an integer marker and fits the existing collector state schema without a new format.
The collector persists progress from accepted event Seq values. TurnEvent.Seq stores the current message.rowid, while ingestion_state.last_offset stores the next rowid to scan. This keeps partial acceptance resumable: a saved next event rowid is included on the next poll, and a fully accepted scan saves max(rowid)+1.
message.rowid is unique and monotonic for rows in the source DB, so equal-millisecond messages do not skip unaccepted rows.
message.time_created is still the event timestamp, but it is not a safe offset: multiple messages can share the same millisecond.
Messages and parts both have time_created, but messages are the turn boundaries; parts inherit the message's session and can be queried via message_id.
Alternative: message.id lexicographic — IDs are sortable ULIDs (e.g. msg_df14a9a20001MitrpJBdwZHMN4) but integer offset is simpler for the existing state schema.
Alternative: event.seq — requires joining through event_sequence, adds complexity.

Query pattern for incremental collection:

SELECT * FROM message
WHERE rowid >= ?
ORDER BY rowid ASC;

#5. Fixture Anonymization Notes

Fields that must be redacted in test fixtures and logs:

path.cwd, path.root — absolute local paths.
modelID / providerID — may contain API key fragments or account info.
Tool output file contents (tool-output/tool_<callID>) — may contain secrets, env vars, tokens.
auth.json — contains access tokens, refresh tokens; never read in parser.
Session title — may contain internal project names or ticket IDs.

Safe to retain:

ID prefixes and structure (ses_, msg_, prt_, tool_).
role, agent, mode, type enums.
time_created relative offsets.
slug — random adjective-noun pairs, no PII.
Token counts and costs (aggregated, not content).

#6. Parser Risks

Risk	Impact	Mitigation
Schema drift	opencode may add new `part` types or message keys	Defensive JSON unmarshaling; ignore unknown keys
WAL lock contention	Collector reads while opencode writes	Use `mode=ro` URI or copy with `sqlite3` CLI if needed
Large tool outputs	6MB+ files could bloat index	Only index tool output metadata/summary; skip `tool-output/` files unless `include_tool_outputs=1`
Part ordering ambiguity	`time_created` may collide for parts in same message	Use `id` lexicographic as secondary sort
Eventual consistency	Event sourcing means DB state may lag events	Poll `message` table directly; events are for replay only
Nested JSON depth	`state.output` in tool parts may be stringified JSON	Treat as opaque text or attempt one-level unmarshal
Session archiving	`time_archived` column exists; archived sessions may be compacted	Respect `time_archived IS NULL` or include archived based on config

#7. Open Questions for Parser Implementation

Should the collector index every session or only those matching a project filter?
Should reasoning parts be indexed as prose or excluded by default?
Should patch parts be indexed as code/text or handled separately?
How to handle parent-child message threading (parentID) in search results?
What is the compaction format in compaction parts? (Only 60 observed, low priority.)

#8. Summary for Next Phase

Source: SQLite ~/.local/share/opencode/opencode.db, read-only.
Tables: session, message, part.
Progress: message.rowid (INTEGER SQLite row marker).
Turn definition: message row + linked part rows by message_id.
Tool outputs: External files in tool-output/, referenced by callID.
Registration: One parser package implementing parser.Parser + registration in collector.

~bigbes/lethe