~bigbes/lethe

ref: f0f651bfc74681988f56bd610074e1dce6dbee1c lethe/docs/spikes/opencode-format.md -rw-r--r-- 8.4 KiB
f0f651bf — Eugene Blikh feat: add search data layer — adapter, highlight helper, hook, and tests 24 days ago

#opencode Storage Format Spike

Date: 2026-05-04
Scope: Determine canonical source, schema, progress marker, and parser risks for lethe collector.
Respects: IV10, PC3, AS3, AS4, UK1.


#1. Canonical Source Choice

Recommendation: SQLite database (~/.local/share/opencode/opencode.db)

Rationale:

  • The DB is the source of truth. JSON files in storage/ are hydrated projections of DB rows with extra derived fields (e.g. messageID in part JSON).
  • DB counts: 725 sessions, 26,611 messages, 107,889 parts.
  • JSON files are fewer: 16,926 message JSONs vs 26,611 DB rows, indicating some messages exist only in the DB.
  • The DB is actively WAL-journaled (opencode.db-wal ~2MB, -shm 32KB), so reads must use the live DB path, not a copy.

Alternative rejected:

  • storage/{message,part,session,project}/ JSON tree — incomplete, redundant, and 2GB total vs 1.3GB DB.
  • event table — append-only event log with event_sequence aggregate IDs. Useful for replay but overkill for collector polling.
  • ~/.cache/opencode/ — models.json, node_modules; no transcript data.
  • ~/.config/opencode/ — auth, plugins, skills; no transcript data.
  • ~/.local/share/opencode/tool-output/ — binary blobs referenced by callID, not structured text.

#2. Schema Overview

#2.1 Tables (relevant to collector)

CREATE TABLE session (
    id          TEXT PRIMARY KEY,
    project_id  TEXT NOT NULL,
    parent_id   TEXT,
    slug        TEXT NOT NULL,
    directory   TEXT NOT NULL,
    title       TEXT NOT NULL,
    version     TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL
);

CREATE TABLE message (
    id          TEXT PRIMARY KEY,
    session_id  TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL,
    data        TEXT NOT NULL  -- JSON
);

CREATE TABLE part (
    id          TEXT PRIMARY KEY,
    message_id  TEXT NOT NULL,
    session_id  TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL,
    data        TEXT NOT NULL  -- JSON
);

Indexes:

  • part_session_idx ON part (session_id)
  • session_project_idx ON session (project_id)

#2.2 Message data JSON shape

Top-level keys (observed):

  • role: "user" | "assistant"
  • agent: "build" | "implementer-deep" | "implementer-smart" | "oracle" | ...
  • mode: same as agent, the execution mode
  • parentID: previous message ID (for threading)
  • path: { cwd, root } — absolute filesystem paths (PII risk)
  • modelID, providerID: model metadata
  • tokens: { input, output, reasoning, cache: { read, write } }
  • cost: float
  • time: { created, completed? } — epoch millis
  • finish: "stop" | "tool-calls" | ...

#2.3 Part data JSON shape (by type)

Type Keys Content
text type, text, time, metadata? Raw prose
reasoning type, text, time Model reasoning block
tool type, tool, callID, state Tool invocation + output
step-start type Turn boundary marker
step-finish type, reason, tokens, cost, snapshot Turn summary
patch type, patch, path Code diff
file type, path, content? File reference
compaction ... Summarized history

#2.4 Tool output storage

Tool outputs are not stored in part.data JSON. Instead:

  • part.data.state.output contains a short summary/truncation marker for some tools.
  • Large tool outputs are written to ~/.local/share/opencode/tool-output/tool_<callID> as opaque files.
  • 20 files observed, ranging from 0B to 6MB.
  • The callID in the part JSON links to the filename.

#3. Session / Message / Part Flow

A single user → assistant turn is not one row. It is a multi-part sequence:

message (user, role=user)
  └── part (type=text)          — user prompt

message (assistant, role=assistant)
  └── part (type=step-start)    — turn begins
  └── part (type=reasoning)     — model thinking
  └── part (type=tool)          — tool call(s)
  └── part (type=step-finish)   — turn ends

For search indexing, a "turn" is best defined as:

  • User turn: one message with role=user + its text part(s).
  • Assistant turn: one message with role=assistant + all associated text, reasoning, and tool parts between step-start and step-finish.

#4. Progress Marker Choice

Recommendation: message.rowid as the progress offset.

Rationale:

  • AS4 assumes an integer offset. SQLite rowid is an integer marker and fits the existing collector state schema without a new format.
  • The collector persists progress from accepted event Seq values. TurnEvent.Seq stores the current message.rowid, while ingestion_state.last_offset stores the next rowid to scan. This keeps partial acceptance resumable: a saved next event rowid is included on the next poll, and a fully accepted scan saves max(rowid)+1.
  • message.rowid is unique and monotonic for rows in the source DB, so equal-millisecond messages do not skip unaccepted rows.
  • message.time_created is still the event timestamp, but it is not a safe offset: multiple messages can share the same millisecond.
  • Messages and parts both have time_created, but messages are the turn boundaries; parts inherit the message's session and can be queried via message_id.
  • Alternative: message.id lexicographic — IDs are sortable ULIDs (e.g. msg_df14a9a20001MitrpJBdwZHMN4) but integer offset is simpler for the existing state schema.
  • Alternative: event.seq — requires joining through event_sequence, adds complexity.

Query pattern for incremental collection:

SELECT * FROM message
WHERE rowid >= ?
ORDER BY rowid ASC;

#5. Fixture Anonymization Notes

Fields that must be redacted in test fixtures and logs:

  • path.cwd, path.root — absolute local paths.
  • modelID / providerID — may contain API key fragments or account info.
  • Tool output file contents (tool-output/tool_<callID>) — may contain secrets, env vars, tokens.
  • auth.json — contains access tokens, refresh tokens; never read in parser.
  • Session title — may contain internal project names or ticket IDs.

Safe to retain:

  • ID prefixes and structure (ses_, msg_, prt_, tool_).
  • role, agent, mode, type enums.
  • time_created relative offsets.
  • slug — random adjective-noun pairs, no PII.
  • Token counts and costs (aggregated, not content).

#6. Parser Risks

Risk Impact Mitigation
Schema drift opencode may add new part types or message keys Defensive JSON unmarshaling; ignore unknown keys
WAL lock contention Collector reads while opencode writes Use mode=ro URI or copy with sqlite3 CLI if needed
Large tool outputs 6MB+ files could bloat index Only index tool output metadata/summary; skip tool-output/ files unless include_tool_outputs=1
Part ordering ambiguity time_created may collide for parts in same message Use id lexicographic as secondary sort
Eventual consistency Event sourcing means DB state may lag events Poll message table directly; events are for replay only
Nested JSON depth state.output in tool parts may be stringified JSON Treat as opaque text or attempt one-level unmarshal
Session archiving time_archived column exists; archived sessions may be compacted Respect time_archived IS NULL or include archived based on config

#7. Open Questions for Parser Implementation

  1. Should the collector index every session or only those matching a project filter?
  2. Should reasoning parts be indexed as prose or excluded by default?
  3. Should patch parts be indexed as code/text or handled separately?
  4. How to handle parent-child message threading (parentID) in search results?
  5. What is the compaction format in compaction parts? (Only 60 observed, low priority.)

#8. Summary for Next Phase

  • Source: SQLite ~/.local/share/opencode/opencode.db, read-only.
  • Tables: session, message, part.
  • Progress: message.rowid (INTEGER SQLite row marker).
  • Turn definition: message row + linked part rows by message_id.
  • Tool outputs: External files in tool-output/, referenced by callID.
  • Registration: One parser package implementing parser.Parser + registration in collector.