# opencode Storage Format Spike

**Date:** 2026-05-04  
**Scope:** Determine canonical source, schema, progress marker, and parser risks for lethe collector.  
**Respects:** IV10, PC3, AS3, AS4, UK1.

---

## 1. Canonical Source Choice

**Recommendation: SQLite database (`~/.local/share/opencode/opencode.db`)**

Rationale:
- The DB is the source of truth. JSON files in `storage/` are hydrated projections of DB rows with extra derived fields (e.g. `messageID` in part JSON).
- DB counts: **725 sessions**, **26,611 messages**, **107,889 parts**.
- JSON files are fewer: 16,926 message JSONs vs 26,611 DB rows, indicating some messages exist only in the DB.
- The DB is actively WAL-journaled (`opencode.db-wal` ~2MB, `-shm` 32KB), so reads must use the live DB path, not a copy.

Alternative rejected:
- `storage/{message,part,session,project}/` JSON tree — incomplete, redundant, and 2GB total vs 1.3GB DB.
- `event` table — append-only event log with `event_sequence` aggregate IDs. Useful for replay but overkill for collector polling.
- `~/.cache/opencode/` — models.json, node_modules; no transcript data.
- `~/.config/opencode/` — auth, plugins, skills; no transcript data.
- `~/.local/share/opencode/tool-output/` — binary blobs referenced by `callID`, not structured text.

---

## 2. Schema Overview

### 2.1 Tables (relevant to collector)

```sql
CREATE TABLE session (
    id          TEXT PRIMARY KEY,
    project_id  TEXT NOT NULL,
    parent_id   TEXT,
    slug        TEXT NOT NULL,
    directory   TEXT NOT NULL,
    title       TEXT NOT NULL,
    version     TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL
);

CREATE TABLE message (
    id          TEXT PRIMARY KEY,
    session_id  TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL,
    data        TEXT NOT NULL  -- JSON
);

CREATE TABLE part (
    id          TEXT PRIMARY KEY,
    message_id  TEXT NOT NULL,
    session_id  TEXT NOT NULL,
    time_created INTEGER NOT NULL,
    time_updated INTEGER NOT NULL,
    data        TEXT NOT NULL  -- JSON
);
```

Indexes:
- `part_session_idx` ON `part` (`session_id`)
- `session_project_idx` ON `session` (`project_id`)

### 2.2 Message `data` JSON shape

Top-level keys (observed):
- `role`: `"user" | "assistant"`
- `agent`: `"build" | "implementer-deep" | "implementer-smart" | "oracle" | ...`
- `mode`: same as agent, the execution mode
- `parentID`: previous message ID (for threading)
- `path`: `{ cwd, root }` — absolute filesystem paths (PII risk)
- `modelID`, `providerID`: model metadata
- `tokens`: `{ input, output, reasoning, cache: { read, write } }`
- `cost`: float
- `time`: `{ created, completed? }` — epoch millis
- `finish`: `"stop" | "tool-calls" | ...`

### 2.3 Part `data` JSON shape (by type)

| Type | Keys | Content |
|------|------|---------|
| `text` | `type`, `text`, `time`, `metadata?` | Raw prose |
| `reasoning` | `type`, `text`, `time` | Model reasoning block |
| `tool` | `type`, `tool`, `callID`, `state` | Tool invocation + output |
| `step-start` | `type` | Turn boundary marker |
| `step-finish` | `type`, `reason`, `tokens`, `cost`, `snapshot` | Turn summary |
| `patch` | `type`, `patch`, `path` | Code diff |
| `file` | `type`, `path`, `content?` | File reference |
| `compaction` | ... | Summarized history |

### 2.4 Tool output storage

Tool outputs are **not** stored in `part.data` JSON. Instead:
- `part.data.state.output` contains a short summary/truncation marker for some tools.
- Large tool outputs are written to `~/.local/share/opencode/tool-output/tool_<callID>` as opaque files.
- 20 files observed, ranging from 0B to 6MB.
- The `callID` in the part JSON links to the filename.

---

## 3. Session / Message / Part Flow

A single user → assistant turn is not one row. It is a multi-part sequence:

```
message (user, role=user)
  └── part (type=text)          — user prompt

message (assistant, role=assistant)
  └── part (type=step-start)    — turn begins
  └── part (type=reasoning)     — model thinking
  └── part (type=tool)          — tool call(s)
  └── part (type=step-finish)   — turn ends
```

For search indexing, a "turn" is best defined as:
- **User turn**: one `message` with `role=user` + its `text` part(s).
- **Assistant turn**: one `message` with `role=assistant` + all associated `text`, `reasoning`, and `tool` parts between `step-start` and `step-finish`.

---

## 4. Progress Marker Choice

**Recommendation: `message.rowid` as the progress offset.**

Rationale:
- AS4 assumes an integer offset. SQLite `rowid` is an integer marker and fits the existing collector state schema without a new format.
- The collector persists progress from accepted event `Seq` values. `TurnEvent.Seq` stores the current `message.rowid`, while `ingestion_state.last_offset` stores the next rowid to scan. This keeps partial acceptance resumable: a saved next event rowid is included on the next poll, and a fully accepted scan saves `max(rowid)+1`.
- `message.rowid` is unique and monotonic for rows in the source DB, so equal-millisecond messages do not skip unaccepted rows.
- `message.time_created` is still the event timestamp, but it is not a safe offset: multiple messages can share the same millisecond.
- Messages and parts both have `time_created`, but messages are the turn boundaries; parts inherit the message's session and can be queried via `message_id`.
- Alternative: `message.id` lexicographic — IDs are sortable ULIDs (e.g. `msg_df14a9a20001MitrpJBdwZHMN4`) but integer offset is simpler for the existing state schema.
- Alternative: `event.seq` — requires joining through `event_sequence`, adds complexity.

Query pattern for incremental collection:
```sql
SELECT * FROM message
WHERE rowid >= ?
ORDER BY rowid ASC;
```

---

## 5. Fixture Anonymization Notes

Fields that must be redacted in test fixtures and logs:
- `path.cwd`, `path.root` — absolute local paths.
- `modelID` / `providerID` — may contain API key fragments or account info.
- Tool output file contents (`tool-output/tool_<callID>`) — may contain secrets, env vars, tokens.
- `auth.json` — contains access tokens, refresh tokens; never read in parser.
- Session `title` — may contain internal project names or ticket IDs.

Safe to retain:
- ID prefixes and structure (`ses_`, `msg_`, `prt_`, `tool_`).
- `role`, `agent`, `mode`, `type` enums.
- `time_created` relative offsets.
- `slug` — random adjective-noun pairs, no PII.
- Token counts and costs (aggregated, not content).

---

## 6. Parser Risks

| Risk | Impact | Mitigation |
|------|--------|------------|
| **Schema drift** | opencode may add new `part` types or message keys | Defensive JSON unmarshaling; ignore unknown keys |
| **WAL lock contention** | Collector reads while opencode writes | Use `mode=ro` URI or copy with `sqlite3` CLI if needed |
| **Large tool outputs** | 6MB+ files could bloat index | Only index tool output metadata/summary; skip `tool-output/` files unless `include_tool_outputs=1` |
| **Part ordering ambiguity** | `time_created` may collide for parts in same message | Use `id` lexicographic as secondary sort |
| **Eventual consistency** | Event sourcing means DB state may lag events | Poll `message` table directly; events are for replay only |
| **Nested JSON depth** | `state.output` in tool parts may be stringified JSON | Treat as opaque text or attempt one-level unmarshal |
| **Session archiving** | `time_archived` column exists; archived sessions may be compacted | Respect `time_archived IS NULL` or include archived based on config |

---

## 7. Open Questions for Parser Implementation

1. Should the collector index **every** session or only those matching a project filter?
2. Should `reasoning` parts be indexed as prose or excluded by default?
3. Should `patch` parts be indexed as code/text or handled separately?
4. How to handle parent-child message threading (`parentID`) in search results?
5. What is the compaction format in `compaction` parts? (Only 60 observed, low priority.)

---

## 8. Summary for Next Phase

- **Source:** SQLite `~/.local/share/opencode/opencode.db`, read-only.
- **Tables:** `session`, `message`, `part`.
- **Progress:** `message.rowid` (INTEGER SQLite row marker).
- **Turn definition:** `message` row + linked `part` rows by `message_id`.
- **Tool outputs:** External files in `tool-output/`, referenced by `callID`.
- **Registration:** One parser package implementing `parser.Parser` + registration in collector.