# lethe-search-and-opencode

**Status:** done
**Branch:** `task/lethe-search-and-opencode`
**Worktree:** `/Users/blikh/data/home/lethe/.worktrees/lethe-search-and-opencode`
**Mode:** hands-off
**Module:** `sourcecraft.dev/bigbes/lethe`
**Depends on:** `lethe-server.md` (#1) — FTS5 tables and triggers were created in #1; this task only adds query code. `lethe-collector-claude-code.md` (#2) — the collector framework and Parser interface this task extends.
**Sibling tasks (deferred):** per-tool parsers (`lethe-collector-crush.md`, `lethe-collector-pi.md`, `lethe-collector-kimi.md`); RFC backlog items (cost rollups for tools that report it, tagging, JSON/Markdown export).

## Design

### Purpose

Make the archive searchable by exposing the existing FTS5 indexes through `/api/v1/search`, and prove the collector parser boundary with a second tool: opencode.

A successful end state for this task: ingested Claude Code and opencode turns can be searched through the authenticated JSON API, with ranked snippets and session anchors that #7 can render in the existing React `/search` route.

### Scope

**In:**
- `GET /api/v1/search?q=&tool=&host=&since=&until=&include_tool_outputs=&limit=&cursor=` — owner-scoped FTS5 query against `turns_fts`; opt-in union with `tool_outputs_fts`.
- `internal/domain/search/` — repository and handler matching the existing domain package shape.
- Cursor pagination over the ranked result set; cursor is opaque and invalid cursors are `400 INVALID`.
- JSON result rows include enough data for #7 to link to `/session/{tool}/{host}/{session_id}#turn-{turn_id}`.
- Collector-side:
  - `internal/collector/parser/opencode/` — new parser implementing the same Parser interface from #2.
  - Format-discovery spike captured under `docs/spikes/opencode-format.md` before the parser is written; spike output is checked in.
  - One parser registration in `cmd/lethe-collector`; otherwise the collector runner framework is untouched.
- Golden fixtures for the opencode parser (anonymized snippets in `testdata/opencode/`).

**Out:**
- React search UI — #7 owns filling `web/src/routes/search.tsx` and any saved-search execute flow.
- Stats API and stats page — already shipped by #5 and left unchanged here.
- Server-rendered HTML and vanilla search JS — superseded by the shipped React SPA.
- Schema migrations — #1 already created `turns_fts` and `tool_outputs_fts`.
- crush, pi, kimi parsers — separate task files when the time comes.
- Tag system for manual session annotation (RFC backlog).
- JSON / Markdown export endpoints (RFC backlog).
- Faceted search UI (e.g. histogram-driven date range pickers). Filters stay as form fields and URL params.
- Saved-search CRUD, alerts, RSS, anything subscription-shaped.
- A second machine's deploy (the goal is to prove the parser interface with #2; running on the work PC is a deployment exercise, not a code change).

### Chosen approach

**Search API.** Add a `search` domain package, mount it under the existing authenticated `/api/v1` group, and register it in the steward graph beside `session`, `project`, `stats`, and `savedsearch`.

Response shape:

```json
{
  "results": [
    {
      "tool": "claude-code",
      "host": "laptop",
      "session_id": "...",
      "turn_id": "...",
      "timestamp": 1760000000,
      "role": "user",
      "working_dir": "/repo",
      "snippet": "...\u0002term\u0003...",
      "match_source": "turn",
      "rank": -1.23
    }
  ],
  "limit": 50,
  "next_cursor": "opaque-or-empty"
}
```

`snippet` uses marker runes instead of HTML so #7 can render highlights with React text nodes; `match_source` is `turn` or `tool_output`.

**Search query (default — `include_tool_outputs=0`).** One FTS5 `MATCH` against `turns_fts`, joined back to `turns`/`sessions` by rowid and composite key so filters and result metadata come from canonical tables:

```sql
SELECT
    t.tool, t.host, t.session_id, t.turn_id, t.timestamp, t.role,
    s.working_dir,
    snippet(turns_fts, 0, char(2), char(3), '…', 32) AS snippet,
    bm25(turns_fts) AS rank,
    'turn' AS match_source
FROM turns_fts
JOIN turns AS t ON t.rowid = turns_fts.rowid
JOIN sessions AS s ON s.owner = t.owner AND s.tool = t.tool AND s.host = t.host AND s.session_id = t.session_id
WHERE turns_fts MATCH ?
  AND t.owner = ?
  AND (? IS NULL OR t.tool = ?)
  AND (? IS NULL OR t.host = ?)
  AND (? IS NULL OR t.timestamp >= ?)
  AND (? IS NULL OR t.timestamp <  ?)
ORDER BY rank ASC, t.timestamp DESC, t.turn_id ASC
LIMIT ?;
```

Pagination cursor encodes `(rank, timestamp, turn_id, match_source)` of the last row and must be generated from the same normalized query/filter tuple.

**Search query (`include_tool_outputs=1`).** Two `MATCH` queries, one per FTS table, `UNION ALL`, then window-dedupe on `(tool, host, session_id, turn_id)` keeping the better-ranked match and exposing which source won.

**Query validation.** Empty `q` is `400 INVALID`; `limit` clamps to the existing 50/200 pattern; `since`/`until` parse as Unix seconds; invalid FTS syntax returns `400 INVALID` rather than a 500.

**opencode parser — discovery first.** The local install currently exposes `~/.local/share/opencode/opencode.db`, `storage/session/**/*.json`, and `tool-output/*`; the spike decides which is canonical before parser code exists.

If session JSON files are canonical, implementation mirrors the Claude Code parser: discover session JSON files, parse from byte offset, and emit complete-turn events. If SQLite is canonical, implementation opens the DB read-only and uses `ingestion_state.last_offset` as a row marker. If neither source is stable, opencode leaves this task and the task still ships `/api/v1/search`.

**Tradeoffs that settled it.**
- *Keep #3 API-only vs include React UI:* API-only matches `docs/TODO.md`, unblocks #7 cleanly, and avoids mixing parser discovery with frontend route work.
- *Marker snippets vs HTML snippets:* markers avoid `dangerouslySetInnerHTML`; #7 can convert them to `<mark>` with normal React nodes.
- *Single-table FTS query vs always-union:* default prose search is faster and less noisy; tool outputs remain an explicit power-user toggle.
- *Cursor vs offset pagination:* cursor prevents an API break if the corpus grows; it costs one helper and a cursor validation test.
- *Discovery spike vs guess-and-iterate on opencode:* the spike is cheaper than implementing against the wrong store and creates the parser fixture map.

### Backwards-compatibility check

- Server: additive route only; existing `/api/v1/stats`, `/api/v1/sessions`, `/api/v1/projects`, `/api/v1/saved-searches`, and ingest behavior stay unchanged.
- Collector: additive parser registration only; existing `claude-code` sources keep the same config and parser behavior.
- Database: no migrations; the task reads the existing FTS tables and canonical `turns`/`sessions` tables.
- Web: the React `/search` stub remains a stub until #7.

### Hands-off decisions

- udesign-refresh: scope narrowed to `/api/v1/search` plus opencode parser — current `docs/TODO.md` assigns React search UI to #7, and stats already shipped in #5.
- udesign-refresh: server-rendered HTML/vanilla-JS search removed — the repo now serves a React SPA with an existing `/search` stub.
- udesign-refresh: snippets use non-HTML markers — future React UI can render highlights without unsafe HTML insertion.

TDD: yes (reason: FTS query behavior, cursor round-trips, owner scoping, FTS syntax errors, and opencode parser offsets are deterministic contracts where regressions should fail CI.)

### Invariants

- IV1 — This task adds no schema migration files.
- IV2 — `internal/shared/wire/` types are not modified.
- IV3 — `/api/v1/search` is read-only and executes `SELECT` only.
- IV4 — Search results are scoped through the same authenticated owner rules as sessions/projects.
- IV5 — Default search queries `turns_fts` only; `tool_outputs_fts` is read only when `include_tool_outputs=1`.
- IV6 — API snippets contain marker runes, not HTML.
- IV7 — Empty or syntactically invalid FTS queries return `400 INVALID`, not `500`.
- IV8 — The opencode parser implements `parser.Parser` unchanged.
- IV9 — The collector runner and state schema are unchanged by opencode support.
- IV10 — `docs/spikes/opencode-format.md` is committed before opencode parser implementation lands.
- IV11 — Existing `/api/v1/stats` behavior and React `/stats` page are not changed by this task.
- IV12 — `web/src/routes/search.tsx` remains a stub until #7.

### Principles

- PC1 — API first, UI later: #3 returns data; #7 decides presentation.
- PC2 — Search defaults to prose turns; tool-output search is explicit.
- PC3 — Spike before parser code when the source format is unknown.
- PC4 — New parser support is one package plus one registration, not a new collector abstraction.

### Assumptions

- AS1 — `turns_fts` and `tool_outputs_fts` are kept current by #1's triggers for every ingested turn.
- AS2 — Joining FTS rowid back to `turns.rowid` is stable for the existing regular FTS5 tables.
- AS3 — opencode local storage has a readable canonical transcript source under `~/.local/share/opencode/`.
- AS4 — The collector state's integer offset can represent the chosen opencode progress marker.

### Unknowns

- UK1 — Which opencode store is canonical: `opencode.db`, `storage/session/**/*.json`, `tool-output/*`, or a combination.
- UK2 — Whether SQLite FTS query syntax needs a stricter user-query normalizer than passing the validated `q` through to `MATCH`.
- UK3 — Whether default BM25 quality is good enough on real lethe data.

## Plan

Approach: ship `/api/v1/search` as an additive read domain first, then run the opencode storage spike before writing the parser; keep #3 API/parser-only so #7 can consume the search contract without frontend churn here.

### PH1 — Search Repository

- Tier: deep — FTS5, owner scoping, dedupe, and cursor semantics are correctness-sensitive.
- **1.1** `internal/domain/search/repository.go:1-260` (create)
  - `type Result struct`, `type Row struct`, `type Filter struct`, `type Cursor struct` — API/domain shapes for JSON output, filters, and pagination.
  - `func (r *Repository) Search(ctx context.Context, f Filter) (*Result, error)` — executes default `turns_fts` search and optional `tool_outputs_fts` union with owner/tool/host/time filters.
  - `func EncodeCursor(c Cursor, f Filter) (string, error)` / `func DecodeCursor(raw string, f Filter) (Cursor, error)` — opaque cursor tied to normalized query/filter tuple.
  - Respects: IV1, IV2, IV3, IV4, IV5, IV6, IV7, IV11, IV12, PC1, PC2, AS1, AS2, UK2, UK3.
- **1.2** `internal/domain/search/repository_test.go:1-360` (create)
  - RED tests for owner isolation, tool/host/since/until filters, prose-only default, tool-output opt-in, dedupe, cursor next page, invalid cursor, marker snippets, and invalid FTS syntax mapping.
  - Respects: TDD, IV3-IV7, AS1, AS2.
- Commit: `search: add fts repository`

### PH2 — Search HTTP Wiring

- Tier: smart — follows existing handler/steward patterns but defines a new public API contract.
- **2.1** `internal/domain/search/handler.go:1-220` (create)
  - `func (h *Handler) Mount(r chi.Router)` — registers `GET /search` under `/api/v1`.
  - `func (h *Handler) List(w http.ResponseWriter, r *http.Request)` — resolves auth owner scope, parses query params, clamps limit to 50/200, renders JSON or RFC 7807 errors.
  - `func (h *Handler) resolveScope(r *http.Request) (session.OwnerScope, error)` — mirrors session/project admin owner rules.
  - Respects: IV3, IV4, IV7, PC1.
- **2.2** `internal/domain/search/handler_test.go:1-260` (create)
  - RED tests for route registration, missing/empty `q`, bad `since`, non-admin `owner`, admin `owner=*`, bad cursor, and successful response envelope.
  - Respects: TDD, IV4, IV7.
- **2.3** `internal/server/server.go:31-66,103-110` (modify)
  - Inject `*search.Handler` and mount it inside the authenticated `/api/v1` group.
  - Respects: IV4, IV11, IV12.
- **2.4** `cmd/lethe/main.go:26-137` and `cmd/lethe/main_e2e_test.go:73-92` (modify)
  - Register `search.Repository` and `search.Handler` with steward in production and e2e graph setup.
  - Respects: IV11.
- Commit: `search: expose search endpoint`

### PH3 — opencode Format Spike

- Tier: smart — exploratory but needs a durable writeup before parser code.
- **3.1** `cmd/lethe-spike-opencode/main.go:1-180` (create, then delete before phase commit)
  - Walk `~/.local/share/opencode/`, `~/.config/opencode/`, and `~/.cache/opencode/`; report structural file types, counts, sizes, and redacted samples.
  - Respects: PC3, AS3, UK1.
- **3.2** `docs/spikes/opencode-format.md:1-160` (create)
  - Record canonical source choice, session/message/tool-output shape, progress marker choice, fixture anonymization notes, and parser risks.
  - Respects: IV10, PC3, AS3, AS4, UK1.
- Commit: `collector: document opencode storage format`

### PH4 — opencode Parser

- Tier: deep — parser correctness affects resumability and archive integrity.
- **4.1** `internal/collector/parser/opencode/parser.go:1-320` (create)
  - `func New(host string) *Parser`, `func (p *Parser) Tool() string`, `func (p *Parser) Discover(root string) ([]parser.SourceFile, error)`, `func (p *Parser) Parse(path string, since int64) ([]wire.TurnEvent, int64, error)` — implement the source shape chosen in PH3 without changing `parser.Parser`.
  - `func mapRecord(...) (wire.TurnEvent, bool)` or SQLite-equivalent mapper — converts opencode session/message/tool-output records into `wire.TurnEvent`.
  - Respects: IV2, IV8, IV9, IV10, PC3, PC4, AS3, AS4.
- **4.2** `internal/collector/parser/opencode/parser_test.go:1-260` and `internal/collector/parser/opencode/testdata/*` (create)
  - RED tests for discovery, turn mapping, tool-output mapping, offset/marker resume, malformed-record fallback/skip behavior, and host/tool/source identity.
  - Respects: TDD, IV8, IV9, IV10.
- **4.3** `cmd/lethe-collector/main.go:17-221` and `cmd/lethe-collector/main_test.go:1-90` (modify)
  - Register `opencode.New(host)` in `buildParsers`; test that both `claude-code` and `opencode` are present.
  - Respects: IV8, IV9, PC4.
- Commit: `collector: add opencode parser`

### Test Strategy

- RED first: `internal/domain/search` repository tests for FTS result shape, owner scope, filters, cursor, tool-output opt-in, and invalid query handling.
- RED first: `internal/domain/search` handler tests for query parsing, auth scoping, route mount, and response envelope.
- RED first: opencode parser tests after PH3 selects the canonical source; no parser production code before fixtures exist.
- Existing safety net: `go test ./... -count=1`; collector CLI smoke with an opencode source in config once PH4 lands.

### Order & Dependencies

- PH1 blocks PH2.
- PH3 blocks PH4.
- PH1/PH2 and PH3/PH4 are otherwise independent; PH4 needs the collector branch already merged on `master`.

### Risks / Rollback

- RK1 — FTS5 `MATCH` syntax can turn user input into hard SQL errors; PH1 maps those to `400 INVALID` and keeps normalization isolated.
- RK2 — opencode may require multi-file joins between session JSON and `tool-output/*`; PH3 must choose a marker that PH4 can persist in `last_offset` without state schema changes.
- RK3 — Cursor pagination over BM25 may duplicate or skip rows if the tie-breaker is incomplete; PH1 orders by rank, timestamp, turn_id, and match_source and tests the boundary.

### Interfaces

- IF1 — `func (r *Repository) Search(ctx context.Context, f Filter) (*Result, error)` — search read boundary used only by the HTTP handler.
- IF2 — `func (h *Handler) Mount(r chi.Router)` — server mount contract matching other domain packages.
- IF3 — `func New(host string) *Parser` — opencode parser constructor registered by the collector CLI.
- IF4 — `func buildParsers(host string) map[string]parser.Parser` — collector parser registry remains the only dispatch point.
- IF5 — `docs/spikes/opencode-format.md` — canonical opencode source choice consumed by the parser phase.

### Interface Graph

- PH1 -> IF1 @ `internal/domain/search/`
- PH2 IF1 -> IF2 @ `internal/domain/search/`, `internal/server/`, `cmd/lethe/`
- PH3 -> IF5 @ `docs/spikes/opencode-format.md`
- PH4 IF5 -> IF3, IF4 @ `internal/collector/parser/opencode/`, `cmd/lethe-collector/`

Backwards-compat: additive route and parser registration only; PH1/PH2 do not alter existing routes or schema, and PH4 does not change the parser interface, runner, or collector state schema.

Scope check: no stats work, no React search UI, no schema migration, no saved-search changes, and no parser abstraction beyond `buildParsers`.

## Verify

**Result:** passed

Positive:
- CK1 — `/api/v1/search` repository and handler tests cover ranked prose search, tool-output opt-in, filters, cursors, and response envelope.
- CK2 — opencode parser tests cover SQLite discovery, turn mapping, tool summaries, resume marker, malformed skips, and collector registration.
- CK3 — `go build ./cmd/lethe ./cmd/lethe-collector` succeeds.
- CK4 — `go test ./... -count=1` passes.

Negative:
- CK5 — empty/invalid search query and bad cursor return `INVALID`.
- CK6 — non-admin `?owner=` on search returns `FORBIDDEN`.
- CK7 — opencode parser does not ingest external `tool-output/` blob contents.

Invariants / assumptions:
- CK8 (IV1, IV2) — no search package references schema DDL or `internal/shared/wire`.
- CK9 (IV3-IV7) — search tests verify read-path behavior, owner scoping, prose default, marker snippets, and invalid-query handling.
- CK10 (IV8-IV10, AS3, AS4) — opencode parser implements `parser.Parser`, keeps collector state schema unchanged, and consumes the committed storage spike.
- CK11 (IV11, IV12) — stats packages and React `/search` route were not changed.

Interfaces:
- CK12 (IF1) — `Repository.Search(ctx, Filter)` is called by handler and repository tests.
- CK13 (IF2) — `Handler.Mount(r chi.Router)` registers `/api/v1/search`.
- CK14 (IF3, IF4) — `opencode.New(host)` is registered through `buildParsers` and tested by `cmd/lethe-collector`.
- CK15 (IF5) — `docs/spikes/opencode-format.md` records the SQLite source and `message.rowid` marker used by PH4.

Smoke: `go test ./internal/domain/search -run TestHandler_SuccessfulResponseEnvelope -v` and `go test ./internal/collector/parser/opencode -run TestParse_MapsTurnsAndIdentity -v` both pass.

## Conclusion

Outcome: `/api/v1/search` and the opencode collector parser shipped on `task/lethe-search-and-opencode` through `5cc599d`.

Invariants:
- IV1 — no migration files were added.
- IV2 — `internal/shared/wire/` was not modified.
- IV3 — search implementation is repository/handler read-path code only.
- IV4 — search handler uses the existing authenticated owner-scope rules.
- IV5 — repository tests cover prose-only default and tool-output opt-in.
- IV6 — snippets use marker bytes, not HTML.
- IV7 — empty, malformed, and bad-cursor search inputs return `INVALID`.
- IV8 — opencode implements `parser.Parser` unchanged.
- IV9 — collector runner and state schema were unchanged.
- IV10 — `docs/spikes/opencode-format.md` landed before parser implementation.
- IV11 — stats API/page code was not changed.
- IV12 — React `/search` route was not changed.

### Assumptions check
- AS1 — held — search tests exercise FTS rows populated by existing triggers.
- AS2 — held — search joins FTS rowid back to `turns.rowid` in tests and implementation.
- AS3 — held — spike confirmed readable opencode SQLite storage under `~/.local/share/opencode/`.
- AS4 — held after review fix — collector `last_offset` stores next opencode `message.rowid`, and `TurnEvent.Seq` stores current rowid.

### Unknowns outcome
- UK1 — resolved — SQLite `opencode.db` is canonical for v1.
- UK2 — resolved for v1 — invalid FTS syntax maps to `INVALID`; no stricter normalizer was needed.
- UK3 — still-open — BM25 quality needs real archive usage after ingest.

### Review findings
- Critical: opencode offset marker changed from `message.time_created` to inclusive next-`message.rowid` after reviewer found skipped-row risk in partial-accept paths.