~bigbes/lethe

ref: 96e95ab9e44d2234ab036319836a5087eb4c2a2f lethe/docs/tasks/lethe-search-and-opencode.md -rw-r--r-- 19.9 KiB
96e95ab9 — Eugene Blikh fix: add tool column to search table; remove conversation bleed from comments 23 days ago

#lethe-search-and-opencode

Status: done Branch: task/lethe-search-and-opencode Worktree: /Users/blikh/data/home/lethe/.worktrees/lethe-search-and-opencode Mode: hands-off Module: sourcecraft.dev/bigbes/lethe Depends on: lethe-server.md (#1) — FTS5 tables and triggers were created in #1; this task only adds query code. lethe-collector-claude-code.md (#2) — the collector framework and Parser interface this task extends. Sibling tasks (deferred): per-tool parsers (lethe-collector-crush.md, lethe-collector-pi.md, lethe-collector-kimi.md); RFC backlog items (cost rollups for tools that report it, tagging, JSON/Markdown export).

#Design

#Purpose

Make the archive searchable by exposing the existing FTS5 indexes through /api/v1/search, and prove the collector parser boundary with a second tool: opencode.

A successful end state for this task: ingested Claude Code and opencode turns can be searched through the authenticated JSON API, with ranked snippets and session anchors that #7 can render in the existing React /search route.

#Scope

In:

  • GET /api/v1/search?q=&tool=&host=&since=&until=&include_tool_outputs=&limit=&cursor= — owner-scoped FTS5 query against turns_fts; opt-in union with tool_outputs_fts.
  • internal/domain/search/ — repository and handler matching the existing domain package shape.
  • Cursor pagination over the ranked result set; cursor is opaque and invalid cursors are 400 INVALID.
  • JSON result rows include enough data for #7 to link to /session/{tool}/{host}/{session_id}#turn-{turn_id}.
  • Collector-side:
    • internal/collector/parser/opencode/ — new parser implementing the same Parser interface from #2.
    • Format-discovery spike captured under docs/spikes/opencode-format.md before the parser is written; spike output is checked in.
    • One parser registration in cmd/lethe-collector; otherwise the collector runner framework is untouched.
  • Golden fixtures for the opencode parser (anonymized snippets in testdata/opencode/).

Out:

  • React search UI — #7 owns filling web/src/routes/search.tsx and any saved-search execute flow.
  • Stats API and stats page — already shipped by #5 and left unchanged here.
  • Server-rendered HTML and vanilla search JS — superseded by the shipped React SPA.
  • Schema migrations — #1 already created turns_fts and tool_outputs_fts.
  • crush, pi, kimi parsers — separate task files when the time comes.
  • Tag system for manual session annotation (RFC backlog).
  • JSON / Markdown export endpoints (RFC backlog).
  • Faceted search UI (e.g. histogram-driven date range pickers). Filters stay as form fields and URL params.
  • Saved-search CRUD, alerts, RSS, anything subscription-shaped.
  • A second machine's deploy (the goal is to prove the parser interface with #2; running on the work PC is a deployment exercise, not a code change).

#Chosen approach

Search API. Add a search domain package, mount it under the existing authenticated /api/v1 group, and register it in the steward graph beside session, project, stats, and savedsearch.

Response shape:

{
  "results": [
    {
      "tool": "claude-code",
      "host": "laptop",
      "session_id": "...",
      "turn_id": "...",
      "timestamp": 1760000000,
      "role": "user",
      "working_dir": "/repo",
      "snippet": "...\u0002term\u0003...",
      "match_source": "turn",
      "rank": -1.23
    }
  ],
  "limit": 50,
  "next_cursor": "opaque-or-empty"
}

snippet uses marker runes instead of HTML so #7 can render highlights with React text nodes; match_source is turn or tool_output.

Search query (default — include_tool_outputs=0). One FTS5 MATCH against turns_fts, joined back to turns/sessions by rowid and composite key so filters and result metadata come from canonical tables:

SELECT
    t.tool, t.host, t.session_id, t.turn_id, t.timestamp, t.role,
    s.working_dir,
    snippet(turns_fts, 0, char(2), char(3), '…', 32) AS snippet,
    bm25(turns_fts) AS rank,
    'turn' AS match_source
FROM turns_fts
JOIN turns AS t ON t.rowid = turns_fts.rowid
JOIN sessions AS s ON s.owner = t.owner AND s.tool = t.tool AND s.host = t.host AND s.session_id = t.session_id
WHERE turns_fts MATCH ?
  AND t.owner = ?
  AND (? IS NULL OR t.tool = ?)
  AND (? IS NULL OR t.host = ?)
  AND (? IS NULL OR t.timestamp >= ?)
  AND (? IS NULL OR t.timestamp <  ?)
ORDER BY rank ASC, t.timestamp DESC, t.turn_id ASC
LIMIT ?;

Pagination cursor encodes (rank, timestamp, turn_id, match_source) of the last row and must be generated from the same normalized query/filter tuple.

Search query (include_tool_outputs=1). Two MATCH queries, one per FTS table, UNION ALL, then window-dedupe on (tool, host, session_id, turn_id) keeping the better-ranked match and exposing which source won.

Query validation. Empty q is 400 INVALID; limit clamps to the existing 50/200 pattern; since/until parse as Unix seconds; invalid FTS syntax returns 400 INVALID rather than a 500.

opencode parser — discovery first. The local install currently exposes ~/.local/share/opencode/opencode.db, storage/session/**/*.json, and tool-output/*; the spike decides which is canonical before parser code exists.

If session JSON files are canonical, implementation mirrors the Claude Code parser: discover session JSON files, parse from byte offset, and emit complete-turn events. If SQLite is canonical, implementation opens the DB read-only and uses ingestion_state.last_offset as a row marker. If neither source is stable, opencode leaves this task and the task still ships /api/v1/search.

Tradeoffs that settled it.

  • Keep #3 API-only vs include React UI: API-only matches docs/TODO.md, unblocks #7 cleanly, and avoids mixing parser discovery with frontend route work.
  • Marker snippets vs HTML snippets: markers avoid dangerouslySetInnerHTML; #7 can convert them to <mark> with normal React nodes.
  • Single-table FTS query vs always-union: default prose search is faster and less noisy; tool outputs remain an explicit power-user toggle.
  • Cursor vs offset pagination: cursor prevents an API break if the corpus grows; it costs one helper and a cursor validation test.
  • Discovery spike vs guess-and-iterate on opencode: the spike is cheaper than implementing against the wrong store and creates the parser fixture map.

#Backwards-compatibility check

  • Server: additive route only; existing /api/v1/stats, /api/v1/sessions, /api/v1/projects, /api/v1/saved-searches, and ingest behavior stay unchanged.
  • Collector: additive parser registration only; existing claude-code sources keep the same config and parser behavior.
  • Database: no migrations; the task reads the existing FTS tables and canonical turns/sessions tables.
  • Web: the React /search stub remains a stub until #7.

#Hands-off decisions

  • udesign-refresh: scope narrowed to /api/v1/search plus opencode parser — current docs/TODO.md assigns React search UI to #7, and stats already shipped in #5.
  • udesign-refresh: server-rendered HTML/vanilla-JS search removed — the repo now serves a React SPA with an existing /search stub.
  • udesign-refresh: snippets use non-HTML markers — future React UI can render highlights without unsafe HTML insertion.

TDD: yes (reason: FTS query behavior, cursor round-trips, owner scoping, FTS syntax errors, and opencode parser offsets are deterministic contracts where regressions should fail CI.)

#Invariants

  • IV1 — This task adds no schema migration files.
  • IV2 — internal/shared/wire/ types are not modified.
  • IV3 — /api/v1/search is read-only and executes SELECT only.
  • IV4 — Search results are scoped through the same authenticated owner rules as sessions/projects.
  • IV5 — Default search queries turns_fts only; tool_outputs_fts is read only when include_tool_outputs=1.
  • IV6 — API snippets contain marker runes, not HTML.
  • IV7 — Empty or syntactically invalid FTS queries return 400 INVALID, not 500.
  • IV8 — The opencode parser implements parser.Parser unchanged.
  • IV9 — The collector runner and state schema are unchanged by opencode support.
  • IV10 — docs/spikes/opencode-format.md is committed before opencode parser implementation lands.
  • IV11 — Existing /api/v1/stats behavior and React /stats page are not changed by this task.
  • IV12 — web/src/routes/search.tsx remains a stub until #7.

#Principles

  • PC1 — API first, UI later: #3 returns data; #7 decides presentation.
  • PC2 — Search defaults to prose turns; tool-output search is explicit.
  • PC3 — Spike before parser code when the source format is unknown.
  • PC4 — New parser support is one package plus one registration, not a new collector abstraction.

#Assumptions

  • AS1 — turns_fts and tool_outputs_fts are kept current by #1's triggers for every ingested turn.
  • AS2 — Joining FTS rowid back to turns.rowid is stable for the existing regular FTS5 tables.
  • AS3 — opencode local storage has a readable canonical transcript source under ~/.local/share/opencode/.
  • AS4 — The collector state's integer offset can represent the chosen opencode progress marker.

#Unknowns

  • UK1 — Which opencode store is canonical: opencode.db, storage/session/**/*.json, tool-output/*, or a combination.
  • UK2 — Whether SQLite FTS query syntax needs a stricter user-query normalizer than passing the validated q through to MATCH.
  • UK3 — Whether default BM25 quality is good enough on real lethe data.

#Plan

Approach: ship /api/v1/search as an additive read domain first, then run the opencode storage spike before writing the parser; keep #3 API/parser-only so #7 can consume the search contract without frontend churn here.

#PH1 — Search Repository

  • Tier: deep — FTS5, owner scoping, dedupe, and cursor semantics are correctness-sensitive.
  • 1.1 internal/domain/search/repository.go:1-260 (create)
    • type Result struct, type Row struct, type Filter struct, type Cursor struct — API/domain shapes for JSON output, filters, and pagination.
    • func (r *Repository) Search(ctx context.Context, f Filter) (*Result, error) — executes default turns_fts search and optional tool_outputs_fts union with owner/tool/host/time filters.
    • func EncodeCursor(c Cursor, f Filter) (string, error) / func DecodeCursor(raw string, f Filter) (Cursor, error) — opaque cursor tied to normalized query/filter tuple.
    • Respects: IV1, IV2, IV3, IV4, IV5, IV6, IV7, IV11, IV12, PC1, PC2, AS1, AS2, UK2, UK3.
  • 1.2 internal/domain/search/repository_test.go:1-360 (create)
    • RED tests for owner isolation, tool/host/since/until filters, prose-only default, tool-output opt-in, dedupe, cursor next page, invalid cursor, marker snippets, and invalid FTS syntax mapping.
    • Respects: TDD, IV3-IV7, AS1, AS2.
  • Commit: search: add fts repository

#PH2 — Search HTTP Wiring

  • Tier: smart — follows existing handler/steward patterns but defines a new public API contract.
  • 2.1 internal/domain/search/handler.go:1-220 (create)
    • func (h *Handler) Mount(r chi.Router) — registers GET /search under /api/v1.
    • func (h *Handler) List(w http.ResponseWriter, r *http.Request) — resolves auth owner scope, parses query params, clamps limit to 50/200, renders JSON or RFC 7807 errors.
    • func (h *Handler) resolveScope(r *http.Request) (session.OwnerScope, error) — mirrors session/project admin owner rules.
    • Respects: IV3, IV4, IV7, PC1.
  • 2.2 internal/domain/search/handler_test.go:1-260 (create)
    • RED tests for route registration, missing/empty q, bad since, non-admin owner, admin owner=*, bad cursor, and successful response envelope.
    • Respects: TDD, IV4, IV7.
  • 2.3 internal/server/server.go:31-66,103-110 (modify)
    • Inject *search.Handler and mount it inside the authenticated /api/v1 group.
    • Respects: IV4, IV11, IV12.
  • 2.4 cmd/lethe/main.go:26-137 and cmd/lethe/main_e2e_test.go:73-92 (modify)
    • Register search.Repository and search.Handler with steward in production and e2e graph setup.
    • Respects: IV11.
  • Commit: search: expose search endpoint

#PH3 — opencode Format Spike

  • Tier: smart — exploratory but needs a durable writeup before parser code.
  • 3.1 cmd/lethe-spike-opencode/main.go:1-180 (create, then delete before phase commit)
    • Walk ~/.local/share/opencode/, ~/.config/opencode/, and ~/.cache/opencode/; report structural file types, counts, sizes, and redacted samples.
    • Respects: PC3, AS3, UK1.
  • 3.2 docs/spikes/opencode-format.md:1-160 (create)
    • Record canonical source choice, session/message/tool-output shape, progress marker choice, fixture anonymization notes, and parser risks.
    • Respects: IV10, PC3, AS3, AS4, UK1.
  • Commit: collector: document opencode storage format

#PH4 — opencode Parser

  • Tier: deep — parser correctness affects resumability and archive integrity.
  • 4.1 internal/collector/parser/opencode/parser.go:1-320 (create)
    • func New(host string) *Parser, func (p *Parser) Tool() string, func (p *Parser) Discover(root string) ([]parser.SourceFile, error), func (p *Parser) Parse(path string, since int64) ([]wire.TurnEvent, int64, error) — implement the source shape chosen in PH3 without changing parser.Parser.
    • func mapRecord(...) (wire.TurnEvent, bool) or SQLite-equivalent mapper — converts opencode session/message/tool-output records into wire.TurnEvent.
    • Respects: IV2, IV8, IV9, IV10, PC3, PC4, AS3, AS4.
  • 4.2 internal/collector/parser/opencode/parser_test.go:1-260 and internal/collector/parser/opencode/testdata/* (create)
    • RED tests for discovery, turn mapping, tool-output mapping, offset/marker resume, malformed-record fallback/skip behavior, and host/tool/source identity.
    • Respects: TDD, IV8, IV9, IV10.
  • 4.3 cmd/lethe-collector/main.go:17-221 and cmd/lethe-collector/main_test.go:1-90 (modify)
    • Register opencode.New(host) in buildParsers; test that both claude-code and opencode are present.
    • Respects: IV8, IV9, PC4.
  • Commit: collector: add opencode parser

#Test Strategy

  • RED first: internal/domain/search repository tests for FTS result shape, owner scope, filters, cursor, tool-output opt-in, and invalid query handling.
  • RED first: internal/domain/search handler tests for query parsing, auth scoping, route mount, and response envelope.
  • RED first: opencode parser tests after PH3 selects the canonical source; no parser production code before fixtures exist.
  • Existing safety net: go test ./... -count=1; collector CLI smoke with an opencode source in config once PH4 lands.

#Order & Dependencies

  • PH1 blocks PH2.
  • PH3 blocks PH4.
  • PH1/PH2 and PH3/PH4 are otherwise independent; PH4 needs the collector branch already merged on master.

#Risks / Rollback

  • RK1 — FTS5 MATCH syntax can turn user input into hard SQL errors; PH1 maps those to 400 INVALID and keeps normalization isolated.
  • RK2 — opencode may require multi-file joins between session JSON and tool-output/*; PH3 must choose a marker that PH4 can persist in last_offset without state schema changes.
  • RK3 — Cursor pagination over BM25 may duplicate or skip rows if the tie-breaker is incomplete; PH1 orders by rank, timestamp, turn_id, and match_source and tests the boundary.

#Interfaces

  • IF1 — func (r *Repository) Search(ctx context.Context, f Filter) (*Result, error) — search read boundary used only by the HTTP handler.
  • IF2 — func (h *Handler) Mount(r chi.Router) — server mount contract matching other domain packages.
  • IF3 — func New(host string) *Parser — opencode parser constructor registered by the collector CLI.
  • IF4 — func buildParsers(host string) map[string]parser.Parser — collector parser registry remains the only dispatch point.
  • IF5 — docs/spikes/opencode-format.md — canonical opencode source choice consumed by the parser phase.

#Interface Graph

  • PH1 -> IF1 @ internal/domain/search/
  • PH2 IF1 -> IF2 @ internal/domain/search/, internal/server/, cmd/lethe/
  • PH3 -> IF5 @ docs/spikes/opencode-format.md
  • PH4 IF5 -> IF3, IF4 @ internal/collector/parser/opencode/, cmd/lethe-collector/

Backwards-compat: additive route and parser registration only; PH1/PH2 do not alter existing routes or schema, and PH4 does not change the parser interface, runner, or collector state schema.

Scope check: no stats work, no React search UI, no schema migration, no saved-search changes, and no parser abstraction beyond buildParsers.

#Verify

Result: passed

Positive:

  • CK1 — /api/v1/search repository and handler tests cover ranked prose search, tool-output opt-in, filters, cursors, and response envelope.
  • CK2 — opencode parser tests cover SQLite discovery, turn mapping, tool summaries, resume marker, malformed skips, and collector registration.
  • CK3 — go build ./cmd/lethe ./cmd/lethe-collector succeeds.
  • CK4 — go test ./... -count=1 passes.

Negative:

  • CK5 — empty/invalid search query and bad cursor return INVALID.
  • CK6 — non-admin ?owner= on search returns FORBIDDEN.
  • CK7 — opencode parser does not ingest external tool-output/ blob contents.

Invariants / assumptions:

  • CK8 (IV1, IV2) — no search package references schema DDL or internal/shared/wire.
  • CK9 (IV3-IV7) — search tests verify read-path behavior, owner scoping, prose default, marker snippets, and invalid-query handling.
  • CK10 (IV8-IV10, AS3, AS4) — opencode parser implements parser.Parser, keeps collector state schema unchanged, and consumes the committed storage spike.
  • CK11 (IV11, IV12) — stats packages and React /search route were not changed.

Interfaces:

  • CK12 (IF1) — Repository.Search(ctx, Filter) is called by handler and repository tests.
  • CK13 (IF2) — Handler.Mount(r chi.Router) registers /api/v1/search.
  • CK14 (IF3, IF4) — opencode.New(host) is registered through buildParsers and tested by cmd/lethe-collector.
  • CK15 (IF5) — docs/spikes/opencode-format.md records the SQLite source and message.rowid marker used by PH4.

Smoke: go test ./internal/domain/search -run TestHandler_SuccessfulResponseEnvelope -v and go test ./internal/collector/parser/opencode -run TestParse_MapsTurnsAndIdentity -v both pass.

#Conclusion

Outcome: /api/v1/search and the opencode collector parser shipped on task/lethe-search-and-opencode through 5cc599d.

Invariants:

  • IV1 — no migration files were added.
  • IV2 — internal/shared/wire/ was not modified.
  • IV3 — search implementation is repository/handler read-path code only.
  • IV4 — search handler uses the existing authenticated owner-scope rules.
  • IV5 — repository tests cover prose-only default and tool-output opt-in.
  • IV6 — snippets use marker bytes, not HTML.
  • IV7 — empty, malformed, and bad-cursor search inputs return INVALID.
  • IV8 — opencode implements parser.Parser unchanged.
  • IV9 — collector runner and state schema were unchanged.
  • IV10 — docs/spikes/opencode-format.md landed before parser implementation.
  • IV11 — stats API/page code was not changed.
  • IV12 — React /search route was not changed.

#Assumptions check

  • AS1 — held — search tests exercise FTS rows populated by existing triggers.
  • AS2 — held — search joins FTS rowid back to turns.rowid in tests and implementation.
  • AS3 — held — spike confirmed readable opencode SQLite storage under ~/.local/share/opencode/.
  • AS4 — held after review fix — collector last_offset stores next opencode message.rowid, and TurnEvent.Seq stores current rowid.

#Unknowns outcome

  • UK1 — resolved — SQLite opencode.db is canonical for v1.
  • UK2 — resolved for v1 — invalid FTS syntax maps to INVALID; no stricter normalizer was needed.
  • UK3 — still-open — BM25 quality needs real archive usage after ingest.

#Review findings

  • Critical: opencode offset marker changed from message.time_created to inclusive next-message.rowid after reviewer found skipped-row risk in partial-accept paths.