# lethe-collector-claude-code **Status:** Design (hands-off) **Module:** `sourcecraft.dev/bigbes/lethe` **Depends on:** `lethe-server.md` (#1) — locks the wire format and ingest semantics this task targets. **Sibling tasks (deferred):** `lethe-search-and-opencode.md` (#3) and per-tool follow-ups (`lethe-collector-crush.md`, etc.) when the time comes. ## Design ### Purpose Stand up the `lethe-collector` binary and the first parser (Claude Code). End state: a systemd user service on the laptop watches `~/.claude/projects/`, parses new turns, ships them to the running `lethe` server over Tailscale, and survives offline periods via a local outbox. Re-runs are safe and resumable. A successful end state for this task: the collector has been running on the laptop for an hour against real Claude Code activity, and the server's HTML timeline shows my actual recent sessions, with turns matching what's in the `.jsonl` files. ### Scope **In:** - Single Go binary `lethe-collector` (`cmd/lethe-collector/main.go`), cobra-based: - `lethe-collector daemon` — long-running, watches all configured sources. - `lethe-collector backfill ` — one-shot, walks all source files from offset 0, ships everything; resumable via the same offset state. - `lethe-collector status` — prints per-source ingestion lag, outbox depth, last error. - Parser interface in `internal/collector/parser/` (the Parser type from RFC §6.2, populated against the locked `internal/shared/wire/` types). - Claude Code parser in `internal/collector/parser/claudecode/` with golden-file fixtures. - Local state DB in `~/.local/state/lethe/state.db` (SQLite, one file): tables `ingestion_state` (per source file offset) and `outbox` (buffered events when server is unreachable). - Polling-based discovery and ingestion loop (no fsnotify); per-source goroutines orchestrated via `auxilia/async`. - HTTPS POST to the server's `/api/v1/ingest` over Tailscale; relies on `tailscale serve` injecting `Tailscale-User-Login` for the authenticated daemon. - Outbox replay with exponential backoff, bounded size (default 100 MB), oldest-drop on overflow with WARN. - systemd user unit shipped at `deploy/lethe-collector.service` with `Restart=always`, `WantedBy=default.target`, journald logging. - Configuration via YAML at `~/.config/lethe/collector.yaml`, loaded with the same Viper strict-mode pattern as the server. - Logging via `scribe`, errors via `culpa`. **Out:** - Other parsers (opencode → #3; crush, pi, kimi → their own task files later). - Any server-side change. Server is locked from #1; if a wire-format gap is discovered, it gets a separate amendment task. - macOS launchd unit (Linux only for v1; trivially added later — same binary, different unit file). - TUI / curses status. `status` prints plain text. - File-watcher backend (fsnotify or inotify directly). - HTTP/2 push, gRPC, or any non-NDJSON-over-HTTPS transport. - Multi-user / multi-account configurations (still single-tenant). ### Chosen approach **CLI: cobra.** Three subcommands. `daemon` is the default deployed mode; `backfill` is the bootstrap and disaster-recovery tool; `status` is the operator's quick-look. Cobra is overkill for one command but right for three and pulls its weight from this task onward. **Discovery: polling.** Every source has a configurable `poll_interval` (default 30s). On each tick, the source walks its root, lists candidate files (e.g. `**/*.jsonl` under `~/.claude/projects/`), and processes each one independently. Polling beats fsnotify here because: - Source tools may write via `rename(tmp, final)` — fsnotify fires on a path that immediately doesn't exist at handle-open time. - Long-running sessions append continuously; a single file gets touched many times — polling coalesces naturally. - Cross-machine, cross-FS, cross-tool: polling has no edge cases. fsnotify has many. **Per-source ingestion loop.** 1. Walk the root, list source files. 2. For each file: load `last_offset` from `ingestion_state` keyed by `(tool, source_file)`. 3. Open file read-only, seek to `last_offset`, scan to EOF using `bufio.Scanner` with a sufficiently-large buffer (Claude Code lines can be hundreds of KB). 4. For each complete line, hand to the parser, accumulate `wire.TurnEvent`s. 5. Batch up to N events (default 500) or M bytes (default 8 MiB), whichever first; serialize to NDJSON; POST to `/api/v1/ingest`. 6. On `200 {accepted: K, errors: [...]}`: persist `last_offset = offset_at_line(K)` and continue from line K+1. If `K < N`, log the errors at WARN and skip the bad lines (their offset is also persisted past them so they don't loop forever). 7. On 5xx or network error: serialize the unsent events into the `outbox` table and break to next file. Replay attempted on next tick. 8. Sleep `poll_interval`, loop. **Outbox.** A `outbox` table in the state DB: `(id INTEGER PK AUTOINCREMENT, tool TEXT, host TEXT, source_file TEXT, payload BLOB, created_at INTEGER)`. On every tick, before processing fresh files, the loop tries to replay outbox rows oldest-first in chunks. Each successful POST deletes the rows it committed. Bounded by `outbox.max_bytes` config (default 100 MiB); when exceeded, oldest rows are dropped and a WARN is logged. The "happy path" (server reachable) never writes to the outbox at all — it's a strict overflow buffer. **Parser interface.** ```go package parser import "sourcecraft.dev/bigbes/lethe/internal/shared/wire" type Parser interface { Tool() string Discover(root string) ([]SourceFile, error) Parse(path string, since int64) (events []wire.TurnEvent, newOffset int64, err error) } type SourceFile struct { Path string Size int64 } ``` `Parse` returns events in source order with monotonically-increasing `seq`. If a line is malformed, the parser returns it as a `system`-role turn with the raw line in `metadata` (so it shows up in the archive but doesn't poison search) and continues. `newOffset` is the byte position immediately after the last fully-parsed line — never mid-line, so a partial trailing write is left for the next poll. **Claude Code parser specifics.** - Source root: `~/.claude/projects/`. File pattern: `*/.jsonl` (one file per session). - `session_id`: the UUID from the filename. The directory name (``) goes into `session_meta.metadata` for project attribution. - One `.jsonl` line = one event, parsed into a permissive struct that uses `json.RawMessage` for any ambiguous field. - Event-type mapping: - `type: "user"` → `role: "user"`, `content` = the user message text. - `type: "assistant"` → `role: "assistant"`, `content` = joined assistant text parts; `model` from the event; `tokens_in/out` from `usage.input_tokens/output_tokens` when present. - `type: "tool_use"` and `type: "tool_result"` → `role: "tool"`, `content` = a short rendered summary (e.g. `""`), full payload into `tool_calls` JSON. - `type: "summary"` and unknown types → `role: "system"`, content from event, full event into `metadata`. - `cwd` field → `session_meta.working_dir`. The path of the file → `session_meta.source_file`. - `cost_usd` left null (Max-billed sessions don't reliably report cost). - `turn_id`: prefer the event's `uuid` field. When missing, synthesize `sha256(session_id || seq || timestamp || content[:64])` truncated to 16 bytes hex. - `parentUuid` (resume chaining): stored in turn `metadata` for now. Chaining sessions across files is a #3-or-later UI concern — every `.jsonl` file is a session in this task. - Slash commands and sub-agent invocations: their event subtypes go into `metadata` opaquely. The UI in #1 already renders `metadata` as JSON-collapsed; surfacing them properly is a later refinement. **Auth.** The collector POSTs to `https://.tailnet.ts.net/api/v1/ingest`. `tailscale serve` on phoebe terminates HTTPS and injects `Tailscale-User-Login` from the connecting node's owner. The server validates that header against its allowlist. If `tailscale serve` doesn't inject the header for non-browser clients (the open question from #1), the deploy step fixes it — the collector code itself is unchanged. **Configuration.** YAML at `~/.config/lethe/collector.yaml`: ```yaml server_url: "https://phoebe..ts.net" host: "laptop" # required; identifies this machine in the archive state_dir: "~/.local/state/lethe" http: timeout: "30s" retry_max: 5 outbox: max_bytes: 104857600 # 100 MiB sources: - tool: "claude-code" path: "~/.claude/projects" poll_interval: "30s" batch_max_lines: 500 batch_max_bytes: 8388608 # 8 MiB log: level: info format: human ``` `host` is required and has no default. The host string is the user's choice; the server stores it verbatim. **Tradeoffs that settled it.** - *Polling vs fsnotify:* polling is correct in every case; fsnotify isn't. The wasted CPU of one `os.ReadDir` per minute is irrelevant. - *Outbox in SQLite vs flat-file queue:* one file (`state.db`) for both offsets and outbox, atomic transactions, no separate format to debug. Cost is one extra dependency that was already required. - *Parse-then-batch vs streaming POST:* batching keeps the wire protocol simple (NDJSON body, one HTTP call) and lets the server commit chunks atomically. Streaming would force the server to handle interrupted bodies — the RFC's chunked-commit response shape works because the body is bounded. - *Synthesize missing turn_ids vs require source IDs:* Claude Code always provides UUIDs in current versions, but the parser can't assume that holds for older fixture files or future regressions. Synthesis preserves idempotency; the rare case of a `content[:64]` collision within one session at one timestamp is acceptable. **Unknowns that remain.** - Whether `tailscale serve` injects `Tailscale-User-Login` for daemon HTTP clients (vs only browsers). If not, I add a `lethe-token` shared-secret fallback header in the deploy step — a 5-line server change. Confirmed empirically before declaring this task done. - True line-size distribution of Claude Code `.jsonl` events. If it exceeds `bufio.Scanner`'s default 64 KiB token buffer, the parser uses `Scanner.Buffer(buf, maxSize)` with maxSize = 16 MiB. Captured here so the test fixtures cover the long-line case. - Whether the laptop's `~/.claude/projects/` ever contains files concurrent-written from multiple Claude Code processes. If yes, the parser still works (append-only, monotonic offset), but the test plan should cover it. ### Backwards-compatibility check Greenfield collector. The only interface contract this task can break is the wire format with the server, which is locked into `internal/shared/wire/` and cannot drift unilaterally. ### Hands-off decisions - udesign: parser interface placed at `internal/collector/parser/` rather than `internal/parsers/` (RFC §6.2 used `internal/parsers//`) — keeps "collector-internal" code in one subtree, mirrors `internal/server/` from #1. - udesign: outbox lives in the same state DB as offsets — single SQLite file is simpler than two stores; transaction guarantees are useful when an offset bump and an outbox dequeue happen together. - udesign: bad-line handling skips with WARN rather than halting the file — one corrupt line in `.jsonl` shouldn't pause ingestion of the rest. Risk: silent data loss if many lines are wrong; mitigated by counting WARNs in `status`. - udesign: `host` is required config with no default — auto-detecting via `os.Hostname()` produces noise on machines whose hostname is `myname-mbp.local`. Forcing the user to choose `laptop` / `workpc` keeps the archive's host column meaningful. - udesign: TOML → YAML for collector config — consistent with the server's config format from #1; one parser, one mental model. - udesign: `parentUuid` chaining of resumed Claude Code sessions deferred — every `.jsonl` is one session in this task. Surfacing chains is a UI concern for later. - udesign: synthesized `turn_id` uses `sha256(session_id || seq || timestamp || content[:64])[:16]` — `content[:64]` is enough to disambiguate within a single timestamp; full-content hash would balloon for large turns. TDD: yes (reason: parser behavior on golden fixture `.jsonl` files, offset persistence/resume semantics, outbox replay, and idempotent re-POST behavior are exactly the deterministic regression-prone surfaces TDD is good for. CLI scaffolding and systemd unit are exempt.) ### Invariants - The collector opens source files **read-only**. No code path writes, renames, or deletes anything under any source root. - `ingestion_state.last_offset` is persisted **only after** the server returns `accepted: N` and the offset has been advanced past line N. - Offsets are byte positions immediately after a fully-parsed `\n`. Never mid-line; partial trailing lines are left for the next poll. - A re-run after crash, kill -9, power loss, or container restart resumes from the last persisted offset. Server-side idempotency handles any duplicates the offset miss generates. - The outbox is bounded by `outbox.max_bytes`. Overflow drops oldest entries with a WARN; never blocks the loop. - One source file's failure (parse error, permission error, deleted file mid-loop) does not stop ingestion of any other source file. - `wire.TurnEvent.host` on every emitted event equals the `host` from config. The collector does not infer host from `os.Hostname()` or any environment. - The Parser interface is the only point of tool-specific knowledge in the collector. The ingestion loop knows nothing about Claude Code's JSON shape. - All HTTP requests to the server include the configured `server_url` and the path `/api/v1/ingest`; no other endpoints are called from `daemon` mode (`status` may call read endpoints). - `daemon` mode handles SIGTERM by stopping new polls, draining in-flight POSTs (bounded), persisting any in-memory offset advances, and exiting with code 0. ### Principles - Polling beats watching. The cost is bounded; the correctness is total. - Each source's loop is independent. Nothing in the architecture forces cross-source coordination. - The parser is the only file that knows a tool's format. Adding a tool is one new directory under `internal/collector/parser/`, plus a `register.go` import. - The outbox is a safety net, not the primary path. The happy path skips it entirely. If a bug forces all traffic through the outbox, that's a regression worth alerting on. - Permissive parsing: unknown fields → `metadata`, malformed lines → system-role turn with raw payload. Never panic, never stall. - No background goroutines without a `context.Context` tied to shutdown. - Test against real fixture files (anonymized snippets from `~/.claude/projects/` checked into `testdata/`), not hand-crafted minimal JSON.