Status: Design (hands-off)
Module: sourcecraft.dev/bigbes/lethe
Depends on: lethe-server.md (#1) — locks the wire format and ingest semantics this task targets.
Sibling tasks (deferred): lethe-search-and-opencode.md (#3) and per-tool follow-ups (lethe-collector-crush.md, etc.) when the time comes.
Stand up the lethe-collector binary and the first parser (Claude Code). End state: a systemd user service on the laptop watches ~/.claude/projects/, parses new turns, ships them to the running lethe server over Tailscale, and survives offline periods via a local outbox. Re-runs are safe and resumable.
A successful end state for this task: the collector has been running on the laptop for an hour against real Claude Code activity, and the server's HTML timeline shows my actual recent sessions, with turns matching what's in the .jsonl files.
In:
lethe-collector (cmd/lethe-collector/main.go), cobra-based:
lethe-collector daemon — long-running, watches all configured sources.lethe-collector backfill <tool> — one-shot, walks all source files from offset 0, ships everything; resumable via the same offset state.lethe-collector status — prints per-source ingestion lag, outbox depth, last error.internal/collector/parser/ (the Parser type from RFC §6.2, populated against the locked internal/shared/wire/ types).internal/collector/parser/claudecode/ with golden-file fixtures.~/.local/state/lethe/state.db (SQLite, one file): tables ingestion_state (per source file offset) and outbox (buffered events when server is unreachable).auxilia/async./api/v1/ingest over Tailscale; relies on tailscale serve injecting Tailscale-User-Login for the authenticated daemon.deploy/lethe-collector.service with Restart=always, WantedBy=default.target, journald logging.~/.config/lethe/collector.yaml, loaded with the same Viper strict-mode pattern as the server.scribe, errors via culpa.Out:
status prints plain text.CLI: cobra. Three subcommands. daemon is the default deployed mode; backfill is the bootstrap and disaster-recovery tool; status is the operator's quick-look. Cobra is overkill for one command but right for three and pulls its weight from this task onward.
Discovery: polling. Every source has a configurable poll_interval (default 30s). On each tick, the source walks its root, lists candidate files (e.g. **/*.jsonl under ~/.claude/projects/), and processes each one independently. Polling beats fsnotify here because:
rename(tmp, final) — fsnotify fires on a path that immediately doesn't exist at handle-open time.Per-source ingestion loop.
last_offset from ingestion_state keyed by (tool, source_file).last_offset, scan to EOF using bufio.Scanner with a sufficiently-large buffer (Claude Code lines can be hundreds of KB).wire.TurnEvents./api/v1/ingest.200 {accepted: K, errors: [...]}: persist last_offset = offset_at_line(K) and continue from line K+1. If K < N, log the errors at WARN and skip the bad lines (their offset is also persisted past them so they don't loop forever).outbox table and break to next file. Replay attempted on next tick.poll_interval, loop.Outbox. A outbox table in the state DB: (id INTEGER PK AUTOINCREMENT, tool TEXT, host TEXT, source_file TEXT, payload BLOB, created_at INTEGER). On every tick, before processing fresh files, the loop tries to replay outbox rows oldest-first in chunks. Each successful POST deletes the rows it committed. Bounded by outbox.max_bytes config (default 100 MiB); when exceeded, oldest rows are dropped and a WARN is logged. The "happy path" (server reachable) never writes to the outbox at all — it's a strict overflow buffer.
Parser interface.
package parser
import "sourcecraft.dev/bigbes/lethe/internal/shared/wire"
type Parser interface {
Tool() string
Discover(root string) ([]SourceFile, error)
Parse(path string, since int64) (events []wire.TurnEvent, newOffset int64, err error)
}
type SourceFile struct {
Path string
Size int64
}
Parse returns events in source order with monotonically-increasing seq. If a line is malformed, the parser returns it as a system-role turn with the raw line in metadata (so it shows up in the archive but doesn't poison search) and continues. newOffset is the byte position immediately after the last fully-parsed line — never mid-line, so a partial trailing write is left for the next poll.
Claude Code parser specifics.
~/.claude/projects/. Real corpus includes both */<session-uuid>.jsonl and nested */<session-uuid>/subagents/*.jsonl; ingest every .jsonl file as its own session.session_id: the UUID from the filename. The directory name (<project-hash>) goes into session_meta.metadata for project attribution..jsonl line = one event, parsed into a permissive struct that uses json.RawMessage for any ambiguous field.message.role plus nested message.content[].type, not just the top-level record type: in current Claude logs, tool use lives inside assistant records and tool results live inside user records.
message.role: "user" with string content → role: "user", content = the user message text.message.role: "assistant" with text parts → role: "assistant", content = joined assistant text parts; model from the event; tokens_in/out from usage.input_tokens/output_tokens when present.message.content[].type: "tool_use" and message.content[].type: "tool_result" → role: "tool", content = a short rendered summary (e.g. "<tool_use: Read file=...>"), full payload into tool_calls JSON.permission-mode, attachment, ai-title, last-prompt, etc.) are skipped unless they fail to parse, in which case they degrade to a system turn with the raw line in metadata.cwd field → session_meta.working_dir. The path of the file → session_meta.source_file.cost_usd left null (Max-billed sessions don't reliably report cost).turn_id: prefer the event's uuid field. When missing, synthesize sha256(session_id || seq || timestamp || content[:64]) truncated to 16 bytes hex.parentUuid (resume chaining): stored in turn metadata for now. Chaining sessions across files is a #3-or-later UI concern — every .jsonl file is a session in this task.metadata opaquely. The UI in #1 already renders metadata as JSON-collapsed; surfacing them properly is a later refinement.Auth. The collector POSTs to https://<phoebe>.tailnet.ts.net/api/v1/ingest. tailscale serve on phoebe terminates HTTPS and injects Tailscale-User-Login from the connecting node's owner. The server validates that header against its allowlist. If tailscale serve doesn't inject the header for non-browser clients (the open question from #1), the deploy step fixes it — the collector code itself is unchanged.
Configuration. YAML at ~/.config/lethe/collector.yaml:
server_url: "https://phoebe.<tailnet>.ts.net"
host: "laptop" # required; identifies this machine in the archive
state_dir: "~/.local/state/lethe"
http:
timeout: "30s"
retry_max: 5
outbox:
max_bytes: 104857600 # 100 MiB
sources:
- tool: "claude-code"
path: "~/.claude/projects"
poll_interval: "30s"
batch_max_lines: 500
batch_max_bytes: 8388608 # 8 MiB
log:
level: info
format: human
host is required and has no default. The host string is the user's choice; the server stores it verbatim.
Tradeoffs that settled it.
os.ReadDir per minute is irrelevant.state.db) for both offsets and outbox, atomic transactions, no separate format to debug. Cost is one extra dependency that was already required.content[:64] collision within one session at one timestamp is acceptable.Unknowns that remain.
tailscale serve injects Tailscale-User-Login for daemon HTTP clients (vs only browsers). If not, I add a lethe-token shared-secret fallback header in the deploy step — a 5-line server change. Confirmed empirically before declaring this task done..jsonl events. If it exceeds bufio.Scanner's default 64 KiB token buffer, the parser uses Scanner.Buffer(buf, maxSize) with maxSize = 16 MiB. Captured here so the test fixtures cover the long-line case.~/.claude/projects/ ever contains files concurrent-written from multiple Claude Code processes. If yes, the parser still works (append-only, monotonic offset), but the test plan should cover it.Greenfield collector. The only interface contract this task can break is the wire format with the server, which is locked into internal/shared/wire/ and cannot drift unilaterally.
internal/collector/parser/ rather than internal/parsers/ (RFC §6.2 used internal/parsers/<tool>/) — keeps "collector-internal" code in one subtree, mirrors internal/server/ from #1..jsonl shouldn't pause ingestion of the rest. Risk: silent data loss if many lines are wrong; mitigated by counting WARNs in status.host is required config with no default — auto-detecting via os.Hostname() produces noise on machines whose hostname is myname-mbp.local. Forcing the user to choose laptop / workpc keeps the archive's host column meaningful.parentUuid chaining of resumed Claude Code sessions deferred — every .jsonl is one session in this task. Surfacing chains is a UI concern for later.turn_id uses sha256(session_id || seq || timestamp || content[:64])[:16] — content[:64] is enough to disambiguate within a single timestamp; full-content hash would balloon for large turns.TDD: yes (reason: parser behavior on golden fixture .jsonl files, offset persistence/resume semantics, outbox replay, and idempotent re-POST behavior are exactly the deterministic regression-prone surfaces TDD is good for. CLI scaffolding and systemd unit are exempt.)
ingestion_state.last_offset is persisted only after the server returns accepted: N and the offset has been advanced past line N.\n. Never mid-line; partial trailing lines are left for the next poll.outbox.max_bytes. Overflow drops oldest entries with a WARN; never blocks the loop.wire.TurnEvent.host on every emitted event equals the host from config. The collector does not infer host from os.Hostname() or any environment.server_url and the path /api/v1/ingest; no other endpoints are called from daemon mode (status may call read endpoints).daemon mode handles SIGTERM by stopping new polls, draining in-flight POSTs (bounded), persisting any in-memory offset advances, and exiting with code 0.internal/collector/parser/, plus a register.go import.metadata, malformed lines → system-role turn with raw payload. Never panic, never stall.context.Context tied to shutdown.~/.claude/projects/ checked into testdata/), not hand-crafted minimal JSON.