From 20e5301a02603e4b525a946a0b077aab86f94f99 Mon Sep 17 00:00:00 2001 From: Eugene Blikh Date: Sun, 3 May 2026 14:41:23 +0300 Subject: [PATCH] docs: plan lethe collector implementation --- docs/TODO.md | 2 +- docs/tasks/lethe-collector-claude-code.md | 114 +++++++++++++++++++++- 2 files changed, 114 insertions(+), 2 deletions(-) diff --git a/docs/TODO.md b/docs/TODO.md index b068f03d2e2d8f99e2b21fd06ed1329f6265a66e..4897793c7e2415dcee118eac5c487a38180d7c82 100644 --- a/docs/TODO.md +++ b/docs/TODO.md @@ -7,7 +7,7 @@ Index of task specs and their state. Each row points at a `docs/tasks/.md` | # | Slug | Status | Description | |---|---|---|---| | 1 | [`lethe-server`](tasks/lethe-server.md) | **Verified** | Backend skeleton: SQLite ingest, sessions list/detail, forward-auth, RFC 7807, deployable on phoebe behind Authelia. Shipped over 9 phases. | -| 2 | [`lethe-collector-claude-code`](tasks/lethe-collector-claude-code.md) | Designed (deferred) | Per-host systemd-user collector that tails `~/.claude/projects/*.jsonl` and POSTs normalized turns to ingest. Blocks #8 and #9. | +| 2 | [`lethe-collector-claude-code`](tasks/lethe-collector-claude-code.md) | **Executing** | Per-host systemd-user collector that tails `~/.claude/projects/*.jsonl` and POSTs normalized turns to ingest. Blocks #8 and #9. | | 3 | [`lethe-search-and-opencode`](tasks/lethe-search-and-opencode.md) | Designed (deferred) | Adds `GET /api/v1/search` (FTS5) and an `opencode` collector. Blocks #7. | | 4 | [`lethe-web-ui-foundation`](tasks/lethe-web-ui-foundation.md) | **Reviewed** | Vite/React/TS SPA, embed pipeline, shell + Home + Session views, palette skeleton, 5 stub routes. Plus `/sessions` aggregate fields. | | 5 | [`lethe-web-ui-aggregates`](tasks/lethe-web-ui-aggregates.md) | **Reviewed** | Backend `/projects` + `/stats` endpoints, Projects index + Project detail + Stats screen. Replaces 3 of #4's stubs. | diff --git a/docs/tasks/lethe-collector-claude-code.md b/docs/tasks/lethe-collector-claude-code.md index 7dd234536dbc143f5ef575d786cb7746226d213b..621fe1f5bf9a6d687cc575ee19548988d9dcde7a 100644 --- a/docs/tasks/lethe-collector-claude-code.md +++ b/docs/tasks/lethe-collector-claude-code.md @@ -1,6 +1,9 @@ # lethe-collector-claude-code -**Status:** Design (hands-off) +**Status:** executing +**Branch:** `task/lethe-collector-claude-code` +**Worktree:** `/Users/blikh/data/home/lethe/.worktrees/lethe-collector-claude-code` +**Mode:** hands-off **Module:** `sourcecraft.dev/bigbes/lethe` **Depends on:** `lethe-server.md` (#1) — locks the wire format and ingest semantics this task targets. **Sibling tasks (deferred):** `lethe-search-and-opencode.md` (#3) and per-tool follow-ups (`lethe-collector-crush.md`, etc.) when the time comes. @@ -174,3 +177,112 @@ TDD: yes (reason: parser behavior on golden fixture `.jsonl` files, offset persi - Permissive parsing: unknown fields → `metadata`, malformed lines → system-role turn with raw payload. Never panic, never stall. - No background goroutines without a `context.Context` tied to shutdown. - Test against real fixture files (anonymized snippets from `~/.claude/projects/` checked into `testdata/`), not hand-crafted minimal JSON. + +## Plan + +Approach: keep the parser as the only Claude-specific layer, then add small collector packages for config, SQLite state/outbox, HTTP sending, and orchestration; the CLI is a thin cobra shell over those packages. + +### PH1 — Config And State + +- Tier: smart — config/state define the contracts every later phase consumes. +- **1.1** `internal/collector/config/config.go:1-220` (create) + - `Load(path string) (*Config, error)` — strict Viper YAML loader with `~` expansion, defaults except required `host`, and validation. + - Respects: IV7, PC2, GPC1. +- **1.2** `internal/collector/state/store.go:1-260` (create) + - `Open(ctx context.Context, path string) (*Store, error)` — opens SQLite, creates parent dir, applies embedded migrations. + - `GetOffset(ctx context.Context, tool, sourceFile string) (int64, error)` / `SaveOffset(ctx context.Context, tool, sourceFile string, offset int64) error`. + - `Enqueue(ctx context.Context, item OutboxItem) error`, `Oldest(ctx context.Context, limit int) ([]OutboxRow, error)`, `Delete(ctx context.Context, ids []int64) error`, `Stats(ctx context.Context) (Stats, error)`. + - Respects: IV2, IV4, IV5, GPC5. +- **1.3** `internal/collector/state/migrations.go:1-80` (create) + - `applyMigrations(ctx context.Context, db *sqlx.DB) error` — idempotent DDL for `ingestion_state` and `outbox`. + - Respects: IV4, IV5. +- Commit: `collector: add config and state store` + +### PH2 — HTTP Send And Outbox Replay + +- Tier: smart — partial-accept offset semantics and outbox deletion must match the server contract exactly. +- **2.1** `internal/collector/ingest/sender.go:1-240` (create) + - `PostBatch(ctx context.Context, events []wire.TurnEvent) (Result, error)` — serializes NDJSON, POSTs `server_url + /api/v1/ingest`, decodes `{accepted,errors}`. + - `EncodeNDJSON(events []wire.TurnEvent) ([]byte, error)` — shared by sender and outbox tests. + - Respects: IV2, IV8, GPC4. +- **2.2** `internal/collector/ingest/outbox.go:1-220` (create) + - `ReplayOutbox(ctx context.Context, store *state.Store, sender *Sender, limit int) error` — oldest-first replay, delete only fully accepted rows. + - `EnforceOutboxLimit(ctx context.Context, store *state.Store, maxBytes int64) error` — oldest-drop overflow. + - Respects: IV5, PC3, GPC5. +- Commit: `collector: add ingest sender and outbox replay` + +### PH3 — Source Runner + +- Tier: deep — this phase owns resumability, shutdown, and per-source isolation. +- **3.1** `internal/collector/ingest/runner.go:1-320` (create) + - `RunOnce(ctx context.Context, cfg config.Config, src config.Source, p parser.Parser, store *state.Store, sender *Sender) error` — replay outbox, discover files, parse from persisted offset, send batches, persist accepted offsets. + - `RunDaemon(ctx context.Context, cfg config.Config, parsers map[string]parser.Parser, store *state.Store, sender *Sender) error` — per-source polling loops via `auxilia/async` and context-bound shutdown. + - Respects: IV1-IV8, PC1-PC6, AS1-AS3. +- **3.2** `internal/collector/ingest/batch.go:1-160` (create) + - `BuildBatches(events []wire.TurnEvent, maxLines int, maxBytes int) ([]Batch, error)` — records event indexes so accepted counts map back to offsets. + - Respects: IV2, IV3. +- Commit: `collector: add polling source runner` + +### PH4 — CLI And Deploy + +- Tier: smart — command behavior is user-facing but mostly glue. +- **4.1** `cmd/lethe-collector/main.go:1-260` (create) + - `newRootCmd() *cobra.Command`, `newDaemonCmd() *cobra.Command`, `newBackfillCmd() *cobra.Command`, `newStatusCmd() *cobra.Command`. + - Default config path is `~/.config/lethe/collector.yaml`; `host` still has no default inside config. + - Respects: IV6, IV7, IV9, GPC6. +- **4.2** `deploy/lethe-collector.service:1-40` (create) + - systemd user unit running `lethe-collector daemon` with journald logging and restart policy. + - Respects: IV9. +- **4.3** `docs/tasks/lethe-collector-claude-code.md` (modify) + - Record implementation decisions, deferred items, and verify results. + - Respects: GPC7. +- Commit: `collector: add lethe-collector cli` + +### Test strategy + +- RED first: `internal/collector/config` tests for strict unknown-key rejection, required `host`, YAML defaults, and `~` expansion. +- RED first: `internal/collector/state` tests for migration idempotency, offset upsert, outbox FIFO replay rows, byte accounting, and oldest-drop limit. +- RED first: `internal/collector/ingest` tests for NDJSON encoding, partial accepted-count offset persistence, network-failure outbox enqueue, replay deletion, and batch byte/line caps. +- Existing parser tests remain the regression gate for Claude Code format handling. + +### Order & dependencies + +- PH1 blocks PH2-PH4. +- PH2 blocks PH3. +- PH3 blocks PH4 daemon/backfill behavior; `status` can be implemented after PH1. + +### Risks / rollback + +- RK1 — The server returns `accepted` counts but not source offsets, so PH3 must retain per-event source offsets in-memory and enqueue whole batches on hard failures. +- RK2 — `tailscale serve` header behavior remains empirical; verify records the result and defers token fallback if needed rather than changing the locked server in this task. + +### Interfaces + +- IF1 — `config.Load(path string) (*Config, error)` — all CLI commands load the same strict collector YAML. +- IF2 — `state.Store` offset/outbox methods — runner and status share one SQLite boundary. +- IF3 — `ingest.Sender.PostBatch(ctx, events)` — runner and outbox replay share one HTTP boundary. +- IF4 — `ingest.RunOnce` / `ingest.RunDaemon` — CLI commands do not know parser, offset, or batching internals. + +### Interface graph + +- PH1 -> IF1, IF2 @ `internal/collector/config/`, `internal/collector/state/` +- PH2 IF2 -> IF3 @ `internal/collector/ingest/sender.go`, `internal/collector/ingest/outbox.go` +- PH3 IF1, IF2, IF3 -> IF4 @ `internal/collector/ingest/runner.go`, `internal/collector/ingest/batch.go` +- PH4 IF1, IF2, IF4 -> @ `cmd/lethe-collector/`, `deploy/` + +Backwards-compat: greenfield collector; PH2 must not mutate `internal/shared/wire`, and all server interaction stays inside the existing `POST /api/v1/ingest` contract. + +Scope check: no server changes, no extra parser registry abstraction, and no token-auth fallback unless verify proves Tailscale forwarding cannot work. + +## Verify + +## Conclusion + +### Hands-off decisions + +- size: Medium — the design is complete and remaining work spans CLI, config, state, HTTP, daemon, deploy, and tests. +- worktree: `task/lethe-collector-claude-code` at `/Users/blikh/data/home/lethe/.worktrees/lethe-collector-claude-code` — hands-off requires isolated reversible edits. +- worktree setup: added `.worktrees/` to `.gitignore` on `master` before creating the task worktree — `git-worktrees` requires project-local worktree directories to be ignored. +- uplan: plan auto-approved (hands-off). + +### Deferred (needs user input)