~bigbes/lethe

20e5301a02603e4b525a946a0b077aab86f94f99 — Eugene Blikh 24 days ago 35b7a18
docs: plan lethe collector implementation
2 files changed, 114 insertions(+), 2 deletions(-)

M docs/TODO.md
M docs/tasks/lethe-collector-claude-code.md
M docs/TODO.md => docs/TODO.md +1 -1
@@ 7,7 7,7 @@ Index of task specs and their state. Each row points at a `docs/tasks/<slug>.md`
| # | Slug | Status | Description |
|---|---|---|---|
| 1 | [`lethe-server`](tasks/lethe-server.md) | **Verified** | Backend skeleton: SQLite ingest, sessions list/detail, forward-auth, RFC 7807, deployable on phoebe behind Authelia. Shipped over 9 phases. |
| 2 | [`lethe-collector-claude-code`](tasks/lethe-collector-claude-code.md) | Designed (deferred) | Per-host systemd-user collector that tails `~/.claude/projects/*.jsonl` and POSTs normalized turns to ingest. Blocks #8 and #9. |
| 2 | [`lethe-collector-claude-code`](tasks/lethe-collector-claude-code.md) | **Executing** | Per-host systemd-user collector that tails `~/.claude/projects/*.jsonl` and POSTs normalized turns to ingest. Blocks #8 and #9. |
| 3 | [`lethe-search-and-opencode`](tasks/lethe-search-and-opencode.md) | Designed (deferred) | Adds `GET /api/v1/search` (FTS5) and an `opencode` collector. Blocks #7. |
| 4 | [`lethe-web-ui-foundation`](tasks/lethe-web-ui-foundation.md) | **Reviewed** | Vite/React/TS SPA, embed pipeline, shell + Home + Session views, palette skeleton, 5 stub routes. Plus `/sessions` aggregate fields. |
| 5 | [`lethe-web-ui-aggregates`](tasks/lethe-web-ui-aggregates.md) | **Reviewed** | Backend `/projects` + `/stats` endpoints, Projects index + Project detail + Stats screen. Replaces 3 of #4's stubs. |

M docs/tasks/lethe-collector-claude-code.md => docs/tasks/lethe-collector-claude-code.md +113 -1
@@ 1,6 1,9 @@
# lethe-collector-claude-code

**Status:** Design (hands-off)
**Status:** executing
**Branch:** `task/lethe-collector-claude-code`
**Worktree:** `/Users/blikh/data/home/lethe/.worktrees/lethe-collector-claude-code`
**Mode:** hands-off
**Module:** `sourcecraft.dev/bigbes/lethe`
**Depends on:** `lethe-server.md` (#1) — locks the wire format and ingest semantics this task targets.
**Sibling tasks (deferred):** `lethe-search-and-opencode.md` (#3) and per-tool follow-ups (`lethe-collector-crush.md`, etc.) when the time comes.


@@ 174,3 177,112 @@ TDD: yes (reason: parser behavior on golden fixture `.jsonl` files, offset persi
- Permissive parsing: unknown fields → `metadata`, malformed lines → system-role turn with raw payload. Never panic, never stall.
- No background goroutines without a `context.Context` tied to shutdown.
- Test against real fixture files (anonymized snippets from `~/.claude/projects/` checked into `testdata/`), not hand-crafted minimal JSON.

## Plan

Approach: keep the parser as the only Claude-specific layer, then add small collector packages for config, SQLite state/outbox, HTTP sending, and orchestration; the CLI is a thin cobra shell over those packages.

### PH1 — Config And State

- Tier: smart — config/state define the contracts every later phase consumes.
- **1.1** `internal/collector/config/config.go:1-220` (create)
  - `Load(path string) (*Config, error)` — strict Viper YAML loader with `~` expansion, defaults except required `host`, and validation.
  - Respects: IV7, PC2, GPC1.
- **1.2** `internal/collector/state/store.go:1-260` (create)
  - `Open(ctx context.Context, path string) (*Store, error)` — opens SQLite, creates parent dir, applies embedded migrations.
  - `GetOffset(ctx context.Context, tool, sourceFile string) (int64, error)` / `SaveOffset(ctx context.Context, tool, sourceFile string, offset int64) error`.
  - `Enqueue(ctx context.Context, item OutboxItem) error`, `Oldest(ctx context.Context, limit int) ([]OutboxRow, error)`, `Delete(ctx context.Context, ids []int64) error`, `Stats(ctx context.Context) (Stats, error)`.
  - Respects: IV2, IV4, IV5, GPC5.
- **1.3** `internal/collector/state/migrations.go:1-80` (create)
  - `applyMigrations(ctx context.Context, db *sqlx.DB) error` — idempotent DDL for `ingestion_state` and `outbox`.
  - Respects: IV4, IV5.
- Commit: `collector: add config and state store`

### PH2 — HTTP Send And Outbox Replay

- Tier: smart — partial-accept offset semantics and outbox deletion must match the server contract exactly.
- **2.1** `internal/collector/ingest/sender.go:1-240` (create)
  - `PostBatch(ctx context.Context, events []wire.TurnEvent) (Result, error)` — serializes NDJSON, POSTs `server_url + /api/v1/ingest`, decodes `{accepted,errors}`.
  - `EncodeNDJSON(events []wire.TurnEvent) ([]byte, error)` — shared by sender and outbox tests.
  - Respects: IV2, IV8, GPC4.
- **2.2** `internal/collector/ingest/outbox.go:1-220` (create)
  - `ReplayOutbox(ctx context.Context, store *state.Store, sender *Sender, limit int) error` — oldest-first replay, delete only fully accepted rows.
  - `EnforceOutboxLimit(ctx context.Context, store *state.Store, maxBytes int64) error` — oldest-drop overflow.
  - Respects: IV5, PC3, GPC5.
- Commit: `collector: add ingest sender and outbox replay`

### PH3 — Source Runner

- Tier: deep — this phase owns resumability, shutdown, and per-source isolation.
- **3.1** `internal/collector/ingest/runner.go:1-320` (create)
  - `RunOnce(ctx context.Context, cfg config.Config, src config.Source, p parser.Parser, store *state.Store, sender *Sender) error` — replay outbox, discover files, parse from persisted offset, send batches, persist accepted offsets.
  - `RunDaemon(ctx context.Context, cfg config.Config, parsers map[string]parser.Parser, store *state.Store, sender *Sender) error` — per-source polling loops via `auxilia/async` and context-bound shutdown.
  - Respects: IV1-IV8, PC1-PC6, AS1-AS3.
- **3.2** `internal/collector/ingest/batch.go:1-160` (create)
  - `BuildBatches(events []wire.TurnEvent, maxLines int, maxBytes int) ([]Batch, error)` — records event indexes so accepted counts map back to offsets.
  - Respects: IV2, IV3.
- Commit: `collector: add polling source runner`

### PH4 — CLI And Deploy

- Tier: smart — command behavior is user-facing but mostly glue.
- **4.1** `cmd/lethe-collector/main.go:1-260` (create)
  - `newRootCmd() *cobra.Command`, `newDaemonCmd() *cobra.Command`, `newBackfillCmd() *cobra.Command`, `newStatusCmd() *cobra.Command`.
  - Default config path is `~/.config/lethe/collector.yaml`; `host` still has no default inside config.
  - Respects: IV6, IV7, IV9, GPC6.
- **4.2** `deploy/lethe-collector.service:1-40` (create)
  - systemd user unit running `lethe-collector daemon` with journald logging and restart policy.
  - Respects: IV9.
- **4.3** `docs/tasks/lethe-collector-claude-code.md` (modify)
  - Record implementation decisions, deferred items, and verify results.
  - Respects: GPC7.
- Commit: `collector: add lethe-collector cli`

### Test strategy

- RED first: `internal/collector/config` tests for strict unknown-key rejection, required `host`, YAML defaults, and `~` expansion.
- RED first: `internal/collector/state` tests for migration idempotency, offset upsert, outbox FIFO replay rows, byte accounting, and oldest-drop limit.
- RED first: `internal/collector/ingest` tests for NDJSON encoding, partial accepted-count offset persistence, network-failure outbox enqueue, replay deletion, and batch byte/line caps.
- Existing parser tests remain the regression gate for Claude Code format handling.

### Order & dependencies

- PH1 blocks PH2-PH4.
- PH2 blocks PH3.
- PH3 blocks PH4 daemon/backfill behavior; `status` can be implemented after PH1.

### Risks / rollback

- RK1 — The server returns `accepted` counts but not source offsets, so PH3 must retain per-event source offsets in-memory and enqueue whole batches on hard failures.
- RK2 — `tailscale serve` header behavior remains empirical; verify records the result and defers token fallback if needed rather than changing the locked server in this task.

### Interfaces

- IF1 — `config.Load(path string) (*Config, error)` — all CLI commands load the same strict collector YAML.
- IF2 — `state.Store` offset/outbox methods — runner and status share one SQLite boundary.
- IF3 — `ingest.Sender.PostBatch(ctx, events)` — runner and outbox replay share one HTTP boundary.
- IF4 — `ingest.RunOnce` / `ingest.RunDaemon` — CLI commands do not know parser, offset, or batching internals.

### Interface graph

- PH1 -> IF1, IF2 @ `internal/collector/config/`, `internal/collector/state/`
- PH2 IF2 -> IF3 @ `internal/collector/ingest/sender.go`, `internal/collector/ingest/outbox.go`
- PH3 IF1, IF2, IF3 -> IF4 @ `internal/collector/ingest/runner.go`, `internal/collector/ingest/batch.go`
- PH4 IF1, IF2, IF4 -> @ `cmd/lethe-collector/`, `deploy/`

Backwards-compat: greenfield collector; PH2 must not mutate `internal/shared/wire`, and all server interaction stays inside the existing `POST /api/v1/ingest` contract.

Scope check: no server changes, no extra parser registry abstraction, and no token-auth fallback unless verify proves Tailscale forwarding cannot work.

## Verify

## Conclusion

### Hands-off decisions

- size: Medium — the design is complete and remaining work spans CLI, config, state, HTTP, daemon, deploy, and tests.
- worktree: `task/lethe-collector-claude-code` at `/Users/blikh/data/home/lethe/.worktrees/lethe-collector-claude-code` — hands-off requires isolated reversible edits.
- worktree setup: added `.worktrees/` to `.gitignore` on `master` before creating the task worktree — `git-worktrees` requires project-local worktree directories to be ignored.
- uplan: plan auto-approved (hands-off).

### Deferred (needs user input)