ADR 0020 — AI Lab eval workbench
Accepted · 2026-06-03
Context
Section titled “Context”Restura already ships an AI assistant — a request-aware chat panel talking to OpenAI, Anthropic, and OpenRouter through a provider-agnostic core (shared/protocol/ai). That core streams a single chat against a known-safe cloud endpoint. It does not cover the other thing developers want from a model: systematically testing a prompt — comparing models, running it over a dataset, and grading the outputs — including against local runtimes (Ollama, LM Studio, vLLM) on localhost.
Two things make this more than “reuse the chat panel”:
- Local runtimes need localhost. The SSRF guard (ADR 0004) blocks loopback on every existing path — correctly, since the cloud assistant never hits
127.0.0.1. A prompt-testing tool targeting Ollama must reach loopback, without thereby opening a hole to LAN, private ranges, or cloud-metadata. - Evals are a fan-out, not a stream. Grading needs a non-streaming completion per (case × model) cell, structured-output calls (judge, dataset generation), and scorers that run untrusted user code — a shape the streaming chat orchestrator doesn’t provide.
Decision
Section titled “Decision”Add AI Lab as a separate, Electron-only feature (src/features/ai-lab) that reuses the AI provider core but layers its own engine, stores, and a provider-kind-aware security carve-out.
- Provider model — widen the
Providerunion intoCloudProvider(openai/anthropic/openrouter) andLocalProvider(ollama/openai-compatible), withisLocalProvider()as the single predicate. Ollama and OpenAI-compatible share one route — the OpenAI wire shape — differing only in that auth is optional and the base URL is user-supplied. The OpenAI decoder is reused unchanged. - Security — the Electron AI Lab handler sets
allowLocalhost = isLocalProvider(provider)and passes it into the same shared SSRF guard everything else uses. Local providers reach127.0.0.1/::1and nothing else; LAN, RFC 1918/6598, link-local, IPv6 unique-local, and metadata stay blocked for everyone, across redirects and DNS rebind. Cloud providers get no carve-out. No second guard to drift. - Eval engine —
ai-complete.tsdrains the provider stream to oneCompletionResult. A bounded-concurrency runner sweeps (case × model) cells withAbortSignalcancel. Only the model call crosses IPC; scorers run in the renderer — thejson-schema(Ajv) andscriptscorers included, the latter on the QuickJS sandbox. Thejudgeand dataset-generation paths use structured output. Cost isnull(unknown) for unpriced gateways rather than coerced to$0. - Persistence — new
aiLab/evalRunsDexie tables with Zod validators;evalRunsuses the shareddebouncedStoragewrapper. API keys areSecretRefhandles, never plaintext. - Capabilities — four
aiLab.*rows added tocapabilities.ts(the single source of truth), all desktop-only.
Consequences
Section titled “Consequences”Positive
- Local-model testing without weakening the guard: one predicate, one guard, loopback-only.
- Adding a local runtime is a base URL, not new code — anything OpenAI-compatible already works.
- Deterministic scorers stay pure and unit-testable; judge/script capabilities are injected, not imported.
- The provider-union split is reusable by a future Worker AI path.
Negative
- Desktop-only — like the AI assistant’s web gap, there is no Worker route (recorded in the capability matrix, not hidden).
- A new feature surface (5 tabs, two stores, an engine) to maintain alongside the assistant.
scriptscorers run user code; safety rests entirely on the QuickJS sandbox boundary (ADR 0015), now exercised by a second caller.