Skip to content

ADR 0020 — AI Lab eval workbench

Accepted · 2026-06-03

Restura already ships an AI assistant — a request-aware chat panel talking to OpenAI, Anthropic, and OpenRouter through a provider-agnostic core (shared/protocol/ai). That core streams a single chat against a known-safe cloud endpoint. It does not cover the other thing developers want from a model: systematically testing a prompt — comparing models, running it over a dataset, and grading the outputs — including against local runtimes (Ollama, LM Studio, vLLM) on localhost.

Two things make this more than “reuse the chat panel”:

  1. Local runtimes need localhost. The SSRF guard (ADR 0004) blocks loopback on every existing path — correctly, since the cloud assistant never hits 127.0.0.1. A prompt-testing tool targeting Ollama must reach loopback, without thereby opening a hole to LAN, private ranges, or cloud-metadata.
  2. Evals are a fan-out, not a stream. Grading needs a non-streaming completion per (case × model) cell, structured-output calls (judge, dataset generation), and scorers that run untrusted user code — a shape the streaming chat orchestrator doesn’t provide.

Add AI Lab as a separate, Electron-only feature (src/features/ai-lab) that reuses the AI provider core but layers its own engine, stores, and a provider-kind-aware security carve-out.

  • Provider model — widen the Provider union into CloudProvider (openai/anthropic/openrouter) and LocalProvider (ollama/openai-compatible), with isLocalProvider() as the single predicate. Ollama and OpenAI-compatible share one route — the OpenAI wire shape — differing only in that auth is optional and the base URL is user-supplied. The OpenAI decoder is reused unchanged.
  • Security — the Electron AI Lab handler sets allowLocalhost = isLocalProvider(provider) and passes it into the same shared SSRF guard everything else uses. Local providers reach 127.0.0.1 / ::1 and nothing else; LAN, RFC 1918/6598, link-local, IPv6 unique-local, and metadata stay blocked for everyone, across redirects and DNS rebind. Cloud providers get no carve-out. No second guard to drift.
  • Eval engineai-complete.ts drains the provider stream to one CompletionResult. A bounded-concurrency runner sweeps (case × model) cells with AbortSignal cancel. Only the model call crosses IPC; scorers run in the renderer — the json-schema (Ajv) and script scorers included, the latter on the QuickJS sandbox. The judge and dataset-generation paths use structured output. Cost is null (unknown) for unpriced gateways rather than coerced to $0.
  • Persistence — new aiLab / evalRuns Dexie tables with Zod validators; evalRuns uses the shared debouncedStorage wrapper. API keys are SecretRef handles, never plaintext.
  • Capabilities — four aiLab.* rows added to capabilities.ts (the single source of truth), all desktop-only.

Positive

  • Local-model testing without weakening the guard: one predicate, one guard, loopback-only.
  • Adding a local runtime is a base URL, not new code — anything OpenAI-compatible already works.
  • Deterministic scorers stay pure and unit-testable; judge/script capabilities are injected, not imported.
  • The provider-union split is reusable by a future Worker AI path.

Negative

  • Desktop-only — like the AI assistant’s web gap, there is no Worker route (recorded in the capability matrix, not hidden).
  • A new feature surface (5 tabs, two stores, an engine) to maintain alongside the assistant.
  • script scorers run user code; safety rests entirely on the QuickJS sandbox boundary (ADR 0015), now exercised by a second caller.