ADR 0020 — AI Lab eval workbench

Accepted · 2026-05-20

Context

Restura already ships an AI assistant — a request-aware chat panel talking to OpenAI, Anthropic, and OpenRouter through a provider-agnostic core (shared/protocol/ai). That core streams a single chat against a known-safe cloud endpoint. It does not cover the other thing developers want from a model: systematically testing a prompt — comparing models, running it over a dataset, and grading the outputs — including against local runtimes (Ollama, LM Studio, vLLM) on localhost.

Two things make this more than “reuse the chat panel”:

Local runtimes need localhost. The SSRF guard (ADR 0004) blocks loopback on every existing path — correctly, since the cloud assistant never hits 127.0.0.1. A prompt-testing tool targeting Ollama must reach loopback, without thereby opening a hole to LAN, private ranges, or cloud-metadata.
Evals are a fan-out, not a stream. Grading needs a non-streaming completion per (case × model) cell, structured-output calls (judge, dataset generation), and scorers that run untrusted user code — a shape the streaming chat orchestrator doesn’t provide.

Decision

Add AI Lab as a separate, Electron-only feature (src/features/ai-lab) that reuses the AI provider core but layers its own engine, stores, and a provider-kind-aware security carve-out.

Provider model — widen the Provider union into CloudProvider (openai/anthropic/openrouter) and LocalProvider (ollama/openai-compatible), with isLocalProvider() as the single predicate. Ollama and OpenAI-compatible share one route — the OpenAI wire shape — differing only in that auth is optional and the base URL is user-supplied. The OpenAI decoder is reused unchanged.
Security — the Electron AI Lab handler sets allowLocalhost = isLocalProvider(provider) and passes it into the same shared SSRF guard everything else uses. Local providers reach 127.0.0.1 / ::1 and nothing else; LAN, RFC 1918/6598, link-local, IPv6 unique-local, and metadata stay blocked for everyone, across redirects and DNS rebind. Cloud providers get no carve-out. No second guard to drift.
Eval engine — ai-complete.ts drains the provider stream to one CompletionResult. A bounded-concurrency runner sweeps (case × model) cells with AbortSignal cancel. Only the model call crosses IPC; scorers run in the renderer — the json-schema (Ajv) and script scorers included, the latter on the QuickJS sandbox. The judge and dataset-generation paths use structured output. Cost is null (unknown) for unpriced gateways rather than coerced to $0.
Persistence — new aiLab / evalRuns (and later arenaRuns) Dexie tables with Zod validators; evalRuns uses the shared debouncedStorage wrapper. API keys are SecretRef handles, never plaintext.
Capabilities — aiLab.* rows added to capabilities.ts (the single source of truth), all desktop-only.

Consequences

Positive

Local-model testing without weakening the guard: one predicate, one guard, loopback-only.
Adding a local runtime is a base URL, not new code — anything OpenAI-compatible already works.
Deterministic scorers stay pure and unit-testable; judge/script capabilities are injected, not imported.
The provider-union split is reusable by a future Worker AI path.

Negative

Desktop-only — like the AI assistant’s web gap, there is no Worker route (recorded in the capability matrix, not hidden).
A new feature surface (six tabs, three stores, an engine) to maintain alongside the assistant.
script scorers run user code; safety rests entirely on the QuickJS sandbox boundary (ADR 0015), now exercised by a second caller.

Guide: AI Lab — the workbench in context.
ADR 0010 — AI assistant architecture — the provider core AI Lab reuses.
ADR 0004 — Security Hardening — the SSRF guard the localhost carve-out plugs into.
ADR 0007 — SecretRef Pattern — how provider API keys avoid plaintext storage.
ADR 0015 — QuickJS script sandbox — the boundary the script scorer runs inside.

ADR 0020 — AI Lab eval workbench

Context

Decision

Consequences

Related