AI Lab

Desktop only

AI Lab is a workbench for testing prompts and models. Add a provider (a local Ollama, an OpenAI-compatible gateway, or a cloud key), compare model outputs side by side in the Playground, then run a prompt over a dataset and score every output with deterministic checks, sandboxed scripts, or an LLM-as-judge.

It is reachable at /ai-lab (the flask icon in the top bar) and is Electron-only — testing local LLMs and arbitrary OpenAI-compatible endpoints needs direct network access the browser can’t provide. On the web build the route renders a desktop-only state.

Providers

A provider config is one row per endpoint you add. AI Lab splits them into two kinds:

Kind	Providers	Base URL	API key	Pricing
Cloud	OpenAI, Anthropic, OpenRouter	Hardcoded (overridable)	Required	Known — cost estimated
Local	Ollama, OpenAI-compatible	User-supplied	Optional	Unknown — shown as “free”/“unknown”

Ollama defaults to http://localhost:11434 and needs no key.
OpenAI-compatible covers LM Studio, vLLM, llama.cpp, Together, Groq, and any gateway that speaks the OpenAI /v1/chat/completions wire shape — you must supply the base URL.

API keys are stored as a SecretRef handle, never in plaintext in the renderer; the main process resolves the key only at the moment it signs the outbound request. The Providers tab also lets you test the connection and discover models (/api/tags for Ollama, /v1/models for OpenAI-compatible).

Playground

Pick one or more model references (provider config + model id) and stream a prompt to all of them at once. Outputs render side by side as they arrive, so you can eyeball quality, speed, and tone differences between models or providers in one shot. Streaming uses the same subscribe-before-invoke IPC contract as the AI assistant, so no early tokens are dropped.

Datasets

A dataset is a list of cases. Each case carries:

vars — template variables substituted into the prompt’s {{placeholders}}.
expected (optional) — an exact expected output, for exact-match scoring.
reference (optional) — a gold/reference answer, for reference-based scorers, the judge, and the pairwise scorer.
turns (optional) — a multi-turn conversation ({ role: 'user' | 'assistant'; content }[]). When present, the runner replays these turns as the model input instead of the prompt’s single user message ({{vars}} still resolve inside each turn).

Cases are edited as a compact JSON array, and you can import / export a dataset as CSV or JSONL (JSONL preserves turns; flat CSV does not). Beyond hand-authoring, four generators seed a dataset:

Generate from an OpenAPI spec

The Datasets tab can seed a dataset from an OpenAPI 3 / Swagger 2 document. AI Lab extracts a compact operation summary (method, path, summary, parameter names — no $ref dereferencing) and asks a model to emit diverse test cases as a structured tool call. Review and edit the generated cases before saving.

Import from request history / collections

Turn real traffic into eval cases: From history lists your saved HTTP requests (request history and saved collection requests), and each selected request becomes a case — its method, URL, headers, and body become vars, and the captured response body becomes the reference. Secrets are redacted (credential headers, Bearer/JWT/key=value tokens, recognizable provider keys) by the same redactor the AI assistant uses before anything reaches a model.

Generate adversarial / red-team cases

Red-team asks a model to generate jailbreak, prompt-injection, or boundary/abuse inputs to probe a prompt’s robustness. Describe the system under test and pick a focus; each generated case’s reference describes the safe expected behavior, so you can pair it with a judge scorer.

Save Playground outputs as a dataset

After a Playground run, Save outputs as dataset captures the current vars plus each finished model output (as the case reference) into a new dataset — a quick way to bootstrap a regression set from a good run.

Evals

An eval is the cross-product of dataset cases × selected models, each cell scored by your configured scorers. The runner is a bounded-concurrency sweep (like the load-test runner): render the prompt → call the model → score → emit progress. Model calls cross IPC to the main process; scorers run in the renderer.

Pick a prompt template, a dataset, and one or more models.
Add one or more scorers (below).
Set a concurrency cap and run. Progress streams in per cell; you can cancel mid-run (in-flight cells finish).

Scorers

A cell passes only when its model call succeeded and every scorer passed. Each scorer fails closed.

Scorer	Passes when
`exact-match`	Output equals the case’s `expected` / `reference` (optional case-insensitive trim).
`contains`	Output contains a substring.
`regex`	Output matches a pattern (invalid pattern → fail).
`json-valid`	Output parses as JSON.
`json-schema`	Output is JSON valid against a supplied JSON Schema (Ajv).
`latency`	Round-trip ≤ `maxMs`.
`cost`	Estimated USD ≤ `maxUSD`. Unknown cost fails — an unpriced gateway can’t satisfy a budget.
`script`	A QuickJS test script passes (the output is exposed as `pm.response.text()`).
`judge`	An LLM-as-judge scores the output ≥ a pass threshold.
`tool-call`	The model called the expected tool, and its JSON arguments validate against a JSON Schema and/or match the case’s `expected` / `reference` args.
`pairwise`	A preference judge prefers the cell output over the case’s `reference` (head-to-head A/B, with optional position-bias swap).

LLM-as-judge

The judge scorer calls a model of your choosing with the output, the case’s reference, and a rubric, and asks it to return a structured judgement (score 0–1, reasoning, pass) via a tool call. Use any provider for the judge, including a cheaper or local model than the one under test. The judge engine supports multi-criteria weighted rubrics (each criterion scored independently; a gate criterion fails the cell regardless of the weighted score), self-consistency (run the judge N≤5 times and aggregate by median, reporting score variance), and calibration anchors (reference-scored examples that pin the 0–1 scale).

Tool-call & pairwise scorers

tool-call grades function-calling: expose tool definitions to the model and assert it called the right tool with arguments that validate against a schema (or match the case’s expected args). Useful for agent/MCP-style prompts.
pairwise is preference judging — instead of an absolute score, a judge picks a winner between the cell output and the case reference. Enable position swap to run both A/B orderings and cancel position bias (a flip-flop is scored as a tie). For model-vs-model ranking across a whole dataset, use the Arena instead.

Execute-and-score (http-exec target)

By default a cell scores the model’s text. Switch the eval’s target to Execute as HTTP request (or GraphQL) and the cell instead: calls the model → parses an HTTP/GraphQL request out of the output (a JSON object, or a fenced ```json block) → executes it through Restura’s real request executor → scores the upstream response. This answers “did the model produce a request that actually works”, not just “did it read correctly”. The executed request goes through the same SSRF guard, redirect policy, and cookie jar as any request you’d send by hand.

Reports

The Reports tab shows a leaderboard across a run (pass rate, p50/p95 latency, cost per model) and a regression compare against an earlier run of the same eval config. Eval-run ids are stable per config so re-running and comparing works across sessions. Pass rate excludes “not evaluated” cells (a run with no scorers), so a misconfigured eval reads as neither 0% nor 100%.

Each run can be exported as CSV, JSON, or Markdown. A per-case drill-down opens any case to see every model’s full output, per-scorer detail, and judge per-criterion reasoning side by side — a cross-model diff for that case.

Arena

The Arena ranks models head-to-head. Pick a dataset and two or more contestant models plus a judge model; the Arena runs every model pair against every case as a round-robin pairwise comparison (with position-bias swap) and folds the results into an Elo leaderboard and a win-rate matrix. It’s the model-vs-model counterpart to the per-cell pairwise scorer, and the right tool when you want a ranking across many models rather than a pass/fail per cell. Runs are bounded-concurrency and cancellable, and persist to their own history (the arenaRuns store).

Agent suites

An agent suite evaluates a multi-step agent against repeatable tasks rather than a single completion. It records the trajectory and outcome for each trial, applies task-aware grading, and can repeat trials to show reliability rather than treating one run as conclusive. Suites have run-wide step, time, tool, token, cost, and output budgets; cancelling a suite stops provider, grader, and tool work, and late results cannot overwrite a cancelled run.

Providers and capabilities

Desktop suites use OpenAI Chat, Anthropic Messages, OpenRouter, Ollama, Hugging Face, and generic OpenAI-compatible providers through the encrypted Electron bridge. Model capabilities start conservative; Restura uses trusted catalog data where available and otherwise requires an explicit user assertion before it enables a capability such as tool calling. Gemini, Azure OpenAI, and Bedrock profiles are not shipped agent transports.

Tools, grounding, and approval

Saved HTTP requests run through Restura’s normal request executor and its SSRF, auth, and cancellation boundaries. Only unscripted GET, HEAD, and OPTIONS requests are read-only; all other requests, and any request with executable scripts, require explicit per-call approval.
MCP tools reuse a saved MCP profile with fresh desktop-owned sessions, existing SSRF/DNS protections, tool allowlists, cancellation, and approval.
Grounding is selected explicitly and recorded with provenance in the trace. Saved reports are bounded and sanitized before persistence or export.
The sandbox contract is present for future providers, but no Docker or hosted sandbox provider ships today.

CI boundary

restura agent eval runs versioned suites and portable Agent Bundles in CI. It supports stateless OpenAI Responses and Anthropic Messages, environment credentials, deterministic fixture tools, and fail-closed baseline gates. Live tools must be explicitly listed in a runtime manifest; HTTP is limited to manifest-listed GET / HEAD / OPTIONS requests and MCP sources must be declared read-only with a tool allowlist. The CLI refuses desktop secret handles, suite base-URL overrides, judge graders, and sandbox providers. See CLI — Agent eval for the command contract.

Security & SSRF

Every outbound model call goes through the Electron AI Lab handler, which applies the same shared SSRF guard as the rest of Restura, with one provider-kind-aware twist: allowLocalhost is enabled only for local providers (Ollama, OpenAI-compatible).

Loopback only. Local providers may reach 127.0.0.1 / ::1. Nothing else opens up — LAN, RFC 1918/6598 private ranges, link-local 169.254/16, IPv6 unique-local, and cloud-metadata endpoints remain blocked for everyone, including across redirects and DNS rebind.
Cloud providers get no carve-out — they can only reach their public endpoints.
The handler also enforces a hard concurrency ceiling independent of the renderer-side cap.
http-exec (the execute-and-score target) adds no new outbound path: AI-generated requests run through the standard request executor and the same SSRF guard as user-issued requests. The residual risk is that the model picks the destination — see the caution above and ADR 0023.

Where it runs

Desktop — fully wired. The renderer drives model calls over the window.electron.aiLab IPC bridge → electron/main/ai-lab-handler.ts → shared/protocol/ai. primary path
Web / self-hosted — not available; the route shows a desktop-only message. There is no Worker route for AI Lab.

AI assistant — the request-aware chat panel (shares the provider core).
Scripts — the same QuickJS sandbox the script scorer uses.
ADR 0020 — AI Lab eval workbench — the architecture decision behind this feature.
ADR 0023 — AI Lab http-exec — scoring AI-generated requests through the real executor.
Capability matrix — what’s desktop-only and why.