Getting Started

Setting Up Evaluations

Set up a robust evaluation workflow before running your first full test suite.

Recommended Project Layout

Use a consistent workspace layout so CLI and App commands resolve configs, libraries, and results predictably.

Keep evaluation YAML files in `mcplab/evals`.
Keep reusable servers and agents in library files.
Store run output in `mcplab/results/evaluation-runs`.

recommended layout

mcplab/
  evals/
    eval.yaml
  results/
    evaluation-runs/
  servers.yaml
  agents.yaml

Author a Minimal, Valid Config

Start with one agent and one scenario. Validate this baseline first before adding more scenarios or models.

mcplab/evals/eval.yaml

agents:
  - id: claude-haiku
    provider: anthropic
    model: claude-haiku-4-5-20251001
    temperature: 0

scenarios:
  - id: setup-check
    agent: claude-haiku
    servers: [demo-server]
    mcp_servers:
      - id: demo-server
        transport: http
        url: http://localhost:3000/mcp
    prompt: Use available tools to complete this setup verification task.

Configure Auth and Environment

Set provider keys and server auth variables before running evaluations. Keep secret values in environment variables, not in committed YAML.

Use `auth.type: bearer` + `env` for bearer-token server auth.
Use `auth.type: oauth_client_credentials` for client-credentials flows.
Use `auth.type: oauth_authorization_code` when interactive/browser OAuth is required.

.env example

ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...
MY_SERVER_TOKEN=...

Preflight Checklist

MCP endpoint URL is reachable and returns MCP responses.
Scenario IDs are unique and agent references resolve.
Server labels in `scenarios[].servers` match your intended MCP server entries.
All required env var names are set in your shell/session.
You can run one scenario once successfully before scaling.

first validation run

npx @inspectr/mcplab run -c mcplab/evals/eval.yaml -s setup-check -n 1

Next Setup Steps

Add more scenarios after the baseline setup-check passes.
Use `--agents` to compare models on the same scenarios.
Open `report.html` or the App results view to inspect failures and tool traces.