Skip to content

Methodology

A verdict is never a bare leaderboard rank. It’s defined against a verifier and backed by reproduced evidence.

  1. A task enters the catalog from a real-usage taxonomy (the Anthropic Economic Index, WildBench, WildChat) or a community submission, normalized to a clear name, a category, and — critically — a verification method.
  2. A benchmark only shortlists. A score (AA Index / LiveBench / Arena) picks which local models are worth running; it never decides the verdict.
  3. The shortlisted models are run against the verifier. “Clears it” means it passes the task’s check — tests pass, output matches the schema, a judge rubric is satisfied — reproduced, not a single lucky run.
  4. Practitioner reports corroborate. Manipulation-resistant signals (bridging- based ranking à la Community Notes, GitHub-gated votes, decay) confirm or contest the eval, and surface disagreement rather than hiding it.

A false “safe for local” is the trust-killing error, so 🟢 requires strong reproduced positive evidence and 🔴 requires reproduced failure. Everything else is 🔶 needs more data. See reading a verdict.

When a verdict is positive, the recommendation is the smallest, cheapest local model that is reliably safe — not the biggest that happens to pass. When no local model clears the task, the fallback escalates by exactly one rung: hosted open weights before frontier. It’s a FrugalGPT-style cascade made concrete.

The full pipeline — task catalog, evidence and eval runs, practitioner signals, and the agent API/MCP — is laid out end to end in How it works.

Built by Sam Carlton