Methodology
A verdict is never a bare leaderboard rank. It’s defined against a verifier and backed by reproduced evidence.
From a task to a verdict
Section titled “From a task to a verdict”- A task enters the catalog from a real-usage taxonomy (the Anthropic Economic Index, WildBench, WildChat) or a community submission, normalized to a clear name, a category, and — critically — a verification method.
- A benchmark only shortlists. A score (AA Index / LiveBench / Arena) picks which local models are worth running; it never decides the verdict.
- The shortlisted models are run against the verifier. “Clears it” means it passes the task’s check — tests pass, output matches the schema, a judge rubric is satisfied — reproduced, not a single lucky run.
- Practitioner reports corroborate. Manipulation-resistant signals (bridging- based ranking à la Community Notes, GitHub-gated votes, decay) confirm or contest the eval, and surface disagreement rather than hiding it.
The bar is asymmetric
Section titled “The bar is asymmetric”A false “safe for local” is the trust-killing error, so 🟢 requires strong reproduced positive evidence and 🔴 requires reproduced failure. Everything else is 🔶 needs more data. See reading a verdict.
Recommend the cheapest safe rung
Section titled “Recommend the cheapest safe rung”When a verdict is positive, the recommendation is the smallest, cheapest local model that is reliably safe — not the biggest that happens to pass. When no local model clears the task, the fallback escalates by exactly one rung: hosted open weights before frontier. It’s a FrugalGPT-style cascade made concrete.
The full pipeline — task catalog, evidence and eval runs, practitioner signals, and the agent API/MCP — is laid out end to end in How it works.
Built by Sam Carlton