How it works
DoesItLocal turns a messy practitioner question (“can my local model do this?”) into a single, evidence-backed verdict per task. Here is the flow end to end.
1. A task enters the catalog
Section titled “1. A task enters the catalog”The unit is a task — a describable piece of work you’d ask an AI to do: “extract structured fields from an invoice,” “write a unit test for a pure function,” “summarize a meeting transcript,” “refactor a 200-line module.” Tasks are seeded from real-usage and benchmark taxonomies (so they reflect what people actually ask, not invented categories) and grow from community submissions. Each carries a name, a plain description, a category, and — crucially — a verification method: how you’d check the output is right.
2. Evidence accumulates
Section titled “2. Evidence accumulates”Two kinds of signal attach to a task, newest first:
- Eval runs — a model runs the task and its output is scored by the task’s verifier (tests pass / lint clean / types check / exact-match / a validated judge). This is the reproducible measurement.
- Practitioner reports — developers report what happened when they ran a model on this kind of task, with their hardware and model. This is the cheap, scaling, breadth signal — and it’s collected through a manipulation-resistant voting system, not a naïve up/down tally. See Methodology.
A model’s published benchmark score (AA Index, LiveBench, etc.) is used only to decide which models are worth running on the task at all — a shortlist input, never evidence of the verdict.
3. The verdict resolves
Section titled “3. The verdict resolves”The signals roll up into one current verdict per task, on an asymmetric, default-conservative scale:
- 🟢 Safe for local — a named local model reliably clears the task’s verifier.
- 🟡 Local with a check — a local model can do it if you gate the output with a cheap verification step; trusting it blind is not safe.
- 🔴 Needs a bigger model — no local model reliably clears it; use the recommended fallback.
- 🔶 Needs more data — the honest default until the evidence is strong enough to assert anything.
The bar is asymmetric on purpose: a negative (“needs a bigger model”) demands reproduced evidence, and anything ambiguous stays “needs more data” — DoesItLocal never guesses a green light. See Reading a verdict.
4. You get a recommendation, not just a label
Section titled “4. You get a recommendation, not just a label”Alongside the verdict, each task shows:
- a table of local models that clear it (and at what size/quantization/hardware), and
- a recommended fallback — the cheapest open-weights or frontier model that does handle it — when local isn’t safe.
Every task page carries this table. Browse the tasks to see it in action.
5. Agents query it directly
Section titled “5. Agents query it directly”A coding agent or router hits the agent API / MCP (planned) with a task (and the local model it has) and gets back the verdict + recommendation in one request — so it can run the task locally when that’s safe, add a verifier when that’s the unlock, and escalate to a bigger model only when it must.
Why it stays honest
Section titled “Why it stays honest”Freshness and asymmetry are the whole game. Models turn over every quarter, so verdicts carry a date and a staleness flag and re-resolve as new evidence lands; and because a false “safe for local” is the one error that burns trust, the default is always the conservative call. The reasoning behind this design is in Methodology.
Built by Sam Carlton