How it works

DoesItLocal turns a messy practitioner question (“can my local model do this?”) into a single, evidence-backed verdict per task. Here is the flow end to end.

1. A task enters the catalog

The unit is a task — a describable piece of work you’d ask an AI to do: “extract structured fields from an invoice,” “write a unit test for a pure function,” “summarize a meeting transcript,” “refactor a 200-line module.” Tasks are seeded from real-usage and benchmark taxonomies (so they reflect what people actually ask, not invented categories) and grow from community submissions. Each carries a name, a plain description, a category, and — crucially — a verification method: how you’d check the output is right.

2. Evidence accumulates

Two kinds of signal attach to a task, newest first:

Eval runs — a model runs the task and its output is scored by the task’s verifier (tests pass / lint clean / types check / exact-match / a validated judge). This is the reproducible measurement.
Practitioner reports — developers report what happened when they ran a model on this kind of task, with their hardware and model. This is the cheap, scaling, breadth signal — and it’s collected through a manipulation-resistant voting system, not a naïve up/down tally. See Methodology.

A model’s published benchmark score (AA Index, LiveBench, etc.) is used only to decide which models are worth running on the task at all — a shortlist input, never evidence of the verdict.

3. The verdict resolves

The signals roll up into one current verdict per task, on an asymmetric, default-conservative scale:

🟢 Safe for local — a named local model reliably clears the task’s verifier.
🟡 Local with a check — a local model can do it if you gate the output with a cheap verification step; trusting it blind is not safe.
🔴 Needs a bigger model — no local model reliably clears it; use the recommended fallback.
🔶 Needs more data — the honest default until the evidence is strong enough to assert anything.

The bar is asymmetric on purpose: a negative (“needs a bigger model”) demands reproduced evidence, and anything ambiguous stays “needs more data” — DoesItLocal never guesses a green light. See Reading a verdict.

4. You get a recommendation, not just a label

Alongside the verdict, each task shows:

a table of local models that clear it (and at what size/quantization/hardware), and
a recommended fallback — the cheapest open-weights or frontier model that does handle it — when local isn’t safe.

Every task page carries this table. Browse the tasks to see it in action.

5. Agents query it directly

A coding agent or router hits the agent API / MCP (planned) with a task (and the local model it has) and gets back the verdict + recommendation in one request — so it can run the task locally when that’s safe, add a verifier when that’s the unlock, and escalate to a bigger model only when it must.

Why it stays honest

Freshness and asymmetry are the whole game. Models turn over every quarter, so verdicts carry a date and a staleness flag and re-resolve as new evidence lands; and because a false “safe for local” is the one error that burns trust, the default is always the conservative call. The reasoning behind this design is in Methodology.

Built by Sam Carlton