Skip to content

Introduction

DoesItLocal answers one question, currently and with evidence: can a local LLM safely do this task — and if not, what should you use instead? Not a vibe or a leaderboard rank, but a specific verdict per task — safe for local, local with a verification step, needs a bigger model, or needs more data — naming which local models clear it, on what hardware, and the recommended open-weights or frontier fallback when none do.

Running a capable model on your own MacBook is suddenly realistic — Gemma, Qwen, Llama and friends fit in 16–64 GB and cost nothing per token. So the real question stopped being “can I run a model locally?” and became “which of my tasks can I trust a local model with, and which still need a frontier model?” Today that question is answered badly:

  • Leaderboards rank models, not tasks. A single “intelligence” score (or even a coding-agent score) tells you a model’s average standing — it cannot tell you whether your task is one a local model handles cleanly or one where it fails silently. The dominant variance is the model × task × harness interaction, not a scalar you can read off a chart.
  • Practitioner knowledge is real but scattered and decaying. The useful answer (“Qwen2.5-Coder 7B nails small refactors but invents APIs on anything unfamiliar”) lives in a r/LocalLLaMA thread, a Hacker News comment, or someone’s blog — undated, unaggregated, and stale within a quarter as models turn over.
  • “Safe” is the word nobody quantifies. A task is safe for local only if the model’s output reliably passes a check with low risk of a confident, wrong answer slipping through. That’s a property of the task and its verifier, not a star rating.

The cost of getting it wrong is asymmetric: route a task to a local model that looks fine and silently botches it, and you’ve shipped a bug or burned an afternoon — far worse than paying a few cents to a frontier model. So the honest default has to be conservative.

A single, fresh dataset of per-task local-safety verdicts, surfaced three ways:

  1. A free public website — humans browse tasks, see the verdict, the model table, and the evidence behind it.
  2. An agent API / MCP server (planned) — a coding agent or router queries “is this task safe for the local model I have?” before spending tokens, and routes accordingly.
  3. A practitioner community (planned) — developers vote and comment on what a local model can actually do, through a voting system built to resist the manipulation that wrecks naïve score sites.

Each verdict is produced by a hybrid signal: reproducible eval/verifier runs (the measurement) plus manipulation-resistant practitioner reports (the breadth). A model’s benchmark score is only ever an input that decides which models are worth trying — never the verdict itself.

DoesItLocal is not another model leaderboard, and not a router you install. It publishes facts about tasks — which kinds of work a local model can be trusted with — so that humans and the routers they already use (LiteLLM, a cascade, Gemini CLI’s local classifier) can make a cheaper, safer call. Scoring the task, not the model, is the whole point — see About DoesItLocal.

Built by Sam Carlton