InsightsJune 17, 2026

How to Compare AI Coding Agents

Thirty-plus agents. Roughly five models. What you're choosing is the harness.

There are more than thirty serious coding agents you can install today. They feel wildly different. Open four of them and you'd swear they were built by rival species.

They're mostly the same five models in different costumes.

Claude, GPT, Gemini, and a short list of strong open-weight models do almost all the actual thinking. Most agents are a harness wrapped around one of them: the part that reads your repo, decides what to look at, runs commands, edits files, and recovers when a test fails. Once you see that, the question stops being "which agent is smartest" and becomes "which harness do I want to drive, on which model, for this task." That's a much more useful question, and it's the one this post is about.

Benchmarks won't pick your agent

The standard way to compare agents is a leaderboard — SWE-bench, a terminal benchmark, a number next to each name. It's the wrong tool for this decision, for three reasons.

They measure the model, not the harness. Most of a benchmark score comes from the model underneath, and the model is the part you can swap. A ranking of agents is mostly a ranking of whichever frontier model each one happened to be pointed at that week.

They measure a task shape you don't have. The classic benchmark is "here's a GitHub issue, produce a patch that passes hidden tests." Real work is ambiguous, half-specified, spread across services, and judged by a human. An agent that tops the patch-the-issue game can still be exhausting to work with on a vague feature request.

They churn and they saturate. A new model lands, one harness gets to it first, the order reshuffles, and three weeks later it's stale. The top scores now cluster so tightly that the gaps are noise.

Use a benchmark as a coarse filter — is this model capable enough to trust with real code — and then ignore it. What you feel every day isn't on the leaderboard.

What actually differs

Strip away the model and the score, and agents separate along a handful of axes that genuinely change how it feels to work with them.

Model strategy. Locked to one lab (Claude Code is Claude, Codex is GPT) or bring-your-own across many providers (opencode, Aider, Cline). Locked agents are tuned tightly to their model and tend to get new capabilities first. Open agents let you chase the best — or cheapest — model without changing tools, and let you run a model you host yourself.
Context strategy. How the agent figures out what to read before it acts. Naive agents grep and hope. Serious ones build an index, do semantic search, or carry a persistent map of your codebase. On a small repo it barely matters. On a million-line monorepo it's the whole game.
Autonomy level. Where it sits on the line from "suggests one edit and waits" to "disappears for twenty minutes and comes back with a branch." Pairing-style tools keep you in the loop on every change. Autonomous ones are leverage when the task is well-specified and a liability when it isn't.
Permission and safety model. What it will do without asking — edit files, run shell commands, install packages, hit the network. This is the difference between an agent that feels safe to let loose and one you have to babysit.
Where it runs. A terminal CLI you can script and drop into any workflow, versus something welded to one editor or living only in someone's cloud. Terminal-native agents compose. The rest you adapt to.
State and cost shape. Whether it remembers anything between sessions, and whether you pay a flat subscription, metered tokens, or nothing because you brought your own key. These quietly decide whether you actually reach for it.

None of these show up in a score. All of them decide whether you keep the agent after a week.

The field, by archetype

Here's the whole field Agentastic supports — 33 agents — grouped by what kind of thing each one actually is. The takes are opinionated on purpose.

The frontier labs' own CLIs

The model makers shipping their own harness. Locked to their model, tuned for it, usually first to a new release.

Agent	Vendor	What sets it apart
Claude Code	Anthropic	The one to beat for hard, multi-file work — strong planning, sub-agents, hooks, MCP.
Codex	OpenAI	A tight edit-run-test loop; at its best when the job is "make the tests pass."
Gemini	Google	An enormous context window and a free tier that's hard to argue with.
Qwen Code	Alibaba	A lab CLI for a strong open-weight model you can also self-host.
Kimi	Moonshot AI	Long context on an open-weight stack; a lot of capability per dollar.
Mistral Vibe	Mistral	EU-hosted models — the one to reach for when data residency is the constraint.

Bring-your-own-model agents

Open, provider-agnostic, often self-hostable. The harness is the product; you choose the brain. This is where most experimentation lives.

Agent	Vendor	What sets it apart
opencode	open source	The popular open default — provider-agnostic and built for multiple sessions at once.
Aider	open source	Minimal and git-native; commits every change, so it feels like real pairing.
Cline	open source	Approval-gated by default — autonomy you grant a step at a time.
Continue	Continue	Config-driven and customizable; bend it to your stack.
Goose	Block	MCP-native and extensible; built to be wired into your own tools.
OpenHands	OpenHands	Open and capable, runs local or in the cloud, leans autonomous.
Charm	Charmbracelet	The best-looking agent in the terminal, and not just for show (Crush).
Codebuff	Codebuff	Fast, no-ceremony terminal edits.
Pi	community	Tiny and hackable — a good base to build your own thing on.
Kilo Code	Kilo	Provider-agnostic with a managed option if you don't want to wire keys.
Command Code	Langbase	Workflow- and skills-oriented; structure over free-for-all.

Specialists

Each does one thing other agents treat as an afterthought.

Agent	Vendor	What sets it apart
Amp	Sourcegraph	Code search is the feature; shines where finding the right file is the hard part.
Auggie	Augment Code	A real context engine for enterprise-scale codebases.
Droid	Factory	Built for long-running, autonomous background work.
Letta Code	Letta	Persistent memory across sessions — it remembers your repo and your decisions.
Hermes Agent	Nous Research	Privacy-first: self-hosted, no telemetry, no cloud lock-in.
mini-SWE-agent	SWE-bench team	About a hundred readable lines — the best way to actually learn how agents work.
Cortex Code	Snowflake	Data-engineering and warehouse-adjacent code.
OB-1	OpenBlock Labs	Autonomous on-chain and data work.
Autohand Code	Autohand	A ReAct loop plus a skills system.

Agents that meet you where your work lives

From companies whose agent plugs into a product you may already use.

Agent	Vendor	What sets it apart
GitHub Copilot	GitHub	Wired into PRs and Actions; multi-model, with deep inline-editing roots.
Cursor	Anysphere	IDE-grade editing brought to the terminal, and cheap per task.
Junie	JetBrains	For JetBrains shops; model-agnostic.
Kiro	AWS	Spec-driven — you write the spec, it builds to it.
Rovo Dev	Atlassian	Work that starts from a Jira ticket.

Review specialists

Not builders. Point them at a diff and they tell you what's wrong.

Agent	Vendor	What sets it apart
CodeRabbit	CodeRabbit	An automated reviewer on every change.
Greptile	Greptile	Whole-codebase-aware review, not just line-by-line.

And anything not on this list still works — point Agentastic at any terminal CLI in Settings → Connections and it becomes an agent too.

How to choose without overthinking it

You don't need the perfect agent. You need a small kit and the judgment to match it to the task.

One heavyweight for hard, multi-file work — a frontier-lab CLI on its best model. This is where capability actually pays for itself.
One open, bring-your-own agent for the long tail and anything cost- or privacy-sensitive — pointed at a cheaper or self-hosted model. Most tasks don't need the frontier.
One specialist if your bottleneck has a name — search on a monorepo, memory across sessions, autonomous background runs.
A reviewer on the diff before you merge.

The trap is treating this as a marriage. The best model moves every few weeks, the best harness for today's task isn't the one for tomorrow's, and the cost of guessing wrong compounds if switching means relearning your tools.

So don't marry one. The skill that actually compounds isn't picking the winner — it's orchestration: running two agents on the same problem and keeping the better diff, handing the boring half to a cheap agent while the expensive one does the thinking, reviewing output instead of babysitting it.

That's the bet Agentastic makes. Every agent runs in its own git worktree or container, so you can launch three of them on the same repo at once without them stepping on each other. Whatever produced the diff, you review it the same way — one surface, merge or delete. Auto-approve, resume, and plan mode are normalized across agents, so switching is a dropdown, not a weekend.

The honest conclusion of any agent comparison in 2026 is that there is no winner that stays won. The developers moving fastest aren't the ones who picked right. They're the ones who never had to pick just one.