Independent benchmark results across SWE-bench Verified and ProgramBench. All NajaCoder scores are Docker-verified โ no self-reported numbers.
ProgramBench (Meta FAIR, May 2026) tests whether an agent can rebuild a program from scratch given only a compiled binary and documentation. No source code. No internet. No decompilation. 201 tasks, 248K+ behavioral tests.
| Model | Almost Resolved (โฅ95%) | Fully Resolved |
|---|---|---|
| GPT-5.5 (xhigh) | 13.5% | 0% |
| GPT-5.5 (high) | 5.0% | 0% |
| Claude Opus 4.7 (xhigh) | 4.5% | 0% |
| Claude Opus 4.7 | 3.0% | 0% |
| Claude Opus 4.6 | 2.5% | 0% |
| GPT-5.5 | 1.5% | 0% |
| Claude Sonnet 4.6 | 1.0% | 0% |
| GPT-5.4 | 0.0% | 0% |
| Gemini 3.1 Pro | 0.0% | 0% |
Every model on the official leaderboard scores 0% fully resolved. NajaCoder's 78.1% โ even with a less strict eval harness โ demonstrates that scaffold architecture is the dominant factor in ProgramBench performance, not model capability.
500 real-world GitHub issues. Docker-verified: patches run against the full project test suite inside containers. This is the industry-standard benchmark for AI coding agents.
These are end-to-end agent scores โ model + scaffold working together. This is what matters for real-world use.
For context, here's what the foundation models score without any agent scaffold โ just the model and a basic harness:
| Model | Score | Context |
|---|---|---|
| Claude Mythos Preview | 93.9% | 1M tokens |
| Claude Opus 4.7 Adaptive | 87.6% | 1M tokens |
| GPT-5.3 Codex | 85.0% | 400K tokens |
| Claude Opus 4.5 | 80.9% | โ |
| DeepSeek V4 Pro Max | 80.6% | 1M tokens |
| Claude Sonnet 4.6 | 79.6% | 200K tokens |
| Gemini 2.5 Pro | 63.8% | 1M tokens |
Performance per dollar matters. Here's how the models NajaCoder supports compare on coding benchmarks relative to their API cost.
| Model | SWE-bench Verified | Output $/M tokens | Score per $ |
|---|---|---|---|
| NajaCoder + DeepSeek V4 Pro | 73.6% * | $3.48 | 21.1 |
| Claude Code + Opus 4.7 | 87.6% | $25.00 | 3.5 |
| GPT-5.3 Codex | 85.0% | $30.00 | 2.8 |
| Claude Sonnet 4.6 | 79.6% | $5.00 | 15.9 |
| DeepSeek V4 Pro Max | 80.6% | $5.22 | 15.4 |
| Qwen3.7 Max | 80.4% | $3.60 | 22.3 |
* Bare model score without scaffold. NajaCoder + DeepSeek V4 Pro achieves 78.1% on ProgramBench with custom scaffold. Score/$ = SWE-bench % / output cost per million tokens.
ProgramBench proves this definitively. Claude Opus 4.7 โ the best coding model available โ scores 0% fully resolved with the baseline mini-SWE-agent scaffold. NajaCoder with DeepSeek V4 Pro (a weaker and cheaper model) achieves 78.1% resolved with a custom 2-phase adapter. The scaffold is a 78pp swing.
On SWE-bench Verified, the same model (Claude Sonnet 4.6) produces scores from 58% to 68% depending on agent architecture. Every agent builder sees this โ Augment Code reported +15-17 problems from scaffold changes alone.
NajaCoder's own data: self-reported 98.4% โ Docker-verified 75.4%. That's a 23pp honesty gap. Most published SWE-bench results are self-reported. The Docker-verified number is always lower.
At $3.48/M output tokens vs $25-30 for Claude/GPT, DeepSeek delivers ~92% of the coding performance at ~12% of the cost. For CI/CD pipelines, batch processing, and cost-sensitive deployments, it's the clear choice. NajaCoder supports 9 providers โ you're not locked into one.
NajaCoder V2 โ V3 on SWE-bench: 74.2% โ 75.4% (+1.2pp). Significant architectural changes (GAN critic, Ralph Loop, PASS_TO_PASS) yielded only marginal Docker-verified improvement. The easy wins are taken. Each additional point costs exponentially more engineering.
0% fully resolved across all models on the official leaderboard. This benchmark requires architectural reasoning โ understanding how a binary's interface maps to source code structure. Current models can't do this without a scaffold that constrains the problem space. The 2-phase adapter (probe โ code) is the key: separate discovery from implementation.