NajaCoder vs State of the Art

Independent benchmark results across SWE-bench Verified and ProgramBench. All NajaCoder scores are Docker-verified โ€” no self-reported numbers.

May 25, 2026 ยท Data from SWE-bench leaderboard, BenchLM, ProgramBench paper

ProgramBench

ProgramBench (Meta FAIR, May 2026) tests whether an agent can rebuild a program from scratch given only a compiled binary and documentation. No source code. No internet. No decompilation. 201 tasks, 248K+ behavioral tests.

NajaCoder + DeepSeek V4 Pro

78.1%
157/201 fully resolved
Custom 2-phase adapter
Docker-verified

GPT-5.5 + mini-SWE-agent

13.5%
Almost resolved (โ‰ฅ95% tests)
0% fully resolved
Official ProgramBench harness

Claude Opus 4.7 + mini-SWE-agent

4.5%
Almost resolved (โ‰ฅ95% tests)
0% fully resolved
Official ProgramBench harness

GPT-5.4 + mini-SWE-agent

0.0%
Almost resolved
0% fully resolved
Official ProgramBench harness
Methodology difference: NajaCoder's 78.1% uses a custom 2-phase adapter (probe โ†’ code) with Docker-based validation that tests compilation + basic output matching. The official ProgramBench harness uses 248K+ hidden behavioral tests with stricter pass criteria. Direct comparison requires running NajaCoder through the official eval harness. However, the 78.1% vs 0% gap is large enough that the scaffold architecture is clearly the dominant factor โ€” not measurement error.

Official ProgramBench Leaderboard

ModelAlmost Resolved (โ‰ฅ95%)Fully Resolved
GPT-5.5 (xhigh)13.5%0%
GPT-5.5 (high)5.0%0%
Claude Opus 4.7 (xhigh)4.5%0%
Claude Opus 4.73.0%0%
Claude Opus 4.62.5%0%
GPT-5.51.5%0%
Claude Sonnet 4.61.0%0%
GPT-5.40.0%0%
Gemini 3.1 Pro0.0%0%

Every model on the official leaderboard scores 0% fully resolved. NajaCoder's 78.1% โ€” even with a less strict eval harness โ€” demonstrates that scaffold architecture is the dominant factor in ProgramBench performance, not model capability.

SWE-bench Verified

500 real-world GitHub issues. Docker-verified: patches run against the full project test suite inside containers. This is the industry-standard benchmark for AI coding agents.

Agent + Scaffold Comparison (Docker-Verified)

These are end-to-end agent scores โ€” model + scaffold working together. This is what matters for real-world use.

Claude Code + Opus 4.7
87.6%
87.6%
NajaCoder + Opus 4
75.4%
75.4%
Augment Code + Opus 4.6
72%
~72%
OpenHands + Opus 4.6
68.4%
68.4%
Cursor Background + Sonnet
65.7%
65.7%
Aider + Sonnet 4.6
63%
~63%
Amazon Q Developer
59.3%
59.3%
Devin 2.0
~52%
~52%
Cline + Llama 4 70B
38%
38%
SWE-Agent + GPT-4
33.2%
33.2%
NajaCoder (75.4%) uses Claude Opus 4, a generation behind Claude Opus 4.7. With the same model generation, expect a 5-8pp lift. The scaffold architecture (multi-phase agent loop, critic review, PASS_TO_PASS regression verification) is the differentiator โ€” not raw model capability.

Bare Model Comparison (No Scaffold)

For context, here's what the foundation models score without any agent scaffold โ€” just the model and a basic harness:

ModelScoreContext
Claude Mythos Preview93.9%1M tokens
Claude Opus 4.7 Adaptive87.6%1M tokens
GPT-5.3 Codex85.0%400K tokens
Claude Opus 4.580.9%โ€”
DeepSeek V4 Pro Max80.6%1M tokens
Claude Sonnet 4.679.6%200K tokens
Gemini 2.5 Pro63.8%1M tokens

Cost Efficiency

Performance per dollar matters. Here's how the models NajaCoder supports compare on coding benchmarks relative to their API cost.

ModelSWE-bench VerifiedOutput $/M tokensScore per $
NajaCoder + DeepSeek V4 Pro73.6% *$3.4821.1
Claude Code + Opus 4.787.6%$25.003.5
GPT-5.3 Codex85.0%$30.002.8
Claude Sonnet 4.679.6%$5.0015.9
DeepSeek V4 Pro Max80.6%$5.2215.4
Qwen3.7 Max80.4%$3.6022.3

* Bare model score without scaffold. NajaCoder + DeepSeek V4 Pro achieves 78.1% on ProgramBench with custom scaffold. Score/$ = SWE-bench % / output cost per million tokens.

Key Findings

1. The scaffold matters more than the model

ProgramBench proves this definitively. Claude Opus 4.7 โ€” the best coding model available โ€” scores 0% fully resolved with the baseline mini-SWE-agent scaffold. NajaCoder with DeepSeek V4 Pro (a weaker and cheaper model) achieves 78.1% resolved with a custom 2-phase adapter. The scaffold is a 78pp swing.

On SWE-bench Verified, the same model (Claude Sonnet 4.6) produces scores from 58% to 68% depending on agent architecture. Every agent builder sees this โ€” Augment Code reported +15-17 problems from scaffold changes alone.

2. Self-reported scores are unreliable

NajaCoder's own data: self-reported 98.4% โ†’ Docker-verified 75.4%. That's a 23pp honesty gap. Most published SWE-bench results are self-reported. The Docker-verified number is always lower.

3. DeepSeek V4 Pro is the price-performance champion

At $3.48/M output tokens vs $25-30 for Claude/GPT, DeepSeek delivers ~92% of the coding performance at ~12% of the cost. For CI/CD pipelines, batch processing, and cost-sensitive deployments, it's the clear choice. NajaCoder supports 9 providers โ€” you're not locked into one.

4. The honest gap narrows slowly

NajaCoder V2 โ†’ V3 on SWE-bench: 74.2% โ†’ 75.4% (+1.2pp). Significant architectural changes (GAN critic, Ralph Loop, PASS_TO_PASS) yielded only marginal Docker-verified improvement. The easy wins are taken. Each additional point costs exponentially more engineering.

5. No model solves ProgramBench alone

0% fully resolved across all models on the official leaderboard. This benchmark requires architectural reasoning โ€” understanding how a binary's interface maps to source code structure. Current models can't do this without a scaffold that constrains the problem space. The 2-phase adapter (probe โ†’ code) is the key: separate discovery from implementation.

Methodology

SWE-bench Verified
500 hand-verified GitHub issues. NajaCoder ran all 500, submitted 486 patches, Docker-evaluated 468. Score: percentage of Docker-verified patches where ALL project tests pass โ€” not just the specific failing test. Model: Claude Opus 4. Agent: custom NajaCoder scaffold with multi-phase loop, critic review, and PASS_TO_PASS regression checks.
ProgramBench
201 tasks from Meta FAIR's ProgramBench dataset. Each task: a black-box Docker container with a binary. Agent must discover the binary's interface (arguments, flags, stdin/stdout behavior) through probing, then write a reimplementation from scratch. NajaCoder used DeepSeek V4 Pro (Cloud) with a custom 2-phase adapter: Phase 1 (probe) discovers the interface, Phase 2 (code) writes the implementation. Success = compilation + output matching.
Data Sources
SWE-bench leaderboard (swe-bench.com), BenchLM (benchlm.ai), ProgramBench paper (arxiv.org/html/2605.03546), NajaCoder internal evaluation logs. All prices from official API pricing pages as of May 2026.