AI Coding Benchmarks
How do LLMs actually perform on coding tasks? Here are the benchmark scores that matter.
Scores are from official papers and third-party evaluations. Last updated March 2025.
SWE-bench
Real-world GitHub issues benchmark. Tests ability to fix actual bugs in popular repos.
| Model | Score | Date |
|---|---|---|
| Claude 3.5 Sonnet | 2024-10 | |
| GPT-4o | 2024-09 | |
| DeepSeek V2 | 2024-08 | |
| Gemini 1.5 Pro | 2024-09 | |
| Code Llama 70B | 2024-03 |
HumanEval
Function-level code generation. 164 Python programming problems.
| Model | Score | Date |
|---|---|---|
| Claude 3.5 Sonnet | 2024-10 | |
| GPT-4o | 2024-09 | |
| Gemini 1.5 Pro | 2024-09 | |
| DeepSeek Coder V2 | 2024-08 | |
| Code Llama 70B | 2024-03 | |
| StarCoder2 15B | 2024-02 |
MBPP
Mostly Basic Python Problems. 974 crowd-sourced Python problems.
| Model | Score | Date |
|---|---|---|
| Claude 3.5 Sonnet | 2024-10 | |
| GPT-4o | 2024-09 | |
| DeepSeek Coder V2 | 2024-08 | |
| Gemini 1.5 Pro | 2024-09 | |
| Code Llama 70B | 2024-03 |
Aider Polyglot
Multi-language benchmark by Aider. Tests code editing across Python, JS, Java, C++, Go, Rust.
| Model | Score | Date |
|---|---|---|
| Claude 3.5 Sonnet | 2024-10 | |
| GPT-4o | 2024-09 | |
| DeepSeek V2 | 2024-08 | |
| Gemini 1.5 Pro | 2024-09 | |
| Claude 3 Opus | 2024-03 |
Understanding the Benchmarks
SWE-bench
The most practical benchmark. Tests whether an AI can fix real bugs from actual GitHub issues. Requires understanding large codebases, writing correct patches, and passing existing tests. A score of 49% means the AI fixed 49 out of 100 real-world issues.
HumanEval
Tests basic code generation ability. 164 Python function completion problems with test cases. Easy for modern LLMs — scores above 90% are common. Good for comparing speed and cost, less useful for comparing quality at the frontier.
MBPP
974 Python problems crowdsourced from workers. Broader than HumanEval but still function-level. Good for measuring consistent quality across many different problem types.
Aider Polyglot
Tests code editing across 6 languages (not just generation). More realistic than HumanEval because it tests the model's ability to modify existing code, which is what developers actually do most of the time.
Key Takeaways
- Claude 3.5 Sonnet leads on most coding benchmarks — especially SWE-bench (real-world tasks) where it has a significant lead.
- DeepSeek V2 is remarkably competitive at a fraction of the cost. Best value for coding at scale.
- HumanEval scores are saturating — top models all score 90%+. SWE-bench is more meaningful for comparing frontier models.
- Open-source models are catching up but still 20-30% behind on the hardest benchmarks.
- Benchmarks don't tell the whole story — instruction following, context utilization, and tool use matter as much as raw coding ability.