AI Coding Benchmarks

How do LLMs actually perform on coding tasks? Here are the benchmark scores that matter.

Scores are from official papers and third-party evaluations. Last updated March 2025.

SWE-bench

Real-world GitHub issues benchmark. Tests ability to fix actual bugs in popular repos.

Model Score Date
Claude 3.5 Sonnet
49.0%
2024-10
GPT-4o
33.2%
2024-09
DeepSeek V2
27.6%
2024-08
Gemini 1.5 Pro
25.1%
2024-09
Code Llama 70B
8.3%
2024-03

HumanEval

Function-level code generation. 164 Python programming problems.

Model Score Date
Claude 3.5 Sonnet
92.0%
2024-10
GPT-4o
90.2%
2024-09
Gemini 1.5 Pro
84.1%
2024-09
DeepSeek Coder V2
90.2%
2024-08
Code Llama 70B
67.8%
2024-03
StarCoder2 15B
46.3%
2024-02

MBPP

Mostly Basic Python Problems. 974 crowd-sourced Python problems.

Model Score Date
Claude 3.5 Sonnet
90.5%
2024-10
GPT-4o
87.8%
2024-09
DeepSeek Coder V2
89.4%
2024-08
Gemini 1.5 Pro
82.3%
2024-09
Code Llama 70B
62.4%
2024-03

Aider Polyglot

Multi-language benchmark by Aider. Tests code editing across Python, JS, Java, C++, Go, Rust.

Model Score Date
Claude 3.5 Sonnet
73.7%
2024-10
GPT-4o
66.0%
2024-09
DeepSeek V2
59.2%
2024-08
Gemini 1.5 Pro
56.4%
2024-09
Claude 3 Opus
68.4%
2024-03

Understanding the Benchmarks

SWE-bench

The most practical benchmark. Tests whether an AI can fix real bugs from actual GitHub issues. Requires understanding large codebases, writing correct patches, and passing existing tests. A score of 49% means the AI fixed 49 out of 100 real-world issues.

HumanEval

Tests basic code generation ability. 164 Python function completion problems with test cases. Easy for modern LLMs — scores above 90% are common. Good for comparing speed and cost, less useful for comparing quality at the frontier.

MBPP

974 Python problems crowdsourced from workers. Broader than HumanEval but still function-level. Good for measuring consistent quality across many different problem types.

Aider Polyglot

Tests code editing across 6 languages (not just generation). More realistic than HumanEval because it tests the model's ability to modify existing code, which is what developers actually do most of the time.

Key Takeaways

  • Claude 3.5 Sonnet leads on most coding benchmarks — especially SWE-bench (real-world tasks) where it has a significant lead.
  • DeepSeek V2 is remarkably competitive at a fraction of the cost. Best value for coding at scale.
  • HumanEval scores are saturating — top models all score 90%+. SWE-bench is more meaningful for comparing frontier models.
  • Open-source models are catching up but still 20-30% behind on the hardest benchmarks.
  • Benchmarks don't tell the whole story — instruction following, context utilization, and tool use matter as much as raw coding ability.