AI Coding Benchmarks

How do LLMs actually perform on coding tasks? Here are the benchmark scores that matter.

Scores are from official papers and third-party evaluations. Last updated March 2025.

SWE-bench

Real-world GitHub issues benchmark. Tests ability to fix actual bugs in popular repos.

Model	Score	Date
Claude 3.5 Sonnet	49.0%	2024-10
GPT-4o	33.2%	2024-09
DeepSeek V2	27.6%	2024-08
Gemini 1.5 Pro	25.1%	2024-09
Code Llama 70B	8.3%	2024-03

HumanEval

Function-level code generation. 164 Python programming problems.

Model	Score	Date
Claude 3.5 Sonnet	92.0%	2024-10
GPT-4o	90.2%	2024-09
Gemini 1.5 Pro	84.1%	2024-09
DeepSeek Coder V2	90.2%	2024-08
Code Llama 70B	67.8%	2024-03
StarCoder2 15B	46.3%	2024-02

MBPP

Mostly Basic Python Problems. 974 crowd-sourced Python problems.

Model	Score	Date
Claude 3.5 Sonnet	90.5%	2024-10
GPT-4o	87.8%	2024-09
DeepSeek Coder V2	89.4%	2024-08
Gemini 1.5 Pro	82.3%	2024-09
Code Llama 70B	62.4%	2024-03

Aider Polyglot

Multi-language benchmark by Aider. Tests code editing across Python, JS, Java, C++, Go, Rust.

Model	Score	Date
Claude 3.5 Sonnet	73.7%	2024-10
GPT-4o	66.0%	2024-09
DeepSeek V2	59.2%	2024-08
Gemini 1.5 Pro	56.4%	2024-09
Claude 3 Opus	68.4%	2024-03

Understanding the Benchmarks

SWE-bench

The most practical benchmark. Tests whether an AI can fix real bugs from actual GitHub issues. Requires understanding large codebases, writing correct patches, and passing existing tests. A score of 49% means the AI fixed 49 out of 100 real-world issues.

HumanEval

Tests basic code generation ability. 164 Python function completion problems with test cases. Easy for modern LLMs — scores above 90% are common. Good for comparing speed and cost, less useful for comparing quality at the frontier.

MBPP

974 Python problems crowdsourced from workers. Broader than HumanEval but still function-level. Good for measuring consistent quality across many different problem types.

Aider Polyglot

Tests code editing across 6 languages (not just generation). More realistic than HumanEval because it tests the model's ability to modify existing code, which is what developers actually do most of the time.

Key Takeaways

Claude 3.5 Sonnet leads on most coding benchmarks — especially SWE-bench (real-world tasks) where it has a significant lead.
DeepSeek V2 is remarkably competitive at a fraction of the cost. Best value for coding at scale.
HumanEval scores are saturating — top models all score 90%+. SWE-bench is more meaningful for comparing frontier models.
Open-source models are catching up but still 20-30% behind on the hardest benchmarks.
Benchmarks don't tell the whole story — instruction following, context utilization, and tool use matter as much as raw coding ability.