LLM Leaderboard
Comprehensive benchmark scores for top Large Language Models. Compare performance across Coding, Reasoning, and Creative tasks.
| Model | Context | Platform Price | Official Price | Code Arena | Chat Arena | GPQA | AIME 2025 | SWE-Bench | ARC-AGI v2 | |
|---|---|---|---|---|---|---|---|---|---|---|
1 | Claude Opus 4.6detail Anthropic | 1.0M | $4.00 / 1M Tokens$20.00 / 1M Tokens | $5.00 / 1M Tokens $25.00 / 1M Tokens | 2,003 | 1,476 | 91.3% | 99.8% | 80.8% | 68.8% |
2 | Gemini 3.1 Prodetail Google | 1.0M | $2.00 / 1M Tokens$10.00 / 1M Tokens | $2.00 / 1M Tokens $12.00 / 1M Tokens | 1,859 | 1,222 | 94.3% | - | 80.6% | 77.1% |
3 | GLM-5detail Zhipu AI | 200K | $0.95 / 1M Tokens$2.85 / 1M Tokens | $1.00 / 1M Tokens $3.20 / 1M Tokens | 1,594 | 1,179 | - | - | 77.8% | - |
4 | Claude Opus 4.5detail anthropic | 200K | $4.00 / 1M Tokens$20.00 / 1M Tokens | $5.00 / 1M Tokens $25.00 / 1M Tokens | 1,580 | 1,345 | 87.0% | - | 80.9% | 37.6% |
5 | Gemini 3 Prodetail google | - | $2.00 / 1M Tokens$10.00 / 1M Tokens | $2.00 / 1M Tokens $12.00 / 1M Tokens | 1,579 | 1,045 | 91.9% | 100.0% | 76.2% | 31.1% |
6 | Gemini 3 Flashdetail google | 1.0M | $0.40 / 1M Tokens$2.50 / 1M Tokens | $0.50 / 1M Tokens $3.00 / 1M Tokens | 1,578 | 1,172 | 90.4% | 99.7% | 78.0% | 33.6% |
7 | GPT-5.2 openai | 400K | - | $1.75 / 1M Tokens $14.00 / 1M Tokens | 1,505 | 1,172 | 92.4% | 100.0% | 80.0% | 52.9% |
8 | Kimi K2.5detail Moonshot AI | 262K | $0.50 / 1M Tokens$2.80 / 1M Tokens | $0.60 / 1M Tokens $3.00 / 1M Tokens | 1,448 | 988 | 87.6% | 96.1% | 76.8% | - |
9 | GPT-5.4detail OpenAI | 1.0M | $2.00 / 1M Tokens$10.00 / 1M Tokens | $2.50 / 1M Tokens $15.00 / 1M Tokens | 1,437 | 1,146 | 92.8% | - | - | 73.3% |
10 | Claude Sonnet 4.6detail Anthropic | 200K | $2.40 / 1M Tokens$12.00 / 1M Tokens | $3.00 / 1M Tokens $15.00 / 1M Tokens | 1,380 | 941 | 89.9% | - | 79.6% | 58.3% |
11 | GPT-5 High openai | 400K | - | $1.25 / 1M Tokens $10.00 / 1M Tokens | 1,301 | 1,037 | 87.3% | 94.6% | - | - |
12 | Qwen3.5-397B-A17B Alibaba Cloud / Qwen Team | 262K | - | $0.60 / 1M Tokens $3.60 / 1M Tokens | 1,214 | 1,067 | 88.4% | - | 76.4% | - |
13 | GLM-4.6detail zai-org | 131K | $0.45 / 1M Tokens$1.80 / 1M Tokens | $0.55 / 1M Tokens $2.19 / 1M Tokens | 1,143 | 1,079 | 81.0% | 93.9% | 68.0% | - |
14 | GPT-5.2 Codexdetail openai | 400K | $1.20 / 1M Tokens$9.60 / 1M Tokens | $1.75 / 1M Tokens $14.00 / 1M Tokens | 1,139 | 802 | - | - | - | - |
15 | GPT-5.1 High OpenAI | 400K | - | $1.25 / 1M Tokens $10.00 / 1M Tokens | 1,117 | 1,132 | 88.1% | 99.6% | - | - |
16 | Claude Sonnet 4.5detail anthropic | 200K | $2.40 / 1M Tokens$12.00 / 1M Tokens | $3.00 / 1M Tokens $15.00 / 1M Tokens | 1,103 | 1,294 | 83.4% | 87.0% | - | - |
17 | GPT-5 Medium openai | 400K | - | $1.25 / 1M Tokens $10.00 / 1M Tokens | 1,098 | 1,026 | 88.1% | 88.9% | - | - |
18 | GPT-5.3 Codexdetail OpenAI | 400K | $1.20 / 1M Tokens$9.60 / 1M Tokens | $1.75 / 1M Tokens $14.00 / 1M Tokens | 1,089 | 650 | - | - | - | - |
19 | GPT-5.1 OpenAI | 400K | - | $1.25 / 1M Tokens $10.00 / 1M Tokens | 1,079 | 1,010 | 88.1% | 94.0% | 76.3% | - |
20 | GLM-4.7detail Zhipu AI | 205K | $0.60 / 1M Tokens$2.20 / 1M Tokens | $0.60 / 1M Tokens $2.20 / 1M Tokens | 1,030 | 1,017 | 85.7% | 95.7% | 73.8% | - |
Showing 1 to 20 of 275 models
Metric Definitions
LLM
- Code Arena
- Average score across coding arenas based on human votes.
- Chat Arena
- Human preference score from blind comparisons.
- GPQA
- Graduate-level science questions requiring expert knowledge.
- AIME 2025
- Recent math competition problems.
- SWE-Bench
- Real GitHub issues requiring code changes.
- ARC-AGI v2
- Abstract reasoning problems.
Image
- IMAGE GEN
- Human preference score for text-to-image generation.
- IMAGE EDIT
- Human preference score for image editing and transformation.
Video
- Text to Video
- Human preference score for text-to-video generation.
- Image to Video
- Human preference score for image-to-video generation.
- Video to Video
- Human preference score for video editing capabilities.
TTS
- TTS
- Human preference score for text-to-speech quality.
STT
- STT
- Human preference score for transcription accuracy.