LLM Leaderboard
Comprehensive benchmark scores for top Large Language Models. Compare performance across Coding, Reasoning, and Creative tasks.
| Model | Context | Platform Price | Official Price | Code Arena | Chat Arena | GPQA | AIME 2025 | SWE-Bench | ARC-AGI v2 | |
|---|---|---|---|---|---|---|---|---|---|---|
1 | GLM-5detail Zhipu AI | - | $0.95 / 1M Tokens$2.85 / 1M Tokens | $1.00 / 1M Tokens $3.20 / 1M Tokens | 0 | 0 | - | - | 77.8% | - |
2 | Phi 4 Microsoft | - | - | - | 0 | 0 | 56.1% | - | - | - |
3 | Grok-2 xAI | - | - | - | 0 | 0 | 56.0% | - | - | - |
4 | Grok-3detail xAI | - | $3.00 / 1M Tokens$15.00 / 1M Tokens | $3.00 / 1M Tokens $15.00 / 1M Tokens | 0 | 0 | 84.6% | 93.3% | - | - |
5 | Grok-4detail xAI | - | $2.00 / 1M Tokens$10.00 / 1M Tokens | - | 0 | 0 | 87.5% | 91.7% | - | 15.9% |
6 | o1-pro OpenAI | - | - | - | 0 | 0 | 79.0% | - | - | - |
7 | GLM-4.5detail Zhipu AI | - | $0.35 / 1M Tokens$1.50 / 1M Tokens | - | 0 | 0 | 79.1% | - | 64.2% | - |
8 | GLM-4.6detail Zhipu AI | - | $0.45 / 1M Tokens$1.80 / 1M Tokens | $0.55 / 1M Tokens $2.19 / 1M Tokens | 0 | 0 | 81.0% | 93.9% | 68.0% | - |
9 | GLM-4.7detail Zhipu AI | - | $0.60 / 1M Tokens$2.20 / 1M Tokens | $0.60 / 1M Tokens $2.20 / 1M Tokens | 0 | 0 | 85.7% | 95.7% | 73.8% | - |
10 | GLM-5.1detail Zhipu AI | - | $1.40 / 1M Tokens$4.20 / 1M Tokens | $1.40 / 1M Tokens $4.40 / 1M Tokens | 0 | 0 | 86.2% | - | - | - |
11 | GPT-4.5 OpenAI | - | - | - | 0 | 0 | 69.5% | - | 38.0% | - |
12 | GPT-5.4detail OpenAI | - | $2.00 / 1M Tokens$10.00 / 1M Tokens | $2.50 / 1M Tokens $15.00 / 1M Tokens | 0 | 0 | 92.8% | - | - | 73.3% |
13 | GPT-5.5detail OpenAI | - | $5.00 / 1M Tokens$30.00 / 1M Tokens | $5.00 / 1M Tokens $30.00 / 1M Tokens | 0 | 0 | 93.6% | - | - | 85.0% |
14 | o1-mini OpenAI | - | - | - | 0 | 0 | 60.0% | - | - | - |
15 | o3-mini OpenAI | - | - | - | 0 | 0 | 77.2% | - | 49.3% | - |
16 | o4-mini OpenAI | - | - | - | 0 | 0 | 81.4% | 92.7% | 68.1% | - |
17 | QwQ-32B Alibaba Cloud / Qwen Team | - | - | - | 0 | 0 | 65.2% | - | - | - |
18 | GLM-4.5V Zhipu AI | - | - | - | 0 | 0 | - | - | - | - |
19 | Grok-1.5 xAI | - | - | - | 0 | 0 | 35.9% | - | - | - |
20 | Grok 4.3 xAI | - | - | $1.25 / 1M Tokens $2.50 / 1M Tokens | 0 | 0 | - | - | - | - |
Showing 1 to 20 of 298 models
Metric Definitions
LLM
- Code Arena
- Average score across coding arenas based on human votes.
- Chat Arena
- Human preference score from blind comparisons.
- GPQA
- Graduate-level science questions requiring expert knowledge.
- AIME 2025
- Recent math competition problems.
- SWE-Bench
- Real GitHub issues requiring code changes.
- ARC-AGI v2
- Abstract reasoning problems.
Image
- IMAGE GEN
- Human preference score for text-to-image generation.
- IMAGE EDIT
- Human preference score for image editing and transformation.
Video
- Text to Video
- Human preference score for text-to-video generation.
- Image to Video
- Human preference score for image-to-video generation.
- Video to Video
- Human preference score for video editing capabilities.
TTS
- TTS
- Human preference score for text-to-speech quality.
STT
- STT
- Human preference score for transcription accuracy.