LLM Leaderboard

Comprehensive benchmark scores for top Large Language Models. Compare performance across Coding, Reasoning, and Creative tasks.

ModelContextPlatform PriceOfficial Price
Code Arena
Chat Arena
GPQA
AIME 2025
SWE-Bench
ARC-AGI v2
1
GLM-5detail
Zhipu AI
-
$0.95 / 1M Tokens$2.85 / 1M Tokens
$1.00 / 1M Tokens $3.20 / 1M Tokens
0
0
-
-
77.8%
-
2
Phi 4
Microsoft
---
0
0
56.1%
-
-
-
3
Grok-2
xAI
---
0
0
56.0%
-
-
-
4
-
$3.00 / 1M Tokens$15.00 / 1M Tokens
$3.00 / 1M Tokens $15.00 / 1M Tokens
0
0
84.6%
93.3%
-
-
5
-
$2.00 / 1M Tokens$10.00 / 1M Tokens
-
0
0
87.5%
91.7%
-
15.9%
6
o1-pro
OpenAI
---
0
0
79.0%
-
-
-
7
-
$0.35 / 1M Tokens$1.50 / 1M Tokens
-
0
0
79.1%
-
64.2%
-
8
-
$0.45 / 1M Tokens$1.80 / 1M Tokens
$0.55 / 1M Tokens $2.19 / 1M Tokens
0
0
81.0%
93.9%
68.0%
-
9
-
$0.60 / 1M Tokens$2.20 / 1M Tokens
$0.60 / 1M Tokens $2.20 / 1M Tokens
0
0
85.7%
95.7%
73.8%
-
10
-
$1.40 / 1M Tokens$4.20 / 1M Tokens
$1.40 / 1M Tokens $4.40 / 1M Tokens
0
0
86.2%
-
-
-
11
GPT-4.5
OpenAI
---
0
0
69.5%
-
38.0%
-
12
-
$2.00 / 1M Tokens$10.00 / 1M Tokens
$2.50 / 1M Tokens $15.00 / 1M Tokens
0
0
92.8%
-
-
73.3%
13
-
$5.00 / 1M Tokens$30.00 / 1M Tokens
$5.00 / 1M Tokens $30.00 / 1M Tokens
0
0
93.6%
-
-
85.0%
14
o1-mini
OpenAI
---
0
0
60.0%
-
-
-
15
o3-mini
OpenAI
---
0
0
77.2%
-
49.3%
-
16
o4-mini
OpenAI
---
0
0
81.4%
92.7%
68.1%
-
17
QwQ-32B
Alibaba Cloud / Qwen Team
---
0
0
65.2%
-
-
-
18
GLM-4.5V
Zhipu AI
---
0
0
-
-
-
-
19
Grok-1.5
xAI
---
0
0
35.9%
-
-
-
20
Grok 4.3
xAI
--
$1.25 / 1M Tokens $2.50 / 1M Tokens
0
0
-
-
-
-
Showing 1 to 20 of 298 models

Metric Definitions

LLM

Code Arena
Average score across coding arenas based on human votes.
Chat Arena
Human preference score from blind comparisons.
GPQA
Graduate-level science questions requiring expert knowledge.
AIME 2025
Recent math competition problems.
SWE-Bench
Real GitHub issues requiring code changes.
ARC-AGI v2
Abstract reasoning problems.

Image

IMAGE GEN
Human preference score for text-to-image generation.
IMAGE EDIT
Human preference score for image editing and transformation.

Video

Text to Video
Human preference score for text-to-video generation.
Image to Video
Human preference score for image-to-video generation.
Video to Video
Human preference score for video editing capabilities.

TTS

TTS
Human preference score for text-to-speech quality.

STT

STT
Human preference score for transcription accuracy.