LLM Leaderboard

Comprehensive benchmark scores for top Large Language Models. Compare performance across Coding, Reasoning, and Creative tasks.

ModelContextPlatform PriceOfficial Price
Code Arena
Chat Arena
GPQA
AIME 2025
SWE-Bench
ARC-AGI v2
1
1.0M
$2.00 / 1M Tokens$10.00 / 1M Tokens
$2.00 / 1M Tokens $12.00 / 1M Tokens
2,084
1,222
94.3%
-
80.6%
77.1%
2
1.0M
$4.00 / 1M Tokens$20.00 / 1M Tokens
$5.00 / 1M Tokens $25.00 / 1M Tokens
2,018
1,491
91.3%
99.8%
80.8%
68.8%
3
1.0M
$5.00 / 1M Tokens$25.00 / 1M Tokens
$5.00 / 1M Tokens $25.00 / 1M Tokens
1,774
358
94.2%
-
87.6%
-
4
1.0M
$2.00 / 1M Tokens$10.00 / 1M Tokens
$2.50 / 1M Tokens $15.00 / 1M Tokens
1,743
1,146
92.8%
-
-
73.3%
5
1.0M
$0.40 / 1M Tokens$2.50 / 1M Tokens
$0.50 / 1M Tokens $3.00 / 1M Tokens
1,704
1,143
90.4%
99.7%
78.0%
33.6%
6
200K
$4.00 / 1M Tokens$20.00 / 1M Tokens
$5.00 / 1M Tokens $25.00 / 1M Tokens
1,616
1,342
87.0%
-
80.9%
37.6%
7
-
$2.00 / 1M Tokens$10.00 / 1M Tokens
$2.00 / 1M Tokens $12.00 / 1M Tokens
1,579
1,045
91.9%
100.0%
76.2%
31.1%
8
GLM-5detail
Zhipu AI
200K
$0.95 / 1M Tokens$2.85 / 1M Tokens
$1.00 / 1M Tokens $3.20 / 1M Tokens
1,575
1,158
-
-
77.8%
-
9
GPT-5.2
openai
400K-
$1.75 / 1M Tokens $14.00 / 1M Tokens
1,517
1,193
92.4%
100.0%
80.0%
52.9%
10
Kimi K2.5detail
Moonshot AI
262K
$0.50 / 1M Tokens$2.80 / 1M Tokens
$0.60 / 1M Tokens $3.00 / 1M Tokens
1,479
1,003
87.6%
96.1%
76.8%
-
11
200K
$2.40 / 1M Tokens$12.00 / 1M Tokens
$3.00 / 1M Tokens $15.00 / 1M Tokens
1,414
956
89.9%
-
79.6%
58.3%
12
200K
$1.40 / 1M Tokens$4.20 / 1M Tokens
$6.00 / 1M Tokens $24.00 / 1M Tokens
1,332
-179
86.2%
-
-
-
13
GPT-5 High
openai
---
1,301
1,067
87.3%
94.6%
-
-
14
400K
$1.20 / 1M Tokens$9.60 / 1M Tokens
$1.75 / 1M Tokens $14.00 / 1M Tokens
1,242
895
-
-
-
-
15
GPT-5.1
OpenAI
400K-
$1.25 / 1M Tokens $10.00 / 1M Tokens
1,232
1,013
88.1%
94.0%
76.3%
-
16
Qwen3.5-397B-A17B
Alibaba Cloud / Qwen Team
262K-
$0.60 / 1M Tokens $3.60 / 1M Tokens
1,208
963
88.4%
-
76.4%
-
17
200K
$12.00 / 1M Tokens$60.00 / 1M Tokens
$15.00 / 1M Tokens $75.00 / 1M Tokens
1,155
1,180
80.9%
78.0%
74.5%
-
18
400K
$1.20 / 1M Tokens$9.60 / 1M Tokens
$1.75 / 1M Tokens $14.00 / 1M Tokens
1,148
812
-
-
-
-
19
GPT-5.1 High
OpenAI
---
1,140
1,132
88.1%
99.6%
-
-
20
131K
$0.45 / 1M Tokens$1.80 / 1M Tokens
$0.55 / 1M Tokens $2.19 / 1M Tokens
1,137
1,079
81.0%
93.9%
68.0%
-
Showing 1 to 20 of 291 models

Metric Definitions

LLM

Code Arena
Average score across coding arenas based on human votes.
Chat Arena
Human preference score from blind comparisons.
GPQA
Graduate-level science questions requiring expert knowledge.
AIME 2025
Recent math competition problems.
SWE-Bench
Real GitHub issues requiring code changes.
ARC-AGI v2
Abstract reasoning problems.

Image

IMAGE GEN
Human preference score for text-to-image generation.
IMAGE EDIT
Human preference score for image editing and transformation.

Video

Text to Video
Human preference score for text-to-video generation.
Image to Video
Human preference score for image-to-video generation.
Video to Video
Human preference score for video editing capabilities.

TTS

TTS
Human preference score for text-to-speech quality.

STT

STT
Human preference score for transcription accuracy.