GDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

See example tasks

The GDPval gold public dataset includes 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity.

The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains.

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek.

We introduce GDPval, a benchmark designed to evaluate AI models on real-world, economically valuable tasks across 44 occupations. The dataset encompasses 1,320 tasks derived from nine major industries contributing significantly to the U.S. GDP. These tasks were developed in collaboration with industry professionals averaging 14 years of experience, ensuring they accurately represent real-world complexities. The evaluation requires models to produce diverse outputs, including documents, slides, diagrams, and spreadsheets, mirroring actual work products. Initial results indicate that frontier AI models are approaching the quality of work produced by human experts, with models able to perform certain professional tasks approximately 100 times faster and at a fraction of the cost compared to human experts.

GDPval

2510.04374

openai/gdpval

GDPval

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) scores the highest on GDPval with a score of 1890, followed by GPT-5.5 (xhigh) with a score of 1769, and GPT-5.5 (high) with a score of 1755

GDPval-AA Elo

GDPval-AA Leaderboard

Elo scores for agentic performance on real-world work tasks using web and shell access via Stirrup, an open-source harness developed by Artificial Analysis

Stirrup Agent Harness

AI Chatbot

Chatbots

GDPval-AA: AI Chatbots

Elo scores for AI chatbots tested in the GDPval-AA evaluation

AI Chatbot

Score Comparisons

GDPval-AA: Elo vs. Artificial Analysis Intelligence Index

GDPval-AA Elo · Artificial Analysis Intelligence Index

Most attractive quadrant

Alibaba

Amazon

Anthropic

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

Token Usage

GDPval-AA: Output Token Usage

Output tokens used to run the evaluation

Reasoning tokens

Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Average Turns

GDPval-AA: Average Turns per Task

Average number of turns per task

Score vs. Release Date

GDPval-AA: Elo vs. Release Date

Most attractive region

Alibaba

Amazon

Anthropic

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

GDPval-AA Leaderboard


1	Anthropic	Claude Opus 4.8 (Adaptive Reasoning, Max Effort)	1890	-34 / +35	May 2026
2	OpenAI	GPT-5.5 (xhigh)	1769	-32 / +31	Apr 2026
3	OpenAI	GPT-5.5 (high)	1755	-31 / +32	Apr 2026
4	Anthropic	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	1753	-41 / +40	Apr 2026
5	Anthropic	Claude Opus 4.7 (Non-reasoning, High Effort)	1677	-27 / +28	Apr 2026
6	Anthropic	Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)	1676	-26 / +29	Feb 2026
7	OpenAI	GPT-5.4 (xhigh)	1674	-34 / +32	Mar 2026
8	Google	Gemini 3.5 Flash (medium)	1659	-29 / +28	May 2026
9	Google	Gemini 3.5 Flash (high)	1656	-26 / +30	May 2026
10	OpenAI	GPT-5.5 (medium)	1650	-24 / +28	Apr 2026
11	Anthropic	Claude Opus 4.6 (Adaptive Reasoning, Max Effort)	1619	-31 / +33	Feb 2026
12	Anthropic	Claude Sonnet 4.6 (Non-reasoning, High Effort)	1599	-26 / +28	Feb 2026
13	Anthropic	Claude Opus 4.6 (Non-reasoning, High Effort)	1591	-23 / +27	Feb 2026
14	Xiaomi	MiMo-V2.5-Pro	1571	-27 / +28	Apr 2026
15	DeepSeek	DeepSeek V4 Pro (Reasoning, High Effort)	1558	-29 / +31	Apr 2026
16	DeepSeek	DeepSeek V4 Pro (Reasoning, Max Effort)	1554	-29 / +29	Apr 2026
17	Xiaomi	MiMo-V2.5	1551	-25 / +25	Apr 2026
18	Alibaba	Qwen3.7 Max	1546	-25 / +27	May 2026
19	Z AI	GLM-5.1 (Reasoning)	1535	-0 / +0	Apr 2026
20	Alibaba	Qwen3.7 Plus	1523	-24 / +25	Jun 2026
21	MiniMax	MiniMax-M2.7	1505	-24 / +26	Mar 2026
22	Alibaba	Qwen3.6 Max Preview	1504	-20 / +21	Apr 2026
23	OpenAI	GPT-5.4 (low)	1503	-27 / +29	Mar 2026
24	Z AI	GLM-5-Turbo	1497	-21 / +23	Mar 2026
25	xAI	Grok 4.3 (high)	1495	-25 / +23	Apr 2026
26	Z AI	GLM-5.1 (Non-reasoning)	1494	-25 / +26	Apr 2026
27	OpenAI	GPT-5.3 Codex (xhigh)	1482	-25 / +25	Feb 2026
28	Kimi	Kimi K2.6	1481	-25 / +26	Apr 2026
29	DeepSeek	DeepSeek V4 Pro (Non-reasoning)	1476	-24 / +27	Apr 2026
30	OpenAI	GPT-5.2 (xhigh)	1467	-27 / +26	Dec 2025
31	Anthropic	Claude Sonnet 4.6 (Non-reasoning, Low Effort)	1456	-23 / +27	Feb 2026
32	Anthropic	Claude Opus 4.5 (Reasoning)	1452	-24 / +25	Nov 2025
33	OpenAI	GPT-5.5 (low)	1443	-23 / +25	Apr 2026
34	Google	Gemini 3.5 Flash (minimal)	1440	-24 / +27	May 2026
35	OpenAI	GPT-5.4 mini (xhigh)	1438	-23 / +26	Mar 2026
36	Anthropic	Claude Opus 4.5 (Non-reasoning)	1419	-21 / +23	Nov 2025
37	Meta	Muse Spark	1417	-23 / +23	Apr 2026
38	DeepSeek	DeepSeek V4 Flash (Reasoning, High Effort)	1414	-25 / +26	Apr 2026
39	Xiaomi	MiMo-V2-Pro	1406	-22 / +22	Mar 2026
40	OpenAI	GPT-5.2 (medium)	1405	-21 / +23	Dec 2025
41	Alibaba	Qwen3.6 27B (Reasoning)	1403	-23 / +24	Apr 2026
42	Z AI	GLM-5 (Reasoning)	1395	-23 / +23	Feb 2026
43	DeepSeek	DeepSeek V4 Flash (Non-reasoning)	1392	-25 / +28	Apr 2026
44	DeepSeek	DeepSeek V4 Flash (Reasoning, Max Effort)	1388	-20 / +34	Apr 2026
45	Alibaba	Qwen3.6 27B (Non-reasoning)	1385	-21 / +26	Apr 2026
46	Alibaba	Qwen3.6 Plus	1354	-23 / +24	Apr 2026
47	Xiaomi	MiMo-V2-Omni-0327	1344	-24 / +25	Mar 2026
48	OpenAI	GPT-5.4 (Non-reasoning)	1342	-26 / +26	Mar 2026
49	Z AI	GLM 5V Turbo (Reasoning)	1330	-22 / +24	Apr 2026
50	Kimi	Kimi K2.6 (Non-reasoning)	1326	-25 / +28	Apr 2026
51	Google	Gemini 3 Deep Think	1324	-30 / +31	Feb 2026
52	Z AI	GLM-5 (Non-reasoning)	1323	-23 / +24	Feb 2026
53		Claude Pro - 4.5 Opus (Extended Thinking)	1319	-41 / +38	-
54	Xiaomi	MiMo-V2-Omni	1319	-21 / +23	Mar 2026
55	Anthropic	Claude 4.5 Sonnet (Reasoning)	1317	-23 / +26	Sep 2025
56	OpenAI	GPT-5.4 mini (medium)	1316	-22 / +24	Mar 2026
57	OpenAI	GPT-5.5 (Non-reasoning)	1315	-23 / +25	Apr 2026
58	Google	Gemini 3.1 Pro Preview	1314	-26 / +27	Feb 2026
59	xAI	Grok 4.3 (medium)	1310	-24 / +26	Apr 2026
60	Anthropic	Claude 4.5 Sonnet (Non-reasoning)	1306	-21 / +24	Sep 2025
61	xAI	Grok 4.3 (Non-reasoning)	1300	-24 / +25	Apr 2026
62	Alibaba	Qwen3.6 35B A3B (Reasoning)	1299	-23 / +25	Apr 2026
63	StepFun	Step 3.7 Flash	1298	-25 / +25	May 2026
64	Xiaomi	MiMo-V2.5-Pro (Non-reasoning)	1298	-23 / +25	Apr 2026
65	OpenAI	GPT-5 (high)	1294	-22 / +23	Aug 2025
66	OpenAI	GPT-5.2 Codex (xhigh)	1288	-25 / +28	Dec 2025
67	Kimi	Kimi K2.5 (Reasoning)	1285	-23 / +25	Jan 2026
68	Kimi	Kimi K2.5 (Non-reasoning)	1265	-23 / +25	Jan 2026
69	Tencent	Hy3-preview (Reasoning)	1238	-24 / +23	Apr 2026
70	OpenAI	GPT-5.1 (high)	1227	-20 / +22	Nov 2025
71	Tencent	Hy3-preview (Non-reasoning)	1226	-23 / +28	Apr 2026
72	Alibaba	Qwen3.6 35B A3B (Non-reasoning)	1225	-23 / +23	Apr 2026
73	OpenAI	GPT-5.2 (Non-reasoning)	1222	-22 / +21	Dec 2025
74	Alibaba	Qwen3.5 397B A17B (Non-reasoning)	1220	-23 / +25	Feb 2026
75	OpenAI	GPT-5 Codex (high)	1214	-22 / +24	Sep 2025
76	Google	Gemini 3 Flash Preview (Reasoning)	1204	-24 / +24	Dec 2025
77	OpenAI	GPT-5.4 nano (medium)	1199	-21 / +22	Mar 2026
78	DeepSeek	DeepSeek V3.2 (Reasoning)	1197	-24 / +22	Dec 2025
79	OpenAI	GPT-5.4 nano (xhigh)	1194	-24 / +28	Mar 2026
80	OpenAI	GPT-5.1 Codex (high)	1192	-25 / +25	Nov 2025
81	Alibaba	Qwen3.5 397B A17B (Reasoning)	1190	-23 / +22	Feb 2026
82	Alibaba	Qwen3.5 Omni Plus	1185	-21 / +23	Mar 2026
83	Z AI	GLM-4.7 (Reasoning)	1185	-22 / +23	Dec 2025
84	OpenAI	GPT-5 mini (high)	1184	-21 / +23	Aug 2025
85	Google	Gemini 3 Pro Preview (high)	1184	-23 / +23	Nov 2025
86	Z AI	GLM-4.7 (Non-reasoning)	1177	-25 / +23	Dec 2025
87	MiniMax	MiniMax-M2.5	1176	-21 / +24	Feb 2026
88	Anthropic	Claude 4.5 Haiku (Reasoning)	1171	-24 / +25	Oct 2025
89	xAI	Grok 4.20 0309 v2 (Reasoning)	1169	-24 / +24	Apr 2026
90	Mistral	Mistral Medium 3.5	1168	-25 / +24	Apr 2026
91	Google	Gemini 3 Pro Preview (low)	1167	-26 / +27	Nov 2025
92	Alibaba	Qwen3.5 27B (Non-reasoning)	1161	-22 / +22	Feb 2026
93	Alibaba	Qwen3.5 27B (Reasoning)	1160	-22 / +22	Feb 2026
94		ChatGPT Plus - 5.1 Thinking (Extended Thinking)	1149	-41 / +45	-
95	OpenAI	GPT-5 (low)	1149	-23 / +23	Aug 2025
96	OpenAI	GPT-5.5 Instant (May 2026)	1141	-23 / +23	May 2026
97	Alibaba	Qwen3 Max Thinking	1136	-23 / +23	Jan 2026
98	Anthropic	Claude 4.5 Haiku (Non-reasoning)	1135	-25 / +27	Oct 2025
99	Anthropic	Claude 4 Sonnet (Reasoning)	1133	-26 / +26	May 2025
100	Anthropic	Claude 4 Sonnet (Non-reasoning)	1125	-23 / +23	May 2025
101	xAI	Grok 4.3 (low)	1124	-22 / +24	Apr 2026
102	InclusionAI	Ring-2.6-1T	1122	-26 / +25	May 2026
103	KwaiKAT	KAT Coder Pro V2	1120	-23 / +24	Mar 2026
104	Google	Gemini 3 Flash Preview (Non-reasoning)	1116	-24 / +25	Dec 2025
105	Alibaba	Qwen3.5 122B A10B (Reasoning)	1116	-22 / +23	Feb 2026
106	Google	Gemma 4 31B (Reasoning)	1113	-22 / +23	Apr 2026
107	Alibaba	Qwen3.5 122B A10B (Non-reasoning)	1111	-22 / +24	Feb 2026
108	MiniMax	MiniMax-M2.1	1091	-24 / +23	Dec 2025
109	DeepSeek	DeepSeek V3.1 (Non-reasoning)	1080	-21 / +25	Aug 2025
110	Xiaomi	MiMo-V2-Flash (Reasoning)	1079	-23 / +24	Dec 2025
111	China Mobile	JT-35B-Flash	1076	-23 / +25	May 2026
112	Google	Gemini 2.5 Flash Preview (Sep '25) (Reasoning)	1071	-24 / +24	Sep 2025
113	StepFun	Step 3.5 Flash 2603	1070	-23 / +24	Apr 2026
114	DeepSeek	DeepSeek V3.2 Exp (Non-reasoning)	1067	-24 / +25	Sep 2025
115	Xiaomi	MiMo-V2-Flash (Non-reasoning)	1059	-25 / +24	Dec 2025
116	StepFun	Step 3.5 Flash	1054	-27 / +27	Feb 2026
117	OpenAI	GPT-5.1 Codex mini (high)	1052	-24 / +25	Nov 2025
118	Anthropic	Claude 3.7 Sonnet (Reasoning)	1048	-23 / +26	Feb 2025
119	Alibaba	Qwen3.5 35B A3B (Non-reasoning)	1047	-24 / +23	Feb 2026
120	Anthropic	Claude 3.7 Sonnet (Non-reasoning)	1047	-24 / +24	Feb 2025
121	xAI	Grok 4.1 Fast (Reasoning)	1046	-24 / +26	Nov 2025
122	InclusionAI	Ling-2.6-1T	1046	-23 / +24	Apr 2026
123	xAI	Grok 4.20 0309 (Reasoning)	1043	-22 / +24	Mar 2026
124	Xiaomi	MiMo-V2-Flash (Feb 2026)	1043	-26 / +25	Dec 2025
125	Alibaba	Qwen3 Max	1038	-22 / +23	Sep 2025
126	xAI	Grok 4.20 0309 v2 (Non-reasoning)	1037	-27 / +26	Apr 2026
127		Perplexity Pro - Labs	1032	-41 / +39	-
128	MiniMax	MiniMax-M2	1031	-25 / +26	Oct 2025
129	Z AI	GLM-4.6 (Reasoning)	1029	-25 / +31	Sep 2025
130	Google	Gemma 4 26B A4B (Reasoning)	1014	-22 / +23	Apr 2026
131	xAI	Grok 4 Fast (Reasoning)	1014	-23 / +24	Sep 2025
132	OpenAI	o4-mini (high)	1008	-24 / +23	Apr 2025
133	Google	Gemma 4 31B (Non-reasoning)	1005	-21 / +23	Apr 2026
134	DeepSeek	DeepSeek V3.1 Terminus (Reasoning)	1004	-25 / +27	Sep 2025
135	NVIDIA	NVIDIA Nemotron 3 Super 120B A12B (Reasoning)	1003	-22 / +21	Mar 2026
136	OpenAI	GPT-5.4 mini (Non-Reasoning)	1002	-22 / +24	Mar 2026
137	OpenAI	GPT-5 mini (medium)	1002	-25 / +25	Aug 2025
138	OpenAI	GPT-5.1 (Non-reasoning)	1000	-0 / +0	Nov 2025
139	DeepSeek	DeepSeek V3.2 Exp (Reasoning)	1000	-25 / +25	Sep 2025
140	OpenAI	GPT-5 (medium)	1000	-25 / +25	Aug 2025
141	MiniMax	MiniMax M1 80k	997	-24 / +25	Jun 2025
142	Kimi	Kimi K2 Thinking	992	-23 / +25	Nov 2025
143	xAI	Grok 4	989	-24 / +24	Jul 2025
144	ByteDance Seed	Doubao Seed Code	986	-26 / +27	Nov 2025
145	Z AI	GLM-4.6 (Non-reasoning)	985	-23 / +26	Sep 2025
146	DeepSeek	DeepSeek V3.1 Terminus (Non-reasoning)	975	-23 / +24	Sep 2025
147	Amazon	Nova 2.0 Pro Preview (medium)	973	-27 / +23	Nov 2025
148		Google AI Pro - Thinking with 3 Pro	972	-43 / +43	-
149	Inception	Mercury 2	959	-23 / +24	Feb 2026
150	Google	Gemma 4 26B A4B (Non-reasoning)	948	-24 / +24	Apr 2026
151	Alibaba	Qwen3 Max Thinking (Preview)	948	-25 / +26	Nov 2025
152	OpenAI	gpt-oss-120b (high)	947	-28 / +27	Aug 2025
153	OpenAI	GPT-5.4 nano (Non-Reasoning)	941	-30 / +30	Mar 2026
154	Google	Gemini 3.1 Flash-Lite	928	-22 / +24	Mar 2026
155	Google	Gemini 2.5 Pro	919	-23 / +26	Jun 2025
156	Cohere	Command A+	919	-24 / +24	May 2026
157	Alibaba	Qwen3 Coder Next	912	-24 / +23	Feb 2026
158	xAI	Grok 4.20 0309 (Non-reasoning)	909	-22 / +22	Mar 2026
159	Alibaba	Qwen3.5 35B A3B (Reasoning)	906	-22 / +24	Feb 2026
160	Alibaba	Qwen3.5 Omni Flash	897	-24 / +28	Mar 2026
161		SuperGrok - Grok 4	882	-46 / +40	-
162	DeepSeek	DeepSeek V3.2 (Non-reasoning)	877	-28 / +28	Dec 2025
163	Arcee AI	Trinity Large Thinking	865	-24 / +23	Apr 2026
164	Mistral	Mistral Large 3	864	-25 / +24	Dec 2025
165	Kimi	Kimi K2 0905	864	-28 / +28	Sep 2025
166	Mistral	Mistral Small 4 (Reasoning)	861	-23 / +25	Mar 2026
167	Mistral	Devstral 2	854	-24 / +28	Dec 2025
168	Amazon	Nova 2.0 Lite (high)	854	-24 / +25	Oct 2025
169	Google	Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)	852	-26 / +27	Sep 2025
170	Mistral	Mistral Small 4 (Non-reasoning)	845	-23 / +22	Mar 2026
171	Alibaba	Qwen3.5 9B (Non-reasoning)	844	-22 / +23	Mar 2026
172	LongCat	LongCat Flash Lite	837	-25 / +26	Jan 2026
173	Z AI	GLM-4.7-Flash (Reasoning)	837	-26 / +26	Jan 2026
174	China Mobile	JT-MINI	832	-25 / +23	Apr 2026
175	Mistral	Devstral Small (May '25)	829	-26 / +25	May 2025
176	OpenAI	gpt-oss-120b (low)	828	-23 / +26	Aug 2025
177	LG AI Research	K-EXAONE (Reasoning)	825	-28 / +24	Dec 2025
178	Mistral	Devstral Small 2	820	-25 / +25	Dec 2025
179	Alibaba	Qwen3 235B A22B 2507 (Reasoning)	820	-24 / +24	Jul 2025
180	KwaiKAT	KAT-Coder-Pro V1	818	-27 / +26	Nov 2025
181	Alibaba	Qwen3 Max (Preview)	816	-23 / +22	Sep 2025
182	LG AI Research	EXAONE 4.5 33B	813	-25 / +27	Apr 2026
183	Z AI	GLM-4.7-Flash (Non-reasoning)	801	-41 / +38	Jan 2026
184	Baidu	ERNIE 5.0 Thinking Preview	789	-27 / +27	Nov 2025
185	Amazon	Nova 2.0 Omni (medium)	783	-24 / +26	Nov 2025
186	InclusionAI	Ling 2.6 Flash	783	-21 / +24	Apr 2026
187	xAI	Grok 4.1 Fast (Non-reasoning)	783	-25 / +27	Nov 2025
188	Mistral	Mistral Medium 3.1	780	-27 / +27	Aug 2025
189	Alibaba	Qwen3 235B A22B 2507 Instruct	778	-28 / +27	Jul 2025
190	xAI	Grok 4 Fast (Non-reasoning)	778	-27 / +26	Sep 2025
191	OpenAI	GPT-4.1	776	-26 / +29	Apr 2025
192	Alibaba	Qwen3 VL 4B (Reasoning)	776	-39 / +40	Oct 2025
193	LG AI Research	K-EXAONE (Non-reasoning)	764	-26 / +25	Dec 2025
194	NVIDIA	Nemotron 3 Nano Omni 30B A3B Reasoning	763	-26 / +27	Apr 2026
195	xAI	Grok Code Fast 1	762	-26 / +28	Aug 2025
196	ByteDance Seed	Seed-OSS-36B-Instruct	759	-27 / +27	Aug 2025
197	Alibaba	Qwen3 235B A22B (Reasoning)	756	-26 / +30	Apr 2025
198	NVIDIA	Nemotron Cascade 2 30B A3B	756	-23 / +25	Mar 2026
199	OpenAI	GPT-5 nano (high)	755	-25 / +26	Aug 2025
200	OpenAI	o3	753	-32 / +29	Apr 2025
201	Prime Intellect	INTELLECT-3	749	-25 / +25	Nov 2025
202	OpenAI	o3-mini (high)	747	-25 / +26	Jan 2025
203	Google	Gemini 2.5 Flash (Non-reasoning)	741	-28 / +26	May 2025
204	Alibaba	Qwen3 235B A22B (Non-reasoning)	738	-30 / +28	Apr 2025
205	Sarvam	Sarvam 105B (high)	738	-22 / +23	Mar 2026
206	OpenAI	o1	736	-28 / +29	Dec 2024
207	Alibaba	Qwen3 Next 80B A3B (Reasoning)	725	-27 / +26	Sep 2025
208	Alibaba	Qwen3.5 9B (Reasoning)	714	-24 / +23	Mar 2026
209	Alibaba	Qwen3 VL 235B A22B (Reasoning)	714	-25 / +27	Sep 2025
210	Alibaba	Qwen3 Coder 30B A3B Instruct	711	-24 / +26	Jul 2025
211	Anthropic	Claude 3.5 Haiku	708	-24 / +25	Oct 2024
212	Google	Gemini 2.5 Flash (Reasoning)	698	-28 / +26	May 2025
213	Z AI	GLM-4.6V (Non-reasoning)	692	-27 / +27	Dec 2025
214	Mistral	Devstral Medium	690	-27 / +26	Jul 2025
215	InclusionAI	Ring-1T	687	-27 / +27	Oct 2025
216	Alibaba	Qwen3 VL 8B Instruct	682	-37 / +40	Oct 2025
217	DeepSeek	DeepSeek R1 0528 (May '25)	681	-28 / +28	May 2025
218	Naver	HyperCLOVA X SEED Think (32B)	679	-27 / +28	Dec 2025
219	Upstage	Solar Pro 3	675	-25 / +24	Apr 2026
220	Mistral	Magistral Small 1.2	669	-26 / +25	Sep 2025
221	Alibaba	Qwen3 VL 8B (Reasoning)	669	-27 / +29	Oct 2025
222	Alibaba	Qwen3.5 4B (Non-reasoning)	668	-22 / +23	Mar 2026
223	Alibaba	Qwen3 VL 30B A3B (Reasoning)	667	-37 / +37	Oct 2025
224	xAI	Grok 3	666	-24 / +28	Feb 2025
225	Mistral	Magistral Medium 1	666	-27 / +27	Jun 2025
226	Upstage	Solar Open 100B (Reasoning)	665	-27 / +28	Dec 2025
227	Alibaba	Qwen3 30B A3B 2507 (Reasoning)	660	-25 / +25	Jul 2025
228	Amazon	Nova 2.0 Pro Preview (low)	659	-28 / +29	Nov 2025
229	Mistral	Ministral 3 14B	655	-26 / +27	Dec 2025
230	OpenAI	gpt-oss-20B (high)	649	-25 / +26	Aug 2025
231	Alibaba	Qwen3 VL 32B (Reasoning)	647	-25 / +27	Oct 2025
232	Amazon	Nova 2.0 Lite (medium)	643	-27 / +27	Oct 2025
233	Korea Telecom	Mi:dm K 2.5 Pro	642	-26 / +26	Dec 2025
234	Mistral	Ministral 3 8B	639	-28 / +26	Dec 2025
235	Alibaba	Qwen3 VL 235B A22B Instruct	636	-33 / +36	Sep 2025
236	Alibaba	Qwen3 Next 80B A3B Instruct	627	-29 / +28	Sep 2025
237	Mistral	Magistral Medium 1.2	627	-29 / +27	Sep 2025
238	OpenAI	GPT-4.1 mini	620	-28 / +25	Apr 2025
239	DeepSeek	DeepSeek V3.1 (Reasoning)	614	-29 / +31	Aug 2025
240	Z AI	GLM-4.6V (Reasoning)	609	-30 / +32	Dec 2025
241	MBZUAI Institute of Foundation Models	K2 Think V2	608	-25 / +26	Dec 2025
242	OpenAI	GPT-5 nano (medium)	593	-27 / +27	Aug 2025
243	Alibaba	Qwen3 4B 2507 (Reasoning)	589	-28 / +27	Aug 2025
244	Mistral	Mistral Medium 3	586	-27 / +25	May 2025
245	Nous Research	Hermes 4 - Llama-3.1 405B (Reasoning)	585	-23 / +25	Aug 2025
246	MBZUAI Institute of Foundation Models	K2-V2 (medium)	581	-26 / +26	Dec 2025
247	ServiceNow	Apriel-v1.6-15B-Thinker	571	-26 / +25	Nov 2025
248	Google	Gemini 2.0 Flash (Feb '25)	566	-26 / +26	Feb 2025
249	Mistral	Devstral Small (Jul '25)	564	-30 / +29	Jul 2025
250	NVIDIA	NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)	563	-28 / +27	Dec 2025
251	MBZUAI Institute of Foundation Models	K2-V2 (high)	561	-27 / +27	Dec 2025
252	Z AI	GLM-4.5-Air	559	-31 / +27	Jul 2025
253	OpenAI	gpt-oss-20B (low)	550	-25 / +26	Aug 2025
254	IBM	Granite 4.1 8B	542	-26 / +26	Apr 2026
255	Nous Research	Hermes 4 - Llama-3.1 70B (Reasoning)	538	-24 / +23	Aug 2025
256	Kimi	Kimi K2	526	-29 / +31	Jul 2025
257	Nous Research	Hermes 4 - Llama-3.1 70B (Non-reasoning)	522	-26 / +24	Aug 2025
258	Alibaba	Qwen3 30B A3B 2507 Instruct	515	-27 / +27	Jul 2025
259	Nous Research	Hermes 4 - Llama-3.1 405B (Non-reasoning)	510	-24 / +24	Aug 2025
260	Z AI	GLM-4.5V (Reasoning)	510	-23 / +22	Aug 2025
261	Alibaba	Qwen3.5 4B (Reasoning)	510	-28 / +31	Mar 2026
262	Amazon	Nova 2.0 Lite (low)	508	-29 / +30	Oct 2025
263	Alibaba	Qwen3 30B A3B (Reasoning)	507	-26 / +26	Apr 2025
264	Alibaba	Qwen3 Coder 480B A35B Instruct	506	-30 / +29	Jul 2025
265	Amazon	Nova Premier	505	-28 / +28	Apr 2025
266	LG AI Research	EXAONE 4.0 32B (Reasoning)	503	-28 / +25	Jul 2025
267	Allen Institute for AI	Molmo2-8B	500	-0 / +0	Dec 2025
268	DeepSeek	DeepSeek V3.2 Speciale	500	-0 / +0	Dec 2025
269	Alibaba	Qwen3 8B (Reasoning)	496	-27 / +29	Apr 2025
270	Alibaba	Qwen3 Omni 30B A3B (Reasoning)	495	-27 / +26	Sep 2025
271	Alibaba	Qwen3 VL 30B A3B Instruct	495	-30 / +28	Oct 2025
272	IBM	Granite 4.1 30B	495	-25 / +25	Apr 2026
273	Alibaba	Qwen3 32B (Reasoning)	489	-27 / +27	Apr 2025
274	Motif Technologies	Motif-2-12.7B-Reasoning	484	-30 / +30	Dec 2025
275	Mistral	Ministral 3 3B	483	-28 / +29	Dec 2025
276	Alibaba	Qwen3 14B (Reasoning)	479	-30 / +27	Apr 2025
277	NVIDIA	NVIDIA Nemotron 3 Nano 4B	478	-29 / +28	Mar 2026
278	Alibaba	Qwen3 8B (Non-reasoning)	473	-27 / +26	Apr 2025
279	Alibaba	Qwen3 14B (Non-reasoning)	471	-27 / +25	Apr 2025
280	OpenAI	GPT-5 mini (minimal)	471	-28 / +29	Aug 2025
281	Z AI	GLM-4.5 (Reasoning)	468	-31 / +33	Jul 2025
282	Z AI	GLM-4.5V (Non-reasoning)	459	-31 / +31	Aug 2025
283	Upstage	Solar Pro 2 (Reasoning)	449	-30 / +28	Jul 2025
284	Upstage	Solar Pro 2 (Non-reasoning)	445	-27 / +28	Jul 2025
285	NVIDIA	NVIDIA Nemotron Nano 9B V2 (Reasoning)	439	-26 / +28	Aug 2025
286	Meta	Llama 4 Maverick	435	-28 / +29	Apr 2025
287	Google	Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)	434	-30 / +29	Sep 2025
288	InclusionAI	Ling-flash-2.0	419	-30 / +29	Sep 2025
289	xAI	Grok 3 mini Reasoning (high)	419	-40 / +36	Feb 2025
290	DeepSeek	DeepSeek V3 (Dec '24)	410	-30 / +28	Dec 2024
291	DeepSeek	DeepSeek V3 0324	407	-27 / +28	Mar 2025
292	InclusionAI	Ling-1T	401	-28 / +31	Oct 2025
293	Meta	Llama 3.3 Instruct 70B	400	-31 / +28	Dec 2024
294	Amazon	Nova Pro	387	-31 / +30	Dec 2024
295	OpenAI	GPT-5 (minimal)	385	-30 / +30	Aug 2025
296	Amazon	Nova 2.0 Lite (Non-reasoning)	380	-30 / +29	Oct 2025
297	Google	Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)	380	-29 / +30	Sep 2025
298	NVIDIA	Llama Nemotron Super 49B v1.5 (Non-reasoning)	379	-28 / +28	Jul 2025
299	OpenAI	GPT-4o (Aug '24)	378	-28 / +29	Aug 2024
300	Anthropic	Claude 3 Haiku	378	-24 / +23	Mar 2024
301	Trillion Labs	Tri-21B-Think	372	-25 / +26	Feb 2026
302	TII UAE	Falcon-H1R-7B	372	-31 / +28	Jan 2026
303	NVIDIA	Llama Nemotron Super 49B v1.5 (Reasoning)	368	-31 / +29	Jul 2025
304	MBZUAI Institute of Foundation Models	K2-V2 (low)	365	-29 / +30	Dec 2025
305	IBM	Granite 4.1 3B	364	-25 / +29	Apr 2026
306	Amazon	Nova 2.0 Omni (low)	359	-34 / +29	Nov 2025
307	Sarvam	Sarvam 30B (high)	357	-23 / +24	Mar 2026
308	Nanbeige	Nanbeige4.1-3B	356	-30 / +31	Feb 2026
309	Allen Institute for AI	Olmo 3.1 32B Instruct	354	-28 / +29	Jan 2026
310	OpenAI	GPT-4o (Nov '24)	348	-26 / +25	Nov 2024
311	NVIDIA	NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning)	346	-30 / +29	Dec 2025
312	IBM	Granite 4.0 H Small	343	-30 / +29	Sep 2025
313	Amazon	Nova Lite	343	-30 / +32	Dec 2024
314	Alibaba	Qwen3 VL 4B Instruct	342	-28 / +27	Oct 2025
315	Amazon	Nova Micro	339	-31 / +29	Dec 2024
316	Trillion Labs	Tri-21B-think Preview	337	-33 / +30	Feb 2026
317	Mistral	Mistral Small 3.1	336	-31 / +27	Mar 2025
318	NVIDIA	Llama 3.1 Nemotron Instruct 70B	335	-30 / +31	Oct 2024
319	Alibaba	Qwen3 30B A3B (Non-reasoning)	329	-31 / +28	Apr 2025
320	LG AI Research	EXAONE 4.0 32B (Non-reasoning)	328	-33 / +29	Jul 2025
321	NVIDIA	NVIDIA Nemotron Nano 12B v2 VL (Reasoning)	328	-29 / +28	Oct 2025
322	Mistral	Mistral Large 2 (Nov '24)	323	-31 / +31	Nov 2024
323	Alibaba	Qwen3.5 2B (Reasoning)	319	-25 / +24	Mar 2026
324	OpenAI	GPT-4.1 nano	318	-32 / +29	Apr 2025
325	Google	Gemini 2.5 Flash-Lite (Reasoning)	318	-31 / +29	Jun 2025
326	Amazon	Nova 2.0 Pro Preview (Non-reasoning)	317	-30 / +30	Nov 2025
327	Alibaba	Qwen3 0.6B (Reasoning)	314	-31 / +28	Apr 2025
328	Alibaba	Qwen3 4B 2507 Instruct	305	-31 / +28	Aug 2025
329	Amazon	Nova 2.0 Omni (Non-reasoning)	304	-32 / +30	Nov 2025
330	Google	Gemini 2.5 Flash-Lite (Non-reasoning)	303	-31 / +30	Jun 2025
331	Mistral	Mistral Small 3.2	303	-31 / +32	Jun 2025
332	Google	Gemma 4 E4B (Reasoning)	302	-26 / +24	Apr 2026
333	NVIDIA	NVIDIA Nemotron Nano 9B V2 (Non-reasoning)	302	-28 / +29	Aug 2025
334	Alibaba	Qwen3 VL 32B Instruct	300	-34 / +29	Oct 2025
335	Alibaba	Qwen3 Omni 30B A3B Instruct	295	-31 / +30	Sep 2025
336	LG AI Research	Exaone 4.0 1.2B (Non-reasoning)	294	-28 / +27	Jul 2025
337	LG AI Research	Exaone 4.0 1.2B (Reasoning)	291	-28 / +30	Jul 2025
338	Google	Gemma 4 E4B (Non-reasoning)	291	-26 / +26	Apr 2026
339	IBM	Granite 4.0 H 350M	289	-31 / +31	Oct 2025
340	Google	Gemma 3 27B Instruct	285	-31 / +28	Mar 2025
341	Meta	Llama 3.1 Instruct 70B	284	-27 / +27	Jul 2024
342	NVIDIA	NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning)	283	-27 / +29	Oct 2025
343	AI21 Labs	Jamba 1.7 Large	282	-30 / +31	Jul 2025
344	Google	Gemma 3 12B Instruct	279	-30 / +26	Mar 2025
345	OpenAI	GPT-5 nano (minimal)	279	-30 / +27	Aug 2025
346	IBM	Granite 4.0 Micro	279	-29 / +28	Sep 2025
347	Alibaba	Qwen3 0.6B (Non-reasoning)	277	-34 / +30	Apr 2025
348	Meta	Llama 3.1 Instruct 8B	277	-32 / +31	Jul 2024
349	Alibaba	Qwen3.5 0.8B (Reasoning)	276	-25 / +24	Mar 2026
350	Allen Institute for AI	Olmo 3 7B Instruct	276	-27 / +30	Nov 2025
351	Alibaba	Qwen3 1.7B (Reasoning)	275	-28 / +32	Apr 2025
352	AI21 Labs	Jamba 1.7 Mini	274	-30 / +30	Jul 2025
353	Cohere	Command A	273	-29 / +30	Mar 2025
354	Liquid AI	LFM2 1.2B	271	-32 / +28	Jul 2025
355	Meta	Llama 4 Scout	270	-30 / +29	Apr 2025
356	Google	Gemma 4 E2B (Reasoning)	269	-25 / +25	Apr 2026
357	IBM	Granite 4.0 H 1B	266	-30 / +29	Oct 2025
358	IBM	Granite 4.0 350M	265	-31 / +28	Oct 2025
359	Liquid AI	LFM2.5-1.2B-Instruct	265	-32 / +31	Jan 2026
360	InclusionAI	Ling-mini-2.0	260	-24 / +26	Sep 2025
361	Liquid AI	LFM2 8B A1B	258	-31 / +29	Oct 2025
362	IBM	Granite 4.0 1B	258	-30 / +30	Oct 2025
363	StepFun	Step3 VL 10B	257	-29 / +29	Jan 2026
364	OpenBMB	MiniCPM-V 4.6 1.3B	256	-28 / +28	May 2026
365	Meta	Llama 3.1 Instruct 405B	255	-31 / +29	Jul 2024
366	Google	Gemma 3 4B Instruct	254	-30 / +28	Mar 2025
367	AI21 Labs	Jamba Reasoning 3B	254	-30 / +29	Oct 2025
368	Alibaba	Qwen3 1.7B (Non-reasoning)	253	-30 / +31	Apr 2025
369	Liquid AI	LFM2.5-1.2B-Thinking	252	-31 / +29	Jan 2026
370	Google	Gemma 4 E2B (Non-reasoning)	251	-27 / +26	Apr 2026
371	DeepSeek	DeepSeek R1 (Jan '25)	248	-30 / +28	Jan 2025
372	Google	Gemma 3n E4B Instruct	243	-35 / +31	Jun 2025
373	NVIDIA	Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)	238	-26 / +25	Apr 2025
374	Alibaba	Qwen3.5 2B (Non-reasoning)	237	-24 / +24	Mar 2026
375	Liquid AI	LFM2 2.6B	235	-28 / +30	Sep 2025
376	Liquid AI	LFM2 24B A2B	234	-27 / +24	Feb 2026
377	Alibaba	Qwen3.5 0.8B (Non-reasoning)	233	-27 / +24	Mar 2026
378	Liquid AI	LFM2.5-VL-1.6B	231	-33 / +29	Jan 2026
379	OpenBMB	MiniCPM5-1B (Non-reasoning)	231	-29 / +27	May 2026
380	Microsoft	Phi-4 Mini Instruct	228	-30 / +28	Feb 2024
381	IBM	Granite 3.3 8B (Non-reasoning)	224	-29 / +28	Apr 2025

Example Tasks

Frequently Asked Questions

GDPval-AA is Artificial Analysis' evaluation based on OpenAI's GDPval dataset, which tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries.

GDPval-AA compares model submissions head-to-head on the same task. For each matchup, the two outputs are anonymized and an LLM judge picks a winner. These blind pairwise results are aggregated into an Elo rating per model.

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) has the highest GDPval-AA score, with a GDPval-AA Elo rating of 1,890 among models with published GDPval-AA results. View model

GDPval-AA covers real-world professional tasks across a range of occupations and industries, producing outputs such as documents, spreadsheets, slides, and diagrams. Generating these deliverables generally requires interacting with a sandbox filesystem through shell access and using web search, capabilities the model is given through the Stirrup agentic harness.

Most benchmarks test short-answer or multiple-choice responses. GDPval-AA instead evaluates complete deliverables: models operate in an agentic environment with tools, produce file outputs, and have their submissions scored through pairwise grading on relative quality.

Explore Evaluations

Artificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA Leaderboard

APEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

ITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

Artificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.

GDPval-AA Leaderboard

Background

Methodology

Publication

Abstract

Related Links

GDPval

GDPval-AA Elo

GDPval-AA Leaderboard

Chatbots

GDPval-AA: AI Chatbots

Score Comparisons

GDPval-AA: Elo vs. Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index

Token Usage

GDPval-AA: Output Token Usage

Evaluation Token Usage

Average Turns

GDPval-AA: Average Turns per Task

Score vs. Release Date

GDPval-AA: Elo vs. Release Date

GDPval-AA Leaderboard

Example Tasks

Frequently Asked Questions

What is GDPval-AA?

How does GDPval-AA decide which model did better?

Which AI model has the highest GDPval-AA score?

What kinds of tasks are included in GDPval-AA?

How is GDPval-AA different from standard AI benchmarks?

Explore Evaluations