Android Bench
Android LLM Leaderboard
| Model | Score (%) |
arrow_range
Cl range (%)
|
Date |
|---|---|---|---|
|
|
72.4 | 65.3 — 79.8 | 2026-03-04 |
|
|
66.6 | 58.9 — 73.9 | 2026-03-04 |
|
|
62.5 | 54.7 — 70.3 | 2026-03-04 |
|
|
61.9 | 53.9 — 69.6 | 2026-03-04 |
|
|
60.4 | 52.6 — 67.8 | 2026-03-04 |
|
|
58.4 | 51.1 — 66.6 | 2026-03-04 |
|
|
54.2 | 45.5 — 62.4 | 2026-03-04 |
|
|
42.0 | 36.3 — 47.9 | 2026-03-04 |
|
|
16.1 | 10.9 — 21.9 | 2026-03-04 |
Score is the average percentage of 100 test cases successfully resolved across 10 runs for each model.
Confidence Interval (CI) represents the expected performance range, reflecting the results' statistical reliability (p-value < 0.05).

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
