Will the highest-scoring LLM on Dec 31, 2026 show <10% improvement over 2025's best average benchmark performance?

Ṁ40

2026

59%

chance

ALL

BASELINE (2025 leader): The current leader as of Aug 2025 appears to be GPT-4.5 with ~90.2% on MMLU, with Claude 4 and Gemini 2.5 Pro at ~85-86%

Top 7 LLMs Ranked in 2025: GPT-4o, Gemini, Claude & More.

To establish the 2025 baseline:

On Dec 31, 2025, identify the LLM with the highest average score across the "Core Benchmark Suite" (defined below)
This becomes the baseline for calculating 10% improvement

CORE BENCHMARK SUITE (to avoid cherry-picking):

MMLU (general knowledge)
HumanEval (coding)
GSM8K (math reasoning)
ARC-Challenge (scientific reasoning)
GPQA (expert-level knowledge)

RESOLUTION CRITERIA:

On Dec 31, 2026, identify the highest-scoring LLM on the same benchmark suite
Calculate the percentage improvement: (2026_score - 2025_score) / 2025_score × 100
BET RESOLVES YES if improvement is less than 10%
BET RESOLVES NO if improvement is 10% or greater

KEY DEFINITIONS:

"LLM": Text-based language models (excludes multimodal-only systems)
"Publicly available": Model must be accessible via API, open-source, or major consumer platform
"Score sources": Use official leaderboards (HuggingFace, Papers with Code) or company-reported figures
"Average": Simple arithmetic mean across the 5 benchmarks

EDGE CASES:

If benchmarks become saturated (>98% scores), substitute with the most widely-adopted replacement benchmark
If a benchmark is discontinued, use the closest equivalent as determined by academic consensus
Minimum 3 valid benchmark scores required for inclusion

Example calculation: