The “cognitive core” hypothesis (that the general-reasoning components of a trained LLM are not that large in parameter count) is looking surprisingly plausible. This would explain why distillation is so effective.
The contrary (call it associationism) is that general reasoning is just a bunch of heuristics and priors piled atop each other; that you need lots of memorisation because . This is also a live possibility: for instance, consider that a year of intense RLVR mostly led to task-specific improvements. Or more basic: consider that bigger models seem to reason better as well as just knowing more things.
Resolution: by the end of next year, will a publicly evaluated model with <23B active parameters match o3 on ECI (i.e. score >=148) and ADeLE (i.e. weighted ability average >=5.0) while using <30x per-query inference FLOPs?
Spoiler resolution: this is of course possible to do by goodharting the benchmarks so I reserve the right to resolve "no" if the signs point to this. Sorry.
My current credence (Dec 2025): 20%
If you want to use a model of me as well as your model of ML to answer, here are some of my views.
Update 2025-12-07 (PST) (AI summary of creator comment): For models without publicly disclosed parameter counts, the creator will accept either:
Publicly confirmed parameter counts, OR
Convincing third-party estimates (e.g., Nesov/Epoch style analysis working backwards from inference speeds or similar methods)
Update 2025-12-07 (PST) (AI summary of creator comment): The active parameter threshold has been changed from <20B to <23B, and the resolution criteria now requires models to meet benchmarks on both ECI AND ADeLE (previously it was OR).
Update 2025-12-07 (PST) (AI summary of creator comment): The creator does not trust Qwen's anticontamination measures. This may affect resolution decisions regarding Qwen models meeting the criteria.
Not all evaluated models have made public their parameter counts. How will you rule if a model reaches the benchmark thresholds, but its parameter counts, while plausibly in the necessary range, are not public? E.g. GPT-5-mini-high has an unclear parameter count (it could be in range already if they used an aggressive MoE architecture) and scores ECI 147 which is only one shy.
@eapache going to go with publicly confirmed counts OR a really convincing Nesov / Epoch style estimate working backwards from speeds or w/e. Good enough?
Qwen has been in the habit of releasing public models with 22B active parameters (the last tested version of which, released in April 2025, scored an ECI of 137). An updated version of this model with sufficient ECI would not meet the letter of the description (since 22 > 20) but I think would probably be in the spirit?