I believe creating a model 90% as good as o4-mini is within the purview of a smart hobby researcher today.
Specifically, I believe it can be achieved using an open-source model of roughly the caliber available today as base, clever scaffolding for agentic tool-use/web search, and an affordable amount of GPU compute.
Specs:
If a LLM is used as base, it must be open-weights, and released during or before June 2025.
Base model must use fewer than 40B activated params if MoE or fewer than 80B params if dense.
Scaffolding/harness to let the model search/run in a loop is allowed and encouraged. Anything goes as long as it's fully automated and not machine learned.
If compute is used for fine-tuning/reinforcement learning, the cost of the compute must be within $500 or fair market value (whichever is higher.)
"90% as good" is defined as difference between o4-mini and hypothetical model of Cohen's d over task-wise scores in 5 runs of THUDM AgentBench ≤ 0.32.
If there are any competent, good-faith attempts (as judged by me), this market resolves YES if any of them satisfy all criteria, else NO. If there are no such attempts, this market resolves N/A.