Can model 90% as good as o4-mini be created with open source and <$500 GPU compute?

I believe creating a model 90% as good as o4-mini is within the purview of a smart hobby researcher today.

Specifically, I believe it can be achieved using an open-source model of roughly the caliber available today as base, clever scaffolding for agentic tool-use/web search, and an affordable amount of GPU compute.

Specs:

If a LLM is used as base, it must be open-weights, and released during or before June 2025.

Base model must use fewer than 40B activated params if MoE or fewer than 80B params if dense.

Scaffolding/harness to let the model search/run in a loop is allowed and encouraged. Anything goes as long as it's fully automated and not machine learned.

If compute is used for fine-tuning/reinforcement learning, the cost of the compute must be within $500 or fair market value (whichever is higher.)

"90% as good" is defined as difference between o4-mini and hypothetical model of Cohen's d over task-wise scores in 5 runs of THUDM AgentBench ≤ 0.32.

If there are any competent, good-faith attempts (as judged by me), this market resolves YES if any of them satisfy all criteria, else NO. If there are no such attempts, this market resolves N/A.

Related questions

Related questions