
Any proof that this is possible counts.
Apple’s own on device models reach this threshold (benchmarking at qwen2.5, which seems comparable to gpt4o on lmarena)
https://machinelearning.apple.com/research/introducing-apple-foundation-models
Mistral Small seems similar in performance to GPT 3.5: https://mistral.ai/news/la-plateforme/
Should be a matter of days until someone runs it on their iPhone
Might happen much sooner than I expected
https://twitter.com/harmlessai/status/1626769581858758661?s=46&t=-aOs5vi8y_5tlqgbjNoRKA
As evidence in favour:
this type of advance in making existing transformers smaller at same performance: https://twitter.com/arankomatsuzaki/status/1624947959644278786?s=20&t=sWoi47Zz-RRprcyDC-jZog
optimal training data vs parameter count (a la Chinchilla)
the hefty "neural engines" that Apple is continuously making better in their apple silicon
that we may find new architectures that work better for language