Will a large language model beat a super grandmaster playing chess by EOY 2028?
69
Ṁ12k
2028
55%
chance

The LLM must not be limited to only playing chess. For example, if an LLM can play chess and hold a conversation, it counts. However, it cannot make use of a non-LLM based chess engine, or non-LLM related computation, like a calculator or a python script.

A super grandmaster has a FIDE Elo rating of 2700 or more.

The super GM must be trying to win the game.

The game format is unspecified. The GM can be blindfold, or play on a board whose moves are relayed to the LLM via standard notation, or anything else.

duplicate, with different resolution criteria, to

The following AI generated updates are not binding and may contain mistakes. I will try to occasionally add clarifications to the description describing the changes caused by user questions.

  • Update 2025-06-19 (PST) (AI summary of creator comment): The creator has specified that they would count an AI that is like a 'human brain in AI form' for the purposes of this market. This indicates a broad interpretation of what constitutes a 'large language model'.

  • Update 2025-06-20 (PST) (AI summary of creator comment): The creator has specified which tools and inputs are permitted for the LLM:

    • Allowed: A scratchpad for memory and calls to other instances of the same model.

    • Not allowed: A board evaluation function or an opening book provided in the prompt/context.

  • Update 2025-06-20 (PST) (AI summary of creator comment): The creator has stated they lean towards allowing the LLM's output to be constrained to only output valid moves, viewing this as a minor form of assistance.

  • Update 2025-06-20 (PST) (AI summary of creator comment): The creator has specified that a hybrid model, such as a chimera network, will not count if it contains a subnetwork that is a specialized non-LLM for chess.

  • Update 2025-06-20 (PST) (AI summary of creator comment): The creator has specified that a model architecture containing a sub-network that is a chess-specialized LLM is allowed. This sub-network can be a transformer specifically trained on chess-related "language" such as PGNs or FENs.

  • Update 2025-06-20 (PST) (AI summary of creator comment): The creator has specified that in a hybrid model, a sub-network specialized for chess must also understand English. A model where the chess sub-network is so specialized that it does not know English is considered an 'unpleasant edge-case' that may not count.

  • Update 2025-06-20 (PST) (AI summary of creator comment): The creator has clarified the criteria for hybrid models with specialized sub-networks:

    • Any sub-network, even one specialized for chess, must also be able to understand and use a non-chess based human language (e.g., English).

  • Update 2025-06-20 (PST) (AI summary of creator comment): The creator has specified that the model is not restricted to text-only interactions. As long as the model uses language, other modalities such as audio-to-audio are permitted.

  • Update 2025-06-20 (PST) (AI summary of creator comment): The creator has provided additional specifications for the model:

    • The model can be multimodal (e.g., using visual or audio inputs), it is not restricted to text-only.

    • The model is not required to be autoregressive. Models with different architectures or training methods will be considered.

  • Update 2025-06-21 (PST) (AI summary of creator comment): The creator has provided further clarification on the architecture of qualifying hybrid or Mixture-of-Experts (MoE) models:

    • Likely to count: Models where specialized experts are trained together and specialization is emergent.

    • Unlikely to count: Systems where models are trained completely separately and a router or classifier directs prompts to a specialized chess model.

  • Update 2025-06-21 (PST) (AI summary of creator comment): In response to concerns that an LLM could use unlimited time to search the game tree, the creator is now considering imposing time controls on the match. Potential options mentioned include:

    • A specific time limit per move (e.g., 5 minutes).

    • Letting the super grandmaster decide the time controls for the match.

Get
Ṁ1,000
and
S3.00
Sort by:

Text-only as the modality? (I.e., you put LLM instead of LMM/MLLM/ARM)

@Soaffine No, language is necessary but if it’s audio to audio only for some weird reason that would be fine. Does language like necessarily imply text in the common usage? Ig i didn’t intend it that way

@Bayesian I'm more concerned about visual tokens, where it is meaningful to distinguish between a model that is only trained on language sequences with some biased wrapper for chess states (like a chess notation, or stapling a *LIP on it ala GPT-4), and a model which is trained on both language and visual tokens. I suspect that not only will the models large enough to win here be exclusively multimodal, but moreover that native multimodality will be a crucial part of winning. In general, I think LLM connotes and implies training on language sequences only, whereas Multimodal LLM, Large Multimodal, Multimodal, or Autoregressive model are neutral to the modalities they are trained on.

@Soaffine alright then I mean an optionally multimodal llm i think? like not necessarily multimodal. maybe like
llm*
* multimodal counts

i would also expect it to be autoregressive, but if in 2 years all chatbots are multimodal language models that train in a different way than by being autoregressive or wtv i would want that to still count i think

Are the following tools allowed:

  • no-op tool for reasoning

  • scratchpad tool for long term memory

  • tool to call other instances of the same model

Can the prompt include a board evaluation function that the model could manually calculate? Can an opening book be given in the model’s context?

Some of these are not particularly realistic or useful, but still curious.

Honestly I just want to see a fleet of LLMs with an orchestrator manually running MCTS, and deploying instances to manually evaluate positions in the tree, and deploying others to play each other on leaf nodes, etc. Prob worse than fine tuning your model to have good intuitions about positions, but way cooler!

no op for reasoning is basically just reasoning? In which case fine

Scratchpad fine

Toll call to other instances of the same model fine.

Prompt cannot include board eval.

Opening book cannot be provided

@BionicD0LPH1N What is a no-op tool for reasoning?

@Bayesian If you're going to allow the model to call new instances of itself I think you need to set a time limit. Otherwise this is trivially doable by just recursively exploring the entire game tree.

@IsaacKing good observation, hmmm picking a time limit is tricky though, and like for any time limit you could just spend more compute to parallelize it right? maybe i should not let the llms call new instances of themselves too, but i want to avoid the market resolving NO on a technicality where people still feel like it counts as llms beating a supergm in a non-cheating way. occasionally hard to specify categorization ig. maybe something like "5 minutes per move", or we let the super gm decide; if the supergm wants to play 5 minutes total for each, or 1 hour for each (of the supergm and the chess llm), that would seem fine to me? let me know if that seems like a problematic / bad way to slice it

@Bayesian Technically yes, but in practice we don't have enough compute for that. Even traditional chess playing engines like Stockfish can't explore anywhere near the whole tree, they need to rely on some heuristics. So I think any reasonable time limit would be fine. I suspect that the hardest for an LLM will be somewhere from 10-60 minutes per game.

Too long and the LLM starts benefitting from its perfect memory, humans just can't think of that many possibilities at once. Too short and the LLM benefits from being a computer, able to think way faster than a human. It's the middle area where the LLM will have to be what I would consider "actually intelligent", not able to benefit from the same unfair advantages that any other computer program has.

@IsaacKing A no-op tool is a dummy tool that does nothing externally and returns nothing, so it gives the model the explicit option to e.g. think without acting or to decide that “no external tool is needed” during a reasoning step. It's sometimes used in reasoning models.
[Or maybe you can consider that all reasoning models necessarily use this tool when reasoning? But some reasoning models must reason as the first step and don't get the choice to call a no-op tool, so I wouldn't really call it a no-op tool in that case. But some newer models do get the choice to think whenever, so this terminology feels more appropriate.]

What if the model is constrained to only output valid moves?

Eg right now current models hallucinate illegal moves, but output shaping could fix this on the runtime side without making the model smarter. Would this be permissible to you or would you consider this cheating?

Other markets are split on whether illegal moves will count as a forfeit to the model, for example

@KJW_01294 Is this something you think it’s plausible for a superhuman chess llm to need or else it cant beat the supergm? Like I’d lean toward it being fine just because it seems like so little help and i wouldnt want the market to resolve No despite the capability clearly existing bc someone restrained the model output to things it would almost certainly output anyway. But if it seems like it would be very hard for a superhuman chess llm to play valid moves unless otherwise constrained, lmk

What qualifies as LLM? Just how much must the architecture resemble what we have now?

@patrik Do you find any loophole or flaw with this draft def?

A large language model is a neural network with millions to trillions of parameters, trained on vast amounts of text data, that can generate coherent human-like text and perform diverse language tasks through a unified learned representation of language.

@Bayesian Yes this doesn't limit the architecture much.

@Bayesian I propose that you limit it to only verbal reasoning plus no new major architectural changes (not sure how to quantify that one tbh).

@patrik Yeah that is tricky. If it has non verbal reasoning using internal representation id definitely count that i think. Yeah tricky to tell what future architectural changes are ‘too big’ or wtv before they are created. I am tentatively fine with a pretty weakly limited architecture where it feels like the llm is a at least partly text based intelligence that thinks about and plays chess in a way that doesnt feel like it’s cheating against the human, ie it’s using its brain and reasoning without like a specialized chess computer program.

@Bayesian I don't think something centered around language can do it but if it's just partially using language then sure. It might not make a lot of sense to call it LLM tho.

@patrik i think something centered around language can do it? What do you have in mind that would partially use language and be superhuman at chess but that it would be a stretch to call an llm?

@Bayesian The human brain.

@patrik I see. Yeah i’d want something like a human brain in ai form to count for the purpose of this market so it doesnt seem like a problem?

@Bayesian But it'd be a stretch to call it LLM.

@Bayesian would a chimera network count?

I think I could build an MoE model to do this in that timeframe if no one else does

@Quillist Like with one subnetwork being a chess specialized non llm and another subnetwork is a non-chess-gm-level llm? That wouldn’t count. I’m not set on whether chimera networks dont count in general tho, idk about them enough to say

@Bayesian would it count if one subnet was a chess specialized LLM. The chess subnet would still in essence be a transformer based network like the other subnets, but is trained on PGNs+FENs ( which is technically a language )

@Quillist yes, that is allowed!

seems weird if it is so specialized that the subnet doesn't know english though. that would be pretty unpleasant of an edge-case

what about, subnets or the entire net can be specialized for chess, as long as each subnet can understand and make use of some non-chess based human language? thoughts?

@Bayesian When I was thinking about making my own question in this vain. I thought of the following constraints

  1. Must be able to tell me 10 things worth doing in any major city (1 random failure is fine)

  2. Must be able to write poetry as well as llama 3.1 70b

Just so it's clear it's not hyper fine tuned on chess so it loses all other basic low level llm usages

@Bayesian The whole point of subnet systems is to allow for network specialization, similar to how different brain regions specialize on different information streams.
There are many subnets in major LLMs that arguably have no understanding of English, where it only activates on weird encoding/spacing patterns.
Some subnets are trained specifically for coding, etc.

PGNs+FENs are Formal languages opposed to Natural languages, but they are languages all the same.

@Quillist I'm not sure yet. I would think that, like, if the nets were trained together in an MoE-like way, but the specialization was emergent, that would be fine (as it is in the brain), but if they're trained completely separately with some logic to pick one subnet or the other depending on if you classify the prompt as chess related, that seems iffy to me, like it wouldn't feel to people like it should count as a superhuman level chess llm. hmmmmmm

To provide more grounding on why it should count:

Would the same logic apply to coding? Like if you had subnets that are by design, biased to specialize on code to boost coding performance, would people dismiss it as an LLM? Using recent history, pretty much anyone would still consider it an LLM.

And if we try to avoid the trap of using public sentiment/media definitions, and go with strict academic nomenclature, it would definitely still be considered an LLM

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules