Invalid contract
Background
LiveCodeBench is a holistic and contamination-free benchmark that continuously harvests fresh coding problems from LeetCode, AtCoder and Codeforces contests, then replays them inside a deterministic harness to stop data-leak and measure real generalisation. Each evaluation window fixes a start & end date (the current one spans 454 problems released 1 Aug 2024 → 1 May 2025), and scores models by Pass@1—the share of tasks whose very first generated solution compiles and passes hidden tests. The benchmark also tags every problem easy / medium / hard and reports per-tier accuracy, revealing where models still stumble.
State of play (July 2025):
O4-Mini-High: 80.2% [Pass@1]
Why 95% matters
True zero-shot coding mastery. Pass@1 at 95% means the agent never needs retries, tool chains, or human edits—mirroring the reliability expected of senior engineers.
Contamination guard. Because tasks are time-filtered, perfect accuracy demonstrates genuine problem-solving, not memorisation of training-set snippets.
Broader skill coverage. LiveCodeBench evaluates code generation, self-repair and test-output prediction; a model that aces code generation Pass@1 is likely strong on the other tracks too, hinting at near-general software autonomy.
Resolution Criteria
The market resolves to the first calendar year in which ALL of the following conditions are satisfied:
Leaderboard evidence – The public LiveCodeBench leaderboard lists a run with Pass@1 ≥ 95% on all problems in the active evaluation window (currently 454 tasks).
Independent verification – The claim is confirmed by either
(a) a peer-reviewed or widely-cited paper (e.g. arXiv, NeurIPS, ICSE) that releases evaluation logs, or
(b) acceptance by the LiveCodeBench maintainers as an official leaderboard entry.
Autonomy – After evaluation starts, no human may alter code; unlimited compute, retrieval or tool use is allowed only if invoked automatically by the agent.
Expiry – If no qualifying run is verified by Jan 1, 2033, the market resolves “Not Applicable.”