updated February 8, 2026
Research-level mathematics proved in natural language and verified in Lean 4. Problems from the First Proof benchmark — spanning stochastic analysis, representation theory, algebraic combinatorics, spectral graph theory, algebraic topology, symplectic geometry, lattices & topology, tensor analysis, and numerical linear algebra.
We demonstrate a workflow toward automated mathematical discovery using formal verification at scale to make rigorous progress on open problems beyond the reach of our current understanding.
Each problem is solved and verified through a three-stage pipeline combining frontier AI models with formal verification.
OpenAI's GPT-5.2 Pro Extended thinking with Web search iteratively develops a complete natural language proof. The model identifies prior work and barriers, then mounts rigorous attacks with numbered steps, cited theorems, and verified hypotheses. Each proof undergoes adversarial self-verification—checking quantifier order, boundary cases, and theorem applicability—before the full solution is finalized.
GPT 🔁 Codex: If the iterative GPT-5.2 process does not converge to a complete and correct proof, for example, for problem 6, then continue with OpenAI's Codex app using GPT-5.3-Codex Extra High with the entire GPT-5.2 iterative conversation as input to computationally make progress on the problem using the same prompt iteratively, then return to GPT-5.2 Pro Extended thinking with the computational results from Codex and continue solving the problem. Cycling between iterative GPT and iterative Codex combines the strengths of both models.
OpenAI's Codex app with GPT-5.3-Codex Extra High with Lean 4 formalizes first the problem statement, then the full proof. Each formalization is compiled in Lean 4, stripped of comments, and backtranslated to natural language. The backtranslation is checked for mathematical equivalence against the original, ensuring the formal version faithfully captures the intended mathematics. Proofs compile without sorry statements or axioms, adding missing Lean definitions and verifying them.
Anthropic's Claude Code (with Opus 4.6 and enabled agent teams) orchestrates a team of specialized agents to explore the problem from diverse angles. The team includes Numina's Lean prover agent. The team formalizes, compiles, backtranslates, and verifies equivalence in a loop, iterating until the Lean formalization both compiles and is mathematically correct and complete.