Do AI-generated proofs meet formal verification standards for software?

Formal verification demands proofs that are machine-checkable, reproducible, and sound, and AI-generated proofs rarely meet those standards without integration into established proof-checking tooling. Formal projects such as CompCert led by Xavier Leroy INRIA and the seL4 microkernel verification led by Gerwin Klein Data61 and University of New South Wales deliver end-to-end guarantees by producing proof artifacts that are checked by proof assistants and by independent reviewers. These efforts show the level of rigor required for software that must meet high-assurance criteria.

What formal verification requires

A proof assistant like Coq developed at INRIA or Isabelle with contributions from Lawrence C. Paulson University of Cambridge enforces a small trusted kernel that checks every inference. The critical property is soundness of the proof checker so that an accepted proof corresponds to a mathematical guarantee about the code. Projects such as CompCert by Xavier Leroy INRIA and verified kernels by Gerwin Klein Data61 and University of New South Wales demonstrate that machine-checked proofs must connect specifications through formal semantics to executable artifacts to satisfy certification regimes used in aerospace, medical devices, and critical infrastructure.

Role of AI and its limitations

Large language models can generate plausible proofs, suggest lemmas, and produce tactic scripts that are useful starting points for formal development. However, model outputs often lack the explicit, low-level proof objects required by proof assistants and can hallucinate steps that are not checkable. This gap means that raw AI-generated proofs do not by themselves meet formal verification standards unless their outputs are consumed and validated by a proof assistant such as Coq or Lean and the resulting proof terms are fully checked by the system.

The consequences affect safety and trust. Relying on unverifiable AI proofs in safety-critical domains risks undetected errors with human and territorial impacts in regulated industries where certification bodies demand auditable evidence. Cultural adoption in engineering teams will depend on workflows that combine human expertise, toolchain trust, and reproducible artifacts. Research by practitioners and institutions shows the most credible path is tool-assisted automation where AI accelerates proof development but proof assistants remain the ultimate arbiter of correctness.