blockchain architectureFormal VerificationAI-Assisted VerificationDeFi SecuritySpecification Burden

Specs Were Always the Bottleneck: How AI Lowers the Cost of Formal Verification in DeFi

AI compresses the labor of producing verification artifacts. It does not compress the cognitive work of deciding what to verify. The distinction is small in words and large in consequence: verified-looking output without machine-checked guarantees is the same false-confidence failure mode the previous article catalogued — now in a faster workflow.

Andrew Nalichaev··15 min read

The bottleneck on formal verification adoption in DeFi has never been the proofs. It has been the specifications, the environment models, the annotations, the property-discovery work, and the human time required to keep those artifacts current as a protocol evolves. AI has begun to compress that labor. It has not, and likely will not soon, compress the cognitive work of understanding what to prove. The distinction matters more than any tooling announcement, because most of the failure modes of "AI-assisted verification" live exactly in the gap between assisted formalization and replaced expertise.

This article maps what AI changes in the economics of formal verification adoption, what it does not, and the disciplined workflow that lets a smaller protocol benefit from both. Earlier articles in this cluster — the pillar on why audits stop being enough at the system boundary and the follow-up on the proof boundary itself — argued the case for formal verification and mapped the scope of what a proof actually establishes. The operational consequence is the starting point here — formalization labor has kept verification out of reach for most teams, and that is the cost AI is starting to shift.

Core claim

  • Formal verification adoption has been bottlenecked by formalization labor, not by the value of proofs.
  • AI is beginning to compress that labor — property drafting, annotation generation, spec-mismatch detection, proof-context augmentation, model extraction, artifact reuse.
  • AI's real role is assistant, accelerator, and friction-reducer; not autonomous guarantor.
  • Machine-checked feedback is non-negotiable. An LLM-generated proof candidate is not a proof until a verifier discharges it.
  • The largest failure mode is false confidence: weak specs, hallucinated assumptions, or "verification-looking" outputs without machine-checked artifacts.

AI Is Becoming Necessary, but Stays Conditional and Not the Same as Verification

AI is becoming necessary if formal verification is to spread beyond the largest few protocols. Without it, the labor curve keeps verification a boutique line item on the budgets of teams that can afford a continuous specification function, and out of reach for the protocols that need it most. AI is also conditional — the value of every AI-generated property, annotation, or proof candidate depends on a verifier in the loop and a specification a human team can defend in its own words. And it is not the same as verification: an LLM can produce text that looks like a proof, but only a machine-checked discharge against an inspectable specification is one. Holding these three statements together is the only way to use AI in this part of the security stack without recreating the overclaim problem the previous article in this cluster catalogued.

Why Formal Verification Adoption Has Stalled

The honest reason most DeFi teams have not adopted formal verification has not been disbelief in proofs. It has been the labor profile. Setting up a meaningful verification program requires expertise most teams do not have on staff: a person or small team that can read the protocol's architecture, decide which invariants matter for safety, write those invariants in a specification language a verifier can check, model the environment the protocol assumes, run the prover, interpret counterexamples, and iterate. That work is specialist, slow, and continuously stale because the protocol keeps changing.

The economics fall out of this profile. As one publicly documented example: Aave's continuous formal verification program was reported at roughly $1.7 million over six months — not a market rate for formal verification in the abstract, but an indication of what a mature protocol chose to fund to keep a serious specification current under continuous governance change. Most teams cannot justify that line item before product-market fit, and many teams that could justify it never knew to ask, because the security budget conversation often stops at "we got audited."

Four distinct barriers explain why projects skip formal verification. Property and spec work is specialist labor, and the specialist labor market is thin. Protocols change faster than spec authors can keep up; a verified property today is a stale property by next sprint unless someone maintains it. Fixed specialist spend before revenue is hard to defend. And the assurance conversation in the industry is still dominated by audits, so many buyers treat "audited" as the whole answer. None of these are about the abstract value of proofs. All of them are about labor and cost.

This is the bottleneck AI matters to. Not because AI proves contracts. Because AI lowers the labor cost at the point where most teams quit.

What AI Actually Changes

The pattern that has emerged from the research is not autonomous verification. It is generate-check-repair: an LLM drafts a candidate artifact — a property, an annotation, a proof sketch, a mismatch report — a verifier or external oracle checks it, the LLM repairs based on feedback, and the loop continues until the artifact discharges or the team takes over. This is the structure of PropertyGPT in the smart-contract specific case, of recent verifier-guided annotation-generation work in the broader formal methods literature, and of theorem-proving assistance work on industrial proof corpora. Several layers within this loop sit at different maturity points; teams choosing where to invest should read the gradient honestly.

Property generation — usable today with discipline. LLMs can draft candidate invariants and specifications from contract source plus informal description. PropertyGPT specifically targets smart contracts: it uses compilation and static-analysis feedback as an external oracle, hands candidate properties to a prover, and achieves high recall against ground truth on known incidents plus zero-days. The candidate-property quality is not yet at expert level for nontrivial business logic, but the time-to-first-draft is now measured in minutes rather than days. For a small team writing its first invariant on an accounting module, this is the difference between starting and not starting — provided the draft goes through a human reviewer before it is ratified as the spec.

Annotation generation and repair — emerging in production. Loop invariants, lemmas, and proof helpers are the labor-intensive sub-task that has historically priced out formal methods for general software. Recent verifier-guided annotation-generation work uses the verifier's own output as a feedback signal: the LLM proposes, the verifier checks, the LLM repairs. The result is a measurable reduction in the human time needed to bring a verification target across the finish line. For DeFi modules with tractable invariants, this layer is moving from research demo into the workflows of teams that already have a verification function. It is not yet a turnkey offering for the average DeFi team.

Spec-mismatch detection — emerging in narrow but high-value applications. The most interesting recent result, and the most relevant to the proof-boundary article in this cluster, is that LLMs can act as natural-language verification assistants over contract-specific properties — and in doing so can flag mismatches between the natural-language claim the team is making and the formal spec the prover is actually checking. The 2025 LLMs-as-verification-oracles work on Solidity audit reports specifically found multiple cases in real production audits where the public property description and the CVL spec diverged in load-bearing ways. The use case is narrow, but it directly addresses the specification-quality failure mode the proof-boundary article identified as the most common production verification failure.

Proof-context augmentation and maintenance — emerging for specialized domains, still maturing for DeFi. Work on seL4-grade theorem proving has shown that even compact, locally deployable LLMs can materially improve proof success when paired with proof-context augmentation. Adjacent work on proof maintenance — keeping proofs alive across evolving codebases — directly attacks the cadence problem governed DeFi protocols face. Neither line of work claims to replace proof engineers; both attack the labor of maintaining existing verified surface as the code changes. Production maturity in DeFi specifically is still narrower than in the systems-software domains where this work originated.

Code-to-model extraction — frontier. Some research uses LLMs to extract semantic models from existing code for targeted verification subdomains. The work that exists is real; the production maturity is narrow; the cost reduction, where it lands, is large. This is the layer most worth tracking and least worth betting an assurance posture on today.

The shape of what AI changes across all five layers is consistent. It is not "AI proves contracts." It is "AI lowers the cost of drafting, repairing, and maintaining the artifacts a verifier needs to do its job." That distinction is the whole game.

What AI Does Not Change

The list of things AI does not change is the same list the proof-boundary article catalogued, plus a few new failure modes specific to LLM-in-the-loop workflows.

A proof candidate from an LLM is not a proof. The verifier discharge is the proof. This is not pedantry. Recent annotation-generation research explicitly documents that LLM-produced solutions can "cheat" — by inserting assume-style escapes, by gaming the verifier with statements that discharge trivially without addressing the property, or by structurally pattern-matching on similar-looking problems without preserving the semantic guarantee. The fix is the same in every line of work: enforce machine-checked feedback at every step, manually inspect anything that discharges suspiciously fast, and never accept an LLM-produced artifact as evidence without a verifier signature behind it.

A property the LLM drafts is not a property the protocol team has agreed to defend. The specification problem the proof-boundary article identified — that the public claim, the natural-language description, and the formal spec frequently diverge — does not go away when an LLM is writing the spec. It can get worse, because the LLM will produce plausible-sounding properties that match the surface of the contract without capturing the business invariant the team actually cares about. The specification has to be reviewed by someone who understands what the protocol is for, on every iteration.

An environment model the LLM proposes is not an environment model the protocol is operating in. Tokens, oracles, callbacks, bridges, governance permissions — these have to be specified against reality, not against the LLM's prior on what a typical DeFi protocol looks like. The Compound cETH incident is a warning about what happens when an environment assumption is plausible-looking but wrong. An LLM that confidently writes a plausible-looking environment model has not solved that problem; it has automated the path to it.

The specification quality bar does not move. It moves up, if anything, because more verification surface means more places where a weak spec can produce a proof that ships overconfidence. The teams that benefit most from AI-assisted verification are the ones that pair it with an explicit specification review process. The teams that do not will produce proof artifacts faster and ship more sophisticated overclaim.

Failure Modes — How AI Verification Goes Wrong

The honest map of the failure modes is worth stating directly, because each one has a specific operational defense.

Verifier cheating. LLM produces a proof candidate that discharges because it gamed the verifier — assume-escapes, trivial structural matches, hidden weakenings of the property. Defense: inspect every artifact that discharges faster than a competent human expected. Run differential checks against simpler properties. Reject any verification claim where the spec was generated by AI and not reviewed by a human who can explain what it defends.

Plausible-sounding spec. LLM produces a property that reads well and matches the surface of the contract but does not capture the business invariant the protocol depends on. Defense: specifications are reviewed by a protocol engineer who can articulate, in their own words, what the property is supposed to defend and why. AI is allowed to draft; humans are required to ratify.

Plausible-sounding environment. LLM produces an environment model based on its prior, not on the actual dependencies the protocol integrates with. Defense: environment models are written against actual deployed code, actual oracle behavior, and actual token implementations — and tested at runtime where modeling is not feasible.

Brittle proof maintenance. LLM-generated proofs that worked at version N break at version N+1 because the LLM pattern-matched on incidental structure rather than on the underlying semantic property. The proof-maintenance literature names this as the semantic-structural gap directly. Defense: treat AI-assisted proofs as artifacts that need their own maintenance discipline, not as one-shot deliverables. Re-verify on every release.

Documentation drift. The natural-language description the team publishes about what was verified diverges from the formal property the prover discharged — and AI tooling makes the gap easier to hide, because both the prose and the spec can be LLM-generated separately. Defense: the spec is the source of truth, and the prose is generated from it under review, not the other way around.

Each of these failure modes is a version of the same root cause: AI compresses the labor of producing verification-looking output without compressing the cognitive work of producing actual verification. The operational defenses are the same across the board — verifier-in-the-loop, human-in-the-loop on specs and environment models, and treating AI artifacts as candidates that need ratification rather than as deliverables.

A Near-Term Workflow for a DeFi Team With a Limited Security Budget

The honest near-term workflow for a DeFi team that wants formal verification but does not have Aave's budget is short and disciplined.

Start where the value moves. The highest-leverage formal verification targets are the modules that hold or move user funds — accounting, solvency, parameter bounds, liquidation paths. A small set of accounting invariants on a lending market, AMM, or vault is the first concrete deliverable. Trying to verify everything is a doomed project; verifying the modules where the blast radius is largest, well, is the operational shape that has actually emerged from production protocols willing to make the investment.

Use AI to draft, not to ratify. Property drafts, candidate invariants, annotation skeletons, and spec-mismatch checks against existing audit reports are reasonable LLM uses. None of them ship without human review and a verifier signature. Treat the LLM output as a junior engineer's draft: useful, fast, requires careful inspection, never accepted as authoritative without review.

Keep the verifier in the loop on every iteration. Generate-check-repair is the unit of work, not generate-and-publish. If the verifier did not discharge, the property is not held. If the verifier discharged suspiciously fast, the spec is suspect and gets re-inspected before anything ships publicly.

Make integration assumptions explicit. Tokens, oracles, callbacks, governance permissions — every dependency the verification relies on becomes a named assumption in the spec, with either a runtime check or a documented limit. The Compound cETH lesson is operationalized at this step, not by writing a perfect environment model but by surfacing the assumptions the model is making so they can be inspected and challenged.

Maintain the verified surface as a permanent function. Re-verify on every release. Re-check spec-and-prose alignment on every public claim. If the protocol expands faster than the verification can keep up, the verified surface shrinks publicly until it catches up — the badge tracks the model, not the marketing.

This is not a $1.7-million continuous program. It is a one-engineer-quarter starting commitment that, done well, produces real assurance on the most expensive surface area in the protocol. AI lowers the entry cost; it does not eliminate the discipline.

Honest Scaling Without Becoming Shallow

The risk in this transition is exactly the one this cluster has been calibrating against: that the field gets shallower as it gets cheaper. Verification at scale, done badly, produces more sophisticated overclaim. Verification at scale, done well, makes the assurance posture of more protocols inspectable to the people who depend on them.

The difference between the two outcomes is whether the specification stays inspectable. AI lowers the cost of producing specifications. It does not, on its own, raise the quality of the specifications produced. That is still human work, and it is still the bottleneck on whether verification means anything at all.

The optimistic version of the next few years is that more DeFi teams have working formal verification programs because the labor curve finally moved. The pessimistic version is that more DeFi teams ship "AI-assisted verified" badges that the market reads as guarantees, and the gap between assurance claim and user experience widens further. Which version the field gets depends on whether the teams adopting AI-assisted verification adopt the discipline along with the tools.

The tools are now usable. The discipline has always been usable. The remaining question is whether the protocols that need formal verification most will hold the line on specs, environment models, and machine-checked feedback when the cost of producing the appearance of verification has fallen and the cost of actually verifying has not fallen as far.

Key takeaway

  • The bottleneck on formal verification adoption was never the proofs. It was the labor of writing specifications, modeling environments, and maintaining proofs across changing code.
  • AI compresses that labor — in property drafting, annotation generation, spec-mismatch detection, proof-context augmentation, and code-to-model extraction, with different maturity at each layer.
  • Machine-checked feedback is non-negotiable. An LLM artifact is a candidate, not a proof. A verifier discharge is the proof.
  • The failure modes are specific: verifier cheating, plausible-sounding specs, plausible-sounding environments, brittle proof maintenance, documentation drift. Each has a known operational defense.
  • The realistic near-term workflow is generate-check-repair with humans on specs and environment models, starting from the modules that move value.
  • AI lowers the cost of producing verification. It does not lower the cost of producing honest verification. Whether the next phase of DeFi assurance is deeper or shallower depends on which cost the field chooses to maintain.