Why verification, not execution, is the binding constraint on economic growth in the age of AI agents
For most of human history, cognition was scarce. Fire, agriculture, calculus, the semiconductor—each required human minds to observe, recombine, and verify. The economy organised itself around that scarcity: wages, credentials, firms, and markets are all mechanisms for rationing attention and leveraging the limited throughput of the human mind.
That bottleneck is giving way. But when a scarce resource suddenly becomes abundant, the constraint does not vanish—it migrates, often violently, to its nearest complement. Execution is becoming abundant, fast, and scalable. Verification—the capacity to know whether what was executed is what was intended—is not.
This paper defines the structural asymmetry: execution scaling faster than reliable verification. It calls it the Measurability Gap.
On SWE-bench Verified, accuracy rose from 4.4% to 71.7% in a single year. Task horizons for frontier agents are doubling on a sub-year cadence. Google's Gemini 3 Deep Think disproved a decade-old mathematical conjecture. OpenAI reports its first frontier model was instrumental in creating itself.
The capability curve has become self-referential—agents accelerate the engineering pipelines that produce their successors. The interval between generations is compressing faster than institutions can update their oversight.
The defining friction of the labour market shifts from skill-biased technical change to what the authors call measurability-biased technical change: automation commoditises anything that can be measured, stripping the wage premium from historically prestigious roles the moment their core feedback loops are digitised.
Employment for early-career workers in AI-exposed fields has already declined 16% relative to less-exposed occupations—not through mass layoffs, but through frozen hiring pipelines that quietly treat AI as a direct substitute for junior execution.
Systems like Claude Code Security systematically surface classes of high-severity vulnerabilities that seasoned auditors missed for years—not through superior intuition but through exhaustive, automated pattern-matching across entire codebases. The tacit judgement that once distinguished senior security researchers is extracted, codified into reproducible tooling, and commoditised faster than the profession can replenish it.
The paper builds from two primitives:
Cost to automate a task (cA) — falling exponentially with model capability, hardware, and data.
Cost to verify the result (cH) — bounded by biological cognition, institutional capacity, and the sheer difficulty of knowing whether the output is correct.
A task is economically automatable only when cA < cH—that is, when it is cheaper to have an agent do it and the result can still be reliably checked. The Measurability Gap is the divergence between what agents can execute and what humans can verify:
Δm = mA − mH
Where mA is agent execution capability and mH is human verification capacity. When Δm grows, the economy enters a regime where agents produce outputs that look correct but cannot be reliably confirmed.
Unverified deployment generates what the paper calls counterfeit utility—output that passes surface-level filters but fails unmeasured human intent. The externality formula:
XA = (1 − τ)(1 − sv)La
Where τ is alignment stability, sv is the verified share of agentic output, and La is the total volume of agentic labour. Every unit of unverified deployment that drifts from intent imposes a cost that is socialised—through systemic fragility, security debt, and the erosion of institutional trust—while the deployer captures the immediate efficiency gain.
Each worker divides finite time across four activities:
As cA falls, Tm collapses toward zero. The question is whether the freed-up time migrates to Tfb and Snm (augmented economy) or is simply eliminated (hollow economy).
Every task in the economy can be plotted on two axes: how cheap it is to automate versus how cheap it is to verify. This creates four zones:
Low cA, low cH. Easy to automate and easy to verify. Structurally safe to scale. Think: code generation with unit tests, invoice processing, data entry.
Low cA, high cH. Easy to automate but hard to verify. The danger zone. Think: legal advice, medical diagnosis, financial strategy—where wrong answers look plausible.
High cA, low cH. Hard to automate but easy to verify. Humans remain efficient. Think: plumbing, surgery, complex negotiations.
High cA, high cH. Neither automatable nor verifiable at scale. Think: artistic vision, ethical leadership, foundational research intuition.
The critical dynamic: cA is dropping exponentially while cH is biologically bounded. Tasks that were in the Safe Industrial Zone—where both automation and verification were easy—get pushed into the Runaway Risk Zone as agents take on harder work faster than verification infrastructure can keep up.
The boundary between "safe to automate" and "dangerous to automate" is not fixed. It is a race between falling execution costs and the (much slower) expansion of verification capacity. Without deliberate investment in verification infrastructure, the Runaway Risk Zone grows as a share of total economic activity.
The paper identifies three self-reinforcing dynamics that widen the Measurability Gap:
When firms automate entry-level measurable work (Tm), they destroy the apprenticeship pipeline that produces future verifiers. The junior lawyer who once learned judgement by drafting contracts now has nothing to draft. The result: a near-term productivity boost purchased at the cost of future verification capacity.
Experts who interact with AI systems generate exactly the training data that automates their own expertise. A senior engineer's code reviews become training signal. A compliance officer's rejection log becomes an automated constraint. The very act of overseeing the system transfers tacit knowledge into codified form—extractable, reproducible, and no longer requiring the expert.
As Δm widens, optimisation decouples from intent. Agent output increasingly maximises measurable proxies rather than the unmeasured human objective. The gap is invisible on dashboards—metrics improve while actual value erodes. The paper formalises this as τ (alignment stability) degrading toward zero.
The interaction between these three engines creates a Lotka-Volterra dynamic. Agentic labour (the "predator") grows rapidly on a diet of measurable tasks, driving up demand for verification capacity (the "prey"). But verification capacity cannot scale as fast, so it depletes. As verification shrinks, unverified deployment grows, widening Δm further. The system oscillates—but each cycle degrades the institutional substrate.
The paper also introduces a dynastic welfare function with an identity parameter λ. If a generation views its successor—augmented or synthetic—as truly its own (high λ), total welfare may increase even as biological human involvement shrinks. If not (low λ), the transition looks like existential dispossession. This is not a technical parameter. It is a civilisational choice.
The formal model yields several sharp predictions:
The model identifies six policy instruments that shift the economy toward the Augmented path:
When agents act as labour rather than tools, pricing must shift from flat-rate to metered risk: charging per inference plus a liability premium indexed to verification difficulty. Insurance becomes not an accessory to the product but the product boundary itself.
The actor with the best real-time observability into agent behaviour—provenance logs, near-miss traces, post-mortems—can both price latent liability and intervene to reduce it. The likely equilibrium: vertical integration, where platforms bundle execution, verification, and indemnity into a single product. The durable moat accrues to whoever closes the loop between verification logs, claims outcomes, and model improvement.
In February 2026, ElevenLabs launched an insured AI voice agent by partnering with The Artificial Intelligence Underwriting Company—empirical confirmation of this prediction.
Open models do not eliminate the need for proprietary verification. The dominant design in high-stakes settings is a hybrid stack: open models for broad execution, specialised proprietary components for safety-critical sub-tasks, and diversified model lineages for verification (since heterogeneity reduces correlated error at scale).
The winning position is not owning the model but owning the verification wrapper: proprietary evaluation harnesses, monitoring, and governance. Firms swap in better open models as they arrive while keeping the trust layer in-house.
In a non-cooperative equilibrium, actors rationally sacrifice verification to preserve relative capability. Export controls buy time but are structurally transient—as cA approaches zero, the threshold for dangerous capability drops, rendering physical containment ineffective.
The counter: a democratic supply-chain coalition that conditions market access on verifiable safety standards. Cryptographic receipts—verifiable inference, attestation of model versions, immutable audit logs—become critical geopolitical tools. The competition shifts from raw execution speed to verified performance, making safety a prerequisite for profit rather than a tax on innovation.
The paper argues that "skill-biased technical change"—the dominant framework for understanding technology's impact on wages since the 1990s—is being replaced by "measurability-biased technical change." The dividing line is no longer how skilled you are, but how measurable your output is.
Workers in the Strategic Labour Market Topology sort into four roles:
The paper maps corporate strategy onto the Measurability Gap and identifies five defensible positions:
The durable moat is rarely the basic brain (driving cA down), but the proprietary context and traces that shape behaviour—especially verification-grade knowledge—and the institutional capacity to underwrite outputs at scale. Use open models for reasoning. Privatise the domain context and the verification stack.
Proprietary data bifurcates. Execution-grade data (finished code, contracts, reports) lowers cA and is structurally vulnerable—general models will approximate it. Verification-grade data (redlines, rejection logs, near-miss histories) lowers cH and is uniquely defensible because it captures the negative space of expertise that general models cannot infer.
Incumbents in complex, heavily regulated domains have a significant advantage: decades of failure knowledge. Digitise that into accessible KIP and competitors can never replicate the historical ground truth required to trust output at scale.
Bind agentic labour (La) to verified throughput (sv). Treat any unverified excess as latent debt, not productivity. Charge by metered risk. Avoid the "AI verifies AI" trap. Protect the internal apprenticeship pipeline—Tm approaching zero without Tsim is fatal.
As marginal execution costs collapse, the premium on the specific expertise required to verify it expands disproportionately. Companies can aggregate expert intuition into permanent assets: a senior engineer's code architecture becomes a reusable template, a compliance officer's rejection log fine-tunes automated constraints. One super-verifier can steer a large fleet of agents.
A smaller class of companies competes where value is anchored in coordination equilibria—meaning, status, legitimacy, community, identity. The company functions as a coordination device: "this is what quality means." Its output is validated by social consensus and long-horizon reputation, not objective metrics. "Made by humans" becomes a luxury signal. Agents can mimic the surface form but cannot manufacture the equilibrium.
The paper predicts firms converge toward a three-layer topology:
The fragility: if entry-level automation removes the apprenticeship pipeline, companies must deliberately fund internal "flight simulators"—rotating juniors through supervised verification work and curated edge cases.
The paper's most novel extension: a comprehensive re-evaluation of network effects through the measurability lens. The central claim—network effects are fragile if an entrant can use agents to raise gross scale without raising the verifiable share, and durable only when growth makes the network safer and cheaper to police.
From most vulnerable to most defensible:
Traditionally, network effects are monotonic: more activity equals more value. In an agentic economy, this relationship can invert (dV/dN < 0).
Agentic slop is worse than human spam. Human noise is random; agentic distortion is correlated. Millions of agents exploiting the same model blind spots create phantom consensus—artificial agreement, fake trends, and engagement bait the platform mistakes for quality.
The result: high-quality human participants exit. The platform becomes a "dark forest" where real users retreat to private, high-trust channels. The network effect turns toxic—the more open the platform is to participation, the less valuable it becomes. This is the unravelling dynamic.
Value migrates to platforms with advanced verification infrastructure: proof of personhood, provenance, and the verification flywheel (each resolved dispute builds precedent that makes the next resolution cheaper).
When execution costs collapse toward zero, the limit on returns is no longer the capacity to produce but the capacity to trust. Profit migrates from execution to three bottlenecks: verification bandwidth, liability absorption, and proprietary ground truth. The defining question: what makes this deployment verifiable, insurable, and defensible?
Agentic infrastructure: Agents require compute and financial rails to operate continuously. Programmable payment rails (stablecoins, smart contracts) that allow agents to settle transactions, post collateral, and use escrow without friction. If the economy fragments into many autonomous agents transacting at high frequency, traditional banking—built around human KYC, batch settlement, and minimum thresholds—strains. Microtransactions become the natural way to access and pay for KIP.
The non-measurable economy: As measurable output becomes abundant, the premium on the unmeasurable increases. Value companies that function as coordination mechanisms for status, identity, and community. Because legitimacy is path-dependent and socially constructed, it cannot be automated. Brands that successfully signal "human origin" or "shared meaning" will command high margins precisely because their value proposition is resistant to cA approaching zero.
The primary objective is not to slow automation but to stabilise the verifiable share (sv) during liftoff. This requires building an institutional architecture that prevents Δm from widening to the point of systemic failure.
Engineer a liability regime where deployers internalise the expected cost of harms. Commercial agents above defined autonomy thresholds must post financial guarantees proportional to their scale and potential harm. Regulators must first enforce the preconditions of insurability: standardised incident reporting, auditable execution traces, and disclosure formats that convert opaque algorithmic behaviour into quantifiable risk.
Verification-grade ground truth—the library of rare failures, edge cases, and outcome registries—exhibits strong natural-monopoly characteristics. Left to private incentives, this critical data fragments. Governments must invest in public measurement infrastructure: certified datasets, open evaluation harnesses, and interoperable audit formats.
If a generation loses the independent intuition required to verify machines, oversight shifts toward fragile monocultures of "AI verifying AI." Governments must fund high-fidelity "flight simulators for work"—Tsim platforms that keep experts sharp via synthetic adversity. Augmentation tools must be treated as public infrastructure, akin to literacy or internet access, to prevent the cognitive divide from cementing into a permanent caste system.
Banning open-source models is futile. Instead, subsidise independent auditing: a specialised corps of red-teamers, evaluators, and incident-response researchers who continuously stress-test widely used systems. Publish high-signal findings to crowdsource the safety map faster than bad actors can exploit it.
Democracies must form a supply-chain coalition—an economic entente that conditions access to the world's largest markets on verifiable safety standards. Cryptographic provenance and programmable payment rails are the core market infrastructure that prevents a global lemons market, making sv legible and scalable across borders.
All strategies—for individuals, companies, investors, and policymakers—converge on a single objective:
It requires aggressive investment in the scarce complements: verification capacity, verification-grade ground truth, provenance infrastructure, alignment stability, synthetic practice, human augmentation, and liability underwriting.
If these complements are neglected, the model predicts a structural drift toward a hollow economy: high nominal output but collapsing human utility, where the resource leak ultimately crowds out consumption and capital formation.
Key Actions by Stakeholder
Even if the future inevitably relies on AI to verify AI, the safety of that future depends on the human institutions built today to ensure those verifiers remain aligned as execution scales beyond our direct view and understanding.
Source: Catalini, C., Hui, X. & Wu, J. (2026). "Some Simple Economics of AGI." SSRN Working Paper 6298838. Published February 24, 2026.
This summary was prepared in the style of The Economist using progressive disclosure for depth. It condenses the original 103-page paper while preserving all core models, predictions, and strategic implications.