The Measurability Gap

Why verification, not execution, is the binding constraint on economic growth in the age of AI agents

Based on "Some Simple Economics of AGI" by Christian Catalini (MIT), Xiang Hui (WashU) & Jane Wu (UCLA) · February 2026 · SSRN 6298838

For most of human history, cognition was scarce. Fire, agriculture, calculus, the semiconductor—each required human minds to observe, recombine, and verify. The economy organised itself around that scarcity: wages, credentials, firms, and markets are all mechanisms for rationing attention and leveraging the limited throughput of the human mind.

That bottleneck is giving way. But when a scarce resource suddenly becomes abundant, the constraint does not vanish—it migrates, often violently, to its nearest complement. Execution is becoming abundant, fast, and scalable. Verification—the capacity to know whether what was executed is what was intended—is not.

This paper defines the structural asymmetry: execution scaling faster than reliable verification. It calls it the Measurability Gap.

The Core Thesis: Execution Is Cheap, Trust Is Not

On SWE-bench Verified, accuracy rose from 4.4% to 71.7% in a single year. Task horizons for frontier agents are doubling on a sub-year cadence. Google's Gemini 3 Deep Think disproved a decade-old mathematical conjecture. OpenAI reports its first frontier model was instrumental in creating itself.

The capability curve has become self-referential—agents accelerate the engineering pipelines that produce their successors. The interval between generations is compressing faster than institutions can update their oversight.

The agentic economy is not primarily a race to deploy more agents, but to secure the foundations of their oversight.

The defining friction of the labour market shifts from skill-biased technical change to what the authors call measurability-biased technical change: automation commoditises anything that can be measured, stripping the wage premium from historically prestigious roles the moment their core feedback loops are digitised.

Go Deeper: The empirical evidence

Employment for early-career workers in AI-exposed fields has already declined 16% relative to less-exposed occupations—not through mass layoffs, but through frozen hiring pipelines that quietly treat AI as a direct substitute for junior execution.

Systems like Claude Code Security systematically surface classes of high-severity vulnerabilities that seasoned auditors missed for years—not through superior intuition but through exhaustive, automated pattern-matching across entire codebases. The tacit judgement that once distinguished senior security researchers is extracted, codified into reproducible tooling, and commoditised faster than the profession can replenish it.

The Economic Framework

The paper builds from two primitives:

Cost to automate a task (c_A) — falling exponentially with model capability, hardware, and data.

Cost to verify the result (c_H) — bounded by biological cognition, institutional capacity, and the sheer difficulty of knowing whether the output is correct.

A task is economically automatable only when c_A < c_H—that is, when it is cheaper to have an agent do it and the result can still be reliably checked. The Measurability Gap is the divergence between what agents can execute and what humans can verify:

Δm = m_A − m_H

Where m_A is agent execution capability and m_H is human verification capacity. When Δm grows, the economy enters a regime where agents produce outputs that look correct but cannot be reliably confirmed.

Go Deeper: The "Trojan Horse" externality

Unverified deployment generates what the paper calls counterfeit utility—output that passes surface-level filters but fails unmeasured human intent. The externality formula:

X_A = (1 − τ)(1 − s_v)L_a

Where τ is alignment stability, s_v is the verified share of agentic output, and L_a is the total volume of agentic labour. Every unit of unverified deployment that drifts from intent imposes a cost that is socialised—through systemic fragility, security debt, and the erosion of institutional trust—while the deployer captures the immediate efficiency gain.

Go Deeper: Human time allocation

Each worker divides finite time across four activities:

Measurable execution (T_m) — tasks with clear success criteria. First to be automated.
Non-measurable expertise (S_nm) — tacit judgement, taste, institutional memory. Hard to extract.
Verification & oversight (T_fb) — reviewing agent output. The scarce bottleneck.
Training (T_sim) — building human capital through practice. Under threat.

As c_A falls, T_m collapses toward zero. The question is whether the freed-up time migrates to T_fb and S_nm (augmented economy) or is simply eliminated (hollow economy).

Four Economic Zones

Every task in the economy can be plotted on two axes: how cheap it is to automate versus how cheap it is to verify. This creates four zones:

Safe Industrial Zone

Low c_A, low c_H. Easy to automate and easy to verify. Structurally safe to scale. Think: code generation with unit tests, invoice processing, data entry.

Runaway Risk Zone

Low c_A, high c_H. Easy to automate but hard to verify. The danger zone. Think: legal advice, medical diagnosis, financial strategy—where wrong answers look plausible.

Human Artisan Zone

High c_A, low c_H. Hard to automate but easy to verify. Humans remain efficient. Think: plumbing, surgery, complex negotiations.

Pure Tacit Zone

High c_A, high c_H. Neither automatable nor verifiable at scale. Think: artistic vision, ethical leadership, foundational research intuition.

Go Deeper: Why the Runaway Risk Zone expands

The critical dynamic: c_A is dropping exponentially while c_H is biologically bounded. Tasks that were in the Safe Industrial Zone—where both automation and verification were easy—get pushed into the Runaway Risk Zone as agents take on harder work faster than verification infrastructure can keep up.

The boundary between "safe to automate" and "dangerous to automate" is not fixed. It is a race between falling execution costs and the (much slower) expansion of verification capacity. Without deliberate investment in verification infrastructure, the Runaway Risk Zone grows as a share of total economic activity.

Three Engines of Crisis

The paper identifies three self-reinforcing dynamics that widen the Measurability Gap:

1. The Missing Junior Loop

When firms automate entry-level measurable work (T_m), they destroy the apprenticeship pipeline that produces future verifiers. The junior lawyer who once learned judgement by drafting contracts now has nothing to draft. The result: a near-term productivity boost purchased at the cost of future verification capacity.

Firms are rationally thinning the pipeline that produces future verifiers at precisely the moment the economy most needs to expand verification capacity.

2. The Codifier's Curse

Experts who interact with AI systems generate exactly the training data that automates their own expertise. A senior engineer's code reviews become training signal. A compliance officer's rejection log becomes an automated constraint. The very act of overseeing the system transfers tacit knowledge into codified form—extractable, reproducible, and no longer requiring the expert.

3. Alignment Drift

As Δm widens, optimisation decouples from intent. Agent output increasingly maximises measurable proxies rather than the unmeasured human objective. The gap is invisible on dashboards—metrics improve while actual value erodes. The paper formalises this as τ (alignment stability) degrading toward zero.

Go Deeper: The predator-prey dynamic

The interaction between these three engines creates a Lotka-Volterra dynamic. Agentic labour (the "predator") grows rapidly on a diet of measurable tasks, driving up demand for verification capacity (the "prey"). But verification capacity cannot scale as fast, so it depletes. As verification shrinks, unverified deployment grows, widening Δm further. The system oscillates—but each cycle degrades the institutional substrate.

The paper also introduces a dynastic welfare function with an identity parameter λ. If a generation views its successor—augmented or synthetic—as truly its own (high λ), total welfare may increase even as biological human involvement shrinks. If not (low λ), the transition looks like existential dispossession. This is not a technical parameter. It is a civilisational choice.

The Model's Predictions

The formal model yields several sharp predictions:

Wage polarisation flips. It is no longer high-skill versus low-skill. It is verifiable-output versus non-verifiable-output. A radiologist whose readings can be checked by a second AI suffers more wage pressure than a therapist whose outcomes are inherently hard to measure.
Verification becomes the scarce factor. Returns to capital in verification infrastructure grow as Δm widens, creating a new asset class.
"AI verifies AI" is a correlated-error trap. Using the same model family to verify its own output creates the illusion of safety while concentrating correlated risk. True verification requires orthogonal methods.
Two equilibria. The Hollow Economy (high output, collapsing trust, resource leak) and the Augmented Economy (scaled verification, human-AI complementarity, institutional resilience).

Go Deeper: Governance levers

The model identifies six policy instruments that shift the economy toward the Augmented path:

Liability — pricing the Trojan Horse externality into deployment decisions
Verification subsidies — funding public ground truth and open evaluation infrastructure
Apprenticeship mandates — preserving the junior loop through synthetic practice (T_sim)
Observability requirements — auditable execution traces as a precondition for deployment
Provenance infrastructure — cryptographic receipts binding agent output to verifiable process
Human augmentation — radically expanding m_H through better interfaces, not just better models

Extensions

Liability-as-a-Service: insurance becomes the product boundary

When agents act as labour rather than tools, pricing must shift from flat-rate to metered risk: charging per inference plus a liability premium indexed to verification difficulty. Insurance becomes not an accessory to the product but the product boundary itself.

The actor with the best real-time observability into agent behaviour—provenance logs, near-miss traces, post-mortems—can both price latent liability and intervene to reduce it. The likely equilibrium: vertical integration, where platforms bundle execution, verification, and indemnity into a single product. The durable moat accrues to whoever closes the loop between verification logs, claims outcomes, and model improvement.

In February 2026, ElevenLabs launched an insured AI voice agent by partnering with The Artificial Intelligence Underwriting Company—empirical confirmation of this prediction.

Open source vs. tail risk: the verification wrapper

Open models do not eliminate the need for proprietary verification. The dominant design in high-stakes settings is a hybrid stack: open models for broad execution, specialised proprietary components for safety-critical sub-tasks, and diversified model lineages for verification (since heterogeneity reduces correlated error at scale).

The winning position is not owning the model but owning the verification wrapper: proprietary evaluation harnesses, monitoring, and governance. Firms swap in better open models as they arrive while keeping the trust layer in-house.

Geopolitics: the safety prisoner's dilemma

In a non-cooperative equilibrium, actors rationally sacrifice verification to preserve relative capability. Export controls buy time but are structurally transient—as c_A approaches zero, the threshold for dangerous capability drops, rendering physical containment ineffective.

The counter: a democratic supply-chain coalition that conditions market access on verifiable safety standards. Cryptographic receipts—verifiable inference, attestation of model versions, immutable audit logs—become critical geopolitical tools. The competition shifts from raw execution speed to verified performance, making safety a prerequisite for profit rather than a tax on innovation.

Measurability-biased technical change: the new labour economics

The paper argues that "skill-biased technical change"—the dominant framework for understanding technology's impact on wages since the 1990s—is being replaced by "measurability-biased technical change." The dividing line is no longer how skilled you are, but how measurable your output is.

Workers in the Strategic Labour Market Topology sort into four roles:

Directors — define intent, navigate uncertainty, orchestrate agent swarms
Meaning Makers — curate, interpret, steward community; output validated by social consensus
Liability Underwriters — adversarial auditors who absorb risk and generate verification-grade knowledge
Displaced Workers — whose measurable tasks have been automated without a pathway to the other three roles

What It Means for Firms: Five Moats

The paper maps corporate strategy onto the Measurability Gap and identifies five defensible positions:

1. Rent Cognition, Own Trust

The durable moat is rarely the basic brain (driving c_A down), but the proprietary context and traces that shape behaviour—especially verification-grade knowledge—and the institutional capacity to underwrite outputs at scale. Use open models for reasoning. Privatise the domain context and the verification stack.

2. Ground Truth as a Moat

Proprietary data bifurcates. Execution-grade data (finished code, contracts, reports) lowers c_A and is structurally vulnerable—general models will approximate it. Verification-grade data (redlines, rejection logs, near-miss histories) lowers c_H and is uniquely defensible because it captures the negative space of expertise that general models cannot infer.

Incumbents in complex, heavily regulated domains have a significant advantage: decades of failure knowledge. Digitise that into accessible K_IP and competitors can never replicate the historical ground truth required to trust output at scale.

3. Price the Liability, Gate Deployment

Bind agentic labour (L_a) to verified throughput (s_v). Treat any unverified excess as latent debt, not productivity. Charge by metered risk. Avoid the "AI verifies AI" trap. Protect the internal apprenticeship pipeline—T_m approaching zero without T_sim is fatal.

4. Top Talent as a Moat

As marginal execution costs collapse, the premium on the specific expertise required to verify it expands disproportionately. Companies can aggregate expert intuition into permanent assets: a senior engineer's code architecture becomes a reusable template, a compliance officer's rejection log fine-tunes automated constraints. One super-verifier can steer a large fleet of agents.

5. The Non-Measurable as a Moat

A smaller class of companies competes where value is anchored in coordination equilibria—meaning, status, legitimacy, community, identity. The company functions as a coordination device: "this is what quality means." Its output is validated by social consensus and long-horizon reputation, not objective metrics. "Made by humans" becomes a luxury signal. Agents can mimic the surface form but cannot manufacture the equilibrium.

Go Deeper: The "AI Sandwich" org structure

The paper predicts firms converge toward a three-layer topology:

Top: Directors — navigate Knightian uncertainty, define intent, detect drift. Their role is not to do, but to define the why.
Middle: Verified Agents — scalable low-cost execution. Contributes to productive output only when filtered through the bottom layer.
Bottom: Liability Underwriters — adversarial auditors who detect hidden risk and absorb liability. Their output is not just verified agentic labour but also the verification-grade K_IP that makes further automation possible.

The fragility: if entry-level automation removes the apprenticeship pipeline, companies must deliberately fund internal "flight simulators"—rotating juniors through supervised verification work and curated edge cases.

Network Effects, Reappraised

The paper's most novel extension: a comprehensive re-evaluation of network effects through the measurability lens. The central claim—network effects are fragile if an entrant can use agents to raise gross scale without raising the verifiable share, and durable only when growth makes the network safer and cheaper to police.

From most vulnerable to most defensible:

1
Execution-Grade Liquidity (listings, content, multi-homing) — Vulnerable. Agents can manufacture apparent scale or instantly reallocate real inventory. Thickness becomes a commodity purchasable with compute.
2
Execution-Grade Data Flywheels (matching, ranking, curating) — Contested. User-side wrapper agents disintermediate the platform by ingesting raw feeds and applying private objective functions.
3
Systems of Record (switching costs) — Bifurcated. The "migration is too painful" defence shrinks as agents automate measurable switching labour. Defensibility retreats to domains where being wrong is prohibitive—regulated audit trails, legal accountability.
4
Complementary Ecosystems (app stores, plugins) — Verification-driven. Code generation commoditises catalogue breadth. Value shifts to safety, maintenance, and enforceable responsibility when something breaks.
5
Verification Networks (reputation, history) — Durable. Precedent lowers verification cost. A payments marketplace with years of chargeback adjudications classifies edge cases faster and more accurately than any new entrant.
6
Coordination Equilibria (norms, status, identity) — Most defensible. Path-dependent, socially constructed. Bitcoin: the code is open and forkable, but the consensus that this ledger holds value is a Schelling point no amount of compute can reproduce.

Go Deeper: The "Slop Curve" — when more activity means less value

Traditionally, network effects are monotonic: more activity equals more value. In an agentic economy, this relationship can invert (dV/dN < 0).

Agentic slop is worse than human spam. Human noise is random; agentic distortion is correlated. Millions of agents exploiting the same model blind spots create phantom consensus—artificial agreement, fake trends, and engagement bait the platform mistakes for quality.

The result: high-quality human participants exit. The platform becomes a "dark forest" where real users retreat to private, high-trust channels. The network effect turns toxic—the more open the platform is to participation, the less valuable it becomes. This is the unravelling dynamic.

Value migrates to platforms with advanced verification infrastructure: proof of personhood, provenance, and the verification flywheel (each resolved dispute builds precedent that makes the next resolution cheaper).

Strategies for Investors: Capitalise the Unmeasurable

When execution costs collapse toward zero, the limit on returns is no longer the capacity to produce but the capacity to trust. Profit migrates from execution to three bottlenecks: verification bandwidth, liability absorption, and proprietary ground truth. The defining question: what makes this deployment verifiable, insurable, and defensible?

Fund the scarce complements to c_A → 0: verification tooling shovels, synthetic simulation platforms (T_sim), liability-as-a-service underwriting, and provenance/settlement infrastructure (crypto rails for machine-to-machine commerce)
Underwrite verification moats: firms capable of absorbing tail risk of agentic deployment. Value these like insurers—by underwriting margin, loss experience, and reserve adequacy—not like SaaS companies
Short the wrappers: firms cannibalising their junior pipelines for short-term margin
Diligence verified share (s_v): the empirical measurement of X_A, and the depth of verification-grade K_IP, rather than raw model capability
Back defensible networks: distinguish between commoditised execution networks (fragile) and verification networks / coordination equilibria (durable)
Invest in augmentation: interfaces that expand human bandwidth (m_H) to match agentic generation speed (m_A)

Go Deeper: Agentic infrastructure & the non-measurable economy

Agentic infrastructure: Agents require compute and financial rails to operate continuously. Programmable payment rails (stablecoins, smart contracts) that allow agents to settle transactions, post collateral, and use escrow without friction. If the economy fragments into many autonomous agents transacting at high frequency, traditional banking—built around human KYC, batch settlement, and minimum thresholds—strains. Microtransactions become the natural way to access and pay for K_IP.

The non-measurable economy: As measurable output becomes abundant, the premium on the unmeasurable increases. Value companies that function as coordination mechanisms for status, identity, and community. Because legitimacy is path-dependent and socially constructed, it cannot be automated. Brands that successfully signal "human origin" or "shared meaning" will command high margins precisely because their value proposition is resistant to c_A approaching zero.

Strategies for Policymakers: Build the Immune System

The primary objective is not to slow automation but to stabilise the verifiable share (s_v) during liftoff. This requires building an institutional architecture that prevents Δm from widening to the point of systemic failure.

Price the Trojan Horse Externality

Engineer a liability regime where deployers internalise the expected cost of harms. Commercial agents above defined autonomy thresholds must post financial guarantees proportional to their scale and potential harm. Regulators must first enforce the preconditions of insurability: standardised incident reporting, auditable execution traces, and disclosure formats that convert opaque algorithmic behaviour into quantifiable risk.

Verification as a Public Good

Verification-grade ground truth—the library of rare failures, edge cases, and outcome registries—exhibits strong natural-monopoly characteristics. Left to private incentives, this critical data fragments. Governments must invest in public measurement infrastructure: certified datasets, open evaluation harnesses, and interoperable audit formats.

Go Deeper: Human augmentation, crowdsourcing safety, and geopolitical coordination

Invest in Human Augmentation

If a generation loses the independent intuition required to verify machines, oversight shifts toward fragile monocultures of "AI verifying AI." Governments must fund high-fidelity "flight simulators for work"—T_sim platforms that keep experts sharp via synthetic adversity. Augmentation tools must be treated as public infrastructure, akin to literacy or internet access, to prevent the cognitive divide from cementing into a permanent caste system.

The "White Hat" Subsidy

Banning open-source models is futile. Instead, subsidise independent auditing: a specialised corps of red-teamers, evaluators, and incident-response researchers who continuously stress-test widely used systems. Publish high-signal findings to crowdsource the safety map faster than bad actors can exploit it.

Global Coordination

Democracies must form a supply-chain coalition—an economic entente that conditions access to the world's largest markets on verifiable safety standards. Cryptographic provenance and programmable payment rails are the core market infrastructure that prevents a global lemons market, making s_v legible and scalable across borders.

The Bottom Line

All strategies—for individuals, companies, investors, and policymakers—converge on a single objective:

Maximise the volume of verified deployment (s_v · L_a) while minimising the accumulation of latent risk (X_A). This cannot be achieved by merely scaling agentic labour and compute.

It requires aggressive investment in the scarce complements: verification capacity, verification-grade ground truth, provenance infrastructure, alignment stability, synthetic practice, human augmentation, and liability underwriting.

If these complements are neglected, the model predicts a structural drift toward a hollow economy: high nominal output but collapsing human utility, where the resource leak ultimately crowds out consumption and capital formation.

Key Actions by Stakeholder

Individuals: Re-skill toward remaining bottlenecks (directing, verifying, meaning-making). Invest in synthetic practice. Establish verifiable reputation as capital. Pivot to the non-measurable economy.
Firms: Enforce a Jagged Frontier that binds execution to verified throughput. Build verification-grade K_IP. Price by metered risk. Protect the apprenticeship pipeline.
Investors: Fund the scarce complements to cheap execution. Underwrite balance-sheet verification moats. Short wrappers, long underwriters.
Policymakers: Price the Trojan Horse externality. Treat cognitive sovereignty as national security. Fund public ground truth. Coordinate internationally on verifiable safety standards.

Even if the future inevitably relies on AI to verify AI, the safety of that future depends on the human institutions built today to ensure those verifiers remain aligned as execution scales beyond our direct view and understanding.

Source: Catalini, C., Hui, X. & Wu, J. (2026). "Some Simple Economics of AGI." SSRN Working Paper 6298838. Published February 24, 2026.

This summary was prepared in the style of The Economist using progressive disclosure for depth. It condenses the original 103-page paper while preserving all core models, predictions, and strategic implications.