← Writing

Research note · June 2026

Beyond FTE Equivalence

A work calculus for measuring AI system throughput.

Timothy WalshCTO & Co-Founder, HALO12 min read
FTE metrics measure what AI replaces; they cannot capture what AI creates.

The problem

When organizations adopt AI systems, the first question from leadership is nearly always: "How many FTEs does this replace?" The question is intuitive, grounded in decades of SaaS ROI models, and entirely wrong.

FTE displacement assumes a one-to-one mapping between AI output and human labor. That framing suffers from three critical failures.

  • It cannot measure created work. A substantial proportion of tasks completed by autonomous AI agents were never performed by humans — not because they lacked value, but because the coordination cost, staffing overhead, or cognitive load made them infeasible. There is no FTE to replace for a task that was never done.
  • It flattens heterogeneous effort. An AI agent summarizing a three-paragraph email and an agent synthesizing 18 months of customer churn data across four databases are treated identically if both "replace one analyst-hour." The depth, breadth, and reliability of these outputs are entirely different.
  • It has no confidence dimension. Human work products carry implicit quality signals — seniority, review processes, institutional knowledge. AI outputs carry none of these by default. FTE math tells you how much was done but never how much to trust it.

Background

Solow's famous observation that "you can see the computer age everywhere but in the productivity statistics" has found a modern echo in AI adoption. Part of this paradox is metrical: we are measuring AI output with instruments calibrated for human labor, and the mismatch produces systematic undercount.

Brynjolfsson, Li and Raymond (2023) documented significant productivity gains from generative AI in customer support, while noting aggregate statistics have yet to reflect them. Eloundou et al. (2023) estimated ~80% of the U.S. workforce could have at least 10% of tasks affected by LLMs. Noy and Zhang (2023) found ChatGPT reduced task completion time by 40% in writing tasks. Dell'Acqua et al. (2023) documented a "jagged technological frontier" where AI dramatically outperforms on some tasks while failing on adjacent ones.

These studies share a common frame: they measure AI by its effect on existing human tasks. That is valuable but incomplete. It cannot account for tasks AI makes feasible for the first time.

Agrawal, Gans and Goldfarb proposed viewing AI as a "prediction machine" that reduces the cost of prediction, restructuring tasks around cheap predictions and expensive judgment. Our framework builds on that task-decomposition view but adds quantitative dimensions: each AI-completed task carries measurable Effort, Confidence, and Scope rather than being a binary "automated or not."

The Work Calculus

Three continuous, composable dimensions for measuring AI system output.

Effort (Æ) — how much work the system performed; the depth and breadth of processing applied to a given task. Scored on a normalized 0–1 scale: Low (0.00–0.30), Medium (0.40–0.60), High (0.70–1.00).

Confidence (ℂ) — how trustworthy the output is; a reliability signal based on execution success, source verification, and error presence. Scored 0–1: Weak (0.00–0.30), Moderate (0.40–0.60), Strong (0.70–1.00).

Scope (Σ) — how large the task was in absolute terms; the magnitude of the work unit, independent of how well or reliably it was performed. Unlike Effort and Confidence, Scope is not normalized to [0, 1]. It is an extensive quantity: the scope of two tasks is the sum of their individual scopes.

Design principles

  • Texture over ceilings. Scoring functions produce a range of values (roughly 0.55–0.90); exceptional outcomes are rare (>0.95); degraded results are immediately apparent (<0.50).
  • Precision creates signal. Two-decimal values on [0, 1]. The perceptual difference between 0.76 and 0.82 is meaningful; the difference between 8/10 and 8/10 is nothing.
  • Composability. Individual task scores aggregate into meaningful session, weekly, and monthly throughput figures.
  • Independence from human baselines. No dimension is defined relative to "how long a human would take." Human-equivalent translations are derived after scoring, never baked in.

Throughput

The composite metric is scope-weighted effort. Total Throughput (𝒯) is the sum across tasks of Scope × Effort. Mean Confidence (C̄) is the average Confidence across those tasks. Together, 𝒯 and C̄ form a two-axis summary: how much work was done and how reliable it was.

The "unique-to-AI" proportion may be the most strategically important figure on the dashboard.

A representative dashboard view might report Total Throughput of 2,847 (+18% vs. prior), Mean Confidence of 0.84 (stable), task counts by tier (38 Complex · 201 Medium · 173 Simple), a derived human-context translation of ~6.2 analyst-weeks, and a "unique to AI" figure indicating that 47% of completed tasks had no prior workflow. That unique-to-AI proportion represents work that would not exist without the AI system — tasks that were never staffed, never prioritized, never even conceived as actionable. No FTE displacement model can account for it, because there is no FTE to displace.

Why FTE fails

Let H(t) denote the set of tasks performed by human workers in period t, and A(t) the set of tasks completed by the AI system. FTE equivalence implicitly assumes A(t) ⊆ H(t). In practice this does not hold. There exists a set A(t) \ H(t) — tasks the AI performs that have no human-labor analog. FTE equivalence assigns zero value to all of them.

Four categories of created work recur in deployments:

  • Below-threshold tasks — work too small to justify a meeting, a ticket, or a calendar block, but valuable in aggregate.
  • Temporal-access tasks — work performed at 2 AM or during holidays when no human team is available.
  • Exhaustive-search tasks — analysis that requires reviewing every record in a dataset, not a sample; feasible for AI, impractical for humans.
  • Compounding-context tasks — work that benefits from perfect recall of prior interactions, eliminating the ramp-up cost of human task-switching.

Quadrant map

The Æ × ℂ space partitions into four action labels that tell an operator what to do with the output.

  • SUFFICIENT — Low Æ, high ℂ. A straightforward query with a reliable answer. The system didn't need to work hard; the output is trustworthy.
  • READY — High Æ, high ℂ. The system performed thorough work and is confident in the result. Use the output as-is.
  • CLARIFY — Low Æ, low ℂ. The system couldn't do much and isn't confident. Rephrase or provide more context.
  • SECOND_PASS — High Æ, low ℂ. The system performed extensive work but encountered uncertainty. Review carefully before acting.

The AAC precedent

Augmentative and alternative communication (AAC) systems — technologies that help individuals with communication disabilities express themselves — faced an analogous measurement problem decades ago. Early AAC evaluation metrics asked "how closely does this approximate natural speech?" — a displacement frame that systematically undervalued the technology.

An AAC device that enabled a nonverbal individual to express preferences, make decisions, and participate in education was producing communication that had no prior baseline, yet a speech-equivalence metric would score it as a poor substitute for typical fluency. The field matured when evaluation shifted to communicative competence: how much meaning was successfully conveyed, across how many contexts, with what reliability. That is measurement native to the technology's actual contribution.

Conclusion

The Flintstones counted feet on the ground. The Jetsons measure thrust.

The FTE equivalence model served the early phase of AI adoption. As AI systems mature into autonomous agents with heterogeneous tool use, variable-depth processing, and the ability to perform work that has no human precedent, the measurement framework must mature with them.

The Work Calculus — Effort (Æ), Confidence (ℂ), and Scope (Σ) — provides a native, composable language for measuring AI throughput. It captures what FTE displacement cannot: the total work mass moved by the system, the reliability of that work, and critically, the proportion of work that exists only because the AI system made it possible.

Timothy Walsh

About the author

Timothy Walsh

CTO & Co-Founder, HALO

Tim is the CTO and co-founder of HALO. He writes about the measurement, architecture, and economics of autonomous AI systems.

Newsletter

Get the next essay in your inbox.

One email when the next piece of writing lands. No marketing, no list-sharing.