The Three Metrics That Matter

Evidence Level B: Metrics derived from NASA-TLX (validated since 1988) and calibration research in judgment psychology.

1. Overestimation Delta (Δ)

What it is: The gap between what teams claim AI delivered versus what it actually delivered.

Why it matters: Teams overestimate AI productivity gains by 20-40% on average. Without measurement, overconfidence compounds sprint over sprint.

Healthy range: Δ under 10%. Above 15% signals systematic overconfidence requiring intervention.

2. Micro-TLX Score

What it is: A 2-slider workload check (mental demand + frustration) after AI-assisted tasks.

Why it matters: AI can save time while increasing cognitive load. If TLX climbs while time drops, you're trading visible efficiency for invisible burnout.

Healthy range: TLX under 50. Above 65 indicates unsustainable cognitive burden.

3. Time-to-Passed-Review

What it is: Elapsed time from task start to approval by a human reviewer.

Why it matters: AI-generated work often requires more revision cycles. Fast generation plus slow review equals no real savings.

Healthy range: Same or better than manual baseline. Longer review times signal quality debt.

Where Delusions Hide

✓
"We're 3x faster" — Ask: Faster at generation or at passed review? Count revisions.
"Quality is the same" — Ask: Who measured? When? Against what rubric?
"Everyone loves it" — Ask: What's the TLX score? Enthusiasm ≠ sustainability.
"It works for everything" — Ask: Which tasks show negative Δ? There are always some.
"We don't need to track anymore" — Ask: When did you last measure? Drift happens fast.

30-Day Measurement Plan

Week 1: Baseline

[ ] Select 3 representative tasks (1 routine, 1 complex, 1 novel)
[ ] Run each task manually; record time and quality score
[ ] Collect TLX after each task
[ ] Document the rubric you used for quality

Deliverable: Baseline data for 3 tasks with time, quality, and TLX.

Week 2: AI Trials

[ ] Run same 3 tasks with AI assistance
[ ] Record claimed vs. actual time savings
[ ] Collect TLX immediately after each task
[ ] Calculate Δ for each task

Deliverable: Side-by-side comparison with Δ calculated.

Week 3: Analysis

[ ] Review which tasks showed positive Δ (under 10%)
[ ] Identify tasks where TLX increased despite time savings
[ ] Flag tasks where review cycles increased
[ ] Draft recommendations: expand, restrict, or retrain

Deliverable: Task classification (green/yellow/red) with rationale.

Week 4: Institutionalize

[ ] Add Δ tracking to sprint retrospectives
[ ] Create dashboard for ongoing TLX monitoring
[ ] Set thresholds for intervention triggers
[ ] Schedule monthly calibration review

Deliverable: Measurement system operational with clear owners.

Key Questions for Leadership

Use these in your next AI review meeting:

"What's our average Δ across tracked tasks?"
- If they can't answer, measurement isn't happening.
"Which tasks show negative ROI when we include review time?"
- Forces honest accounting of the full cycle.
"What's the TLX trend over the last month?"
- Catches burnout before it becomes turnover.
"When did we last update our quality rubrics?"
- Rubrics drift; AI output changes. Both need recalibration.

Apply Now

Put This Kit Into Action

Open Demo →

Open Productivity Pack →

Executive AI Starter Kit