The Three Metrics That Matter
Evidence Level B: Metrics derived from NASA-TLX (validated since 1988) and calibration research in judgment psychology.
1. Overestimation Delta (Δ)
What it is: The gap between what teams claim AI delivered versus what it actually delivered.
Why it matters: Teams overestimate AI productivity gains by 20-40% on average. Without measurement, overconfidence compounds sprint over sprint.
Healthy range: Δ under 10%. Above 15% signals systematic overconfidence requiring intervention.
2. Micro-TLX Score
What it is: A 2-slider workload check (mental demand + frustration) after AI-assisted tasks.
Why it matters: AI can save time while increasing cognitive load. If TLX climbs while time drops, you're trading visible efficiency for invisible burnout.
Healthy range: TLX under 50. Above 65 indicates unsustainable cognitive burden.
3. Time-to-Passed-Review
What it is: Elapsed time from task start to approval by a human reviewer.
Why it matters: AI-generated work often requires more revision cycles. Fast generation plus slow review equals no real savings.
Healthy range: Same or better than manual baseline. Longer review times signal quality debt.
Where Delusions Hide
- ✓
- "We're 3x faster" — Ask: Faster at generation or at passed review? Count revisions.
- "Quality is the same" — Ask: Who measured? When? Against what rubric?
- "Everyone loves it" — Ask: What's the TLX score? Enthusiasm ≠ sustainability.
- "It works for everything" — Ask: Which tasks show negative Δ? There are always some.
- "We don't need to track anymore" — Ask: When did you last measure? Drift happens fast.
30-Day Measurement Plan
Week 1: Baseline
- [ ] Select 3 representative tasks (1 routine, 1 complex, 1 novel)
- [ ] Run each task manually; record time and quality score
- [ ] Collect TLX after each task
- [ ] Document the rubric you used for quality
Deliverable: Baseline data for 3 tasks with time, quality, and TLX.
Week 2: AI Trials
- [ ] Run same 3 tasks with AI assistance
- [ ] Record claimed vs. actual time savings
- [ ] Collect TLX immediately after each task
- [ ] Calculate Δ for each task
Deliverable: Side-by-side comparison with Δ calculated.
Week 3: Analysis
- [ ] Review which tasks showed positive Δ (under 10%)
- [ ] Identify tasks where TLX increased despite time savings
- [ ] Flag tasks where review cycles increased
- [ ] Draft recommendations: expand, restrict, or retrain
Deliverable: Task classification (green/yellow/red) with rationale.
Week 4: Institutionalize
- [ ] Add Δ tracking to sprint retrospectives
- [ ] Create dashboard for ongoing TLX monitoring
- [ ] Set thresholds for intervention triggers
- [ ] Schedule monthly calibration review
Deliverable: Measurement system operational with clear owners.
Key Questions for Leadership
Use these in your next AI review meeting:
-
"What's our average Δ across tracked tasks?"
- If they can't answer, measurement isn't happening.
-
"Which tasks show negative ROI when we include review time?"
- Forces honest accounting of the full cycle.
-
"What's the TLX trend over the last month?"
- Catches burnout before it becomes turnover.
-
"When did we last update our quality rubrics?"
- Rubrics drift; AI output changes. Both need recalibration.
Apply Now
Put This Kit Into Action
Further Reading
- The Evaluator's Edge — Build the judgment skills to catch AI errors
- Fair Trial Methodology — Run experiments that produce trustworthy results
- Overestimation Delta Explained — Deep dive on measurement mechanics
- Delta Logging in Sprints — Practical templates for tracking Δ
“This kit aligns with the AI CogniFit Methodology and Validity Framework. All recommended metrics have been tested across multiple organizations.
”