The One-Page Executive Summary
When your team presents AI evaluation results, you'll see two key metrics that tell the whole story:
1. Overestimation Delta (Δ)
What it measures: The gap between what teams think AI accomplished versus what it actually delivered.
- Green (0-5%): Team accurately estimates AI capabilities
- Yellow (5-15%): Mild overconfidence, needs calibration
- Red (>15%): Significant overestimation, risk of quality issues
What to ask: "What specific tasks showed the highest Δ, and what's our mitigation plan?"
2. Task Load Index (TLX)
What it measures: Mental workload across six dimensions (mental, physical, temporal, performance, effort, frustration).
- Low (0-40): Task feels manageable, team can sustain pace
- Medium (40-70): Increased effort required, monitor for fatigue
- High (70-100): Unsustainable workload, burnout risk
What to ask: "Which TLX dimensions spiked, and how are we addressing them?"
Reading the Tiles
When you see evaluation tiles in presentations or dashboards:
┌─────────────────────┐ ┌─────────────────────┐
│ Manual Baseline │ │ AI-Assisted │
│ Time: 45 min │ │ Time: 28 min │
│ Quality: 92% │ │ Quality: 87% │
│ TLX: 35 │ │ TLX: 52 │
└─────────────────────┘ └─────────────────────┘
↓
Δ = +8% overestimation
(Claimed 40% faster, delivered 38%)
Key Questions for Your Team
- Efficiency vs Quality Trade-off: "Is the time savings worth the quality drop?"
- Sustainability Check: "Can the team maintain this TLX level long-term?"
- Evidence Basis: "How many task attempts support these numbers?"
Action Triggers
Green Light (Continue):
- Δ < 5% AND Quality maintained AND TLX < 60
Yellow Light (Monitor):
- Δ 5-15% OR Quality drop < 5% OR TLX 60-70
Red Light (Intervene):
- Δ > 15% OR Quality drop > 10% OR TLX > 70
Next Steps
After reviewing tiles, direct your team to:
- If results are good: Scale gradually with weekly Δ/TLX monitoring
- If results are mixed: Run targeted experiments on problem areas
- If results are poor: Revert to manual process and retrain
Learn More
- Full Methodology - Understand the complete measurement framework
- Interpretation Guide - Detailed thresholds and benchmarks
- Run Your Own Evaluation - See the process firsthand