The One-Page Executive Summary

When your team presents AI evaluation results, you'll see two key metrics that tell the whole story:

1. Overestimation Delta (Δ)

What it measures: The gap between what teams think AI accomplished versus what it actually delivered.

Green (0-5%): Team accurately estimates AI capabilities
Yellow (5-15%): Mild overconfidence, needs calibration
Red (>15%): Significant overestimation, risk of quality issues

What to ask: "What specific tasks showed the highest Δ, and what's our mitigation plan?"

2. Task Load Index (TLX)

What it measures: Mental workload across six dimensions (mental, physical, temporal, performance, effort, frustration).

Low (0-40): Task feels manageable, team can sustain pace
Medium (40-70): Increased effort required, monitor for fatigue
High (70-100): Unsustainable workload, burnout risk

What to ask: "Which TLX dimensions spiked, and how are we addressing them?"

Reading the Tiles

When you see evaluation tiles in presentations or dashboards:

┌─────────────────────┐  ┌─────────────────────┐
│ Manual Baseline     │  │ AI-Assisted         │
│ Time: 45 min        │  │ Time: 28 min        │
│ Quality: 92%        │  │ Quality: 87%        │
│ TLX: 35             │  │ TLX: 52             │
└─────────────────────┘  └─────────────────────┘
                ↓
         Δ = +8% overestimation
         (Claimed 40% faster, delivered 38%)

Key Questions for Your Team

Efficiency vs Quality Trade-off: "Is the time savings worth the quality drop?"
Sustainability Check: "Can the team maintain this TLX level long-term?"
Evidence Basis: "How many task attempts support these numbers?"

Action Triggers

Green Light (Continue):

Δ < 5% AND Quality maintained AND TLX < 60

Yellow Light (Monitor):

Δ 5-15% OR Quality drop < 5% OR TLX 60-70

Red Light (Intervene):

Δ > 15% OR Quality drop > 10% OR TLX > 70

Next Steps

After reviewing tiles, direct your team to:

If results are good: Scale gradually with weekly Δ/TLX monitoring
If results are mixed: Run targeted experiments on problem areas
If results are poor: Revert to manual process and retrain

Learn More

Full Methodology - Understand the complete measurement framework
Interpretation Guide - Detailed thresholds and benchmarks
Run Your Own Evaluation - See the process firsthand

How to Read Δ & TLX Tiles