Transparent Measurement for AI-Augmented Work
Our methodology combines cognitive science, psychometrics, and controlled experimentation to give you reliable, reproducible measurements of AI's actual impact on your productivity.
Executive Summary
Key principles in 60 seconds
Δ (Delta)
Gap between expected vs. actual AI performance
TLX
NASA-validated cognitive workload in 15 seconds
Fair Trial
Same rubric for manual vs. AI-assisted work
ROI
Time saved × quality maintained = true gain
Bottom line: Run each task twice (manual then AI), measure Δ between expectation and reality, track cognitive load via micro-TLX, and validate with blind reviewer scores to prove actual lift.
The Fair Trial Protocol
Four steps to reliable AI productivity measurement
Baseline (Manual)
Complete the task without AI. Record time and self-rate quality.
AI-Assisted Run
Same task with AI tools. Same rubric, same time tracking.
Blind Review
Reviewer scores both outputs without knowing which used AI.
Calculate Δ
Compare predicted vs. actual scores. Log micro-TLX workload.
Core Metrics
The four pillars of AI productivity measurement
Overestimation Δ
Measures the gap between what you expected and what you achieved with AI assistance.
Micro-TLX
NASA-validated cognitive workload assessment condensed to 2 key dimensions for rapid feedback.
Efficiency Δ-Time
Percentage time saved when using AI vs. manual approach for the same task.
Rubric Agreement
How closely your quality rating matches the gold standard rubric score.
Scoring Formulas
Transparent methodology: All scores are calculated using these public formulas.
#Overestimation Delta (Δ)
Formula
Range
-100 to +100
Interpretation
Positive values indicate overestimation of your abilities. Zero means perfectly calibrated.
Example
Inputs: Predicted: 75%, Actual: 60%
Calculation: Δ = 75 - 60 = 15
Result: +15% overestimation
#Calibration Score
Formula
Range
0-100
Interpretation
100 means you predicted your score exactly. Lower scores indicate miscalibration.
Example
Inputs: Δ = +15
Calculation: Calibration = max(0, 100 - |15|) = max(0, 85)
Result: 85 (well calibrated)
#Rubric Agreement
Formula
Range
0-100
Interpretation
Measures how closely your quality ratings match expert judgments. Each point difference costs 25 points.
Example
Inputs: User rated: 4, Gold standard: 3
Calculation: Agreement = 100 - 25 × |4 - 3| = 100 - 25 = 75
Result: 75 (1 point off)
#Efficiency Δ-time
Formula
Range
-50% to +80%
Interpretation
Positive means AI saved time. Negative means AI added time. Capped to prevent outliers.
Example
Inputs: Manual: 10 min, AI: 6 min
Calculation: Δ-time = ((10 - 6) / 10) × 100 = 40%
Result: +40% faster with AI
#Cognitive Load Efficiency
Formula
Range
0-100
Interpretation
Higher scores mean lower cognitive burden when using AI assistance.
Example
Inputs: Mental Demand: 40, Frustration: 30
Calculation: CLE = 100 - ((40 + 30) / 2) = 100 - 35 = 65
Result: 65 (moderate cognitive efficiency)
#Composite AI Judgment Score
Formula
Range
0-100
Interpretation
Your overall AI readiness score combining knowledge, judgment, self-awareness, and efficiency.
Component Weights
Example
Inputs: Literacy: 80, Rubric: 75, Calibration: 85, Efficiency: 70, Cognitive: 65
Calculation: (80×0.30) + (75×0.30) + (85×0.15) + (70×0.15) + (65×0.10)
Result: 24 + 22.5 + 12.75 + 10.5 + 6.5 = 76
#Benchmark Requirements
Minimum Sample Size
Percentiles are only displayed when n ≥ 30 for the specific (role × pack × last 30 days) cell.
Time Window
Only data from the last 30 days is included in benchmark calculations.
Insufficient Sample
When minimum sample is not met, "Insufficient sample" is displayed instead of percentiles.
Percentile Calculation
Uses linear interpolation between data points for smooth percentile ranks.
Composite Score Levels
Limits & Misuse
Important caveats for responsible use
Context Drift
- • Coding benchmarks don't transfer to marketing tasks
- • AI model updates can invalidate baselines within weeks
- • Team workflows differ from published medians
Outdated Benchmarks
- • Re-validate cohort medians every 3 months
- • GPT-3.5 benchmarks don't apply to GPT-4
- • Industry averages mask high/low performer variance
Small Sample Risks
- • Need 20+ data points per role for significance
- • Outliers skew averages in groups under 10
- • Single-sprint measurements miss learning curves
Remember: Establish your own baselines first, then compare to industry benchmarks. Never make decisions based solely on external averages. Learn more about research quality →
Frequently Asked Questions
Common questions about our methodology
How do you calculate Overestimation Δ?
Δ = Self-Rating − Actual Score. You predict your quality before submitting, then a blind reviewer scores the output. The difference reveals calibration accuracy.
Why two runs instead of just measuring AI output?
Without a baseline, you can't know if AI actually helped. The manual run establishes your true capability, making the comparison meaningful and controlled.
How is micro-TLX different from full NASA-TLX?
Full NASA-TLX has 6 dimensions and takes 2+ minutes. Our micro-TLX uses the 2 most predictive dimensions (Mental Demand + Frustration) and takes under 15 seconds.
What if my Δ is consistently negative?
Negative Δ means you're underestimating your AI-assisted performance—common among experts. This is valuable data showing you may be leaving gains on the table.
How often should I recalibrate benchmarks?
Re-run baselines every 3 months or whenever AI models update significantly. GPT-4 benchmarks don't apply to GPT-4-Turbo or newer releases.
Open Beta: Measure Your AI Productivity
Run a Fair Trial assessment and get your first Δ and TLX scores in under 10 minutes.