Transparent Measurement for AI-Augmented Work
Our methodology combines cognitive science, psychometrics, and controlled experimentation to give you reliable, reproducible measurements of AI's actual impact on your productivity.
Executive Summary
Key principles in 60 seconds
Δ (Delta)
Gap between expected vs. actual AI performance
TLX
NASA-validated cognitive workload in 15 seconds
Fair Trial
Same rubric for manual vs. AI-assisted work
ROI
Time saved × quality maintained = true gain
Bottom line: Run each task twice (manual then AI), measure Δ between expectation and reality, track cognitive load via micro-TLX, and validate with blind reviewer scores to prove actual lift.
The Fair Trial Protocol
Four steps to reliable AI productivity measurement
Baseline (Manual)
Complete the task without AI. Record time and self-rate quality.
AI-Assisted Run
Same task with AI tools. Same rubric, same time tracking.
Blind Review
Reviewer scores both outputs without knowing which used AI.
Calculate Δ
Compare predicted vs. actual scores. Log micro-TLX workload.
Core Metrics
The four pillars of AI productivity measurement
Overestimation Δ
Measures the gap between what you expected and what you achieved with AI assistance.
Micro-TLX
NASA-validated cognitive workload assessment condensed to 2 key dimensions for rapid feedback.
Efficiency Δ-Time
Percentage time saved when using AI vs. manual approach for the same task.
Rubric Agreement
How closely your quality rating matches the gold standard rubric score.
Scoring Formulas
Transparent methodology: All scores are calculated using these public formulas.
#Overestimation Delta (Δ)
Formula
Range
-100 to +100
Interpretation
Positive values indicate overestimation of your abilities. Zero means perfectly calibrated.
Example
Inputs: Predicted: 75%, Actual: 60%
Calculation: Δ = 75 - 60 = 15
Result: +15% overestimation
#Calibration Score
Formula
Range
0-100
Interpretation
100 means you predicted your score exactly. Lower scores indicate miscalibration.
Example
Inputs: Δ = +15
Calculation: Calibration = max(0, 100 - |15|) = max(0, 85)
Result: 85 (well calibrated)
#Rubric Agreement
Formula
Range
0-100
Interpretation
Measures how closely your quality ratings match expert judgments. Each point difference costs 25 points.
Example
Inputs: User rated: 4, Gold standard: 3
Calculation: Agreement = 100 - 25 × |4 - 3| = 100 - 25 = 75
Result: 75 (1 point off)
#Efficiency Δ-time
Formula
Range
-50% to +80%
Interpretation
Positive means AI saved time. Negative means AI added time. Capped to prevent outliers.
Example
Inputs: Manual: 10 min, AI: 6 min
Calculation: Δ-time = ((10 - 6) / 10) × 100 = 40%
Result: +40% faster with AI
#Cognitive Load Efficiency
Formula
Range
0-100
Interpretation
Higher scores mean lower cognitive burden when using AI assistance.
Example
Inputs: Mental Demand: 40, Frustration: 30
Calculation: CLE = 100 - ((40 + 30) / 2) = 100 - 35 = 65
Result: 65 (moderate cognitive efficiency)
#Composite AI Judgment Score
Formula
Range
0-100
Interpretation
Your overall AI readiness score combining knowledge, judgment, self-awareness, and efficiency.
Component Weights
Example
Inputs: Literacy: 80, Rubric: 75, Calibration: 85, Efficiency: 70, Cognitive: 65
Calculation: (80×0.30) + (75×0.30) + (85×0.15) + (70×0.15) + (65×0.10)
Result: 24 + 22.5 + 12.75 + 10.5 + 6.5 = 76
#Benchmark Requirements
Minimum Sample Size
Percentiles are only displayed when n ≥ 30 for the specific (role × pack × last 30 days) cell.
Time Window
Only data from the last 30 days is included in benchmark calculations.
Insufficient Sample
When minimum sample is not met, "Insufficient sample" is displayed instead of percentiles.
Percentile Calculation
Uses linear interpolation between data points for smooth percentile ranks.
Composite Score Levels
Limits & Misuse
Important caveats for responsible use
Context Drift
- • Coding benchmarks don't transfer to marketing tasks
- • AI model updates can invalidate baselines within weeks
- • Team workflows differ from published medians
Outdated Benchmarks
- • Re-validate cohort medians every 3 months
- • GPT-3.5 benchmarks don't apply to GPT-4
- • Industry averages mask high/low performer variance
Small Sample Risks
- • Need 20+ data points per role for significance
- • Outliers skew averages in groups under 10
- • Single-sprint measurements miss learning curves
Remember: Establish your own baselines first, then compare to industry benchmarks. Never make decisions based solely on external averages. Learn more about research quality →
Frequently Asked Questions
Common questions about our methodology
Δ = Self-Rating − Actual Score. You predict your quality before submitting, then a blind reviewer scores the output. The difference reveals calibration accuracy.
Open Beta: Measure Your AI Productivity
Run a Fair Trial assessment and get your first Δ and TLX scores in under 10 minutes.