Why two runs instead of just measuring AI output?

Without a baseline, you can't know if AI actually helped. The manual run establishes your true capability, making the comparison meaningful and controlled.

How is micro-TLX different from full NASA-TLX?

Full NASA-TLX has 6 dimensions and takes 2+ minutes. Our micro-TLX uses the 2 most predictive dimensions (Mental Demand + Frustration) and takes under 15 seconds.

What if my Δ is consistently negative?

Negative Δ means you're underestimating your AI-assisted performance—common among experts. This is valuable data showing you may be leaving gains on the table.

How often should I recalibrate benchmarks?

Re-run baselines every 3 months or whenever AI models update significantly. GPT-4 benchmarks don't apply to GPT-4-Turbo or newer releases.

Methodology

Q: How do you calculate Overestimation Δ?

Δ = Self-Rating − Actual Score. You predict your quality before submitting, then a blind reviewer scores the output. The difference reveals calibration accuracy.

How we measure real AI productivity gains

Research Validity

Scientific Framework

Transparent Measurement for AI-Augmented Work

Our methodology combines cognitive science, psychometrics, and controlled experimentation to give you reliable, reproducible measurements of AI's actual impact on your productivity.

View Core Metrics Fair Trial Protocol

The Fair Trial Protocol

Baseline (Manual)

AI-Assisted Run

Blind Review

Calculate Δ

Overestimation Index

TLX

Cognitive Load

Peer-Reviewed Methods

Executive Summary

Key principles in 60 seconds

Δ (Delta)

Gap between expected vs. actual AI performance

TLX

NASA-validated cognitive workload in 15 seconds

Fair Trial

Same rubric for manual vs. AI-assisted work

ROI

Time saved × quality maintained = true gain

Bottom line: Run each task twice (manual then AI), measure Δ between expectation and reality, track cognitive load via micro-TLX, and validate with blind reviewer scores to prove actual lift.

The Fair Trial Protocol

Four steps to reliable AI productivity measurement

Baseline (Manual)

Complete the task without AI. Record time and self-rate quality.

AI-Assisted Run

Same task with AI tools. Same rubric, same time tracking.

Blind Review

Reviewer scores both outputs without knowing which used AI.

Calculate Δ

Compare predicted vs. actual scores. Log micro-TLX workload.

Learn more about Fair Trial methodology

Core Metrics

The four pillars of AI productivity measurement

Overestimation Δ

-100 to +100

Δ = Self-Rating − Actual Score

Measures the gap between what you expected and what you achieved with AI assistance.

Δ near 0 = Well calibrated

Δ > +5 = Overconfident

Micro-TLX

0 to 100

TLX = (Mental Demand + Frustration) / 2

NASA-validated cognitive workload assessment condensed to 2 key dimensions for rapid feedback.

TLX < 40 = Low cognitive load

TLX > 70 = Risk of burnout

Efficiency Δ-Time

-100% to +100%

Δ-Time = ((Manual − AI) / Manual) × 100%

Percentage time saved when using AI vs. manual approach for the same task.

Positive = Time saved with AI

Negative = AI slower than manual

Rubric Agreement

0-100

Interpretation

Your overall AI readiness score combining knowledge, judgment, self-awareness, and efficiency.

Component Weights

30% - Foundational AI knowledge

30% - Quality evaluation accuracy

15% - Self-awareness accuracy

15% - Time efficiency with AI

10% - Mental burden management

Example

Inputs: Literacy: 80, Rubric: 75, Calibration: 85, Efficiency: 70, Cognitive: 65

Calculation: (80×0.30) + (75×0.30) + (85×0.15) + (70×0.15) + (65×0.10)

Result: 24 + 22.5 + 12.75 + 10.5 + 6.5 = 76

#Benchmark Requirements

Minimum Sample Size

Percentiles are only displayed when n ≥ 30 for the specific (role × pack × last 30 days) cell.

Time Window

Only data from the last 30 days is included in benchmark calculations.

Insufficient Sample

When minimum sample is not met, "Insufficient sample" is displayed instead of percentiles.

Percentile Calculation

percentile = ((below + 0.5 × equal) / n) × 100

Uses linear interpolation between data points for smooth percentile ranks.

Composite Score Levels

0-29Beginner

30-49Developing

50-69Proficient

70-84Advanced

85-100Expert

Limits & Misuse

Important caveats for responsible use

Context Drift

• Coding benchmarks don't transfer to marketing tasks
• AI model updates can invalidate baselines within weeks
• Team workflows differ from published medians

Outdated Benchmarks

• Re-validate cohort medians every 3 months
• GPT-3.5 benchmarks don't apply to GPT-4
• Industry averages mask high/low performer variance

Small Sample Risks

• Need 20+ data points per role for significance
• Outliers skew averages in groups under 10
• Single-sprint measurements miss learning curves

Remember: Establish your own baselines first, then compare to industry benchmarks. Never make decisions based solely on external averages. Learn more about research quality →

Frequently Asked Questions

Common questions about our methodology

Δ = Self-Rating − Actual Score. You predict your quality before submitting, then a blind reviewer scores the output. The difference reveals calibration accuracy.

Open Beta: Measure Your AI Productivity

Run a Fair Trial assessment and get your first Δ and TLX scores in under 10 minutes.

Try the 3-Minute Demo

Open the Analyzer