Skip to main content
Open BetaWe’re learning fast - your sessions and feedback directly shape AI CogniFit.

Methodology

How we measure real AI productivity gains

Scientific Framework

Transparent Measurement for AI-Augmented Work

Our methodology combines cognitive science, psychometrics, and controlled experimentation to give you reliable, reproducible measurements of AI's actual impact on your productivity.

The Fair Trial Protocol
1
Baseline (Manual)
2
AI-Assisted Run
3
Blind Review
4
Calculate Δ
Δ
Overestimation Index
TLX
Cognitive Load
Peer-Reviewed Methods

Executive Summary

Key principles in 60 seconds

Δ (Delta)

Gap between expected vs. actual AI performance

TLX

NASA-validated cognitive workload in 15 seconds

Fair Trial

Same rubric for manual vs. AI-assisted work

ROI

Time saved × quality maintained = true gain

Bottom line: Run each task twice (manual then AI), measure Δ between expectation and reality, track cognitive load via micro-TLX, and validate with blind reviewer scores to prove actual lift.

The Fair Trial Protocol

Four steps to reliable AI productivity measurement

1

Baseline (Manual)

Complete the task without AI. Record time and self-rate quality.

2

AI-Assisted Run

Same task with AI tools. Same rubric, same time tracking.

3

Blind Review

Reviewer scores both outputs without knowing which used AI.

4

Calculate Δ

Compare predicted vs. actual scores. Log micro-TLX workload.

Core Metrics

The four pillars of AI productivity measurement

Overestimation Δ

-100 to +100
Δ = Self-Rating − Actual Score

Measures the gap between what you expected and what you achieved with AI assistance.

Δ near 0 = Well calibrated
Δ > +5 = Overconfident

Micro-TLX

0 to 100
TLX = (Mental Demand + Frustration) / 2

NASA-validated cognitive workload assessment condensed to 2 key dimensions for rapid feedback.

TLX < 40 = Low cognitive load
TLX > 70 = Risk of burnout

Efficiency Δ-Time

-100% to +100%
Δ-Time = ((Manual − AI) / Manual) × 100%

Percentage time saved when using AI vs. manual approach for the same task.

Positive = Time saved with AI
Negative = AI slower than manual

Rubric Agreement

0 to 100
Agreement = 100 − 25 × |User − Gold|

How closely your quality rating matches the gold standard rubric score.

> 75 = Strong agreement
< 50 = Calibration needed

Scoring Formulas

Transparent methodology: All scores are calculated using these public formulas.

#Overestimation Delta (Δ)

Formula

Δ = predicted_score% − actual_score%

Range

-100 to +100

Interpretation

Positive values indicate overestimation of your abilities. Zero means perfectly calibrated.

Example

Inputs: Predicted: 75%, Actual: 60%

Calculation: Δ = 75 - 60 = 15

Result: +15% overestimation

#Calibration Score

Formula

Calibration = max(0, 100 − |Δ|)

Range

0-100

Interpretation

100 means you predicted your score exactly. Lower scores indicate miscalibration.

Example

Inputs: Δ = +15

Calculation: Calibration = max(0, 100 - |15|) = max(0, 85)

Result: 85 (well calibrated)

#Rubric Agreement

Formula

Agreement = 100 − 25 × |user_score − gold_score|

Range

0-100

Interpretation

Measures how closely your quality ratings match expert judgments. Each point difference costs 25 points.

Example

Inputs: User rated: 4, Gold standard: 3

Calculation: Agreement = 100 - 25 × |4 - 3| = 100 - 25 = 75

Result: 75 (1 point off)

#Efficiency Δ-time

Formula

Δ-time = ((manual_time − ai_time) / manual_time) × 100

Range

-50% to +80%

Interpretation

Positive means AI saved time. Negative means AI added time. Capped to prevent outliers.

Example

Inputs: Manual: 10 min, AI: 6 min

Calculation: Δ-time = ((10 - 6) / 10) × 100 = 40%

Result: +40% faster with AI

#Cognitive Load Efficiency

Formula

CLE = 100 − average(Mental_Demand, Frustration)

Range

0-100

Interpretation

Higher scores mean lower cognitive burden when using AI assistance.

Example

Inputs: Mental Demand: 40, Frustration: 30

Calculation: CLE = 100 - ((40 + 30) / 2) = 100 - 35 = 65

Result: 65 (moderate cognitive efficiency)

#Composite AI Judgment Score

Formula

(Literacy × 30%) + (Rubric Agreement × 30%) + (Calibration × 15%) + (Efficiency × 15%) + (Cognitive Load × 10%)

Range

0-100

Interpretation

Your overall AI readiness score combining knowledge, judgment, self-awareness, and efficiency.

Component Weights

30% - Foundational AI knowledge
30% - Quality evaluation accuracy
15% - Self-awareness accuracy
15% - Time efficiency with AI
10% - Mental burden management

Example

Inputs: Literacy: 80, Rubric: 75, Calibration: 85, Efficiency: 70, Cognitive: 65

Calculation: (80×0.30) + (75×0.30) + (85×0.15) + (70×0.15) + (65×0.10)

Result: 24 + 22.5 + 12.75 + 10.5 + 6.5 = 76

#Benchmark Requirements

Minimum Sample Size

Percentiles are only displayed when n ≥ 30 for the specific (role × pack × last 30 days) cell.

Time Window

Only data from the last 30 days is included in benchmark calculations.

Insufficient Sample

When minimum sample is not met, "Insufficient sample" is displayed instead of percentiles.

Percentile Calculation

percentile = ((below + 0.5 × equal) / n) × 100

Uses linear interpolation between data points for smooth percentile ranks.

Composite Score Levels

0-29Beginner
30-49Developing
50-69Proficient
70-84Advanced
85-100Expert

Limits & Misuse

Important caveats for responsible use

Context Drift

  • • Coding benchmarks don't transfer to marketing tasks
  • • AI model updates can invalidate baselines within weeks
  • • Team workflows differ from published medians

Outdated Benchmarks

  • • Re-validate cohort medians every 3 months
  • • GPT-3.5 benchmarks don't apply to GPT-4
  • • Industry averages mask high/low performer variance

Small Sample Risks

  • • Need 20+ data points per role for significance
  • • Outliers skew averages in groups under 10
  • • Single-sprint measurements miss learning curves

Remember: Establish your own baselines first, then compare to industry benchmarks. Never make decisions based solely on external averages. Learn more about research quality →

Frequently Asked Questions

Common questions about our methodology

How do you calculate Overestimation Δ?

Δ = Self-Rating − Actual Score. You predict your quality before submitting, then a blind reviewer scores the output. The difference reveals calibration accuracy.

Why two runs instead of just measuring AI output?

Without a baseline, you can't know if AI actually helped. The manual run establishes your true capability, making the comparison meaningful and controlled.

How is micro-TLX different from full NASA-TLX?

Full NASA-TLX has 6 dimensions and takes 2+ minutes. Our micro-TLX uses the 2 most predictive dimensions (Mental Demand + Frustration) and takes under 15 seconds.

What if my Δ is consistently negative?

Negative Δ means you're underestimating your AI-assisted performance—common among experts. This is valuable data showing you may be leaving gains on the table.

How often should I recalibrate benchmarks?

Re-run baselines every 3 months or whenever AI models update significantly. GPT-4 benchmarks don't apply to GPT-4-Turbo or newer releases.

Open Beta: Measure Your AI Productivity

Run a Fair Trial assessment and get your first Δ and TLX scores in under 10 minutes.

PrivacyEthicsStatusOpen Beta Terms
Share feedback