Skip to main content
Open BetaWe’re learning fast - your sessions and feedback directly shape AI CogniFit.

Methodology

How we measure

Scoring, Overestimation Δ, micro-TLX, item analysis, and benchmarks.

Executive Summary

Δ (Delta):Measures the gap between expected and actual AI performance (self-rating minus scored outcome)
TLX:NASA-validated cognitive workload assessment adapted to 2 key dimensions for rapid feedback
Fair Trial:Double baseline methodology ensuring same rubric for manual vs AI-assisted tasks
ROI:Time saved × quality maintained = true productivity gain (not just speed)

Bottom line: Run each task twice (manual then AI), measure Δ between expectation and reality, track cognitive load via micro-TLX, and validate with reviewer minutes to prove actual lift.

What to expect

Quick reference so you can explain the metrics before anyone clicks into the Analyzer.

Overestimation Δ snapshot

Δ = self-rating − reviewer score. Keep cohorts within ±5 to prove confidence ≈ accuracy.

micro-TLX (2 sliders)

Log mental demand + frustration immediately after each run; it takes under 15 seconds but surfaces fatigue trends.

Fair Trial checklist

Counter-balance order, lock the rubric, and compare manual vs. AI time so the story survives scrutiny.

Review checklist

Scoring Formulas

Transparent methodology: All scores are calculated using these public formulas.

#Overestimation Delta (Δ)

Formula

Δ = predicted_score% − actual_score%

Range

-100 to +100

Interpretation

Positive values indicate overestimation of your abilities. Zero means perfectly calibrated.

Example

Inputs: Predicted: 75%, Actual: 60%

Calculation: Δ = 75 - 60 = 15

Result: +15% overestimation

#Calibration Score

Formula

Calibration = max(0, 100 − |Δ|)

Range

0-100

Interpretation

100 means you predicted your score exactly. Lower scores indicate miscalibration.

Example

Inputs: Δ = +15

Calculation: Calibration = max(0, 100 - |15|) = max(0, 85)

Result: 85 (well calibrated)

#Rubric Agreement

Formula

Agreement = 100 − 25 × |user_score − gold_score|

Range

0-100

Interpretation

Measures how closely your quality ratings match expert judgments. Each point difference costs 25 points.

Example

Inputs: User rated: 4, Gold standard: 3

Calculation: Agreement = 100 - 25 × |4 - 3| = 100 - 25 = 75

Result: 75 (1 point off)

#Efficiency Δ-time

Formula

Δ-time = ((manual_time − ai_time) / manual_time) × 100

Range

-50% to +80%

Interpretation

Positive means AI saved time. Negative means AI added time. Capped to prevent outliers.

Example

Inputs: Manual: 10 min, AI: 6 min

Calculation: Δ-time = ((10 - 6) / 10) × 100 = 40%

Result: +40% faster with AI

#Cognitive Load Efficiency

Formula

CLE = 100 − average(Mental_Demand, Frustration)

Range

0-100

Interpretation

Higher scores mean lower cognitive burden when using AI assistance.

Example

Inputs: Mental Demand: 40, Frustration: 30

Calculation: CLE = 100 - ((40 + 30) / 2) = 100 - 35 = 65

Result: 65 (moderate cognitive efficiency)

#Composite AI Judgment Score

Formula

(Literacy × 30%) + (Rubric Agreement × 30%) + (Calibration × 15%) + (Efficiency × 15%) + (Cognitive Load × 10%)

Range

0-100

Interpretation

Your overall AI readiness score combining knowledge, judgment, self-awareness, and efficiency.

Component Weights

30% - Foundational AI knowledge
30% - Quality evaluation accuracy
15% - Self-awareness accuracy
15% - Time efficiency with AI
10% - Mental burden management

Example

Inputs: Literacy: 80, Rubric: 75, Calibration: 85, Efficiency: 70, Cognitive: 65

Calculation: (80×0.30) + (75×0.30) + (85×0.15) + (70×0.15) + (65×0.10)

Result: 24 + 22.5 + 12.75 + 10.5 + 6.5 = 76

#Benchmark Requirements

Minimum Sample Size

Percentiles are only displayed when n ≥ 30 for the specific (role × pack × last 30 days) cell.

Time Window

Only data from the last 30 days is included in benchmark calculations.

Insufficient Sample

When minimum sample is not met, "Insufficient sample" is displayed instead of percentiles.

Percentile Calculation

percentile = ((below + 0.5 × equal) / n) × 100

Uses linear interpolation between data points for smooth percentile ranks.

Composite Score Levels

0-29Beginner
30-49Developing
50-69Proficient
70-84Advanced
85-100Expert

Scoring

Every pack is run twice (manual then AI-assisted) with the same rubric. Reviewers log pass/fail plus rework minutes so “faster” never hides defect rates.

Overestimation Index

Δ = self-rating − scored performance. We flag cohorts when >15% of runs exceed ±5 points so you know when confidence outruns evidence.

Micro-TLX

Two sliders (mental demand + frustration) appear after each run. They add <15 seconds and help you correlate speed with cognitive cost.

Item analysis

We calculate Cronbach’s α nightly across every assessment and pack question. When α drops below 0.78 we freeze that item until it is recalibrated.

Benchmarks & limits

We publish cohort medians weekly so you can compare, but we never expose raw customer data. All deltas are normalized against your own history first.

Limits & Misuse

Context drift:

  • • Benchmarks from one domain (e.g., coding) may not transfer to others (e.g., marketing)
  • • AI model updates can invalidate historical baselines within weeks
  • • Team-specific workflows may show different patterns than published medians

Outdated benchmarks:

  • • Cohort medians older than 3 months should be re-validated
  • • GPT-3.5 benchmarks don't apply to GPT-4 or newer models
  • • Industry averages mask wide variance between high/low performers

Small sample risks:

  • • Δ measurements need 20+ data points per role for statistical significance
  • • Individual outliers can skew team averages in groups under 10 people
  • • Single-sprint measurements don't account for learning curves

Remember: Use these methods to establish your own baselines first, then compare to industry benchmarks. Never make decisions based solely on external averages. Learn more about evidence quality →

Need more detail?

Use the feedback button if you need anonymized cohort medians or reviewer guidance.

Measurement FAQ

Use these quick answers when execs or reviewers ask about the math.

How do you calculate Overestimation Δ?

Every pack runs twice (manual then AI). We subtract reviewer scores from self-ratings and flag cohorts when ±5 is exceeded.

What does micro-TLX capture?

Two sliders log mental demand and frustration immediately after each run so Δ includes a fatigue lens.

Which benchmarks do you share?

We post cohort medians and reviewer minutes weekly so teams can compare progress without exposing raw prompts.

Open Beta

Help steer the Open Beta with real Δ and TLX tiles.

Run the analyzer demo, share methodology notes with your team, and send us benchmarks so the release ships with proof—not hype.

PrivacyEthicsStatusOpen Beta Terms
Share feedback