Overestimation Δ snapshot
Δ = self-rating − reviewer score. Keep cohorts within ±5 to prove confidence ≈ accuracy.
Methodology
Scoring, Overestimation Δ, micro-TLX, item analysis, and benchmarks.
Bottom line: Run each task twice (manual then AI), measure Δ between expectation and reality, track cognitive load via micro-TLX, and validate with reviewer minutes to prove actual lift.
Before you run your evaluation:
What to expect
Quick reference so you can explain the metrics before anyone clicks into the Analyzer.
Δ = self-rating − reviewer score. Keep cohorts within ±5 to prove confidence ≈ accuracy.
Log mental demand + frustration immediately after each run; it takes under 15 seconds but surfaces fatigue trends.
Counter-balance order, lock the rubric, and compare manual vs. AI time so the story survives scrutiny.
Review checklistTransparent methodology: All scores are calculated using these public formulas.
Formula
Range
-100 to +100
Interpretation
Positive values indicate overestimation of your abilities. Zero means perfectly calibrated.
Example
Inputs: Predicted: 75%, Actual: 60%
Calculation: Δ = 75 - 60 = 15
Result: +15% overestimation
Formula
Range
0-100
Interpretation
100 means you predicted your score exactly. Lower scores indicate miscalibration.
Example
Inputs: Δ = +15
Calculation: Calibration = max(0, 100 - |15|) = max(0, 85)
Result: 85 (well calibrated)
Formula
Range
0-100
Interpretation
Measures how closely your quality ratings match expert judgments. Each point difference costs 25 points.
Example
Inputs: User rated: 4, Gold standard: 3
Calculation: Agreement = 100 - 25 × |4 - 3| = 100 - 25 = 75
Result: 75 (1 point off)
Formula
Range
-50% to +80%
Interpretation
Positive means AI saved time. Negative means AI added time. Capped to prevent outliers.
Example
Inputs: Manual: 10 min, AI: 6 min
Calculation: Δ-time = ((10 - 6) / 10) × 100 = 40%
Result: +40% faster with AI
Formula
Range
0-100
Interpretation
Higher scores mean lower cognitive burden when using AI assistance.
Example
Inputs: Mental Demand: 40, Frustration: 30
Calculation: CLE = 100 - ((40 + 30) / 2) = 100 - 35 = 65
Result: 65 (moderate cognitive efficiency)
Formula
Range
0-100
Interpretation
Your overall AI readiness score combining knowledge, judgment, self-awareness, and efficiency.
Component Weights
Example
Inputs: Literacy: 80, Rubric: 75, Calibration: 85, Efficiency: 70, Cognitive: 65
Calculation: (80×0.30) + (75×0.30) + (85×0.15) + (70×0.15) + (65×0.10)
Result: 24 + 22.5 + 12.75 + 10.5 + 6.5 = 76
Minimum Sample Size
Percentiles are only displayed when n ≥ 30 for the specific (role × pack × last 30 days) cell.
Time Window
Only data from the last 30 days is included in benchmark calculations.
Insufficient Sample
When minimum sample is not met, "Insufficient sample" is displayed instead of percentiles.
Percentile Calculation
Uses linear interpolation between data points for smooth percentile ranks.
Every pack is run twice (manual then AI-assisted) with the same rubric. Reviewers log pass/fail plus rework minutes so “faster” never hides defect rates.
Δ = self-rating − scored performance. We flag cohorts when >15% of runs exceed ±5 points so you know when confidence outruns evidence.
Two sliders (mental demand + frustration) appear after each run. They add <15 seconds and help you correlate speed with cognitive cost.
We calculate Cronbach’s α nightly across every assessment and pack question. When α drops below 0.78 we freeze that item until it is recalibrated.
We publish cohort medians weekly so you can compare, but we never expose raw customer data. All deltas are normalized against your own history first.
Context drift:
Outdated benchmarks:
Small sample risks:
Remember: Use these methods to establish your own baselines first, then compare to industry benchmarks. Never make decisions based solely on external averages. Learn more about evidence quality →
Need more detail?
Use the feedback button if you need anonymized cohort medians or reviewer guidance.
Use these quick answers when execs or reviewers ask about the math.
How do you calculate Overestimation Δ?
Every pack runs twice (manual then AI). We subtract reviewer scores from self-ratings and flag cohorts when ±5 is exceeded.
What does micro-TLX capture?
Two sliders log mental demand and frustration immediately after each run so Δ includes a fatigue lens.
Which benchmarks do you share?
We post cohort medians and reviewer minutes weekly so teams can compare progress without exposing raw prompts.
Open Beta
Run the analyzer demo, share methodology notes with your team, and send us benchmarks so the release ships with proof—not hype.