Skip to main content
Open BetaWe’re learning fast - your sessions and feedback directly shape AI CogniFit.

Validity

Validity & reliability

What the scores mean—and what they do not.

Executive Summary

Validity:Our metrics measure what they claim—Δ tracks overestimation, TLX measures cognitive load
Reliability:Cronbach's α >0.78 across all assessments ensures consistent, repeatable results
Privacy:Individual prompts and runs stay private; only aggregated trends are shared for benchmarking
Evidence:3-tier grading system (A/B/C) indicates confidence level based on study quality and sample size

What this means: You can trust the numbers—they're scientifically validated, statistically reliable, and your data remains secure. Compare against benchmarks without exposing sensitive information.

What to expect

Quick reference so you can explain the metrics before anyone clicks into the Analyzer.

Overestimation Δ snapshot

Δ = self-rating − reviewer score. Keep cohorts within ±5 to prove confidence ≈ accuracy.

micro-TLX (2 sliders)

Log mental demand + frustration immediately after each run; it takes under 15 seconds but surfaces fatigue trends.

Fair Trial checklist

Counter-balance order, lock the rubric, and compare manual vs. AI time so the story survives scrutiny.

Review checklist

What we share

Aggregated deltas, TLX trends, and Cronbach’s α—never raw prompts or personal data.

What we do not share

Individual runs or annotations never leave your workspace.

Next steps

Want formulas and audit steps? Review the full methodology. Go to methodology.

Limits & Misuse

Statistical limitations:

  • • Confidence intervals widen with sample sizes under 30 runs
  • • Cross-team comparisons require similar task complexity to be valid
  • • Self-selection bias affects voluntary assessment participation

Measurement validity boundaries:

  • • TLX captures perceived workload, not objective performance
  • • Δ measurements assume honest self-assessment (gaming is possible)
  • • Cultural differences affect self-rating scales across global teams

Privacy model constraints:

  • • Aggregated data can still reveal patterns in small teams (<5 people)
  • • Benchmark comparisons may not account for industry-specific requirements
  • • Data retention policies vary by jurisdiction and compliance needs

Key point: These measurements are diagnostic tools, not performance evaluations. Use them to identify improvement opportunities, not to rank individuals. Review evidence standards →

Glossary

Overestimation Δ
Difference between self-rating and reviewer score. Flagged when ±5 or greater.
micro-TLX
Two quick sliders (mental demand + frustration) captured right after each run.
Item analysis
Cronbach’s α to verify quiz items and pack rubrics stay reliable.

Validity FAQ

Bring these along when Legal or Compliance wants a quick briefing.

What data leaves my workspace?

Only anonymized deltas, TLX trends, and reviewer medians so execs can compare cohorts without raw runs.

How are reviewer notes protected?

Reviewer evidence stays scoped to your workspace with access controls and export logging.

How do I brief leadership on validity?

Pair this page with Methodology, include Δ + TLX tiles, and link the interpretation guide for instant context.

Open Beta

Help steer the Open Beta with real Δ and TLX tiles.

Run the analyzer demo, share methodology notes with your team, and send us benchmarks so the release ships with proof—not hype.

PrivacyEthicsStatusOpen Beta Terms
Share feedback