Validity & Reliability
What the scores mean (and don't)
Trustworthy Measurements You Can Rely On
Our scoring system is built on peer-reviewed psychometric methods, validated statistical techniques, and transparent privacy practices. Know exactly what the numbers mean—and what they don't.
All metrics validated against peer-reviewed standards
Executive Summary
What you need to know in 60 seconds
Validity
Metrics measure what they claim
Reliability
Consistent, repeatable results
Privacy
Individual data stays private
Transparency
A/B/C grading shows confidence
Bottom line: You can trust the numbers—they're scientifically validated, statistically reliable, and your data remains secure. Compare against benchmarks without exposing sensitive information.
Our Three Guarantees
The foundation of trustworthy AI measurement
Validity
Our metrics measure what they claim to measure
- Δ tracks overestimation (predicted − actual)
- TLX captures cognitive workload (NASA-validated)
- Fair Trial controls for confounding variables
Reliability
Results are consistent and repeatable
- Cronbach's α > 0.78 across assessments
- Test-retest correlation r > 0.85
- Inter-rater reliability κ > 0.70
Privacy
Your data stays secure and under your control
- Individual prompts never shared
- Aggregated trends only for benchmarks
- Export or delete anytime
What We Share
Aggregated data for benchmarking
What We Never Share
Your private data stays private
Key Terms Glossary
Definitions and formulas
Overestimation Delta (Δ)
The difference between your predicted quality score and the actual scored result. Positive values indicate overconfidence; negative values indicate underestimation.
Micro-TLX
A streamlined version of NASA's Task Load Index measuring cognitive workload through two dimensions: Mental Demand and Frustration.
Cronbach's Alpha (α)
A measure of internal consistency showing how closely related assessment items are. Values above 0.78 indicate good reliability.
Fair Trial Protocol
Our controlled methodology requiring identical tasks run both manually and with AI assistance, scored by blind reviewers using the same rubric.
Limits & Responsible Use
Important caveats to understand
Statistical Limitations
- • Confidence intervals widen below n=30
- • Cross-team comparisons need similar tasks
- • Self-selection bias in voluntary assessments
Measurement Boundaries
- • TLX = perceived load, not objective
- • Δ assumes honest self-assessment
- • Cultural bias in self-rating scales
Privacy Constraints
- • Small teams (<5) may reveal patterns
- • Benchmarks vary by industry context
- • Retention policies vary by region
Key point: These are diagnostic tools, not performance evaluations. Use them to identify improvement opportunities, not to rank individuals. Review research standards →
Frequently Asked Questions
Common questions about validity and privacy
What data leaves my workspace?
Only aggregated, anonymized statistics like role-based percentiles. Individual prompts, outputs, and run details never leave your workspace. You can opt out of benchmarking entirely.
How do you ensure the scores are accurate?
We use blind review (reviewers don't know which output used AI), standardized rubrics, and statistical validation. Cronbach's α > 0.78 ensures internal consistency across all assessments.
How confident should I be in these measurements?
Check the research grade (A/B/C) on each metric. Grade A means RCT-level evidence; Grade C means preliminary data. We're transparent about confidence levels.
Can I delete my data?
Yes. You can export all your data anytime, and request full deletion. We comply with GDPR, CCPA, and other privacy regulations.
Why do some benchmarks show 'Insufficient sample'?
We require n ≥ 30 data points for a specific role × task × time window before showing percentiles. This prevents misleading comparisons from small samples.
Open Beta: See Your Validated Scores
Now that you understand what the numbers mean, run your first assessment and get validated measurements of your AI productivity.