How do you ensure the scores are accurate?

We use blind review (reviewers don't know which output used AI), standardized rubrics, and statistical validation. Cronbach's α > 0.78 ensures internal consistency across all assessments.

How confident should I be in these measurements?

Check the research grade (A/B/C) on each metric. Grade A means RCT-level evidence; Grade C means preliminary data. We're transparent about confidence levels.

Can I delete my data?

Yes. You can export all your data anytime, and request full deletion. We comply with GDPR, CCPA, and other privacy regulations.

Why do some benchmarks show 'Insufficient sample'?

We require n ≥ 30 data points for a specific role × task × time window before showing percentiles. This prevents misleading comparisons from small samples.

Validity & Reliability

Q: What data leaves my workspace?

Only aggregated, anonymized statistics like role-based percentiles. Individual prompts, outputs, and run details never leave your workspace. You can opt out of benchmarking entirely.

What the scores mean (and don't)

Methodology Research

Scientific Standards

Trustworthy Measurements You Can Rely On

Our scoring system is built on peer-reviewed psychometric methods, validated statistical techniques, and transparent privacy practices. Know exactly what the numbers mean—and what they don't.

View Guarantees Glossary

Trust Indicators

α > 0.78

Internal Consistency

r > 0.85

Test-Retest

κ > 0.70

Inter-Rater

n ≥ 30

Min Sample

All metrics validated against peer-reviewed standards

Psychometrically Validated

Executive Summary

What you need to know in 60 seconds

Validity

Metrics measure what they claim

Reliability

Consistent, repeatable results

Privacy

Individual data stays private

Transparency

A/B/C grading shows confidence

Bottom line: You can trust the numbers—they're scientifically validated, statistically reliable, and your data remains secure. Compare against benchmarks without exposing sensitive information.

Our Three Guarantees

The foundation of trustworthy AI measurement

Validity

Our metrics measure what they claim to measure

Δ tracks overestimation (predicted − actual)
TLX captures cognitive workload (NASA-validated)
Fair Trial controls for confounding variables

Reliability

Results are consistent and repeatable

Cronbach's α > 0.78 across assessments
Test-retest correlation r > 0.85
Inter-rater reliability κ > 0.70

Privacy

Your data stays secure and under your control

Individual prompts never shared
Aggregated trends only for benchmarks
Export or delete anytime

What We Share

Aggregated data for benchmarking

Aggregated Δ distributions by role and task type

TLX trends showing cognitive load patterns

Cohort medians for peer benchmarking

Percentile rankings (when n ≥ 30)

What We Never Share

Your private data stays private

Individual run data or outputs

Raw prompts or AI responses

Reviewer notes or feedback

Personal identifiers in benchmarks

Key Terms Glossary

Definitions and formulas

Overestimation Delta (Δ)

The difference between your predicted quality score and the actual scored result. Positive values indicate overconfidence; negative values indicate underestimation.

Δ = Self-Rating − Actual Score

Micro-TLX

A streamlined version of NASA's Task Load Index measuring cognitive workload through two dimensions: Mental Demand and Frustration.

TLX = (Mental Demand + Frustration) / 2

Cronbach's Alpha (α)

A measure of internal consistency showing how closely related assessment items are. Values above 0.78 indicate good reliability.

α > 0.78 = Acceptable reliability

Fair Trial Protocol

Our controlled methodology requiring identical tasks run both manually and with AI assistance, scored by blind reviewers using the same rubric.

Same task × Same rubric × Blind review

Limits & Responsible Use

Important caveats to understand

Statistical Limitations

• Confidence intervals widen below n=30
• Cross-team comparisons need similar tasks
• Self-selection bias in voluntary assessments

Measurement Boundaries

• TLX = perceived load, not objective
• Δ assumes honest self-assessment
• Cultural bias in self-rating scales

Privacy Constraints

• Small teams (<5) may reveal patterns
• Benchmarks vary by industry context
• Retention policies vary by region

Key point: These are diagnostic tools, not performance evaluations. Use them to identify improvement opportunities, not to rank individuals. Review research standards →

Frequently Asked Questions

Common questions about validity and privacy

Only aggregated, anonymized statistics like role-based percentiles. Individual prompts, outputs, and run details never leave your workspace. You can opt out of benchmarking entirely.

Open Beta: See Your Validated Scores

Now that you understand what the numbers mean, run your first assessment and get validated measurements of your AI productivity.

View Full Methodology

Start Assessment