Research Standards

Evidence Levels & Quality Standards

How we grade the confidence level of our AI evaluation research and recommendations

How to Use Evidence Levels Safely

✓Match evidence level to decision risk: Level A for strategic commitments, B for pilots, C for exploration
✓Always check publication date: AI evidence older than 12 months may be outdated
✓Consider your context: Validate findings match your industry, scale, and technical maturity

What NOT to do:

• Never use Level C evidence alone for major investments
• Don't ignore contradictory evidence from different sources
• Avoid generalizing from single-vendor case studies

Quick Reference for Executives

Level A: Gold standard evidence - safe to base strategic decisions on
Level B: Industry-validated - appropriate for pilot programs and controlled rollouts
Level C: Emerging insights - useful for exploration but requires validation

How We Assign Evidence Levels

Our evidence grading follows established scientific standards adapted for AI evaluation contexts. Each claim, metric, and recommendation receives a grade based on:

Study design quality - RCTs and systematic reviews earn higher grades
Sample size and diversity - Larger, more representative samples increase confidence
Reproducibility - Findings replicated across contexts receive higher ratings
Quantitative rigor - Statistical significance and effect sizes matter
Recency - More recent evidence (especially for AI) carries more weight

Evidence Level Definitions

Evidence A

Strong Evidence

Important Limits & Potential Misuse

• Evidence levels indicate confidence in research quality, not guaranteed outcomes
• Context matters: Strong evidence in one setting may not transfer to yours
• AI capabilities evolve rapidly; evidence older than 12 months may be outdated
• Never use single studies to justify major decisions; seek converging evidence
• Consider your organization's unique constraints and capabilities

Evidence B

Moderate Evidence

Important Limits & Potential Misuse

• Evidence levels indicate confidence in research quality, not guaranteed outcomes
• Context matters: Strong evidence in one setting may not transfer to yours
• AI capabilities evolve rapidly; evidence older than 12 months may be outdated
• Never use single studies to justify major decisions; seek converging evidence
• Consider your organization's unique constraints and capabilities

Evidence C

Limited Evidence

Important Limits & Potential Misuse

• Evidence levels indicate confidence in research quality, not guaranteed outcomes
• Context matters: Strong evidence in one setting may not transfer to yours
• AI capabilities evolve rapidly; evidence older than 12 months may be outdated
• Never use single studies to justify major decisions; seek converging evidence
• Consider your organization's unique constraints and capabilities

Methodology →

Learn how we measure Overestimation Δ and TLX with scientific rigor

Validity →

Understand the statistical foundations of our evaluation framework