What makes a trial 'fair'
A fair trial controls for confounds: same task, same evaluator, same rubric, same time pressure. Without controls, you're measuring noise, not AI impact.
Pre-Flight Checklist
Before you start, confirm:
- ✓Task selected: Routine task you do at least weekly
- ✓Rubric locked: Using anti-drift template (see Rubrics guide)
- ✓Sample size: Minimum 5 matched pairs (10 total outputs)
- ✓Evaluator assigned: Same person scores all outputs
- ✓Order randomized: Evaluator doesn't know which is AI vs. manual
- ✓Timing ready: Stopwatch for task completion time
- ✓TLX prepared: Workload questionnaire for each task
The Fair Trial Protocol
Step 1: Select 5 Matched Task Instances
Choose 5 representative examples of the task. They should vary in complexity but be comparable:
| Pair | Task Description | Complexity (1-3) | |------|------------------|------------------| | 1 | [e.g., "Summarize Q3 report"] | 2 | | 2 | [e.g., "Summarize competitor analysis"] | 2 | | 3 | [e.g., "Summarize customer feedback"] | 1 | | 4 | [e.g., "Summarize market research"] | 3 | | 5 | [e.g., "Summarize internal audit"] | 2 |
Step 2: Generate Both Versions
For each task:
- Manual version: Complete the task without AI (time it)
- AI version: Complete the task with AI assistance (time it)
- Record workload: Complete TLX after each version
Important: Randomize which you do first (flip a coin per pair).
Step 3: Blind the Evaluator
- Remove any identifying markers (prompts, AI artifacts)
- Assign random IDs (A1, A2, B1, B2, etc.)
- Shuffle order before evaluation
- Evaluator sees only: Output + Rubric
Step 4: Score All Outputs
Use your locked rubric. Record:
| Output ID | C1 Score | C2 Score | C3 Score | Total | Time (min) | TLX Score | |-----------|----------|----------|----------|-------|------------|-----------| | A1 | | | | | | | | B1 | | | | | | | | A2 | | | | | | | | ... | | | | | | |
Step 5: Unblind and Analyze
Reveal which outputs were AI vs. manual. Calculate:
| Metric | Manual (avg) | AI-Assisted (avg) | Δ | |--------|--------------|-------------------|---| | Quality Score | | | | | Time (min) | | | | | TLX Workload | | | |
Fair Trial Log Template
Fair Trial Log — [Task Type] — [Date]
Setup
- Task: [description]
- Sample size: 5 pairs
- Evaluator: [name/initials]
- Rubric: [link or name]
Pairs
| Pair | Task | Manual First? | Manual Time | AI Time | |------|------|---------------|-------------|---------| | 1 | | Y/N | min | min | | 2 | | Y/N | min | min | | 3 | | Y/N | min | min | | 4 | | Y/N | min | min | | 5 | | Y/N | min | min |
Blinded Evaluation
| ID | C1 | C2 | C3 | Total | |----|----|----|-----|-------| | | | | | |
Results (after unblinding)
| Metric | Manual | AI | Δ | Significant? | |--------|--------|----|---|--------------| | Quality | | | | | | Time | | | | | | TLX | | | | |
Conclusions
- Quality difference: [higher/lower/same]
- Time difference: [faster/slower/same]
- Workload difference: [lighter/heavier/same]
- Recommendation: [continue/adjust/abandon]
Interpreting Results
Scenario A: AI wins on quality AND time
Action: Document the workflow and scale it. Repeat trial in 2 weeks to confirm.
Scenario B: AI wins on time, loses on quality
Action: AI draft + human edit may work. Calculate: does edit time + AI time < manual time?
Scenario C: AI loses on time, wins on quality
Action: Use AI for high-stakes outputs where quality justifies time investment.
Scenario D: AI loses on both
Action: Try different prompts, different model, or accept this task isn't AI-ready yet.
“"Our first fair trial showed AI summaries were 30% faster but scored 15% lower. We almost abandoned it—but the edit time was only 5 minutes. Net win."”
Common Fair Trial Mistakes
- ✓Uncontrolled complexity: Comparing AI on easy tasks vs. manual on hard tasks
- ✓Evaluator bias: Evaluator knows which is AI and scores accordingly
- ✓Missing workload data: Claiming "faster" without measuring cognitive load
- ✓Single trial: One comparison is anecdote; 5+ pairs is evidence
- ✓Rubric drift: Standards change between manual and AI scoring
Citations
- Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs. Houghton Mifflin.
- Hart, S. G. (2006). "NASA-TLX: 20 Years Later." Proceedings of the Human Factors and Ergonomics Society.
- Google Research. (2024). "Rigorous A/B Testing for AI-Assisted Workflows." Technical Report.
Apply this now
Practice prompt
Run a 5-pair fair trial on your most common AI-assisted task this week.
Try this now
Identify one task you currently use AI for—that's your first trial candidate.
Common pitfall
Running unblinded trials—if you know which is AI, you'll score it differently.
Key takeaways
- •Control variables: same task, same evaluator, same rubric, randomized order
- •Minimum 5 matched pairs—single comparisons are anecdotes, not evidence
- •Always capture workload (TLX) alongside time—speed without sustainability is false savings
See it in action
Drop this into a measured run—demo it, then tie it back to your methodology.
See also
Pair this play with related resources, methodology notes, or quickstarts.
Further reading
Next Steps
Ready to measure your AI impact? Start with a quick demo to see your Overestimation Δ and cognitive load metrics.