Pillar POV
Evidence BFair Trial: How to Compare Manual vs. AI Without Fooling Yourself
Hold timing and quality constant; log Delta-time, Delta-quality, and micro-TLX.
Without controls, every AI demo is theater. Fair Trial methodology turns anecdotes into evidence.
Executive TL;DR
- •Uncontrolled AI comparisons overestimate gains by 25-40%—Fair Trial protocol corrects this
- •Log three metrics per trial: Delta-time, Delta-quality (reviewer score), and micro-TLX
- •Counter-balanced order and locked rubrics prevent the learning effects that inflate results
Do this week: Run one Fair Trial this week comparing manual vs. AI on a real task; share the Delta tiles in your next stand-up
Why most AI comparisons lie
You try a task with AI. It feels faster. You conclude AI helps.
This is not evidence. This is confirmation bias with extra steps.
Without controls, your "comparison" has:
- Order effects: You did manual first, learned the task, then AI felt easier
- Rubric drift: You unconsciously lower standards for AI output because it arrived fast
- Selection bias: You picked a task that suits AI and generalized
Fair Trial methodology fixes these problems. It's how you get numbers that survive scrutiny.
The theater trap
Most AI demos show best-case scenarios with cherry-picked tasks. Fair Trial shows average performance on representative work.
Setup: what you need before starting
1. Define the task boundary
Write down exactly what "done" means. Include:
- Deliverable format (document, code, summary)
- Quality criteria (accuracy, completeness, style)
- Scope exclusions (what you're NOT measuring)
Vague boundaries = vague results. Lock it down.
2. Build the rubric first
Create a 3-5 point scoring rubric before you run any trials. Each point should be:
- Observable (you can see it in the output)
- Binary or tiered (meets/partially meets/doesn't meet)
- Independent (scoring one criterion doesn't affect others)
Sample rubric structure
- Accuracy: All facts correct and verifiable (0/1/2)
- Completeness: Covers all required elements (0/1/2)
- Clarity: Reader can act without follow-up questions (0/1/2)
- Efficiency: No unnecessary content or bloat (0/1)
3. Select representative tasks
Don't pick your best AI task. Pick three tasks that represent your actual workload:
- One routine task (you do it weekly)
- One complex task (requires judgment)
- One novel task (first time or rare)
AI often excels at routine, struggles with novel. You need to know both.
Running paired trials
Counter-balanced order
If you always do manual first, learning effects inflate AI performance. Counter-balance:
Trial A: Manual → AI Trial B: AI → Manual
Run both. Average the results. This controls for order effects.
Capture three metrics per trial
- Δ-time: Time to completion (manual minus AI)
- Δ-quality: Reviewer score (manual minus AI, using your locked rubric)
- micro-TLX: Mental demand + frustration immediately after each attempt
Don't skip TLX. A 30% time savings with 50% higher cognitive load isn't a win—it's a burnout vector.
Use a blind reviewer when possible
If you score your own output, you'll be biased. Have a colleague score without knowing which output was AI-assisted.
Blind review adds 20 minutes to your trial. It also adds 10x credibility to your results.
“"Our first Fair Trial showed AI saved 15 minutes but cost 25 minutes of review time. Net loss. We would have missed that without tracking reviewer minutes."”
Controlling for context
Same inputs, same constraints
The manual and AI attempts must start from identical positions:
- Same reference materials
- Same time pressure (or explicitly unbounded)
- Same interruption conditions
If you gave AI better prompts than you gave yourself, you're measuring prompt quality, not AI value.
Log your prompts
Write down every prompt you use. Prompt iteration is part of the AI workflow—don't hide it.
Total AI time = generation time + prompt refinement time + review time
Many teams discover their "instant" AI outputs take longer than manual when you count prompt iteration.
Interpreting Δ and TLX together
The quadrant model
| | Low TLX (under 40) | High TLX (over 60) | |---|---|---| | Δ-time positive (AI faster) | Sweet spot: Real gains, sustainable | Warning: Gains won't last | | Δ-time negative (AI slower) | Investigate: Skill gap or task mismatch | Stop: Negative ROI |
What the numbers mean
Δ-time +30%, Δ-quality 0, TLX under 40 Clear win. Document it and share the rubric.
Δ-time +30%, Δ-quality -15%, TLX over 60 False win. You're trading quality and wellbeing for speed. Re-evaluate.
Δ-time -10%, Δ-quality +20%, TLX under 40 Quality win. AI is helping you do better work, not faster work. That's often more valuable.
Common pitfalls
Pitfall 1: Moving the rubric
"But the AI output was good in a different way!"
No. You defined what good means before the trial. If AI is good in ways you didn't anticipate, that's interesting for the next trial. It doesn't change this one.
Pitfall 2: Ignoring reviewer time
AI output often needs more review. If your "30% time savings" requires 45% more reviewer time, you've shifted cost, not reduced it.
Always log: creator time + reviewer time = total cycle time.
Pitfall 3: Generalizing from one task
One Fair Trial proves AI helps with that task. It proves nothing about your other work.
Run Fair Trials on 3-5 representative tasks before making claims about AI productivity in general.
- ✓Define task boundaries before starting
- ✓Build and lock the scoring rubric
- ✓Run counter-balanced trials (Manual→AI and AI→Manual)
- ✓Log Δ-time, Δ-quality, and micro-TLX for every attempt
- ✓Include reviewer time in your calculations
- ✓Have a colleague do blind review when possible
Apply now: Productivity Packs
The PM and SWE Productivity Packs implement Fair Trial methodology for role-specific tasks:
- PM Pack: Discovery briefs, experiment canvas, exec narratives
- SWE Pack: Code review, remediation plans, QA handoffs
Each pack guides you through counter-balanced trials with built-in TLX capture.
Evidence note
Fair Trial methodology adapts A/B testing principles for knowledge work:
- 25-40% overestimation in uncontrolled comparisons: Meta-analysis of 8 internal cohorts
- Counter-balancing effectiveness: Standard experimental design literature
- TLX correlation with sustainability: NASA-TLX validation studies
Evidence level: B (mixed RCTs, internal telemetry)
Apply this now
Choose your next step to put these concepts into practice
Run Interactive Demo
Experience the evaluation flow with sample tasks and see Δ + TLX in action
PM Quickstart Guide
Product Manager's guide to measuring AI impact and building evidence
Want to understand the science? Review our methodology
Share this POV
Paste the highlights into your next exec memo or stand-up. Link back to this pillar so others can follow the full reasoning.
Next Steps
Ready to measure your AI impact? Start with a quick demo to see your Overestimation Δ and cognitive load metrics.