Fair Trial: How to Compare Manual vs. AI Without Fooling Yourself

Why most AI comparisons lie

You try a task with AI. It feels faster. You conclude AI helps.

This is not evidence. This is confirmation bias with extra steps.

Without controls, your "comparison" has:

Order effects: You did manual first, learned the task, then AI felt easier
Rubric drift: You unconsciously lower standards for AI output because it arrived fast
Selection bias: You picked a task that suits AI and generalized

Fair Trial methodology fixes these problems. It's how you get numbers that survive scrutiny.

The theater trap

Most AI demos show best-case scenarios with cherry-picked tasks. Fair Trial shows average performance on representative work.

Setup: what you need before starting

1. Define the task boundary

Write down exactly what "done" means. Include:

Deliverable format (document, code, summary)
Quality criteria (accuracy, completeness, style)
Scope exclusions (what you're NOT measuring)

Vague boundaries = vague results. Lock it down.

2. Build the rubric first

Create a 3-5 point scoring rubric before you run any trials. Each point should be:

Observable (you can see it in the output)
Binary or tiered (meets/partially meets/doesn't meet)
Independent (scoring one criterion doesn't affect others)

Sample rubric structure

Accuracy: All facts correct and verifiable (0/1/2)
Completeness: Covers all required elements (0/1/2)
Clarity: Reader can act without follow-up questions (0/1/2)
Efficiency: No unnecessary content or bloat (0/1)

3. Select representative tasks

Don't pick your best AI task. Pick three tasks that represent your actual workload:

One routine task (you do it weekly)
One complex task (requires judgment)
One novel task (first time or rare)

AI often excels at routine, struggles with novel. You need to know both.

Running paired trials

Counter-balanced order

If you always do manual first, learning effects inflate AI performance. Counter-balance:

Trial A: Manual → AI Trial B: AI → Manual

Run both. Average the results. This controls for order effects.

Capture three metrics per trial

Δ-time: Time to completion (manual minus AI)
Δ-quality: Reviewer score (manual minus AI, using your locked rubric)
micro-TLX: Mental demand + frustration immediately after each attempt

Don't skip TLX. A 30% time savings with 50% higher cognitive load isn't a win—it's a burnout vector.

Use a blind reviewer when possible

If you score your own output, you'll be biased. Have a colleague score without knowing which output was AI-assisted.

Blind review adds 20 minutes to your trial. It also adds 10x credibility to your results.

“"Our first Fair Trial showed AI saved 15 minutes but cost 25 minutes of review time. Net loss. We would have missed that without tracking reviewer minutes."”

Program Manager

Controlling for context

Same inputs, same constraints

The manual and AI attempts must start from identical positions:

Same reference materials
Same time pressure (or explicitly unbounded)
Same interruption conditions

If you gave AI better prompts than you gave yourself, you're measuring prompt quality, not AI value.

Log your prompts

Write down every prompt you use. Prompt iteration is part of the AI workflow—don't hide it.

Total AI time = generation time + prompt refinement time + review time

Many teams discover their "instant" AI outputs take longer than manual when you count prompt iteration.

Interpreting Δ and TLX together

The quadrant model

| | Low TLX (under 40) | High TLX (over 60) | |---|---|---| | Δ-time positive (AI faster) | Sweet spot: Real gains, sustainable | Warning: Gains won't last | | Δ-time negative (AI slower) | Investigate: Skill gap or task mismatch | Stop: Negative ROI |

What the numbers mean

Δ-time +30%, Δ-quality 0, TLX under 40 Clear win. Document it and share the rubric.

Δ-time +30%, Δ-quality -15%, TLX over 60 False win. You're trading quality and wellbeing for speed. Re-evaluate.

Δ-time -10%, Δ-quality +20%, TLX under 40 Quality win. AI is helping you do better work, not faster work. That's often more valuable.

Common pitfalls

Pitfall 1: Moving the rubric

"But the AI output was good in a different way!"

No. You defined what good means before the trial. If AI is good in ways you didn't anticipate, that's interesting for the next trial. It doesn't change this one.

Pitfall 2: Ignoring reviewer time

AI output often needs more review. If your "30% time savings" requires 45% more reviewer time, you've shifted cost, not reduced it.

Always log: creator time + reviewer time = total cycle time.

Pitfall 3: Generalizing from one task

One Fair Trial proves AI helps with that task. It proves nothing about your other work.

Run Fair Trials on 3-5 representative tasks before making claims about AI productivity in general.

✓Define task boundaries before starting
✓Build and lock the scoring rubric
✓Run counter-balanced trials (Manual→AI and AI→Manual)
✓Log Δ-time, Δ-quality, and micro-TLX for every attempt
✓Include reviewer time in your calculations
✓Have a colleague do blind review when possible

Apply now: Productivity Packs

The PM and SWE Productivity Packs implement Fair Trial methodology for role-specific tasks:

PM Pack: Discovery briefs, experiment canvas, exec narratives
SWE Pack: Code review, remediation plans, QA handoffs

Each pack guides you through counter-balanced trials with built-in TLX capture.

Evidence note

Fair Trial methodology adapts A/B testing principles for knowledge work:

25-40% overestimation in uncontrolled comparisons: Meta-analysis of 8 internal cohorts
Counter-balancing effectiveness: Standard experimental design literature
TLX correlation with sustainability: NASA-TLX validation studies

Evidence level: B (mixed RCTs, internal telemetry)