Evaluator Rubrics That Don't Drift

The drift problem

Rubric drift happens when you unconsciously adjust your criteria after seeing output. "That's pretty good for AI" is drift. "The rubric said X, the output did Y" is evaluation. Lock your rubric before you generate anything.

The 5-Point Scale Framework

Use this structure for any evaluation rubric:

| Score | Label | Meaning | |-------|-------|---------| | 5 | Exemplary | Exceeds all criteria; could be used as training example | | 4 | Proficient | Meets all criteria; minor polish needed | | 3 | Adequate | Meets most criteria; requires revision | | 2 | Developing | Misses key criteria; significant rework needed | | 1 | Inadequate | Does not meet criteria; restart required |

Tight vs. Loose Descriptors

Tight Descriptors (for compliance, accuracy, process)

Use tight descriptors when correctness is binary and stakes are high.

Example: Factual Accuracy Rubric

| Score | Descriptor | |-------|------------| | 5 | All facts verifiable; zero errors; sources cited | | 4 | All facts correct; 1 minor omission; sources available | | 3 | 1-2 factual errors; correctable without restructure | | 2 | 3+ errors; requires fact-check pass | | 1 | Core claims unsupported or false |

Loose Descriptors (for creative, strategic, exploratory)

Use loose descriptors when judgment matters more than checklist compliance.

Example: Strategic Insight Rubric

| Score | Descriptor | |-------|------------| | 5 | Novel framing; actionable; changes how we think about the problem | | 4 | Useful insight; clear action path; builds on existing knowledge | | 3 | Reasonable analysis; expected conclusions; no surprises | | 2 | Surface-level; restates inputs; lacks depth | | 1 | Off-topic or irrelevant to the question |

Copy-Paste Rubric Template

RUBRIC: [Task Name]
Locked: [Date] | Reviewer: [Name]

CRITERIA 1: [Name]
5 - [Exemplary behavior]
4 - [Proficient behavior]
3 - [Adequate behavior]
2 - [Developing behavior]
1 - [Inadequate behavior]

CRITERIA 2: [Name]
5 - [Exemplary behavior]
...

SCORING RULES:
- Score each criterion independently
- Average for final score
- Flag any criterion at 2 or below for discussion

✓Rubric locked before any output is generated
✓Each criterion has 5 distinct, observable levels
✓Tight vs. loose chosen based on task type
✓Scoring rules documented (average, weighted, minimum threshold)
✓Second reviewer calibrated on same rubric

Preventing Drift

Write rubric before prompting — never adjust after seeing output
Calibrate with a partner — score same output independently, then compare
Version your rubrics — track changes with dates and rationale
Review rubric monthly — update deliberately, not reactively

“"Our inter-rater agreement jumped from 62% to 89% when we locked rubrics before the review session started."”

QA Lead

Related Resources

Bias Spotting Checklist — pair with rubric review
Fair Trial Methodology — controlled comparison framework

Apply this now

Practice prompt

Take a rubric you use regularly and rewrite it using the tight/loose framework.

Try this now

Score one AI output with your current rubric, then score it again after locking criteria. Compare.

Common pitfall

Adjusting the rubric after seeing impressive output—this is how drift starts.

Practice now: Judgment Pack Review Fair Trial methodology

Key takeaways

•Lock your rubric before you see any output—post-hoc adjustment is drift
•Use tight descriptors for accuracy, loose for creativity
•Calibrate with a second reviewer to catch personal bias

See it in action

Drop this into a measured run—demo it, then tie it back to your methodology.

Run the Judgment demo Review the methodology

Next Steps

Run the 3-minute demo Methodology Evidence

Ready to measure your AI impact? Start with a quick demo to see your Overestimation Δ and cognitive load metrics.

Evaluator Rubrics That Don't Drift

The 5-Point Scale Framework

Tight vs. Loose Descriptors

Tight Descriptors (for compliance, accuracy, process)

Loose Descriptors (for creative, strategic, exploratory)

Copy-Paste Rubric Template

Preventing Drift

Related Resources

Apply this now

Key takeaways

See it in action

See also

Further reading

Next Steps

Key Takeaways

Share this resource

Related Resources

Debugging AI Outputs: A Systematic Approach

Bias Spotting in Model Outputs: A 10-Minute Checklist

Red-Teaming in Five Minutes: Two Adversarial Prompts Per Task