Skip to main content
Open BetaWe’re learning fast - your sessions and feedback directly shape AI CogniFit.
rubricsevaluationquality-assurancejudgmentB Evidence

Evaluator Rubrics That Don't Drift

Design 5-point rubrics that stay consistent across reviewers and sessions. Includes copy-paste templates for tight vs. loose evaluation criteria.

The drift problem

Rubric drift happens when you unconsciously adjust your criteria after seeing output. "That's pretty good for AI" is drift. "The rubric said X, the output did Y" is evaluation. Lock your rubric before you generate anything.

The 5-Point Scale Framework

Use this structure for any evaluation rubric:

| Score | Label | Meaning | |-------|-------|---------| | 5 | Exemplary | Exceeds all criteria; could be used as training example | | 4 | Proficient | Meets all criteria; minor polish needed | | 3 | Adequate | Meets most criteria; requires revision | | 2 | Developing | Misses key criteria; significant rework needed | | 1 | Inadequate | Does not meet criteria; restart required |

Tight vs. Loose Descriptors

Tight Descriptors (for compliance, accuracy, process)

Use tight descriptors when correctness is binary and stakes are high.

Example: Factual Accuracy Rubric

| Score | Descriptor | |-------|------------| | 5 | All facts verifiable; zero errors; sources cited | | 4 | All facts correct; 1 minor omission; sources available | | 3 | 1-2 factual errors; correctable without restructure | | 2 | 3+ errors; requires fact-check pass | | 1 | Core claims unsupported or false |

Loose Descriptors (for creative, strategic, exploratory)

Use loose descriptors when judgment matters more than checklist compliance.

Example: Strategic Insight Rubric

| Score | Descriptor | |-------|------------| | 5 | Novel framing; actionable; changes how we think about the problem | | 4 | Useful insight; clear action path; builds on existing knowledge | | 3 | Reasonable analysis; expected conclusions; no surprises | | 2 | Surface-level; restates inputs; lacks depth | | 1 | Off-topic or irrelevant to the question |

Copy-Paste Rubric Template

RUBRIC: [Task Name]
Locked: [Date] | Reviewer: [Name]

CRITERIA 1: [Name]
5 - [Exemplary behavior]
4 - [Proficient behavior]
3 - [Adequate behavior]
2 - [Developing behavior]
1 - [Inadequate behavior]

CRITERIA 2: [Name]
5 - [Exemplary behavior]
...

SCORING RULES:
- Score each criterion independently
- Average for final score
- Flag any criterion at 2 or below for discussion
  • Rubric locked before any output is generated
  • Each criterion has 5 distinct, observable levels
  • Tight vs. loose chosen based on task type
  • Scoring rules documented (average, weighted, minimum threshold)
  • Second reviewer calibrated on same rubric

Preventing Drift

  1. Write rubric before prompting — never adjust after seeing output
  2. Calibrate with a partner — score same output independently, then compare
  3. Version your rubrics — track changes with dates and rationale
  4. Review rubric monthly — update deliberately, not reactively
"Our inter-rater agreement jumped from 62% to 89% when we locked rubrics before the review session started."
QA Lead

Related Resources

Apply this now

Practice prompt

Take a rubric you use regularly and rewrite it using the tight/loose framework.

Try this now

Score one AI output with your current rubric, then score it again after locking criteria. Compare.

Common pitfall

Adjusting the rubric after seeing impressive output—this is how drift starts.

Key takeaways

  • Lock your rubric before you see any output—post-hoc adjustment is drift
  • Use tight descriptors for accuracy, loose for creativity
  • Calibrate with a second reviewer to catch personal bias

See it in action

Drop this into a measured run—demo it, then tie it back to your methodology.

See also

Pair this play with related resources, methodology notes, or quickstarts.

Further reading

Next Steps

Ready to measure your AI impact? Start with a quick demo to see your Overestimation Δ and cognitive load metrics.

Key Takeaways

  • Lock your rubric before you see any output—post-hoc adjustment is drift
  • Use tight descriptors for accuracy, loose for creativity
  • Calibrate with a second reviewer to catch personal bias

Share this resource

PrivacyEthicsStatusOpen Beta Terms
Share feedback
Evaluator Rubrics That Don't Drift · AI CogniFit resources