The Human Judgment Arena

Practice evaluating AI outputs, sharpen your critical thinking skills, and see how you stack up against other humans in detecting AI mistakes.

Enter the Arena Take Baseline Assessment

What is the Arena?

The Human Judgment Arena is a competitive environment where you practice spotting mistakes in AI-generated content. Unlike passive learning, the Arena challenges you with real scenarios across multiple disciplines, tracks your performance with a rating system, and lets you see how your judgment compares to other participants.

Whether you're a product manager evaluating AI summaries, an engineer reviewing AI-generated code, or simply curious about AI capabilities, the Arena helps you develop practical skills for working effectively with AI.

Season Zero: Human–AI Deception Benchmark

The inaugural Arena season measuring how effectively humans can detect AI errors, hallucinations, and unsafe outputs across multiple judgment domains. Season Zero establishes baseline human performance against state-of-the-art language models.

Season Goals

Measure human ability to detect AI-generated hallucinations and errors
Benchmark judgment performance across logic, safety, and authenticity domains
Establish baseline deception rates for frontier AI models
Build the world's first public leaderboard for human AI judgment skills

What makes Season Zero special: This is the inaugural season—the first public benchmark of human AI judgment across multiple domains. Your participation helps establish baseline performance data for AI safety research.

Arena Disciplines

Each discipline focuses on a different aspect of AI evaluation. Master all six to become a well-rounded AI evaluator for Season Zero.

Hallucination Hunter

Active

Evaluate AI-generated answers against source material. Can you detect when the AI fabricates facts, misquotes sources, or adds unsupported claims?

Skills tested: Spot AI hallucinations in grounded answers

Logic Detective

Active

Analyze step-by-step AI reasoning to catch logical errors. Can you pinpoint where a chain of reasoning first goes wrong?

Skills tested: Find the first mistake in a chain-of-thought

Risk Triage

Active

Evaluate the risk level of AI-generated content in various contexts. Should this output be deployed, reviewed, or blocked?

Skills tested: Assess AI output risk levels

Imitation Game

Active

Compare two pieces of text and determine which is higher quality or more human-like. Test your ability to detect AI-generated content.

Skills tested: Distinguish human from AI writing

Forecast & Calibration

Active

Make probabilistic predictions and assess how well-calibrated your confidence is. Learn to distinguish what you know from what you think you know when working with AI model forecasts.

Skills tested: Estimate probabilities and calibrate confidence

Red Team Authoring

Active

Design prompts and scenarios that stress-test AI models and reveal failure modes. Contribute challenges that help make AI systems more robust and help other humans sharpen their judgment skills.

Skills tested: Create adversarial test cases for AI systems

Fact Checker

Active

Deep-dive into citation verification. Check if AI-provided sources actually exist, are accurately quoted, and support the claims being made.

Skills tested: Verify claims and citations in AI content

Bias Auditor

Active

Develop sensitivity to implicit biases in AI responses. Detect stereotyping, unfair assumptions, or skewed perspectives that might not be immediately obvious.

Skills tested: Identify subtle biases in AI reasoning

Example Scenarios

Here's what you'll actually do in each discipline:

Hallucination Hunter Example

AI says: "The James Webb Space Telescope detected over 500 exoplanets in its first year, according to Smith et al. (2023) in Nature Astronomy."

Source material: Mentions JWST's launch and mirror design, but says nothing about exoplanet discoveries or any such citation.

Your job: Spot that the citation and exoplanet count are fabricated—a classic AI hallucination mixing real facts with invented details.

Logic Detective Example

Reasoning chain:

All successful startups need product-market fit
ProductX has product-market fit
Therefore, ProductX is a successful startup

Your job: Identify that step 3 commits the fallacy of affirming the consequent (having PMF doesn't guarantee success—you also need execution, timing, etc.).

Risk Triage Example

Scenario: AI drafts a customer-facing email about a service outage.

AI output: "We apologize for the inconvenience. The issue was caused by a critical database failure in our legacy systems..."

Your job: Decide if this can ship as-is, needs edits (too technical?), is only for internal brainstorming, or should be blocked entirely.

Imitation Game Example

Task: Read two responses to a product question. One is written by a human PM, one by AI.

Option A:

"We should prioritize user feedback and iterate quickly..."

Option B:

"Based on data from Q2, we can leverage synergies..."

Your job: Identify subtle cues (word choice, structure, depth of insight) to determine which is human-written.

The Human Leaderboard

The Arena isn't about humans vs. AI—it's about humans competing with other humans to develop better AI judgment. Your rating reflects how well you spot AI mistakes compared to other participants.

Interestingly, this creates a dual leaderboard system:

Human leaderboard: Players are ranked by their ability to evaluate AI outputs correctly
Model evaluation: AI models are implicitly ranked by how often they can fool strong human evaluators

This means that even top-tier players (Diamond, Platinum) sometimes get deceived by particularly well-crafted AI mistakes. When that happens, it reveals something important: that specific type of error is genuinely deceptive, even to skilled evaluators. This data helps researchers understand which AI behaviors are most problematic.

Example: If 80% of Gold+ players mark a hallucination as "fully supported," that's a strong signal that the AI's error was particularly convincing—useful feedback for both human training and AI development.

Human AI Deception Benchmark

Learn how the Arena contributes to a living benchmark measuring human ability to detect AI errors. Your participation helps advance AI safety research.

Rating System

How Ratings Work

Your rating reflects your skill at evaluating AI outputs. Everyone starts at 1500 and gains or loses points based on performance in ranked sessions.

Correct answers on harder questions earn more points
Incorrect answers on easier questions lose more points
Rating Deviation (RD) measures uncertainty—new players have higher RD, which decreases as you play more
Ratings with high RD are marked as provisional until you've played enough games

Your Judgment Rating is calculated using a proven algorithm similar to those used in chess and competitive gaming. It's a statistical estimate of your skill, not an absolute measure.

Tier System

Your rating places you in a tier. Climb through the ranks as you improve!

Iron

Bronze

1200+

Silver

1300+

Gold

1400+

Platinum

1500+

Diamond

1600+

Master

1800+

Grandmaster

2000+

Seasons

The Arena operates in Seasons—time-limited periods (typically 2-3 months) with curated question pools and research themes.

At the end of each season:

Top performers are recognized on the season leaderboard
Research insights are published from aggregated play data
Ratings soft-reset for the next season (decay toward 1500)
New question pools and challenges are introduced

Seasons help keep the Arena fresh and allow for focused research on specific AI behaviors or domains.

Session Types

Warm-up

6-8 questions per session
Easier difficulty mix
Does NOT affect your rating
Great for learning and practice

Ranked

10-12 questions per session
Balanced difficulty mix
DOES affect your rating
Compete for leaderboard position

Arena + Learn

The Arena works best alongside our Learn assessments. Here's how they complement each other:

Baseline Assessments

Take structured assessments to understand your starting point with AI literacy, judgment calibration, and domain knowledge. Results don't change over time.

Take assessment

Arena Competition

Practice under pressure with varied scenarios. Your rating evolves as you play, reflecting your improving (or declining) skills over time.

Play now

After Arena sessions, you'll get personalized recommendations for which skills to practice based on your performance patterns.

Why This Matters

Different roles benefit from Arena practice in distinct ways. Here's how the Arena helps you build practical AI judgment skills:

For Product & Innovation Leaders

Evaluate AI features faster: Practice spotting hallucinations and logic errors builds intuition for when to trust AI-generated product specs, PRDs, and user stories.
Reduce review cycles: Risk Triage training helps you quickly decide which AI outputs need human editing vs. can ship as-is, streamlining workflows.
Make better vendor decisions: Understanding AI failure modes helps you ask the right questions when evaluating AI tools and vendors.
Benchmark your instincts: See where your judgment aligns with (or diverges from) other product leaders across the community.

For Engineering Teams

Improve code review skills: Logic Detective scenarios mirror real code review tasks—finding the first flaw in a reasoning chain trains you to spot bugs in AI-generated code.
Calibrate trust in AI tools: Practice helps you develop accurate mental models of when GitHub Copilot, ChatGPT, or other tools are likely to be correct vs. wrong.
Defend against subtle errors: Hallucination Hunter trains you to verify AI-cited documentation and API examples before shipping code.
Build verification habits: Repeated practice creates automatic checking routines that carry over to your day-to-day AI-assisted coding.

For Business Leadership

Spot high-stakes AI mistakes: Develop instincts for when AI-generated reports, summaries, or analyses contain misleading conclusions or fabricated data.
Understand team risk profiles: Use Arena data to identify skill gaps in your organization's AI literacy—who needs more training? Which blindspots are common?
Set AI governance standards: Practice with Risk Triage scenarios informs policies about when AI outputs require human approval before deployment.

For Teams & Organizations

When teams practice together in the Arena, aggregated performance data reveals organizational blindspots—systematic biases and knowledge gaps that affect AI adoption success.

Example Blindspot:

If 70% of your product team consistently overestimates AI accuracy on Risk Triage scenarios, that signals a cultural over-reliance on AI outputs without verification—a risk factor for product quality issues.

Team Benchmarking:

Compare your team's aggregate performance to industry peers. Are you stronger at hallucination detection but weaker at logic verification? Use data to target training investments.

Important Note

Arena ratings and leaderboard positions are for educational and research purposes only. They are not professional certifications, clinical assessments, or legally binding measures of competence. Scores are based on limited data and should be interpreted as approximate indicators of relative skill, not absolute measures.

Ready to Test Your Judgment?

Jump into the Arena and see how well you can spot AI mistakes.

Enter the Arena