The Human Judgment Arena
Practice evaluating AI outputs, sharpen your critical thinking skills, and see how you stack up against other humans in detecting AI mistakes.
What is the Arena?
The Human Judgment Arena is a competitive environment where you practice spotting mistakes in AI-generated content. Unlike passive learning, the Arena challenges you with real scenarios across multiple disciplines, tracks your performance with a rating system, and lets you see how your judgment compares to other participants.
Whether you're a product manager evaluating AI summaries, an engineer reviewing AI-generated code, or simply curious about AI capabilities, the Arena helps you develop practical skills for working effectively with AI.
Season Zero: Human–AI Deception Benchmark
The inaugural Arena season measuring how effectively humans can detect AI errors, hallucinations, and unsafe outputs across multiple judgment domains. Season Zero establishes baseline human performance against state-of-the-art language models.
Season Goals
- Measure human ability to detect AI-generated hallucinations and errors
- Benchmark judgment performance across logic, safety, and authenticity domains
- Establish baseline deception rates for frontier AI models
- Build the world's first public leaderboard for human AI judgment skills
What makes Season Zero special: This is the inaugural season—the first public benchmark of human AI judgment across multiple domains. Your participation helps establish baseline performance data for AI safety research.
Arena Disciplines
Each discipline focuses on a different aspect of AI evaluation. Master all six to become a well-rounded AI evaluator for Season Zero.
Hallucination Hunter
ActiveEvaluate AI-generated answers against source material. Can you detect when the AI fabricates facts, misquotes sources, or adds unsupported claims?
Logic Detective
ActiveAnalyze step-by-step AI reasoning to catch logical errors. Can you pinpoint where a chain of reasoning first goes wrong?
Risk Triage
ActiveEvaluate the risk level of AI-generated content in various contexts. Should this output be deployed, reviewed, or blocked?
Imitation Game
ActiveCompare two pieces of text and determine which is higher quality or more human-like. Test your ability to detect AI-generated content.
Forecast & Calibration
ActiveMake probabilistic predictions and assess how well-calibrated your confidence is. Learn to distinguish what you know from what you think you know when working with AI model forecasts.
Red Team Authoring
ActiveDesign prompts and scenarios that stress-test AI models and reveal failure modes. Contribute challenges that help make AI systems more robust and help other humans sharpen their judgment skills.
Fact Checker
ActiveDeep-dive into citation verification. Check if AI-provided sources actually exist, are accurately quoted, and support the claims being made.
Bias Auditor
ActiveDevelop sensitivity to implicit biases in AI responses. Detect stereotyping, unfair assumptions, or skewed perspectives that might not be immediately obvious.
Example Scenarios
Here's what you'll actually do in each discipline:
Hallucination Hunter Example
AI says: "The James Webb Space Telescope detected over 500 exoplanets in its first year, according to Smith et al. (2023) in Nature Astronomy."
Source material: Mentions JWST's launch and mirror design, but says nothing about exoplanet discoveries or any such citation.
Your job: Spot that the citation and exoplanet count are fabricated—a classic AI hallucination mixing real facts with invented details.
Logic Detective Example
Reasoning chain:
- All successful startups need product-market fit
- ProductX has product-market fit
- Therefore, ProductX is a successful startup
Your job: Identify that step 3 commits the fallacy of affirming the consequent (having PMF doesn't guarantee success—you also need execution, timing, etc.).
Risk Triage Example
Scenario: AI drafts a customer-facing email about a service outage.
AI output: "We apologize for the inconvenience. The issue was caused by a critical database failure in our legacy systems..."
Your job: Decide if this can ship as-is, needs edits (too technical?), is only for internal brainstorming, or should be blocked entirely.
Imitation Game Example
Task: Read two responses to a product question. One is written by a human PM, one by AI.
Option A:
"We should prioritize user feedback and iterate quickly..."
Option B:
"Based on data from Q2, we can leverage synergies..."
Your job: Identify subtle cues (word choice, structure, depth of insight) to determine which is human-written.
The Human Leaderboard
The Arena isn't about humans vs. AI—it's about humans competing with other humans to develop better AI judgment. Your rating reflects how well you spot AI mistakes compared to other participants.
Interestingly, this creates a dual leaderboard system:
- Human leaderboard: Players are ranked by their ability to evaluate AI outputs correctly
- Model evaluation: AI models are implicitly ranked by how often they can fool strong human evaluators
This means that even top-tier players (Diamond, Platinum) sometimes get deceived by particularly well-crafted AI mistakes. When that happens, it reveals something important: that specific type of error is genuinely deceptive, even to skilled evaluators. This data helps researchers understand which AI behaviors are most problematic.
Example: If 80% of Gold+ players mark a hallucination as "fully supported," that's a strong signal that the AI's error was particularly convincing—useful feedback for both human training and AI development.
Human AI Deception Benchmark
Learn how the Arena contributes to a living benchmark measuring human ability to detect AI errors. Your participation helps advance AI safety research.
Rating System
How Ratings Work
Your rating reflects your skill at evaluating AI outputs. Everyone starts at 1500 and gains or loses points based on performance in ranked sessions.
- Correct answers on harder questions earn more points
- Incorrect answers on easier questions lose more points
- Rating Deviation (RD) measures uncertainty—new players have higher RD, which decreases as you play more
- Ratings with high RD are marked as provisional until you've played enough games
Your Judgment Rating is calculated using a proven algorithm similar to those used in chess and competitive gaming. It's a statistical estimate of your skill, not an absolute measure.
Tier System
Your rating places you in a tier. Climb through the ranks as you improve!
Seasons
The Arena operates in Seasons—time-limited periods (typically 2-3 months) with curated question pools and research themes.
At the end of each season:
- Top performers are recognized on the season leaderboard
- Research insights are published from aggregated play data
- Ratings soft-reset for the next season (decay toward 1500)
- New question pools and challenges are introduced
Seasons help keep the Arena fresh and allow for focused research on specific AI behaviors or domains.
Session Types
Warm-up
- 6-8 questions per session
- Easier difficulty mix
- Does NOT affect your rating
- Great for learning and practice
Ranked
- 10-12 questions per session
- Balanced difficulty mix
- DOES affect your rating
- Compete for leaderboard position
Arena + Learn
The Arena works best alongside our Learn assessments. Here's how they complement each other:
Baseline Assessments
Take structured assessments to understand your starting point with AI literacy, judgment calibration, and domain knowledge. Results don't change over time.
Take assessmentArena Competition
Practice under pressure with varied scenarios. Your rating evolves as you play, reflecting your improving (or declining) skills over time.
Play nowAfter Arena sessions, you'll get personalized recommendations for which skills to practice based on your performance patterns.
Why This Matters
Different roles benefit from Arena practice in distinct ways. Here's how the Arena helps you build practical AI judgment skills:
For Product & Innovation Leaders
- Evaluate AI features faster: Practice spotting hallucinations and logic errors builds intuition for when to trust AI-generated product specs, PRDs, and user stories.
- Reduce review cycles: Risk Triage training helps you quickly decide which AI outputs need human editing vs. can ship as-is, streamlining workflows.
- Make better vendor decisions: Understanding AI failure modes helps you ask the right questions when evaluating AI tools and vendors.
- Benchmark your instincts: See where your judgment aligns with (or diverges from) other product leaders across the community.
For Engineering Teams
- Improve code review skills: Logic Detective scenarios mirror real code review tasks—finding the first flaw in a reasoning chain trains you to spot bugs in AI-generated code.
- Calibrate trust in AI tools: Practice helps you develop accurate mental models of when GitHub Copilot, ChatGPT, or other tools are likely to be correct vs. wrong.
- Defend against subtle errors: Hallucination Hunter trains you to verify AI-cited documentation and API examples before shipping code.
- Build verification habits: Repeated practice creates automatic checking routines that carry over to your day-to-day AI-assisted coding.
For Business Leadership
- Spot high-stakes AI mistakes: Develop instincts for when AI-generated reports, summaries, or analyses contain misleading conclusions or fabricated data.
- Understand team risk profiles: Use Arena data to identify skill gaps in your organization's AI literacy—who needs more training? Which blindspots are common?
- Set AI governance standards: Practice with Risk Triage scenarios informs policies about when AI outputs require human approval before deployment.
For Teams & Organizations
When teams practice together in the Arena, aggregated performance data reveals organizational blindspots—systematic biases and knowledge gaps that affect AI adoption success.
Example Blindspot:
If 70% of your product team consistently overestimates AI accuracy on Risk Triage scenarios, that signals a cultural over-reliance on AI outputs without verification—a risk factor for product quality issues.
Team Benchmarking:
Compare your team's aggregate performance to industry peers. Are you stronger at hallucination detection but weaker at logic verification? Use data to target training investments.
Important Note
Arena ratings and leaderboard positions are for educational and research purposes only. They are not professional certifications, clinical assessments, or legally binding measures of competence. Scores are based on limited data and should be interpreted as approximate indicators of relative skill, not absolute measures.
Ready to Test Your Judgment?
Jump into the Arena and see how well you can spot AI mistakes.
Enter the Arena