When Benchmarks Break: The Crisis of Measurement in the AI Era

We're living through a peculiar moment in artificial intelligence history. Not because progress has stalled, but because our measuring sticks have broken. For decades, standardized tests served as our compass, showing us how far machine systems lagged behind human capabilities. Then came the large language models—GPT-4, Claude 3, Gemini Ultra—and suddenly, tests that seemed insurmountable just years ago were being solved with superhuman precision.

This is where Humanity's Last Exam (HLE) enters the picture: an ambitious project from the Center for AI Safety (CAIS) and Scale AI designed to be, quite literally, the final standardized test before AI capabilities move beyond what closed-ended questions can measure.

The Benchmark Saturation Problem

To understand why HLE exists, consider what happened to its predecessors. The Massive Multitask Language Understanding (MMLU) benchmark, covering 57 academic disciplines, was once considered the gold standard. By 2024, top models were scoring over 90%. In psychometrics, this is called a "ceiling effect"—when a test can no longer distinguish between very good and excellent performance, it loses its diagnostic value.

Similar fates befell other benchmarks:

  • GSM8K (Grade School Math): Once challenging, now nearly perfected by smaller models
  • HumanEval: A coding test increasingly trivialized by training on GitHub repositories
  • GPQA (Google-Proof Q&A): A direct predecessor with expert-level questions, but limited in scope (around 448 questions)

When all your speedometers read "maximum velocity," it becomes impossible to tell which vehicle is actually faster—or whether it's about to break the sound barrier. HLE was explicitly designed to shatter this ceiling and establish a new scale reaching the theoretical limits of human expert knowledge.

What Makes HLE Different?

Humanity's Last Exam is a multimodal, interdisciplinary benchmark consisting of 2,500 public questions plus an additional 500-1,000 strictly confidential test questions to prevent overfitting. But it's not just the size that matters—it's the philosophy behind its construction.

The Crowdsourcing Revolution

Rather than scraping questions from the internet (risking contamination with AI training data), CAIS and Scale AI launched a global competition with over $500,000 in prizes. Scientists, doctoral candidates, and specialized professionals from more than 50 countries and 500 institutions were recruited to create genuinely novel questions.

This approach delivered two strategic advantages:

  1. Guaranteed Novelty: Questions specifically created for this competition couldn't exist in training datasets
  2. Deep Expertise: An AI researcher can't write good theoretical chemistry questions, but a chemistry PhD candidate can create challenges that stump even their professors

Beyond STEM: The Humanities Component

A common misconception is that HLE focuses exclusively on STEM fields. While mathematics comprises roughly 42% of the benchmark, the humanities play a crucial and qualitatively unique role. These questions serve as a critical filter against pure computational power.

Take this example from Earth Anderson, a PhD candidate in history at the University of Arkansas: a question about French royal tradition required not just knowing which king bore the epithet "Augustus" (Philippe Auguste), but understanding the historiographical analysis of how this title originated in literature—specifically, the connection to Suetonius' Caesar biography. Current AI models failed because they could retrieve the facts but couldn't perform the subtle distinction between the historical actor and the literary genre tradition, a core competency of historical research.

Other humanities challenges include:

  • Classical Philology: Translating inscriptions in Palmyrene script from Roman tombstones
  • Linguistics/Theology: Analyzing biblical Hebrew texts to identify closed syllables based on the specific "Tiberian pronunciation" tradition
  • Philosophy: Questions requiring hermeneutic understanding rather than fact retrieval

The "Google-Proof" Principle

In an era of internet-connected AI with Retrieval Augmented Generation (RAG), asking questions that appear in Wikipedia is pointless. HLE's creators used advanced search agents (GPT-4o with Search, Perplexity Sonar) to verify each question. If a question could be answered by simply searching and summarizing the top three results, it was disqualified or rewritten to require genuine reasoning.

Bad question (trivia): "When was the Battle of Hastings fought?" → Can be Googled.
Good question (HLE-level): A question that relates a specific tactical decision in the battle to a 2023 archaeological discovery and asks about implications for troop morale. Here, the AI must synthesize two pieces of information that have never appeared together in a single text.

From Turing Test to Competence Test: A Paradigm Shift

The difference between the Turing Test and HLE touches the core of what we mean by "intelligence." These represent two completely different philosophical approaches to AI evaluation.

The Turing Test: The Imitation Game

Alan Turing's original 1950 concept was based on deception. The question wasn't "Does the machine know quantum physics?" but "Can the machine convince a human it's human?"

  • Focus: Social competence, language fluency, deception
  • Weakness: Rewards mediocrity—an AI that solves complex math instantly would fail by revealing itself as a machine
  • Status: Largely considered scientifically irrelevant for measuring competence today

HLE: The Competence Test

Humanity's Last Exam reverses the logic. It's irrelevant whether the AI seems human. What matters is whether it can perform tasks that only the cognitive elite of humanity can achieve.

  • Focus: Factual truth, logical depth, problem-solving capacity
  • Goal: Measuring "superintelligence" in specific domains
  • Benchmark: Not the average person, but the PhD-level expert
Feature Turing Test (1950) Humanity's Last Exam (2025)
Core Question "Can the machine pass as human?" "Does the machine possess expert knowledge?"
Measurement Behavior / Imitation Cognitive Ability / Reasoning
Success Condition Deceiving human evaluators Solving objective, verifiable problems
Relevance Today Historical / Philosophical Technical / Safety-Critical

The transition from Turing Test to HLE marks the shift from AI that talks to AI that works. It's an attempt to penetrate the "smooth surface" of LLMs and test whether there's a solid foundation of logic beneath, or merely statistical probabilities.

The Current State: Where AI Stands Today

As of late 2025, the HLE leaderboard paints a fascinating picture of both how far AI has come and how far it still needs to go.

The Initial Shock (Early 2025)

When HLE first launched, developers of top models experienced a rude awakening. Models scoring 90% on MMLU plummeted to single-digit percentages on HLE. GPT-4 and early Gemini 1.5 versions achieved scores between 2% and 10%—barely better than random guessing.

The Reasoning Models Wave (Late 2025)

The picture changed with the introduction of "System 2" thinking models (referencing Daniel Kahneman's theory of slow thinking). These models—OpenAI o1/o3, Google Gemini 3, Kimi K2—use "inference-time compute" techniques, essentially "thinking" before answering by internally generating and evaluating thousands of logical intermediate steps.

Current Leaderboard:

Rank Model Developer HLE Score Key Feature
1 Kimi K2 Thinking Moonshot AI (China) ~44.9% - 49% Aggressive use of chain-of-thought and external tools
2 Gemini 3 Pro Google DeepMind ~37.5% (no tools) / ~45.8% (with tools) Native multimodality, deep search integration
3 Grok 4x xAI ~25% - 44% Strong variance depending on tool usage
4 GPT-5 / o3 OpenAI ~35.2% Strong baseline performance
Ref Human Expert (Humanity) ~88 - 90% Reference value for experts in their field

Key Insights:

  • China's Ascent: Kimi K2 Thinking's top placement is geopolitically significant, showing Chinese labs have reached the world's leading edge in complex reasoning.
  • The Tool Gap: The difference between Gemini 3's scores "without tools" (37.5%) and "with tools" (45.8%) shows that access to Python interpreters or search engines is decisive. HLE increasingly measures the ability for agentic workflow orchestration.
  • The Human Gap: Despite impressive progress, 45% is still a failing grade. The fact that humans achieve nearly 90% refutes any claim that the questions are unfair—they're solvable, but only with genuine understanding.

The Calibration Disaster

Perhaps even more alarming than the error rate is the calibration error. HLE doesn't just measure if the answer is correct—it also asks: "How certain are you?"

The result: Models show extremely high calibration errors (sometimes exceeding 50%). They often give completely wrong answers while claiming 99% certainty. This is the real safety risk. An AI that "knows what it doesn't know" is safe. An AI that hallucinates with confidence is dangerous—especially in fields like medicine or law that HLE covers.

Controversies: Is the Test Fair?

No scientific project of this magnitude escapes criticism. An external audit by FutureHouse questioned HLE's validity, sparking an important debate about the nature of "truth" in science.

FutureHouse examined the chemistry and biology sections (about 321 questions) and initially claimed that 29% ± 3.7% of answers were flawed, with contradictions to peer-reviewed papers and "gotcha" questions designed primarily to trip up AI.

One example: A question asked for the "rarest noble gas on Earth as a percentage of terrestrial matter in 2002." The answer according to HLE was "Oganesson" (a synthetic element). FutureHouse criticized this as scientifically misleading, since Oganesson has no natural relevance—the question was more a linguistic puzzle than a chemistry problem.

The HLE team responded professionally, conducting a re-review and correcting the error rate to approximately 18%. They also introduced a "rolling" process where the community can report errors (bug bounty) to continuously refine the dataset.

The Philosophical Implication:

This controversy reveals a deep problem in AI evaluation. When we enter the realm of expert knowledge, there's often no single "truth" anymore. Science is discourse, not dogma. Experts disagree. A benchmark demanding "exact match" hits epistemological limits here. An AI that presents a divergent but scientifically sound opinion would be marked "wrong." This is a limitation of HLE that must be considered when interpreting results.

After the Last Exam: What Comes Next?

Humanity's Last Exam is more than a dataset—it's a wake-up call. It ends the illusion that current AI models are already "superintelligent" by showing they still fail frequently in the depths of human knowledge. Simultaneously, it documents the breathtaking pace at which this gap is closing.

The name "Last Exam" implies finality. When AIs pass this test (likely between 2026 and 2028), closed-ended examinations will lose their purpose. The next stage of evaluation will be "open-ended": giving an AI access to a laboratory, a budget, and a timeframe, then evaluating the results of its research (a new drug, a mathematical theorem).

HLE is thus the bridge between the chatbot era and the era of autonomous research agents. It's the last test we can still grade before AIs begin providing answers we ourselves can no longer verify.

Key Takeaways

  • What is it? A stress test with 2,500+ extremely difficult questions across 100+ disciplines (including humanities), designed to reveal the limits of AI models at PhD level
  • Benchmark for what? Deep reasoning and expert knowledge, excluding pure memorization (Google-proof)
  • Different from Turing Test? Diametrically opposed—Turing tests deception and human-like behavior; HLE tests objective competence and truth
  • Current status? Best models (Kimi K2, Gemini 3) reach about 45-49%, while humans score ~90%. Models are extremely confident but often wrong (calibration problem)
  • The future? The last test we can grade before AI begins operating beyond our ability to verify their answers

The era of measuring AI with multiple-choice tests is coming to an end. What replaces it will define whether we remain the examiners—or become the examined.