A major new study suggests that many of the tests used to evaluate artificial intelligence are unreliable and frequently exaggerate what AI systems can actually do.

Researchers at the Oxford Internet Institute, working with more than thirty collaborators from other institutions, analyzed 445 widely used AI benchmarks. These benchmarks are the primary tools developers use to assess model performance in areas such as language skills, reasoning, software engineering and more. The study argues that many of these tests lack scientific rigor and fail to measure what they claim to measure.

According to the paper, released Tuesday, many top benchmarks do not clearly define the skill or concept they are supposed to evaluate. Others reuse data from earlier tests in ways that undermine reliability. The researchers also found that statistical comparisons between models are often weak or missing entirely.

Adam Mahdi, senior research fellow at the Oxford Internet Institute and one of the lead authors, warned that these flaws can give a distorted picture of AI progress. Mahdi said that when we ask an AI model to perform a task, the test may capture something entirely different than the intended skill. Andrew Bean, another lead author, agreed that even respected benchmarks are often trusted without sufficient scrutiny.

The study reviewed both narrow tests, such as Russian or Arabic language evaluations, and broader assessments of capabilities like spatial reasoning or continual learning. A key concern was whether these tests have what researchers call “construct validity,” meaning whether they genuinely measure the real-world ability they claim to target.

For example, one benchmark meant to assess Russian language skill evaluates models using a set of nine tasks pulled from Russian Wikipedia content. The authors argue that such approaches may not reflect true proficiency. The problem is widespread: nearly half of the benchmarks analyzed fail to define the concepts they evaluate clearly enough to be scientifically useful.

The study also highlights GSM8K, a popular benchmark of grade school math. High leaderboard scores on GSM8K are often taken as proof that AI models show strong mathematical reasoning. Yet Mahdi explained that producing correct answers does not guarantee true reasoning ability. A correct answer does not necessarily show that the model understands math in a deeper sense.

Bean noted that evaluating broad concepts like reasoning will always involve selecting a small set of tasks, which introduces challenges. This makes it crucial for benchmarks to be explicit about what they measure and why.

The authors argue that people working on AI evaluation sometimes choose tasks that loosely relate to concepts like harmlessness or reasoning, declare that these tasks represent the entire concept, and then make broad performance claims. The new paper proposes eight recommendations and a detailed checklist to guide the creation of more transparent and trustworthy benchmarks. Suggestions include defining the scope of each test, using a more representative set of tasks, and relying on proper statistical comparisons.

The paper has received positive feedback from experts. Nikola Jurkovic of the METR AI research center praised the work and said the checklist can help ensure that future benchmarks generate more meaningful insights.

The report builds on earlier concerns from researchers at Anthropic, who last year called for stronger statistical testing to determine whether head-to-head differences between models are meaningful or simply the result of chance.

The field is already responding. Several groups have begun designing new benchmarks focused on real-world tasks. In September, OpenAI introduced tests that evaluate AI performance across forty-four occupations, such as reviewing invoices or creating production schedules for a video shoot. Dan Hendrycks of the Center for AI Safety and his team have developed a similar set of tests aimed at measuring performance on tasks common in remote work.

Hendrycks stressed that high scores on a benchmark do not always reflect mastery of the underlying goal. Mahdi echoed this sentiment, adding that the scientific evaluation of AI is still in its early stages and that there is significant room for improvement as the field matures.

Source