OpenAI’s New Reasoning Models Hallucinate More Than Older Ones

OpenAI’s latest AI models, o3 and o4-mini, are designed to be top-tier reasoning engines—but they come with a significant drawback: they hallucinate more frequently than several of the company’s earlier models.

Hallucinations, where AI generates false or fabricated information, remain one of the most persistent challenges in the field. While previous models have generally shown gradual improvements in reducing hallucinations, OpenAI’s newest releases appear to buck that trend.

According to internal testing, o3 and o4-mini—classified as reasoning models—perform worse in this regard than earlier counterparts like o1, o1-mini, and o3-mini, as well as more traditional models like GPT-4o.

What’s more concerning is that OpenAI doesn’t fully understand why this is happening.

In its technical documentation, the company states that “more research is needed” to determine why hallucinations seem to increase with newer reasoning models. Despite improvements in tasks such as coding and math, the models also produce more claims in general—which leads to both more accurate and more inaccurate outputs.

The problem is especially apparent in PersonQA, OpenAI’s internal benchmark for measuring a model’s accuracy on knowledge about people. O3 hallucinated on 33% of the questions—about double the rates for o1 (16%) and o3-mini (14.8%). O4-mini fared even worse, hallucinating in nearly half of the cases.

External testing by AI nonprofit Transluce revealed more troubling signs. The lab found o3 sometimes invents steps it supposedly took to reach answers—such as falsely claiming it ran code on a 2021 MacBook Pro “outside of ChatGPT.” That’s not something the model is capable of doing.

Neil Chowdhury, a Transluce researcher and former OpenAI employee, speculated that reinforcement learning methods used in the o-series might amplify hallucination issues usually softened by traditional training processes.

Transluce co-founder Sarah Schwettmann noted that this higher hallucination rate could limit the model’s usefulness, especially in high-stakes contexts.

Still, some users see promise. Kian Katanforoosh, Stanford adjunct professor and CEO of upskilling startup Workera, said his team has been testing o3 for coding tasks and sees it as a strong contender—though it has a habit of generating broken links.

While hallucinations might sometimes fuel creative outputs, they remain a serious drawback in domains where factual accuracy is critical—like legal or healthcare applications.

One potential remedy is integrating web search capabilities. OpenAI’s GPT-4o with search, for instance, hits 90% accuracy on SimpleQA, another benchmark used by the company. Access to real-time information might help rein in hallucinations—at least in use cases where privacy concerns allow for it.

If scaling up reasoning continues to exacerbate hallucinations, solving this issue will become even more urgent.

“Addressing hallucinations across all our models is an ongoing area of research,” said OpenAI spokesperson Niko Felix, emphasizing the company’s continued efforts to improve reliability and accuracy.

The shift toward reasoning models reflects a broader trend in the AI industry, as researchers seek ways to boost performance without relying on ever-larger datasets and computing power. But that same shift may be introducing new challenges—including a resurgence in hallucinated content.

Source

Control F5 Team
Blog Editor
OUR WORK
Case studies

We have helped 20+ companies in industries like Finance, Transportation, Health, Tourism, Events, Education, Sports.

READY TO DO THIS
Let’s build something together