A new research paper from OpenAI takes a closer look at why large language models (LLMs) like GPT-5 — and chatbots such as ChatGPT — continue to generate hallucinations, and whether changes in evaluation methods could help reduce them.

In a blog post summarizing the work, OpenAI defines hallucinations as “plausible but false statements generated by language models.” The company admits that while progress has been made, hallucinations “remain a fundamental challenge for all large language models” — and one that may never fully disappear.

To demonstrate, the researchers asked a popular chatbot about the title of Adam Tauman Kalai’s PhD dissertation. They received three different answers, all incorrect. When they followed up by asking about Kalai’s birthday, the chatbot again produced three different — and equally wrong — responses.

So why do chatbots fail so confidently? According to the paper, the problem starts with how LLMs are pretrained. The models are optimized to predict the next word in a sequence, without any labels marking a statement as true or false. This means they learn fluent language patterns, but not factual accuracy. While consistent rules — like spelling or punctuation — improve as models scale, rare or arbitrary facts (like someone’s birthday) can’t be reliably predicted from patterns alone, leading to hallucinations.

Still, the paper argues that the deeper issue lies in how models are evaluated. Current benchmarks don’t cause hallucinations directly, but they “set the wrong incentives.” The researchers liken today’s accuracy-based tests to multiple-choice exams where guessing is often rewarded: you might get lucky with the right answer, while leaving a question blank guarantees zero points.

“When models are graded only on accuracy — the percentage of questions they get exactly right — they’re pushed to guess rather than admit ‘I don’t know,’” the paper explains.

The proposed fix is to change evaluation methods in ways similar to standardized tests that discourage blind guessing, such as by applying penalties for wrong answers or granting partial credit for skipped questions. For AI, that would mean “penalizing confident errors more than uncertainty, and awarding partial credit when models appropriately express doubt.”

And OpenAI stresses that small tweaks aren’t enough. Rather than adding a handful of “uncertainty-aware” tests, the widely used, accuracy-focused benchmarks themselves need to evolve.

“If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess,” the researchers warn.

Source