Researchers Use NPR’s Sunday Puzzle to Gauge AI’s Problem-Solving Skills

Every Sunday, NPR’s Will Shortz, the renowned crossword puzzle editor for The New York Times, challenges listeners with the Sunday Puzzle, a segment designed to be solvable with minimal prior knowledge but still demanding enough to stump even seasoned participants.

This unique blend of accessibility and complexity has caught the attention of AI researchers, who see it as a potential benchmark for evaluating artificial intelligence’s reasoning capabilities.

A recent study conducted by researchers from institutions including Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and AI startup Cursor introduced an AI benchmark built around these puzzles. Their findings revealed intriguing insights, such as the tendency of some reasoning models—like OpenAI’s o1—to “give up” and produce incorrect answers they seemingly know are wrong.

A Fresh Benchmark for AI Evaluation

“We aimed to create a benchmark with problems that humans can solve using only general knowledge,” said Arjun Guha, a computer science professor at Northeastern and one of the study’s co-authors, in an interview with TechCrunch.

The AI industry currently faces challenges in benchmarking. Many standard evaluations focus on assessing models’ proficiency in highly specialized fields, such as PhD-level math and science, which may not reflect real-world applications. Additionally, existing benchmarks quickly become outdated as AI systems improve.

By contrast, the Sunday Puzzle offers an alternative: its questions do not rely on niche expertise, and their phrasing prevents models from relying on rote memorization to generate responses, according to Guha.

“What makes these puzzles particularly tough is that solving them requires a mix of insight and a process of elimination—progress often happens all at once when everything clicks together,” he explained.

Limitations and Future Improvements

Like any benchmark, this one has its constraints. The Sunday Puzzle is primarily geared toward English-speaking U.S. audiences, and since past quizzes are publicly available, AI models could, in theory, “memorize” answers. However, Guha notes that he hasn’t observed signs of this happening.

“New questions are released every week, ensuring the most recent ones remain unseen,” he said. “We plan to keep the benchmark updated and track how AI performance evolves over time.”

The benchmark currently includes around 600 Sunday Puzzle riddles. Among the models tested, reasoning-focused systems like OpenAI’s o1 and DeepSeek’s R1 demonstrated the best results. Unlike traditional AI models, reasoning models carefully fact-check their responses before providing an answer, helping them avoid common pitfalls. However, this approach also means they take longer—anywhere from seconds to minutes—to reach conclusions.

AI Struggles with Reasoning—And Sometimes ‘Gets Frustrated’

Interestingly, DeepSeek’s R1 exhibited an unusual behavior: for certain puzzles, it would outright declare, “I give up,” before selecting a seemingly random incorrect answer—an all-too-human response. Other observed quirks included models retracting incorrect answers, attempting to refine them unsuccessfully, or getting stuck in endless loops of indecision. Some even provided correct answers but then second-guessed themselves unnecessarily.

“On difficult problems, R1 actually states that it’s feeling ‘frustrated,’” Guha noted. “It was amusing to see a model emulate human-like responses. The impact of ‘frustration’ on reasoning quality is something we still need to explore.”

Next Steps in AI Reasoning Research

As of now, OpenAI’s o1 leads the benchmark with a score of 59%, followed by o3-mini with high reasoning effort at 47%, and R1 at 35%. The researchers aim to expand their testing to additional reasoning models to further pinpoint areas for improvement.

“Good reasoning doesn’t require a PhD, so it should be possible to develop benchmarks that don’t demand highly specialized knowledge,” Guha remarked. “A benchmark accessible to a broad range of researchers allows for deeper analysis, which could ultimately lead to more effective AI solutions. As AI systems continue to shape everyday life, it’s crucial for everyone to understand what these models can—and cannot—do.”

Source

Control F5 Team
Blog Editor
OUR WORK
Case studies

We have helped 20+ companies in industries like Finance, Transportation, Health, Tourism, Events, Education, Sports.

READY TO DO THIS
Let’s build something together