A new report suggests that today’s tests and benchmarks may not fully address the growing demand for AI safety and accountability.

Generative AI models, which can analyze and produce text, images, music, videos, and more, are under increasing scrutiny due to their tendency to make mistakes and behave unpredictably. In response, organizations ranging from public sector agencies to large tech firms are proposing new benchmarks to test the safety of these models.

Late last year, startup Scale AI established a lab focused on evaluating how well models adhere to safety guidelines. This month, NIST and the U.K. AI Safety Institute introduced tools designed to assess model risk.

However, these methods for testing models may fall short.

The Ada Lovelace Institute (ALI), a U.K.-based nonprofit AI research organization, conducted a study that included interviews with experts from academic labs, civil society, and AI model vendors, as well as an audit of recent research into AI safety evaluations. The co-authors discovered that while current evaluations can be helpful, they are often incomplete, easily manipulated, and do not always predict how models will perform in real-world situations.

“Whether it’s a smartphone, a prescription drug, or a car, we expect the products we use to be safe and reliable; in these sectors, products are rigorously tested to ensure they are safe before they are deployed,” Elliot Jones, senior researcher at ALI and co-author of the report, told TechCrunch. “Our research aimed to examine the limitations of current approaches to AI safety evaluation, assess how evaluations are currently being used, and explore their use as a tool for policymakers and regulators.”

Benchmarks and Red Teaming

The study’s co-authors first reviewed academic literature to summarize the harms and risks posed by current AI models and the state of existing AI model evaluations. They then interviewed 16 experts, including four employees at unnamed tech companies developing generative AI systems.

The study revealed significant disagreements within the AI industry on the best methods and taxonomy for evaluating models.

Some evaluations only tested how models performed against benchmarks in the lab, not how they might affect real-world users. Others used tests designed for research purposes, which were then applied to production models — despite vendors insisting on their use in production.

Possible Solutions

The main reasons AI evaluations have not improved are the pressure to release models quickly and a reluctance to conduct tests that could raise issues before a release.

“One person we spoke with who works for a company developing foundation models felt there was more pressure within companies to release models quickly, making it harder to prioritize thorough evaluations,” Jones said. “Major AI labs are releasing models at a pace that outstrips their or society’s ability to ensure they are safe and reliable.”

One interviewee in the ALI study described evaluating models for safety as an “intractable” problem. So, what can the industry and regulators do to find solutions?

Mahi Hardalupas, a researcher at ALI, believes that progress is possible but will require more involvement from public-sector bodies.

“Regulators and policymakers must clearly articulate what they expect from evaluations,” he said. “At the same time, the evaluation community must be transparent about the current limitations and potential of evaluations.”

Source