about

resources

events

contribute

republishing

☰

ΑΙhub.org

Making it easier to verify an AI model’s responses

by MIT News

15 November 2024

share this:

An image of multiple 3D shapes representing speech bubbles in a sequence, with broken up fragments of text within them. Wes Cockx & Google DeepMind / Better Images of AI / AI large language models / Licenced by CC-BY 4.0

By Adam Zewe

Despite their impressive capabilities, large language models are far from perfect. These artificial intelligence models sometimes “hallucinate” by generating incorrect or unsupported information in response to a query.

Due to this hallucination problem, an LLM’s responses are often verified by human fact-checkers, especially if a model is deployed in a high-stakes setting like health care or finance. However, validation processes typically require people to read through long documents cited by the model, a task so onerous and error-prone it may prevent some users from deploying generative AI models in the first place.

To help human validators, MIT researchers created a user-friendly system that enables people to verify an LLM’s responses much more quickly. With this tool, called SymGen, an LLM generates responses with citations that point directly to the place in a source document, such as a given cell in a database.

Users hover over highlighted portions of its text response to see data the model used to generate that specific word or phrase. At the same time, the unhighlighted portions show users which phrases need additional attention to check and verify.

“We give people the ability to selectively focus on parts of the text they need to be more worried about. In the end, SymGen can give people higher confidence in a model’s responses because they can easily take a closer look to ensure that the information is verified,” says Shannon Shen, an electrical engineering and computer science graduate student and co-lead author of a paper on SymGen.

Through a user study, Shen and his collaborators found that SymGen sped up verification time by about 20 percent, compared to manual procedures. By making it faster and easier for humans to validate model outputs, SymGen could help people identify errors in LLMs deployed in a variety of real-world situations, from generating clinical notes to summarizing financial market reports.

Shen is joined on the paper by co-lead author and fellow EECS graduate student Lucas Torroba Hennigen; EECS graduate student Aniruddha “Ani” Nrusimha; Bernhard Gapp, president of the Good Data Initiative; and senior authors David Sontag, a professor of EECS, a member of the MIT Jameel Clinic, and the leader of the Clinical Machine Learning Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Yoon Kim, an assistant professor of EECS and a member of CSAIL. The research was recently presented at the Conference on Language Modeling.

Symbolic references

To aid in validation, many LLMs are designed to generate citations, which point to external documents, along with their language-based responses so users can check them. However, these verification systems are usually designed as an afterthought, without considering the effort it takes for people to sift through numerous citations, Shen says.

“Generative AI is intended to reduce the user’s time to complete a task. If you need to spend hours reading through all these documents to verify the model is saying something reasonable, then it’s less helpful to have the generations in practice,” Shen says.

The researchers approached the validation problem from the perspective of the humans who will do the work.

A SymGen user first provides the LLM with data it can reference in its response, such as a table that contains statistics from a basketball game. Then, rather than immediately asking the model to complete a task, like generating a game summary from those data, the researchers perform an intermediate step. They prompt the model to generate its response in a symbolic form.

With this prompt, every time the model wants to cite words in its response, it must write the specific cell from the data table that contains the information it is referencing. For instance, if the model wants to cite the phrase “Portland Trailblazers” in its response, it would replace that text with the cell name in the data table that contains those words.

“Because we have this intermediate step that has the text in a symbolic format, we are able to have really fine-grained references. We can say, for every single span of text in the output, this is exactly where in the data it corresponds to,” Torroba Hennigen says.

SymGen then resolves each reference using a rule-based tool that copies the corresponding text from the data table into the model’s response.

“This way, we know it is a verbatim copy, so we know there will not be any errors in the part of the text that corresponds to the actual data variable,” Shen adds.

Streamlining validation

The model can create symbolic responses because of how it is trained. Large language models are fed reams of data from the internet, and some data are recorded in “placeholder format” where codes replace actual values.

When SymGen prompts the model to generate a symbolic response, it uses a similar structure.

“We design the prompt in a specific way to draw on the LLM’s capabilities,” Shen adds.

During a user study, the majority of participants said SymGen made it easier to verify LLM-generated text. They could validate the model’s responses about 20 percent faster than if they used standard methods.

However, SymGen is limited by the quality of the source data. The LLM could cite an incorrect variable, and a human verifier may be none-the-wiser.

In addition, the user must have source data in a structured format, like a table, to feed into SymGen. Right now, the system only works with tabular data.

Moving forward, the researchers are enhancing SymGen so it can handle arbitrary text and other forms of data. With that capability, it could help validate portions of AI-generated legal document summaries, for instance. They also plan to test SymGen with physicians to study how it could identify errors in AI-generated clinical summaries.

Read the work in full

Towards verifiable text generation with symbolic references, Lucas Torroba Hennigen, Shannon Zejiang Shen, Aniruddha Nrusimha, Bernhard Gapp, David Sontag and Yoon Kim.

MIT News

AIhub is supported by:

Interview with Sukanya Mandal: Synthesizing multi-modal knowledge graphs for smart city intelligence

AIhub 09 Apr 2026

A modular four-stage framework that draws on LLMs to automate synthetic multi-modal knowledge graphs.

Emergence of fragility in LLM-based social networks: an interview with Francesco Bertolotti

Ella Scallan 08 Apr 2026

Francesco tells us how LLMs behave in the social network Moltbook, and what this reveals about network dynamics.

Scaling up multi-agent systems: an interview with Minghong Geng

Lucy Smith 07 Apr 2026

We sat down with Minghong in the latest of our interviews with the 2026 AAAI/SIGAI Doctoral Consortium participants.

Forthcoming machine learning and AI seminars: April 2026 edition

Lucy Smith 02 Apr 2026

A list of free-to-attend AI-related seminars that are scheduled to take place between 2 April and 31 May 2026.

#AAAI2026 invited talk: machine learning for particle physics

Lucy Smith 01 Apr 2026

How is ML used in the search for new particles at CERN?

monthly digest

Making it easier to verify an AI model’s responses

Symbolic references

Streamlining validation

Read the work in full

Related posts :

Interview with Sukanya Mandal: Synthesizing multi-modal knowledge graphs for smart city intelligence

Emergence of fragility in LLM-based social networks: an interview with Francesco Bertolotti

Scaling up multi-agent systems: an interview with Minghong Geng

Forthcoming machine learning and AI seminars: April 2026 edition

#AAAI2026 invited talk: machine learning for particle physics

AIhub monthly digest: March 2026 – time series, multiplicity, and the history of RoboCup

What I’ve learned from 25 years of automated science, and what the future holds: an interview with Ross King

A multi-armed robot for assisting with agricultural tasks

↑