about

resources

events

contribute

republishing

☰

ΑΙhub.org

Faithfully reflecting updated information in text: Interview with Robert Logan – #NAACL2022 award winner

by AIhub

04 August 2022

What is the topic of the research in your paper?

Our paper introduces the new task of faithfully reflecting updated information in text or FRUIT for short. Given an outdated Wikipedia article and new information about the article’s subject, the goal is to edit the article’s text to be consistent with the new information.

Could you tell us about the implications of your research and why it is an interesting area for study?

Textual knowledge bases such as Wikipedia are essential resources for both humans and machine learning models. However, as these knowledge bases grow, it becomes increasingly difficult to keep the information within them consistent. Every time an entry is changed or added, it could potentially introduce a contradiction with one or more of the other entries. One reason FRUIT is a valuable task is that solving it will produce systems that could help substantially reduce the burden required to maintain consistency in these settings.

In addition, solving the task requires addressing several challenges in modern NLP. First, models cannot perform well solely by relying on their parametric world knowledge. The model must prefer the evidence whenever it contradicts the model’s parametric knowledge, which recent work has shown is difficult for pretrained language models [1] [2]. Furthermore, the generated text needs to be faithful to both the original article and the provided evidence, but prefer claims in the evidence when they invalidate information in the original article. Lastly, the provided evidence can contain both unstructured text and structured objects like tables and lists, so solving the task involves challenging aspects of both multi-document summarization and data-to-text generation.

Based on our results, we’ve also found that FRUIT can be useful for measuring different models’ propensities to hallucinate as most hallucinations are usually easy to detect since they typically introduce new entities or numbers.

Could you explain your methodology?

Our main contributions in this paper were designing and releasing an initial dataset for benchmarking systems on this task and measuring baseline results for several T5 (Text-to-Text Transfer Transformer)-based models .

For data collection, we computed the diff between two snapshots of Wikipedia to collect: 1) the set of updated articles, and 2) evidence from other articles that could justify these updates. We then employed several heuristics to match the evidence to the articles and filter out unwanted data. To ensure that the collected evidence supported all of the edits in our evaluation dataset, we additionally hired human annotators to edit out any unsupported text from the updated article in a way that preserved fluency. Our dataset (FRUIT-Wiki) and the code we used for data collection are publicly available here.

We measured baseline results for several T5-based models and a trivial baseline that generates a verbatim copy of the original article. For the neural models, we investigated a straightforward application of T5 where we concatenated the original article and evidence to form the input and trained the model to decode the entire updated article from scratch. We additionally investigated an alternative approach we call EdiT5 that we trained to instead generate the diff between the original and the updated article and predict which pieces of evidence were used to generate the updates. To evaluate the faithfulness of generations, we introduced a variant of the ROUGE score we call UpdateROUGE that only considers the updated sentences instead of the entire article (this prevents systems that do not update the text from receiving high scores). We additionally introduced metrics that use the output of a NER system to measure the precision and recall of entities in generated vs. ground truth updates, as well as how many “hallucinated” entities appear in the generated updates but not in the original article or provided evidence.

demonstrating how FRUIT works Illustration of the FRUIT task. An outdated original article and relevant new information are provided as inputs, and the goal is to generate the updated article

What were your main findings?

Regarding our data collection pipeline, we found that it produces high-quality data according to many measures. By comparing the pipeline outputs to the human annotations, we found that our heuristics correctly associate evidence to updates about 85% of the time and that 90% of the mentions in the updated text are supported. Furthermore, we detected a high rank correlation in system performance on the automatically generated vs. human-annotated data (>90% for most metrics), which suggests that researchers that use our pipeline to collect datasets for other time periods and language should be able to estimate the relative performance of different systems without requiring human annotation.

On the modeling side, we were pleasantly surprised at how well our baselines performed. I pretty much jumped out of my seat when I saw how plausible the initial model generations were! In terms of the metrics I mentioned above, we found that the T5-based models typically obtain UpdateROUGE scores in the mid-40s (which is comparable to ROUGE scores for other generation tasks) and that the number of hallucinated entities is typically quite small (1.5 hallucinated tokens per updated article on average). In addition, we found that our proposed EdiT5 approach outperformed the basic approach by about 2-5% (absolute) on all metrics. We also performed qualitative analysis on 100 test instances to better understand the models’ mistakes. We found the EdiT5 model generated ungrounded claims in only 35% of the updated articles. In those articles, most mistakes were due to the model hallucinating numbers and dates (21 out of 35).

Overall, these results suggest that the benchmark dataset and data collection pipeline we’ve compiled for this task will be useful for conducting research on it. They also highlight some areas of improvement for future work to concentrate on.

What further work are you planning in this area?

One area that I’d really like to focus on is improving evaluation for this task. While using ROUGE and NER-based metrics was acceptable for a first pass at measuring faithfulness, these metrics only penalize the model when new n-grams and mentions appear in the generated updates. Meanwhile, there’s been a lot of exciting work on model-based metrics for evaluating faithfulness in other generative tasks (this paper that was published alongside FRUIT at NAACL gives a good overview) that I would be really excited to see applied to FRUIT but will likely need some modification to deal with things like structured inputs and conflicting information in the source article and evidence.

Read the paper in full

FRUIT: Faithfully Reflecting Updated Information in Text
Robert L. Logan IV, Alexandre Passos, Sameer Singh, Ming-Wei Chang.

About Robert Logan

Robert L. Logan IV is a research scientist at Dataminr. His research primarily focuses on the interplay between language modeling and information extraction. Prior to joining Dataminr, he was a Ph.D. student at the University of California, Irvine, studying machine learning and natural language processing under Padhraic Smyth and Sameer Singh, and received BAs in Mathematics and Economics from the University of California, Santa Cruz. He has also conducted machine learning research as an intern at Google Research and Diffbot and worked as a research analyst at Prologis.

AIhub is dedicated to free high-quality information about AI.

AIhub is supported by:

Learning from logical constraints with lower- and upper-bound arithmetic circuits

Lucile Dierckx, Alexandre Dubray and Siegfried Nijssen 07 Jan 2026

Find out more about work presented at IJCAI 2025.

What are small language models and how do they differ from large ones?

The Conversation 06 Jan 2026

Let’s explore what makes SLMs and LLMs different – and how to choose the right one for your situation.

Forthcoming machine learning and AI seminars: January 2026 edition

Lucy Smith 05 Jan 2026

A list of free-to-attend AI-related seminars that are scheduled to take place between 5 January and 28 February 2026.

AAAI presidential panel – AI perception versus reality video discussion

Lucy Smith 02 Jan 2026

Watch the second panel discussion in this series from AAAI.

2025 digest of digests

Lucy Smith 30 Dec 2025

We look back through the archives of our monthly digests to pick out some highlights from the year.

monthly digest

AIhub monthly digest: December 2025 – studying bias in AI-based recruitment tools, an image dataset for ethical AI benchmarking, and end of year com

Lucy Smith 29 Dec 2025

Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

Half of UK novelists believe AI is likely to replace their work entirely

University of Cambridge 24 Dec 2025

A new report asks literary creatives about their views on generative AI tools and LLM-authored books.