ΑΙhub.org
 

#NeurIPS2023 invited talk: Lora Aroyo on data quality and diversity


by
17 January 2024



share this:

pedestrians crossing a zebra crossing
The thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) took place in New Orleans towards the end of last year. As part of the programme of invited talks, Lora Aroyo spoke about her work on responsible AI, specifically looking at the data annotation process and what this means for models that use that data.

The rapid progress of AI in recent years has been, in large part, due to the availability of large quantities of data for model training. However, these advancements have left in their wake a trail of problematic model behaviours. Lora’s research is focussed on studying the characteristics of data, such as stereotypes and biases, that impact on the models. In the talk, she presented empirical results from experiments on human-labelled data used for model evaluation and fine-tuning, and adversarial data used for safety evaluation.

The real world is not binary

The process of data annotation is at odds with the real world. When labelling data, raters are required to make binary distinctions. However, this does not reflect the real world, where the vast majority of data that we deal with is a continuous spectrum of possibilities. This doesn’t fit into the narrow and brittle binary categories with which the raters have to work.

Often, human annotators are blamed for poor-quality data. However, it is rarely they who are at fault. As an example, Lora showed some pictures and asked whether the audience would label them as a guitar or not. Making the distinction is not at all simple, and illustrates that many things do not fit into binary categories.
One of Lora’s slides, showing pictures that may or may not be labelled as a “guitar”.

Truth by disagreement

Lora talked about a research project which concerned distributional truth, where disagreement between raters can be used to provide guidance in data collection task analysis. In their research, Lora and her colleagues looked at difficult-to-label data, and used experts and crowd raters to give their opinions. They found that asking a group of crowd raters was much more informative than asking a single expert. For example, if one statement gets voted “yes” by 95% of the crowd, it is likely to be a much more certain statement than one that gets voted “yes” by 70% of the crowd. With crowd responses the nuance, or uncertainty, in a particular statement or item can be observed, something that is not possible with a binary answer from one expert.

Lora stressed that disagreement between raters is good and, rather than trying to resolve it, researchers need to adapt their tasks to collect more of these examples in order to challenge their systems and evaluation metrics. These disagreements represent the difficult cases that are abundant in the real world. This ambiguity is more prevalent than might be expected, with the team finding that it formed quite a significant part of any large corpus. The key takeaway is that disagreement is a signal for natural diversity and variance in human annotations and should be included in the data we use for training and evaluation.

Safety

Lora and her team next turned their attention to safety and specifically whether raters from different demographics perceive safety differently. This project centred on scrutinising datasets in terms of what they contained and who annotated them.

A number of experiments were performed with generative adversarial conversations, in which there were a high number of raters per item. It was found that raters from different demographics (e.g. age, gender, race, geography) did indeed respond differently when asked to rate conversations for safety. It was only because the team used a larger pool of raters (and specifically considered the demographics) that they were able to spot the differences. Typically, researchers working with data annotators will use three to five raters per item, no where near enough to spot demographic effects.

In one experiment, a diverse pool of 123 raters was asked to annotate 350 conversations for safety, i.e. by choosing whether a particular statement was “safe” or “unsafe”. In around 20% of the cases, the number of ratings given for “safe” vs “unsafe” were very similar, so, in other words, there was no clear consensus. This percentage of ambiguous cases was found to be a function of the number of raters used. So, when the number of raters was reduced from 123 to between 20-50, the level of ambiguity ranged from 35-23%. When fewer than 20 raters were used, the ambiguity was upwards of around 40%.

Diversity of perspectives

Lora closed by highlighting that the preference for binary data could trigger unknown risks associated with the adoption of emerging generative AI capabilities across different cultures and countries. The diversity of human perspectives should be included in model training and development, and ambiguity should be acknowledged as part of AI datasets to ensure the trust, safety and reliability of model outputs.

You can find out more about Lora’s work here.


More on NeurIPS 2023

All of our NeurIPS 2023 coverage can be found here.



tags: ,


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Machine learning framework to predict global imperilment status of freshwater fish

  20 Mar 2026
“With our model, decision makers can deploy resources in advance before a species becomes imperiled.”

Interview with AAAI Fellow Yan Liu: machine learning for time series

  19 Mar 2026
Hear from 2026 AAAI Fellow Yan Liu about her research into time series, the associated applications, and the promise of physics-informed models.

A principled approach for data bias mitigation

  18 Mar 2026
Find out more about work presented at AIES 2025 which proposes a new way to measure data bias, along with a mitigation algorithm with mathematical guarantees.

An AI image generator for non-English speakers

  17 Mar 2026
"Translations lose the nuances of language and culture, because many words lack good English equivalents."

AI and Theory of Mind: an interview with Nitay Alon

  16 Mar 2026
Find out more about how Theory of Mind plays out in deceptive environments, multi-agents systems, the interdisciplinary nature of this field, when to use Theory of Mind, and when not to, and more.
coffee corner

AIhub coffee corner: AI, kids, and the future – “generation AI”

  13 Mar 2026
The AIhub coffee corner captures the musings of AI experts over a short conversation.

AI chatbots can effectively sway voters – in either direction

  12 Mar 2026
A short interaction with a chatbot can meaningfully shift a voter’s opinion about a presidential candidate or proposed policy.

Studying the properties of large language models: an interview with Maxime Meyer

  11 Mar 2026
What happens when you increase the prompt length in a LLM? In the latest interview in our AAAI Doctoral Consortium series, we sat down with Maxime, a PhD student in Singapore.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence