ΑΙhub.org
 

#NeurIPS2023 invited talk: Lora Aroyo on data quality and diversity


by
17 January 2024



share this:

pedestrians crossing a zebra crossing
The thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) took place in New Orleans towards the end of last year. As part of the programme of invited talks, Lora Aroyo spoke about her work on responsible AI, specifically looking at the data annotation process and what this means for models that use that data.

The rapid progress of AI in recent years has been, in large part, due to the availability of large quantities of data for model training. However, these advancements have left in their wake a trail of problematic model behaviours. Lora’s research is focussed on studying the characteristics of data, such as stereotypes and biases, that impact on the models. In the talk, she presented empirical results from experiments on human-labelled data used for model evaluation and fine-tuning, and adversarial data used for safety evaluation.

The real world is not binary

The process of data annotation is at odds with the real world. When labelling data, raters are required to make binary distinctions. However, this does not reflect the real world, where the vast majority of data that we deal with is a continuous spectrum of possibilities. This doesn’t fit into the narrow and brittle binary categories with which the raters have to work.

Often, human annotators are blamed for poor-quality data. However, it is rarely they who are at fault. As an example, Lora showed some pictures and asked whether the audience would label them as a guitar or not. Making the distinction is not at all simple, and illustrates that many things do not fit into binary categories.
One of Lora’s slides, showing pictures that may or may not be labelled as a “guitar”.

Truth by disagreement

Lora talked about a research project which concerned distributional truth, where disagreement between raters can be used to provide guidance in data collection task analysis. In their research, Lora and her colleagues looked at difficult-to-label data, and used experts and crowd raters to give their opinions. They found that asking a group of crowd raters was much more informative than asking a single expert. For example, if one statement gets voted “yes” by 95% of the crowd, it is likely to be a much more certain statement than one that gets voted “yes” by 70% of the crowd. With crowd responses the nuance, or uncertainty, in a particular statement or item can be observed, something that is not possible with a binary answer from one expert.

Lora stressed that disagreement between raters is good and, rather than trying to resolve it, researchers need to adapt their tasks to collect more of these examples in order to challenge their systems and evaluation metrics. These disagreements represent the difficult cases that are abundant in the real world. This ambiguity is more prevalent than might be expected, with the team finding that it formed quite a significant part of any large corpus. The key takeaway is that disagreement is a signal for natural diversity and variance in human annotations and should be included in the data we use for training and evaluation.

Safety

Lora and her team next turned their attention to safety and specifically whether raters from different demographics perceive safety differently. This project centred on scrutinising datasets in terms of what they contained and who annotated them.

A number of experiments were performed with generative adversarial conversations, in which there were a high number of raters per item. It was found that raters from different demographics (e.g. age, gender, race, geography) did indeed respond differently when asked to rate conversations for safety. It was only because the team used a larger pool of raters (and specifically considered the demographics) that they were able to spot the differences. Typically, researchers working with data annotators will use three to five raters per item, no where near enough to spot demographic effects.

In one experiment, a diverse pool of 123 raters was asked to annotate 350 conversations for safety, i.e. by choosing whether a particular statement was “safe” or “unsafe”. In around 20% of the cases, the number of ratings given for “safe” vs “unsafe” were very similar, so, in other words, there was no clear consensus. This percentage of ambiguous cases was found to be a function of the number of raters used. So, when the number of raters was reduced from 123 to between 20-50, the level of ambiguity ranged from 35-23%. When fewer than 20 raters were used, the ambiguity was upwards of around 40%.

Diversity of perspectives

Lora closed by highlighting that the preference for binary data could trigger unknown risks associated with the adoption of emerging generative AI capabilities across different cultures and countries. The diversity of human perspectives should be included in model training and development, and ambiguity should be acknowledged as part of AI datasets to ensure the trust, safety and reliability of model outputs.

You can find out more about Lora’s work here.


More on NeurIPS 2023

All of our NeurIPS 2023 coverage can be found here.



tags: ,


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Top AI ethics and policy issues of 2025 and what to expect in 2026

, and   04 Mar 2026
In the latest issue of AI Matters, a publication of ACM SIGAI, Larry Medsker summarised the year in AI ethics and policy, and looked ahead to 2026.

The greatest risk of AI in higher education isn’t cheating – it’s the erosion of learning itself

  03 Mar 2026
Will AI hollow out the pipeline of students, researchers and faculty that is the basis of today’s universities?

Forthcoming machine learning and AI seminars: March 2026 edition

  02 Mar 2026
A list of free-to-attend AI-related seminars that are scheduled to take place between 2 March and 30 April 2026.
monthly digest

AIhub monthly digest: February 2026 – collective decision making, multi-modal learning, and governing the rise of interactive AI

  27 Feb 2026
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

The Good Robot podcast: the role of designers in AI ethics with Tomasz Hollanek

  26 Feb 2026
In this episode, Tomasz argues that design is central to AI ethics and explores the role designers should play in shaping ethical AI systems.

Reinforcement learning applied to autonomous vehicles: an interview with Oliver Chang

  25 Feb 2026
In the third of our interviews with the 2026 AAAI Doctoral Consortium cohort, we hear from Oliver Chang.

The Machine Ethics podcast: moral agents with Jen Semler

In this episode, Ben and Jen Semler talk about what makes a moral agent, the point of moral agents, philosopher and engineer collaborations, and more.

Extending the reward structure in reinforcement learning: an interview with Tanmay Ambadkar

  23 Feb 2026
Find out more about Tanmay's research on RL frameworks, the latest in our series meeting the AAAI Doctoral Consortium participants.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence