ΑΙhub.org
 

#NeurIPS2023 invited talk: Lora Aroyo on data quality and diversity

by
17 January 2024



share this:

pedestrians crossing a zebra crossing
The thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) took place in New Orleans towards the end of last year. As part of the programme of invited talks, Lora Aroyo spoke about her work on responsible AI, specifically looking at the data annotation process and what this means for models that use that data.

The rapid progress of AI in recent years has been, in large part, due to the availability of large quantities of data for model training. However, these advancements have left in their wake a trail of problematic model behaviours. Lora’s research is focussed on studying the characteristics of data, such as stereotypes and biases, that impact on the models. In the talk, she presented empirical results from experiments on human-labelled data used for model evaluation and fine-tuning, and adversarial data used for safety evaluation.

The real world is not binary

The process of data annotation is at odds with the real world. When labelling data, raters are required to make binary distinctions. However, this does not reflect the real world, where the vast majority of data that we deal with is a continuous spectrum of possibilities. This doesn’t fit into the narrow and brittle binary categories with which the raters have to work.

Often, human annotators are blamed for poor-quality data. However, it is rarely they who are at fault. As an example, Lora showed some pictures and asked whether the audience would label them as a guitar or not. Making the distinction is not at all simple, and illustrates that many things do not fit into binary categories.
One of Lora’s slides, showing pictures that may or may not be labelled as a “guitar”.

Truth by disagreement

Lora talked about a research project which concerned distributional truth, where disagreement between raters can be used to provide guidance in data collection task analysis. In their research, Lora and her colleagues looked at difficult-to-label data, and used experts and crowd raters to give their opinions. They found that asking a group of crowd raters was much more informative than asking a single expert. For example, if one statement gets voted “yes” by 95% of the crowd, it is likely to be a much more certain statement than one that gets voted “yes” by 70% of the crowd. With crowd responses the nuance, or uncertainty, in a particular statement or item can be observed, something that is not possible with a binary answer from one expert.

Lora stressed that disagreement between raters is good and, rather than trying to resolve it, researchers need to adapt their tasks to collect more of these examples in order to challenge their systems and evaluation metrics. These disagreements represent the difficult cases that are abundant in the real world. This ambiguity is more prevalent than might be expected, with the team finding that it formed quite a significant part of any large corpus. The key takeaway is that disagreement is a signal for natural diversity and variance in human annotations and should be included in the data we use for training and evaluation.

Safety

Lora and her team next turned their attention to safety and specifically whether raters from different demographics perceive safety differently. This project centred on scrutinising datasets in terms of what they contained and who annotated them.

A number of experiments were performed with generative adversarial conversations, in which there were a high number of raters per item. It was found that raters from different demographics (e.g. age, gender, race, geography) did indeed respond differently when asked to rate conversations for safety. It was only because the team used a larger pool of raters (and specifically considered the demographics) that they were able to spot the differences. Typically, researchers working with data annotators will use three to five raters per item, no where near enough to spot demographic effects.

In one experiment, a diverse pool of 123 raters was asked to annotate 350 conversations for safety, i.e. by choosing whether a particular statement was “safe” or “unsafe”. In around 20% of the cases, the number of ratings given for “safe” vs “unsafe” were very similar, so, in other words, there was no clear consensus. This percentage of ambiguous cases was found to be a function of the number of raters used. So, when the number of raters was reduced from 123 to between 20-50, the level of ambiguity ranged from 35-23%. When fewer than 20 raters were used, the ambiguity was upwards of around 40%.

Diversity of perspectives

Lora closed by highlighting that the preference for binary data could trigger unknown risks associated with the adoption of emerging generative AI capabilities across different cultures and countries. The diversity of human perspectives should be included in model training and development, and ambiguity should be acknowledged as part of AI datasets to ensure the trust, safety and reliability of model outputs.

You can find out more about Lora’s work here.


More on NeurIPS 2023

All of our NeurIPS 2023 coverage can be found here.



tags: ,


Lucy Smith , Managing Editor for AIhub.
Lucy Smith , Managing Editor for AIhub.




            AIhub is supported by:


Related posts :



AIhub coffee corner: Open vs closed science

The AIhub coffee corner captures the musings of AI experts over a short conversation.
26 April 2024, by

Are emergent abilities of large language models a mirage? – Interview with Brando Miranda

We hear about work that won a NeurIPS 2023 outstanding paper award.
25 April 2024, by

We built an AI tool to help set priorities for conservation in Madagascar: what we found

Daniele Silvestro has developed a tool that can help identify conservation and restoration priorities.
24 April 2024, by

Interview with Mike Lee: Communicating AI decision-making through demonstrations

We hear from AAAI/SIGAI Doctoral Consortium participant Mike Lee about his research on explainable AI.
23 April 2024, by

Machine learning viability modelling of vertical-axis wind turbines

Researchers have used a genetic learning algorithm to identify optimal pitch profiles for the turbine blades.
22 April 2024, by

The Machine Ethics podcast: What is AI? Volume 3

This is a bonus episode looking back over answers to our question: What is AI?
19 April 2024, by




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association