ΑΙhub.org
 

#NeurIPS2023 invited talk: Lora Aroyo on data quality and diversity

by
17 January 2024



share this:

pedestrians crossing a zebra crossing
The thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) took place in New Orleans towards the end of last year. As part of the programme of invited talks, Lora Aroyo spoke about her work on responsible AI, specifically looking at the data annotation process and what this means for models that use that data.

The rapid progress of AI in recent years has been, in large part, due to the availability of large quantities of data for model training. However, these advancements have left in their wake a trail of problematic model behaviours. Lora’s research is focussed on studying the characteristics of data, such as stereotypes and biases, that impact on the models. In the talk, she presented empirical results from experiments on human-labelled data used for model evaluation and fine-tuning, and adversarial data used for safety evaluation.

The real world is not binary

The process of data annotation is at odds with the real world. When labelling data, raters are required to make binary distinctions. However, this does not reflect the real world, where the vast majority of data that we deal with is a continuous spectrum of possibilities. This doesn’t fit into the narrow and brittle binary categories with which the raters have to work.

Often, human annotators are blamed for poor-quality data. However, it is rarely they who are at fault. As an example, Lora showed some pictures and asked whether the audience would label them as a guitar or not. Making the distinction is not at all simple, and illustrates that many things do not fit into binary categories.
One of Lora’s slides, showing pictures that may or may not be labelled as a “guitar”.

Truth by disagreement

Lora talked about a research project which concerned distributional truth, where disagreement between raters can be used to provide guidance in data collection task analysis. In their research, Lora and her colleagues looked at difficult-to-label data, and used experts and crowd raters to give their opinions. They found that asking a group of crowd raters was much more informative than asking a single expert. For example, if one statement gets voted “yes” by 95% of the crowd, it is likely to be a much more certain statement than one that gets voted “yes” by 70% of the crowd. With crowd responses the nuance, or uncertainty, in a particular statement or item can be observed, something that is not possible with a binary answer from one expert.

Lora stressed that disagreement between raters is good and, rather than trying to resolve it, researchers need to adapt their tasks to collect more of these examples in order to challenge their systems and evaluation metrics. These disagreements represent the difficult cases that are abundant in the real world. This ambiguity is more prevalent than might be expected, with the team finding that it formed quite a significant part of any large corpus. The key takeaway is that disagreement is a signal for natural diversity and variance in human annotations and should be included in the data we use for training and evaluation.

Safety

Lora and her team next turned their attention to safety and specifically whether raters from different demographics perceive safety differently. This project centred on scrutinising datasets in terms of what they contained and who annotated them.

A number of experiments were performed with generative adversarial conversations, in which there were a high number of raters per item. It was found that raters from different demographics (e.g. age, gender, race, geography) did indeed respond differently when asked to rate conversations for safety. It was only because the team used a larger pool of raters (and specifically considered the demographics) that they were able to spot the differences. Typically, researchers working with data annotators will use three to five raters per item, no where near enough to spot demographic effects.

In one experiment, a diverse pool of 123 raters was asked to annotate 350 conversations for safety, i.e. by choosing whether a particular statement was “safe” or “unsafe”. In around 20% of the cases, the number of ratings given for “safe” vs “unsafe” were very similar, so, in other words, there was no clear consensus. This percentage of ambiguous cases was found to be a function of the number of raters used. So, when the number of raters was reduced from 123 to between 20-50, the level of ambiguity ranged from 35-23%. When fewer than 20 raters were used, the ambiguity was upwards of around 40%.

Diversity of perspectives

Lora closed by highlighting that the preference for binary data could trigger unknown risks associated with the adoption of emerging generative AI capabilities across different cultures and countries. The diversity of human perspectives should be included in model training and development, and ambiguity should be acknowledged as part of AI datasets to ensure the trust, safety and reliability of model outputs.

You can find out more about Lora’s work here.


More on NeurIPS 2023

All of our NeurIPS 2023 coverage can be found here.



tags: ,


Lucy Smith , Managing Editor for AIhub.
Lucy Smith , Managing Editor for AIhub.




            AIhub is supported by:


Related posts :



#AAAI2024 invited talk: Milind Tambe – using ML for social good

Winner of the 2024 AAAI Award for Artificial Intelligence for the Benefit of Humanity, Milind spoke about recent projects.
01 March 2024, by

AIhub monthly digest: February 2024 – causal relations in text, applied reinforcement learning, and AAAI 2024

Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.
29 February 2024, by

#AAAI2024 in tweets: part two

Find out what the conference participants got up to during the second half of the event.
28 February 2024, by

Unlocking the potential of entity-centric knowledge graphs: transforming healthcare and beyond

The concept of entity-centric knowledge graphs holds promise in reshaping how we organize, access, and leverage data.
27 February 2024, by and

Congratulations to the #AAAI2024 outstanding paper winners

The winners of the outstanding papers were announced at the conference during the opening ceremony.
26 February 2024, by

#AAAI2024 in tweets: part one

Find out what the conference participants have been up to over the past few days.
23 February 2024, by





©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association