The thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) took place in New Orleans towards the end of last year. As part of the programme of invited talks, Lora Aroyo spoke about her work on responsible AI, specifically looking at the data annotation process and what this means for models that use that data.
The rapid progress of AI in recent years has been, in large part, due to the availability of large quantities of data for model training. However, these advancements have left in their wake a trail of problematic model behaviours. Lora’s research is focussed on studying the characteristics of data, such as stereotypes and biases, that impact on the models. In the talk, she presented empirical results from experiments on human-labelled data used for model evaluation and fine-tuning, and adversarial data used for safety evaluation.
The process of data annotation is at odds with the real world. When labelling data, raters are required to make binary distinctions. However, this does not reflect the real world, where the vast majority of data that we deal with is a continuous spectrum of possibilities. This doesn’t fit into the narrow and brittle binary categories with which the raters have to work.
Often, human annotators are blamed for poor-quality data. However, it is rarely they who are at fault. As an example, Lora showed some pictures and asked whether the audience would label them as a guitar or not. Making the distinction is not at all simple, and illustrates that many things do not fit into binary categories.
One of Lora’s slides, showing pictures that may or may not be labelled as a “guitar”.
Lora talked about a research project which concerned distributional truth, where disagreement between raters can be used to provide guidance in data collection task analysis. In their research, Lora and her colleagues looked at difficult-to-label data, and used experts and crowd raters to give their opinions. They found that asking a group of crowd raters was much more informative than asking a single expert. For example, if one statement gets voted “yes” by 95% of the crowd, it is likely to be a much more certain statement than one that gets voted “yes” by 70% of the crowd. With crowd responses the nuance, or uncertainty, in a particular statement or item can be observed, something that is not possible with a binary answer from one expert.
Lora stressed that disagreement between raters is good and, rather than trying to resolve it, researchers need to adapt their tasks to collect more of these examples in order to challenge their systems and evaluation metrics. These disagreements represent the difficult cases that are abundant in the real world. This ambiguity is more prevalent than might be expected, with the team finding that it formed quite a significant part of any large corpus. The key takeaway is that disagreement is a signal for natural diversity and variance in human annotations and should be included in the data we use for training and evaluation.
Lora and her team next turned their attention to safety and specifically whether raters from different demographics perceive safety differently. This project centred on scrutinising datasets in terms of what they contained and who annotated them.
A number of experiments were performed with generative adversarial conversations, in which there were a high number of raters per item. It was found that raters from different demographics (e.g. age, gender, race, geography) did indeed respond differently when asked to rate conversations for safety. It was only because the team used a larger pool of raters (and specifically considered the demographics) that they were able to spot the differences. Typically, researchers working with data annotators will use three to five raters per item, no where near enough to spot demographic effects.
In one experiment, a diverse pool of 123 raters was asked to annotate 350 conversations for safety, i.e. by choosing whether a particular statement was “safe” or “unsafe”. In around 20% of the cases, the number of ratings given for “safe” vs “unsafe” were very similar, so, in other words, there was no clear consensus. This percentage of ambiguous cases was found to be a function of the number of raters used. So, when the number of raters was reduced from 123 to between 20-50, the level of ambiguity ranged from 35-23%. When fewer than 20 raters were used, the ambiguity was upwards of around 40%.
Lora closed by highlighting that the preference for binary data could trigger unknown risks associated with the adoption of emerging generative AI capabilities across different cultures and countries. The diversity of human perspectives should be included in model training and development, and ambiguity should be acknowledged as part of AI datasets to ensure the trust, safety and reliability of model outputs.
You can find out more about Lora’s work here.
All of our NeurIPS 2023 coverage can be found here.