ΑΙhub.org
 

Unsolved ML safety problems


by
26 October 2021



share this:

By Dan Hendrycks

Along with researchers from Google Brain and OpenAI, we are releasing a paper on Unsolved Problems in ML Safety. Due to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. As a preview of the paper, in this post we consider a subset of the paper’s directions, namely withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), and steering ML systems (“Alignment”).

Robustness

Robustness research aims to build systems that are less vulnerable to extreme hazards and to adversarial threats. Two problems in robustness are robustness to long tails and robustness to adversarial examples.

Long tails

Examples of long tail events. First row, left: an ambulance in front of a green light. First row, middle: birds on the road. First row, right: a reflection of a pedestrian. Bottom row, left: a group of people cosplaying. Bottom row, middle: a foggy road. Bottom row, right: a person partly occluded by a board on their back. (Source)

To operate in open-world high-stakes environments, ML systems will need to endure unusual events and tail risks. However, current ML systems are brittle in the face of real-world complexity, and they become unreliable when faced with novel situations like those above. To reduce the chance that ML systems choose a wrong course of action in environments dominated by rare events, models need to be unusually robust.

Adversarial examples

Adversarial perturbations. An example of an input image altered by an adversarial perturbation. After the adversarial perturbation, the neural network makes a high confidence mistake. (Source)

Adversaries can easily manipulate vulnerabilities in ML systems and cause them to make mistakes. As shown above, carefully crafted small perturbations are enough to break ML systems. In the paper, we focus on this problem but also suggest that researchers consider more realistic settings, like when attackers can create perceptible images or when attack specifications are not known beforehand.

Monitoring

Monitoring research aims to create tools and features that help human operators identify hazards and inspect ML systems. Two problems in monitoring are anomaly detection and backdoor detection. This list is nonexhaustive, and we include other problems in the paper including calibration, honest outputs, and detecting emergent capabilities.

Anomaly detection

Anomaly detection. On the left is a usual image which belongs to an ImageNet class, so the ImageNet classifier knows how to handle the image. On the right is an anomalous image which does not belong to any ImageNet class. Nonetheless, the model classifies the image with high confidence.

Anomaly detectors can warn human operators of potential hazards, and this can help them reduce their exposure to hazards. For example, anomaly detectors can help detect malicious uses of ML systems or flag novel examples for human review. However, deep learning-based anomaly detectors are not highly reliable, as shown in the figure above.

Backdoors

Backdoors. Depicted is a backdoored facial recognition system that gates building access. The backdoor could be triggered by a specific unique item chosen by an adversary, such as a pair of glasses. If the adversary wears that specific pair of glasses, the backdoored facial recognition will allow the adversary in the building. (Source)

ML systems risk carrying backdoors. Backdoored models behave correctly and benignly in almost all scenarios, but in particular circumstances chosen by the adversary, they have been taught to behave incorrectly. Models trained on massive datasets scraped from online are increasingly likely to be trained on poisoned data and thereby have backdoors injected. Moreover, downstream models are increasingly obtained by a single upstream foundation model, so a single backdoored system could render backdoors commonplace.

Alignment

Alignment research aims to create safe ML system objectives and have them safely pursued. Two problems in alignment are value learning and proxy gaming, but the paper includes many additional problems.

Value learning

Estimating human values such as pleasantness. Transformer models can partially separate between pleasant and unpleasant states given diverse open-world inputs. Utility values or pleasantness values are not ground truth values and are products of the model’s own learned utility function. (Source)

Encoding human goals and intent is challenging because many human values are hard to define and measure. How can we teach ML systems to model happiness, good judgment, freedom of action, meaningful experiences, safe outcomes, and more? In the figure above, we show that models are starting to have traction on the problem, but they nonetheless make many mistakes and can only process simple inputs. More research is needed to learn reliable representations for happiness and other human values.

Proxy gaming

Boatrace Proxy Gaming. An RL agent gained a high score not by finishing the race but by going in the wrong direction, catching on fire, and colliding into other boats. (Source)

Objective proxies can be gamed by optimizers and adversaries. In fact, Goodhart’s law asserts that “When a measure becomes a target, it ceases to be a good measure.” This means that we cannot just learn a proxy for human values—we must also make it robust to optimizers that are incentivized to game the proxy. An example of a reward maximizing agent gaming a video game proxy is in the figure above.


In the full paper, we describe several more problems, clarify each problem’s motivation, and provide concrete research directions. Check out the paper here.


This article was initially published on the BAIR blog, and appears here with the authors’ permission.



tags:


BAIR blog




            AIhub is supported by:


Related posts :



coffee corner

AIhub coffee corner: Agentic AI

  15 Aug 2025
The AIhub coffee corner captures the musings of AI experts over a short conversation.

New research could block AI models learning from your online content

  14 Aug 2025
The method protects images from being used to train AI or create deepfakes by adding invisible changes that confuse the technology.

What’s coming up at #IJCAI2025?

  13 Aug 2025
Find out what's on the programme at the forthcoming International Joint Conference on Artificial Intelligence.

Interview with Flávia Carvalhido: Responsible multimodal AI

  12 Aug 2025
We hear from PhD student Flávia about her research, what inspired her to study AI, and her experience at AAAI 2025.

Using AI to speed up landslide detection

  11 Aug 2025
Researchers are using AI to speed up landslide detection following major earthquakes and extreme rainfall events.

IJCAI in Canada: 90-second pitches from the next generation of AI researchers

  08 Aug 2025
Find out about some of the interesting research taking place across Canada.

AI for the ancient world: how a new machine learning system can help make sense of Latin inscriptions

  08 Aug 2025
System retrieves textual and contextual parallels, makes use of visual details, and can generate speculative text to fill gaps in inscriptions.

Smart microscope captures aggregation of misfolded proteins

  07 Aug 2025
EPFL researchers have developed a microscope that can predict the onset of misfolded protein aggregation.



 

AIhub is supported by:






©2025.05 - Association for the Understanding of Artificial Intelligence


 












©2025.05 - Association for the Understanding of Artificial Intelligence