ΑΙhub.org
 

Unsolved ML safety problems


by
26 October 2021



share this:

By Dan Hendrycks

Along with researchers from Google Brain and OpenAI, we are releasing a paper on Unsolved Problems in ML Safety. Due to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. As a preview of the paper, in this post we consider a subset of the paper’s directions, namely withstanding hazards (“Robustness”), identifying hazards (“Monitoring”), and steering ML systems (“Alignment”).

Robustness

Robustness research aims to build systems that are less vulnerable to extreme hazards and to adversarial threats. Two problems in robustness are robustness to long tails and robustness to adversarial examples.

Long tails

Examples of long tail events. First row, left: an ambulance in front of a green light. First row, middle: birds on the road. First row, right: a reflection of a pedestrian. Bottom row, left: a group of people cosplaying. Bottom row, middle: a foggy road. Bottom row, right: a person partly occluded by a board on their back. (Source)

To operate in open-world high-stakes environments, ML systems will need to endure unusual events and tail risks. However, current ML systems are brittle in the face of real-world complexity, and they become unreliable when faced with novel situations like those above. To reduce the chance that ML systems choose a wrong course of action in environments dominated by rare events, models need to be unusually robust.

Adversarial examples

Adversarial perturbations. An example of an input image altered by an adversarial perturbation. After the adversarial perturbation, the neural network makes a high confidence mistake. (Source)

Adversaries can easily manipulate vulnerabilities in ML systems and cause them to make mistakes. As shown above, carefully crafted small perturbations are enough to break ML systems. In the paper, we focus on this problem but also suggest that researchers consider more realistic settings, like when attackers can create perceptible images or when attack specifications are not known beforehand.

Monitoring

Monitoring research aims to create tools and features that help human operators identify hazards and inspect ML systems. Two problems in monitoring are anomaly detection and backdoor detection. This list is nonexhaustive, and we include other problems in the paper including calibration, honest outputs, and detecting emergent capabilities.

Anomaly detection

Anomaly detection. On the left is a usual image which belongs to an ImageNet class, so the ImageNet classifier knows how to handle the image. On the right is an anomalous image which does not belong to any ImageNet class. Nonetheless, the model classifies the image with high confidence.

Anomaly detectors can warn human operators of potential hazards, and this can help them reduce their exposure to hazards. For example, anomaly detectors can help detect malicious uses of ML systems or flag novel examples for human review. However, deep learning-based anomaly detectors are not highly reliable, as shown in the figure above.

Backdoors

Backdoors. Depicted is a backdoored facial recognition system that gates building access. The backdoor could be triggered by a specific unique item chosen by an adversary, such as a pair of glasses. If the adversary wears that specific pair of glasses, the backdoored facial recognition will allow the adversary in the building. (Source)

ML systems risk carrying backdoors. Backdoored models behave correctly and benignly in almost all scenarios, but in particular circumstances chosen by the adversary, they have been taught to behave incorrectly. Models trained on massive datasets scraped from online are increasingly likely to be trained on poisoned data and thereby have backdoors injected. Moreover, downstream models are increasingly obtained by a single upstream foundation model, so a single backdoored system could render backdoors commonplace.

Alignment

Alignment research aims to create safe ML system objectives and have them safely pursued. Two problems in alignment are value learning and proxy gaming, but the paper includes many additional problems.

Value learning

Estimating human values such as pleasantness. Transformer models can partially separate between pleasant and unpleasant states given diverse open-world inputs. Utility values or pleasantness values are not ground truth values and are products of the model’s own learned utility function. (Source)

Encoding human goals and intent is challenging because many human values are hard to define and measure. How can we teach ML systems to model happiness, good judgment, freedom of action, meaningful experiences, safe outcomes, and more? In the figure above, we show that models are starting to have traction on the problem, but they nonetheless make many mistakes and can only process simple inputs. More research is needed to learn reliable representations for happiness and other human values.

Proxy gaming

Boatrace Proxy Gaming. An RL agent gained a high score not by finishing the race but by going in the wrong direction, catching on fire, and colliding into other boats. (Source)

Objective proxies can be gamed by optimizers and adversaries. In fact, Goodhart’s law asserts that “When a measure becomes a target, it ceases to be a good measure.” This means that we cannot just learn a proxy for human values—we must also make it robust to optimizers that are incentivized to game the proxy. An example of a reward maximizing agent gaming a video game proxy is in the figure above.


In the full paper, we describe several more problems, clarify each problem’s motivation, and provide concrete research directions. Check out the paper here.


This article was initially published on the BAIR blog, and appears here with the authors’ permission.



tags:


BAIR blog

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Interview with Xinwei Song: strategic interactions in networked multi-agent systems

  16 Apr 2026
Xinwei Song tells us about her research using algorithmic game theory and multi-agent reinforcement learning.

2026 AI Index Report released

  15 Apr 2026
Find out what the ninth edition of the report, which was published on 13 April, says about trends in AI.

Formal verification for safety evaluation of autonomous vehicles: an interview with Abdelrahman Sayed Sayed

  14 Apr 2026
Find out more about work at the intersection of continuous AI models, formal methods, and autonomous systems.

Water flow in prairie watersheds is increasingly unpredictable — but AI could help

  13 Apr 2026
In recent years, the Prairies have seen bigger swings in climate conditions — very wet years followed by very dry ones.

Identifying interactions at scale for LLMs

  10 Apr 2026
Model behavior is rarely the result of isolated components; rather, it emerges from complex dependencies and patterns.

Interview with Sukanya Mandal: Synthesizing multi-modal knowledge graphs for smart city intelligence

  09 Apr 2026
A modular four-stage framework that draws on LLMs to automate synthetic multi-modal knowledge graphs.

Emergence of fragility in LLM-based social networks: an interview with Francesco Bertolotti

  08 Apr 2026
Francesco tells us how LLMs behave in the social network Moltbook, and what this reveals about network dynamics.

Scaling up multi-agent systems: an interview with Minghong Geng

  07 Apr 2026
We sat down with Minghong in the latest of our interviews with the 2026 AAAI/SIGAI Doctoral Consortium participants.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence