about

resources

events

contribute

republishing

☰

ΑΙhub.org

Unsolved ML safety problems

by BAIR blog

26 October 2021

Robustness

Robustness research aims to build systems that are less vulnerable to extreme hazards and to adversarial threats. Two problems in robustness are robustness to long tails and robustness to adversarial examples.

Long tails

Examples of long tail events. First row, left: an ambulance in front of a green light. First row, middle: birds on the road. First row, right: a reflection of a pedestrian. Bottom row, left: a group of people cosplaying. Bottom row, middle: a foggy road. Bottom row, right: a person partly occluded by a board on their back. (Source)

To operate in open-world high-stakes environments, ML systems will need to endure unusual events and tail risks. However, current ML systems are brittle in the face of real-world complexity, and they become unreliable when faced with novel situations like those above. To reduce the chance that ML systems choose a wrong course of action in environments dominated by rare events, models need to be unusually robust.

Adversarial examples

Adversarial perturbations. An example of an input image altered by an adversarial perturbation. After the adversarial perturbation, the neural network makes a high confidence mistake. (Source)

Adversaries can easily manipulate vulnerabilities in ML systems and cause them to make mistakes. As shown above, carefully crafted small perturbations are enough to break ML systems. In the paper, we focus on this problem but also suggest that researchers consider more realistic settings, like when attackers can create perceptible images or when attack specifications are not known beforehand.

Monitoring

Monitoring research aims to create tools and features that help human operators identify hazards and inspect ML systems. Two problems in monitoring are anomaly detection and backdoor detection. This list is nonexhaustive, and we include other problems in the paper including calibration, honest outputs, and detecting emergent capabilities.

Anomaly detection

Anomaly detection. On the left is a usual image which belongs to an ImageNet class, so the ImageNet classifier knows how to handle the image. On the right is an anomalous image which does not belong to any ImageNet class. Nonetheless, the model classifies the image with high confidence.

Anomaly detectors can warn human operators of potential hazards, and this can help them reduce their exposure to hazards. For example, anomaly detectors can help detect malicious uses of ML systems or flag novel examples for human review. However, deep learning-based anomaly detectors are not highly reliable, as shown in the figure above.

Backdoors

Backdoors. Depicted is a backdoored facial recognition system that gates building access. The backdoor could be triggered by a specific unique item chosen by an adversary, such as a pair of glasses. If the adversary wears that specific pair of glasses, the backdoored facial recognition will allow the adversary in the building. (Source)

ML systems risk carrying backdoors. Backdoored models behave correctly and benignly in almost all scenarios, but in particular circumstances chosen by the adversary, they have been taught to behave incorrectly. Models trained on massive datasets scraped from online are increasingly likely to be trained on poisoned data and thereby have backdoors injected. Moreover, downstream models are increasingly obtained by a single upstream foundation model, so a single backdoored system could render backdoors commonplace.

Alignment

Alignment research aims to create safe ML system objectives and have them safely pursued. Two problems in alignment are value learning and proxy gaming, but the paper includes many additional problems.

Value learning

Estimating human values such as pleasantness. Transformer models can partially separate between pleasant and unpleasant states given diverse open-world inputs. Utility values or pleasantness values are not ground truth values and are products of the model’s own learned utility function. (Source)

Encoding human goals and intent is challenging because many human values are hard to define and measure. How can we teach ML systems to model happiness, good judgment, freedom of action, meaningful experiences, safe outcomes, and more? In the figure above, we show that models are starting to have traction on the problem, but they nonetheless make many mistakes and can only process simple inputs. More research is needed to learn reliable representations for happiness and other human values.

Proxy gaming

Boatrace Proxy Gaming. An RL agent gained a high score not by finishing the race but by going in the wrong direction, catching on fire, and colliding into other boats. (Source)

Objective proxies can be gamed by optimizers and adversaries. In fact, Goodhart’s law asserts that “When a measure becomes a target, it ceases to be a good measure.” This means that we cannot just learn a proxy for human values—we must also make it robust to optimizers that are incentivized to game the proxy. An example of a reward maximizing agent gaming a video game proxy is in the figure above.

In the full paper, we describe several more problems, clarify each problem’s motivation, and provide concrete research directions. Check out the paper here.

This article was initially published on the BAIR blog, and appears here with the authors’ permission.

tags: deep dive

BAIR blog

AIhub is supported by:

Designing value-aligned autonomous vehicles: from moral dilemmas to conflict-sensitive design

Astrid Rakow 13 Nov 2025

Autonomous systems increasingly face value-laden choices. This blog post introduces the idea of designing “conflict-sensitive” autonomous traffic agents that explicitly recognise, reason about, and act upon competing ethical, legal, and social values.

Unsolved ML safety problems

Robustness

Long tails

Adversarial examples

Monitoring

Anomaly detection

Backdoors

Alignment

Value learning

Proxy gaming

Related posts :

Designing value-aligned autonomous vehicles: from moral dilemmas to conflict-sensitive design

Generations in Dialogue: Human-centric AI and collaborative AI systems with Professor Andreea Bobu

Learning from failure to tackle extremely hard problems

How AI can improve storm surge forecasts to help save lives

Rewarding explainability in drug repurposing with knowledge graphs

AI Song Contest – vote for your favourite

Generations in Dialogue: Multi-agent systems and human-AI interaction with Professor Manuela Veloso

Generations in Dialogue: Bridging Perspectives in AI – a new podcast from AAAI

↑