about

resources

events

contribute

republishing

☰

ΑΙhub.org

Unsolved ML safety problems

by BAIR blog

26 October 2021

Robustness

Robustness research aims to build systems that are less vulnerable to extreme hazards and to adversarial threats. Two problems in robustness are robustness to long tails and robustness to adversarial examples.

Long tails

Examples of long tail events. First row, left: an ambulance in front of a green light. First row, middle: birds on the road. First row, right: a reflection of a pedestrian. Bottom row, left: a group of people cosplaying. Bottom row, middle: a foggy road. Bottom row, right: a person partly occluded by a board on their back. (Source)

To operate in open-world high-stakes environments, ML systems will need to endure unusual events and tail risks. However, current ML systems are brittle in the face of real-world complexity, and they become unreliable when faced with novel situations like those above. To reduce the chance that ML systems choose a wrong course of action in environments dominated by rare events, models need to be unusually robust.

Adversarial examples

Adversarial perturbations. An example of an input image altered by an adversarial perturbation. After the adversarial perturbation, the neural network makes a high confidence mistake. (Source)

Adversaries can easily manipulate vulnerabilities in ML systems and cause them to make mistakes. As shown above, carefully crafted small perturbations are enough to break ML systems. In the paper, we focus on this problem but also suggest that researchers consider more realistic settings, like when attackers can create perceptible images or when attack specifications are not known beforehand.

Monitoring

Monitoring research aims to create tools and features that help human operators identify hazards and inspect ML systems. Two problems in monitoring are anomaly detection and backdoor detection. This list is nonexhaustive, and we include other problems in the paper including calibration, honest outputs, and detecting emergent capabilities.

Anomaly detection

Anomaly detection. On the left is a usual image which belongs to an ImageNet class, so the ImageNet classifier knows how to handle the image. On the right is an anomalous image which does not belong to any ImageNet class. Nonetheless, the model classifies the image with high confidence.

Anomaly detectors can warn human operators of potential hazards, and this can help them reduce their exposure to hazards. For example, anomaly detectors can help detect malicious uses of ML systems or flag novel examples for human review. However, deep learning-based anomaly detectors are not highly reliable, as shown in the figure above.

Backdoors

Backdoors. Depicted is a backdoored facial recognition system that gates building access. The backdoor could be triggered by a specific unique item chosen by an adversary, such as a pair of glasses. If the adversary wears that specific pair of glasses, the backdoored facial recognition will allow the adversary in the building. (Source)

ML systems risk carrying backdoors. Backdoored models behave correctly and benignly in almost all scenarios, but in particular circumstances chosen by the adversary, they have been taught to behave incorrectly. Models trained on massive datasets scraped from online are increasingly likely to be trained on poisoned data and thereby have backdoors injected. Moreover, downstream models are increasingly obtained by a single upstream foundation model, so a single backdoored system could render backdoors commonplace.

Alignment

Alignment research aims to create safe ML system objectives and have them safely pursued. Two problems in alignment are value learning and proxy gaming, but the paper includes many additional problems.

Value learning

Estimating human values such as pleasantness. Transformer models can partially separate between pleasant and unpleasant states given diverse open-world inputs. Utility values or pleasantness values are not ground truth values and are products of the model’s own learned utility function. (Source)

Encoding human goals and intent is challenging because many human values are hard to define and measure. How can we teach ML systems to model happiness, good judgment, freedom of action, meaningful experiences, safe outcomes, and more? In the figure above, we show that models are starting to have traction on the problem, but they nonetheless make many mistakes and can only process simple inputs. More research is needed to learn reliable representations for happiness and other human values.

Proxy gaming

Boatrace Proxy Gaming. An RL agent gained a high score not by finishing the race but by going in the wrong direction, catching on fire, and colliding into other boats. (Source)

Objective proxies can be gamed by optimizers and adversaries. In fact, Goodhart’s law asserts that “When a measure becomes a target, it ceases to be a good measure.” This means that we cannot just learn a proxy for human values—we must also make it robust to optimizers that are incentivized to game the proxy. An example of a reward maximizing agent gaming a video game proxy is in the figure above.

In the full paper, we describe several more problems, clarify each problem’s motivation, and provide concrete research directions. Check out the paper here.

This article was initially published on the BAIR blog, and appears here with the authors’ permission.

tags: deep dive

BAIR blog

AIhub is supported by:

Introducing the NASA Onboard Artificial Intelligence Research (OnAIR) platform: an interview with Evana Gizzi

Lucy Smith 03 Jul 2025

Find out about the OnAIR platform, some of the particular challenges of deploying AI-based solutions in space, and how the tool has been used so far.

An interview with Nicolai Ommer: the RoboCupSoccer Small Size League

Lucy Smith 01 Jul 2025

We caught up with Nicolai to find out more about the Small Size League, how the auto referees work, and how teams use AI.

Forthcoming machine learning and AI seminars: July 2025 edition

Lucy Smith 30 Jun 2025

A list of free-to-attend AI-related seminars that are scheduled to take place between 1 July and 31 August 2025.

monthly digest

What is vibe coding? A computer scientist explains what it means to have AI write computer code − and what risks that can entail

The Conversation 19 Jun 2025

Until recently, most computer code was written, at least originally, by human beings. But with the advent of GenAI, that has begun to change.