ΑΙhub.org
 

Physically realistic attacks on deep reinforcement learning


by
21 May 2020



share this:

adversary attacks

By Adam Gleave

Deep reinforcement learning (RL) has achieved superhuman performance in problems ranging from data center cooling to video games. RL policies may soon be widely deployed, with research underway in autonomous driving, negotiation and automated trading. Many potential applications are safety-critical: automated trading failures caused Knight Capital to lose USD 460M, while faulty autonomous vehicles have resulted in loss of life.

Consequently, it is critical that RL policies are robust: both to naturally occurring distribution shift, and to malicious attacks by adversaries. Unfortunately, we find that RL policies which perform at a high-level in normal situations can harbor serious vulnerabilities which can be exploited by an adversary.

Prior work has shown deep RL policies are vulnerable to small adversarial perturbations to their observations, similar to adversarial examples in image classifiers. This threat model assumes the adversary can directly modify the victim’s sensory observation. Such low-level access is rarely possible. For example, an autonomous vehicle’s camera image can be influenced by other drivers, but only to a limited extent. Other drivers cannot add noise to arbitrary pixels, or make a building disappear.

By contrast, we model the victim and adversary as agents in a shared environment. The adversary can take a similar set of actions to the victim. These actions may indirectly change the observations the victim sees, but only in a physically realistic fashion.

Note that if the victim policy were to play a Nash equilibria, it would not be exploitable by an adversary. We therefore focus on attacking victim policies trained via self-play, a popular method that approximates Nash equilibria. While it is known self-play may not always converge, it has produced highly capable AI systems. For example, AlphaGo and OpenAI Five have beaten world Go champions, and a professional Dota 2 team.

We find it is still possible to attack victim policies in this more realistic multi-agent threat model. Specifically, we exploit state-of-the-art policies trained by Bansal et al from OpenAI in zero-sum games between simulated Humanoid robots. We train our adversarial policies against a fixed victim policy, for less than 3% as many timesteps as the victim was trained for. In other respects, it is trained similarly to the self-play opponents: we use the same RL algorithm, Proximal Policy Optimization, and the same sparse reward. Surprisingly, the adversarial policies reliably beat most victims, despite not standing up and instead flailing on the ground.

In the video at the top of the post, we show victims in three different environments playing normal self-play opponents and adversarial policies. The Kick and Defend environment is a penalty shootout between a victim kicker and goalie opponent. You Shall Not Pass has a victim runner trying to cross the finish line, and an opponent blocker trying to prevent them. Sumo Humans has two agents competing on a round arena to knock out their opponent.

In Kick and Defend and You Shall Not Pass, the adversarial policy never stands up nor touches the victim. Instead, it positions its body in such a way to cause the victim’s policy to take poor actions. This style of attack is impossible in Sumo Humans, where the adversarial policy would immediately lose if it fell over. Instead, the adversarial policy learns to kneel in the center in a stable position, which proves surprisingly effective.

To better understand how the adversarial policies exploit their victims, we created “masked” versions of victim policies. The masked victim always observes a static value for the opponent position, corresponding to a typical initial starting state. This doctored observation is then passed to the original victim policy.

One would expect performance to degrade when the policy cannot see its opponent, and indeed the masked victims win less often against normal opponents. However, they are far more robust to adversarial policies. This result shows that the adversarial policies win by taking actions to induce natural observations that are adversarial to the victim, and not by physically interfering with the victim.

Furthermore, these results show there is a cyclic relationship between the policies. There is no overall strongest policy: the best policy depends on the other player’s policy, like in rock-paper-scissors. Technically this is known as non-transitivity: policy A beats B which beats C, yet C beats A. This is surprising since these environments’ real-world analogs are (approximately) transitive: professional human soccer players and sumo wrestlers can reliably beat amateurs. Self play assumes transitivity and so this may be why the self-play policies are vulnerable to attack.

Of course in general we don’t want to completely blind the victim, since this hurts performance against normal opponents. Instead, we propose adversarial training: fine-tuning the victim policy against the adversary that has been trained against it. Specifically, we fine-tune for 20 million timesteps, the same amount of experience the adversary is trained with. Half of the episodes are against an adversary, and the other half against a normal opponent. We find the fine-tuned victim policy is robust to the adversary it was trained against, and suffers only a small performance drop against a normal opponent.

However, one might wonder if this fine-tuned victim is robust to our attack method, or just the adversary it was fine-tuned against. Repeating the attack method finds a new adversarial policy:

Notably, the new adversary trips the victim up rather than just flailing around. This suggests our new policies are meaningfully more robust (although there may of course be failure modes we haven’t discovered).

The existence of adversarial policies has significant implications for the training, understanding and evaluation of RL policies. First, adversarial policies highlight the need to move beyond self-play. Promising approaches include iteratively applying the adversarial training defence above, and population-based training which naturally trains against a broader range of opponents.

Second, this attack shows that RL policies can be vulnerable to adversarial observations that are on the manifold of naturally occurring data. By contrast, most prior work on adversarial examples has produced physically unrealistic perturbed images.

Finally, these results highlight the limitations of current evaluation methodologies. The victim policies have strong average-case performance against a range of both normal opponents and random policies. Yet their worst-case performance against adversarial policies is extremely poor. Moreover, it would be difficult to find this worst-case by hand: the adversarial policies do not seem like challenging opponents to human eyes. We would recommend testing safety-critical policies by adversarial attack, constructively lower bounding the policies’ exploitability.

To find out more, check out our paper or visit the project website for more example videos.

This article was initially published on the BAIR blog, and appears here with the authors’ permission.




BAIR blog

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Machine learning framework to predict global imperilment status of freshwater fish

  20 Mar 2026
“With our model, decision makers can deploy resources in advance before a species becomes imperiled.”

Interview with AAAI Fellow Yan Liu: machine learning for time series

  19 Mar 2026
Hear from 2026 AAAI Fellow Yan Liu about her research into time series, the associated applications, and the promise of physics-informed models.

A principled approach for data bias mitigation

  18 Mar 2026
Find out more about work presented at AIES 2025 which proposes a new way to measure data bias, along with a mitigation algorithm with mathematical guarantees.

An AI image generator for non-English speakers

  17 Mar 2026
"Translations lose the nuances of language and culture, because many words lack good English equivalents."

AI and Theory of Mind: an interview with Nitay Alon

  16 Mar 2026
Find out more about how Theory of Mind plays out in deceptive environments, multi-agents systems, the interdisciplinary nature of this field, when to use Theory of Mind, and when not to, and more.
coffee corner

AIhub coffee corner: AI, kids, and the future – “generation AI”

  13 Mar 2026
The AIhub coffee corner captures the musings of AI experts over a short conversation.

AI chatbots can effectively sway voters – in either direction

  12 Mar 2026
A short interaction with a chatbot can meaningfully shift a voter’s opinion about a presidential candidate or proposed policy.

Studying the properties of large language models: an interview with Maxime Meyer

  11 Mar 2026
What happens when you increase the prompt length in a LLM? In the latest interview in our AAAI Doctoral Consortium series, we sat down with Maxime, a PhD student in Singapore.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence