ΑΙhub.org
 

Physically realistic attacks on deep reinforcement learning


by
21 May 2020



share this:

adversary attacks

By Adam Gleave

Deep reinforcement learning (RL) has achieved superhuman performance in problems ranging from data center cooling to video games. RL policies may soon be widely deployed, with research underway in autonomous driving, negotiation and automated trading. Many potential applications are safety-critical: automated trading failures caused Knight Capital to lose USD 460M, while faulty autonomous vehicles have resulted in loss of life.

Consequently, it is critical that RL policies are robust: both to naturally occurring distribution shift, and to malicious attacks by adversaries. Unfortunately, we find that RL policies which perform at a high-level in normal situations can harbor serious vulnerabilities which can be exploited by an adversary.

Prior work has shown deep RL policies are vulnerable to small adversarial perturbations to their observations, similar to adversarial examples in image classifiers. This threat model assumes the adversary can directly modify the victim’s sensory observation. Such low-level access is rarely possible. For example, an autonomous vehicle’s camera image can be influenced by other drivers, but only to a limited extent. Other drivers cannot add noise to arbitrary pixels, or make a building disappear.

By contrast, we model the victim and adversary as agents in a shared environment. The adversary can take a similar set of actions to the victim. These actions may indirectly change the observations the victim sees, but only in a physically realistic fashion.

Note that if the victim policy were to play a Nash equilibria, it would not be exploitable by an adversary. We therefore focus on attacking victim policies trained via self-play, a popular method that approximates Nash equilibria. While it is known self-play may not always converge, it has produced highly capable AI systems. For example, AlphaGo and OpenAI Five have beaten world Go champions, and a professional Dota 2 team.

We find it is still possible to attack victim policies in this more realistic multi-agent threat model. Specifically, we exploit state-of-the-art policies trained by Bansal et al from OpenAI in zero-sum games between simulated Humanoid robots. We train our adversarial policies against a fixed victim policy, for less than 3% as many timesteps as the victim was trained for. In other respects, it is trained similarly to the self-play opponents: we use the same RL algorithm, Proximal Policy Optimization, and the same sparse reward. Surprisingly, the adversarial policies reliably beat most victims, despite not standing up and instead flailing on the ground.

In the video at the top of the post, we show victims in three different environments playing normal self-play opponents and adversarial policies. The Kick and Defend environment is a penalty shootout between a victim kicker and goalie opponent. You Shall Not Pass has a victim runner trying to cross the finish line, and an opponent blocker trying to prevent them. Sumo Humans has two agents competing on a round arena to knock out their opponent.

In Kick and Defend and You Shall Not Pass, the adversarial policy never stands up nor touches the victim. Instead, it positions its body in such a way to cause the victim’s policy to take poor actions. This style of attack is impossible in Sumo Humans, where the adversarial policy would immediately lose if it fell over. Instead, the adversarial policy learns to kneel in the center in a stable position, which proves surprisingly effective.

To better understand how the adversarial policies exploit their victims, we created “masked” versions of victim policies. The masked victim always observes a static value for the opponent position, corresponding to a typical initial starting state. This doctored observation is then passed to the original victim policy.

One would expect performance to degrade when the policy cannot see its opponent, and indeed the masked victims win less often against normal opponents. However, they are far more robust to adversarial policies. This result shows that the adversarial policies win by taking actions to induce natural observations that are adversarial to the victim, and not by physically interfering with the victim.

Furthermore, these results show there is a cyclic relationship between the policies. There is no overall strongest policy: the best policy depends on the other player’s policy, like in rock-paper-scissors. Technically this is known as non-transitivity: policy A beats B which beats C, yet C beats A. This is surprising since these environments’ real-world analogs are (approximately) transitive: professional human soccer players and sumo wrestlers can reliably beat amateurs. Self play assumes transitivity and so this may be why the self-play policies are vulnerable to attack.

Of course in general we don’t want to completely blind the victim, since this hurts performance against normal opponents. Instead, we propose adversarial training: fine-tuning the victim policy against the adversary that has been trained against it. Specifically, we fine-tune for 20 million timesteps, the same amount of experience the adversary is trained with. Half of the episodes are against an adversary, and the other half against a normal opponent. We find the fine-tuned victim policy is robust to the adversary it was trained against, and suffers only a small performance drop against a normal opponent.

However, one might wonder if this fine-tuned victim is robust to our attack method, or just the adversary it was fine-tuned against. Repeating the attack method finds a new adversarial policy:

Notably, the new adversary trips the victim up rather than just flailing around. This suggests our new policies are meaningfully more robust (although there may of course be failure modes we haven’t discovered).

The existence of adversarial policies has significant implications for the training, understanding and evaluation of RL policies. First, adversarial policies highlight the need to move beyond self-play. Promising approaches include iteratively applying the adversarial training defence above, and population-based training which naturally trains against a broader range of opponents.

Second, this attack shows that RL policies can be vulnerable to adversarial observations that are on the manifold of naturally occurring data. By contrast, most prior work on adversarial examples has produced physically unrealistic perturbed images.

Finally, these results highlight the limitations of current evaluation methodologies. The victim policies have strong average-case performance against a range of both normal opponents and random policies. Yet their worst-case performance against adversarial policies is extremely poor. Moreover, it would be difficult to find this worst-case by hand: the adversarial policies do not seem like challenging opponents to human eyes. We would recommend testing safety-critical policies by adversarial attack, constructively lower bounding the policies’ exploitability.

To find out more, check out our paper or visit the project website for more example videos.

This article was initially published on the BAIR blog, and appears here with the authors’ permission.




BAIR blog




            AIhub is supported by:



Related posts :

Congratulations to the #AAAI2026 award winners

  05 Feb 2026
Find out who has won the prestigious 2026 awards for their contributions to the field.

Forthcoming machine learning and AI seminars: February 2026 edition

  04 Feb 2026
A list of free-to-attend AI-related seminars that are scheduled to take place between 4 February and 31 March 2026.

#AAAI2026 social media round up: part 2

  03 Feb 2026
Catch up on the action from the second half of the conference.

Interview with Zijian Zhao: Labor management in transportation gig systems through reinforcement learning

  02 Feb 2026
In the second of our interviews with the 2026 AAAI Doctoral Consortium cohort, we hear from Zijian Zhao.
monthly digest

AIhub monthly digest: January 2026 – moderating guardrails, humanoid soccer, and attending AAAI

  30 Jan 2026
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

The Machine Ethics podcast: 2025 wrap up with Lisa Talia Moretti & Ben Byford

Lisa and Ben chat about the prevalence of AI slop, the end of social media, Grok and explicit content generation, giving legislation more teeth, anthropomorphising reasoning models, and more.

Interview with Kate Larson: Talking multi-agent systems and collective decision-making

  27 Jan 2026
AIhub ambassador Liliane-Caroline Demers caught up with Kate Larson at IJCAI 2025 to find out more about her research.

#AAAI2026 social media round up: part 1

  23 Jan 2026
Find out what participants have been getting up to during the first few of days at the conference


AIhub is supported by:







 













©2026.01 - Association for the Understanding of Artificial Intelligence