ΑΙhub.org
 

Unsupervised skill discovery with contrastive intrinsic control


by
01 April 2022



share this:
cic bair

By Misha Laskin

Unsupervised Reinforcement Learning (RL), where RL agents pre-train with self-supervised rewards, is an emerging paradigm for developing RL agents that are capable of generalization. Recently, we released the Unsupervised RL Benchmark (URLB) which we covered in a previous post. URLB benchmarked many unsupervised RL algorithms across three categories — competence-based, knowledge-based, and data-based algorithms. A surprising finding was that competence-based algorithms significantly underperformed other categories. In this post we will demystify what has been holding back competence-based methods and introduce Contrastive Intrinsic Control (CIC), a new competence-based algorithm that is the first to achieve leading results on URLB.

Results from benchmarking unsupervised RL algorithms

To recap, competence-based methods (which we will cover in detail) maximize the mutual information between states and skills (e.g. DIAYN), knowledge-based methods maximize the error of a predictive model (e.g. Curiosity), and data-based methods maximize the diversity of observed data (e.g. APT). Evaluating these algorithms on URLB by reward-free pre-training for 2M steps followed by 100k steps of finetuning across 12 downstream tasks, we previously found the following stack ranking of algorithms from the three categories.

URLB results

In the above figure competence-based methods (in green) do substantially worse than the other two types of unsupervised RL algorithms. Why is this the case and what can we do to resolve it?

Competence-based exploration

As a quick primer, competence-based algorithms maximize the mutual information between some observed variable such as a state and a latent skill vector, which is usually sampled from noise.

Competence-based Exploration

The mutual information is usually an intractable quantity and since we want to maximize it, we are usually better off maximizing a variational lower bound.

Mutual Info Decomposition

The quantity q(z|\tau) is referred to as the discriminator. In prior works, the discriminators are either classifiers over discrete skills or regressors over continuous skills. The problem is that classification and regression tasks need an exponential number of diverse data samples to be accurate. In simple environments where the number of potential behaviors is small, current competence-based methods work but not in environments where the set of potential behaviors is large and diverse.

How environment design influences performance

To illustrate this point, let’s run three algorithms on the OpenAI Gym and DeepMind Control (DMC) Hopper. Gym Hopper resets when the agent loses balance while DMC episodes have fixed length regardless if the agent falls over. By resetting early, Gym Hopper constrains the agent to a small number of behaviors that can be achieved by remaining balanced. We run three algorithms — DIAYN and ICM, popular competence-based and knowledge-based algorithms, as well as a “Fixed” agent which gets a reward of +1 for each timestep, and measure the zero-shot extrinsic reward for hopping during self-supervised pre-training.

OpenAI Gym vs DMC

On OpenAI Gym both DIAYN and the Fixed agent receive higher extrinsic rewards relative to ICM, but on the DeepMind Control Hopper both algorithms collapse. The only significant difference between the two environments is that OpenAI Gym resets early whereas DeepMind Control does not. This supports the hypothesis that when an environment supports many behaviors prior competence-based approaches struggle to learn useful skills.

Indeed, if we visualize behaviors learned by DIAYN on other DeepMind Control environments, we see that it learns a small set of static skills.

Prior methods fail to learn diverse behaviors

diaynw1.gif
diaynw2.gif
diaynw3.gif
diaynq1.gif
diaynq2.gif
diaynq3.gif

Skills learned by DIAYN after 2M steps of training.

Effective competence-based exploration with Contrastive Intrinsic Control (CIC)

As illustrated in the above example – complex environments support a large number of skills and we therefore need discriminators capable of supporting large skill spaces. This tension between the need to support large skill spaces and the limitation of current discriminators leads us to propose Contrastive Intrinsic Control (CIC).

Contrastive Intrinsic Control (CIC) introduces a new contrastive density estimator to approximate the conditional entropy (the discriminator). Unlike visual contrastive learning, this contrastive objective operates over state transitions and skill vectors. This allows us to bring powerful representation learning machinery from vision to unsupervised skill discovery.

CIC Decomposition

For a practical algorithm, we use the CIC contrastive skill learning as an auxiliary loss during pre-training. The self-supervised intrinsic reward is the value of the entropy estimate computed over the CIC embeddings. We also analyze other forms of intrinsic rewards in the paper, but this simple variant performs well with minimal complexity. The CIC architecture has the following form:

CIC Architecture

Qualitatively the behaviors from CIC after 2M steps of pre-training are quite diverse.

Diverse Behaviors learned with CIC

cicw1.gif
cicw2.gif
cicw3.gif
cicq1.gif
cicq2.gif
cicq3.gif

Skills learned by CIC after 2M steps of training.

With explicit exploration through the state-transition entropy term and the contrastive skill discriminator for representation learning CIC adapts extremely efficiently to downstream tasks – outperforming prior competence-based approaches by 1.78x and all prior exploration methods by 1.19x on state-based URLB.

Results

We provide more information in the CIC paper about how architectural details and skill dimension affect the performance of the CIC paper. The main takeaway from CIC is that there is nothing wrong with the competence-based objective of maximizing mutual information. However, what matters is how well we approximate this objective, especially in environments that support a large number of behaviors. CIC is the first competence-based algorithm to achieve leading performance on URLB. Our hope is that our approach encourages other researchers to work on new unsupervised RL algorithms

Paper: CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery
Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, Pieter Abbeel

Code: https://github.com/rll-research/cic


This article was initially published on the BAIR blog, and appears here with the authors’ permission.



tags:


BAIR blog




            AIhub is supported by:


Related posts :



#AAAI2025 workshops round-up 1: Artificial intelligence for music, and towards a knowledge-grounded scientific research lifecycle

  18 Mar 2025
We hear from the organisers of two workshops at AAAI2025 and find out the key takeaways from their events.

The Good Robot podcast: Re-imagining voice assistants with Stina Hasse Jørgensen and Frederik Juutilainen

  17 Mar 2025
Eleanor and Kerry chat to Stina Hasse Jørgensen and Frederik Juutilainen about an experimental research project that created an alternative voice assistant.

Visualizing research in the age of AI

  14 Mar 2025
Felice Frankel discusses the implications of generative AI when communicating science visually.

#IJCAI panel on communicating about AI with the public

  13 Mar 2025
A recording of this session at IJCAI2024 is now available to watch.

Interview with Tunazzina Islam: Understand microtargeting and activity patterns on social media

  11 Mar 2025
Hear from Doctoral Consortium participant Tunazzina about her research on computational social science, natural language processing, and social media mining and analysis

Microsoft cuts data centre plans and hikes prices in push to make users carry AI costs

  10 Mar 2025
Microsoft is trying to recoup the costs by raising prices, putting ads in products, and cancelling data centre leases

Report on the future of AI research

  07 Mar 2025
Find out more about a report released by the AAAI 2025 Presidential Panel.

Andrew Barto and Richard Sutton win 2024 Turing Award

  06 Mar 2025
Pair are recognised for their pioneering reinforcement learning research.




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association