ΑΙhub.org
 

Learning state abstractions for long-horizon planning


by
29 January 2021



share this:
Learning State Abstractions for Long-Horizon Planning

By Scott Emmons*, Ajay Jain*, Michael Laskin*, Thanard Kurutach, Pieter Abbeel, Deepak Pathak

Many tasks that we do on a regular basis, such as navigating a city, cooking a meal, or loading a dishwasher, require planning over extended periods of time. Accomplishing these tasks may seem simple to us; however, reasoning over long time horizons remains a major challenge for today’s Reinforcement Learning (RL) algorithms. While unable to plan over long horizons, deep RL algorithms excel at learning policies for short horizon tasks, such as robotic grasping, directly from pixels. At the same time, classical planning methods such as Dijkstra’s algorithm and A^* search can plan over long time horizons, but they require hand-specified or task-specific abstract representations of the environment as input.

To achieve the best of both worlds, state-of-the-art visual navigation methods have applied classical search methods to learned graphs. In particular, SPTM [2] and SoRB [3] use a replay buffer of observations as nodes in a graph and learn a parametric distance function to draw edges in the graph. These methods have been successfully applied to long-horizon simulated navigation tasks that were too challenging for previous methods to solve.

Nevertheless, these methods are still limited because they are highly sensitive to errors in the learned graph. Even a single faulty edge acts like a wormhole in the graph topology that planning algorithms try to exploit, which makes existing methods that combine graph search and RL extremely brittle. For example, if an artificial agent navigating a maze thinks that two observations on either side of a wall are nearby, its plans will involve transitions that collide into the wall. Adopting a simple model that assumes a constant probability p of each edge being faulty, we see that the expected number of faulty edges is p|E| = O(|V|^2). In other words, errors in the graph scale quadratically with the number of nodes in the graph.


We could do a lot better if we could minimize the errors in the graph. But how? Graphs over observations in both simulated and real-world environments can be prohibitively large, making it challenging to even identify which edges are faulty. To minimize errors in the graph, then, we desire sparsity; we want to keep a minimal set of nodes that is sufficient for planning. If we have a way
to aggregate similar observations into a single node in the graph, we can reduce the number of errors and improve the accuracy of our plans. The key challenge is to aggregate observations in a way that respects temporal constraints. If observations are similar in appearance but actually far away, then they should be aggregated into different nodes.


So how can we sparsify our graph while guaranteeing that the graph remains useful for planning? Our key insight is a novel merging criterion called two-way consistency. Two-way consistency can be viewed as a generalization of value irrelevance to the goal-conditioned setting. Intuitively, two-way consistency merges nodes (i) that can be interchanged as starting states and (ii) that can be interchanged as goal states.


For an example of two-way consistency, consider the above figure. Suppose during our node merging procedure we ask: can we merge the nodes with pink and orange bottles according to two-way consistency? First, we note that moving from the blue bottle to the pink bottle requires roughly the same work as moving from the blue bottle to the orange bottle. So the nodes with pink and orange bottles satisfy criterion (ii) because they can be interchanged as goal states. However, while it is possible to start from the pink bottle and move to the blue bottle, if we instead start at the orange bottle, the orange bottle will fall to the floor and crash! So the nodes with pink and orange bottles fail criterion (i) because they cannot be interchanged as starting states.

In practice, we can’t expect to encounter two nodes that can be perfectly interchanged. Instead, we merge nodes that can be interchanged up to a threshold parameter \tau. By increasing \tau, we can make the resulting graph as sparse as we’d like. Crucially, *we prove in the paper that merging according to two-way consistency preserves the graph’s quality up to an error term that scales only linearly with the merging threshold \tau.

Our motivation for sparsity, discussed above, is robustness: we expect smaller graphs to have fewer errors. Furthermore, our main theorem tells us that we can merge nodes according to two-way consistency while preserving the graph’s quality. Experimentally, though, are the resulting sparse graphs more robust?

To test the robustness of Sparse Graphical Memory to errors in learned distance metrics, we thinned the walls in the PointEnv mazes of [3]. While PointEnv is a simple environment with (x, y) coordinate observations, thinning the walls is a major challenge for parametric distance functions; any error in the learned distance function will cause faulty edges across the walls that destroy the feasibility of plans. For this reason, simply thinning the maze walls is enough to break the previous state-of-the-art [3] resulting in a 0% success rate.

How does Sparse Graphical Memory fare? With many fewer edges, it becomes tractable to perform self-supervised cleanup: the agent can step through the environment to detect and remove faulty edges from its graph. The below figure illustrates the results of this process. While the dense graph shown in red has many faulty edges, sparsity and self-supervised cleanup, shown in green, overcome errors in the learned distance metric, leading to a 100% success rate.


We see a similar trend in experiments with visual input. In both ViZDoom [4] and SafetyGym [5] – maze navigation tasks that require planning from raw images – Sparse Graphical Memory consistently improves the success of baseline methods including SoRB [3] and SPTM [2].

In addition to containing fewer errors, Sparse Graphical Memory also results in more optimal plans. On a ViZDoom maze navigation task [4], we find that SGM requires significantly less steps to reach the final goal across easy, medium, and hard maze tasks, meaning that the agent follows a shorter path to the final destination.


Overall, we found that state aggregation with two-way consistency resulted in substantially more robust plans over the prior state-of-the-art. While promising, many open questions and challenges remain for combining classical planning with learning-based control. Some of the questions we’re thinking about are – how can we extend these methods beyond navigation to manipulation domains? As the world is not static, how should we build graphs over changing environments? How can two-way consistency be utilized beyond the scope of graphical-based planning methods? We are excited about these future directions and hope our theoretical and experimental findings prove useful to other researchers investigating control over extended time horizons.

References

  1. Emmons*, Jain*, Laskin* et al. Sparse Graphical Memory for Robust Planning. NeurIPS 2020.
  2. Savinov et al. Semi-parametric Topological Memory for Navigation. ICLR 2019.
  3. Eysenbach et al. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning. NeurIPS 2020.
  4. Wydmuch et al. ViZDoom Competitions: Playing Doom from Pixels. IEEE Transactions on Games, 2018.
  5. Ray et al. Benchmarking Safe Exploration in Deep Reinforcement Learning. Preprint, 2019.

This article was initially published on the BAIR blog, and appears here with the authors’ permission.




BAIR blog




            AIhub is supported by:


Related posts :



monthly digest

AIhub monthly digest: May 2025 – materials design, object state classification, and real-time monitoring for healthcare data

  30 May 2025
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

Congratulations to the #AAMAS2025 best paper, best demo, and distinguished dissertation award winners

  29 May 2025
Find out who won the awards presented at the International Conference on Autonomous Agents and Multiagent Systems last week.

The Good Robot podcast: Transhumanist fantasies with Alexander Thomas

  28 May 2025
In this episode, Eleanor talks to Alexander Thomas, a filmmaker and academic, about the transhumanist narrative.

Congratulations to the #ICRA2025 best paper award winners

  27 May 2025
The winners and finalists in the different categories have been announced.

#ICRA2025 social media round-up

  23 May 2025
Find out what the participants got up to at the International Conference on Robotics & Automation.

Interview with Gillian Hadfield: Normative infrastructure for AI alignment

  22 May 2025
Kumar Kshitij Patel spoke to Gillian Hadfield about her interdisciplinary research, career trajectory, path into AI alignment, law, and general thoughts on AI systems.

PitcherNet helps researchers throw strikes with AI analysis

  21 May 2025
Baltimore Orioles tasks Waterloo Engineering researchers to develop AI tech that can monitor pitchers using low-resolution video captured by smartphones

Interview with Filippos Gouidis: Object state classification

  20 May 2025
Read the latest interview in our series featuring the AAAI/SIGAI Doctoral Consortium participants.



 

AIhub is supported by:






©2025.05 - Association for the Understanding of Artificial Intelligence


 












©2025.05 - Association for the Understanding of Artificial Intelligence