**By Michael Janner**

Reinforcement learning systems can make decisions in one of two ways. In the *model-based* approach, a system uses a predictive model of the world to ask questions of the form “what will happen if I do *x*?” to choose the best *x*^{1}. In the alternative *model-free* approach, the modeling step is bypassed altogether in favor of learning a control policy directly. Although in practice the line between these two techniques can become blurred, as a coarse guide it is useful for dividing up the space of algorithmic possibilities.

*Predictive models can be used to ask “what if?” questions to guide future decisions.*

The natural question to ask after making this distinction is whether to use such a predictive model. The field has grappled with this question for quite a while, and is unlikely to reach a consensus any time soon. However, we have learned enough about designing model-based algorithms that it is possible to draw some general conclusions about best practices and common pitfalls. In this post, we will survey various realizations of model-based reinforcement learning methods. We will then describe some of the tradeoffs that come into play when using a learned predictive model for training a policy and how these considerations motivate a simple but effective strategy for model-based reinforcement learning. The latter half of this post is based on our recent paper on model-based policy optimization, for which code is available here.

Below, model-based algorithms are grouped into four categories to highlight the range of uses of predictive models. For the comparative performance of some of these approaches in a continuous control setting, this benchmarking paper is highly recommended.

**Analytic gradient computation**

Assumptions about the form of the dynamics and cost function are convenient because they can yield closed-form solutions for locally optimal control, as in the LQR framework. Even when these assumptions are not valid, receding–horizon control can account for small errors introduced by approximated dynamics. Similarly, dynamics models parametrized as Gaussian processes have analytic gradients that can be used for policy improvement. Controllers derived via these simple parametrizations can also be used to provide guiding samples for training more complex nonlinear policies.

**Sampling-based planning**

In the fully general case of nonlinear dynamics models, we lose guarantees of local optimality and must resort to sampling action sequences. The simplest version of this approach, random shooting, entails sampling candidate actions from a fixed distribution, evaluating them under a model, and choosing the action that is deemed the most promising. More sophisticated variants iteratively adjust the sampling distribution, as in the cross-entropy method (CEM; used in PlaNet, PETS, and visual foresight) or path integral optimal control (used in recent model-based dexterous manipulation work).

In discrete-action settings, however, it is more common to search over tree structures than to iteratively refine a single trajectory of waypoints. Common tree-based search algorithms include MCTS, which has underpinned recent impressive results in games playing, and iterated width search. Sampling-based planning, in both continuous and discrete domains, can also be combined with structured physics-based, object-centric priors.

**Model-based data generation**

An important detail in many machine learning success stories is a means of artificially increasing the size of a training set. It is difficult to define a manual data augmentation procedure for policy optimization, but we can view a predictive model analogously as a learned method of generating synthetic data. The original proposal of such a combination comes from the Dyna algorithm by Sutton, which alternates between model learning, data generation under a model, and policy learning using the model data. This strategy has been combined with iLQG, model ensembles, and meta-learning; has been scaled to image observations; and is amenable to theoretical analysis. A close cousin to model-based data generation is the use of a model to improve target value estimates for temporal difference learning.

**Value-equivalence prediction**

A final technique, which does not fit neatly into model-based versus model-free categorization, is to incorporate computation that resembles model-based planning without supervising the model’s predictions to resemble actual states. Instead, plans under the model are constrained to match trajectories in the real environment only in their predicted cumulative reward. These value-equivalent models have shown to be effective in high-dimensional observation spaces where conventional model-based planning has proven difficult.

In what follows, we will focus on the data generation strategy for model-based reinforcement learning. It is not obvious whether incorporating model-generated data into an otherwise model-free algorithm is a good idea. Modeling errors could cause diverging temporal-difference updates, and in the case of linear approximation, model and value fitting are equivalent. However, it is easier to motivate model usage by considering the empirical generalization capacity of predictive models, and such a model-based augmentation procedure turns out to be surprisingly effective in practice.

**The Good News**

A natural way of thinking about the effects of model-generated data begins with the standard objective of reinforcement learning:

which says that we want to maximize the expected cumulative discounted rewards from acting according to a policy in an environment governed by dynamics . It is important to pay particular attention to the distributions over which this expectation is taken.^{2} For example, while the expectation is supposed to be taken over trajectories from the current policy , in practice many algorithms re-use trajectories from an old policy for improved sample-efficiency. There has been much algorithm development dedicated to correcting for the issues associated with the resulting *off-policy error*.

Using model-generated data can also be viewed as a simple modification of the sampling distribution. Incorporating model data into policy optimization amounts to swapping out the true dynamics with an approximation . The *model bias* introduced by making this substitution acts analogously to the off-policy error, but it allows us to do something rather useful: we can query the model dynamics at any state to generate samples from the current policy, effectively circumventing the off-policy error.

If model usage can be viewed as trading between off-policy error and model bias, then a straightforward way to proceed would be to compare these two terms. However, estimating a model’s error on the *current* policy’s distribution requires us to make a statement about how that model will generalize. While worst-case bounds are rather pessimistic here, we found that predictive models tend to generalize to the state distributions of future policies well enough to motivate their usage in policy optimization.

*Generalization of learned models, trained on samples from a data-collecting policy , to the state distributions of future policies seen during policy optimization. Increasing the training set size not only improves performance on the training distribution, but also on nearby distributions.*

**The Bad News**

The above result suggests that the single-step predictive accuracy of a learned model can be reliable under policy shift. The catch is that most model-based algorithms rely on models for much more than single-step accuracy, often performing model-based rollouts equal in length to the task horizon in order to properly estimate the state distribution under the model. When predictions are strung together in this manner, small errors compound over the prediction horizon.

*A 450-step action sequence rolled out under a learned probabilistic model, with the figure’s position depicting the mean prediction and the shaded regions corresponding to one standard deviation away from the mean. The growing uncertainty and deterioration of a recognizable sinusoidal motion underscore accumulation of model errors.*

**Analyzing the trade-off**

This qualitative trade-off can be made more precise by writing a lower bound on a policy’s true return in terms of its model-estimated return:

*A lower bound on a policy’s true return in terms of its expected model return, the model rollout length, the policy divergence, and the model error on the current policy’s state distribution.*

As expected, there is a tension involving the model rollout length. The model serves to reduce off-policy error via the terms exponentially decreasing in the rollout length . However, increasing the rollout length also brings about increased discrepancy proportional to the model error.

We have two main conclusions from the above results:

- predictive models can generalize well enough for the incurred model bias to be worth the reduction in off-policy error, but
- compounding errors make long-horizon model rollouts unreliable.

A simple recipe for combining these two insights is to use the model only to perform short rollouts from all previously encountered real states instead of full-length rollouts from the initial state distribution. Variants of this procedure have been studied in prior works dating back to the classic Dyna algorithm, and we will refer to it generically as model-based policy optimization (MBPO), which we summarize in the pseudo-code below.

We found that this simple procedure, combined with a few important design decisions like using probabilistic model ensembles and a stable off-policy model-free optimizer, yields the best combination of sample efficiency and asymptotic performance. We also found that MBPO avoids the pitfalls that have prevented recent model-based methods from scaling to higher-dimensional states and long-horizon tasks.

*Learning curves of MBPO and five prior works on continuous control benchmarks. MBPO reaches the same asymptotic performance as the best model-free algorithms, often with only one-tenth of the data, and scales to state dimensions and horizon lengths that cause previous model-based algorithms to fail.*

This post is based on the following paper:

**When to Trust Your Model: Model-Based Policy Optimization**Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine*Neural Information Processing Systems (NeurIPS), 2019.*

*I would like to thank Michael Chang and Sergey Levine for their valuable feedback.*

- In reinforcement learning, this variable is typically denoted by
*a*for “action.” In control theory, it is denoted by*u*for “upravleniye” (or more faithfully, “управление”), which I am told is “control” in Russian.↩ - We have omitted the initial state distribution to focus on those distributions affected by incorporating a learned model.↩

**References**

- KR Allen, KA Smith, and JB Tenenbaum. The tools challenge: rapid trial-and-error learning in physical problem solving. CogSci 2019.
- B Amos, IDJ Rodriguez, J Sacks, B Boots, JZ Kolter. Differentiable MPC for end-to-end planning and control. NeurIPS 2018.
- T Anthony, Z Tian, and D Barber. Thinking fast and slow with deep learning and tree search. NIPS 2017.
- K Asadi, D Misra, S Kim, and ML Littman. Combating the compounding-error problem with a multi-step model. arXiv 2019.
- V Bapst, A Sanchez-Gonzalez, C Doersch, KL Stachenfeld, P Kohli., PW Battaglia, and JB Hamrick. Structured agents for physical construction. ICML 2019.
- ZI Botev, DP Kroese, RY Rubinstein, and P L’Ecuyer. The cross-entropy method for optimization. Handbook of Statistics, volume 31, chapter 3. 2013.
- J Buckman, D Hafner, G Tucker, E Brevdo, and H Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. NeurIPS 2018.
- K Chua, R Calandra, R McAllister, and S Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. NeurIPS 2018.
- I Clavera, J Rothfuss, J Schulman, Y Fujita, T Asfour, and P Abbeel. Model-based reinforcement learning via meta-policy optimization. CoRL 2018.
- R Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. CG 2006.
- M Deisenroth and CE Rasmussen. PILCO: A model-based and data-efficient approach to policy search. ICML 2011.
- F Ebert, C Finn, S Dasari, A Xie, A Lee, and S Levine. Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv 2018.
- V Feinberg, A Wan, I Stoica, MI Jordan, JE Gonzalez, and S Levine. Model-based value estimation for efficient model-free reinforcement learning. ICML 2018.
- C Finn and S Levine. Deep visual foresight for planning robot motion. ICRA 2017.
- S Gu, T Lillicrap, I Sutskever, and S Levine. Continuous deep Q-learning with model-based acceleration. ICML 2016.
- D Ha and J Schmidhuber. World models. NeurIPS 2018.
- T Haarnoja, A Zhou, P Abbeel, and S Levine. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML 2018.
- D Hafner, T Lillicrap, I Fischer, R Villegas, D Ha, H Lee, and J Davidson. Learning latent dynamics for planning from pixels. ICML 2019.
- LP Kaelbling, ML Littman, and AP Moore. Reinforcement learning: a survey. JAIR 1996.
- L Kaiser, M Babaeizadeh, P Milos, B Osinski, RH Campbell, K Czechowski, D Erhan, C Finn, P Kozakowsi, S Levine, R Sepassi, G Tucker, and H Michalewski. Model-based reinforcement learning for Atari. arXiv 2019.
- A Krizhevsky, I Sutskever, and GE Hinton. ImageNet classification with deep convolutional neural networks. NIPS 2012.
- T Kurutach, I Clavera, Y Duan, A Tamar, and P Abbeel. Model-ensemble trust-region policy optimization. ICLR 2018.
- S Levine and V Koltun. Guided policy search. ICML 2013.
- W Li and E Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. ICINCO 2004.
- N Lipovetzky, M Ramirez, and H Geffner. Classical planning with simulators: results on the Atari video games. IJCAI 2015.
- Y Luo, H Xu, Y Li, Y Tian, T Darrell, and T Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. ICLR 2019.
- R Munos, T Stepleton, A Harutyunyan, MG Bellemare. Safe and efficient off-policy reinforcement learning. NIPS 2016.
- A Nagabandi, K Konoglie, S Levine, and V Kumar. Deep dynamics models for learning dexterous manipulation. arXiv 2019.
- A Nagabandi, GS Kahn, R Fearing, and S Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. ICRA 2018.
- J Oh, S Singh, and H Lee. Value prediction network. NIPS 2017.
- R Parr, L Li, G Taylor, C Painter-Wakefield, ML Littman. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. ICML 2008.
- D Precup, R Sutton, and S Singh. Eligibility traces for off-policy policy evaluation. ICML 2000.
- J Schrittwieser, I Antonoglou, T Hubert, K Simonyan, L Sifre, S Schmitt, A Guez, E Lockhart, D Hassabis, T Graepel, T Lillicrap, and D Silver. Mastering Atari, Go, chess and shogi by planning with a learned model. arXiv 2019.
- D Silver, T Hubert, J Schrittwieser, I Antonoglou, M Lai, A Guez, M Lanctot, L Sifre, D Ku-maran, T Graepel, TP Lillicrap, K Simonyan, and D Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv 2017.
- RS Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. ICML 1990.
- E Talvitie. Self-correcting models for model-based reinforcement learning. AAAI 2016.
- A Tamar, Y Wu, G Thomas, S Levine, and P Abbeel. Value iteration networks. NIPS 2016.
- Y Tassa, T Erez, and E Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. IROS 2012.
- H van Hasselt, M Hessel, and J Aslanides. When to use parametric models in reinforcement learning? NeurIPS 2019.
- R Veerapaneni, JD Co-Reyes, M Chang, M Janner, C Finn, J Wu, JB Tenenbaum, and S Levine. Entity abstraction in visual model-based reinforcement learning. CoRL 2019.
- T Wang, X Bao, I Clavera, J Hoang, Y Wen, E Langlois, S Zhang, G Zhang, P Abbeel, and J Ba. Benchmarking model-based reinforcement learning. arXiv 2019.
- M Watter, JT Springenberg, J Boedecker, M Riedmiller. Embed to control: a locally linear latent dynamics model for control from raw images. NIPS 2015.
- G Williams, A Aldrich, and E Theodorou. Model predictive path integral control using covariance variable importance sampling. arXiv 2015.

This article was initially published on the BAIR blog, and appears here with the authors’ permission.

A list of free-to-attend AI-related seminars that are scheduled to take place between 11 May and 30 June 2021.

11 May 2021, by
Lucy Smith

Find out how Cambridge researchers are using deep-learning to assist pathologists.

10 May 2021, by
University of Cambridge

We propose a method for using offline data to build a prediction model that only requires access to the available subset of confounders at prediction time.

07 May 2021, by
ML@CMU