
latest postspopular

latest postspopular

Datadriven deep reinforcement learning
By Aviral Kumar
One of the primary factors behind the success of machine learning approaches in open world settings, such as image recognition and natural language processing, has been the ability of highcapacity deep neural network function approximators to learn generalizable models from large amounts of data. Deep reinforcement learning methods, however, require active online data collection, where the model actively interacts with its environment. This makes such methods hard to scale to complex realworld problems, where active data collection means that large datasets of experience must be collected for every experiment – this can be expensive and, for systems such as autonomous vehicles or robots, potentially unsafe. In a number of domains of practical interest, such as autonomous driving, robotics, and games, there exist plentiful amounts of previously collected interaction data which, consists of informative behaviours that are a rich source of prior information. Deep RL algorithms that can utilize such prior datasets will not only scale to realworld problems, but will also lead to solutions that generalize substantially better. A datadriven paradigm for reinforcement learning will enable us to pretrain and deploy agents capable of sampleefficient learning in the realworld.
In this work, we ask the following question: Can deep RL algorithms effectively leverage prior collected offline data and learn without interaction with the environment? We refer to this problem statement as fully offpolicy RL, previously also called batch RL in literature. A class of deep RL algorithms, known as offpolicy RL algorithms can, in principle, learn from previously collected data. Recent offpolicy RL algorithms such as Soft ActorCritic (SAC), QTOpt, and Rainbow, have demonstrated sampleefficient performance in a number of challenging domains such as robotic manipulation and atari games. However, all of these methods still require online data collection, and their ability to learn from fully offpolicy data is limited in practice. In this work, we show why existing deep RL algorithms can fail in the fully offpolicy setting. We then propose effective solutions to mitigate these issues.
Why can’t offpolicy deep RL algorithms learn from static datasets?
Let’s first study how stateoftheart deep RL algorithms perform in the fully offpolicy setting. We choose the Soft Actorcritic (SAC) algorithm and investigate its performance in the fully offpolicy setting. Figure 1 shows the training curve for SAC trained solely on varying amounts of previously collected expert demonstrations for the HalfCheetahv2 gym benchmark task. Although the data demonstrates successful task completion, none of these runs succeed, with corresponding Qvalues diverging in some cases. At first glance, this resembles overfitting, as the evaluation performance deteriorates with more training (green curve), but increasing the size of the static dataset does not rectify the problem (orange (1e4 samples) vs green (1e5 samples)), suggesting the issue is more complex.
Figure 1: Average Return and logarithm of the Qvalue for HalfCheetahv2 with varying amounts of expert data.
Most offpolicy RL methods in use today, including SAC used in Figure 1, are based on approximate dynamic programming (though many also utilize importance sampled policy gradients, for example Liu et al. (2019)). The core component of approximate dynamic programming in deep RL is the value function or Qfunction. The optimal Qfunction obeys the optimal Bellman equation, given below:
Reinforcement learning then corresponds to minimizing the squared difference between the lefthand side and righthand side of this equation, also referred to as the mean squared Bellman error (MSBE):
MSBE is minimized on transition samples in a dataset generated by a behavior policy . Although minimizing MSBE corresponds to a supervised regression problem, the targets for this regression are themselves derived from the current Qfunction estimate.
We can understand the source of the instability shown in Figure 1 by examining the form of the Bellman backup described above. The targets are calculated by maximizing the learned Qvalues with respect to the action at the next state () for Qlearning methods, or by computing an expected value under the policy at the next state, for actorcritic methods that maintain an explicit policy alongside a Qfunction. However, the Qfunction estimator is only reliable on actioninputs from the behavior policy , which is the training distribution. As a result, naively maximizing the value may evaluate the estimator on actions that lie far outside of the training distribution, resulting in pathological values that incur large absolute error from the optimal desired Qvalues. We refer to such actions as outofdistribution (OOD) actions. We call the error in the Qvalues caused due to backing up target values corresponding to OOD actions during backups as bootstrapping error. A schematic of this problem is shown in Figure 2 below.
Figure 2: Incorrectly high Qvalues for OOD actions may be used for backups, leading to accumulation of error.
Using an incorrect Qvalue for computing the backup leads to accumulation of error – minimizing MSBE perfectly leads to propagation of all imperfections in the target values into the current Qfunction estimator. We refer to this process as accumulation or propagation of bootstrapping error over iterations of training. The agent is unable to correct errors as it is unable to gather ground truth return information by actively exploring the environment. Accumulation of bootstrapping error can make Qvalues diverge to infinity (for example, blue and green curves in Figure 1), or not converge to the correct values (for example, red curve in Figure 1), leading to final performance that is substantially worse than even the average performance in the dataset.
Towards deep RL algorithms that learn in the presence of OOD actions
How can we develop RL algorithms that learn from static data without being affected by OOD actions? We will first review some of the existing approaches in literature towards solving this problem and then describe our recent work, BEAR which tackles this problem. There are broadly two classes of methods towards solving this problem.
 Behavioral cloning (BC) based methods: When the static dataset is generated by an expert, one can utilize behavioral cloning to mimic the expert policy, as is done in imitation learning. In a generalized setting, where the behavior policy can be suboptimal but reward information is accessible, one can choose to only mimic a subset of good action decisions from the entire dataset. This is the idea behind prior works such as rewardweighted regression (RWR), relative entropy policy search (REPS), MARWIL, some very recent works such as ABM, and a recent work from our lab, called advantage weighted regression (AWR) where we show that advantageweighted form of behavioral cloning, which assigns higher likelihoods to demonstration actions that receive higher advantages, can also be used in such situations. Such a method trains only on actions observed in the dataset, hence avoids OOD actions completely.
 Dynamic programming (DP) methods: Dynamic programming methods are appealing in fully offpolicy RL scenarios because of their ability to pool information across trajectories, unlike BCbased methods that are implicitly constrained to lie in the vicinity of the best performing trajectory in the static dataset. For example, Qiteration on a dataset consisting of all transitions in an MDP should return the optimal policy at convergence, however previously described BCbased methods may fail to recover optimality if the individual trajectories are highly suboptimal. Within this class, some recent work includes batch constrained Qlearning (BCQ) that constrains the trained policy distribution to lie close to the behavior policy that generated the dataset. This is an optimal strategy when the static dataset is generated by an expert policy. However, this might be suboptimal if the data comes from an arbitrarily suboptimal policy. Other recent work, GLearning, KLControl, BRAC implements closeness to the behavior policy by solving a KLconstrained RL problem. SPIBB selectively constrains the learned policy to match the behavior policy in probability density on less frequent actions.
The key question we pose in our work is: Which policies can be reliably used for backups without backing up OOD actions? Once this question is answered, the job of an RL algorithm reduces to picking the best policy in this set. In our work, we provide a theoretical characterization of this set of policies and use insights from theory to propose a practical dynamic programming based deep RL algorithm called BEAR that learns from purely static data.
Bootstrapping Error Accumulation Reduction (BEAR)
Figure 3: Illustration of support constraint (BEAR) (right) and distributionmatching constraint (middle).
The key idea behind BEAR is to constrain the learned policy to lie within the support (Figure 3, right) of the behavior policy distribution. This is in contrast to distribution matching (Figure 3, middle) – BEAR does not constrain the learned policy to be close in distribution to the behavior policy, but only requires that the learned policy places nonzero probability mass on actions with nonnegligible behavior policy density. We refer to this as support constraint. As an example, in a setting with a uniformatrandom behavior policy, a support constraint allows dynamic programming to learn an optimal, deterministic policy. However, a distributionmatching constraint will require that the learned policy be highly stochastic (and thus not optimal), for instance in Figure 3, middle, the learned policy is constrained to be one of the stochastic purple policies, however in Figure 3, right, the learned policy can be a (near)deterministic yellow policy. For the readers interested in theory, the theoretical insight behind this choice is that a support constraint enables us to control error propagation by upper bounding concentrability under the learned policy, while providing the capacity to reduce divergence from the optimal policy.
How do we enforce that the learned policy satisfies the support constraint? In practice, we use the sampled Maximum Mean Discrepancy (MMD) distance between actions as a measure of support divergence. Letting , and be any RBF kernel, we have:
A simple code snippet for computing MMD is shown below:
def gaussian_kernel(x, y, sigma=0.1):
return exp((x  y).pow(2).sum() / (2 * sigma.pow(2)))
def compute_mmd(x, y):
k_x_x = gaussian_kernel(x, x)
k_x_y = gaussian_kernel(x, y)
k_y_y = gaussian_kernel(y, y)
return sqrt(k_x_x.mean() + k_y_y.mean()  2*k_x_y.mean())
is amenable to stochastic gradientbased training and we show that computing using only few samples from both distributions and provides sufficient signal to quantify differences in support but not in probability density, hence making it a preferred measure for implementing the support constraint. To sum up, the new (constrained)policy improvement step in an actorcritic setup is given by:
Support constraint vs Distributionmatching constraint
Some works, for example, KLControl, BRAC, GLearning, argue that using a distributionmatching constraint might suffice in such fully offpolicy RL problems. In this section, we take a slight detour towards analyzing this choice. In particular, we provide an instance of an MDP where distributionmatching constraint might lead to arbitrarily suboptimal behavior while support matching does not suffer from this issue.
Consider the 1Dlineworld MDP shown in Figure 4 below. Two actions (left and right) are available to the agent at each state. The agent is tasked with reaching to the goal state , starting from state and the corresponding perstep reward values are shown in Figure 4(a). The agent is only allowed to learn from behavior data generated by a policy that performs actions with probabilities described in Figure 4(b), and in particular, this behavior policy executes the suboptimal action at states inbetween S and G with a high likelihood of 0.9, however, both actions and are indistribution at all these states.
Figure 4: Example 1D lineworld and the corresponding behavior policy.
In Figure 5(a), we show that the learned policy with a distributionmatching constraint can be arbitrarily suboptimal, infact, the probability of reaching goal G by rolling out this policy is very small, and tends to 0 as the environment is made larger. However, in Figure 5(b), we show that a support constraint can recover an optimal policy with probability 1.
Figure 5: Policies learned via distributionmatching and supportmatching in the 1D lineworld shown in Figure 4.
Why does distributionmatching fail here? Let us analyze the case when we use a penalty for distributionmatching. If the penalty is enforced tightly, then the agent will be forced to mainly execute the wrong action () in states between and , leading to suboptimal behavior. However, if the penalty is not enforced tightly, with the intention of achieving a better policy than the behavior policy, the agent will perform backups using the OODaction at states to the left of , and these backups will eventually affect the Qvalue at state . This phenomenon will lead to an incorrect Qfunction, and hence a wrong policy – possibly, one that goes to the left starting from instead of moving towards , as OODaction backups combined with overestimation bias in Qlearning might make action at state look more preferable. Figure 6 shows that some states need a strong penalty/constraint (to prevent OOD backups) and the others require a weak penalty/constraint (to achieve optimality) for distributionmatching to work, however, this cannot be achieved via conventional distributionmatching approaches.
Figure 6: Analysis of the strength of distributionmatching constraint needed at different states.
So, how does BEAR perform in practice?
In our experiments, we evaluated BEAR on three kinds of datasets generated by – (1) a partiallytrained mediumreturn policy, (2) a random lowreturn policy and (3) an expert, highreturn policy. (1) resembles the settings in practice such as autonomous driving or robotics, where offline data is collected via scripted policies for robotic grasping or consists of human driving data (which may not be perfect) respectively. Such data is useful as it demonstrates nonrandom, but still not optimal behavior and we expect training on offline data to be most useful in this setting. Good performance on both (2) and (3) demonstrates that the versatility of an algorithm to arbitrary dataset compositions.
For each dataset composition, we compare BEAR to a number of baselines including BC, BCQ, and deep QLearning from demonstrations (DQfD). In general, we find that BEAR outperforms the best performing baseline in setting (1), and BEAR is the only algorithm capable successfully learning a betterthandataset policy in both (2) and (3). We show some learning curves below. BC or BCQ type methods usually do not perform great with random data, partly because of the usage of a distributionmatching constraint as described earlier.
Figure 7: Performance on (1) mediumquality dataset: BEAR outperforms the best performing baseline.
Figure 8: (3) Expert data: BEAR recovers the performance in the expert dataset, and performs similarly to other methods such as BC.
Figure 9: (2) Random data: BEAR recovers better than dataset performance, and is close to the best performing algorithm (Naive RL).
Future Directions and Open Problems
Most of the prior datasets for realworld problems such as RoboNet and Starcraft replays consist of multimodal behavior generated by different users and robots. Hence, one of the next steps to look at is learning from a diverse mixtures of policies. How can we effectively learn policies from a static dataset that consists of a diverse range of behavior – possibly interaction from a diverse range of tasks, in the spirit of what we encounter in the real world? This question is mostly unanswered at the moment. Some very recent work, such as REM, shows that simple modifications to existing distributional offpolicy RL algorithms in the Atari domain can enable fully offpolicy learning on entire interaction data generated from the training run of a separate DQN agent. However, the best solution for learning from a dataset generated by any arbitrary mixture of policies – which is more likely the case in practical problems – is unclear.
A rigorous theoretical characterization of the best achievable policy as a function of a given dataset is also an open problem. In our paper, we analyze this question by looking at typically used assumptions of bounded concentrability in the error and convergence analysis of Fitted Qiteration. Which other assumptions can be applied to analyze this problem? And which of these assumptions is least restrictive and practically feasible? What is the theoretical optimum of what can be achieved solely by offline training?
We hope that our work, BEAR, takes us a step closer to effectively leveraging the most out of prior datasets in an RL algorithm. A datadriven paradigm of RL where one could (pre)train RL algorithms with large amounts of prior data will enable us to go beyond the active exploration bottleneck, thus giving us agents that can be deployed and keep learning continuously in the real world.
This blog post is based on our recent paper:
 Stabilizing OffPolicy QLearning via Bootstrapping Error Reduction
Aviral Kumar*, Justin Fu*, George Tucker, Sergey Levine
In Advances in Neural Information Processing Systems, 2019
The paper and code are available online and a slidedeck explaining the algorithm is available here. I would like to thank Sergey Levine for his valuable feedback on earlier versions of this blog post.
This article was initially published on the BAIR blog, and appears here with the authors’ permission.