Levine, S., Wagener, N., & Abbeel, P. (2015). Learning Contact-Rich Manipulation Skills with Guided Policy Search. Proceedings Of the IEEE International Conference on Robotics and Automation. Retrieved from http://arxiv.org/abs/1501.05611v1

Summary

This paper builds on previous work (Levine & Abbeel, NIPS 2013) that learns policies for object manipulation using a two step process. The first step is to learn local linear-Gaussian controllers for a few specific training examples. Then, the linear-Gaussian controllers are used to train parameters for a more complex policy (e.g., using a neural network) using a method called guided policy search.

Methods

n/a

Algorithm

Linear-Gaussian controllers

First, they assume linear-Gaussian dynamics, i.e.:

$p(\mathbf{x}_{t+1}\vert \mathbf{x}_t,\mathbf{u}_t)=\mathcal{N}(f_{\mathbf{x}t}\mathbf{x}_t+f_{\mathbf{u}t}\mathbf{u}_t, \mathbf{F}_t)$

They can fit these dynamics using samples collected under the previous version of the controller, and from there compute the linear-Gaussian controller:

$p(\mathbf{u}_t\vert \mathbf{x}_t)=\mathcal{N}(\hat{\mathbf{u}}_t+\mathbf{k}_t+\mathbf{K}_t(\mathbf{x}_t-\hat{\mathbf{x}}_t),Q_{\mathbf{u}, \mathbf{u}t}^{-1})$

where $Q_{\mathbf{u}, \mathbf{u}t}$ is the Hessian of the $Q$-function, $\mathbf{k}_t=-Q_{\mathbf{u},\mathbf{u}t}^{-1}Q_{\mathbf{u}t}$ and $\mathbf{K}_t=-Q_{\mathbf{u},\mathbf{u}t}^{-1}Q_{\mathbf{u},\mathbf{x}t}$ and (I think?) where $\hat{\mathbf{u}}_t$ and $\hat{\mathbf{x}}_t$ are the mean state and control from the samples. The $Q$-function is computed using a dynamic programming algorithm, which I won’t go into the details of.

The controller is then updated subject to the constraint that the KL-divergence between the trajectory distribution $p(\tau)=\prod_t p(\mathbf{x}_{t+1}\vert \mathbf{x}_t,\mathbf{u}_t)p(\mathbf{u}_t\vert \mathbf{x}_t)$ and the old trajectory distribution is not more than a threshold $\epsilon$. They solve this optimization using dual gradient descent, which I also won’t go into the details of here.

Levine et al. minimize the number of samples needed by also estimating a Gaussain mixture model prior on the global dynamics, and adaptively adjust both the step size $\epsilon$ and sample count according to an estimate of the additional cost at each iteration due to unmodeled changes in the dynamics.

Guided policy search

Given the learned linear-Gaussian dynamics and controller, they can be used to train a more sophisticated policy with large numbers of parameters (e.g., a neural network). Rather than directly computing a policy, supervised learning is used to learn the policy using trajectories sampled from the linear-Gaussian controllers (this is kind of a form of learning-by-demonstration, I think, where the demonstrations are a combination of the actual trajectories run using the linear-Gaussian controller on the real robot, and synthesized trajectories from the linear-Gaussian controllers). The trajectories are then reoptimized as well to better match the state distribution of the policy, i.e.:

$\min_{\theta,p(\tau)}E_{p(\tau)}[\ell(\tau)]\ \mathrm{s.t.}\ D_{\mathrm{KL}}(p(\mathbf{x}_t)\pi_\theta(\mathbf{u}_t\vert \mathbf{x}_t)\vert\vert p(\mathbf{x}_t, \mathbf{u}_t))=0\ \forall t$

where $p(\tau)$ is the trajectory distribution (obtained from the linear-Gaussian dynamics) and $\pi_\theta$ is the parameterized policy. They solve this using another (different) application of dual gradient descent.

Takeaways

This is a cool way to combine deep learning with a more structured modeling approach. Essentially, the linear-Gaussian controllers are learning a generative model of the dynamics for specific motions, from which an optimal policy can be extracted with relatively little computation. The generative model is then used to train the discriminative model (the neural network). It’s an interesting contrast, though, because I generally think of generative models as being more general, whereas in this case it’s actually that there are multiple generative models, each of which is local and therefore relatively fragile. But by combining them together a more powerful and flexible discriminative model can be trained.

Overall, it seems to me that this approach is quite similar similar to the approach taken by Paraschos et al., at least in terms of using linear-Gaussian dynamics to learn local (or primitive) motion policies. This paper, though, goes the next step and shows how then those local motion policies can be combined in order to train a more sophisticated and general policy.