Xie, C., Patil, S., Moldovan, T., Levine, S., & Abbeel, P. (2015). Model-based Reinforcement Learning with Parametrized Physical Models and Optimism-Driven Exploration. ArXiv Preprint ArXiv:1509.06824v1 [Cs.LG]. Retrieved from http://arxiv.org/abs/1509.06824

Summary

In this paper, Xie et al. use a linear approximation to physical dynamics using dynamics features in order to perform optimism-based exploration for model-based reinforcement learning. For certain dynamical systems, such as a pendulum or robot arm, this approach works remarkably well because the dynamics can be factored into a linear least-squares model where certain features are known (e.g. joint angles) and others must be learned (e.g. link weights or lengths) as part of the RL process. This can either be done by hand or with the help of libraries like SymPyBotics (which looks super cool!).

Methods

n/a

Algorithm

There are a few components to this algorithm, which I will describe separately.

Optimism-based exploration

The idea behind optimism-based exploration is related to the idea of having slack variables in an optimization. Essentially, you have some estimate of the dynamics, and you can act greedily with respect to those dynamics:

$\mathbf{\ddot{q}}_t=\hat{f}(\mathbf{q}_t,\mathbf{\dot{q}}_t,\tau_t)$

where $\hat{f}$ are the forward dynamics. To make it optimism-based:

$\mathbf{\ddot{q}}_t=\hat{f}(\mathbf{q}_t,\mathbf{\dot{q}}_t,\tau_t)+\mathbf{\xi}_t=\mathbf{\tilde{f}}(\mathbf{q}_t,\mathbf{\dot{q}}_t,\tau_t,\mathbf{\xi}_t)$

where $\mathbf{\xi}_t$ is a slack variable. This variable is constrained based on the amount of uncertainty in the estimate of the dynamics $\hat{f}$. Rather than explicitly estimating this uncertainty, Xie et al. simply include a penalty term of the form $\frac{1}{m}\lVert\mathbf{\xi}_t\rVert^2$ where $m$ is proportional to the number of samples.

Linear least squares model

The equations of motion can be factored into a linear equation:

$H(\mathbf{q},\mathbf{\dot{q}}, \mathbf{\ddot{q}})\cdot{}\Delta=\mathbf{\tau}$

where $\Delta$ are the system parameters to be estimated and $H$ are the physical features (e.g. joint angles). Given observations of $[\mathbf{q},\mathbf{\dot{q}}, \mathbf{\ddot{q}}]$, $\Delta$ can be estimated using least-squares regression, thus giving the forward dynamics model ($\mathbf{\tilde{f}}(\mathbf{q}_t,\mathbf{\dot{q}}_t,\tau_t,\mathbf{\xi}_t)$) that can then be used in model-predictive control.

Model-predictive control

The MPC algorithm used by Xie et al. is the same as that used by Kitaev et al. (iterative linear quadratic regulator). Briefly:

The algorithm iteratively computes first order expansions of the dynamics and second order expansions of the cost around the current trajectory, and then analytically computes the sequence of optimal controls with respect to this approximation. This sequence of controls is then executed to obtain a new trajectory, and the process repeats until convergence or for a fixed number of iterations.

Takeaways

By using prior knowledge of the structure of dynamics, it is possible to quickly learn the correct model parameters needed to use model-predictive control. This is really powerful, because model-predictive control can be much more efficient and accurate than a model-free method, especially for systems with complex dynamics (e.g. a 7 DOF robot arm). If some information about the dynamics of the system can be specified (e.g. the number of robot links and their joint angles) then the rest can be quickly inferred because the really hard part (the structure) is already present – the parameters of that structure just need to be learned.