Schulman, J., Lee, A. X., Ho, J., & Abbeel, P. (2013). Tracking deformable objects with point clouds. Proceedings Of the IEEE International Conference on Robotics and Automation. doi:10.1109/ICRA.2013.6630714

Summary

In this work, Schulman et al. propose an algorithm for tracking deformable objects (e.g. rope, cloth, sponges) using point clouds and a variation of the EM algorithm augmented with a real-time physics engine. They show that their algorithm can run in real-time and is fairly accurate provided there are not too many occlusions from external sources (e.g. human hands). Here is a video demonstrating their algorithm:

https://www.youtube.com/watch?v=3ps7gb_oxxI

Methods

n/a

Algorithm

Schulman et al. define a probabilistic generative model for point tracking, which include the following variables:

$\mathbf{m}_{1:K}={\mathbf{m}_1, \mathbf{m}_2, \ldots{}, \mathbf{m}_K}$ are a set of $K$ points in 3D space which define the (estimated) configuration/model of the object.
$\mathbf{c}_{1:N}={\mathbf{c}_1, \mathbf{c}_2, \ldots{}, \mathbf{c}_N}$ are a set of $N$ points generated from the sensor hardware (i.e. the point cloud).
$z_{kn}$ is an indicator variable that is 1 if the object surface nearest to $\mathbf{m}_{k}$ generated observation $\mathbf{c}_n$.
$z_{K+1,n}$ is an indicator variable that is 1 if $\mathbf{c}_n$ is a point generated by noise.
$v_k$ is an indicator variable that is 1 if $\mathbf{m}_k$ is visible from the camera
$\Sigma_k=\mathrm{diag}((\sigma^x)^2, (\sigma^y)^2, (\sigma^z)^2)$ is a diagonal covariance matrix that parameterizes the noise in observations of points in the point cloud: $\mathbf{c}_n\sim \mathcal{N}(\mathbf{m}_k, \Sigma_k)$

I won’t go too much into the details of the densities in this model, except to note that they additionally place a prior on the model points:

$p(\mathbf{m}_{1:K})=e^{-\frac{1}{\eta} V_0(\mathbf{m}_{1:K})}$

where $V_0(\mathbf{m}_{1:K})$ is the potential energy of the tracked object, which is infinite in infeasible states (e.g. interpenetration). This prior encourages the object model to stay in low-energy states—e.g. stationary (or moving slowly), not being bent too much, etc.

The inference procedure is a variation on the EM algorithm. First, in the E-step, they compute a lower bound on the log-probability:

$\begin{align*} \log p(\mathbf{m}_{1:K},\mathbf{c}_{1:N})&=\log p(\mathbf{m}_{1:K}) + \log p(\mathbf{c}_{1:N}\vert \mathbf{m}_{1:K})\\ &= V_0(\mathbf{m}_{1:K})+\sum_n\sum_k \alpha_{nk}\log\mathcal{N}(\mathbf{c}_n; \mathbf{m}_k, \Sigma_k) \end{align*}$

where $\alpha_{nk}=\log p(z_{kn}=1 \vert \mathbf{m}_{1:K}, \mathbf{c}_{1:N})$. Then, in the M-step, they use a physics simulation in order to maximize the lower bound on the log probability. By just running it forward, it would just converge to the prior (a low-energy state), so to avoid this, they introduce external forces into the simulation engine at each point mass equal to the negative gradient of the log likelihood $p(\mathbf{c}_{1:N}\vert \mathbf{m}_{1:K})$. They also include a damping force to keep the total energy at a constant level.

Takeaways

This is a really clever use of a physics engine to compute a maximization over a posterior distribution. Essentially, the key idea is that the physics engine will naturally converge to the prior (a low-energy state); so, to make it converge to the posterior (i.e., take into accound observations), they introduce external forces into the physics engine proportional to the gradient of the log-likelihood. More broadly, this is an interesting and novel way to maintain a physically realistic model of an object (including the object’s dynamics) based on observations.

I expect this would actually work well with rigid body dynamics as well, though perhaps in the presence of many collisions, it wouldn’t – it would be interesting to test it out in those cases. Additionally, it would be interesting to see how this could be extended to multiple object tracking. For example, could you use this method to track the objects in our block towers? I suspect that would be more difficult, as (I think) you would need to add an additional level of hierarchy to the probabilistic generative model to account for different objects. Or, perhaps it would work to just say there is one big object with multiple components that are independent? I’m not sure.