Bertuccelli, L. F., Bethke, B., & How, J. P. (2012). Robust adaptive Markov decision processes: Planning with model uncertainty. IEEE Control Systems Magazine. doi:10.1109/MCS.2012.2205478

Summary

Bertuccelli et al. describe the Robust Adaptive Markov Decision Process (RAMDP), which is a type of MDP that takes into account uncertainty in the transition dynamics. This is somewhat similar to the BAMDP (see e.g. Guez et al.), which maintains a posterior distribution over the transition dynamics and/or reward function, and acts in a Bayes-optimal manner to maximize reward according to the posterior (i.e. by marginalizing over the uncertain dynamics). In contrast, the RAMDP maximizes a lower bound on the reward:

where $\mathcal{A}$ is the transition model, $\mu$ is the policy, $J_R$ is the robust objective function, $J_\mu$ is the objective function under policy $\mu$, and $i$ is the state.

To compute the minimum over transition models, they first define a feasible uncertainty set which, in this case, they choose to be the Dirichlet distribution. They then use a “scenario-based method”, which seems to just be sampling, over which to compute the minimum. Rather than taking a Monte-Carlo sampling approach, they compute sigma points, which are optimal sampling locations based on the mean and variance of the Dirichlet distribution:

where $\beta$ is “a tuning parameter that reflects the level of conservatism desired” (i.e. the range of the credible region), and $\bar{p}$ and $\Sigma$ are the mean and covariance of the Dirichlet distribution.

Takeaways

It’s not entirely clear to me in terms of theoretical guarantees whether BAMDP or RAMDP is better. My intuition is that BAMDP is probably more flexible, as it could potentially be used with complex structured priors. I suppose you could technically do the same for RAMDPs as well, though you would need to reformulate how to perform efficient sampling (since the approach taken by Bertuccelli et al. makes a strong assumption about the Dirichlet distribution). It seems that BAMDPs may be a little bit more agnostic of the assumptions put into the prior, particularly when used with something like MCTS. That said, I suspect that RAMDP probably provides better worst-case guarantees given that it’s explicitly optimizing for the worst case. But, these are just my intuitions; I’m not sure how true that is in practice or how much the worst case matters in non-adversarial settings.