Ross, S., & Pineau, J. (2008). Model-based Bayesian Reinforcement Learning in Large Structured Domains. Proceedings Of the 24th Conference in Uncertainty in Artificial Intelligence, 476–483. Retrieved from http://arxiv.org/abs/1206.3281

Summary

Ross & Pineau propose a method for performing Bayesian reinforcement learning when both the structure and parameters of the transition function are unknown. That is, they need to infer $P(G,\theta_G\vert D)$ where $G$ is the graph structure, $\theta_G$ are the parameters of the graph, and $D$ is the observed data. Then, given this posterior, they need to plan optimally with respect to it. Ross & Pineau show how this can be done by casting the problem as a POMDP (where the unobserved states consist of the original state, the graph structure, and the graph parameters) and then suggest an algorithm for planning in continuous, high-dimensional POMDPs such as this that uses particle filtering with resampling. They show that their method rapidly recovers the correct structure in simple cases and recovers incorrect but approximate structure in more complex cases, and that it is significantly faster and more accurate than a MDP which does full joint inference over transitions (i.e. does not take into account structure).

Takeaways

I am not going to go into the nitty-gritty details of the math in this paper—I think the important takeaway is that by doing Bayesian RL, we can maintain not only a distribution over what we think the transition function is, but we can incorporate structured prior knowledge into the inference procedure. So rather than assuming what is essentially a fully connected transition graph and estimating the parameters of each edge, we can keep open the possibility that the transition function has some form of higher-level structure. I think this is particularly important because it could potentially allow you to do inference about transitions or rewards from states that are different but are actually similar according to some high level structure. Actually, I think this is probably especially important for reward structures, even more so than transition structures—for example if the reward structure is based on something like the graph structures discussed by Kemp & Tenenbaum.