Dearden, R., Friedman, N., & Andre, D. (1999). Model based Bayesian exploration. Proceedings Of the Fifteenth Conference on Uncertainty in Artificial Intelligence, 150–159. Retrieved from http://dl.acm.org/citation.cfm?id=2073814

Summary

In reinforcement learning (RL), there are two main approaches: model-based RL, in which an explicit model of the transitions and/or rewards are estimated, and model-free RL, in which this mdoel is not estimated and just the value of a state is estimated. In the model-based case, however, the estimated model is typically just a point estimate (e.g. a maximum-likelihood estimate) and thus does not reflect any uncertainty about the transition or reward model.

In this paper, Dearden et al. suggest using Bayesian model-based RL in which Bayesian inference is used to maintain a posterior distribution over the transition and reward models, thus capturing both an estimate of what the models are as well as confidence for how accurate that estimate is.

Methods

n/a

Algorithm

The sequence of states and actions taken by the agent can be summarized as a sequence of experience tuples, $\langle s, a, r, t \rangle$. The agent’s belief about the MDP (i.e. the true transition and reward functions) is $\mu$, so that $P(M\vert\mu)$ is the agent’s prior over possible MDPs given their initial belief $\mu$. After one timestep, the agent has a new experience and the updated distribution can be written as:

$\begin{align*} P(M\vert\mu \circ \langle s, a, r, t \rangle) &\propto P(\langle s, a, r, t \rangle \vert M)P(M\vert\mu)\\ &= P(s\rightarrow^a t\vert M)P(s\rightarrow^a r\vert M)P(M\vert\mu) \end{align*}$

We can represent the particular transition and reward functions by sets of parameters $\theta^t_{s,a,t}$ and $\theta^r_{s,a,r}$.

Having a distribution over MDPs also induces a distribution over $Q$ functions. Rather than maximizing the $Q$ function directly, as is typically the case in RL, Dearden et al. choose actions which maximize:

$VPI(s,a)-(\max_{a^\prime}E[q_{s,a^\prime}]-E[q_{s,a}])$

where $VPI$ is the value of perfect information, which is essentially how much information we would get from taking action $a$ in state $s$. What this strategy basically says is that we should take actions which give us more information, but not if we think the $Q$-value of that action is low relative to the $Q$-value of the best current action. Thus, the tradeoff between exploration and exploitation is given explicitly in terms of information (though this is not actually information in the information-theoretic sense—they are using “gain” here to mean the expected difference in value if we think the action is the best, and learn that it isn’t; or that we think the action isn’t the best, and learn that it is).

Takeaways

Dearden et al. go on to discuss various ways of approximating the $Q$ distribution, but I am not going to go into these specifics. I think the more important takeaway from this paper is the idea that you can perform Bayesian inference over the transition and reward functions and thus capture uncertainty about how accurate your estimate is.