Hay, N. J., Russell, S. J., Tolpin, D., & Shimony, S. E. (2012). Selecting Computations: Theory and Applications. ArXiv Preprint ArXiv:1207.5878v1 [Cs.AI]. Retrieved from http://arxiv.org/abs/1207.5879

Summary

This paper is quite related to Guez et al., Browne et al., and Bertuccelli et al.. Here, Hay et al. suggest that to optimally make use of information earned when exploring a state/action space, a metalevel decision problem needs to be solved:

Exploring unpromising or highly predictable paths to great depth is often wasteful; for a given amount of exploration, decision quality can be improved by directing exploration towards those actions sequences whose outcomes are helpful in selecting a good move. Thus, the metalevel decision problem is to choose what future action sequences to explore (or, more generally, what deliberative computations to do), while the object-level decision problem is to choose an action to execute in the real world. (pg. 1)

They show that the UCB1 or UCT bounds (used in bandit problems and MCTS) are suboptimal for the metalevel decision problem. The difference, they say, is that in bandit problems, the decision is relative to a utility you get in the real world, but in metalevel problems, the decision is relative to a cost of computation of simulations rather than real-world actions.

In some cases, the prior distribution of real-world outcomes is known and can be used in the simulations. However, in many other cases, the prior distribution on utilities is not available. Instead, the VOI (value of information) can be used as a way of determining what actions are good to take. These VOIs cannot be computed exactly, but under a few assumptions (myopic policy, samples are iid, expectation of a selection is equal to the sample mean, the distribution is bounded on both sides), they can be bounded from above. Hay et al. prove this bound, and show how it can be applied to MCTS: for root node sampling, rather than following the UCT policy, they use the VOI policy.

Methods

n/a

Algorithm

A metalevel decision process is denoted $M=(S,s_0,A_s,T,R)$, where $S$ are the states, $s_0$ is the intitial state, $A_s$ is the set of actions which includes simulated actions $E\in\mathcal{E}$ as well as the “stop” action $\perp$, $T$ is the transition function, and $R$ is the reward function, such that:

$\begin{align*} T(s,E,s^\prime)&=p(E=e\ \vert\ E_1=e_1,\ldots{},E_n=e_n)\\ T(s,\perp,\perp)&=1\\ R(s,E,s^\prime)&=-c\\ R(s,\perp,\perp)&=\max_i\mu_i(s) \end{align*}$

where $\mu_i(s)=\mathbb{E}[U_i\ \vert\ E_1=e_1,\ldots{},E_n=e_n]$. The value function of a policy $\pi$ for this MDP is:

$V_M^\pi(s)=\mathbb{E}_M^\pi[-cN+\max_i\mu_i(S_N)\ \vert\ S_0=s]$

where $N$ is the total number of computations performed. The expected number of computations is:

$E^{\pi^*}[N\ \vert\ S_0=s]\leq \frac{1}{c}\left(\mathbb{E}[\max_i U_i\ \vert\ S_0=s]-\max_i\mu_i(s)\right)$

Takeaways

It would be interesting to see how this integrates with the Bayes-adaptive MCTS from Guez et al.. There, not only do they run simulations starting from the root node, they also resample the MDP from the prior distribution. So, I’m not sure whether the VOI approach here would still hold in that case. In general though, I do think it is important to take the metalevel decision making problem into account, so this is a really interesting direction to pursue particularly in the context of human reasoning (where our brains almost certainly need to make tradeoffs between doing more computation and acting).