Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P. I., Rohlfshagen, P., … Colton, S. (2012). A survey of Monte Carlo tree search methods. IEEE Transactions On Computational Intelligence and AI in Games, 4(1), 1–10. doi:10.1109/TCIAIG.2012.2186810

Summary

This paper gives a comprehensive overview of the Monte Carlo Tree Search (MCTS) algorithm and the variations thereof. In reading this paper, I am focusing less on the variations of the algorithm and just on the core ideas behind how it works, which are laid out in pages 1-10.

The basic idea behind MCTS is to build up a tree, where each node corresponds to a state and each edge from a node corresponds to an action. Leaf nodes are chosen according to a particular algorithm, and then full games are simulated from those nodes until they terminate. Then, the results of the simulation are backpropagated up the tree all the way to the root node.

Methods

n/a

Algorithm

The basic algorithm of MCTS is as follows:

Selection: start at the root node, and walk down the tree according to the “tree policy” until either an expandable node is reached or a terminal state is reached. A node is “expandable” if it has previously unexplored children.
Expansion: expand one of the child nodes.
Simulation: run forward a simulation from the expanded child node until the simulation terminates according to some “default policy”.
Backpropagation: update the statistics of all nodes in the path from the root node to the expanded child node based on the results of the simulation.

These steps are repeated while there is still computation time remaining. Then, the action is taken that corresponds to the best child node of the root node.

Browne et al. describe a tree policy called the Upper Confidence Bound for Trees (UCT), which chooses a child node $j$ to maximize:

$UCT= \bar{X}_j + 2 C_p\sqrt{\frac{2\log n}{n_j}}$

where \widebar{X}_j is the average reward of child $j$, $n$ is the number of times the parent node has been visited, $n_j$ is the number of times child $j$ has been visited, and $C_p>0$ is a constant. When $n_j=0$, the second term will be infinite, and thus will cause unexplored children to be explored prior to exploiting nodes that have been explored.

Takeaways

MCTS is a successful approach to dealing with large state/action spaces, which product deep trees with large branching factors. MCTS offers a way to trade off between exploration and exploitation while still being able to roll out simulations all the way to termination. This means that even if rewards are only acheived at the end (e.g., win or lose), they can still be incorporated. And, unlike other RL techniques, MCTS does not require that the full state space be enumerated, which may be necessary to fully estimate the value of a state in terms of its future rewards.

Some thoughts/questions that I have:

How does MCTS work when rewards are asymmetric, e.g. in the case of a game where the goal is simply to keep playing for as long as possible? In this case, all terminal states would have a negative reward (for lose). I suppose you could perhaps give rewards that are instead equal to the length of the game that is played, but I’m not sure if that’s actually the correct approach.
Sometimes the state/action space might not actually be a true tree, but just a DAG, because you can reach certain states in multiple ways. But, I think that MCTS would probably treat these states as separate and require that they be explored independently. This seems like it would be a waste of computation.