Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. a, Veness, J., Bellemare, M. G., … Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. doi:10.1038/nature14236

Summary

In this paper, Mnih et al. show how to combine deep learning with reinforcement learning in a stable manner, and scale it up to learn how to play a range of classic Atari video games. They use a convolutional deep network to learn an approximation to the Q function; this approach has been used before (e.g. TD-gammon) however there are two main modifications that Mnih et al. make:

  1. Rather than using the full sequence of frames as training data, they randomly sample experiences (frames) from previous experience. This reduces the correlation between training examples, which smooths over the data distribution.
  2. Rather than using computing the gradient of the Q function with respect to itself, they clone the network and compute the gradient with respect to the cloned network, and only update the cloned network every so often rather than on every step.

Methods

They trained the network on 49 Atari games, and for each game trained on a total of 50 million frames (which they say is about 38 days of game experience). The network itself had the following structure:

  1. Input layer – $84\times 84\times 4$ image.
  2. Layer 1 – 32 filters of $8\times 8$ with stride 4 and followed by a rectifier nonlinearity, which is defined to be $\max(0, x)$.
  3. Layer 2 – 64 filters of $4\times 4$ with stride 2 and also followed by the rectifier nonlinearity.
  4. Layer 3 – 64 filters of $3\times 3$ with stride 1 and also followed by the rectifier nonlinearity.
  5. Layer 4 – fully connected with 512 rectifier units.
  6. Output layer – fully connected with linear units, one per action.

Algorithm

The loss function used to train the network is:

where $D$ is the set of all experiences, $\gamma$ is the discount factor, $\theta_i$ are the network weights at iteration $i$ and $\theta_i^-$ are the cloned network weights at iteration $i$.

Takeaways

This paper is very impressive, and it super cool to see reinforcement learning succeeding at such a large scale. I am also really excited to see the emphasis on algorithms inspired by human neuroscience, though I think there’s still a ways to go.

For example, they trained each network for 38 days (!) which is vastly more experience than required by a human player. Looking at the training graphs, it seems that for example on Space Invaders, the network doesn’t get anywhere even close to human level performance until after about 10 training epochs, which I estimate to be about 40 hours of gameplay. Their human player only had about 2 hours of training per game, for comparison. This suggests that people are coming in with strong inductive biases with how to play these games that the network lacks.

Another thought is that they used a model-free form of Q-learning; however, I would actually expect that people do build a model of the game transitions and reward functions (e.g. people can figure out that if you kill a space invader, you get points; the network would only know that pressing “fire” at certain locations would be a good thing to do because it will lead to the highest expected reward – it doesn’t know anything about why that reward would be good). So I’m not convinced that model-free is necessarily the way to go if we want to be able to capture the way that people learn to play these games.

I also wonder if having it learn a model of the transition/reward functions would negate the need for experience replay. I don’t think it’s necessarily a bad thing to have experience replay in general, but it actual is counterintuitive to me to train on experiences in a random uncorrelated manner. I would at least expect if this happens in humans, that the length of each experience is longer than just a few frames. Mnih et al. cite evidence from neuroscience, and specifically the hippocampus, which is an area I know very little about, however – so it’s certainly possible that my intuitions on this are completely off.