...

/

Inverse Reinforcement Learning

Inverse Reinforcement Learning

Learn about deep Q-learning and inverse reinforcement learning in the context of video games.

While the field of deep learning is independent of reinforcement learning methods such as the QQ-learning algorithm, a powerful combination of these two approaches was applied in training algorithms to play arcade games at near-human levelMnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. “Playing Atari with Deep Reinforcement Learning.” ArXiv.org. 2013. https://arxiv.org/abs/1312.5602..

Deep Q-learning

A major insight in this research was to apply a deep neural network to generate vector representations from the raw pixels of the video game rather than trying to explicitly represent some features of the “state of the game”; this neural network is the QQ-function for this RL algorithm.

Another key development was a technique called experience replay, wherein the history of states (here, pixels from video frames in a game), actions, and rewards is stored in a fixed-length list and re-sampled at random repeatedly, with some stochastic possibility to choose a non-optimal outcome using the epsilon-greedy approach described above. The result is that the value function updates are averaged over many samples of the same data, and correlations between consecutive samples (which could make the algorithm explore only a limited set of the solution space) are broken. Further, this “deep” QQ-learning algorithm is implemented off-policy to avoid the potential circular feedback of generating optimal samples with a jointly optimized policy function.

Press + to interact
Deep Q-learning
Deep Q-learning

The figure above illustrates an overview of deep QQ-learning, where we use a neural network to approximate the QQ-value function. The state is given as the input, and the QQ-value of all possible actions is generated as the output.

Deep QQ-learning is also model-free, in the sense we have no representation or model (such as a generative model that can simulate new frames of the game) of the environment EE. In fact, as with the video game example, it could just be samples of historical data that represent the internal state of a game that is observed by a player.

Putting these pieces together, the deep QQ-learning algorithm uses the following steps to learn to play Atari gamesMnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv. https://arxiv.org/abs/1312.5602:

  • Step 1: Create a list to store samples of (current state, action, reward, next state) as a “replay memory.”

  • Step 2: Randomly initialize the weights in the neural network representing the QQ-function.

  • Step 3: For a certain number of gameplay sequences, initialize a starting game screen (pixels) and a transformation of this input (such as the last four screens). This “window" of fixed-length history is important because otherwise, the QQ-network would need to accommodate arbitrarily sized input (very long or very short sequences of game screens), and this restriction makes it easier to apply a convolutional neural network to the problem.

  • Step 4: For a certain number of steps (screens) in the game, use epsilon greedy sampling to choose the next action given the current screen and reward function computed through QQ.

  • Step 5: After updating the state, save this transition of (current state, action, reward, action, next state) into the replay memory.

  • Step 6: Choose random sets of (current state, action, reward, next state) transitions from the replay memory and compute their reward using the QQ ...