Inverse Reinforcement Learning
Learn about deep Q-learning and inverse reinforcement learning in the context of video games.
We'll cover the following...
While the field of deep learning is independent of reinforcement learning methods such as the
Deep Q-learning
A major insight in this research was to apply a deep neural network to generate vector representations from the raw pixels of the video game rather than trying to explicitly represent some features of the “state of the game”; this neural network is the
Another key development was a technique called experience replay, wherein the history of states (here, pixels from video frames in a game), actions, and rewards is stored in a fixed-length list and re-sampled at random repeatedly, with some stochastic possibility to choose a non-optimal outcome using the epsilon-greedy approach described above. The result is that the value function updates are averaged over many samples of the same data, and correlations between consecutive samples (which could make the algorithm explore only a limited set of the solution space) are broken. Further, this “deep”
The figure above illustrates an overview of deep
Deep
Putting these pieces together, the deep
Step 1: Create a list to store samples of (current state, action, reward, next state) as a “replay memory.”
Step 2: Randomly initialize the weights in the neural network representing the
-function. Step 3: For a certain number of gameplay sequences, initialize a starting game screen (pixels) and a transformation of this input (such as the last four screens). This “window" of fixed-length history is important because otherwise, the
-network would need to accommodate arbitrarily sized input (very long or very short sequences of game screens), and this restriction makes it easier to apply a convolutional neural network to the problem. Step 4: For a certain number of steps (screens) in the game, use epsilon greedy sampling to choose the next action given the current screen and reward function computed through
. Step 5: After updating the state, save this transition of (current state, action, reward, action, next state) into the replay memory.
Step 6: Choose random sets of (current state, action, reward, next state) transitions from the replay memory and compute their reward using the
...