Reinforcement Learning

Learn about reinforcement learning and the principle of optimality in the context of video games.

Most discriminative AI examples involve applying a continuous or discrete label to a piece of data. From applying a deep neural network to determine the digit represented by one of the MNIST images or if a CIFAR-10 image contains a horse, the model produces a single output, a prediction with minimal error. In reinforcement learning, we also want to make such point predictions, but over many steps, and to optimize the total error over repeated uses.

Press + to interact
Atari video game examples
Atari video game examples

Atari video game example

As a concrete example, imagine a video game where a player pilots a spaceship to defeat alien vessels. The spaceship navigated by the player in this exampleMnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. “Playing Atari with Deep Reinforcement Learning.” ArXiv.org. 2013. https://arxiv.org/abs/1312.5602. is the agent; the set of pixels on the screen at any point in the game is the state. Based on this environment, the player needs to take the right action (for example, moving right, pressing fire), which will maximize the reward—here, quite literally, the score for the game. It is not just the next immediate action that we need to consider, though, but all actions until the end of the game since we can accumulate additional points as long as we do not run out of lives.

Learning from expert gameplay

Expert video game players learn how to react in different situations and a policy to follow when confronted with diverse scenarios during gameplay. The problem of RL is to determine a machine learning algorithm that can replicate the behavior of such a human expert by taking a set of inputs (the current state of the game) and outputting the optimal action to increase the probability of winning.

To formalize this description with some mathematical notation, we can denote the environment, such as the video game, in which an agent acts as ϵ\epsilon, which outputs set of data (pixels) at each point in time as the state, xtx_t. For each point of time in the game, the player (algorithm) selects an action, ata_t , from a set of nn actions A =1,2,.... NA = {1, 2, .... N}; this is also known as the action set or action space. While for clarity, we limit our discussion to discrete action spaces, there is theoretically no such restriction, and the theory works just as well for continuous actions (though the resulting RL problem is consequently more complex with an infinite space of possibilities). For each of these actions, the agent receives a reward, rtr_t , that can change the game score.

If we were to consider only the current screen, xtx_t , as the “state” of the system, then our decision would rely only on the present, and the RL problem would become a Markov Decision ProcessCreative Commons. 2019. “Creative Commons — Attribution-ShareAlike 4.0 International — CC BY-SA 4.0.” Creativecommons.org. 2019. https://creativecommons.org/licenses/by-sa/4.0/. (MDP), as the choice for the next action would depend only on immediately available data and not on history.

However, for the video game example given above, the current screen only is probably not enough information to determine the optimal action because it is only partially observable—we don’t know cases where an enemy starcraft may have moved off the screen (and thus where it might re-emerge). We also don’t know what direction our ship is moving without comparing it to prior examples, which might affect whether we need to change direction or not. If the current state of the environment contains all the information we need to know about the game—such as a game of cards in which all players show their hands—then we say that the environment is fully observableBarekatain, Parnian. 2019. “Understanding ‘Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning.’” Medium. March 10, 2019. https://medium.com/@parnianbrk/understanding-stabilising-experience-replay-for-deep-multi-agent-reinforcement-learning-84b4c04886b5..

Press + to interact
A markov decision process
A markov decision process

In the figure, the transition (black arrows) between states (green circles) via actions with certain probabilities (orange circles) yields rewards (orange arrows).

Indeed, a human video game player does not rely only on the immediate state of the game to determine what to do next; they also rely on cues from prior points in the game, such as the point at which enemies went offscreen, to anticipate them re-emerging.

Similarly, our algorithm will benefit from using a sequence of states and actions leading to the current state, s ={x1..., xt ; a1...at1}s = \{x_1..., x_t ; a_1...a_{t-1}\} ...