Reinforcement Learning
Learn about reinforcement learning and the principle of optimality in the context of video games.
Most discriminative AI examples involve applying a continuous or discrete label to a piece of data. From applying a deep neural network to determine the digit represented by one of the MNIST images or if a CIFAR-10 image contains a horse, the model produces a single output, a prediction with minimal error. In reinforcement learning, we also want to make such point predictions, but over many steps, and to optimize the total error over repeated uses.
Atari video game example
As a concrete example, imagine a video game where a player pilots a spaceship to defeat alien vessels. The spaceship navigated by the player in this
Learning from expert gameplay
Expert video game players learn how to react in different situations and a policy to follow when confronted with diverse scenarios during gameplay. The problem of RL is to determine a machine learning algorithm that can replicate the behavior of such a human expert by taking a set of inputs (the current state of the game) and outputting the optimal action to increase the probability of winning.
To formalize this description with some mathematical notation, we can denote the environment, such as the video game, in which an agent acts as
If we were to consider only the current screen,
However, for the video game example given above, the current screen only is probably not enough information to determine the optimal action because it is only partially observable—we don’t know cases where an enemy starcraft may have moved off the screen (and thus where it might re-emerge). We also don’t know what direction our ship is moving without comparing it to prior examples, which might affect whether we need to change direction or not. If the current state of the environment contains all the information we need to know about the game—such as a game of cards in which all players show their hands—then we say that the environment is fully
In the figure, the transition (black arrows) between states (green circles) via actions with certain probabilities (orange circles) yields rewards (orange arrows).
Indeed, a human video game player does not rely only on the immediate state of the game to determine what to do next; they also rely on cues from prior points in the game, such as the point at which enemies went offscreen, to anticipate them re-emerging.
Similarly, our algorithm will benefit from using a sequence of states and actions leading to the current state,