Generative AI with Python and TensorFlow 2/

...

Reinforcement Learning

Learn about reinforcement learning and the principle of optimality in the context of video games.

We'll cover the following...

Atari video game example
- Learning from expert gameplay
Learning a value function
Principle of optimality

Press + to interact

Atari video game example

As a concrete example, imagine a video game where a player pilots a spaceship to defeat alien vessels. The spaceship navigated by the player in this exampleMnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. “Playing Atari with Deep Reinforcement Learning.” ArXiv.org. 2013. https://arxiv.org/abs/1312.5602. is the agent; the set of pixels on the screen at any point in the game is the state. Based on this environment, the player needs to take the right action (for example, moving right, pressing fire), which will maximize the reward—here, quite literally, the score for the game. It is not just the next immediate action that we need to consider, though, but all actions until the end of the game since we can accumulate additional points as long as we do not run out of lives.

Learning from expert gameplay

Expert video game players learn how to react in different situations and a policy to follow when confronted with diverse scenarios during gameplay. The problem of RL is to determine a machine learning algorithm that can replicate the behavior of such a human expert by taking a set of inputs (the current state of the game) and outputting the optimal action to increase the probability of winning.

To formalize this description with some mathematical notation, we can denote the environment, such as the video game, in which an agent acts as $\epsilon$ , which outputs set of data (pixels) at each point in time as the state, $x_t$ . For each point of time in the game, the player (algorithm) selects an action, $a_t$ , from a set of $n$ actions $A = {1, 2, .... N}$ ; this is also known as the action set or action space. While for clarity, we limit our discussion to discrete action spaces, there is theoretically no such restriction, and the theory works just as well for continuous actions (though the resulting RL problem is consequently more complex with an infinite space of possibilities). For each of these actions, the agent receives a reward, $r_t$ , that can change the game score.

If we were to consider only the current screen, $x_t$ , as the “state” of the system, then our decision would rely only on the present, and the RL problem would become a Markov Decision ProcessCreative Commons. 2019. “Creative Commons — Attribution-ShareAlike 4.0 International — CC BY-SA 4.0.” Creativecommons.org. 2019. https://creativecommons.org/licenses/by-sa/4.0/. (MDP), as the choice for the next action would depend only on immediately available data and not on history.

However, for the video game example given above, the current screen only is probably not enough information to determine the optimal action because it is only partially observable—we don’t know cases where an enemy starcraft may have moved off the screen (and thus where it might re-emerge). We also don’t know what direction our ship is moving without comparing it to prior examples, which might affect whether we need to change direction or not. If the current state of the environment contains all the information we need to know about the game—such as a game of cards in which all players show their hands—then we say that the environment is fully observableBarekatain, Parnian. 2019. “Understanding ‘Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning.’” Medium. March 10, 2019. https://medium.com/@parnianbrk/understanding-stabilising-experience-replay-for-deep-multi-agent-reinforcement-learning-84b4c04886b5..

Press + to interact

In the figure, the transition (black arrows) between states (green circles) via actions with certain probabilities (orange circles) yields rewards (orange arrows).

Indeed, a human video game player does not rely only on the immediate state of the game to determine what to do next; they also rely on cues from prior points in the game, such as the point at which enemies went offscreen, to anticipate them re-emerging.

Similarly, our algorithm will benefit from using a sequence of states and actions leading to the current state, $s = \{x_1..., x_t ; a_1...a_{t-1}\}$ ...

Introduction to the Course

An Introduction to Generative AI

Building Blocks of Deep Neural Networks

Teaching Networks to Generate Digits

Painting Pictures with Neural Networks Using VAEs

Recognize Handwritten Digits Using a Deep Neural Network

Image Generation with GANs

Dataset Augmentation with GANs

Style Transfer with GANs

Assessment: Introduction to Generative AI to Style Transfer

Deepfakes with GANs

The Rise of Methods for Text Generation

Exploring OpenAI API

NLP 2.0: Using Transformers to Generate Text

Composing Music with Generative Models

Generating New Music with Artificial Intelligence

Play Video Games with Generative AI: GAIL

Emerging Applications in Generative AI

Assessment: Deepfakes using GANs to Emerging Applications

Conclusion

Appendix

Reinforcement Learning

Atari video game example

Learning from expert gameplay