...

/

Adversarial Learning and Imitation

Adversarial Learning and Imitation

Learn about the generative adversarial imitation learning algorithm.

We'll cover the following...

Given a set of expert observations (of a driver or a champion video game player), we would like to find a reward function that assigns a high reward to an agent that matches expert behavior and a low reward to agents that do not. At the same time, we want to choose such an agent’s policy, π\pi, under this reward function such that it is as informative as possible by maximizing entropy and preferring expert over non-expert choices. We’ll show how both are achieved through an algorithm known as generative adversarial imitation learningHo, Jonathan, and Stefano Ermon. 2016. “Generative Adversarial Imitation Learning.” ArXiv.org. June 10, 2016. https://doi.org/10.48550/arXiv.1606.03476. (GAIL) published in 2016.

GAIL

In the following, instead of a reward function, we use a cost function to match the conventions used in the referenced literature on this topic, but it is just the negative of the reward function.

Putting these constraints together, we getHo, Jonathan, and Stefano Ermon. 2016. “Generative Adversarial Imitation Learning.” ArXiv.org. June 10, 2016. https://doi.org/10.48550/arXiv.1606.03476.:

In the inner term, we are trying to find a policy that maximizes the discounted entropy:

Therefore, leading to a large negative value and minimizing the cost term. We want to maximize this whole expression over different potential cost functions by selecting one that not only satisfies the inner constraint but gives low-cost expert-like behavior, maximizing the overall expression. Note that this inner term is also equivalent to an RL problem that seeks an agent whose behavior optimizes the objective:

Over the space of possible policies π\pi, denoted Π\Pi. However, to limit the space of possible choices of cc, we apply a regularization function ψ(c)\psi(c) ...