Running GAIL on PyBullet Gym
Learn how to implement GAIL and build the actor-critic network.
We'll cover the following...
GAIL is a popular framework used for developing reinforcement learning benchmarks, such as by the research group OpenAI; it is also a closed source and requires a license for use.
Running GAIL
For our experiments, we'll use PyBullet Gymperium, a drop-in replacement for Mujoco that allows us to run a physics simulator and import agents trained in Mujoco environments.
See the baselines for some implementations for developing reinforcement learning benchmarks.
To show how this simulated environment works, let’s create a “hopper,” one of the many virtual agents you can instantiate with the library:
import gym import pybulletgym env = gym.make('HopperMuJoCoEnv-v0') observation = env.reset() print("Observation vector of the walker:\n", observation)
The output of the code above shows an array giving the current observation (11-dimensional vector) vector of the walker.
Adding the call to render ("human")
will create a window showing the "hopper," a simple single-footed figure which moves in a simulated 3D environment (shown in the figure below):
We can run a few iterations of the hopper in its raw, "untrained" form, to get a sense of how it looks. In this simulation, we take up to 1,000 steps and visualize it using a pop-up window:
env.reset()for t in range(1000):action = env.action_space.sample()_, _, done, _ = env.step(action)env.render("human")if done:break
We first clear the environment with reset()
. Then, for up to 1,000 timesteps, we sample the action space (for example, the action
to get an updated reward and observation and render the result until the movement completes.
This demonstration comes from a completely untrained hopper. For our GAIL implementation, we’ll need a hopper that has been successfully trained to walk as a sample of “expert" trajectories for the algorithm. For this purpose, we’ll download a set of hopper data from the OpenAI site.
These contain a set of NumPy files, such as deterministic.trpo.Hopper.0.00.npz
, that contain samples of data from reinforcement learning agents trained using the Trust Region Policy Optimization algorithm used in step 4 of the
If we load this data, we can also ...