Imitation Learning in Tensorflow (Hopper from openAI gym)

Sunday, September 24, 2017


Imitation learning, a.k.a behavioral cloning, is learning from demonstration. In other words, in imitation learning, a machine learns how to behave by looking at what a teacher (or expert) does and then mimics that behavior. An example can be when we collect driving data from human and then use that data for a self driving car.


Imitation learning is a supervised learning where we have a set of observation and action pairs
\[\{(o_i, a_i)\}_{i=1}^N\] where $o_i$ is observation that we collect from environment (sensory outputs), and $a_i$ is the best (I use best very loosly here) action that machine can take (comes from expert). Similar to any other supervised learning, the goal in imitation learning is to estimate a function $f(\cdot)$ so that given an observation o, we estimate the best (defined below) action:
\[ \hat{a} = f(o) \]

The lost function (to define best in above sentence) is defined to minimize distance between expert action and our estimation:
\[ min \sum_i dist(\hat{a}, a) \] For this exercise I use Euclidian distance for $dist$ function above.

Gets our hand dirty

Let's use Hopper from openAI gym. The Hopper is two-dimensional one-legged robot. Its goal is to jump as far away as possible without falling. You can find more information about Hopper on OpenAI page

Create Dataset

Let's import gym package and setup the Hopper environment

import gym
env = gym.make('Hopper-v1')

In gym you can get the environment action/observation space by doing:


For imitation learning, we need an expert to show us what action to take for each observation. For Hopper lets grab expert policy from here

Let assume we have load this pickle file in a way that

expert_action = expert_policy(obs)

To collect data from expert, we run the Hopper multiple times and each time we create pairs of (observation, action). Note that the environment is stochastic environement so every run is a bit different from others.

num_iter = 80
for i in range(num_iter):
    steps = 0
    while not done:
        expert_action = expert_policy(obs)
        obs, r, done, _ = env.step(action)
        steps += 1
        if steps >= env.spec.timestep_limit:

In above code, in each iteration, Hopper jumps until it is done or it reaches certain step limitation.

Neural Network

Now that we have our supervised dataset, we can create a neural network to get an observation and estimate the action to take:

model = model = BCModel(log_name='./logs/'+args.envname,

The detail of the BCModel can be found on my github account. Here, let's just look at the computation graph:

def build_graph(self, activator=tf.nn.tanh, regularizer_scale=0.01):
        self.observations = tf.placeholder(shape=(None, self.observation_dim), dtype=tf.float32, name='observations')
        self.actions = tf.placeholder(shape=(None, self.action_dim), dtype=tf.float32, name='actions')

        # regularizer
        regularizer = tf.contrib.layers.l2_regularizer(scale=regularizer_scale)

        # layers
        W1 = tf.get_variable(shape=(self.observation_dim, 128),
        b1 = tf.get_variable(shape=(1, 128),
        logit1 = tf.matmul(self.observations, W1) + b1
        layer1 = activator(logit1, 'layer1')

        W2 = tf.get_variable(shape=(128, 64),
        b2 = tf.get_variable(shape=(1, 64),
        logit2 = tf.matmul(layer1, W2) + b2
        layer2 = activator(logit2, 'layer2')

        output = tf.matmul(layer2, W3) + b3
        output_action = tf.identity(output, 'output_action')

        self.l2_loss = tf.losses.mean_squared_error(labels=self.actions, predictions=output)

        grad_wrt_input = tf.gradients(self.l2_loss, self.observations)
        grad_wrt_activations = tf.gradients(self.l2_loss, W1)

        self.optimizer = tf.train.AdamOptimizer(learning_rate=self.LEARNING_RATE).minimize(self.l2_loss)

This is fully connected network with linear output layer. Hidden layers have similar activation function (activator).


There are couple of interesting findigns while I was trying to get maximum reward for the hooper.

1- Adam optimizer works much better than gradient descent. The following plots shows the MSE error rate for identitcal networks with SGD and Adam optimizer. As one can see 1) Adam optimizer reaches to lower error faster 2) it is more stable (less variance) compare to SGD

l_2 loss minimization using SGD

l_2 loss minimization using Adam optimizer

2- tanh works better that relu for activation function. In the following table I summerize different attempt by changing activation, optimizer, and output layer. Superisingly, LSTM output layer was not giving better reward than linear layer. One possible explanation is that it increases the model complexity and amount of data might not be enough for network to learn. This can be some of future works.

Network Structure Avg Iteration Reward
Two layer tanh SGD with linear output 213
Two layer tanh Adam with linear output 771.34
Two layer ReLU SGD with linear output 532
Two layer ReLU Adam with linear output 913.65
Two layer tanh with linear output with regularization (Adam) 1169.94
Two layer ReLU with linear output with regularization (Adam) 1127.3
Two layer relu with regularization with lstm output layer 991.2


  1. Hey, nice post! However, when trying to follow along if ran into a dependency error:
    `MujocoDependencyError: To use MuJoCo, you need to either populate ~/.mujoco/mjkey.txt and ~/.mujoco/mjpro131, or set the MUJOCO_PY_MJKEY_PATH and MUJOCO_PY_MJPRO_PATH environment variables appropriately. Follow the instructions on for where to obtain these.`

    Seems like we need a MoJoCo license to replicate this example.

    Do you have any workaround for this?

  2. Renier -



Favorite Quotes

"I have never thought of writing for reputation and honor. What I have in my heart must out; that is the reason why I compose." --Beethoven

"All models are wrong, but some are useful." --George Box

Copyright © 2015 • Ensemble Blogging