There are a lot of RL packages out there, tensorforce, rllab, openai-lab, baselines, and the list goes on. It’s hard to know however, how any of those implementations stack up to published state-of-the-art results. There are several reasons, some are:

1. The standard RL tasks (Mujoco & Atari) are extremely sensitive to not only model hyper-parameters but even random seeds (Islam et al).
2. Even researchers implementing the same algorithm in two different code-bases have failed to produce consistent results given the same model parameters (described in Henderson et al, e.g. Figure 6).
3. It takes time to build reproducible experimentation code.

I’ve been building yarlp for educational purposes, and I wanted to make sure my implementations matched baselines. OpenAI seems to have created something akin to tensorflow models in terms of reproducibility for RL, so a natural step was to benchmark against OpenAI baselines. Nevertheless, I found it quite difficult to match their results because of:

1. environment wrappers and how environment observations were normalized
2. simple tweaks to model hyper-parameters
3. the choice of baseline model
4. and random seeds!

Here are my results on Mujoco1M after painstakingly hashing out minor differences in OpenAI’s implementation of TRPO compared to mine (which are now virtually identical 😂). I averaged over 5 random seeds using this script in baselines, and using the run_benchmark cli script in yarlp, which run all environments in parallel. The results match, but clearly even 5 random seeds is not enough (we plot the 95th percentile CI).

To demonstrate some of the difficulty in reproducibility, here is the same exact algorithm averaged over 2 randomly chosen sets of 3 random seeds on Swimmer-v1 (similar to Figure 10 in Henderson et al):

And this is what happens when I use a value function implementation from rllab compared to the one used in OpenAI baselines on Swimmer-v1 averaged over 3 random seeds:

And this is what happens when observations are not normalized on the Mujoco1M benchmark (Walker2d and HalfCheetah perform noticeably worse):

One can easily produce these kinds of results for other environments, as seen in Henderson et al.

RL is fun, but it’s a bit concerning how unstable these algorithms and environments can be.