There are a lot of RL packages out there, tensorforce, rllab, openai-lab, baselines, and the list goes on. It’s hard to know however, how any of those implementations stack up to published state-of-the-art results. There are several reasons, some are:
- The standard RL tasks (Mujoco & Atari) are extremely sensitive to not only model hyper-parameters but even random seeds (Islam et al).
- Even researchers implementing the same algorithm in two different code-bases have failed to produce consistent results given the same model parameters (described in Henderson et al, e.g. Figure 6).
- It takes time to build reproducible experimentation code.
I’ve been building yarlp for educational purposes, and I wanted to make sure my implementations matched baselines. OpenAI seems to have created something akin to tensorflow models in terms of reproducibility for RL, so a natural step was to benchmark against OpenAI baselines. Nevertheless, I found it quite difficult to match their results because of:
- environment wrappers and how environment observations were normalized
- simple tweaks to model hyper-parameters
- the choice of baseline model
- and random seeds!
Here are my results on Mujoco1M after painstakingly hashing out minor differences in OpenAI’s implementation of TRPO compared to mine. I averaged over 5 random seeds using this script in baselines, and using the run_benchmark cli script in yarlp, which run all environments in parallel. The results match, but clearly even 5 random seeds is not enough (we plot the 95th percentile CI).
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
To demonstrate some of the difficulty in reproducibility, here is the same exact algorithm averaged over 2 randomly chosen sets of 3 random seeds on Swimmer-v1 (similar to Figure 10 in Henderson et al):
![]() |
![]() |
And this is what happens when I use a value function implementation from rllab compared to the one used in OpenAI baselines on Swimmer-v1 averaged over 3 random seeds:

And this is what happens when observations are not normalized on the Mujoco1M benchmark (Walker2d and HalfCheetah perform noticeably worse):
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
One can easily produce these kinds of results for other environments, as seen in Henderson et al.
RL is fun, but it’s a bit concerning how unstable these algorithms and environments can be.














