There are a lot of RL packages out there,
baselines, and the list goes on. It’s hard to know however, how any of those implementations stack up to published state-of-the-art results. There are several reasons, some are:
- The standard RL tasks (Mujoco & Atari) are extremely sensitive to not only model hyper-parameters but even random seeds (Islam et al).
- Even researchers implementing the same algorithm in two different code-bases have failed to produce consistent results given the same model parameters (described in Henderson et al, e.g. Figure 6).
- It takes time to build reproducible experimentation code.
I’ve been building
yarlp for educational purposes, and I wanted to make sure my implementations matched baselines. OpenAI seems to have created something akin to tensorflow models in terms of reproducibility for RL, so a natural step was to benchmark against OpenAI
baselines. Nevertheless, I found it quite difficult to match their results because of:
- environment wrappers and how environment observations were normalized
- simple tweaks to model hyper-parameters
- the choice of baseline model
- and yes, random seeds!
Here are my results on Mujoco1M after painstakingly hashing out minor differences in OpenAI’s implementation of TRPO compared to mine (which are now virtually identical 😂). I averaged over 5 random seeds using this script in
baselines, and using the
run_benchmark cli script in
yarlp, which run all environments in parallel. The results match, but clearly even 5 random seeds is not enough (we plot the 95th percentile CI).
To demonstrate some of the difficulty in reproducibility, here is the same exact algorithm averaged over 2 randomly chosen sets of 3 random seeds on Swimmer-v1 (similar to Figure 10 in Henderson et al):
And this is what happens when I use a value function implementation from
rllab compared to the one used in OpenAI
baselines on Swimmer-v1 averaged over 3 random seeds:
And this is what happens when observations are not normalized on the Mujoco1M benchmark (Walker2d and HalfCheetah perform noticeably worse):
One can easily produce these kinds of results for other environments, as seen in Henderson et al.
RL is fun, but it’s a bit concerning how unstable these algorithms and environments can be.