Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Large drop in mean reward using multiprocessing with make_vec_env and SAC

See original GitHub issue

🐛 Bug

Large drop in mean reward using multiprocessing with make_vec_env and SAC

To Reproduce

See code below (modified from Colab example, using latest SB3)

# Install stable-baselines latest version
!pip install git+https://github.com/DLR-RM/stable-baselines3#egg=stable-baselines3[extra]

import gym
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3 import SAC, PPO
from stable_baselines3.common.evaluation import evaluate_policy
import time

def compare_multi_process(model_choice, env_id, n_timesteps, num_cpu=6, policy_type='MlpPolicy'):

    eval_env = gym.make(env_id) # environment for evaluation
    vec_env = make_vec_env(env_id, n_envs=num_cpu) # num_cpu = number of processes to use

    # Multiprocessed RL Training
    multi_model = model_choice(policy_type, vec_env, verbose=0)
    model_name = type(multi_model).__name__ # save model name as string for print 
    start_time = time.time()
    multi_model.learn(n_timesteps)
    total_time_multi = time.time() - start_time
    print(f"\nTook {total_time_multi:.2f}s for {model_name}_multi - {n_timesteps / total_time_multi:.2f} FPS")

    # Single Process RL Training
    non_multi_model = model_choice(policy_type, env_id, verbose=0)
    start_time = time.time()
    non_multi_model.learn(n_timesteps)
    total_time_single = time.time() - start_time

    print(f"Took {total_time_single:.2f}s for {model_name}_non_multi - {n_timesteps / total_time_single:.2f} FPS")
    print("Multiprocessed training is {:.2f}x faster!".format(total_time_single / total_time_multi))

    # Evaluate the trained agents
    mean_reward, std_reward = evaluate_policy(multi_model, eval_env, n_eval_episodes=10)
    print(f'Mean reward: {model_name}_multi, {mean_reward} +/- {std_reward:.2f}')

    mean_reward, std_reward = evaluate_policy(non_multi_model, eval_env, n_eval_episodes=10)
    print(f'Mean reward: {model_name}_non_multi, {mean_reward} +/- {std_reward:.2f}')

n_timesteps = 10000
env_id = "Pendulum-v0"
# env_id = "MountainCarContinuous-v0"
compare_multi_process(SAC, env_id, n_timesteps)
compare_multi_process(PPO, env_id, n_timesteps)

Traceback 
Took 18.65s for SAC_multi - 536.27 FPS
Took 110.15s for SAC_non_multi - 90.79 FPS
Multiprocessed training is 5.91x faster!
Mean reward: SAC_multi, -1205.6901091783773 +/- 106.64
Mean reward: SAC_non_multi, -190.3052033510613 +/- 55.82

Took 11.11s for PPO_multi - 899.92 FPS
Took 18.54s for PPO_non_multi - 539.47 FPS
Multiprocessed training is 1.67x faster!
Mean reward: PPO_multi, -1111.555600091256 +/- 332.89
Mean reward: PPO_non_multi, -1040.340094899712 +/- 249.71

Expected behavior

We notice a small decrease in mean reward (~10%) for PPO, which is expected. SAC however is a huge decrease in mean reward (600%+). That is quite extreme. I have tested in other envs such as “MountainCarContinuous-v0” with similar results.

### System Info

Colab with GPU. Latest Chrome.

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

Issue Analytics

State:
Created 2 years ago
Comments:13 (3 by maintainers)

Top GitHub Comments

2reactions

pengzhi1998commented, Mar 13, 2022

I really appreciate your help! Sure, if I could find anything, I’d be happy to leave the issue (though I think it’s rather hard, as SB3 is such perfect work!)

1reaction

araffincommented, Dec 21, 2021