[Question] Big gap in training vs. test performance
See original GitHub issueI have the following program where I train a TD3 agent on the “AntBulletEnv-v0” environment.
This is my training code where I evaluate the agent every 10000 steps:
env = make_vec_env("AntBulletEnv-v0", n_envs=1,)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
model = TD3('MlpPolicy', env)
eval_env = make_vec_env("AntBulletEnv-v0", n_envs=1)
eval_env = VecNormalize(eval_env, norm_obs=True, norm_reward=True, clip_obs=10.)
eval_callback = EvalCallback(eval_env, log_path="./", best_model_save_path="./")
model.learn(total_timesteps=2000000, callback=eval_callback)
model.save("model.zip")
env.save("vec_normalize.pkl")
And here are the last few lines of the training output which shows that the performance is at least above 700:
Eval num_timesteps=1978000, episode_reward=817.93 +/- 24.12
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1980000, episode_reward=844.22 +/- 13.98
Episode length: 1000.00 +/- 0.00
New best mean reward!
Eval num_timesteps=1982000, episode_reward=743.50 +/- 50.50
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1984000, episode_reward=766.45 +/- 69.35
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1986000, episode_reward=756.78 +/- 50.64
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1988000, episode_reward=756.68 +/- 17.13
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1990000, episode_reward=790.22 +/- 18.03
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1992000, episode_reward=700.02 +/- 22.26
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1994000, episode_reward=718.14 +/- 89.13
Episode length: 956.40 +/- 87.20
Eval num_timesteps=1996000, episode_reward=776.74 +/- 62.30
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1998000, episode_reward=804.47 +/- 25.14
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=2000000, episode_reward=725.76 +/- 54.01
Episode length: 1000.00 +/- 0.00
Here is my test code:
env = make_vec_env("AntBulletEnv-v0", n_envs=1,)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
model = TD3.load("model")
env = VecNormalize.load("vec_normalize.pkl", env)
env.training = False
env.norm_reward = False
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
And here is its output:
Mean reward = 355.05 +/- 8.26
System specifications:
macOS Big Sur
stable_baselines3==1.0
pybullet==3.09
As it can be seen, the test performance is almost half the evaluation in training. Any one has any idea on why this happens?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Train vs Test Error Gap and its relationship to Overfitting
Often, however, if there is a large gap, it may indicate that we could get better test set performance with more regularization/introducing more...
Read more >Is a gap between training and validation accuracy a problem ...
I know that a gap between training and validation accuracy in e.g k-fold cross-validation indicates the model may be overfitting and will not...
Read more >Why do I have a big gap between prediction error in train and ...
The results show that there is a large gap between error in training and test data. [ RMSE and std for train data...
Read more >The Model Performance Mismatch Problem (and what to do ...
The data sample is split into a training and test dataset. ... Note: ideas of “large differences” in model performance are relative to...
Read more >How much difference between training and test error is ...
A difference between a training and a test score by itself does not signify overfitting. This is just the generalization gap, i.e. the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello, it seems you are normalizing your test env twice.
Anyway, i would recommend you to use the rl zoo (cf. doc)
it is recommended at first to write your own to understand the lib. But later on, the rl zoo will save you a lot of debugging time.