question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Big gap in training vs. test performance

See original GitHub issue

I have the following program where I train a TD3 agent on the “AntBulletEnv-v0” environment.

This is my training code where I evaluate the agent every 10000 steps:

env = make_vec_env("AntBulletEnv-v0", n_envs=1,)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
model = TD3('MlpPolicy', env)
eval_env = make_vec_env("AntBulletEnv-v0", n_envs=1)
eval_env = VecNormalize(eval_env, norm_obs=True, norm_reward=True, clip_obs=10.)
eval_callback = EvalCallback(eval_env, log_path="./", best_model_save_path="./")
model.learn(total_timesteps=2000000, callback=eval_callback)
model.save("model.zip")
env.save("vec_normalize.pkl")

And here are the last few lines of the training output which shows that the performance is at least above 700:

Eval num_timesteps=1978000, episode_reward=817.93 +/- 24.12
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1980000, episode_reward=844.22 +/- 13.98
Episode length: 1000.00 +/- 0.00
New best mean reward!
Eval num_timesteps=1982000, episode_reward=743.50 +/- 50.50
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1984000, episode_reward=766.45 +/- 69.35
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1986000, episode_reward=756.78 +/- 50.64
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1988000, episode_reward=756.68 +/- 17.13
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1990000, episode_reward=790.22 +/- 18.03
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1992000, episode_reward=700.02 +/- 22.26
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1994000, episode_reward=718.14 +/- 89.13
Episode length: 956.40 +/- 87.20
Eval num_timesteps=1996000, episode_reward=776.74 +/- 62.30
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1998000, episode_reward=804.47 +/- 25.14
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=2000000, episode_reward=725.76 +/- 54.01
Episode length: 1000.00 +/- 0.00

Here is my test code:

env = make_vec_env("AntBulletEnv-v0", n_envs=1,)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
model = TD3.load("model")
env = VecNormalize.load("vec_normalize.pkl", env)
env.training = False
env.norm_reward = False
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

And here is its output: Mean reward = 355.05 +/- 8.26

System specifications:

macOS Big Sur
stable_baselines3==1.0
pybullet==3.09

As it can be seen, the test performance is almost half the evaluation in training. Any one has any idea on why this happens?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
araffincommented, Apr 25, 2021

Hello, it seems you are normalizing your test env twice.

Anyway, i would recommend you to use the rl zoo (cf. doc)

0reactions
araffincommented, Apr 26, 2021

No I thought it would be a good practice to write my own. Thanks for the pointer, it works now. I’ll go ahead and close the issue.

it is recommended at first to write your own to understand the lib. But later on, the rl zoo will save you a lot of debugging time.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Train vs Test Error Gap and its relationship to Overfitting
Often, however, if there is a large gap, it may indicate that we could get better test set performance with more regularization/introducing more...
Read more >
Is a gap between training and validation accuracy a problem ...
I know that a gap between training and validation accuracy in e.g k-fold cross-validation indicates the model may be overfitting and will not...
Read more >
Why do I have a big gap between prediction error in train and ...
The results show that there is a large gap between error in training and test data. [ RMSE and std for train data...
Read more >
The Model Performance Mismatch Problem (and what to do ...
The data sample is split into a training and test dataset. ... Note: ideas of “large differences” in model performance are relative to...
Read more >
How much difference between training and test error is ...
A difference between a training and a test score by itself does not signify overfitting. This is just the generalization gap, i.e. the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found