Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Big gap in training vs. test performance

See original GitHub issue

I have the following program where I train a TD3 agent on the “AntBulletEnv-v0” environment.

This is my training code where I evaluate the agent every 10000 steps:

env = make_vec_env("AntBulletEnv-v0", n_envs=1,)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
model = TD3('MlpPolicy', env)
eval_env = make_vec_env("AntBulletEnv-v0", n_envs=1)
eval_env = VecNormalize(eval_env, norm_obs=True, norm_reward=True, clip_obs=10.)
eval_callback = EvalCallback(eval_env, log_path="./", best_model_save_path="./")
model.learn(total_timesteps=2000000, callback=eval_callback)
model.save("model.zip")
env.save("vec_normalize.pkl")

And here are the last few lines of the training output which shows that the performance is at least above 700:

Eval num_timesteps=1978000, episode_reward=817.93 +/- 24.12
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1980000, episode_reward=844.22 +/- 13.98
Episode length: 1000.00 +/- 0.00
New best mean reward!
Eval num_timesteps=1982000, episode_reward=743.50 +/- 50.50
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1984000, episode_reward=766.45 +/- 69.35
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1986000, episode_reward=756.78 +/- 50.64
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1988000, episode_reward=756.68 +/- 17.13
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1990000, episode_reward=790.22 +/- 18.03
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1992000, episode_reward=700.02 +/- 22.26
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1994000, episode_reward=718.14 +/- 89.13
Episode length: 956.40 +/- 87.20
Eval num_timesteps=1996000, episode_reward=776.74 +/- 62.30
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=1998000, episode_reward=804.47 +/- 25.14
Episode length: 1000.00 +/- 0.00
Eval num_timesteps=2000000, episode_reward=725.76 +/- 54.01
Episode length: 1000.00 +/- 0.00

Here is my test code:

env = make_vec_env("AntBulletEnv-v0", n_envs=1,)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
model = TD3.load("model")
env = VecNormalize.load("vec_normalize.pkl", env)
env.training = False
env.norm_reward = False
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

And here is its output: Mean reward = 355.05 +/- 8.26

System specifications:

macOS Big Sur
stable_baselines3==1.0
pybullet==3.09

As it can be seen, the test performance is almost half the evaluation in training. Any one has any idea on why this happens?

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

araffincommented, Apr 25, 2021

Hello, it seems you are normalizing your test env twice.

Anyway, i would recommend you to use the rl zoo (cf. doc)

0reactions

araffincommented, Apr 26, 2021

No I thought it would be a good practice to write my own. Thanks for the pointer, it works now. I’ll go ahead and close the issue.

it is recommended at first to write your own to understand the lib. But later on, the rl zoo will save you a lot of debugging time.

Top Results From Across the Web

Train vs Test Error Gap and its relationship to Overfitting

Often, however, if there is a large gap, it may indicate that we could get better test set performance with more regularization/introducing more...

Is a gap between training and validation accuracy a problem ...

I know that a gap between training and validation accuracy in e.g k-fold cross-validation indicates the model may be overfitting and will not...

Why do I have a big gap between prediction error in train and ...

The results show that there is a large gap between error in training and test data. [ RMSE and std for train data...

The Model Performance Mismatch Problem (and what to do ...

The data sample is split into a training and test dataset. ... Note: ideas of “large differences” in model performance are relative to...

How much difference between training and test error is ...

A difference between a training and a test score by itself does not signify overfitting. This is just the generalization gap, i.e. the...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[Question] Big gap in training vs. test performance

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[Bug] excessive CPU utilization

[Question] agent got stucked always in the action space lower bound