Running learn function multiple times vs running it once
See original GitHub issueQuestion
I am using a2c. I have found that model.learn(100000)
is far more better than model.learn(1000) for i in range(700)
. But I think I have to use model.learn(1000) for i in range(700)` in my situation.
Additional context
I am trying to train a tic-tac-toe agent. This is my core code:
env1 = Env1()
env2 = Env2()
model1 = A2C(
policy='MlpPolicy',
env=env1,
verbose=0,
gamma=1.0,
policy_kwargs=dict(
net_arch=[dict(pi=[18, 18, 18], vf=[18, 18, 18])]
),
# device=th.device('cuda')
)
model2 = A2C(
policy='MlpPolicy',
env=env2,
verbose=0,
gamma=1.0,
policy_kwargs=dict(
net_arch=[dict(pi=[18, 18, 18], vf=[18, 18, 18])]
),
# device=th.device('cuda')
)
env1.agent = model2
env2.agent = model1
batch = 0
while True:
batch += 1
for model in [model1, model2]:
model.learn(total_timesteps=1000)
if batch % 10 == 0:
model1.save(path='E:\\data\\'+str(batch)+'\\model1')
model2.save(path='E:\\data\\'+str(batch)+'\\model2')
Note that model1 always take the first, third, … turns, and model two always take the second, fouth, … turns. In env1, model2 predict
the best action(not deterministic), In env2, model1 predict
the best action(also non-deterministic), which means the two models don’t learn the same time(that’s reasonable because maybe the two will explore the same time, and it will mess things up).
When batch equals to 700, at:
x o o
o x .
x . .
(pic1)
the agent will still do silly things to make the following action:
x o .
o x .
x x .
I am a little confused. Because after training for 80 batches, when i let the two agents to play against each other, they started to play wisely, until the following end state:
x o o
o x x
x x o
After that, I also found after training for 700 batches, when i let the two agents to play against each other again, the still perform the 100% same as training for 80 batches. I think the model can learn to explore, so if the train last long, finally model2 will find it can defeat model1 by turning the state into pic1(actually it’s possible to turn the state into pic1 or other states symmetric with pic1). So I set another model3 only to defeat the stupid model1 after training for 700 batches. This is the core code of model3:
env = Env2(model1)
model3 = A2C(
policy='MlpPolicy',
env=env,
verbose=0,
gamma=1.0,
policy_kwargs=dict(
net_arch=[dict(pi=[18, 18, 18], vf=[18, 18, 18])]
),
device=th.device('cuda')
)
model3.learn(100000)
model3.save('E:\\saved_fucker\\1')
I found that model3 can actually find the bug of mode.1 when I evaluate it, the mean reward(reward of winning is 1 and of losing is -1) is 0.933. Note that when deterministic=True
, the mean reward is 1.0. the following is the evaluation code:
env = Env2(model1)
model = A2C.load('E:\\saved_fucker\\1.zip', env=env)
from stable_baselines3.common.evaluation import evaluate_policy
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=30, deterministic=False)
print(mean_reward, std_reward)
I think I tried my best to tell everything. Sorry for not filling the issue template. Sorry for my poor english.
I have read the doc and have read the page of a2c for several times.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (4 by maintainers)
Top GitHub Comments
@araffin so sorry. Now I filled the template and told everything.
@araffin I spent some time on reading the source code of slimevalleygym and its ppo implementation. I noticed that this project avoid my problem by saving and loading models. It keeps learning for 1e9 timesteps. To create an enemy for self playing, it stores the model when
num_timesteps
becomes a multiple of 10000 by defining a custom callback function, and this stored model will be the enemy for the next 10000 timesteps. I think I have found my solution, a solution different from the one to issue #597. btw, I believe the implementation of stable-baselines3.ppo actually reset the learning parameters so continued learning doesn’t work.