Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running learn function multiple times vs running it once

See original GitHub issue

Question

I am using a2c. I have found that model.learn(100000) is far more better than model.learn(1000) for i in range(700). But I think I have to use model.learn(1000) for i in range(700)` in my situation.

Additional context

I am trying to train a tic-tac-toe agent. This is my core code:

env1 = Env1()
env2 = Env2()
model1 = A2C(
    policy='MlpPolicy',
    env=env1,
    verbose=0,
    gamma=1.0,
    policy_kwargs=dict(
        net_arch=[dict(pi=[18, 18, 18], vf=[18, 18, 18])]
    ),
    # device=th.device('cuda')
)
model2 = A2C(
    policy='MlpPolicy',
    env=env2,
    verbose=0,
    gamma=1.0,
    policy_kwargs=dict(
        net_arch=[dict(pi=[18, 18, 18], vf=[18, 18, 18])]
    ),
    # device=th.device('cuda')
)
env1.agent = model2
env2.agent = model1

batch = 0
while True:
    batch += 1
    for model in [model1, model2]:
        model.learn(total_timesteps=1000)
    if batch % 10 == 0:
        model1.save(path='E:\\data\\'+str(batch)+'\\model1')
        model2.save(path='E:\\data\\'+str(batch)+'\\model2')

Note that model1 always take the first, third, … turns, and model two always take the second, fouth, … turns. In env1, model2 predict the best action(not deterministic), In env2, model1 predict the best action(also non-deterministic), which means the two models don’t learn the same time(that’s reasonable because maybe the two will explore the same time, and it will mess things up).

When batch equals to 700, at:

x o o
o x .
x . .
(pic1)

the agent will still do silly things to make the following action:

x o . 
o x .
x x .

I am a little confused. Because after training for 80 batches, when i let the two agents to play against each other, they started to play wisely, until the following end state:

x o o
o x x
x x o

After that, I also found after training for 700 batches, when i let the two agents to play against each other again, the still perform the 100% same as training for 80 batches. I think the model can learn to explore, so if the train last long, finally model2 will find it can defeat model1 by turning the state into pic1(actually it’s possible to turn the state into pic1 or other states symmetric with pic1). So I set another model3 only to defeat the stupid model1 after training for 700 batches. This is the core code of model3:

env = Env2(model1)
model3 = A2C(
    policy='MlpPolicy',
    env=env,
    verbose=0,
    gamma=1.0,
    policy_kwargs=dict(
        net_arch=[dict(pi=[18, 18, 18], vf=[18, 18, 18])]
    ),
    device=th.device('cuda')
)
model3.learn(100000)
model3.save('E:\\saved_fucker\\1')

I found that model3 can actually find the bug of mode.1 when I evaluate it, the mean reward(reward of winning is 1 and of losing is -1) is 0.933. Note that when deterministic=True, the mean reward is 1.0. the following is the evaluation code:

env = Env2(model1)
model = A2C.load('E:\\saved_fucker\\1.zip', env=env)
from stable_baselines3.common.evaluation import evaluate_policy
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=30, deterministic=False)
print(mean_reward, std_reward)

I think I tried my best to tell everything. Sorry for not filling the issue template. Sorry for my poor english.

I have read the doc and have read the page of a2c for several times.

Issue Analytics

State:
Created a year ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

ZhengWenZhangcommented, Jul 11, 2022

@araffin so sorry. Now I filled the template and told everything.

0reactions

ZhengWenZhangcommented, Jul 13, 2022

@araffin I spent some time on reading the source code of slimevalleygym and its ppo implementation. I noticed that this project avoid my problem by saving and loading models. It keeps learning for 1e9 timesteps. To create an enemy for self playing, it stores the model when num_timesteps becomes a multiple of 10000 by defining a custom callback function, and this stored model will be the enemy for the next 10000 timesteps. I think I have found my solution, a solution different from the one to issue #597. btw, I believe the implementation of stable-baselines3.ppo actually reset the learning parameters so continued learning doesn’t work.

Top Results From Across the Web

Different results when running the function multiple times - C

The purpose of the function, is to sum all the numbers within the string. When I'm running the function once with this main...

Is it better to call a function multiple times, or to assign a ...

I prefer the latter, because you only call the function write the function once, but it takes up more lines of code and...

Call function multiple times but execute once - MATLAB Answers

I have a main code which I am running, inside the main code I reference a function 3 times, the function is multple...

do...while - JavaScript - MDN Web Docs - Mozilla

A statement that is executed at least once and is re-executed each time the condition evaluates to true. To execute multiple statements within...

LET function - Microsoft Support

Improved Performance If you write the same expression multiple times in a formula ... you to call the expression by name and for...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Running learn function multiple times vs running it once

Question

Additional context

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[Bug] Why `predict` sometimes return `(array, states)` instead of `(action, state)`? Is it a BUG?

Episodic length always divided by the number of environments for multiprocessing strangely