Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Thoughts on ideal training environment to save best agent during training when using multiple env's?

See original GitHub issue

Question

I’ve made use of the following snippet:

    num_cpu = 6 # Number of processes to use
    # Create the vectorized environment
    env = WilKin_Stock_Trading_Environment(df, lookback_window_size=lookback_window_size)
    env = SubprocVecEnv([make_env(env, i) for i in range(num_cpu)])

    model = A2C('MlpPolicy', env, verbose=1, gamma=0.91)

With this, it’s my understanding that there are 6 agents being trained at once, split up among the total_timestamps defined during training, which results in 6 different test results.

How does one combine these results into 1 agent for testing? Is there a way to pick the best agent?

Or, how could one make use of a callback so that the best “agent” gets checkpointed as training goes along?

Was thinking of using this snippet for the latter:

checkpoint_callback = CheckpointCallback(save_freq=1000, save_path='./logs/')
eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/best_model',
                             log_path='./logs/results', eval_freq=500)
callback = CallbackList([checkpoint_callback, eval_callback])

Sort of a newbie with SB3. Thanks!

Issue Analytics

State:
Created 2 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

3reactions

araffincommented, Jan 2, 2022

See this example on how to use callbacks to save the best model.

well, the EvalCallback is the recommended way to go (the default in the RL Zoo), the example in the doc is just to demonstrate the use of callback.

2reactions

Miffylicommented, Jan 2, 2022

No, there is only one agent being trained using six copies of the environment. This can speed up training (faster stepping of environment) but also stabilizes the training of A2C/PPO because you have a larger number of samples per update.

See this example on how to use callbacks to save the best model.

The following is an automated answer:

as you seem to try to apply RL to stock trading, i also must warn you about it. Here is recommendation from a former professional trader:

Retail trading, retail trading with ML, and retail trading with RL are bad ideas for almost everyone to get involved with.

I was a quant trader at a major hedge fund for several years. I am now retired.

On average, traders lose money. On average, retail traders especially lose money. An excellent approximation of trading, and especially of retail trading, is ‘gambling’.

There is a lot more bad advice on trading out there than good advice. It is extraordinarily difficult to demonstrate that any particular advice is some of the rare good advice.

As such, it’s reasonable to treat all commentary on retail trading as an epsilon away from snake oil salesmanship. Sometimes that’ll be wrong, but it’s a strong rule of thumb.

I feel a sense of responsibility to the less world-wise members of this community - which includes plenty of highschoolers - and so I find myself unable to let a conversation about retail trading occur without interceding and warning that it’s very likely snake oil.

I find repeatedly making these warnings and the subsequent fights to be exhausting.