Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] How to use a trained agent in a production setting using a custom environment?

See original GitHub issue

For example, if I created a custom environment for tic-tac-toe and trained an agent on it. How do I actually use the trained agent in a live setting? My current workflow is:

Load the custom environment with the current observation
obs = env.reset()
Get an action from model.predict(obs) optionally include state = state for recurrent policies
Manually perform the action, and observe the next step and save the latest state ( so there would not be a env.step() since there’s no further observations)
Create a new environment with the latest observation and repeat this process

Is there a better way to actually use the agent to perform a task without continuously redefining a new environment?

Optionally, is there a way to incorporate online learning into this process? Such that I can calculate a reward and use that to train the agent for additional steps based on the live feedback?

Issue Analytics

State:
Created 4 years ago
Comments:7

Top GitHub Comments

1reaction

Miffylicommented, Sep 18, 2019

I do not quite understand the “create a new environment”. If your environment follows Gym API, you can do the following (works for recurrent policies too):

env = env.make("your_env_here")
agent = PPO2.load(path_to_model)
state = None
done = False
obs = env.reset()

# Play some limited number of steps
for i in range(1000):
    action, state = agent.predict(obs, state=state, mask=done)
    obs, reward, done, info = env.step(action)
    if done: 
        # Game over, reset env
        # and agent's hidden state
        state = None
        obs = env.reset()

Online learning: Also discussed in #466 , stable-baselines does not support individual update steps and there are no plans on including it. However in your case you could try using learn function to achieve this, like so:

Load/initialize agent, create your tic-tac-toe environment
Start agent learning with learn()
Every call to environment’s step, the environment executes agent’s action and then asks human (or other) player for their action, and executes that.

This way the other players are “part of the environment” on which the agent learns. This self-play environment uses same approach. Note how player2 actions are done in step and reset functions.

0reactions

Miffylicommented, Sep 18, 2019

Ah alright, so the final environment is different from tic-tac-toe. I can not give you right answers here on what would work best as this is the “research” part of RL: You have to try out things yourself and see what works best, unless you find references for this (I do not know any).

On resetting: It depends on what your environment is like. I recommend reading on “trajectories” and “terminal states”, e.g. from Spinning Up tutorials.