[rllib] Persisting Arbitrary Data Between Timesteps
See original GitHub issueI have multiple game-playing agents hooked up to a model that spits out both their next move as well as a vector of symbols to ‘communicate’ with their fellow agents. I plan to build out a custom policy that calculates an intrinsic reward based on the interplay between actions taken this timestep and symbols received last timestep.
What I’m struggling with is the right way to persist this bag of communication vectors; while calculating the reward for a given agent I’d need the communication vectors passed around from the last timestep.
I’ve been considering adding the symbol emission to my action space so my environment’s step function can hold all the vectors (possibly in prev_actions
), alternatively it seems like one could use callbacks such as on_episode_start
to hold the required data. I’m not sure what the best practice for this kind of data-passing would be.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
I see, I think that would probably be best recorded as a custom metric. There’s a few ways to do it but env could return these reward breakdowns in the info return from the env, and the callback can retrieve it from the rollout batch in on_postprocess_traj: https://ray.readthedocs.io/en/latest/rllib-training.html#callbacks-and-custom-metrics
Can it be emitted as an action and included as part of the observation of agents in the next timestep? The env would have to do this internally.
For calculating the rewards, it sounds like you can do it in the env as usual, if you save the last action/symbols, or it could also be done in a on_postprocess_traj callback where you have the opportunity to rewrite the entire rollout sequence if needed.