[rllib] Key set for obs and rewards must be the same
See original GitHub issueWhat is your question?
I’m implementing a stochastic game, where two players take turns performing actions to modify the state of the game. In order to make this happen, I use the following logic in step
:
step(self, action_dict):
if action_dict.get(player_1) is not None:
state = update_from_player_1()
reward = {'player1': reward_for_your_action()}
obs = {'player2': new_state()}
if action_dict.get(player_2) is not None:
state = update_from_player_2()
reward = {'player2': reward_for_your_action()}
obs = {'player1': new_state()}
return obs, reward, done, info
This makes the most sense to me since I think of the reward as being an environment’s response to an action (in addition to an updated state), and the player who receives that reward should be the player who acted. Thus, even though the output is for the next player, the reward should be for the player that acted.
Trying to do it this way results in the error: Key set for obs and rewards must be the same
. So I would have to modify the reward so that the player who is receiving the next state is also receiving a reward. I can still work with this consistently by storing the reward and applying it on opposite turns (i.e. the reward from player_2’s turn gets applied during player_1’s turn), but this just feels weird. It seems that the reward should be applied with the action, not with the next observation.
Am I thinking about this incorrectly? What as the reason behind designing it this way?
Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.8.1 Python 3.7 TF 2.1 Mac 10.14
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:7 (4 by maintainers)
Top GitHub Comments
I think you can always return obs (a noop obs for the last step) and rewards as long as done hasn’t been sent yet to the agent.
On Mon, Feb 24, 2020, 6:05 AM Corey Lowman notifications@github.com wrote:
Just for the record, I don’t think automatically buffering the reward is the right thing to do because it violates some basic principles of MDP’s (I explain a bit here). This may be something that a user wants to try out, but this should not be the default behavior.