Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Key set for obs and rewards must be the same

See original GitHub issue

What is your question?

I’m implementing a stochastic game, where two players take turns performing actions to modify the state of the game. In order to make this happen, I use the following logic in step:

step(self, action_dict):
    if action_dict.get(player_1) is not None:
        state = update_from_player_1()
        reward = {'player1': reward_for_your_action()}
        obs = {'player2': new_state()}
    if action_dict.get(player_2) is not None:
        state = update_from_player_2()
        reward = {'player2': reward_for_your_action()}
        obs = {'player1': new_state()}
    return obs, reward, done, info

This makes the most sense to me since I think of the reward as being an environment’s response to an action (in addition to an updated state), and the player who receives that reward should be the player who acted. Thus, even though the output is for the next player, the reward should be for the player that acted.

Trying to do it this way results in the error: Key set for obs and rewards must be the same. So I would have to modify the reward so that the player who is receiving the next state is also receiving a reward. I can still work with this consistently by storing the reward and applying it on opposite turns (i.e. the reward from player_2’s turn gets applied during player_1’s turn), but this just feels weird. It seems that the reward should be applied with the action, not with the next observation.

Am I thinking about this incorrectly? What as the reason behind designing it this way?

Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.8.1 Python 3.7 TF 2.1 Mac 10.14

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, Feb 24, 2020

I think you can always return obs (a noop obs for the last step) and rewards as long as done hasn’t been sent yet to the agent.

On Mon, Feb 24, 2020, 6:05 AM Corey Lowman notifications@github.com wrote:

What happens when the game ends on one player’s turn and both players need rewards? (I’m thinking about this from a board game stand point, like chess, where the game might end and both players might need to get rewards). Ideally the reward would be retroactively applied to the non-turn player?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/7056?email_source=notifications&email_token=AAADUSRSYLXO6ROWEVFGWQLREPH2DA5CNFSM4KQBQGRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMX4VYQ#issuecomment-590334690, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSSZ676SA5GKRMRAVVLREPH2DANCNFSM4KQBQGRA .

0reactions

rusu24edwardcommented, Oct 6, 2020

Just for the record, I don’t think automatically buffering the reward is the right thing to do because it violates some basic principles of MDP’s (I explain a bit here). This may be something that a user wants to try out, but this should not be the default behavior.

Top Results From Across the Web

Models, Preprocessors, and Action Distributions — Ray 2.2.0

The following diagram provides a conceptual overview of data flow between different components in RLlib. We start with an Environment , which -...

ray.rllib.env.base_env — Ray 0.8.4 documentation

obs, rewards, dones, infos, off_policy_actions = env.poll() >>> print(obs) ... raise ValueError( "Key set for obs and rewards must be the same: "...

Key Concepts — Ray 2.2.0

An agent interacts with an environment and receives a reward. ... making learning of different tasks accessible via RLlib's Python API and its...

How To Customize Policies — Ray 2.2.0

To simplify the definition of policies, RLlib includes Tensorflow and ... with the basic obs , new_obs , actions , rewards , dones...

ray.rllib.env.multi_agent_env — Ray v1.10.0

All agents of the group must act at the same time in the environment. ... info" if set(infos).difference(set(obs)): raise ValueError("Key set for infos...