question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] QMIX doesn't learn anything

See original GitHub issue

I’ve alread posted this question on stackoverflow but since i didn’t got an answer there i will repost it here (https://stackoverflow.com/questions/61523164/ray-rllib-qmix-doesnt-learn-anything)

I wanted to try out the QMIX implementation of Ray/Rllib library but there must be something wrong of how I’m using it because it doesn’t seem to learn anything. Since I’m new to Ray/Rllib I started with the “TwoStepGame” example the libary provides as an example on there github repo (https://github.com/ray-project/ray/blob/master/rllib/examples/twostep_game.py), trying to understand how to use it. Since for the start this example was a little bit to complex for me I adjusted it to make a example that is as simple as possible. Problem: Qmix doesn’t seem to learn, means the resulting reward pretty much matches the expected value of a random policy.

Let me explain the idea of my very simple experiment. We have 2 agents. Every agent can make 3 actions (Discrete(3)). If he makes the action 0 he gets a reward of 0.5 if not 0. So this should be a very simple task, since the best policy is just taking action 0.

Here is my implementation:

from gym.spaces import Tuple, MultiDiscrete, Dict, Discrete
import numpy as np

import ray
from ray import tune
from ray.tune import register_env, grid_search
from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray.rllib.agents.qmix.qmix_policy import ENV_STATE


class TwoStepGame(MultiAgentEnv):
    action_space = Discrete(3)

    def __init__(self, env_config):
        self.counter = 0

    def reset(self):
        return {0: {'obs': np.array([0]), 'state': np.array([0])},
                1: {'obs': np.array([0]), 'state': np.array([0])}}

    def step(self, action_dict):
        self.counter += 1
        move1 = action_dict[0]
        move2 = action_dict[1]
        reward_1 = 0
        reward_2 = 0
        if move1 == 0:
            reward_1 = 0.5
        if move2 == 0:
            reward_2 = 0.5

        obs = {0: {'obs': np.array([0]), 'state': np.array([0])},
               1: {'obs': np.array([0]), 'state': np.array([0])}}
        done = False
        if self.counter > 100:
            self.counter = 0
            done = True

        return obs, {0: reward_1, 1: reward_2}, {"__all__": done}, {}


if __name__ == "__main__":

    grouping = {"group_1": [0, 1]}

    obs_space = Tuple([
        Dict({
            "obs": MultiDiscrete([2]),
            ENV_STATE: MultiDiscrete([3])
        }),
        Dict({
            "obs": MultiDiscrete([2]),
            ENV_STATE: MultiDiscrete([3])
        }),
    ])

    act_space = Tuple([
        TwoStepGame.action_space,
        TwoStepGame.action_space,
    ])

    register_env("grouped_twostep",
        lambda config: TwoStepGame(config).with_agent_groups(
            grouping, obs_space=obs_space, act_space=act_space))

    config = {
        "mixer": grid_search(["qmix"]),
        "env_config": {
            "separate_state_space": True,
            "one_hot_state_encoding": True
        },
    }

    ray.init(num_cpus=1)
    tune.run(
        "QMIX",
        stop={
            "timesteps_total": 100000,
        },
        config=dict(config, **{
            "env": "grouped_twostep",
        }),
    )

and here is the result of the output when I run it for 100 000 timesteps

+----------------------------+------------+-------+---------+--------+------------------+--------+----------+
| Trial name                 | status     | loc   | mixer   |   iter |   total time (s) |     ts |   reward |
|----------------------------+------------+-------+---------+--------+------------------+--------+----------|
| QMIX_grouped_twostep_00000 | TERMINATED |       | qmix    |    100 |          276.796 | 101000 |   33.505 |
+----------------------------+------------+-------+---------+--------+------------------+--------+----------+



Process finished with exit code 0

As you can see the policy seems to be random since the expected value is 1/3 and the resulting reward is 33.505 (because I reset the enviroment every 100 timesteps). My Question: What do i not understand? There must be something wrong with my configuration or maybe my understanding of how rllib works. But since the best policy is very very simpel (just always take action 0) it seems to me like this algorithm cannot learn.

software version
ray 0.8.4
python 3.6.9
tensorflow 1.14.0
OS Ubuntu (running in a VM on a Windows OS) Release 18.04

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
GoingMyWaycommented, Jun 2, 2020

On StarCraft, QMix from RLLib also seems to learn random policy even train many episodes, it still learns a random policy. https://github.com/oxwhirl/smac/issues/42

1reaction
sven1977commented, Jul 16, 2020

@ManuelZierl Thanks for filing this! The fix should be merged tomorrow or over the WE.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ExternalMultiAgentEnv and QMIX for remote inference over ...
hi, i'm trying to combine these aspects of rllib in order to train QMIX on a server which receives on-policy remote updates from...
Read more >
MATE: Benchmarking Multi-Agent Reinforcement Learning in ...
A gamification of the multi-camera multi-target target coverage problem, and an all-in-one multi-agent reinforcement learning benchmark.
Read more >
Reinforcement Learning and Decision Making | OMSCentral
In order to make QMIX works, I modified RLlib source codes and patched ... It is true that this class doesn't go into...
Read more >
Coordination and Communication in Deep Multi-Agent ...
Learning, we introduce the novel Q-learning algorithm QMIX. ... least some problems that are hard, it does not necessarily mean that every ...
Read more >
A Parallel Framework for Population-based Multi-agent ... - arXiv
methods nested with reinforcement learning (RL) algorithms, which produces ... CPU cores; 5× speedup than RLlib and at least 3× speedup than ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found