Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] QMIX remote worker does not have GPU `(torch.cuda.is_available() == False)` but local mode works fine

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

RLlib

What happened + What you expected to happen

I got an error stating the CUDA device is unavailable while calling make_env on the remote worker (I think). My Gym environment requires an instantiation of a UnrealEngine binary and then loading a pre-trained CV model to process the raw observation from interacting with the binary. I need to use torch.load() to load this pre-trained CV model and I got a torch serialization error while loading this model.

The error report is as follows:

(QMIX pid=108284) 2021-12-21 17:27:52,536       ERROR worker.py:431 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::QMIX.__init__() (pid=108284, ip=162.105.162.24)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 102, in __init__
(QMIX pid=108284)     Trainer.__init__(self, config, env, logger_creator,
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 661, in __init__
(QMIX pid=108284)     super().__init__(config, logger_creator, remote_checkpoint_dir,
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/tune/trainable.py", line 121, in __init__
(QMIX pid=108284)     self.setup(copy.deepcopy(self.config))
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 113, in setup
(QMIX pid=108284)     super().setup(config)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 764, in setup
(QMIX pid=108284)     self._init(self.config, self.env_creator)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 136, in _init
(QMIX pid=108284)     self.workers = self._make_workers(
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1727, in _make_workers
(QMIX pid=108284)     return WorkerSet(
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 110, in __init__
(QMIX pid=108284)     self._local_worker = self._make_worker(
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 449, in _make_worker
(QMIX pid=108284)     worker = cls(
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 459, in __init__
(QMIX pid=108284)     self.env = env_creator(copy.deepcopy(self.env_context))
(QMIX pid=108284)   File "train.py", line 107, in <lambda>
(QMIX pid=108284)     lambda config: make_env(config).with_agent_groups(
(QMIX pid=108284)   File "train.py", line 80, in make_env
(QMIX pid=108284)     env = make_env_impl(env_config)
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/control/envs.py", line 34, in make_train_env
(QMIX pid=108284)     env = make_env(**env_config)  # update config in make_env
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/envs/__init__.py", line 31, in make_env
(QMIX pid=108284)     env = MultiviewPose(config)
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/envs/base.py", line 55, in __init__
(QMIX pid=108284)     pose_estimator = initialize_streaming_pose_estimator(config)
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/pose/utils2d/pose/core.py", line 310, in initialize_streaming_pose_estimator
(QMIX pid=108284)     model_dict = load_pose2d_model(config, device=device)
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/pose/utils2d/pose/core.py", line 29, in load_pose2d_model
(QMIX pid=108284)     state_dict = torch.load(config.TEST.MODEL_FILE)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
(QMIX pid=108284)     return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
(QMIX pid=108284)     result = unpickler.load()
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
(QMIX pid=108284)     deserialized_objects[root_key] = restore_location(obj, location)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
(QMIX pid=108284)     result = fn(storage, location)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
(QMIX pid=108284)     device = validate_cuda_device(location)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
(QMIX pid=108284)     raise RuntimeError('Attempting to deserialize object on a CUDA '
(QMIX pid=108284) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Since QMIX required agents to be grouped together, I made the following change to the environment:

tmp_env = make_env(tmp_env_config)

agent_group = {'camera': tmp_env.agent_ids}
obs_group = Tuple([tmp_env.observation_space for _ in agent_group["camera"]])
act_group = Tuple([tmp_env.action_space for _ in agent_group["camera"]])
tune.register_env(
    "active-pose-parallel-grouped",
    lambda config: make_env(config).with_agent_groups(
        groups=agent_group, obs_space=obs_group, act_space=act_group))

Here are a few things I have tried so far:

I could run PPO (using the same make_env() callable) without running into any errors, DQN seems to work fine too.
I can run QMIX with local_mode = True but the opposite cant
While debugging QMIX with remote workers, I added a print statement inside the make_env function to check if torch.cuda.is_available() while the remote working is initiating a new environment, but it is shown to be False. But for PPO, torch.cuda.is_available() is shown to be True in remote environment creation.

The only difference is that the QMIX requires using the with_agent_groups wrapper, so I’m guessing this could be the source of the error?

Any help is deeply appreciated.

Versions / Dependencies

Ray == 1.9.0, Python == 3.8.12, Pytorch == 1.10.0

Reproduction script

Unfortunately, it is difficult for me to provide a script as our project is currently closed-source, but I am willing to check anything you would suggest.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

mickelliucommented, Jan 8, 2022

Alright, my wild idea tests out. So if I set os.environ['CUDA_VISIBLE_DEVICES'] = '0' while the remote worker is executing make_env function, the serialization error goes away… so it appears that the rolloutworker must be assigned specifically to a CUDA device, i.e. it cannot be an empty string.

1reaction

mickelliucommented, Jan 7, 2022

Thanks @sven1977 for your kind responses. We are doing some cool stuff with UE4 and combined Ray Rllib into the mix. I still need to run a few tests (with your suggestions in mind) before I drew more clues out of this. But first to answer your questions:

Our environment is indeed using a pre-trained model, this is used to process observations received from interacting with the UE4 binary. So the Rllib agent should receive pre-processed observations from the CV model (and we did not make changes to the pre-processing logic written in Rllib, done purely on the env level)
The pre-trained CV model needs a GPU for faster inference. It is a fairly deep model and I can’t imagine without using any gpu.
I did. For PPO I set the num_gpus_per_worker = 0.5 and num_envs_per_worker = 4 so that each GPU hosts 2 workers, 8 vectorized envs and it works seamlessly. For QMIX I tried to either (1) the same setting as PPO that I just mentioned (2) num_gpu = 1 or num_envs_per_work = 1, but none of these worked.

Then I figured QMIX and PPO have completed different-looking execution plan, and lots of lines of codes in QMIX’s execution plan involves device changes on tensors. I am thinking could this potentially be the problem?

Thanks, Mickel