question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] QMIX remote worker does not have GPU `(torch.cuda.is_available() == False)` but local mode works fine

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

RLlib

What happened + What you expected to happen

I got an error stating the CUDA device is unavailable while calling make_env on the remote worker (I think). My Gym environment requires an instantiation of a UnrealEngine binary and then loading a pre-trained CV model to process the raw observation from interacting with the binary. I need to use torch.load() to load this pre-trained CV model and I got a torch serialization error while loading this model.

The error report is as follows:

(QMIX pid=108284) 2021-12-21 17:27:52,536       ERROR worker.py:431 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::QMIX.__init__() (pid=108284, ip=162.105.162.24)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 102, in __init__
(QMIX pid=108284)     Trainer.__init__(self, config, env, logger_creator,
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 661, in __init__
(QMIX pid=108284)     super().__init__(config, logger_creator, remote_checkpoint_dir,
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/tune/trainable.py", line 121, in __init__
(QMIX pid=108284)     self.setup(copy.deepcopy(self.config))
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 113, in setup
(QMIX pid=108284)     super().setup(config)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 764, in setup
(QMIX pid=108284)     self._init(self.config, self.env_creator)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 136, in _init
(QMIX pid=108284)     self.workers = self._make_workers(
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1727, in _make_workers
(QMIX pid=108284)     return WorkerSet(
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 110, in __init__
(QMIX pid=108284)     self._local_worker = self._make_worker(
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 449, in _make_worker
(QMIX pid=108284)     worker = cls(
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 459, in __init__
(QMIX pid=108284)     self.env = env_creator(copy.deepcopy(self.env_context))
(QMIX pid=108284)   File "train.py", line 107, in <lambda>
(QMIX pid=108284)     lambda config: make_env(config).with_agent_groups(
(QMIX pid=108284)   File "train.py", line 80, in make_env
(QMIX pid=108284)     env = make_env_impl(env_config)
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/control/envs.py", line 34, in make_train_env
(QMIX pid=108284)     env = make_env(**env_config)  # update config in make_env
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/envs/__init__.py", line 31, in make_env
(QMIX pid=108284)     env = MultiviewPose(config)
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/envs/base.py", line 55, in __init__
(QMIX pid=108284)     pose_estimator = initialize_streaming_pose_estimator(config)
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/pose/utils2d/pose/core.py", line 310, in initialize_streaming_pose_estimator
(QMIX pid=108284)     model_dict = load_pose2d_model(config, device=device)
(QMIX pid=108284)   File "/data/xjp/Projects/Active-Pose/activepose/pose/utils2d/pose/core.py", line 29, in load_pose2d_model
(QMIX pid=108284)     state_dict = torch.load(config.TEST.MODEL_FILE)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
(QMIX pid=108284)     return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
(QMIX pid=108284)     result = unpickler.load()
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
(QMIX pid=108284)     deserialized_objects[root_key] = restore_location(obj, location)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
(QMIX pid=108284)     result = fn(storage, location)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
(QMIX pid=108284)     device = validate_cuda_device(location)
(QMIX pid=108284)   File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
(QMIX pid=108284)     raise RuntimeError('Attempting to deserialize object on a CUDA '
(QMIX pid=108284) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Since QMIX required agents to be grouped together, I made the following change to the environment:

tmp_env = make_env(tmp_env_config)

agent_group = {'camera': tmp_env.agent_ids}
obs_group = Tuple([tmp_env.observation_space for _ in agent_group["camera"]])
act_group = Tuple([tmp_env.action_space for _ in agent_group["camera"]])
tune.register_env(
    "active-pose-parallel-grouped",
    lambda config: make_env(config).with_agent_groups(
        groups=agent_group, obs_space=obs_group, act_space=act_group))

Here are a few things I have tried so far:

  • I could run PPO (using the same make_env() callable) without running into any errors, DQN seems to work fine too.
  • I can run QMIX with local_mode = True but the opposite cant
  • While debugging QMIX with remote workers, I added a print statement inside the make_env function to check if torch.cuda.is_available() while the remote working is initiating a new environment, but it is shown to be False. But for PPO, torch.cuda.is_available() is shown to be True in remote environment creation.

The only difference is that the QMIX requires using the with_agent_groups wrapper, so I’m guessing this could be the source of the error?

Any help is deeply appreciated.

Versions / Dependencies

Ray == 1.9.0, Python == 3.8.12, Pytorch == 1.10.0

Reproduction script

Unfortunately, it is difficult for me to provide a script as our project is currently closed-source, but I am willing to check anything you would suggest.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
mickelliucommented, Jan 8, 2022

Alright, my wild idea tests out. So if I set os.environ['CUDA_VISIBLE_DEVICES'] = '0' while the remote worker is executing make_env function, the serialization error goes away… so it appears that the rolloutworker must be assigned specifically to a CUDA device, i.e. it cannot be an empty string.

1reaction
mickelliucommented, Jan 7, 2022

Thanks @sven1977 for your kind responses. We are doing some cool stuff with UE4 and combined Ray Rllib into the mix. I still need to run a few tests (with your suggestions in mind) before I drew more clues out of this. But first to answer your questions:

  1. Our environment is indeed using a pre-trained model, this is used to process observations received from interacting with the UE4 binary. So the Rllib agent should receive pre-processed observations from the CV model (and we did not make changes to the pre-processing logic written in Rllib, done purely on the env level)
  2. The pre-trained CV model needs a GPU for faster inference. It is a fairly deep model and I can’t imagine without using any gpu.
  3. I did. For PPO I set the num_gpus_per_worker = 0.5 and num_envs_per_worker = 4 so that each GPU hosts 2 workers, 8 vectorized envs and it works seamlessly. For QMIX I tried to either (1) the same setting as PPO that I just mentioned (2) num_gpu = 1 or num_envs_per_work = 1, but none of these worked.

Then I figured QMIX and PPO have completed different-looking execution plan, and lots of lines of codes in QMIX’s execution plan involves device changes on tensors. I am thinking could this potentially be the problem?

Thanks, Mickel

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why `torch.cuda.is_available()` returns False even after ...
Your graphics card does not support CUDA 9.0. Since I've seen a lot of questions that refer to issues like this I'm writing...
Read more >
Torch.cuda.is_available() returns False even CUDA is installed
Hello everyone! I experience a problem with pytorch can't see cuda. Can someone give any suggestions, how to make it work properly?
Read more >
Torch.cuda.is_available() returns False - Beginner (2018)
NVIDIA -SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running....
Read more >
My jetson nano board returns 'False' to torch.cuda.is_available ...
I try to see whether my Jetson nano board appropriately run CUDA, however it doesn't. import torch torch.cuda.is_available() False But in ...
Read more >
Connected to gpu node but cuda is not available
I used the following command to interactively connect to the gpu torch.cuda.is_available() returns False qsub -I -l nodes=1:gpu:ppn=2 Please ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found