[Bug] QMIX remote worker does not have GPU `(torch.cuda.is_available() == False)` but local mode works fine
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
RLlib
What happened + What you expected to happen
I got an error stating the CUDA device is unavailable while calling make_env
on the remote worker (I think). My Gym environment requires an instantiation of a UnrealEngine binary and then loading a pre-trained CV model to process the raw observation from interacting with the binary. I need to use torch.load()
to load this pre-trained CV model and I got a torch serialization error while loading this model.
The error report is as follows:
(QMIX pid=108284) 2021-12-21 17:27:52,536 ERROR worker.py:431 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::QMIX.__init__() (pid=108284, ip=162.105.162.24)
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 102, in __init__
(QMIX pid=108284) Trainer.__init__(self, config, env, logger_creator,
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 661, in __init__
(QMIX pid=108284) super().__init__(config, logger_creator, remote_checkpoint_dir,
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/tune/trainable.py", line 121, in __init__
(QMIX pid=108284) self.setup(copy.deepcopy(self.config))
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 113, in setup
(QMIX pid=108284) super().setup(config)
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 764, in setup
(QMIX pid=108284) self._init(self.config, self.env_creator)
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 136, in _init
(QMIX pid=108284) self.workers = self._make_workers(
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 1727, in _make_workers
(QMIX pid=108284) return WorkerSet(
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 110, in __init__
(QMIX pid=108284) self._local_worker = self._make_worker(
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 449, in _make_worker
(QMIX pid=108284) worker = cls(
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 459, in __init__
(QMIX pid=108284) self.env = env_creator(copy.deepcopy(self.env_context))
(QMIX pid=108284) File "train.py", line 107, in <lambda>
(QMIX pid=108284) lambda config: make_env(config).with_agent_groups(
(QMIX pid=108284) File "train.py", line 80, in make_env
(QMIX pid=108284) env = make_env_impl(env_config)
(QMIX pid=108284) File "/data/xjp/Projects/Active-Pose/activepose/control/envs.py", line 34, in make_train_env
(QMIX pid=108284) env = make_env(**env_config) # update config in make_env
(QMIX pid=108284) File "/data/xjp/Projects/Active-Pose/activepose/envs/__init__.py", line 31, in make_env
(QMIX pid=108284) env = MultiviewPose(config)
(QMIX pid=108284) File "/data/xjp/Projects/Active-Pose/activepose/envs/base.py", line 55, in __init__
(QMIX pid=108284) pose_estimator = initialize_streaming_pose_estimator(config)
(QMIX pid=108284) File "/data/xjp/Projects/Active-Pose/activepose/pose/utils2d/pose/core.py", line 310, in initialize_streaming_pose_estimator
(QMIX pid=108284) model_dict = load_pose2d_model(config, device=device)
(QMIX pid=108284) File "/data/xjp/Projects/Active-Pose/activepose/pose/utils2d/pose/core.py", line 29, in load_pose2d_model
(QMIX pid=108284) state_dict = torch.load(config.TEST.MODEL_FILE)
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
(QMIX pid=108284) return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
(QMIX pid=108284) result = unpickler.load()
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
(QMIX pid=108284) deserialized_objects[root_key] = restore_location(obj, location)
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
(QMIX pid=108284) result = fn(storage, location)
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
(QMIX pid=108284) device = validate_cuda_device(location)
(QMIX pid=108284) File "/home/xjp/Miniconda3/envs/active-pose/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
(QMIX pid=108284) raise RuntimeError('Attempting to deserialize object on a CUDA '
(QMIX pid=108284) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Since QMIX required agents to be grouped together, I made the following change to the environment:
tmp_env = make_env(tmp_env_config)
agent_group = {'camera': tmp_env.agent_ids}
obs_group = Tuple([tmp_env.observation_space for _ in agent_group["camera"]])
act_group = Tuple([tmp_env.action_space for _ in agent_group["camera"]])
tune.register_env(
"active-pose-parallel-grouped",
lambda config: make_env(config).with_agent_groups(
groups=agent_group, obs_space=obs_group, act_space=act_group))
Here are a few things I have tried so far:
- I could run PPO (using the same make_env() callable) without running into any errors, DQN seems to work fine too.
- I can run QMIX with
local_mode = True
but the opposite cant - While debugging QMIX with remote workers, I added a print statement inside the
make_env
function to check iftorch.cuda.is_available()
while the remote working is initiating a new environment, but it is shown to beFalse
. But for PPO,torch.cuda.is_available()
is shown to beTrue
in remote environment creation.
The only difference is that the QMIX requires using the with_agent_groups
wrapper, so I’m guessing this could be the source of the error?
Any help is deeply appreciated.
Versions / Dependencies
Ray == 1.9.0, Python == 3.8.12, Pytorch == 1.10.0
Reproduction script
Unfortunately, it is difficult for me to provide a script as our project is currently closed-source, but I am willing to check anything you would suggest.
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Alright, my wild idea tests out. So if I set
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
while the remote worker is executingmake_env
function, the serialization error goes away… so it appears that the rolloutworker must be assigned specifically to a CUDA device, i.e. it cannot be an empty string.Thanks @sven1977 for your kind responses. We are doing some cool stuff with UE4 and combined Ray Rllib into the mix. I still need to run a few tests (with your suggestions in mind) before I drew more clues out of this. But first to answer your questions:
num_gpus_per_worker = 0.5
andnum_envs_per_worker = 4
so that each GPU hosts 2 workers, 8 vectorized envs and it works seamlessly. For QMIX I tried to either (1) the same setting as PPO that I just mentioned (2) num_gpu = 1 or num_envs_per_work = 1, but none of these worked.Then I figured QMIX and PPO have completed different-looking execution plan, and lots of lines of codes in QMIX’s execution plan involves device changes on tensors. I am thinking could this potentially be the problem?
Thanks, Mickel