[Bug] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Tune, RLlib
What happened + What you expected to happen
When trying a RLLib experiment following these guidelines: https://www.tensortrade.org/en/latest/examples/train_and_evaluate_using_ray.html
with this config:
tune.run( run_or_experiment="PPO", # We'll be using the builtin PPO agent in RLLib name="MyExperiment1", metric='episode_reward_mean', mode='max', # resources_per_trial= {"cpu": 8, "gpu": 1}, stop={ "training_iteration": 100 # Let's do 5 steps for each hyperparameter combination }, config={ "env": "MyTrainingEnv", "env_config": config_train, # The dictionary we built before "log_level": "WARNING", "framework": "torch", "_fake_gpus": False, "ignore_worker_failures": True, "num_workers": 1, # One worker per agent. You can increase this but it will run fewer parallel trainings. "num_envs_per_worker": 1, "num_gpus": 1, # I yet have to understand if using a GPU is worth it, for our purposes, but I think it's not. This way you can train on a non-gpu enabled system. "clip_rewards": True, "lr": LEARNING_RATE, # Hyperparameter grid search defined above "gamma": GAMMA, # This can have a big impact on the result and needs to be properly tuned (range is 0 to 1) "lambda": LAMBDA, "observation_filter": "MeanStdFilter", "model": { "fcnet_hiddens": FC_SIZE, # Hyperparameter grid search defined above #"use_attention": True, #"attention_use_n_prev_actions": 120, #"attention_use_n_prev_rewards": 120 }, "sgd_minibatch_size": MINIBATCH_SIZE, # Hyperparameter grid search defined above "evaluation_interval": 1, # Run evaluation on every iteration "evaluation_config": { "env_config": config_eval, # The dictionary we built before (only the overriding keys to use in evaluation) "explore": False, # We don't want to explore during evaluation. All actions have to be repeatable. }, }, num_samples=1, # Have one sample for each hyperparameter combination. You can have more to average out randomness. keep_checkpoints_num=3, # Keep the last 2 checkpoints checkpoint_freq=1, # Do a checkpoint on each iteration (slower but you can pick more finely the checkpoint to use later) local_dir=r"D:\ray_results" )
I encountered the following error:
Traceback (most recent call last):
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\trial_runner.py", line 886, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\ray_trial_executor.py", line 675, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\worker.py", line 1760, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, [36mray::PPOTrainer.__init__()[39m (pid=12840, ip=127.0.0.1, repr=PPOTrainer)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
return method(self, *_args, **_kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 948, in _init
raise NotImplementedError
NotImplementedError
During handling of the above exception, another exception occurred:
[36mray::PPOTrainer.__init__()[39m (pid=12840, ip=127.0.0.1, repr=PPOTrainer)
File "python\ray\_raylet.pyx", line 633, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 674, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 640, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 644, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 593, in ray._raylet.execute_task.function_executor
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\_private\function_manager.py", line 648, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
return method(self, *_args, **_kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 741, in __init__
super().__init__(config, logger_creator, remote_checkpoint_dir,
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\tune\trainable.py", line 124, in __init__
self.setup(copy.deepcopy(self.config))
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
return method(self, *_args, **_kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 846, in setup
self.workers = self._make_workers(
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
return method(self, *_args, **_kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\trainer.py", line 1971, in _make_workers
return WorkerSet(
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 123, in __init__
self._local_worker = self._make_worker(
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 499, in _make_worker
worker = cls(
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 586, in __init__
self._build_policy_map(
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1569, in _build_policy_map
self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\policy_map.py", line 143, in create_policy
self[policy_id] = class_(observation_space, action_space,
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\agents\ppo\ppo_torch_policy.py", line 50, in __init__
self._initialize_loss_from_dummy_batch()
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\policy.py", line 832, in _initialize_loss_from_dummy_batch
self.compute_actions_from_input_dict(
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\torch_policy.py", line 294, in compute_actions_from_input_dict
return self._compute_action_helper(input_dict, state_batches,
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\utils\threading.py", line 21, in wrapper
return func(self, *a, **k)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\policy\torch_policy.py", line 934, in _compute_action_helper
dist_inputs, state_out = self.model(input_dict, state_batches,
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\modelv2.py", line 243, in __call__
res = self.forward(restored, state or [], seq_lens)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\complex_input_net.py", line 193, in forward
nn_out, _ = self.flatten[i](SampleBatch({
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\modelv2.py", line 243, in __call__
res = self.forward(restored, state or [], seq_lens)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\fcnet.py", line 124, in forward
self._features = self._hidden_layers(self._last_flat_in)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
input = module(input)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\ray\rllib\models\torch\misc.py", line 160, in forward
return self._model(x)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
input = module(input)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\modules\linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "C:\Users\Usuario\anaconda3\envs\cryptorl\lib\site-packages\torch\nn\functional.py", line 1849, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)
I would expect tensors to be placed on the same device.
Versions / Dependencies
OS: Windows 10 Ray: 2.0.0.dev0 Python: 3.8 Torch: 1.10.1 CUDA: 11.4
Reproduction script
https://www.tensortrade.org/en/latest/examples/train_and_evaluate_using_ray.html
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (1 by maintainers)
Top GitHub Comments
@easysoft2k15 you are partially right, in my experiments: Ray does not work with multi dimentional observation spaces, unless you use “conv_filters”, as per documentaiton here: the bug we see here is due to Torch moving tensors from GPU to CPU which is causing the issue when you train on CPU and GPU, so when i disabled GPU and trained only on CPU all went well.
This problem was solved for me by pip install ray[default,tune,rllib,serve]==1.9.2
Hope it helps!