question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA error: initialization error (and some other problems) when searching hyperparameters on GPUs

See original GitHub issue

Describe the bug

CUDA error: initialization error (setDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:40) when searching hyperparameters on GPUs

To Reproduce

Following the Examples chapter in documentation, and use files in examples\new_project_templates as example.

  1. Set hyperparameters in lightning_module_template.py tunable, e.g., learning_rate.
  2. In single_gpu_node_dp_template.py, we delete main(hyperparams) and following the documentation, we add hyperparams.optimize_parallel_gpu( main, nb_trials=4, nb_workers=1, gpus=[0,1,2,3] )
  3. If now we run the script python single_gpu_node_dp_template.py in gpu, there will be TypeError: optimize_parallel_gpu() got an unexpected keyword argument 'nb_trials'.
  4. Watch into argparse_hopt.py for HyperOptArgumentParser, we will find in function def optimize_parallel_gpu, there is only max_nb_trials but no nb_trials as parameter. Also, there is no nb_workers, and no gpus but only gpu_ids as parameter.
  5. Change the code in single_gpu_node_dp_template.py, it becomes hyperparams.optimize_parallel_gpu( main, max_nb_trials=4, gpu_ids=[0,1,2,3] ). This time, there is TypeError: str expected, not int to remind me that [0,1,2,3] should be changed as ['0,1,2,3'] .
  6. Change it, this time, there is TypeError: main() takes 1 positional argument but 2 were given. The reason is in function def optimize_parallel_gpu_private in argparse_hopt.py, there is results = train_function(trial_params, gpu_id_set). However, our main function in single_gpu_node_dp_template.py has only one parameter hparams. Considering updating argparse_hopt.py is more difficult in gpu server, I add gpu_id_set as a new parameter in main function (although it isn’t used).
  7. This time, it will first show gpu available: False, used: False, then
terminate called after throwing an instance of 'c10::Error'
terminate called recursively
  what():  CUDA error: initialization error (setDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:40)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fcbbb895273 in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc3ca (0x7fcbbbac83ca in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: torch::autograd::Engine::set_device(int) + 0x159 (0x7fcb261e8179 in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::Engine::thread_init(int) + 0x1a (0x7fcb261e81aa in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcbb6ea892a in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa75f (0x7fcbbc4b475f in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x76ba (0x7fcbc04f06ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x6d (0x7fcbc022641d in /lib/x86_64-linux-gnu/libc.so.6)

However, there is no problems when using ipython to test whether cuda is avaliable:

In [1]: import torch


In [2]: torch.cuda.is_available()
Out[2]: True


In [3]: torch.cuda.get_device_name(0)
Out[3]: 'GeForce RTX 2080 Ti'


In [4]: torch.cuda.device_count()
Out[4]: 4

Expected behavior

Run correctly after updating the code.

Environment

  • CUDA Version 10.0.130
  • PyTorch version: 1.2.0
  • Lightning version: 0.4.6
  • Test-tube version: 0.6.9

Additional context

Contents about hyperparameter search is too simple and not consistent with the current version.

  1. Maybe adding some explanation like what is nb_workers is better.
  2. Update those not consistent with the current version like main_local , nb_trials and gpus.
  3. Is there any solutions to search over all the hyperparameter combinations set in model without calculating how many combinations by hand and filling in the code?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
jiarenyfcommented, Aug 8, 2021

Change from 1.4.1 to 1.3.4 version solved the problem …

0reactions
Bordacommented, Feb 26, 2020

@sophiajw have you tries the last version in master?

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA error: initialization error (and some other problems ...
Run correctly after updating the code. ... Contents about hyperparameter search is too simple and not consistent with the current version.
Read more >
Cuda initialization error - distributed - PyTorch Forums
The problem crashed with the initialization error. From the call stack, it seems like it crashed when it tries to release some resource....
Read more >
How do I get tune.run to handle CUDA out of memory errors?
I've passed through a large search space which has extreme ends for both the nodes per layer and number of layers with the...
Read more >
GPU runs out of memory during hyperparameter tuning loop ...
I tried to look around and I don't really understand the last error, I would guess the GPU has not been closed properly/is...
Read more >
A System for Massively Parallel Hyperparameter Tuning - arXiv
We run synchronous SHA and BOHB with default settings and the same η and early-stopping rate as ASHA. Figure 3 shows the average...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found