Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA error: initialization error (and some other problems) when searching hyperparameters on GPUs

See original GitHub issue

Describe the bug

CUDA error: initialization error (setDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:40) when searching hyperparameters on GPUs

To Reproduce

Following the Examples chapter in documentation, and use files in examples\new_project_templates as example.

Set hyperparameters in lightning_module_template.py tunable, e.g., learning_rate.
In single_gpu_node_dp_template.py, we delete main(hyperparams) and following the documentation, we add hyperparams.optimize_parallel_gpu( main, nb_trials=4, nb_workers=1, gpus=[0,1,2,3] )
If now we run the script python single_gpu_node_dp_template.py in gpu, there will be TypeError: optimize_parallel_gpu() got an unexpected keyword argument 'nb_trials'.
Watch into argparse_hopt.py for HyperOptArgumentParser, we will find in function def optimize_parallel_gpu, there is only max_nb_trials but no nb_trials as parameter. Also, there is no nb_workers, and no gpus but only gpu_ids as parameter.
Change the code in single_gpu_node_dp_template.py, it becomes hyperparams.optimize_parallel_gpu( main, max_nb_trials=4, gpu_ids=[0,1,2,3] ). This time, there is TypeError: str expected, not int to remind me that [0,1,2,3] should be changed as ['0,1,2,3'] .
Change it, this time, there is TypeError: main() takes 1 positional argument but 2 were given. The reason is in function def optimize_parallel_gpu_private in argparse_hopt.py, there is results = train_function(trial_params, gpu_id_set). However, our main function in single_gpu_node_dp_template.py has only one parameter hparams. Considering updating argparse_hopt.py is more difficult in gpu server, I add gpu_id_set as a new parameter in main function (although it isn’t used).
This time, it will first show gpu available: False, used: False, then

terminate called after throwing an instance of 'c10::Error'
terminate called recursively
  what():  CUDA error: initialization error (setDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:40)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fcbbb895273 in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc3ca (0x7fcbbbac83ca in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: torch::autograd::Engine::set_device(int) + 0x159 (0x7fcb261e8179 in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::Engine::thread_init(int) + 0x1a (0x7fcb261e81aa in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcbb6ea892a in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa75f (0x7fcbbc4b475f in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x76ba (0x7fcbc04f06ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x6d (0x7fcbc022641d in /lib/x86_64-linux-gnu/libc.so.6)

However, there is no problems when using ipython to test whether cuda is avaliable:

In [1]: import torch


In [2]: torch.cuda.is_available()
Out[2]: True


In [3]: torch.cuda.get_device_name(0)
Out[3]: 'GeForce RTX 2080 Ti'


In [4]: torch.cuda.device_count()
Out[4]: 4

Expected behavior

Run correctly after updating the code.

Environment

CUDA Version 10.0.130
PyTorch version: 1.2.0
Lightning version: 0.4.6
Test-tube version: 0.6.9

Additional context

Contents about hyperparameter search is too simple and not consistent with the current version.

Maybe adding some explanation like what is nb_workers is better.
Update those not consistent with the current version like main_local , nb_trials and gpus.
Is there any solutions to search over all the hyperparameter combinations set in model without calculating how many combinations by hand and filling in the code?