CUDA error: initialization error (and some other problems) when searching hyperparameters on GPUs
See original GitHub issueDescribe the bug
CUDA error: initialization error (setDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:40)
when searching hyperparameters on GPUs
To Reproduce
Following the Examples chapter in documentation, and use files in examples\new_project_templates
as example.
- Set hyperparameters in
lightning_module_template.py
tunable, e.g., learning_rate. - In
single_gpu_node_dp_template.py
, we deletemain(hyperparams)
and following the documentation, we addhyperparams.optimize_parallel_gpu( main, nb_trials=4, nb_workers=1, gpus=[0,1,2,3] )
- If now we run the script
python single_gpu_node_dp_template.py
in gpu, there will beTypeError: optimize_parallel_gpu() got an unexpected keyword argument 'nb_trials'
. - Watch into
argparse_hopt.py
forHyperOptArgumentParser
, we will find in functiondef optimize_parallel_gpu
, there is onlymax_nb_trials
but nonb_trials
as parameter. Also, there is nonb_workers
, and nogpus
but onlygpu_ids
as parameter. - Change the code in
single_gpu_node_dp_template.py
, it becomeshyperparams.optimize_parallel_gpu( main, max_nb_trials=4, gpu_ids=[0,1,2,3] )
. This time, there isTypeError: str expected, not int
to remind me that[0,1,2,3]
should be changed as['0,1,2,3']
. - Change it, this time, there is
TypeError: main() takes 1 positional argument but 2 were given
. The reason is in functiondef optimize_parallel_gpu_private
inargparse_hopt.py
, there isresults = train_function(trial_params, gpu_id_set)
. However, ourmain
function insingle_gpu_node_dp_template.py
has only one parameterhparams
. Considering updatingargparse_hopt.py
is more difficult in gpu server, I addgpu_id_set
as a new parameter inmain
function (although it isn’t used). - This time, it will first show
gpu available: False, used: False
, then
terminate called after throwing an instance of 'c10::Error'
terminate called recursively
what(): CUDA error: initialization error (setDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:40)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fcbbb895273 in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc3ca (0x7fcbbbac83ca in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: torch::autograd::Engine::set_device(int) + 0x159 (0x7fcb261e8179 in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::Engine::thread_init(int) + 0x1a (0x7fcb261e81aa in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcbb6ea892a in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa75f (0x7fcbbc4b475f in /home/huangsiteng/.local/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x76ba (0x7fcbc04f06ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x6d (0x7fcbc022641d in /lib/x86_64-linux-gnu/libc.so.6)
However, there is no problems when using ipython to test whether cuda is avaliable:
In [1]: import torch
In [2]: torch.cuda.is_available()
Out[2]: True
In [3]: torch.cuda.get_device_name(0)
Out[3]: 'GeForce RTX 2080 Ti'
In [4]: torch.cuda.device_count()
Out[4]: 4
Expected behavior
Run correctly after updating the code.
Environment
- CUDA Version 10.0.130
- PyTorch version: 1.2.0
- Lightning version: 0.4.6
- Test-tube version: 0.6.9
Additional context
Contents about hyperparameter search is too simple and not consistent with the current version.
- Maybe adding some explanation like what is
nb_workers
is better. - Update those not consistent with the current version like
main_local
,nb_trials
andgpus
. - Is there any solutions to search over all the hyperparameter combinations set in model without calculating how many combinations by hand and filling in the code?
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
CUDA error: initialization error (and some other problems ...
Run correctly after updating the code. ... Contents about hyperparameter search is too simple and not consistent with the current version.
Read more >Cuda initialization error - distributed - PyTorch Forums
The problem crashed with the initialization error. From the call stack, it seems like it crashed when it tries to release some resource....
Read more >How do I get tune.run to handle CUDA out of memory errors?
I've passed through a large search space which has extreme ends for both the nodes per layer and number of layers with the...
Read more >GPU runs out of memory during hyperparameter tuning loop ...
I tried to look around and I don't really understand the last error, I would guess the GPU has not been closed properly/is...
Read more >A System for Massively Parallel Hyperparameter Tuning - arXiv
We run synchronous SHA and BOHB with default settings and the same η and early-stopping rate as ASHA. Figure 3 shows the average...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Change from 1.4.1 to 1.3.4 version solved the problem …
@sophiajw have you tries the last version in master?