Optuna seemingly stuck with multiple GPUs
See original GitHub issue- PyTorch-Forecasting version: 0.8.4
- PyTorch version: 1.8.0
- Python version: 3.8.8
- Operating System: CentOS
Expected behavior
I’m working through the Demand forecasting with the Temporal Fusion Transformer and try to run the optimize_hyperparameters
part on two GPUs.
Actual behavior
I get some output, but it never finishes. With only a single GPU utilized it finishes within minutes without any issues.
[I 2021-04-13 15:40:26,906] A new study created in memory with name: no-name-e455a085-bb8c-4052-a225-ef363fb68e4c initializing ddp: GLOBAL_RANK: 1, MEMBER: 1/2
Code to reproduce the problem
https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html
this works:
study = optimize_hyperparameters(
train_dataloader,
val_dataloader,
model_path="optuna_test",
n_trials=200,
max_epochs=50,
gradient_clip_val_range=(0.01, 1.0),
hidden_size_range=(8, 128),
hidden_continuous_size_range=(8, 128),
attention_head_size_range=(1, 4),
learning_rate_range=(0.001, 0.1),
dropout_range=(0.1, 0.3),
trainer_kwargs=dict(limit_train_batches=30),
reduce_on_plateau_patience=4,
use_learning_rate_finder=False, # use Optuna to find ideal learning rate or use in-built learning rate finder
)
changing this, it doesn’t anymore:
trainer_kwargs=dict(limit_train_batches=30, gpus=2),
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
pytorch - What might cause a GPU to hang/get stuck when ...
TL;DR: using PyTorch with Optuna with multiprocessing done with Queue(), a GPU (out of 4) can hang. Probably not a deadlock. Any ideas?...
Read more >PyTorch Lightning and Optuna: Multi-GPU hyperparameter ...
How to quickly set up multi-GPU training for hyperparameter optimisation with PyTorch Lightning.
Read more >Benchmarking optimization algorithms for auto-tuning GPU ...
Research in automated performance tuning (auto-tuning) can be grouped into two main categories: (1) auto-tuning compiler-generated code ...
Read more >I worry about CX600 i accidentally bought.
The problem is, i am stuck with it apparently, as the only place where i could ... multi-GPU combos - will i have...
Read more >Spinning Up Documentation - OpenAI
2.3 Installing Spinning Up git clone https://github.com/openai/spinningup.git cd spinningup pip install -e . You Should Know. 8. Chapter 2.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi, I’m Kento Nozawa from the Optuna community. The latest Optuna’s PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.
Best,
Running the examples leads to this issue: https://github.com/optuna/optuna-examples/issues/14