question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optuna seemingly stuck with multiple GPUs

See original GitHub issue
  • PyTorch-Forecasting version: 0.8.4
  • PyTorch version: 1.8.0
  • Python version: 3.8.8
  • Operating System: CentOS

Expected behavior

I’m working through the Demand forecasting with the Temporal Fusion Transformer and try to run the optimize_hyperparameters part on two GPUs.

Actual behavior

I get some output, but it never finishes. With only a single GPU utilized it finishes within minutes without any issues.

[I 2021-04-13 15:40:26,906] A new study created in memory with name: no-name-e455a085-bb8c-4052-a225-ef363fb68e4c initializing ddp: GLOBAL_RANK: 1, MEMBER: 1/2

Code to reproduce the problem

https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html

this works:

study = optimize_hyperparameters(
    train_dataloader,
    val_dataloader,
    model_path="optuna_test",
    n_trials=200,
    max_epochs=50,
    gradient_clip_val_range=(0.01, 1.0),
    hidden_size_range=(8, 128),
    hidden_continuous_size_range=(8, 128),
    attention_head_size_range=(1, 4),
    learning_rate_range=(0.001, 0.1),
    dropout_range=(0.1, 0.3),
    trainer_kwargs=dict(limit_train_batches=30),
    reduce_on_plateau_patience=4,
    use_learning_rate_finder=False,  # use Optuna to find ideal learning rate or use in-built learning rate finder
)

changing this, it doesn’t anymore:

    trainer_kwargs=dict(limit_train_batches=30, gpus=2),

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
nzw0301commented, Jan 4, 2022

Hi, I’m Kento Nozawa from the Optuna community. The latest Optuna’s PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.

Best,

0reactions
DeastinYcommented, May 27, 2021

Running the examples leads to this issue: https://github.com/optuna/optuna-examples/issues/14

Read more comments on GitHub >

github_iconTop Results From Across the Web

pytorch - What might cause a GPU to hang/get stuck when ...
TL;DR: using PyTorch with Optuna with multiprocessing done with Queue(), a GPU (out of 4) can hang. Probably not a deadlock. Any ideas?...
Read more >
PyTorch Lightning and Optuna: Multi-GPU hyperparameter ...
How to quickly set up multi-GPU training for hyperparameter optimisation with PyTorch Lightning.
Read more >
Benchmarking optimization algorithms for auto-tuning GPU ...
Research in automated performance tuning (auto-tuning) can be grouped into two main categories: (1) auto-tuning compiler-generated code ...
Read more >
I worry about CX600 i accidentally bought.
The problem is, i am stuck with it apparently, as the only place where i could ... multi-GPU combos - will i have...
Read more >
Spinning Up Documentation - OpenAI
2.3 Installing Spinning Up git clone https://github.com/openai/spinningup.git cd spinningup pip install -e . You Should Know. 8. Chapter 2.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found