Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optuna seemingly stuck with multiple GPUs

See original GitHub issue

PyTorch-Forecasting version: 0.8.4
PyTorch version: 1.8.0
Python version: 3.8.8
Operating System: CentOS

Expected behavior

I’m working through the Demand forecasting with the Temporal Fusion Transformer and try to run the optimize_hyperparameters part on two GPUs.

Actual behavior

I get some output, but it never finishes. With only a single GPU utilized it finishes within minutes without any issues.

[I 2021-04-13 15:40:26,906] A new study created in memory with name: no-name-e455a085-bb8c-4052-a225-ef363fb68e4c initializing ddp: GLOBAL_RANK: 1, MEMBER: 1/2

Code to reproduce the problem

https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html

this works:

study = optimize_hyperparameters(
    train_dataloader,
    val_dataloader,
    model_path="optuna_test",
    n_trials=200,
    max_epochs=50,
    gradient_clip_val_range=(0.01, 1.0),
    hidden_size_range=(8, 128),
    hidden_continuous_size_range=(8, 128),
    attention_head_size_range=(1, 4),
    learning_rate_range=(0.001, 0.1),
    dropout_range=(0.1, 0.3),
    trainer_kwargs=dict(limit_train_batches=30),
    reduce_on_plateau_patience=4,
    use_learning_rate_finder=False,  # use Optuna to find ideal learning rate or use in-built learning rate finder
)

changing this, it doesn’t anymore:

    trainer_kwargs=dict(limit_train_batches=30, gpus=2),

Issue Analytics

State:
Created 2 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

3reactions

nzw0301commented, Jan 4, 2022

Hi, I’m Kento Nozawa from the Optuna community. The latest Optuna’s PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.

Best,

0reactions

DeastinYcommented, May 27, 2021

Running the examples leads to this issue: https://github.com/optuna/optuna-examples/issues/14