Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No error message when distributed_backend = "invalid choice", Trainer runs on CPU

See original GitHub issue

🐛 Bug

I’m trying to implemented and run new BERT-based model as always, used gpus option, but strangely my model is still running on CPU. I know this from 1.the training is too slow, 2.print(self.device) -> "cpu.", 3.The logs (right below). I never encountered this before so I’m confused. I’m using pytorch-lightning=0.9.0

GPU available: True, used: True
[2020-09-05 08:54:00,565][lightning][INFO] - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
[2020-09-05 08:54:00,565][lightning][INFO] - TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
[2020-09-05 08:54:00,566][lightning][INFO] - CUDA_VISIBLE_DEVICES: [0]

[GPU memory used, but GPU utility is zero]

I also attach a strange warning message I see here.

...\pytorch_lightning\utilities\distributed.py:37: UserWarning: Could not log 
computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given

The code for the model I initialize inside my LightningModule is here (ColBERT). Below is how I initialize my LightningModule. ColBERT.from_pretrained() initializes the model of the link. I print(self.device) at the end of __init__ and I see "cpu" as a result.

class ColBERTLightning(pl.LightningModule):
    def __init__(self, hparams):
        super().__init__()
        self.hparams = hparams
        # BERT-based sub-module initialized here
        model_params = hparams.model
        self.model = ColBERT.from_pretrained(
            model_params.base,
            query_maxlen=model_params.query_maxlen,
            doc_maxlen=model_params.doc_maxlen,
            dim=model_params.projection_dim,
            similarity_metric=model_params.similarity_metric,
        )
        self.labels = torch.zeros(
            hparams.train.batch_size, dtype=torch.long, device=self.device
        )
        print(self.device) # it prints "cpu" even when I use gpus=1

This is the code for trainer. I’m using hydra and DataModule. I’m using pandas inside DataModule to load data.

@hydra.main(config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
    print(OmegaConf.to_yaml(cfg))
    hparams = cfg
    # if hparams.train.gpus is not None:
    #     hparams.train.gpus = str(hparams.train.gpus)

    # init model
    model = ColBERTLightning(hparams)

    # init data module
    data_dir = hparams.dataset.dir
    batch_size = hparams.train.batch_size
    dm = TripleTextDataModule(data_dir, batch_size=batch_size)
    # dm.setup("fit")

    # logger
    source_files_path = str(Path(hydra.utils.get_original_cwd()) / "**/*.py")
    ## TODO: Neptune or wandb?

    # # trainer
    trainer = Trainer(
        accumulate_grad_batches=hparams.train.accumulate_grad_batches,
        distributed_backend=hparams.train.distributed_backend,
        fast_dev_run=hparams.train.fast_dev_run,
        gpus=hparams.train.gpus,
        auto_select_gpus=True,
        gradient_clip_val=hparams.train.gradient_clip_val,
        max_steps=hparams.train.max_steps,
        benchmark=True,
        profiler=hparams.train.use_profiler,
        # profiler=AdvancedProfiler(),
        # sync_batchnorm=True,
        # log_gpu_memory="min_max",
    )

    # # fit
    trainer.fit(model, dm)

Environment

Two envs I’ve tested.

* CUDA:
        - GPU:
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.18.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.5.1
        - pytorch-lightning: 0.9.0
        - tensorboard:       2.2.0
        - tqdm:              4.48.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.7.6
        - version:           #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020

* CUDA:
        - GPU:
                - GeForce GTX 1070 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.18.5
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 0.9.0
        - tensorboard:       2.2.0
        - tqdm:              4.46.1
* System:
        - OS:                Windows
        - architecture:
                - 64bit
                - WindowsPE
        - processor:         AMD64 Family 23 Model 8 Stepping 2, AuthenticAMD
        - python:            3.7.7
        - version:           10.0.19041

Issue Analytics

State:
Created 3 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

4reactions

awaelchlicommented, Sep 5, 2020

In your config, you have distributed_backend = None. This is fine, since None means PL will select the appropriate backend for you, depending if you have gpus or not. However!!!

print(type(hparams.train.distributed_backend))
# output: <class 'str'>

instead of proper None built in type. Trainer does not throw an error when we select an invalid choice for backend. We should change that. @PyTorchLightning/core-contributors

To solve your issue: convert to the real type at runtime or tell hydra the type somehow, not sure if that’s possible. When you do, you will get an error saying:

torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

so somewhere in your code you have an input on the wrong device. move it/create it on the right device. One other note: I saw in your init you have

self.labels = torch.zeros(hparams.train.batch_size, dtype=torch.long)
# but it should be
self.register_buffer("labels", torch.zeros(hparams.train.batch_size, dtype=torch.long))

I will leave you the rest of debugging, since this is now about your tensors that need to be created correctly. If you need additional help, let me know, in the meantime I will make sure we print a error message for wrong backend selection.

2reactions

kyoungrok0517commented, Sep 5, 2020

Or you may try locally.

Download Code

git clone https://github.com/kyoungrok0517/sparse-neural-ranker 

cd sparse-neural-ranker 

pip install torch torchvision && pip install -e . && pip install -r requirements.txt

Download Data

mkdir -p data && cd data
wget https://storage.googleapis.com/kyoungrok-public/msmarco-passage-triple-text-sm/test.parquet
cp test.parquet train.parquet && cp test.parquet val.parquet

Run

python trainer.py dataset.dir="<DATA_DIR>" train.gpus=1

Top Results From Across the Web

No error message when distributed_backend = "invalid choice ...

Bug I'm trying to implemented and run new BERT-based model as always, used gpus option, but strangely my model is still running on...

Graphics Processing Unit (GPU) - PyTorch Lightning

To train on CPU/GPU/TPU without changing your code, we need to build a few good ... However, if you run a distributed model...

Distributed communication package - torch.distributed - PyTorch

Use the Gloo backend for distributed CPU training. GPU hosts with InfiniBand interconnect. Use NCCL, since it's the only backend that currently supports ......

PyTorch Lightning 1.5 Released - Exxact Corporation

LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops ......

Distributed communication package - torch.distributed

If you encounter any problem with NCCL, use Gloo as the fallback option. (Note that Gloo currently runs slower than NCCL for GPUs.)...