No error message when distributed_backend = "invalid choice", Trainer runs on CPU
See original GitHub issueš Bug
Iām trying to implemented and run new BERT-based model as always, used gpus option, but strangely my model is still running on CPU. I know this from 1.the training is too slow, 2.print(self.device) -> "cpu."
, 3.The logs (right below). I never encountered this before so Iām confused. Iām using pytorch-lightning=0.9.0
GPU available: True, used: True
[2020-09-05 08:54:00,565][lightning][INFO] - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
[2020-09-05 08:54:00,565][lightning][INFO] - TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
[2020-09-05 08:54:00,566][lightning][INFO] - CUDA_VISIBLE_DEVICES: [0]
[GPU memory used, but GPU utility is zero]
I also attach a strange warning message I see here.
...\pytorch_lightning\utilities\distributed.py:37: UserWarning: Could not log
computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
The code for the model I initialize inside my LightningModule is here (ColBERT). Below is how I initialize my LightningModule. ColBERT.from_pretrained()
initializes the model of the link. I print(self.device)
at the end of __init__
and I see "cpu"
as a result.
class ColBERTLightning(pl.LightningModule):
def __init__(self, hparams):
super().__init__()
self.hparams = hparams
# BERT-based sub-module initialized here
model_params = hparams.model
self.model = ColBERT.from_pretrained(
model_params.base,
query_maxlen=model_params.query_maxlen,
doc_maxlen=model_params.doc_maxlen,
dim=model_params.projection_dim,
similarity_metric=model_params.similarity_metric,
)
self.labels = torch.zeros(
hparams.train.batch_size, dtype=torch.long, device=self.device
)
print(self.device) # it prints "cpu" even when I use gpus=1
This is the code for trainer. Iām using hydra
and DataModule
. Iām using pandas inside DataModule
to load data.
@hydra.main(config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
print(OmegaConf.to_yaml(cfg))
hparams = cfg
# if hparams.train.gpus is not None:
# hparams.train.gpus = str(hparams.train.gpus)
# init model
model = ColBERTLightning(hparams)
# init data module
data_dir = hparams.dataset.dir
batch_size = hparams.train.batch_size
dm = TripleTextDataModule(data_dir, batch_size=batch_size)
# dm.setup("fit")
# logger
source_files_path = str(Path(hydra.utils.get_original_cwd()) / "**/*.py")
## TODO: Neptune or wandb?
# # trainer
trainer = Trainer(
accumulate_grad_batches=hparams.train.accumulate_grad_batches,
distributed_backend=hparams.train.distributed_backend,
fast_dev_run=hparams.train.fast_dev_run,
gpus=hparams.train.gpus,
auto_select_gpus=True,
gradient_clip_val=hparams.train.gradient_clip_val,
max_steps=hparams.train.max_steps,
benchmark=True,
profiler=hparams.train.use_profiler,
# profiler=AdvancedProfiler(),
# sync_batchnorm=True,
# log_gpu_memory="min_max",
)
# # fit
trainer.fit(model, dm)
Environment
Two envs Iāve tested.
* CUDA:
- GPU:
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- available: True
- version: 10.2
* Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.5.1
- pytorch-lightning: 0.9.0
- tensorboard: 2.2.0
- tqdm: 4.48.2
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.6
- version: #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020
* CUDA:
- GPU:
- GeForce GTX 1070 Ti
- available: True
- version: 10.2
* Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.0
- tensorboard: 2.2.0
- tqdm: 4.46.1
* System:
- OS: Windows
- architecture:
- 64bit
- WindowsPE
- processor: AMD64 Family 23 Model 8 Stepping 2, AuthenticAMD
- python: 3.7.7
- version: 10.0.19041
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (3 by maintainers)
In your config, you have distributed_backend = None. This is fine, since None means PL will select the appropriate backend for you, depending if you have gpus or not. However!!!
instead of proper None built in type. Trainer does not throw an error when we select an invalid choice for backend. We should change that. @PyTorchLightning/core-contributors
To solve your issue: convert to the real type at runtime or tell hydra the type somehow, not sure if thatās possible. When you do, you will get an error saying:
so somewhere in your code you have an input on the wrong device. move it/create it on the right device. One other note: I saw in your init you have
I will leave you the rest of debugging, since this is now about your tensors that need to be created correctly. If you need additional help, let me know, in the meantime I will make sure we print a error message for wrong backend selection.
Or you may try locally.
Download Code
Download Data
Run