Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pytorch lightning examples doesn't work in multi gpu's with backend=dp

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Bert

Language I am using the model on (English, Chinese …): English

The problem arises when using:

the official example scripts: run_pl.sh (run_pl_glue.py)

The tasks I am working on is:

an official GLUE/SQUaD task: Glue

To reproduce

Steps to reproduce the behavior:

run_pl.sh script with multi-gpu’s (ex:8 gpu’s)

Expected behavior

Glue training should happen

Environment info

transformers version: 2.8.0
Platform: Linux
Python version: 3.7
PyTorch version (GPU?): 1.4
Tensorflow version (GPU?):
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: DataParallel

Issue Analytics

State:
Created 3 years ago
Comments:28 (14 by maintainers)

Top GitHub Comments

3reactions

mmiakashscommented, Apr 24, 2020

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.

2reactions

leslyaruncommented, Apr 24, 2020

I can confirm that the issue occurs only when using multi-gpu’s with dp as backend. Using ddp solves the issues.

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0,1]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/warnings.py:18: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "run_pl_glue.py", line 187, in <module>
    trainer = generic_train(model, args)
  File "/home/jupyter/transformers/examples/transformer_base.py", line 310, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 734, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects

Top Results From Across the Web

GPU training (Intermediate) - PyTorch Lightning - Read the Docs

Distributed Data Parallel. DistributedDataParallel (DDP) works as follows: Each GPU across each node gets its own process.

Multi-GPU with Pytorch-Lightning - GitHub Pages

In this tutorial, we will cover the pytorch-lightning multi-gpu example. We will go over how to define a dataset, a data loader, and...

Using Ray with Pytorch Lightning

With this integration, you can run multiple training runs in parallel, with each run having a different set of hyperparameters for your Pytorch...

Distributed GPU training guide (SDK v1) - Microsoft Learn

Don't use DeepSpeed's custom launcher to run distributed training with ... utility to run single-node multi-GPU PyTorch training, you do not ...

Scale your PyTorch code with LightningLite

Run PyTorch models on any hardware with LightningLite without refactoring your ... Here's how Lightning Lite makes adding multi-GPU training support easier ...