pytorch lightning examples doesn't work in multi gpu's with backend=dp
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): Bert
Language I am using the model on (English, Chinese …): English
The problem arises when using:
- the official example scripts: run_pl.sh (run_pl_glue.py)
The tasks I am working on is:
- an official GLUE/SQUaD task: Glue
To reproduce
Steps to reproduce the behavior:
- run_pl.sh script with multi-gpu’s (ex:8 gpu’s)
Expected behavior
Glue training should happen
Environment info
transformers
version: 2.8.0- Platform: Linux
- Python version: 3.7
- PyTorch version (GPU?): 1.4
- Tensorflow version (GPU?):
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: DataParallel
Issue Analytics
- State:
- Created 3 years ago
- Comments:28 (14 by maintainers)
Top Results From Across the Web
GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Distributed Data Parallel. DistributedDataParallel (DDP) works as follows: Each GPU across each node gets its own process.
Read more >Multi-GPU with Pytorch-Lightning - GitHub Pages
In this tutorial, we will cover the pytorch-lightning multi-gpu example. We will go over how to define a dataset, a data loader, and...
Read more >Using Ray with Pytorch Lightning
With this integration, you can run multiple training runs in parallel, with each run having a different set of hyperparameters for your Pytorch...
Read more >Distributed GPU training guide (SDK v1) - Microsoft Learn
Don't use DeepSpeed's custom launcher to run distributed training with ... utility to run single-node multi-GPU PyTorch training, you do not ...
Read more >Scale your PyTorch code with LightningLite
Run PyTorch models on any hardware with LightningLite without refactoring your ... Here's how Lightning Lite makes adding multi-GPU training support easier ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.
I can confirm that the issue occurs only when using multi-gpu’s with dp as backend. Using ddp solves the issues.
I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error: