question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pytorch lightning examples doesn't work in multi gpu's with backend=dp

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Bert

Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • the official example scripts: run_pl.sh (run_pl_glue.py)

The tasks I am working on is:

  • an official GLUE/SQUaD task: Glue

To reproduce

Steps to reproduce the behavior:

  1. run_pl.sh script with multi-gpu’s (ex:8 gpu’s)

Expected behavior

Glue training should happen

Environment info

  • transformers version: 2.8.0
  • Platform: Linux
  • Python version: 3.7
  • PyTorch version (GPU?): 1.4
  • Tensorflow version (GPU?):
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: DataParallel

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:28 (14 by maintainers)

github_iconTop GitHub Comments

3reactions
mmiakashscommented, Apr 24, 2020

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

@leslyarun I am also facing a similar issue with ddp backend (not exactly the same): github issue My guess is that maybe there is an issue with the callback and the saving objects with pickle. At this moment I will try to manually save checkpoint without using the callbacks.

2reactions
leslyaruncommented, Apr 24, 2020

I can confirm that the issue occurs only when using multi-gpu’s with dp as backend. Using ddp solves the issues.

I found one more issue. If I use fast tokenizers with ddp as backend, I get the below error:

INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0,1]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/warnings.py:18: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "run_pl_glue.py", line 187, in <module>
    trainer = generic_train(model, args)
  File "/home/jupyter/transformers/examples/transformer_base.py", line 310, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 734, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects
Read more comments on GitHub >

github_iconTop Results From Across the Web

GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Distributed Data Parallel. DistributedDataParallel (DDP) works as follows: Each GPU across each node gets its own process.
Read more >
Multi-GPU with Pytorch-Lightning - GitHub Pages
In this tutorial, we will cover the pytorch-lightning multi-gpu example. We will go over how to define a dataset, a data loader, and...
Read more >
Using Ray with Pytorch Lightning
With this integration, you can run multiple training runs in parallel, with each run having a different set of hyperparameters for your Pytorch...
Read more >
Distributed GPU training guide (SDK v1) - Microsoft Learn
Don't use DeepSpeed's custom launcher to run distributed training with ... utility to run single-node multi-GPU PyTorch training, you do not ...
Read more >
Scale your PyTorch code with LightningLite
Run PyTorch models on any hardware with LightningLite without refactoring your ... Here's how Lightning Lite makes adding multi-GPU training support easier ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found