question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

Pytorch 1.5 DataParallel

See original GitHub issue

šŸ› Bug

Information

Canā€™t run forward in PyTorch 1.5.0, works fine in 1.4.0

Model I am using (Bert, XLNet ā€¦): XLNet

Language I am using the model on (English, Chinese ā€¦): English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

Transformer + custom head + custom losses + differential learning rates, I donā€™t think it matters.

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

Custom news classification

To reproduce

Steps to reproduce the behavior:

  1. Install PyTorch 1.5.0
  2. Run forward on xlnet
  File "transformers/modeling_xlnet.py", line 761, in forward
    dtype_float = next(self.parameters()).dtype
StopIteration

Expected behavior

Runs forward

Environment info

  • transformers version: 2.8.0
  • Platform: Ubuntu 18.04
  • Python version: Anaconda 3.7
  • PyTorch version (GPU?): 1.5, Yes
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:19
  • Comments:26 (8 by maintainers)

github_iconTop GitHub Comments

12reactions
ArthurCamaracommented, May 1, 2020

Same problem here, running BERT.

torch==1.5.0
transformers==2.8.0

Iā€™m running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7 before running (I have 8 1080TIs on this server).

run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8

Vol45.sample is a .txt with one doc per line

EDIT: It seems to work if I downgrade pytorch to 1.4

6reactions
julien-ccommented, May 8, 2020

Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel (not DistributedDataParallel or single-GPU), correct?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Optional: Data Parallelism ā€” PyTorch Tutorials 1.13.0+cu117 ...
DataParallel splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes their job, DataParallel collectsĀ ......
Read more >
Source code for torch_xla.distributed.data_parallel - PyTorch
[docs]class DataParallel(object): """Enable the execution of a model network in replicated mode using threads. Args: network (:class:`torch.nn.
Read more >
DataParallel ā€” PyTorch 1.13 documentation
Implements data parallelism at the module level. This container parallelizes the application of the given module by splitting the input across the specifiedĀ ......
Read more >
How is it possible to move a model wrapped in DataParallel to ...
How can I convert a model thats been trained on multiple GPUs (wrapped in DataParallel) to cpu? ... and by the way this...
Read more >
Performance Tuning Guide - PyTorch
PyTorch 1.5 introduced support for channels_last memory format for convolutional networks. ... PyTorch has two ways to implement data-parallel training:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found