Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pytorch 1.5 DataParallel

See original GitHub issue

🐛 Bug

Information

Can’t run forward in PyTorch 1.5.0, works fine in 1.4.0

Model I am using (Bert, XLNet …): XLNet

Language I am using the model on (English, Chinese …): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

Transformer + custom head + custom losses + differential learning rates, I don’t think it matters.

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

Custom news classification

To reproduce

Steps to reproduce the behavior:

Install PyTorch 1.5.0
Run forward on xlnet

  File "transformers/modeling_xlnet.py", line 761, in forward
    dtype_float = next(self.parameters()).dtype
StopIteration

Expected behavior

Runs forward

Environment info

transformers version: 2.8.0
Platform: Ubuntu 18.04
Python version: Anaconda 3.7
PyTorch version (GPU?): 1.5, Yes
Tensorflow version (GPU?): N/A
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Issue Analytics

State:
Created 3 years ago
Reactions:19
Comments:26 (8 by maintainers)

Top GitHub Comments

12reactions

ArthurCamaracommented, May 1, 2020

Same problem here, running BERT.

torch==1.5.0
transformers==2.8.0

I’m running on GPUs, using export CUDA_VISIBLE_DEVICES=5,6,7 before running (I have 8 1080TIs on this server).

run_language_modeling.py --output_dir=models --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=Vol45.sample --mlm --save_steps-2000 --line_by_line --per_gpu_train_batch_size=8

Vol45.sample is a .txt with one doc per line

EDIT: It seems to work if I downgrade pytorch to 1.4

6reactions

julien-ccommented, May 8, 2020

Just to scope this bug a little bit better, all of you are using torch.nn.DataParallel (not DistributedDataParallel or single-GPU), correct?

Top Results From Across the Web

Optional: Data Parallelism — PyTorch Tutorials 1.13.0+cu117 ...

DataParallel splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes their job, DataParallel collects ......

Source code for torch_xla.distributed.data_parallel - PyTorch

[docs]class DataParallel(object): """Enable the execution of a model network in replicated mode using threads. Args: network (:class:`torch.nn.

DataParallel — PyTorch 1.13 documentation

Implements data parallelism at the module level. This container parallelizes the application of the given module by splitting the input across the specified ......

How is it possible to move a model wrapped in DataParallel to ...

How can I convert a model thats been trained on multiple GPUs (wrapped in DataParallel) to cpu? ... and by the way this...

Performance Tuning Guide - PyTorch

PyTorch 1.5 introduced support for channels_last memory format for convolutional networks. ... PyTorch has two ways to implement data-parallel training:.