Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue: Trainer error on `evaluate()` in multithreaded/distributed context (shape mismatch)

See original GitHub issue

Environment info

transformers version: 4.3.3
Platform: Linux-5.4.83.1.fi-x86_64-with-centos-7.8.2003-Core
Python version: 3.7.3
PyTorch version (GPU?): 1.8.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes - multinode/multigpu and multigpu settings.

Who can help

@LysandreJik @sgugger

Information

Model I am using (GPT2):

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior: I have witnessed this error in two contexts Using a custom torch.utils.data.IterableDataset.

First:

specify dataloader_num_workers > 1 in TrainingArguments and run trainer.train() with an eval dataset

Second:

In distributed setting, fire up multiple training instances on separate nodes using the torch.distributed.launch command, run trainer.train() with an eval dataset

Error message:

  File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer.py", line 1655, in prediction_loop
    eval_losses_gatherer.add_arrays(self._gather_and_numpify(losses_host, "eval_losses"))
  File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer_pt_utils.py", line 338, in add_arrays
    slice_len = self._nested_set_tensors(self._storage, arrays)
  File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer_pt_utils.py", line 354, in _nested_set_tensors
    storage[self._offsets[i] : self._offsets[i] + slice_len] = arrays[i * slice_len : (i + 1) * slice_len]
ValueError: could not broadcast input array from shape (104,) into shape (96,)

The broadcast input array shape varies. In the first case, the broadcast shape will be dataloader_num_workers * expected_shape (in this case (96,)). Above exhibits the second case error message.

Expected behavior

The evaluate loop should run without error.

Dataset information

The dataset object is an IterableDataset that is abc.Sized.

Script information

The script is fairly generic, involving training and evaluating GPT2 via the Trainer object for next-token prediction.

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Apr 12, 2021

That’s correct, especially for distributed evaluation.

0reactions

djberenbergcommented, Apr 12, 2021

Ok, that makes sense. So just to conclude, transformers.Trainer won’t work in distributed setting with an torch.utils.data.IterableDataset, in principal due to the fact that IterableDatasets are not amenable to that use case, since it isn’t clear how describe a distributed sampling procedure for them. Is that correct? Thanks in advance

Top Results From Across the Web

Top 20 C++ multithreading mistakes and how to avoid them

Mistake # 1: Not using join() to wait for background threads before terminating an application. If we forgot to join a thread or...

tensorflow/RELEASE.md at master - GitHub

Stop constructing Status objects from tensorflow::error::Code . ... Changed default value for the verbose argument of Model.evaluate() and Model.predict() ...

Java Concurrency issues and Thread Synchronization

In this blog post, we'll look at some common pitfalls related to concurrent/multithreaded programs, and learn how to avoid them. Concurrency ...

Entity Framework and Multi threading - Stack Overflow

We are creating entities on different threads, the entities are added to collections which are then data-bound to various WPF controls. The ...

Developing Multithreaded Applications: A Platform Consistent ...

It occurs when threads on different processors modify variables that reside on the same cache line, as illustrated in. The reason this is...