question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue: Trainer error on `evaluate()` in multithreaded/distributed context (shape mismatch)

See original GitHub issue

Environment info

  • transformers version: 4.3.3
  • Platform: Linux-5.4.83.1.fi-x86_64-with-centos-7.8.2003-Core
  • Python version: 3.7.3
  • PyTorch version (GPU?): 1.8.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes - multinode/multigpu and multigpu settings.

Who can help

@LysandreJik @sgugger

Information

Model I am using (GPT2):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior: I have witnessed this error in two contexts Using a custom torch.utils.data.IterableDataset.

First:

  1. specify dataloader_num_workers > 1 in TrainingArguments and run trainer.train() with an eval dataset

Second:

  1. In distributed setting, fire up multiple training instances on separate nodes using the torch.distributed.launch command, run trainer.train() with an eval dataset

Error message:

  File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer.py", line 1655, in prediction_loop
    eval_losses_gatherer.add_arrays(self._gather_and_numpify(losses_host, "eval_losses"))
  File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer_pt_utils.py", line 338, in add_arrays
    slice_len = self._nested_set_tensors(self._storage, arrays)
  File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer_pt_utils.py", line 354, in _nested_set_tensors
    storage[self._offsets[i] : self._offsets[i] + slice_len] = arrays[i * slice_len : (i + 1) * slice_len]
ValueError: could not broadcast input array from shape (104,) into shape (96,)

The broadcast input array shape varies. In the first case, the broadcast shape will be dataloader_num_workers * expected_shape (in this case (96,)). Above exhibits the second case error message.

Expected behavior

The evaluate loop should run without error.

Dataset information

The dataset object is an IterableDataset that is abc.Sized.

Script information

The script is fairly generic, involving training and evaluating GPT2 via the Trainer object for next-token prediction.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Apr 12, 2021

That’s correct, especially for distributed evaluation.

0reactions
djberenbergcommented, Apr 12, 2021

Ok, that makes sense. So just to conclude, transformers.Trainer won’t work in distributed setting with an torch.utils.data.IterableDataset, in principal due to the fact that IterableDatasets are not amenable to that use case, since it isn’t clear how describe a distributed sampling procedure for them. Is that correct? Thanks in advance

Read more comments on GitHub >

github_iconTop Results From Across the Web

Top 20 C++ multithreading mistakes and how to avoid them
Mistake # 1: Not using join() to wait for background threads before terminating an application. If we forgot to join a thread or...
Read more >
tensorflow/RELEASE.md at master - GitHub
Stop constructing Status objects from tensorflow::error::Code . ... Changed default value for the verbose argument of Model.evaluate() and Model.predict() ...
Read more >
Java Concurrency issues and Thread Synchronization
In this blog post, we'll look at some common pitfalls related to concurrent/multithreaded programs, and learn how to avoid them. Concurrency ...
Read more >
Entity Framework and Multi threading - Stack Overflow
We are creating entities on different threads, the entities are added to collections which are then data-bound to various WPF controls. The ...
Read more >
Developing Multithreaded Applications: A Platform Consistent ...
It occurs when threads on different processors modify variables that reside on the same cache line, as illustrated in. The reason this is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found