Issue: Trainer error on `evaluate()` in multithreaded/distributed context (shape mismatch)
See original GitHub issueEnvironment info
transformers
version: 4.3.3- Platform: Linux-5.4.83.1.fi-x86_64-with-centos-7.8.2003-Core
- Python version: 3.7.3
- PyTorch version (GPU?): 1.8.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes - multinode/multigpu and multigpu settings.
Who can help
Information
Model I am using (GPT2):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior: I have witnessed this error in two contexts
Using a custom torch.utils.data.IterableDataset
.
First:
- specify
dataloader_num_workers
> 1 inTrainingArguments
and runtrainer.train()
with an eval dataset
Second:
- In distributed setting, fire up multiple training instances on separate nodes using the
torch.distributed.launch
command, runtrainer.train()
with an eval dataset
Error message:
File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer.py", line 1655, in prediction_loop
eval_losses_gatherer.add_arrays(self._gather_and_numpify(losses_host, "eval_losses"))
File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer_pt_utils.py", line 338, in add_arrays
slice_len = self._nested_set_tensors(self._storage, arrays)
File "/mnt/home/dberenberg/projects/metagenomics/huggingface_meta/lib/python3.7/site-packages/transformers/trainer_pt_utils.py", line 354, in _nested_set_tensors
storage[self._offsets[i] : self._offsets[i] + slice_len] = arrays[i * slice_len : (i + 1) * slice_len]
ValueError: could not broadcast input array from shape (104,) into shape (96,)
The broadcast input array shape varies. In the first case, the broadcast shape will be dataloader_num_workers
* expected_shape
(in this case (96,)). Above exhibits the second case error message.
Expected behavior
The evaluate
loop should run without error.
Dataset information
The dataset object is an IterableDataset
that is abc.Sized
.
Script information
The script is fairly generic, involving training and evaluating GPT2 via the Trainer
object for next-token prediction.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Top 20 C++ multithreading mistakes and how to avoid them
Mistake # 1: Not using join() to wait for background threads before terminating an application. If we forgot to join a thread or...
Read more >tensorflow/RELEASE.md at master - GitHub
Stop constructing Status objects from tensorflow::error::Code . ... Changed default value for the verbose argument of Model.evaluate() and Model.predict() ...
Read more >Java Concurrency issues and Thread Synchronization
In this blog post, we'll look at some common pitfalls related to concurrent/multithreaded programs, and learn how to avoid them. Concurrency ...
Read more >Entity Framework and Multi threading - Stack Overflow
We are creating entities on different threads, the entities are added to collections which are then data-bound to various WPF controls. The ...
Read more >Developing Multithreaded Applications: A Platform Consistent ...
It occurs when threads on different processors modify variables that reside on the same cache line, as illustrated in. The reason this is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That’s correct, especially for distributed evaluation.
Ok, that makes sense. So just to conclude,
transformers.Trainer
won’t work in distributed setting with antorch.utils.data.IterableDataset
, in principal due to the fact thatIterableDataset
s are not amenable to that use case, since it isn’t clear how describe a distributed sampling procedure for them. Is that correct? Thanks in advance