Expected to have finished reduction in the prior iteration before starting a new one.
See original GitHub issueI have modified the nlp_example
to finetune an EncoderDecoder
on translation data like this:
accelerator = Accelerator(device_placement=False, fp16=args.fp16, cpu=args.cpu)
def _tokenize(batch):
if accelerator.distributed_type == DistributedType.TPU:
src = tokenizer(batch[0], padding="max_length", max_length=128, return_tensors="pt")
tgt = tokenizer(batch[1], padding="max_length", max_length=128, return_tensors="pt")
else:
src = tokenizer(list(batch[0]), padding="longest", return_tensors="pt")
tgt = tokenizer(list(batch[1]), padding="longest", return_tensors="pt")
return src, tgt
...
for step, batch in train_bar:
src, tgt = _tokenize(batch)
src["input_ids"] = src["input_ids"].to(accelerator.device)
tgt["input_ids"] = tgt["input_ids"].to(accelerator.device)
outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"])
loss = outputs.loss
loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
if step % gradient_accumulation_steps == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
if step % eval_steps == 0:
model.eval()
for step, batch in enumerate(dev_dataloader):
src, tgt = _tokenize(batch)
src["input_ids"] = src["input_ids"].to(accelerator.device)
tgt["input_ids"] = tgt["input_ids"].to(accelerator.device)
with torch.no_grad():
predictions = model.generate(
src["input_ids"],
decoder_start_token_id=tokenizer.convert_tokens_to_ids("[CLS]"),
num_beams=4,
repetition_penalty=1.0,
do_sample=False,
forced_bos_token_id=None,
)
pred_str = tokenizer.batch_decode(predictions, skip_special_tokens=True)
ref_str = tokenizer.batch_decode(tgt["input_ids"], skip_special_tokens=True)
metric.add_batch(
predictions=accelerator.gather(pred_str), references=accelerator.gather([[r] for r in ref_str]),
)
eval_metric = metric.compute()
...
I am getting the following error during training
File "trainer.py", line 104, in training_function
outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"])
File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
and the following during generation
File "trainer.py", line 120, in training_function
predictions = model.generate(
File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'generate'
Both are working fine if I change the configuration to use only one GPU using accelerate config
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
RuntimeError: Expected to have finished reduction in the prior ...
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module ...
Read more >Expected to have finished reduction in the prior iteration ...
This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) ......
Read more >runtimeerror: expected to have finished reduction in the prior ...
"RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters...
Read more >Can't use DistributedDataParallel for training the ...
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module ...
Read more >RuntimeError: Expected to have finished reduction in the prior ...
Expected to have finished reduction in the prior iteration before starting a new one. 404 NOT FOUND. 05-27 6252. Expected to have finished...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi there! For your first problem, you have to set
find_unused_parameters=True
when creating the distributed model (as PyTorch tells you in the error message). This can be done (with a source install as it’s a feature that was added recently) by creating aDistributedDataParallelKwargs
containing this and passing it to yourAccelerator
:This will still let your script run on one GPU/CPU (it will just be ignored then) and when in distributed training, should fix your first issue. If the error only appears when
gradient_accumulation_steps > 1
, you should setFor the second issue, the model is not the same once you have passed it to
accelerator.prepare
: it has been set up for distributed training and wrapped in a container (DistributedDataParallel
) that does not have agenerate
method anymore. You can get your initial model back withso replace your generate line with:
That’s great, thanks for all the awesome work you do!