Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Expected to have finished reduction in the prior iteration before starting a new one.

See original GitHub issue

I have modified the nlp_example to finetune an EncoderDecoder on translation data like this:

accelerator = Accelerator(device_placement=False, fp16=args.fp16, cpu=args.cpu)
def _tokenize(batch):
    if accelerator.distributed_type == DistributedType.TPU:
        src = tokenizer(batch[0], padding="max_length", max_length=128, return_tensors="pt")
        tgt = tokenizer(batch[1], padding="max_length", max_length=128, return_tensors="pt")
    else:
        src = tokenizer(list(batch[0]), padding="longest", return_tensors="pt")
        tgt = tokenizer(list(batch[1]), padding="longest", return_tensors="pt")
    return src, tgt
...
for step, batch in train_bar:
    src, tgt = _tokenize(batch)
    src["input_ids"] = src["input_ids"].to(accelerator.device)
    tgt["input_ids"] = tgt["input_ids"].to(accelerator.device)
    outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"])
    loss = outputs.loss
    loss = loss / gradient_accumulation_steps
    accelerator.backward(loss)
    if step % gradient_accumulation_steps == 0:
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    if step % eval_steps == 0:
        model.eval()
        for step, batch in enumerate(dev_dataloader):
            src, tgt = _tokenize(batch)
            src["input_ids"] = src["input_ids"].to(accelerator.device)
            tgt["input_ids"] = tgt["input_ids"].to(accelerator.device)
            with torch.no_grad():
                predictions = model.generate(
                    src["input_ids"],
                    decoder_start_token_id=tokenizer.convert_tokens_to_ids("[CLS]"),
                    num_beams=4,
                    repetition_penalty=1.0,
                    do_sample=False,
                    forced_bos_token_id=None,
                )
            pred_str = tokenizer.batch_decode(predictions, skip_special_tokens=True)
            ref_str = tokenizer.batch_decode(tgt["input_ids"], skip_special_tokens=True)
            metric.add_batch(
                predictions=accelerator.gather(pred_str), references=accelerator.gather([[r] for r in ref_str]),
            )
        eval_metric = metric.compute()
...

I am getting the following error during training

  File "trainer.py", line 104, in training_function
    outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"])
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

and the following during generation

  File "trainer.py", line 120, in training_function
    predictions = model.generate(
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'generate'

Both are working fine if I change the configuration to use only one GPU using accelerate config

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

15reactions

sguggercommented, Apr 6, 2021

Hi there! For your first problem, you have to set find_unused_parameters=True when creating the distributed model (as PyTorch tells you in the error message). This can be done (with a source install as it’s a feature that was added recently) by creating a DistributedDataParallelKwargs containing this and passing it to your Accelerator:

from accelerate import DistributedDataParallelKwargs

ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])

This will still let your script run on one GPU/CPU (it will just be ignored then) and when in distributed training, should fix your first issue. If the error only appears when gradient_accumulation_steps > 1, you should set

ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=args.gradient_accumulation_steps > 1)

For the second issue, the model is not the same once you have passed it to accelerator.prepare: it has been set up for distributed training and wrapped in a container (DistributedDataParallel) that does not have a generate method anymore. You can get your initial model back with

accelerator.unwrap_model(model)

so replace your generate line with:

predictions = accelerator.unwrap_model(model).generate(

1reaction

rahularcommented, Apr 7, 2021

That’s great, thanks for all the awesome work you do!

Top Results From Across the Web

RuntimeError: Expected to have finished reduction in the prior ...

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module ...

Expected to have finished reduction in the prior iteration ...

This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) ......

runtimeerror: expected to have finished reduction in the prior ...

"RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters...

Can't use DistributedDataParallel for training the ...

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module ...

RuntimeError: Expected to have finished reduction in the prior ...

Expected to have finished reduction in the prior iteration before starting a new one. 404 NOT FOUND. 05-27 6252. Expected to have finished...