Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Runtime error when running Roberta with multiple GPUs on CommonSenseQA

See original GitHub issue

When I run Roberta with multiple GPUs on CommonSenseQA, I encounter a problem of runtime time. Does anyone encounter the same problem ? Thanks.

 -- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.5/dist-packages/fairseq_cli/train.py", line 284, in distributed_main
    main(args, init_distributed=True)
  File "/usr/local/lib/python3.5/dist-packages/fairseq_cli/train.py", line 80, in main
    train(args, trainer, task, epoch_itr)
  File "/usr/local/lib/python3.5/dist-packages/fairseq_cli/train.py", line 121, in train
    log_output = trainer.train_step(samples)
  File "/usr/local/lib/python3.5/dist-packages/fairseq/trainer.py", line 287, in train_step
    raise e
  File "/usr/local/lib/python3.5/dist-packages/fairseq/trainer.py", line 264, in train_step
    ignore_grad
  File "/usr/local/lib/python3.5/dist-packages/fairseq/tasks/fairseq_task.py", line 230, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/fairseq/criterions/sentence_ranking.py", line 49, in forward
    classification_head_name='sentence_classification_head',
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/distributed.py", line 459, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:518)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f414b2e0273 in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x734 (0x7f414c573ac4 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x691b2c (0x7f414c562b2c in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x1d3f04 (0x7f414c0a4f04 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #4: PyCFunction_Call + 0x77 (0x4ea137 in /usr/bin/python3)
frame #5: PyEval_EvalFrameEx + 0x59f6 (0x53c176 in /usr/bin/python3)
frame #6: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #7: /usr/bin/python3() [0x4ec3f7]
frame #8: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #9: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #10: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #11: /usr/bin/python3() [0x4ec3f7]
frame #12: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #13: /usr/bin/python3() [0x4fbfce]
frame #14: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #15: /usr/bin/python3() [0x574db6]
frame #16: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #17: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #18: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #19: /usr/bin/python3() [0x4ec3f7]
frame #20: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #21: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #22: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #23: /usr/bin/python3() [0x4ec2e3]
frame #24: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #25: /usr/bin/python3() [0x4fbfce]
frame #26: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x574db6]
frame #28: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #29: PyEval_EvalFrameEx + 0x4ed6 (0x53b656 in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x53fc97]
frame #31: PyEval_EvalFrameEx + 0x50bf (0x53b83f in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x5401ef]
frame #33: PyEval_EvalFrameEx + 0x50bf (0x53b83f in /usr/bin/python3)
frame #34: PyEval_EvalFrameEx + 0x4b14 (0x53b294 in /usr/bin/python3)
frame #35: /usr/bin/python3() [0x53fc97]
frame #36: PyEval_EvalFrameEx + 0x50bf (0x53b83f in /usr/bin/python3)
frame #37: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #38: /usr/bin/python3() [0x4ec358]
frame #39: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #40: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #41: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x4ec3f7]
frame #43: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #44: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #45: PyEval_EvalFrameEx + 0x4b14 (0x53b294 in /usr/bin/python3)
frame #46: PyEval_EvalFrameEx + 0x4b14 (0x53b294 in /usr/bin/python3)
frame #47: PyEval_EvalFrameEx + 0x4b14 (0x53b294 in /usr/bin/python3)
frame #48: /usr/bin/python3() [0x53fc97]
frame #49: PyEval_EvalFrameEx + 0x50bf (0x53b83f in /usr/bin/python3)
frame #50: /usr/bin/python3() [0x53fc97]
frame #51: PyEval_EvalCode + 0x1f (0x5409bf in /usr/bin/python3)
frame #52: PyRun_StringFlags + 0x8f (0x52084f in /usr/bin/python3)
frame #53: PyRun_SimpleStringFlags + 0x3c (0x60f15c in /usr/bin/python3)
frame #54: Py_Main + 0x581 (0x640381 in /usr/bin/python3)
frame #55: main + 0xe1 (0x4d0001 in /usr/bin/python3)
frame #56: __libc_start_main + 0xf0 (0x7f41513c6830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #57: _start + 0x29 (0x5d6999 in /usr/bin/python3)


/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))

Here is my fine tune scripts:

MAX_UPDATES=3000        # Number of training steps.
WARMUP_UPDATES=150    # Linearly increase LR over this many steps.
LR=1e-05              # Peak LR for polynomial LR scheduler.
MAX_SENTENCES=16      # Batch size.
SEED=1                # Random seed.
ROBERTA_PATH=roberta_pretrain_model/robeta.large/model.pt
DATA_DIR=raw_data/dataset_created_by_ola

# we use the --user-dir option to load the task from
# the examples/roberta/commonsense_qa directory:
FAIRSEQ_PATH=fairseq/
FAIRSEQ_USER_DIR=${FAIRSEQ_PATH}/examples/roberta/commonsense_qa

CUDA_VISIBLE_DEVICES=0,1,2 fairseq-train --fp16 \
    $DATA_DIR \
    --user-dir $FAIRSEQ_USER_DIR \
    --restore-file $ROBERTA_PATH \
    --reset-optimizer --reset-dataloader --reset-meters \
    --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state \
    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --task commonsense_qa --init-token 0 --bpe gpt2 \
    --arch roberta_large --max-positions 512 \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --criterion sentence_ranking --num-classes 5 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $LR \
    --warmup-updates $WARMUP_UPDATES --total-num-update $MAX_UPDATES \
    --max-sentences $MAX_SENTENCES \
    --max-update $MAX_UPDATES \
    --log-format simple --log-interval 25 \
    --seed $SEED