question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Runtime error when running Roberta with multiple GPUs on CommonSenseQA

See original GitHub issue

When I run Roberta with multiple GPUs on CommonSenseQA, I encounter a problem of runtime time. Does anyone encounter the same problem ? Thanks.

 -- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.5/dist-packages/fairseq_cli/train.py", line 284, in distributed_main
    main(args, init_distributed=True)
  File "/usr/local/lib/python3.5/dist-packages/fairseq_cli/train.py", line 80, in main
    train(args, trainer, task, epoch_itr)
  File "/usr/local/lib/python3.5/dist-packages/fairseq_cli/train.py", line 121, in train
    log_output = trainer.train_step(samples)
  File "/usr/local/lib/python3.5/dist-packages/fairseq/trainer.py", line 287, in train_step
    raise e
  File "/usr/local/lib/python3.5/dist-packages/fairseq/trainer.py", line 264, in train_step
    ignore_grad
  File "/usr/local/lib/python3.5/dist-packages/fairseq/tasks/fairseq_task.py", line 230, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/fairseq/criterions/sentence_ranking.py", line 49, in forward
    classification_head_name='sentence_classification_head',
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/distributed.py", line 459, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:518)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f414b2e0273 in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x734 (0x7f414c573ac4 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x691b2c (0x7f414c562b2c in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x1d3f04 (0x7f414c0a4f04 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so)
frame #4: PyCFunction_Call + 0x77 (0x4ea137 in /usr/bin/python3)
frame #5: PyEval_EvalFrameEx + 0x59f6 (0x53c176 in /usr/bin/python3)
frame #6: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #7: /usr/bin/python3() [0x4ec3f7]
frame #8: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #9: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #10: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #11: /usr/bin/python3() [0x4ec3f7]
frame #12: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #13: /usr/bin/python3() [0x4fbfce]
frame #14: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #15: /usr/bin/python3() [0x574db6]
frame #16: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #17: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #18: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #19: /usr/bin/python3() [0x4ec3f7]
frame #20: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #21: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #22: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #23: /usr/bin/python3() [0x4ec2e3]
frame #24: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #25: /usr/bin/python3() [0x4fbfce]
frame #26: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x574db6]
frame #28: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #29: PyEval_EvalFrameEx + 0x4ed6 (0x53b656 in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x53fc97]
frame #31: PyEval_EvalFrameEx + 0x50bf (0x53b83f in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x5401ef]
frame #33: PyEval_EvalFrameEx + 0x50bf (0x53b83f in /usr/bin/python3)
frame #34: PyEval_EvalFrameEx + 0x4b14 (0x53b294 in /usr/bin/python3)
frame #35: /usr/bin/python3() [0x53fc97]
frame #36: PyEval_EvalFrameEx + 0x50bf (0x53b83f in /usr/bin/python3)
frame #37: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #38: /usr/bin/python3() [0x4ec358]
frame #39: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #40: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #41: PyEval_EvalCodeEx + 0x13b (0x540b0b in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x4ec3f7]
frame #43: PyObject_Call + 0x47 (0x5c20e7 in /usr/bin/python3)
frame #44: PyEval_EvalFrameEx + 0x252b (0x538cab in /usr/bin/python3)
frame #45: PyEval_EvalFrameEx + 0x4b14 (0x53b294 in /usr/bin/python3)
frame #46: PyEval_EvalFrameEx + 0x4b14 (0x53b294 in /usr/bin/python3)
frame #47: PyEval_EvalFrameEx + 0x4b14 (0x53b294 in /usr/bin/python3)
frame #48: /usr/bin/python3() [0x53fc97]
frame #49: PyEval_EvalFrameEx + 0x50bf (0x53b83f in /usr/bin/python3)
frame #50: /usr/bin/python3() [0x53fc97]
frame #51: PyEval_EvalCode + 0x1f (0x5409bf in /usr/bin/python3)
frame #52: PyRun_StringFlags + 0x8f (0x52084f in /usr/bin/python3)
frame #53: PyRun_SimpleStringFlags + 0x3c (0x60f15c in /usr/bin/python3)
frame #54: Py_Main + 0x581 (0x640381 in /usr/bin/python3)
frame #55: main + 0xe1 (0x4d0001 in /usr/bin/python3)
frame #56: __libc_start_main + 0xf0 (0x7f41513c6830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #57: _start + 0x29 (0x5d6999 in /usr/bin/python3)


/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))

Here is my fine tune scripts:

MAX_UPDATES=3000        # Number of training steps.
WARMUP_UPDATES=150    # Linearly increase LR over this many steps.
LR=1e-05              # Peak LR for polynomial LR scheduler.
MAX_SENTENCES=16      # Batch size.
SEED=1                # Random seed.
ROBERTA_PATH=roberta_pretrain_model/robeta.large/model.pt
DATA_DIR=raw_data/dataset_created_by_ola

# we use the --user-dir option to load the task from
# the examples/roberta/commonsense_qa directory:
FAIRSEQ_PATH=fairseq/
FAIRSEQ_USER_DIR=${FAIRSEQ_PATH}/examples/roberta/commonsense_qa

CUDA_VISIBLE_DEVICES=0,1,2 fairseq-train --fp16 \
    $DATA_DIR \
    --user-dir $FAIRSEQ_USER_DIR \
    --restore-file $ROBERTA_PATH \
    --reset-optimizer --reset-dataloader --reset-meters \
    --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state \
    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --task commonsense_qa --init-token 0 --bpe gpt2 \
    --arch roberta_large --max-positions 512 \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --criterion sentence_ranking --num-classes 5 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $LR \
    --warmup-updates $WARMUP_UPDATES --total-num-update $MAX_UPDATES \
    --max-sentences $MAX_SENTENCES \
    --max-update $MAX_UPDATES \
    --log-format simple --log-interval 25 \
    --seed $SEED

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
allenyummycommented, Sep 10, 2019

Sorry, I’m new to Roberta. How to remove all the unused model elements ?

1reaction
shamanezcommented, Sep 9, 2019

It happens when your model has unused parameters. Just comment them out!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Runtime error when running RoBERTa with multiple GPUs on ...
This is the script what I used to running RoBERTa with 2 GPUs on CommonsenseQA. MAX_UPDATES=3000 # Number of training steps. WARMUP_UPDATES=150 #...
Read more >
Runtime error - Google Groups
Running the H2O-DFT-LS benchmark on 4 GPUs ( 4 MPI tasks per GPU), I get: CELL_REF| Volume [angstrom^3]: 25825.145
Read more >
Finetuning RoBERTa on Commonsense QA - Hugging Face
The above command assumes training on 1 GPU with 32GB of RAM. For GPUs with less memory, decrease --batch-size and increase --update-freq ...
Read more >
allennlp-models - PyPI
Using Docker. Docker provides a virtual machine with everything set up to run AllenNLP-- whether you will leverage a GPU or just run...
Read more >
Generative Data Augmentation for Commonsense Reasoning
In Tables 9 and. 10, we specify the input formats for finetuning GPT-. 2 and ROBERTA. Finally, we benchmark the run- ning time...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found