Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[run_summarization.py] wrong dataset leads to CUDA error:s

See original GitHub issue

Feeding --dataset_name cnn_dailymail to --model_name_or_path google/pegasus-xsum leads to lots of errors from pytorch - perhaps there is a way to detect that the dataset is inappropriate and give a nice relevant assert instead?

You’d think that --dataset_name cnn_dailymail and --dataset_name xsum should be interchangeable…

python examples/seq2seq/run_summarization.py --model_name_or_path google/pegasus-xsum --do_train \
--do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0"  \
--output_dir /tmp/tst-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=1 \
--overwrite_output_dir --predict_with_generate
[....]
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
(crashes w/o traceback here)

If I run it on one gpu I get:

[...]
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [138,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 763, in forward
    layer_outputs = encoder_layer(
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 323, in forward
    hidden_states, attn_weights, _ = self.self_attn(
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 190, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/functional.py", line 1860, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Thanks.

@sgugger, @patil-suraj

Issue Analytics

State:
Created 2 years ago
Comments:24 (21 by maintainers)

Top GitHub Comments

3reactions

patrickvonplatencommented, Aug 18, 2021

Ok so the plan is to:

Add a resize_position_embeddings to PreTrainedModel just like we are doing it for the word embeddings
resize_position_embeddings should probably log or warn depending on whether it’s sinus position embeddings or learned ones
The function should overwrite config.max_position_embeddings

=> Happy to open a PR for this one, but would be great to first hear @LysandreJik and @sgugger’s opinion on it as well

3reactions

patrickvonplatencommented, May 13, 2021

@stas00, I checked and the problem simply seems to be that max_source_length is too high. It’s set to 1024 by default even though Pegasus can only handle 512. So, the following command should just run fine:

python examples/pytorch/summarization/run_summarization.py --model_name_or_path google/pegasus-xsum --do_train \
--do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0"  \
--output_dir /tmp/tst-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=1 \
--overwrite_output_dir --predict_with_generate --max_source_length 512