[run_summarization.py] wrong dataset leads to CUDA error:s
See original GitHub issueFeeding --dataset_name cnn_dailymail
to --model_name_or_path google/pegasus-xsum
leads to lots of errors from pytorch - perhaps there is a way to detect that the dataset is inappropriate and give a nice relevant assert instead?
You’d think that --dataset_name cnn_dailymail
and --dataset_name xsum
should be interchangeable…
python examples/seq2seq/run_summarization.py --model_name_or_path google/pegasus-xsum --do_train \
--do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" \
--output_dir /tmp/tst-summarization --per_device_train_batch_size=1 --per_device_eval_batch_size=1 \
--overwrite_output_dir --predict_with_generate
[....]
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [290,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
(crashes w/o traceback here)
If I run it on one gpu I get:
[...]
/workspace/pytorch/aten/src/ATen/native/cuda/Indexing.cu:666: indexSelectLargeIndex: block: [138,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 763, in forward
layer_outputs = encoder_layer(
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 323, in forward
hidden_states, attn_weights, _ = self.self_attn(
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/pegasus/modeling_pegasus.py", line 190, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/functional.py", line 1860, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Thanks.
Issue Analytics
- State:
- Created 2 years ago
- Comments:24 (21 by maintainers)
Top Results From Across the Web
"RuntimeError: CUDA error: out of memory" - Stack Overflow
In my case, the cause for this error message was actually not due to GPU memory, but due to the version mismatch between...
Read more >CUDA Error: Device-Side Assert Triggered: Solved | Built In
A CUDA Error: Device-Side Assert Triggered can either be caused by an inconsistency between the number of labels and output units or an ......
Read more >CUDA C++ Best Practices Guide
CUDA C++ Best Practices Guide. The programming guide to using the CUDA Toolkit to obtain the best performance from NVIDIA GPUs.
Read more >CUDA error: an illegal memory access was encountered
The error I get is the following. Traceback (most recent call last): File "multi_gpu_FRCNN.py", line 125, in <module> logs = train_one_epoch( ...
Read more >"CUDA error" after 12 hrs and 38% training on large ...
Hi all, While trying to train a clinical notes language model on a large (7GB) dataset, I just got the error: 'CUDA error:...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ok so the plan is to:
resize_position_embeddings
toPreTrainedModel
just like we are doing it for the word embeddingsresize_position_embeddings
should probably log or warn depending on whether it’s sinus position embeddings or learned onesconfig.max_position_embeddings
=> Happy to open a PR for this one, but would be great to first hear @LysandreJik and @sgugger’s opinion on it as well
@stas00, I checked and the problem simply seems to be that
max_source_length
is too high. It’s set to 1024 by default even though Pegasus can only handle512
. So, the following command should just run fine: