RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250
See original GitHub issueAny ideas on resolving this issue would be greatly appreciated!
GPU details: Tesla K80 (8 GPUs), NVIDIA-SMI 410.79, Driver Version: 410.79, CUDA Version: 10.0
I was trying to run it on a single GPU alone first (local_rank = -1), and faced the below error.
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250.
Traceback (most recent call last):
File "train.py", line 358, in <module>
train()
File "train.py", line 349, in train
trainer.run(train_loader, max_epochs=args.n_epochs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
self._handle_exception(e)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
raise e
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
hours, mins, secs = self._run_once_on_dataset()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
self._handle_exception(e)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
raise e
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
self.state.output = self._process_function(self, batch)
File "train.py", line 275, in update
lm_loss, mc_loss = model(*batch)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 808, in forward
hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 643, in forward
hidden_states = block(hidden_states)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 334, in forward
a = self.attn(x)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 297, in forward
x = self.c_attn(x)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 248, in forward
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250
I also tried multi-GPU by doing python -m torch.distributed.launch --nproc_per_node=8 train.py <my cmdline options>
and it threw the same error.
Issue Analytics
- State:
- Created 4 years ago
- Comments:17 (2 by maintainers)
Top Results From Across the Web
RuntimeError: cublas runtime error : resource allocation failed
I tried to run this simple model. Basically this model is for text classification. I have used random word embeddings. import torch.nn as...
Read more >RuntimeError: cublas runtime error : resource allocation failed ...
pytorch报错: RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250 #10. 排查方法:
Read more >RuntimeError: cublas runtime error : resource allocation failed at ...
pytorch报错:RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250 #10排查方法:如果模型本身没有GPU存储不够的问题, ...
Read more >【pytorch】RuntimeError: cublas runtime error-Python开发
跑pytorch的代码,遇到一个错误: RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:411.
Read more >[Pytorch --- 7] RuntimeError: cublas runtime error ... - CodeAntenna
RuntimeError : cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch-nightly_1553749783957/work/aten/src/THC/THCGeneral.cpp:228.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@thomwolf I think I have figured this out. Let me know what you think.
https://github.com/huggingface/transfer-learning-conv-ai/blob/b7f295f840f719056287504554083ec3f2688651/train.py#L55
If
len(instance["input_ids"])
above is greater than 512 (which is the default value ofn_positions
inOpenAIGPTConfig
in modeling_openai.py in pytorch-pretrained-bert), then the position_ids created in the below link will contain values much larger than 512.https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L620
Consequently, this line (https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L633) will fail.
I think you need to add truncation logic for the
sequence
above prior to doinginstance["input_ids"] = list(chain(*sequence))
so that the length is always less than or equal to 512.@g-karthik I think the device-side assert error triggered is due to the position embedding which is limited to 512 in size and your input size dimension sequence length is greater than that in the training loader