question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250

See original GitHub issue

Any ideas on resolving this issue would be greatly appreciated!

GPU details: Tesla K80 (8 GPUs), NVIDIA-SMI 410.79, Driver Version: 410.79, CUDA Version: 10.0

I was trying to run it on a single GPU alone first (local_rank = -1), and faced the below error.

ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250.
Traceback (most recent call last):
  File "train.py", line 358, in <module>
    train()
  File "train.py", line 349, in train
    trainer.run(train_loader, max_epochs=args.n_epochs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
    hours, mins, secs = self._run_once_on_dataset()
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
    self.state.output = self._process_function(self, batch)
  File "train.py", line 275, in update
    lm_loss, mc_loss = model(*batch)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 808, in forward
    hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 643, in forward
    hidden_states = block(hidden_states)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 334, in forward
    a = self.attn(x)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 297, in forward
    x = self.c_attn(x)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 248, in forward
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250

I also tried multi-GPU by doing python -m torch.distributed.launch --nproc_per_node=8 train.py <my cmdline options> and it threw the same error.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:17 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
g-karthikcommented, May 31, 2019

@thomwolf I think I have figured this out. Let me know what you think.

https://github.com/huggingface/transfer-learning-conv-ai/blob/b7f295f840f719056287504554083ec3f2688651/train.py#L55

If len(instance["input_ids"]) above is greater than 512 (which is the default value of n_positions in OpenAIGPTConfig in modeling_openai.py in pytorch-pretrained-bert), then the position_ids created in the below link will contain values much larger than 512.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L620

Consequently, this line (https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L633) will fail.

I think you need to add truncation logic for the sequence above prior to doing instance["input_ids"] = list(chain(*sequence)) so that the length is always less than or equal to 512.

1reaction
sashank06commented, Aug 8, 2019

@g-karthik I think the device-side assert error triggered is due to the position embedding which is limited to 512 in size and your input size dimension sequence length is greater than that in the training loader

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: cublas runtime error : resource allocation failed
I tried to run this simple model. Basically this model is for text classification. I have used random word embeddings. import torch.nn as...
Read more >
RuntimeError: cublas runtime error : resource allocation failed ...
pytorch报错: RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250 #10. 排查方法:
Read more >
RuntimeError: cublas runtime error : resource allocation failed at ...
pytorch报错:RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250 #10排查方法:如果模型本身没有GPU存储不够的问题, ...
Read more >
【pytorch】RuntimeError: cublas runtime error-Python开发
跑pytorch的代码,遇到一个错误: RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:411.
Read more >
[Pytorch --- 7] RuntimeError: cublas runtime error ... - CodeAntenna
RuntimeError : cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch-nightly_1553749783957/work/aten/src/THC/THCGeneral.cpp:228.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found