Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250

See original GitHub issue

Any ideas on resolving this issue would be greatly appreciated!

GPU details: Tesla K80 (8 GPUs), NVIDIA-SMI 410.79, Driver Version: 410.79, CUDA Version: 10.0

I was trying to run it on a single GPU alone first (local_rank = -1), and faced the below error.

ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250.
Traceback (most recent call last):
  File "train.py", line 358, in <module>
    train()
  File "train.py", line 349, in train
    trainer.run(train_loader, max_epochs=args.n_epochs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
    hours, mins, secs = self._run_once_on_dataset()
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
    self.state.output = self._process_function(self, batch)
  File "train.py", line 275, in update
    lm_loss, mc_loss = model(*batch)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 808, in forward
    hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 643, in forward
    hidden_states = block(hidden_states)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 334, in forward
    a = self.attn(x)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 297, in forward
    x = self.c_attn(x)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 248, in forward
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCGeneral.cpp:250

I also tried multi-GPU by doing python -m torch.distributed.launch --nproc_per_node=8 train.py <my cmdline options> and it threw the same error.

Issue Analytics

State:
Created 4 years ago
Comments:17 (2 by maintainers)

Top GitHub Comments

3reactions

g-karthikcommented, May 31, 2019

@thomwolf I think I have figured this out. Let me know what you think.

https://github.com/huggingface/transfer-learning-conv-ai/blob/b7f295f840f719056287504554083ec3f2688651/train.py#L55

If len(instance["input_ids"]) above is greater than 512 (which is the default value of n_positions in OpenAIGPTConfig in modeling_openai.py in pytorch-pretrained-bert), then the position_ids created in the below link will contain values much larger than 512.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L620

Consequently, this line (https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L633) will fail.

I think you need to add truncation logic for the sequence above prior to doing instance["input_ids"] = list(chain(*sequence)) so that the length is always less than or equal to 512.

1reaction

sashank06commented, Aug 8, 2019

@g-karthik I think the device-side assert error triggered is due to the position embedding which is limited to 512 in size and your input size dimension sequence length is greater than that in the training loader

Top Results From Across the Web

RuntimeError: cublas runtime error : resource allocation failed

I tried to run this simple model. Basically this model is for text classification. I have used random word embeddings. import torch.nn as...

RuntimeError: cublas runtime error : resource allocation failed ...

pytorch报错： RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250 #10. 排查方法：

RuntimeError: cublas runtime error : resource allocation failed at ...

pytorch报错：RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250 #10排查方法：如果模型本身没有GPU存储不够的问题， ...

【pytorch】RuntimeError: cublas runtime error-Python开发

跑pytorch的代码，遇到一个错误： RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:411.

[Pytorch --- 7] RuntimeError: cublas runtime error ... - CodeAntenna

RuntimeError : cublas runtime error : library not initialized at /opt/conda/conda-bld/pytorch-nightly_1553749783957/work/aten/src/THC/THCGeneral.cpp:228.