question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Illegal memory access when batch_size is between (128, 256)

See original GitHub issue

tests/optimiser/fairseq/test_fairseq_optimiser.py can work well when batch_size <= 128; However, when setting batch_size between [129, 255], the below error will be raised:

Traceback (most recent call last):
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/absl/testing/parameterized.py", line 263, in bound_param_test
    test_method(self, **testcase_params)
  File "tests/optimiser/fairseq/test_fairseq_optimiser.py", line 101, in test_beam_search_optimiser
    no_repeat_ngram_size=no_repeat_ngram_size)
  File "/home/fhu/github/fairseq/fairseq/models/bart/hub_interface.py", line 107, in sample
    hypos = self.generate(input, beam, verbose, **kwargs)
  File "/home/fhu/github/fairseq/fairseq/models/bart/hub_interface.py", line 123, in generate
    prefix_tokens=sample['net_input']['src_tokens'].new_zeros((len(tokens), 1)).fill_(self.task.source_dictionary.bos()),
  File "/home/fhu/github/fairseq/fairseq/tasks/fairseq_task.py", line 361, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 159, in generate
    return self._generate(sample, **kwargs)
  File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 198, in _generate
    encoder_outs = self.model.forward_encoder(net_input)
  File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 697, in forward_encoder
    for model in self.models
  File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 697, in <listcomp>
    for model in self.models
  File "/home/fhu/github/fairseq/fairseq/models/fairseq_encoder.py", line 53, in forward_torchscript
    return self.forward_non_torchscript(net_input)
  File "/home/fhu/github/fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript
    return self.forward(**encoder_input)
  File "/home/fhu/github/fairseq/fairseq/models/transformer.py", line 411, in forward
    x = layer(x, encoder_padding_mask)
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fhu/github/fairseq/fairseq/modules/transformer_layer.py", line 122, in forward
    attn_mask=attn_mask,
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fhu/github/fastseq/fastseq/optimiser/fairseq/beam_search_optimiser_v2.py", line 200, in forward
    v_proj_weight=self.v_proj.weight,
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/functional.py", line 3937, in multi_head_attention_forward
    float('-inf'),
RuntimeError: CUDA error: an illegal memory access was encountered

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
yuyan2docommented, Jul 24, 2020

@JiushengChen @feihugis It’s looks like due to number exceed int.MAX. After add “CUDA_LAUNCH_BLOCKING=1” and rerun the code with batch size larger than 128, error happens in ret = input.softmax(dim), for specific at cuda/SoftMax.cu.

Check the size of tensor need be softmax:

  • For batch size 128, the tensor has shape [2048, 1024,1024], which has 2147483648 element, index in range of [0, 2147483647 (int.MAX)].

  • When batch size larger than 128, elements will fall beyond int.MAX, which cause illegal memory access error.

I verify this assumption by reduce input length a littler bit from 1024 to 1000, then batch size can increase to 134 without error.

In cuda/SoftMax.cu, it use int to do index, which is another hit for this assumption.

    for (int j = 0; j < ILP; ++j) {
      tmp[j] = input[offset + j * blockDim.x];
    }
2reactions
JiushengChencommented, Jul 24, 2020

https://github.com/NVIDIA/apex/issues/319 https://github.com/pytorch/pytorch/issues/21819

Some related threads discussed this error. There are some tricks there. We may try them.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Intermittent CUDA_ERROR_ILLEGAL_ADDRESS error on ...
Larger batch sizes (say 128) seem to always cause this error, but also sometimes smaller ones. The same code runs fine every time...
Read more >
CUDA error: an illegal memory access was encountered with ...
I train a model (GAN) with a batch size larger than 4, I get this error. Unless I hard reset (factory reset the...
Read more >
MongoDB Limits and Thresholds
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of...
Read more >
1.1.7 PDF - PyTorch Lightning Documentation
num_nodes=128 ... The num_workers depends on the batch size and your machine. ... CUDA error: an illegal memory access was encountered.
Read more >
CUDA error: an illegal memory access was encountered - Part ...
When I am running following code on Gradient, it is working fine but it is throwing me error after running for few seconds...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found