Illegal memory access when batch_size is between (128, 256)
See original GitHub issuetests/optimiser/fairseq/test_fairseq_optimiser.py can work well when batch_size <= 128; However, when setting batch_size between [129, 255], the below error will be raised:
Traceback (most recent call last):
File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/absl/testing/parameterized.py", line 263, in bound_param_test
test_method(self, **testcase_params)
File "tests/optimiser/fairseq/test_fairseq_optimiser.py", line 101, in test_beam_search_optimiser
no_repeat_ngram_size=no_repeat_ngram_size)
File "/home/fhu/github/fairseq/fairseq/models/bart/hub_interface.py", line 107, in sample
hypos = self.generate(input, beam, verbose, **kwargs)
File "/home/fhu/github/fairseq/fairseq/models/bart/hub_interface.py", line 123, in generate
prefix_tokens=sample['net_input']['src_tokens'].new_zeros((len(tokens), 1)).fill_(self.task.source_dictionary.bos()),
File "/home/fhu/github/fairseq/fairseq/tasks/fairseq_task.py", line 361, in inference_step
return generator.generate(models, sample, prefix_tokens=prefix_tokens)
File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 159, in generate
return self._generate(sample, **kwargs)
File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 198, in _generate
encoder_outs = self.model.forward_encoder(net_input)
File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 697, in forward_encoder
for model in self.models
File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 697, in <listcomp>
for model in self.models
File "/home/fhu/github/fairseq/fairseq/models/fairseq_encoder.py", line 53, in forward_torchscript
return self.forward_non_torchscript(net_input)
File "/home/fhu/github/fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript
return self.forward(**encoder_input)
File "/home/fhu/github/fairseq/fairseq/models/transformer.py", line 411, in forward
x = layer(x, encoder_padding_mask)
File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/fhu/github/fairseq/fairseq/modules/transformer_layer.py", line 122, in forward
attn_mask=attn_mask,
File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/fhu/github/fastseq/fastseq/optimiser/fairseq/beam_search_optimiser_v2.py", line 200, in forward
v_proj_weight=self.v_proj.weight,
File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/functional.py", line 3937, in multi_head_attention_forward
float('-inf'),
RuntimeError: CUDA error: an illegal memory access was encountered
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Intermittent CUDA_ERROR_ILLEGAL_ADDRESS error on ...
Larger batch sizes (say 128) seem to always cause this error, but also sometimes smaller ones. The same code runs fine every time...
Read more >CUDA error: an illegal memory access was encountered with ...
I train a model (GAN) with a batch size larger than 4, I get this error. Unless I hard reset (factory reset the...
Read more >MongoDB Limits and Thresholds
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of...
Read more >1.1.7 PDF - PyTorch Lightning Documentation
num_nodes=128 ... The num_workers depends on the batch size and your machine. ... CUDA error: an illegal memory access was encountered.
Read more >CUDA error: an illegal memory access was encountered - Part ...
When I am running following code on Gradient, it is working fine but it is throwing me error after running for few seconds...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@JiushengChen @feihugis It’s looks like due to number exceed int.MAX. After add “CUDA_LAUNCH_BLOCKING=1” and rerun the code with batch size larger than 128, error happens in
ret = input.softmax(dim), for specific atcuda/SoftMax.cu.Check the size of tensor need be softmax:
For batch size 128, the tensor has shape [2048, 1024,1024], which has 2147483648 element, index in range of [0, 2147483647 (int.MAX)].
When batch size larger than 128, elements will fall beyond int.MAX, which cause illegal memory access error.
I verify this assumption by reduce input length a littler bit from 1024 to 1000, then batch size can increase to 134 without error.
In
cuda/SoftMax.cu, it use int to do index, which is another hit for this assumption.https://github.com/NVIDIA/apex/issues/319 https://github.com/pytorch/pytorch/issues/21819
Some related threads discussed this error. There are some tricks there. We may try them.