Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is SparseSelfAttention compatible with the model parallel training?

See original GitHub issue

Hello, I’ve replaced the regular attention layer in Megatron LM example with SparseSelfAttention class from deepspeed.ops.sparse_attention. It works fine with model-parallel-size 1, but when i try to train a model with model-parallel-size 2, it fails at the start with CUDA error: an illegal memory access was encountered Stacktrace:

2020-10-20T18:12:23.442097674Z   File "/home/user/transformers/deepspeed_megatron/model/gpt2_modeling.py", line 99, in forward
2020-10-20T18:12:23.442300814Z     transformer_output = self.transformer(embeddings, attention_mask)
2020-10-20T18:12:23.442306Z   File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
2020-10-20T18:12:23.442437693Z     result = self.forward(*input, **kwargs)
2020-10-20T18:12:23.442440965Z   File "/home/user/transformers/deepspeed_megatron/mpu/transformer.py", line 435, in forward
2020-10-20T18:12:23.442727925Z     hidden_states, attention_mask)
2020-10-20T18:12:23.442731435Z   File "/home/user/conda/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in checkpoint
2020-10-20T18:12:23.442942032Z     return CheckpointFunction.apply(function, *args)
2020-10-20T18:12:23.44294789Z   File "/home/user/conda/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 425, in forward
2020-10-20T18:12:23.443044276Z     outputs = run_function(*inputs_cuda)
2020-10-20T18:12:23.443047832Z   File "/home/user/transformers/deepspeed_megatron/mpu/transformer.py", line 425, in custom_forward
2020-10-20T18:12:23.443404947Z     x_ = layer(x_, inputs[1])
2020-10-20T18:12:23.443413355Z   File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
2020-10-20T18:12:23.443541314Z     result = self.forward(*input, **kwargs)
2020-10-20T18:12:23.443548162Z   File "/home/user/transformers/deepspeed_megatron/mpu/transformer.py", line 307, in forward
2020-10-20T18:12:23.443808656Z     mlp_output = self.mlp(layernorm_output)
2020-10-20T18:12:23.443818507Z   File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
2020-10-20T18:12:23.443946968Z     result = self.forward(*input, **kwargs)
2020-10-20T18:12:23.44396135Z   File "/home/user/transformers/deepspeed_megatron/mpu/transformer.py", line 223, in forward
2020-10-20T18:12:23.444200505Z     output = self.dense_4h_to_h(intermediate_parallel)
2020-10-20T18:12:23.444224081Z   File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
2020-10-20T18:12:23.444407938Z     result = self.forward(*input, **kwargs)
2020-10-20T18:12:23.444420391Z   File "/home/user/transformers/deepspeed_megatron/mpu/layers.py", line 319, in forward
2020-10-20T18:12:23.444581528Z     output_parallel = F.linear(input_parallel, self.weight)
2020-10-20T18:12:23.444603006Z   File "/home/user/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1676, in linear
2020-10-20T18:12:23.444873053Z     output = input.matmul(weight.t())
2020-10-20T18:12:23.444900921Z RuntimeError: CUDA error: an illegal memory access was encountered
2020-10-20T18:12:23.6805683Z terminate called after throwing an instance of 'c10::Error'
2020-10-20T18:12:23.680657717Z   what():  CUDA error: an illegal memory access was encountered
2020-10-20T18:12:23.680690715Z Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
2020-10-20T18:12:23.680713803Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fba9b57d1e2 in /home/user/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
2020-10-20T18:12:23.680737617Z frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fba9b7cbf92 in /home/user/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
2020-10-20T18:12:23.680825546Z frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fba9b56b9cd in /home/user/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
2020-10-20T18:12:23.680835531Z frame #3: <unknown function> + 0x5411c2 (0x7fbae30f01c2 in /home/user/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
2020-10-20T18:12:23.68084394Z frame #4: <unknown function> + 0x541266 (0x7fbae30f0266 in /home/user/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
2020-10-20T18:12:23.680851695Z frame #5: <unknown function> + 0x1809da (0x5614ddd289da in /home/user/conda/bin/python)
2020-10-20T18:12:23.680858782Z frame #6: <unknown function> + 0xfa348 (0x5614ddca2348 in /home/user/conda/bin/python)
2020-10-20T18:12:23.680866891Z frame #7: <unknown function> + 0xfadd8 (0x5614ddca2dd8 in /home/user/conda/bin/python)
2020-10-20T18:12:23.680873932Z frame #8: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680880944Z frame #9: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680887915Z frame #10: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680895279Z frame #11: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.68090235Z frame #12: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680909328Z frame #13: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680916475Z frame #14: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.68092328Z frame #15: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680930108Z frame #16: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680939875Z frame #17: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.681011519Z frame #18: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.681019154Z frame #19: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.681026074Z frame #20: <unknown function> + 0x12adc7 (0x5614ddcd2dc7 in /home/user/conda/bin/python)
2020-10-20T18:12:23.681032967Z frame #21: PyDict_SetItemString + 0x89 (0x5614ddcdf889 in /home/user/conda/bin/python)
2020-10-20T18:12:23.681039869Z frame #22: PyImport_Cleanup + 0x9c (0x5614ddd5480c in /home/user/conda/bin/python)
2020-10-20T18:12:23.681058438Z frame #23: Py_FinalizeEx + 0x64 (0x5614dddc8d04 in /home/user/conda/bin/python)
2020-10-20T18:12:23.681065589Z frame #24: <unknown function> + 0x23232e (0x5614dddda32e in /home/user/conda/bin/python)
2020-10-20T18:12:23.681072565Z frame #25: _Py_UnixMain + 0x3c (0x5614dddda67c in /home/user/conda/bin/python)
2020-10-20T18:12:23.681079478Z frame #26: __libc_start_main + 0xe7 (0x7fbaea425b97 in /lib/x86_64-linux-gnu/libc.so.6)
2020-10-20T18:12:23.681086448Z frame #27: <unknown function> + 0x1d7101 (0x5614ddd7f101 in /home/user/conda/bin/python)

Issue Analytics

State:
Created 3 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

ollmercommented, Oct 26, 2020

Ah, now i understand. Yes, i’ve set num_heads in STConfig object the same way you mention in your example. But i haven’t divide it to model_parallel_size. Will try

1reaction

arashasharicommented, Oct 26, 2020

Yes, as you can see here in the example, we set number of heads for sparse attention; I think this part of the example and tutorial is not very clear and thanks to you bringing up this issue, I will update the documentation to make it clear. Further, number of heads for sparse attention needs to be number of heads per attention; the parameter you pointed out above.

Top Results From Across the Web

Model Parallelism - Hugging Face

The processing is done in parallel and all setups are synchronized at the end of each training step. TensorParallel (TP) - each tensor...

DeepSpeed Configuration JSON

The effective training batch size. This is the amount of data samples that leads to one step of model update. train_batch_size is aggregated...

DeepSpeed: Extreme-scale model training for everyone

DeepSpeed has combined three powerful technologies to enable training trillion-scale models and to scale to thousands of GPUs: data parallel ...

Fine-tune BERT with Sparse Self-Attention Mechanism

In this paper, we develop a novel Sparse Self-Attention Fine-tuning model (referred as SSAF) which integrates sparsity into self-attention mechanism to ...

Chao Zhang | Papers With Code

Interlaced Sparse Self-Attention for Semantic Segmentation ... Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized ...