Is SparseSelfAttention compatible with the model parallel training?
See original GitHub issueHello,
I’ve replaced the regular attention layer in Megatron LM example with SparseSelfAttention class from deepspeed.ops.sparse_attention. It works fine with model-parallel-size 1, but when i try to train a model with model-parallel-size 2, it fails at the start with CUDA error: an illegal memory access was encountered
Stacktrace:
2020-10-20T18:12:23.442097674Z File "/home/user/transformers/deepspeed_megatron/model/gpt2_modeling.py", line 99, in forward
2020-10-20T18:12:23.442300814Z transformer_output = self.transformer(embeddings, attention_mask)
2020-10-20T18:12:23.442306Z File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
2020-10-20T18:12:23.442437693Z result = self.forward(*input, **kwargs)
2020-10-20T18:12:23.442440965Z File "/home/user/transformers/deepspeed_megatron/mpu/transformer.py", line 435, in forward
2020-10-20T18:12:23.442727925Z hidden_states, attention_mask)
2020-10-20T18:12:23.442731435Z File "/home/user/conda/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in checkpoint
2020-10-20T18:12:23.442942032Z return CheckpointFunction.apply(function, *args)
2020-10-20T18:12:23.44294789Z File "/home/user/conda/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 425, in forward
2020-10-20T18:12:23.443044276Z outputs = run_function(*inputs_cuda)
2020-10-20T18:12:23.443047832Z File "/home/user/transformers/deepspeed_megatron/mpu/transformer.py", line 425, in custom_forward
2020-10-20T18:12:23.443404947Z x_ = layer(x_, inputs[1])
2020-10-20T18:12:23.443413355Z File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
2020-10-20T18:12:23.443541314Z result = self.forward(*input, **kwargs)
2020-10-20T18:12:23.443548162Z File "/home/user/transformers/deepspeed_megatron/mpu/transformer.py", line 307, in forward
2020-10-20T18:12:23.443808656Z mlp_output = self.mlp(layernorm_output)
2020-10-20T18:12:23.443818507Z File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
2020-10-20T18:12:23.443946968Z result = self.forward(*input, **kwargs)
2020-10-20T18:12:23.44396135Z File "/home/user/transformers/deepspeed_megatron/mpu/transformer.py", line 223, in forward
2020-10-20T18:12:23.444200505Z output = self.dense_4h_to_h(intermediate_parallel)
2020-10-20T18:12:23.444224081Z File "/home/user/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
2020-10-20T18:12:23.444407938Z result = self.forward(*input, **kwargs)
2020-10-20T18:12:23.444420391Z File "/home/user/transformers/deepspeed_megatron/mpu/layers.py", line 319, in forward
2020-10-20T18:12:23.444581528Z output_parallel = F.linear(input_parallel, self.weight)
2020-10-20T18:12:23.444603006Z File "/home/user/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1676, in linear
2020-10-20T18:12:23.444873053Z output = input.matmul(weight.t())
2020-10-20T18:12:23.444900921Z RuntimeError: CUDA error: an illegal memory access was encountered
2020-10-20T18:12:23.6805683Z terminate called after throwing an instance of 'c10::Error'
2020-10-20T18:12:23.680657717Z what(): CUDA error: an illegal memory access was encountered
2020-10-20T18:12:23.680690715Z Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
2020-10-20T18:12:23.680713803Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fba9b57d1e2 in /home/user/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
2020-10-20T18:12:23.680737617Z frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fba9b7cbf92 in /home/user/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
2020-10-20T18:12:23.680825546Z frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fba9b56b9cd in /home/user/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
2020-10-20T18:12:23.680835531Z frame #3: <unknown function> + 0x5411c2 (0x7fbae30f01c2 in /home/user/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
2020-10-20T18:12:23.68084394Z frame #4: <unknown function> + 0x541266 (0x7fbae30f0266 in /home/user/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
2020-10-20T18:12:23.680851695Z frame #5: <unknown function> + 0x1809da (0x5614ddd289da in /home/user/conda/bin/python)
2020-10-20T18:12:23.680858782Z frame #6: <unknown function> + 0xfa348 (0x5614ddca2348 in /home/user/conda/bin/python)
2020-10-20T18:12:23.680866891Z frame #7: <unknown function> + 0xfadd8 (0x5614ddca2dd8 in /home/user/conda/bin/python)
2020-10-20T18:12:23.680873932Z frame #8: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680880944Z frame #9: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680887915Z frame #10: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680895279Z frame #11: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.68090235Z frame #12: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680909328Z frame #13: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680916475Z frame #14: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.68092328Z frame #15: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680930108Z frame #16: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.680939875Z frame #17: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.681011519Z frame #18: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.681019154Z frame #19: <unknown function> + 0xfadec (0x5614ddca2dec in /home/user/conda/bin/python)
2020-10-20T18:12:23.681026074Z frame #20: <unknown function> + 0x12adc7 (0x5614ddcd2dc7 in /home/user/conda/bin/python)
2020-10-20T18:12:23.681032967Z frame #21: PyDict_SetItemString + 0x89 (0x5614ddcdf889 in /home/user/conda/bin/python)
2020-10-20T18:12:23.681039869Z frame #22: PyImport_Cleanup + 0x9c (0x5614ddd5480c in /home/user/conda/bin/python)
2020-10-20T18:12:23.681058438Z frame #23: Py_FinalizeEx + 0x64 (0x5614dddc8d04 in /home/user/conda/bin/python)
2020-10-20T18:12:23.681065589Z frame #24: <unknown function> + 0x23232e (0x5614dddda32e in /home/user/conda/bin/python)
2020-10-20T18:12:23.681072565Z frame #25: _Py_UnixMain + 0x3c (0x5614dddda67c in /home/user/conda/bin/python)
2020-10-20T18:12:23.681079478Z frame #26: __libc_start_main + 0xe7 (0x7fbaea425b97 in /lib/x86_64-linux-gnu/libc.so.6)
2020-10-20T18:12:23.681086448Z frame #27: <unknown function> + 0x1d7101 (0x5614ddd7f101 in /home/user/conda/bin/python)
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Model Parallelism - Hugging Face
The processing is done in parallel and all setups are synchronized at the end of each training step. TensorParallel (TP) - each tensor...
Read more >DeepSpeed Configuration JSON
The effective training batch size. This is the amount of data samples that leads to one step of model update. train_batch_size is aggregated...
Read more >DeepSpeed: Extreme-scale model training for everyone
DeepSpeed has combined three powerful technologies to enable training trillion-scale models and to scale to thousands of GPUs: data parallel ...
Read more >Fine-tune BERT with Sparse Self-Attention Mechanism
In this paper, we develop a novel Sparse Self-Attention Fine-tuning model (referred as SSAF) which integrates sparsity into self-attention mechanism to ...
Read more >Chao Zhang | Papers With Code
Interlaced Sparse Self-Attention for Semantic Segmentation ... Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ah, now i understand. Yes, i’ve set num_heads in STConfig object the same way you mention in your example. But i haven’t divide it to model_parallel_size. Will try
Yes, as you can see here in the example, we set number of heads for sparse attention; I think this part of the example and tutorial is not very clear and thanks to you bringing up this issue, I will update the documentation to make it clear. Further, number of heads for sparse attention needs to be number of heads per attention; the parameter you pointed out above.