[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o`
See original GitHub issueDescribe the bug
I am trying to reproduce the HuggingFace + DeepSpeed https://huggingface.co/transformers/main_classes/deepspeed.html training example on a SageMaker p3.16xlarge instance (8 Tesla V100s) . However, we cannot seem to fix a FAILED: fused_adam_frontend.o
and FAILED: multi_tensor_adam.cuda.o
issues. It may also be related to our gcc version (?):
Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 5.0 and above.
We have tried to install deepspeed from:
- pip install deepspeed
- DS_BUILD_OPS=1 pip install deepspeed --global-option=“build_ext” --global-option=“-j8”
- From source following this guide: https://www.deepspeed.ai/tutorials/advanced-install/#install-deepspeed-from-source
Unfortunately for security reasons, we do not have access to the root of this instance, so we cannot directly upgrade the CUDA/gcc version, which seemed to work for these related issues: https://github.com/microsoft/DeepSpeed/issues/694 and https://github.com/microsoft/DeepSpeedExamples/issues/85, among others.
To Reproduce Steps to reproduce the behavior:
- Spin up Amazon EC2 p3.16xlarge instance
- Install and setup tutorial example: https://huggingface.co/transformers/main_classes/deepspeed.html
- run
translation.py
Expected behavior
A very fast training time. Using python -m torch.distributed.launch
without DeepSpeed runs as expected in our environment.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variablesto where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch']
torch version .................... 1.9.1+cu102
torch cuda version ............... 10.2
nvcc version ..................... 10.0
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.5.4+c6d1418, c6d1418, master
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2
System info (please complete the following information):
- 8x V100s
- Python 3.6.13
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
-> (7, 0)- gcc (GCC) 4.8.5
Launcher context
deepspeed run_translation.py \
--deepspeed ds_config.json \
--model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (2 by maintainers)
Top GitHub Comments
@philschmid , @jeffra , and @tjruwase , thank you for the help! Spinning up an instance with AL2 worked perfectly. You are all wizards of the highest order.
@franckjay it appears AL2 has gcc 7.3 which should be new enough to compile our kernels.
https://aws.amazon.com/amazon-linux-2/faqs/