Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o`

See original GitHub issue

Describe the bug I am trying to reproduce the HuggingFace + DeepSpeed https://huggingface.co/transformers/main_classes/deepspeed.html training example on a SageMaker p3.16xlarge instance (8 Tesla V100s) . However, we cannot seem to fix a FAILED: fused_adam_frontend.o and FAILED: multi_tensor_adam.cuda.o issues. It may also be related to our gcc version (?):

Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 5.0 and above.

We have tried to install deepspeed from:

pip install deepspeed
DS_BUILD_OPS=1 pip install deepspeed --global-option=“build_ext” --global-option=“-j8”
From source following this guide: https://www.deepspeed.ai/tutorials/advanced-install/#install-deepspeed-from-source

Unfortunately for security reasons, we do not have access to the root of this instance, so we cannot directly upgrade the CUDA/gcc version, which seemed to work for these related issues: https://github.com/microsoft/DeepSpeed/issues/694 and https://github.com/microsoft/DeepSpeedExamples/issues/85, among others.

To Reproduce Steps to reproduce the behavior:

Spin up Amazon EC2 p3.16xlarge instance
Install and setup tutorial example: https://huggingface.co/transformers/main_classes/deepspeed.html
run translation.py

Expected behavior A very fast training time. Using python -m torch.distributed.launch without DeepSpeed runs as expected in our environment.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variablesto where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch']
torch version .................... 1.9.1+cu102
torch cuda version ............... 10.2
nvcc version ..................... 10.0
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.5.4+c6d1418, c6d1418, master
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2

System info (please complete the following information):

8x V100s
Python 3.6.13
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" -> (7, 0)
gcc (GCC) 4.8.5

Launcher context

deepspeed run_translation.py \
    --deepspeed ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir  \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
    --source_lang en --target_lang ro

Issue Analytics

State:
Created 2 years ago
Comments:12 (2 by maintainers)

Top GitHub Comments

2reactions

franckjaycommented, Oct 7, 2021

@philschmid , @jeffra , and @tjruwase , thank you for the help! Spinning up an instance with AL2 worked perfectly. You are all wizards of the highest order.

1reaction

jeffracommented, Oct 7, 2021

@franckjay it appears AL2 has gcc 7.3 which should be new enough to compile our kernels.

https://aws.amazon.com/amazon-linux-2/faqs/

Top Results From Across the Web

Multi Instance Training Error - Hugging Face Forums

Ive been trying to get multi instance working with AWS Sagemaker x Hugging Face estimators. My code works okay for single instance non...

Some issues when training model on Sagemaker

Hello world, I'm getting two issues when I fine-tuning on my model using this sagemaker notebook. No GUI login prompt out when running...

Distributed Training on Sagemaker - Hugging Face Forums

This error is really over my head. The script works when on just one GPU. However, after adding in the argument for distributed...

Error deploying BERT on SageMaker - Hugging Face Forums

Yes, so I used HuggingFaceModel before, and it did work successfully! The problem is that I want to run multiple models on the...

Sagemaker gpt-j train file error - Hugging Face Forums

3->-r requirements.txt (line 1)) (2021.1) WARNING: Running pip as root will break packages and permissions. You should install packages reliably ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o`

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[BUG] Offload memory usage not performing as expected

[BUG] fp16 weights in model_states.pt does not match fp32 weights extracted using zero_to_fp32.py