question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: `FAILED: multi_tensor_adam.cuda.o`

See original GitHub issue

Describe the bug I am trying to reproduce the HuggingFace + DeepSpeed https://huggingface.co/transformers/main_classes/deepspeed.html training example on a SageMaker p3.16xlarge instance (8 Tesla V100s) . However, we cannot seem to fix a FAILED: fused_adam_frontend.o and FAILED: multi_tensor_adam.cuda.o issues. It may also be related to our gcc version (?):

Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 5.0 and above.

We have tried to install deepspeed from:

  1. pip install deepspeed
  2. DS_BUILD_OPS=1 pip install deepspeed --global-option=“build_ext” --global-option=“-j8”
  3. From source following this guide: https://www.deepspeed.ai/tutorials/advanced-install/#install-deepspeed-from-source

Unfortunately for security reasons, we do not have access to the root of this instance, so we cannot directly upgrade the CUDA/gcc version, which seemed to work for these related issues: https://github.com/microsoft/DeepSpeed/issues/694 and https://github.com/microsoft/DeepSpeedExamples/issues/85, among others.

To Reproduce Steps to reproduce the behavior:

  1. Spin up Amazon EC2 p3.16xlarge instance
  2. Install and setup tutorial example: https://huggingface.co/transformers/main_classes/deepspeed.html
  3. run translation.py

Expected behavior A very fast training time. Using python -m torch.distributed.launch without DeepSpeed runs as expected in our environment.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variablesto where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch']
torch version .................... 1.9.1+cu102
torch cuda version ............... 10.2
nvcc version ..................... 10.0
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.5.4+c6d1418, c6d1418, master
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2

System info (please complete the following information):

  • 8x V100s
  • Python 3.6.13
  • CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" -> (7, 0)
  • gcc (GCC) 4.8.5

Launcher context

deepspeed run_translation.py \
    --deepspeed ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir  \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
    --source_lang en --target_lang ro 

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
franckjaycommented, Oct 7, 2021

@philschmid , @jeffra , and @tjruwase , thank you for the help! Spinning up an instance with AL2 worked perfectly. You are all wizards of the highest order.

1reaction
jeffracommented, Oct 7, 2021

@franckjay it appears AL2 has gcc 7.3 which should be new enough to compile our kernels.

https://aws.amazon.com/amazon-linux-2/faqs/

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi Instance Training Error - Hugging Face Forums
Ive been trying to get multi instance working with AWS Sagemaker x Hugging Face estimators. My code works okay for single instance non...
Read more >
Some issues when training model on Sagemaker
Hello world, I'm getting two issues when I fine-tuning on my model using this sagemaker notebook. No GUI login prompt out when running...
Read more >
Distributed Training on Sagemaker - Hugging Face Forums
This error is really over my head. The script works when on just one GPU. However, after adding in the argument for distributed...
Read more >
Error deploying BERT on SageMaker - Hugging Face Forums
Yes, so I used HuggingFaceModel before, and it did work successfully! The problem is that I want to run multiple models on the...
Read more >
Sagemaker gpt-j train file error - Hugging Face Forums
3->-r requirements.txt (line 1)) (2021.1) WARNING: Running pip as root will break packages and permissions. You should install packages reliably ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found