question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot run distributed training on sagemaker

See original GitHub issue

I know that distributed data parallelism in the actual accelerate library is still underdevelopment, but per the HF/AWS webinar here https://youtu.be/vEuJBdnb_uM?t=1153 all I should need to do to launch a fully utilized distributed sagemaker job is pass a well formed distribution dict to a sagemaker.huggingface.HuggingFace estimator with an appropriate instance.

I’m currently running a job like this

distribution = {
    'smdistributed': {
        'dataparallel': {'enabled': True }
    }
}

estimator = HuggingFace(
                      image_uri=image_uri,
                      role=role.arn,
                      train_instance_count=1,
                      train_instance_type="ml.p3dn.24xlarge",
                      volume_size_in_gb=50,
                      max_run=(24*60*60),
                      hyperparameters=hyperparameters,
                      base_job_name=JOB_NAME,
                      distribution=distribution,
                      py_version='py36',
                      entry_point='./container/layoutlmv2/train.py'
                      )

 estimator.fit()

The only difference between my job and the youtube video is that I’m passing a custom image_uri to the estimator.

My train.py file sets up accelerator as follows:

...
accelerator = Accelerator()
...
train_dataloader, valid_dataloader, model, optimizer, lr_scheduler = accelerator.prepare(
        train_dataloader, valid_dataloader, model, optimizer, lr_scheduler
    )

print("Accelerator has determined the num processes to be: ", accelerator.num_processes)

Now the specified instance is an appropriate instance and has the following attributes:

#   name        GPUS    Mem     Type  
# p3dn.24xlarge 8       32GB    V100       

Except when execution reaches that print statement I get back:

Accelerator has determined the num processes to be:  1

What am I missing to make accelerator work across multiple GPUs on a single AWS instance?

EDIT I noticed that the provided HF training DLC installs Horovod. Is this the missing piece? Would just adding these lines to my custom container solve it?

# Install Horovod
ENV HOROVOD_VERSION=0.21.3
RUN pip uninstall -y horovod \
 && ldconfig /usr/local/cuda-11.1/targets/x86_64-linux/lib/stubs \
 && HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=/usr/local/cuda-11.1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod==${HOROVOD_VERSION} \
 && ldconfig

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
plamb-visocommented, Jul 7, 2022

For those that find this, I ran accelerate launch locally and then copied the config file to my docker container and added this line:

ENTRYPOINT [ \
    "accelerate", \
    "launch", \
    "--config_file", \
    "/opt/ml/code/multi_gpu_config.yaml", \
    "/opt/ml/code/train.py" \
]
0reactions
pacman100commented, Jul 7, 2022

As per the current implementation, the support is for the official HF DLC. One will have to install accelerate from source in editable mode and change the HF estimator to use custom image_uri and remove the pytorch and transformers versions. That should work in theory if the custom docker image is created following the guidelines in the doc Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library

Read more comments on GitHub >

github_iconTop Results From Across the Web

SageMaker Training Compiler Troubleshooting
When faced with such convergence issues, the first step is to identify if the issue is limited to distributed training or stems from...
Read more >
Run training on Amazon SageMaker - Hugging Face
Prepare a training script. Create a Hugging Face Estimator. Run training with the fit method. Access your trained model. Perform distributed training.
Read more >
Distributed Training - Amazon SageMaker Examples
SageMaker distributed training libraries offer both data-parallel and model-parallel training strategies. It combines software and hardware technologies to ...
Read more >
Sagemaker - Distributed training - Stack Overflow
In the training code, when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used ...
Read more >
Distributed ML training with PyTorch and Amazon SageMaker
Your browser can't play this video. ... Distributed ML training with PyTorch and Amazon SageMaker - AWS Virtual Workshop.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found