Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot run distributed training on sagemaker

See original GitHub issue

I know that distributed data parallelism in the actual accelerate library is still underdevelopment, but per the HF/AWS webinar here https://youtu.be/vEuJBdnb_uM?t=1153 all I should need to do to launch a fully utilized distributed sagemaker job is pass a well formed distribution dict to a sagemaker.huggingface.HuggingFace estimator with an appropriate instance.

I’m currently running a job like this

distribution = {
    'smdistributed': {
        'dataparallel': {'enabled': True }
    }
}

estimator = HuggingFace(
                      image_uri=image_uri,
                      role=role.arn,
                      train_instance_count=1,
                      train_instance_type="ml.p3dn.24xlarge",
                      volume_size_in_gb=50,
                      max_run=(24*60*60),
                      hyperparameters=hyperparameters,
                      base_job_name=JOB_NAME,
                      distribution=distribution,
                      py_version='py36',
                      entry_point='./container/layoutlmv2/train.py'
                      )

 estimator.fit()

The only difference between my job and the youtube video is that I’m passing a custom image_uri to the estimator.

My train.py file sets up accelerator as follows:

...
accelerator = Accelerator()
...
train_dataloader, valid_dataloader, model, optimizer, lr_scheduler = accelerator.prepare(
        train_dataloader, valid_dataloader, model, optimizer, lr_scheduler
    )

print("Accelerator has determined the num processes to be: ", accelerator.num_processes)

Now the specified instance is an appropriate instance and has the following attributes:

#   name        GPUS    Mem     Type  
# p3dn.24xlarge 8       32GB    V100

Except when execution reaches that print statement I get back:

Accelerator has determined the num processes to be:  1

What am I missing to make accelerator work across multiple GPUs on a single AWS instance?

EDIT I noticed that the provided HF training DLC installs Horovod. Is this the missing piece? Would just adding these lines to my custom container solve it?

# Install Horovod
ENV HOROVOD_VERSION=0.21.3
RUN pip uninstall -y horovod \
 && ldconfig /usr/local/cuda-11.1/targets/x86_64-linux/lib/stubs \
 && HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=/usr/local/cuda-11.1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod==${HOROVOD_VERSION} \
 && ldconfig

Issue Analytics

State:
Created a year ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

plamb-visocommented, Jul 7, 2022

For those that find this, I ran accelerate launch locally and then copied the config file to my docker container and added this line:

ENTRYPOINT [ \
    "accelerate", \
    "launch", \
    "--config_file", \
    "/opt/ml/code/multi_gpu_config.yaml", \
    "/opt/ml/code/train.py" \
]

0reactions

pacman100commented, Jul 7, 2022

As per the current implementation, the support is for the official HF DLC. One will have to install accelerate from source in editable mode and change the HF estimator to use custom image_uri and remove the pytorch and transformers versions. That should work in theory if the custom docker image is created following the guidelines in the doc Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library

Top Results From Across the Web

SageMaker Training Compiler Troubleshooting

When faced with such convergence issues, the first step is to identify if the issue is limited to distributed training or stems from...

Run training on Amazon SageMaker - Hugging Face

Prepare a training script. Create a Hugging Face Estimator. Run training with the fit method. Access your trained model. Perform distributed training.

Distributed Training - Amazon SageMaker Examples

SageMaker distributed training libraries offer both data-parallel and model-parallel training strategies. It combines software and hardware technologies to ...

Sagemaker - Distributed training - Stack Overflow

In the training code, when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used ...

Distributed ML training with PyTorch and Amazon SageMaker

Your browser can't play this video. ... Distributed ML training with PyTorch and Amazon SageMaker - AWS Virtual Workshop.