Cannot run distributed training on sagemaker
See original GitHub issueI know that distributed data parallelism in the actual accelerate library is still underdevelopment, but per the HF/AWS webinar here https://youtu.be/vEuJBdnb_uM?t=1153 all I should need to do to launch a fully utilized distributed sagemaker job is pass a well formed distribution
dict to a sagemaker.huggingface.HuggingFace
estimator with an appropriate instance.
I’m currently running a job like this
distribution = {
'smdistributed': {
'dataparallel': {'enabled': True }
}
}
estimator = HuggingFace(
image_uri=image_uri,
role=role.arn,
train_instance_count=1,
train_instance_type="ml.p3dn.24xlarge",
volume_size_in_gb=50,
max_run=(24*60*60),
hyperparameters=hyperparameters,
base_job_name=JOB_NAME,
distribution=distribution,
py_version='py36',
entry_point='./container/layoutlmv2/train.py'
)
estimator.fit()
The only difference between my job and the youtube video is that I’m passing a custom image_uri to the estimator.
My train.py file sets up accelerator as follows:
...
accelerator = Accelerator()
...
train_dataloader, valid_dataloader, model, optimizer, lr_scheduler = accelerator.prepare(
train_dataloader, valid_dataloader, model, optimizer, lr_scheduler
)
print("Accelerator has determined the num processes to be: ", accelerator.num_processes)
Now the specified instance is an appropriate instance and has the following attributes:
# name GPUS Mem Type
# p3dn.24xlarge 8 32GB V100
Except when execution reaches that print statement I get back:
Accelerator has determined the num processes to be: 1
What am I missing to make accelerator work across multiple GPUs on a single AWS instance?
EDIT I noticed that the provided HF training DLC installs Horovod. Is this the missing piece? Would just adding these lines to my custom container solve it?
# Install Horovod
ENV HOROVOD_VERSION=0.21.3
RUN pip uninstall -y horovod \
&& ldconfig /usr/local/cuda-11.1/targets/x86_64-linux/lib/stubs \
&& HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=/usr/local/cuda-11.1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod==${HOROVOD_VERSION} \
&& ldconfig
Issue Analytics
- State:
- Created a year ago
- Comments:9 (3 by maintainers)
Top GitHub Comments
For those that find this, I ran accelerate launch locally and then copied the config file to my docker container and added this line:
As per the current implementation, the support is for the official HF DLC. One will have to install
accelerate
from source in editable mode and change the HF estimator to use customimage_uri
and remove the pytorch and transformers versions. That should work in theory if the custom docker image is created following the guidelines in the doc Create Your Own Docker Container with the SageMaker Distributed Data Parallel Library