S3 checkpoints not working with distributed training on sagemaker
See original GitHub issueEnvironment info
transformers
version: 4.5.0- Platform: AWS Sagemaker
- Python version: 3.6
- PyTorch version (GPU?): 1.7.1
- Tensorflow version (GPU?):
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
Who can help
Information
Model I am using (Bert, XLNet …): gpt-neo
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Use the run_clm.py example script to finetune gpt-neo in Sagemaker with either torch.distributed.launch, or using Sagemaker distributed model parallel (say on a p4d.24xlarge with 8 gpus)
- Only the first checkpoint is synced to the checkpoint_s3_uri location. Subsequent checkpoints do not appear in S3
- Also, at the end of the training job, it spends around 1 hour in the “Uploading” state and ends with the error below.
InternalServerError: We encountered an internal error. Please try again.
Expected behavior
I expected the training to work normally, and all the checkpoints and final model to get synced to the S3 location.
NB: training is working when I don’t use the checkpoint_s3_uri (with both torch.distributed.launch and sagemaker distributed model parallel).
Also with a single gpu (on a p3.2xlarge), training with checkpoint_s3_uri is working, all the checkpoints and final model are synced to S3.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Use Checkpoints in Amazon SageMaker - AWS Documentation
Browse Checkpoint Files · In the left navigation pane, choose Training jobs. · Choose the link to the training job with checkpointing enabled...
Read more >Run training on Amazon SageMaker - Hugging Face
Prepare a training script. Create a Hugging Face Estimator. Run training with the fit method. Access your trained model. Perform distributed training. Create ......
Read more >Amazon SageMaker Training Storage Folders for Training ...
Learn about how the SageMaker training platform manages storage folders for training datasets, checkpoints, outputs, and model artifacts.
Read more >Implement checkpointing with TensorFlow for Amazon ... - MKAI
SageMaker takes care of synchronizing the checkpoints with Amazon S3 and the training container. We simulated a Spot interruption by running ...
Read more >Steps to start training your custom Tensorflow model in AWS ...
SageMaker Experiments, TensorFlow script mode training and restore checkpoint to resume training; Amazon SageMaker Overview; Problem ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@philschmid yeah, made the changes below from using the PyTorch estimator to the HuggingFace one, and now distributed training with s3 checkpoints is working properly now (training job completes successfully, and all the checkpoints are synced to s3). It’s working both using Sagemaker distributed model parallel, and also using torch.distributed.launch
Also just wanted to say that I was pleasantly surprised with how seamlessly Transformers is working with SageMaker model parallel. Great work guys!
Hey @philschmid,
I tried adding overwrite_output_dir=True, it’s partially solved my issue. Now, the checkpoints are in sync with s3(all the checkpoints and model artifacts are getting saved at the desired location). Even though all the checkpoints got uploaded to the s3 it has showed the status as Uploading for an hour and ended with an internal error(weird).
PS: When I didn’t integrate the data parallelism with the same instance type (p4d.24xlarge) everything worked seamlessly.