Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

S3 checkpoints not working with distributed training on sagemaker

See original GitHub issue

Environment info

transformers version: 4.5.0
Platform: AWS Sagemaker
Python version: 3.6
PyTorch version (GPU?): 1.7.1
Tensorflow version (GPU?):
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help

@sgugger

Information

Model I am using (Bert, XLNet …): gpt-neo

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Use the run_clm.py example script to finetune gpt-neo in Sagemaker with either torch.distributed.launch, or using Sagemaker distributed model parallel (say on a p4d.24xlarge with 8 gpus)
Only the first checkpoint is synced to the checkpoint_s3_uri location. Subsequent checkpoints do not appear in S3
Also, at the end of the training job, it spends around 1 hour in the “Uploading” state and ends with the error below.

InternalServerError: We encountered an internal error. Please try again.

Expected behavior

I expected the training to work normally, and all the checkpoints and final model to get synced to the S3 location.

NB: training is working when I don’t use the checkpoint_s3_uri (with both torch.distributed.launch and sagemaker distributed model parallel).

Also with a single gpu (on a p3.2xlarge), training with checkpoint_s3_uri is working, all the checkpoints and final model are synced to S3.

Issue Analytics

State:
Created 2 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

laphangcommented, Apr 26, 2021

@philschmid yeah, made the changes below from using the PyTorch estimator to the HuggingFace one, and now distributed training with s3 checkpoints is working properly now (training job completes successfully, and all the checkpoints are synced to s3). It’s working both using Sagemaker distributed model parallel, and also using torch.distributed.launch

Also just wanted to say that I was pleasantly surprised with how seamlessly Transformers is working with SageMaker model parallel. Great work guys!

before:
estimator = PyTorch(base_job_name=job_name,     
                       entry_point = 'run_clm.py', 
                       source_dir=source_dir,
                       code_location=output_path,
                       role=role,
                       framework_version='1.7.1',
                       py_version='py3',                    
                       hyperparameters=hyperparameters,
                       tags=tags, 
                       output_path=output_path, 
                       checkpoint_s3_uri=checkpoint_path, 
                       instance_count=1, 
                       instance_type='ml.p4d.24xlarge', 
                       distribution= distribution, 
                       use_spot_instances=train_use_spot_instances,
                       max_run=train_max_run,
                       max_wait=train_max_wait, 
                       metric_definitions=metric_definition
                       )

after:
estimator = HuggingFace(base_job_name=job_name,     
                       entry_point = 'run_clm.py', 
                       source_dir=source_dir,
                       code_location=output_path,
                       role=role,
                       transformers_version='4.4.2',
                       pytorch_version='1.6.0',
                       py_version='py36',                    
                       hyperparameters=hyperparameters,
                       tags=tags, 
                       output_path=output_path, 
                       checkpoint_s3_uri=checkpoint_s3_uri, 
                       debugger_hook_config=False,
                       instance_count=1, 
                       instance_type='ml.p4d.24xlarge', 
                       distribution= distribution, 
                       use_spot_instances=train_use_spot_instances,
                       max_run=train_max_run,
                       max_wait=train_max_wait, 
                       metric_definitions=metric_definition
                       )

0reactions

Harshitcmdcommented, Nov 1, 2021

Hey @philschmid,

I tried adding overwrite_output_dir=True, it’s partially solved my issue. Now, the checkpoints are in sync with s3(all the checkpoints and model artifacts are getting saved at the desired location). Even though all the checkpoints got uploaded to the s3 it has showed the status as Uploading for an hour and ended with an internal error(weird).

PS: When I didn’t integrate the data parallelism with the same instance type (p4d.24xlarge) everything worked seamlessly.

Top Results From Across the Web

Use Checkpoints in Amazon SageMaker - AWS Documentation

Browse Checkpoint Files · In the left navigation pane, choose Training jobs. · Choose the link to the training job with checkpointing enabled...

Run training on Amazon SageMaker - Hugging Face

Prepare a training script. Create a Hugging Face Estimator. Run training with the fit method. Access your trained model. Perform distributed training. Create ......

Amazon SageMaker Training Storage Folders for Training ...

Learn about how the SageMaker training platform manages storage folders for training datasets, checkpoints, outputs, and model artifacts.

Implement checkpointing with TensorFlow for Amazon ... - MKAI

SageMaker takes care of synchronizing the checkpoints with Amazon S3 and the training container. We simulated a Spot interruption by running ...

Steps to start training your custom Tensorflow model in AWS ...

SageMaker Experiments, TensorFlow script mode training and restore checkpoint to resume training; Amazon SageMaker Overview; Problem ...