question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

S3 checkpoints not working with distributed training on sagemaker

See original GitHub issue

Environment info

  • transformers version: 4.5.0
  • Platform: AWS Sagemaker
  • Python version: 3.6
  • PyTorch version (GPU?): 1.7.1
  • Tensorflow version (GPU?):
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help

@sgugger

Information

Model I am using (Bert, XLNet …): gpt-neo

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Use the run_clm.py example script to finetune gpt-neo in Sagemaker with either torch.distributed.launch, or using Sagemaker distributed model parallel (say on a p4d.24xlarge with 8 gpus)
  2. Only the first checkpoint is synced to the checkpoint_s3_uri location. Subsequent checkpoints do not appear in S3
  3. Also, at the end of the training job, it spends around 1 hour in the “Uploading” state and ends with the error below.

InternalServerError: We encountered an internal error. Please try again.

Expected behavior

I expected the training to work normally, and all the checkpoints and final model to get synced to the S3 location.

NB: training is working when I don’t use the checkpoint_s3_uri (with both torch.distributed.launch and sagemaker distributed model parallel).

Also with a single gpu (on a p3.2xlarge), training with checkpoint_s3_uri is working, all the checkpoints and final model are synced to S3.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
laphangcommented, Apr 26, 2021

@philschmid yeah, made the changes below from using the PyTorch estimator to the HuggingFace one, and now distributed training with s3 checkpoints is working properly now (training job completes successfully, and all the checkpoints are synced to s3). It’s working both using Sagemaker distributed model parallel, and also using torch.distributed.launch

Also just wanted to say that I was pleasantly surprised with how seamlessly Transformers is working with SageMaker model parallel. Great work guys!

before:
estimator = PyTorch(base_job_name=job_name,     
                       entry_point = 'run_clm.py', 
                       source_dir=source_dir,
                       code_location=output_path,
                       role=role,
                       framework_version='1.7.1',
                       py_version='py3',                    
                       hyperparameters=hyperparameters,
                       tags=tags, 
                       output_path=output_path, 
                       checkpoint_s3_uri=checkpoint_path, 
                       instance_count=1, 
                       instance_type='ml.p4d.24xlarge', 
                       distribution= distribution, 
                       use_spot_instances=train_use_spot_instances,
                       max_run=train_max_run,
                       max_wait=train_max_wait, 
                       metric_definitions=metric_definition
                       )

after:
estimator = HuggingFace(base_job_name=job_name,     
                       entry_point = 'run_clm.py', 
                       source_dir=source_dir,
                       code_location=output_path,
                       role=role,
                       transformers_version='4.4.2',
                       pytorch_version='1.6.0',
                       py_version='py36',                    
                       hyperparameters=hyperparameters,
                       tags=tags, 
                       output_path=output_path, 
                       checkpoint_s3_uri=checkpoint_s3_uri, 
                       debugger_hook_config=False,
                       instance_count=1, 
                       instance_type='ml.p4d.24xlarge', 
                       distribution= distribution, 
                       use_spot_instances=train_use_spot_instances,
                       max_run=train_max_run,
                       max_wait=train_max_wait, 
                       metric_definitions=metric_definition
                       )
0reactions
Harshitcmdcommented, Nov 1, 2021

Hey @philschmid,

I tried adding overwrite_output_dir=True, it’s partially solved my issue. Now, the checkpoints are in sync with s3(all the checkpoints and model artifacts are getting saved at the desired location). Even though all the checkpoints got uploaded to the s3 it has showed the status as Uploading for an hour and ended with an internal error(weird).

PS: When I didn’t integrate the data parallelism with the same instance type (p4d.24xlarge) everything worked seamlessly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use Checkpoints in Amazon SageMaker - AWS Documentation
Browse Checkpoint Files · In the left navigation pane, choose Training jobs. · Choose the link to the training job with checkpointing enabled...
Read more >
Run training on Amazon SageMaker - Hugging Face
Prepare a training script. Create a Hugging Face Estimator. Run training with the fit method. Access your trained model. Perform distributed training. Create ......
Read more >
Amazon SageMaker Training Storage Folders for Training ...
Learn about how the SageMaker training platform manages storage folders for training datasets, checkpoints, outputs, and model artifacts.
Read more >
Implement checkpointing with TensorFlow for Amazon ... - MKAI
SageMaker takes care of synchronizing the checkpoints with Amazon S3 and the training container. We simulated a Spot interruption by running ...
Read more >
Steps to start training your custom Tensorflow model in AWS ...
SageMaker Experiments, TensorFlow script mode training and restore checkpoint to resume training; Amazon SageMaker Overview; Problem ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found