Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Upload model to GCS during multi-TPU training in colab, cell stuck

See original GitHub issue

I am trying to run multi-TPU training on colab using notebook_launcher. In the training loop, I am able to save the model locally using accelerator.save(model.state_dict(), local_file_name). However when I try to upload it to GCS using gsutil cp, the cell gets stuck

accelerator.print('Inside saving')
accelerator.save(model.state_dict(), local_file_name)
accelerator.print('Model saved')
accelerator.print('Uploading model')
if accelerator.is_main_process:
  gcs_file_name = os.path.join(MODELS_DIR, args.save_prefix, '{}.pt'.format(num_steps))
  ! gsutil cp {local_file_name} {gcs_file_name}
accelerator.print('Upload done to {}!'.format(gcs_file_name))

Output and where it gets stuck:

Inside saving
Model saved
Uploading model

Is it not possible to upload to GCS during train loop? If so, is there another solution?

Thanks

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Jul 26, 2021

You could push your model to the Hugging Face Hub when you want to checkpoint (with the push_to_hub method), this way you can resume training from anywhere.

0reactions

github-actions[bot]commented, May 24, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.