question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Upload model to GCS during multi-TPU training in colab, cell stuck

See original GitHub issue

Hi

I am trying to run multi-TPU training on colab using notebook_launcher. In the training loop, I am able to save the model locally using accelerator.save(model.state_dict(), local_file_name). However when I try to upload it to GCS using gsutil cp, the cell gets stuck

accelerator.print('Inside saving')
accelerator.save(model.state_dict(), local_file_name)
accelerator.print('Model saved')
accelerator.print('Uploading model')
if accelerator.is_main_process:
  gcs_file_name = os.path.join(MODELS_DIR, args.save_prefix, '{}.pt'.format(num_steps))
  ! gsutil cp {local_file_name} {gcs_file_name}
accelerator.print('Upload done to {}!'.format(gcs_file_name))

Output and where it gets stuck:

Inside saving
Model saved
Uploading model

Is it not possible to upload to GCS during train loop? If so, is there another solution?

Thanks

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Jul 26, 2021

You could push your model to the Hugging Face Hub when you want to checkpoint (with the push_to_hub method), this way you can resume training from anywhere.

0reactions
github-actions[bot]commented, May 24, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Colab permanently hangs (even between sessions ... - GitHub
Colab is permanently frozen after I tried to unzip a 10gb 7z file. When I tried to unzip it, the progress was so...
Read more >
Model training in google colab seems to get stuck right before ...
i'm stuck here while running sign language detection in colab. I've checked my gpu memory to see if i've maxed out the limit...
Read more >
How to Use Google Colab for Deep Learning
For example, let's look at training a basic deep learning model to recognize handwritten digits trained on the MNIST dataset.
Read more >
Troubleshooting Google Colab for the Total Newbie
Working in Google Colab for the first time has been completely awesome and pretty shockingly easy, but it hasn't been without a couple...
Read more >
Use Google Colab Like A Pro. 15 tips to supercharge your…
When you upload your Jupyter Notebook to GDrive, you'll see that it appears ... store (huge) model checkpoints during model training to your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found