Upload model to GCS during multi-TPU training in colab, cell stuck
See original GitHub issueHi
I am trying to run multi-TPU training on colab using notebook_launcher. In the training loop, I am able to save the model locally using accelerator.save(model.state_dict(), local_file_name)
. However when I try to upload it to GCS using gsutil cp
, the cell gets stuck
accelerator.print('Inside saving')
accelerator.save(model.state_dict(), local_file_name)
accelerator.print('Model saved')
accelerator.print('Uploading model')
if accelerator.is_main_process:
gcs_file_name = os.path.join(MODELS_DIR, args.save_prefix, '{}.pt'.format(num_steps))
! gsutil cp {local_file_name} {gcs_file_name}
accelerator.print('Upload done to {}!'.format(gcs_file_name))
Output and where it gets stuck:
Inside saving
Model saved
Uploading model
Is it not possible to upload to GCS during train loop? If so, is there another solution?
Thanks
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Colab permanently hangs (even between sessions ... - GitHub
Colab is permanently frozen after I tried to unzip a 10gb 7z file. When I tried to unzip it, the progress was so...
Read more >Model training in google colab seems to get stuck right before ...
i'm stuck here while running sign language detection in colab. I've checked my gpu memory to see if i've maxed out the limit...
Read more >How to Use Google Colab for Deep Learning
For example, let's look at training a basic deep learning model to recognize handwritten digits trained on the MNIST dataset.
Read more >Troubleshooting Google Colab for the Total Newbie
Working in Google Colab for the first time has been completely awesome and pretty shockingly easy, but it hasn't been without a couple...
Read more >Use Google Colab Like A Pro. 15 tips to supercharge your…
When you upload your Jupyter Notebook to GDrive, you'll see that it appears ... store (huge) model checkpoints during model training to your...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You could push your model to the Hugging Face Hub when you want to checkpoint (with the
push_to_hub
method), this way you can resume training from anywhere.This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.