[TPU] Textual Inversion Training hangs when saving model
See original GitHub issueDescribe the bug
I am running the textual_inversion.py
on a v3-8 TPU VM, and the script hangs at the model saving (save_progress) step.
Any clue why that may be happening?
Reproduction
No response
Logs
No response
System Info
TPU VM - tpu-vm-pt-1.10
diffusers
version: 0.7.1- Platform: Linux-5.11.0-1021-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 1.10.0+cu102 (False)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.24.0
Issue Analytics
- State:
- Created 10 months ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
11B model training on TPU V3-512 Crashes during training
The error usually occurs when it tries to save a new checkpoint, when it happens it doesn't store the checkpoint and it reloads...
Read more >TPU training freezes in the middle of training - Stack Overflow
I'm trying to train a CNN regression net in TF 1.12, using TPU v3-8 1.12 instance. The model succesfully compiles with XLA, starting...
Read more >DreamBooth training in under 8 GB VRAM and textual ... - Reddit
Is it possible to apply the textual inversion optimization to the Automatic1111 GUI? Currently the optimization seems to be for the huggingface ...
Read more >Textual Inversion - Hugging Face
Textual inversion learns a new token embedding (v* in the diagram above). A prompt (that includes a token which will be mapped to...
Read more >keras model with TF 2.2.0 crashing during training with TPU ...
keras model with TF 2.2. 0 crashing during training with TPU and tf. data. Dataset: "The Encode() method is not implemented for DatasetVariantWrapper...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I have tried using
accelerator.save
(https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/other.py) and it’s equivalent -xm.save
which are supposed to only save the model on the rank=0 device.It was easier to just shift to using the flax script.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.