[Deepspeed ZeRO-3] Broken model save on fresh Transformers branch
See original GitHub issueI have my own model, which utilize two T5 encoders, and I train it via DeepSpeed. It has it’s own save_pretrained() and from_pretrained() methods, which makes a custom load/save logic: https://github.com/exelents/try_t5_siamese/blob/4140194978ac113c45e7370f40b3d9b932d0b35b/siamese_model.py#L80
When I run training and trainer starts to save checkpoint, there are going something strange: weights file for every saved encoder is going to be e few kilobytes - weights are not going to be saved.
On the start of training trainer tries to load checkpoint using model.load_checkpoint(), but it seems this function has it’s own loading logic, because it cannot exec my load model logic and throws an error:
ValueError: [deepspeed] failed to resume from checkpoint ./templates/siamese-t5-small-v1_1-template
I can comment this code, which loads checkpoint, but then I got described before problem with saving checkpoint…
What should I do to make save my own custom model properly? It worked a month ago, but today I refreshed my Transformers repo and everything has broken.
Issue Analytics
- State:
- Created 3 years ago
- Comments:23 (12 by maintainers)
Top GitHub Comments
@stas00 thanks! My problem is solved for now since I’m also using fp16 during fine-tuning so the current stage2 saves are good enough for me.
@samsontmr, would you kindly open a separate issue since while this is related the use-case is quite different. Please tag me and we will work on solving your use case there. Thank you!
p.s. also when you test please make sure you are using the
transformers
anddeepspeeed
master since there are constant fixes merged into it.