question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Deepspeed ZeRO-3] Broken model save on fresh Transformers branch

See original GitHub issue

I have my own model, which utilize two T5 encoders, and I train it via DeepSpeed. It has it’s own save_pretrained() and from_pretrained() methods, which makes a custom load/save logic: https://github.com/exelents/try_t5_siamese/blob/4140194978ac113c45e7370f40b3d9b932d0b35b/siamese_model.py#L80

When I run training and trainer starts to save checkpoint, there are going something strange: weights file for every saved encoder is going to be e few kilobytes - weights are not going to be saved. On the start of training trainer tries to load checkpoint using model.load_checkpoint(), but it seems this function has it’s own loading logic, because it cannot exec my load model logic and throws an error: ValueError: [deepspeed] failed to resume from checkpoint ./templates/siamese-t5-small-v1_1-template I can comment this code, which loads checkpoint, but then I got described before problem with saving checkpoint…

What should I do to make save my own custom model properly? It worked a month ago, but today I refreshed my Transformers repo and everything has broken.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:23 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
samsontmrcommented, Mar 22, 2021

@stas00 thanks! My problem is solved for now since I’m also using fp16 during fine-tuning so the current stage2 saves are good enough for me.

1reaction
stas00commented, Mar 21, 2021

@samsontmr, would you kindly open a separate issue since while this is related the use-case is quite different. Please tag me and we will work on solving your use case there. Thank you!

p.s. also when you test please make sure you are using the transformers and deepspeeed master since there are constant fixes merged into it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed Integration - Hugging Face
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be ... enables model fp16 weights consolidation when...
Read more >
HF-Deepspeed - Kaggle
My Custom TPU Trainer which implements serialized saving when limited host memory available.¶. In [22]:.
Read more >
ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, ...
Read more >
Untitled
Micheila din perechea de regi, Academy de police en new jersey, Push models mn, West end blues 1928, Myutc utc email, Cosmo the...
Read more >
DeepSpeed: Extreme-scale model training for everyone
Today, we are happy to share our new advancements that not only push deep learning training to the extreme, but also democratize it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found