Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Deepspeed ZeRO-3] Broken model save on fresh Transformers branch

See original GitHub issue

I have my own model, which utilize two T5 encoders, and I train it via DeepSpeed. It has it’s own save_pretrained() and from_pretrained() methods, which makes a custom load/save logic: https://github.com/exelents/try_t5_siamese/blob/4140194978ac113c45e7370f40b3d9b932d0b35b/siamese_model.py#L80

When I run training and trainer starts to save checkpoint, there are going something strange: weights file for every saved encoder is going to be e few kilobytes - weights are not going to be saved. On the start of training trainer tries to load checkpoint using model.load_checkpoint(), but it seems this function has it’s own loading logic, because it cannot exec my load model logic and throws an error: ValueError: [deepspeed] failed to resume from checkpoint ./templates/siamese-t5-small-v1_1-template I can comment this code, which loads checkpoint, but then I got described before problem with saving checkpoint…

What should I do to make save my own custom model properly? It worked a month ago, but today I refreshed my Transformers repo and everything has broken.

Issue Analytics

State:
Created 3 years ago
Comments:23 (12 by maintainers)

Top GitHub Comments

1reaction

samsontmrcommented, Mar 22, 2021

@stas00 thanks! My problem is solved for now since I’m also using fp16 during fine-tuning so the current stage2 saves are good enough for me.

1reaction

stas00commented, Mar 21, 2021

@samsontmr, would you kindly open a separate issue since while this is related the use-case is quite different. Please tag me and we will work on solving your use case there. Thank you!

p.s. also when you test please make sure you are using the transformers and deepspeeed master since there are constant fixes merged into it.

Top Results From Across the Web

DeepSpeed Integration - Hugging Face

DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be ... enables model fp16 weights consolidation when...

HF-Deepspeed - Kaggle

My Custom TPU Trainer which implements serialized saving when limited host memory available.¶. In [22]:.

ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, ...

Untitled

Micheila din perechea de regi, Academy de police en new jersey, Push models mn, West end blues 1928, Myutc utc email, Cosmo the...

DeepSpeed: Extreme-scale model training for everyone

Today, we are happy to share our new advancements that not only push deep learning training to the extreme, but also democratize it...