--fp16 Slower & Does Not Reduce Memory Use
See original GitHub issueHey there @lucidrains,
Came across your incredible work and immediately tried it out on my RTX 2070! Since the training will take some time and require a lot of memory, I was relieved that we can use APEX/Amp to train the model by simply adding the --fp16
option.
Unfortunately for me, the memory usage does not reduce compared to the regular fp32 training and the training speed was slower too.
Came across a similar issue #129 but it was closed before a fix was checked in. Will you still continue to work on fp16? I believe this will help many of your users (and fans!)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:9 (3 by maintainers)
Top Results From Across the Web
Fp16 training with feedforward network slower time and no ...
The fp16 should still decrease ur memory foot print even if it's by a small factor. It's possible that the decrease is so...
Read more >Memory and speed - Hugging Face
We present some techniques and ideas to optimize Diffusers inference for memory or speed. As a general rule, we recommend the use of...
Read more >Use FP16 regardless if it is slower or not - TensorRT
Use FP16 regardless if it is slower or not ... [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0...
Read more >Optimize TensorFlow GPU performance with the TensorFlow ...
Keep in mind that offloading computations to GPU may not always be ... is more aggressive and may reduce parallelism and use more...
Read more >Speed Up Model Training - PyTorch Lightning - Read the Docs
Increasing num_workers will ALSO increase your CPU memory consumption. The best thing to do is to increase the num_workers slowly and stop once...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
A bit unrelated, but I can’t even get it running - I just keep getting NaN errors and the learning shutdowns.
@tannisroot yea, I get that feedback a lot. I think I will just remove this feature from the readme and keep it as a silent feature. Perhaps someone can help figure out what’s wrong. It has worked for me in the past, so I’m not sure what changed