Inference with DeepSpeed
See original GitHub issueTrying to run generate.py
on a DeepSpeed checkpoint currently breaks. Using inference with DeepSpeed should be relatively simple I think - but I couldn’t quite figure it out and realized most of the code I was writing actually just belonged in the the DeepSpeedBackend
code which I hadn’t yet grokked yet. Anyway; so I don’t forget - here is some very very broken code bad code that I had written before giving up last night:
Edit: pretend I never wrote this.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (6 by maintainers)
Top Results From Across the Web
Getting Started with DeepSpeed for Inferencing Transformer ...
DeepSpeed -Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large ...
Read more >DeepSpeed/inference-tutorial.md at master - GitHub
DeepSpeed -Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large ...
Read more >Incredibly Fast BLOOM Inference with DeepSpeed and ...
DeepSpeed -Inference uses Tensor-Parallelism and efficient fused CUDA kernels to deliver a super-fast <1msec per token inference on a large batch ...
Read more >Inference Setup — DeepSpeed 0.7.7 documentation
Inference Setup¶ ; DeepSpeedInferenceConfig is used to control all aspects of initializing the ; InferenceEngine . The config should be passed as a...
Read more >Enabling Efficient Inference of Transformer Models at ... - arXiv
DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@richcmwang Exactly, they need the forward call which I’m pretty sure is also the reason why FP16 generation fails. They recommended using a simple if-switch in the
forward
method likedo_generations=True
. If it’s given, don’t do the normal forward calculations but just generations and exit. I didn’t find the time until now to try it, though.Aside from inference being parallelizable, I think the biggest benefit is being able to do inference with models that don’t fit into memory.
Thanks @richcmwang! I’ll work on this later unless you wanna make the PR.
@rom1504 The DeepSpeed docs do indeed claim faster inference with the inference engine. Not sure how though.