question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inference with DeepSpeed

See original GitHub issue

Trying to run generate.py on a DeepSpeed checkpoint currently breaks. Using inference with DeepSpeed should be relatively simple I think - but I couldn’t quite figure it out and realized most of the code I was writing actually just belonged in the the DeepSpeedBackend code which I hadn’t yet grokked yet. Anyway; so I don’t forget - here is some very very broken code bad code that I had written before giving up last night:

Edit: pretend I never wrote this. 

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
janEbertcommented, Jun 15, 2021

@richcmwang Exactly, they need the forward call which I’m pretty sure is also the reason why FP16 generation fails. They recommended using a simple if-switch in the forward method like do_generations=True. If it’s given, don’t do the normal forward calculations but just generations and exit. I didn’t find the time until now to try it, though.

Aside from inference being parallelizable, I think the biggest benefit is being able to do inference with models that don’t fit into memory.

1reaction
afiaka87commented, Jun 15, 2021

Thanks @richcmwang! I’ll work on this later unless you wanna make the PR.

@rom1504 The DeepSpeed docs do indeed claim faster inference with the inference engine. Not sure how though.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started with DeepSpeed for Inferencing Transformer ...
DeepSpeed -Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large ...
Read more >
DeepSpeed/inference-tutorial.md at master - GitHub
DeepSpeed -Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large ...
Read more >
Incredibly Fast BLOOM Inference with DeepSpeed and ...
DeepSpeed -Inference uses Tensor-Parallelism and efficient fused CUDA kernels to deliver a super-fast <1msec per token inference on a large batch ...
Read more >
Inference Setup — DeepSpeed 0.7.7 documentation
Inference Setup¶ ; DeepSpeedInferenceConfig is used to control all aspects of initializing the ; InferenceEngine . The config should be passed as a...
Read more >
Enabling Efficient Inference of Transformer Models at ... - arXiv
DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found