Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inference with DeepSpeed

See original GitHub issue

Trying to run generate.py on a DeepSpeed checkpoint currently breaks. Using inference with DeepSpeed should be relatively simple I think - but I couldn’t quite figure it out and realized most of the code I was writing actually just belonged in the the DeepSpeedBackend code which I hadn’t yet grokked yet. Anyway; so I don’t forget - here is some very very broken code bad code that I had written before giving up last night:

Edit: pretend I never wrote this.

Issue Analytics

State:
Created 2 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

2reactions

janEbertcommented, Jun 15, 2021

@richcmwang Exactly, they need the forward call which I’m pretty sure is also the reason why FP16 generation fails. They recommended using a simple if-switch in the forward method like do_generations=True. If it’s given, don’t do the normal forward calculations but just generations and exit. I didn’t find the time until now to try it, though.

Aside from inference being parallelizable, I think the biggest benefit is being able to do inference with models that don’t fit into memory.

1reaction

afiaka87commented, Jun 15, 2021

Thanks @richcmwang! I’ll work on this later unless you wanna make the PR.

@rom1504 The DeepSpeed docs do indeed claim faster inference with the inference engine. Not sure how though.

Top Results From Across the Web

Getting Started with DeepSpeed for Inferencing Transformer ...

DeepSpeed -Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large ...

DeepSpeed/inference-tutorial.md at master - GitHub

DeepSpeed -Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large ...

Incredibly Fast BLOOM Inference with DeepSpeed and ...

DeepSpeed -Inference uses Tensor-Parallelism and efficient fused CUDA kernels to deliver a super-fast <1msec per token inference on a large batch ...

Inference Setup — DeepSpeed 0.7.7 documentation

Inference Setup¶ ; DeepSpeedInferenceConfig is used to control all aspects of initializing the ; InferenceEngine . The config should be passed as a...

Enabling Efficient Inference of Transformer Models at ... - arXiv

DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x ...