[REQUEST] Model serving via deepspeed's inference module
See original GitHub issueIs your feature request related to a problem? Please describe. No
Describe the solution you’d like
I am trying to run my model serving code in a model-parallel fashion. The tutorial shows how to run code on multi-GPU but the data is predefined, which cannot be used for serving. My original code is using fastapi to do the serving work. When using deepspeed --num_gpus n example.py
the fastapi server will also be initiated n times, which cause port conflict.
Describe alternatives you’ve considered Do I have to first start the model in parallel using deepspeed in one script and then start another script for fastapi, and finally connect them somehow?
Additional context None.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Getting Started with DeepSpeed for Inferencing ...
DeepSpeed -Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large ...
Read more >Deploy large models on Amazon SageMaker using ...
DeepSpeed Inference supports large Transformer-based models with billions of parameters. It allows you to efficiently serve large models by ...
Read more >DeepSpeed: Accelerating large-scale model inference and ...
Inference -adapted parallelism allows users to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU ...
Read more >ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states ...
Read more >DeepSpeed Integration
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which won't be...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Any update on this? Is there another recommended way to do this - for instance, if we wanted to run with uvicorn and thus couldn’t use the deepspeed launcher?
Here is the minimum code I tried:
Then I ran
deepspeed --num_gpus 2 min_example_deepspeed_mp.py
and I got the following error: