Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[REQUEST] Model serving via deepspeed's inference module

See original GitHub issue

Is your feature request related to a problem? Please describe. No

Describe the solution you’d like I am trying to run my model serving code in a model-parallel fashion. The tutorial shows how to run code on multi-GPU but the data is predefined, which cannot be used for serving. My original code is using fastapi to do the serving work. When using deepspeed --num_gpus n example.py the fastapi server will also be initiated n times, which cause port conflict.

Describe alternatives you’ve considered Do I have to first start the model in parallel using deepspeed in one script and then start another script for fastapi, and finally connect them somehow?

Additional context None.

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

9reactions

david-rxcommented, Jan 5, 2022

Any update on this? Is there another recommended way to do this - for instance, if we wanted to run with uvicorn and thus couldn’t use the deepspeed launcher?

2reactions

callzhangcommented, Nov 2, 2021

Here is the minimum code I tried:

from fastapi import FastAPI, Request, Response, Query
from transformers import pipeline
import deepspeed, torch, os, uvicorn

app = FastAPI()

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='gpt2', device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto')


@app.get("/gen")
def generate(text):
    return generator(text, max_length=100)

if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(f'initiating server on rank: {local_rank}')
    uvicorn.run(
        "min_example_deepspeed_mp:app", 
        host="0.0.0.0", port=8500, 
        log_level="info", 
        workers=1
    )

Then I ran deepspeed --num_gpus 2 min_example_deepspeed_mp.py and I got the following error:

[2021-11-03 01:33:39,359] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-11-03 01:33:39,373] [INFO] [runner.py:360:main] cmd = /home/stardust/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 nlp/sentence_generation/min_example_deepspeed_mp.py
[2021-11-03 01:33:39,993] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2021-11-03 01:33:39,994] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=2, node_rank=0
[2021-11-03 01:33:39,994] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2021-11-03 01:33:39,994] [INFO] [launch.py:102:main] dist_world_size=2
[2021-11-03 01:33:39,994] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1
initiating server on rank: 1
initiating server on rank: 0
initiating server on rank: 0
Traceback (most recent call last):
  File "nlp/sentence_generation/min_example_deepspeed_mp.py", line 17, in <module>
    uvicorn.run(
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/main.py", line 447, in run
    server.run()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 68, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/stardust/anaconda3/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 76, in serve
    config.load()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/config.py", line 448, in load
    self.loaded_app = import_from_string(self.app)
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/importer.py", line 21, in import_from_string
    module = importlib.import_module(module_str)
  File "/home/stardust/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/stardust/algorithms-playground/nlp/sentence_generation/min_example_deepspeed_mp.py", line 17, in <module>
    uvicorn.run(
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/main.py", line 447, in run
    server.run()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 68, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/stardust/anaconda3/lib/python3.8/asyncio/runners.py", line 33, in run
    raise RuntimeError(
RuntimeError: asyncio.run() cannot be called from a running event loop
initiating server on rank: 1
Traceback (most recent call last):
  File "nlp/sentence_generation/min_example_deepspeed_mp.py", line 17, in <module>
    uvicorn.run(
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/main.py", line 447, in run
    server.run()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 68, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/stardust/anaconda3/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 76, in serve
    config.load()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/config.py", line 448, in load
    self.loaded_app = import_from_string(self.app)
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/importer.py", line 21, in import_from_string
    module = importlib.import_module(module_str)
  File "/home/stardust/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/stardust/algorithms-playground/nlp/sentence_generation/min_example_deepspeed_mp.py", line 17, in <module>
    uvicorn.run(
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/main.py", line 447, in run
    server.run()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/uvicorn/server.py", line 68, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/stardust/anaconda3/lib/python3.8/asyncio/runners.py", line 33, in run
    raise RuntimeError(
RuntimeError: asyncio.run() cannot be called from a running event loop
sys:1: RuntimeWarning: coroutine 'Server.serve' was never awaited
sys:1: RuntimeWarning: coroutine 'Server.serve' was never awaited
Killing subprocess 8312
Killing subprocess 8313
Traceback (most recent call last):
  File "/home/stardust/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/stardust/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/stardust/anaconda3/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/stardust/anaconda3/bin/python', '-u', 'nlp/sentence_generation/min_example_deepspeed_mp.py', '--local_rank=1']' returned non-zero exit status 1.

Top Results From Across the Web

Getting Started with DeepSpeed for Inferencing ...

DeepSpeed -Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large ...

Deploy large models on Amazon SageMaker using ...

DeepSpeed Inference supports large Transformer-based models with billions of parameters. It allows you to efficiently serve large models by ...

DeepSpeed: Accelerating large-scale model inference and ...

Inference -adapted parallelism allows users to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU ...

ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states ...

DeepSpeed Integration

DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which won't be...