Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TorchServe How to Curl Multiple Images Properly

See original GitHub issue

I am using TorchServe to potentially serve a model from MMOCR (https://github.com/open-mmlab/mmocr), and I have several questions:

I tried to do inference on hundreds of images together using batch mode by using & to concatenate curl commands together, such as suggested here https://github.com/pytorch/serve/issues/1235#issuecomment-938231201. However, this doesn’t provide a neat solution if I have hundreds of curls concatenated together. I can of course have a super long command that looks like

curl -X POST http://localhost:8080/predictions/ABINet -T image1.png & curl -X POST http://localhost:8080/predictions/ABINet -T image2.png & curl -X POST http://localhost:8080/predictions/ABINet -T image3.png & curl -X POST http://localhost:8080/predictions/ABINet -T image4.png &...

But I don’t think this is the right way to go. My questions are: is using & really parallel? What is a good/suggested way to do inference on hundreds of images? What is a Pythonic way to do this (maybe using requests/subprocess)?

I used config.properties file that looks like below

Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
load_models=ABINet.mar
models={\
  "ABINet": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "ABINet.mar",\
        "runtime": "python",\
        "minWorkers": 1,\
        "maxWorkers": 8,\
        "batchSize": 200,\
        "maxBatchDelay": 50,\
        "responseTimeout": 120,\
        "max_request_size": 65535000\
    }\
  }\
}

I noticed that each time I do inference (using curl -X POST http://localhost:8080/predictions/ABINet T image1.png & curl -X POST http://localhost:8080/predictions/ABINet T image2.png &... hundreds of times concatenated), the GPU usage will increase, and the memory wouldn’t be released after the inference is done.

For example, if I want to do inference on 300 images with config.properties that looks like

Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
load_models=ABINet.mar
models={\
  "ABINet": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "ABINet.mar",\
        "runtime": "python",\
        "minWorkers": 4,\
        "maxWorkers": 8,\
        "batchSize": 600,\
        "maxBatchDelay": 50,\
        "responseTimeout": 120,\
        "max_request_size": 65535000\
    }\
  }\
}

using gpustat, after I start torchserve, before I run the first inference, the GPU usage looks like

After running the inference the 1st time, the GPU usage looks like

After running the inference the 2nd time,

So if I do this inference on hundreds of images for 3 times, it will break and error like

{
  "code": 503,
  "type": "ServiceUnavailableException",
  "message": "Model \"ABINet\" has no worker to serve inference request. Please use scale workers API to add workers."
}

Now, I tried registering model with initial_workers as suggested here https://github.com/pytorch/serve/issues/29 but with no luck. My questions are:

How to set this config.properties properly to handle this situation? How would I know what to set for batchsize and maxBatchDelay?
How to allow torchserve to release memory after one inference? Is there something similar to gc.collect() or torch.cuda.reset_peak_memory_stats(device=None)?
How does TorchServe work under the hood? If I send a request with hundreds of images, say, 600, will TorchServe take all in or take only whatever portion it can take? Or will it automatically partition the request (say, take 300 the first time, then take the rest 300)?

I am attaching the MMOCR custom handler for reference

class MMOCRHandler(BaseHandler):
    threshold = 0.5

    def initialize(self, context):
        properties = context.system_properties
        self.map_location = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.device = torch.device(self.map_location + ':' +
                                   str(properties.get('gpu_id')) if torch.cuda.
                                   is_available() else self.map_location)
        self.manifest = context.manifest

        model_dir = properties.get('model_dir')
        serialized_file = self.manifest['model']['serializedFile']
        checkpoint = os.path.join(model_dir, serialized_file)
        self.config_file = os.path.join(model_dir, 'config.py')

        self.model = init_detector(self.config_file, checkpoint, self.device)
        self.initialized = True

    def preprocess(self, data):
        images = []
        for row in data:
            image = row.get('data') or row.get('body')
            if isinstance(image, str):
                image = base64.b64decode(image)
            image = mmcv.imfrombytes(image)
            images.append(image)

        return images


    def inference(self, data, *args, **kwargs):

        results = model_inference(self.model, data, batch_mode=True)
        return results


    def postprocess(self, data):
        # Format output following the example OCRHandler format
        return data

This is driving me nuts. Any help is appreciated.

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

Hegelimcommented, Jun 20, 2022

Thank you so much for the comment! I tried re-writing my code using asyncio and aiohttp and the python file looks like below

import aiohttp
import asyncio
import time

start_time = time.time()

async def get_res(session, url, image):
    with open(image, "rb") as f:
        async with session.post(url, data={"data": f}) as resp:
            res = await resp.text()
            return res


async def main():

    connector = aiohttp.TCPConnector(limit=1000)

    async with aiohttp.ClientSession(connector=connector) as session:

        tasks = []
        url = 'http://localhost:8080/predictions/ABINet'
        for i in range(611):
            image = f"images/forms/{i}.png"
            tasks.append(asyncio.ensure_future(get_res(session, url, image)))

        original_images = await asyncio.gather(*tasks)
        for img in original_images:
            print(img)

asyncio.run(main())
print(f"Time: {time.time() - start_time}")

However, the issue of GPU usage still remains - each call of this python file will boost the GPU memory usage, by the end of 2nd call my GPU is already full. Am I doing this in the right way?

0reactions

anishchhaparwalcommented, Oct 17, 2022

Facing the same issue as @Hegelim. GPU memory usage keeps increasing post each inference batch.

Top Results From Across the Web

Serving PyTorch models with TorchServe | by Álvaro Bartolomé

TorchServe is the ML model serving framework developed by PyTorch. This post explains how to train and serve a CNN transfer learning model....

17. Torchserve Use Cases - PyTorch

The following use-case steps uses curl to execute torchserve REST api calls. ... copied in volume/directory shared while starting torchserve docker image.

Serving PyTorch Models Using TorchServe - Supertype

How to use TorchServe to serve your PyTorch model (detailed TorchServe ... First of all, TorchServe is implemented in Java, so you need...

Deploying PyTorch models for inference at scale using ...

With default handlers for common problems such as image classification, ... With TorchServe, you get many features out-of-the-box.

TorchServe - Deep Learning AMI - 亚马逊云科技

For more information on using TorchServe, see Model Server for PyTorch Documentation . Topics. Serve an Image Classification Model on TorchServe. This tutorial ......