Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PyTorch inference freezing for 5-7 seconds on new batch size for GPU.

See original GitHub issue

Description

When using an object detector with PyTorch Engine, for each model there is a required “warmup” period that occurs. This only happens on GPU devices and not CPU. This warmup period is usually the 1st prediction of any newly submitted batch size that isn’t the very 1st prediction. So the 1st prediction made will always take slightly longer (400-500ms vs. 10-300ms) and then the 2nd prediction will take 5000ms-7000ms and then all subsequent predictions of the same batch size will take 10-300ms. When trying a new batch size like going from a batch size of 8 to 16 using the same model on GPU, it will again require a warmup period of 5000ms-7000ms and then all subsequent predictions of either the 1st batch size of 8 or the 2nd batch size of 16 will take 10-300ms.

Expected Behavior

For the models in GPU memory to be able to run predictions at 10-300ms without the need for a “warmup” period.

Error Message

There is no error message.

Here is an example output from my test code below running on GPU:

15:20:17.028 [main] DEBUG BatchPredictOnGPUBug - Generated dataset successfully in 66ms.
15:20:17.034 [main] DEBUG ai.djl.engine.Engine - Found EngineProvider: PyTorch
15:20:17.034 [main] DEBUG ai.djl.engine.Engine - Found default engine: PyTorch
15:20:17.108 [main] DEBUG ai.djl.pytorch.jni.LibUtils - Using cache dir: $HOME/.djl.ai/pytorch
15:20:17.108 [main] DEBUG ai.djl.pytorch.jni.LibUtils - Loading pytorch library from: $HOME/.djl.ai/pytorch/1.8.1-cu111-linux-x86_64/0.11.0-SNAPSHOT-cu111-libdjl_torch.so
15:20:17.326 [main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 24
15:20:17.327 [main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 48
Total batches: 8
15:20:19.717 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 496ms.
15:20:25.922 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 6205ms.
15:20:32.175 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 6252ms.
15:20:38.433 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 6257ms.
15:20:44.718 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 6285ms.
15:20:51.024 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 6306ms.
15:20:57.377 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 6353ms.
15:21:03.790 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 6413ms.
15:21:03.790 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:21:03.790 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 44571
15:21:03.790 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 9904.666666666666
15:21:03.790 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 1238.0833333333333
15:21:03.790 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 0.8077000740391735/s
15:21:03.830 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 40ms.
15:21:03.910 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 80ms.
15:21:04.017 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 107ms.
15:21:04.164 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 147ms.
15:21:04.335 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 171ms.
15:21:04.560 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 224ms.
15:21:04.825 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 265ms.
15:21:11.207 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 6382ms.
15:21:11.207 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:21:11.207 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 7417
15:21:11.207 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 1648.2222222222222
15:21:11.207 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 206.02777777777777
15:21:11.207 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 4.853714439800458/s
15:21:11.246 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 39ms.
15:21:11.322 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 76ms.
15:21:11.432 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 110ms.
15:21:11.572 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 140ms.
15:21:11.744 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 172ms.
15:21:11.946 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 202ms.
15:21:12.180 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 233ms.
15:21:12.448 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 268ms.
15:21:12.448 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:21:12.448 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 1241
15:21:12.448 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 275.77777777777777
15:21:12.448 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 34.47222222222222
15:21:12.448 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 29.0088638195004/s
15:21:12.487 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 39ms.
15:21:12.562 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 75ms.
15:21:12.667 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 105ms.
15:21:12.809 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 142ms.
15:21:12.980 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 171ms.
15:21:13.182 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 201ms.
15:21:13.419 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 237ms.
15:21:13.691 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 272ms.
15:21:13.691 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:21:13.691 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 1243
15:21:13.691 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 276.22222222222223
15:21:13.691 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 34.52777777777778
15:21:13.691 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 28.96218825422365/s
15:21:13.730 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 39ms.
15:21:13.805 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 75ms.
15:21:13.918 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 113ms.
15:21:14.057 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 139ms.
15:21:14.250 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 193ms.
15:21:14.484 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 234ms.
15:21:14.713 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 229ms.
15:21:14.982 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 269ms.
15:21:14.982 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:21:14.982 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 1291
15:21:14.982 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 286.8888888888889
15:21:14.982 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 35.861111111111114
15:21:14.982 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 27.885360185902403/s

Here is an example output from my test code below running on CPU:

15:24:50.719 [main] DEBUG BatchPredictOnGPUBug - Generated dataset successfully in 65ms.
15:24:50.725 [main] DEBUG ai.djl.engine.Engine - Found EngineProvider: PyTorch
15:24:50.725 [main] DEBUG ai.djl.engine.Engine - Found default engine: PyTorch
15:24:50.796 [main] DEBUG ai.djl.pytorch.jni.LibUtils - Using cache dir: $HOME/.djl.ai/pytorch
15:24:50.797 [main] DEBUG ai.djl.pytorch.jni.LibUtils - Loading pytorch library from: $HOME/.djl.ai/pytorch/1.8.1-cu111-linux-x86_64/0.11.0-SNAPSHOT-cu111-libdjl_torch.so
15:24:51.016 [main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 24
15:24:51.016 [main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 48
Total batches: 8
15:24:51.427 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 316ms.
15:24:51.675 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 247ms.
15:24:51.907 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 231ms.
15:24:52.151 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 244ms.
15:24:52.427 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 276ms.
15:24:52.744 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 317ms.
15:24:53.172 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 428ms.
15:24:53.580 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 408ms.
15:24:53.580 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:24:53.580 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 2470
15:24:53.580 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 548.8888888888889
15:24:53.580 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 68.61111111111111
15:24:53.580 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 14.5748987854251/s
15:24:53.685 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 105ms.
15:24:53.837 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 151ms.
15:24:54.016 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 179ms.
15:24:54.233 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 217ms.
15:24:54.536 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 303ms.
15:24:54.935 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 399ms.
15:24:55.293 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 358ms.
15:24:55.675 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 382ms.
15:24:55.675 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:24:55.675 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 2095
15:24:55.675 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 465.55555555555554
15:24:55.675 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 58.19444444444444
15:24:55.675 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 17.18377088305489/s
15:24:55.780 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 105ms.
15:24:55.928 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 148ms.
15:24:56.109 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 181ms.
15:24:56.326 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 217ms.
15:24:56.582 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 256ms.
15:24:56.882 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 300ms.
15:24:57.222 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 339ms.
15:24:57.605 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 382ms.
15:24:57.605 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:24:57.605 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 1930
15:24:57.605 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 428.8888888888889
15:24:57.605 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 53.611111111111114
15:24:57.605 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 18.652849740932645/s
15:24:57.710 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 105ms.
15:24:57.858 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 148ms.
15:24:58.047 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 189ms.
15:24:58.265 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 217ms.
15:24:58.520 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 255ms.
15:24:58.823 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 303ms.
15:24:59.162 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 338ms.
15:24:59.545 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 383ms.
15:24:59.545 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:24:59.545 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 1940
15:24:59.545 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 431.1111111111111
15:24:59.545 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 53.888888888888886
15:24:59.545 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 18.556701030927837/s
15:24:59.658 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 1 results in 113ms.
15:24:59.807 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 2 results in 149ms.
15:24:59.987 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 3 results in 179ms.
15:25:00.204 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 4 results in 217ms.
15:25:00.461 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 5 results in 257ms.
15:25:00.776 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 6 results in 315ms.
15:25:01.114 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 7 results in 338ms.
15:25:01.512 [main] DEBUG BatchPredictOnGPUBug - Finished detection with 8 results in 398ms.
15:25:01.513 [main] INFO BatchPredictOnGPUBug - TOTAL IMAGES: 36
15:25:01.513 [main] INFO BatchPredictOnGPUBug - TOTAL TIME: 1967
15:25:01.513 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER BATCH: 437.1111111111111
15:25:01.513 [main] INFO BatchPredictOnGPUBug - AVERAGE TIME PER IMAGE: 54.638888888888886
15:25:01.513 [main] INFO BatchPredictOnGPUBug - PERFORMANCE: 18.301982714794104/s

How to Reproduce?

import ai.djl.Device;
import ai.djl.MalformedModelException;
import ai.djl.Model;
import ai.djl.inference.Predictor;
import ai.djl.modality.cv.Image;
import ai.djl.modality.cv.ImageFactory;
import ai.djl.modality.cv.output.DetectedObjects;
import ai.djl.modality.cv.transform.Resize;
import ai.djl.modality.cv.transform.ToTensor;
import ai.djl.modality.cv.translator.YoloV5Translator;
import ai.djl.translate.Pipeline;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.awt.image.BufferedImage;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class BatchPredictOnGPUBug {

    private static final Logger log = LoggerFactory.getLogger(BatchPredictOnGPUBug.class);

    public static void main(String[] args) throws IOException, MalformedModelException {
        String modelFile = args[0];
        int batchSize = 8;
        int imageSize = 320;
        int epochs = 5;
        float detectionThreshold = 0.5f;
        long datasetStartTime = System.currentTimeMillis();
        List<Image> dataset = new ArrayList<>();
        int totalImages = getTotalImages(batchSize);
        for (int i = 0; i < totalImages; i++) {
            dataset.add(generateRandomImage(100, 100));
        }
        log.debug("Generated dataset successfully in {}ms.", (System.currentTimeMillis() - datasetStartTime));
        Device device = Device.getGpuCount() > 0 ? Device.gpu() : Device.cpu();
        Path modelPath = Paths.get(modelFile);
        Model model = Model.newInstance(modelPath.getFileName().toString(), device);
        model.load(modelPath);
        Pipeline pipeline = new Pipeline();
        pipeline.add(new Resize(imageSize));
        pipeline.add(new ToTensor());
        YoloV5Translator translator = YoloV5Translator.builder()
                .setPipeline(pipeline)
                .optSynset(Collections.singletonList("A"))
                .optThreshold(detectionThreshold)
                .build();
        List<List<Image>> batches = new ArrayList<>();
        List<Image> newBatch = new ArrayList<>();
        int tempBatchSize = 1;
        for (Image image : dataset) {
            newBatch.add(image);
            if (newBatch.size() >= tempBatchSize) {
                batches.add(newBatch);
                newBatch = new ArrayList<>();
                tempBatchSize = Math.min(batchSize, tempBatchSize + 1);
            }
        }
        if (!newBatch.isEmpty()) {
            batches.add(newBatch);
        }
        System.out.println("Total batches: " + batches.size());
        for (int i = 0; i < epochs; i++) {
            long startTime = System.currentTimeMillis();
            for (List<Image> batch : batches) {
                try (Predictor<Image, DetectedObjects> predictor = model
                        .newPredictor(translator)) {
                    long singleStartTime = System.currentTimeMillis();
                    List<DetectedObjects> detectedObjects = predictor.batchPredict(batch);
                    log.debug("Finished detection with {} results in {}ms.", detectedObjects.size(), System.currentTimeMillis() - singleStartTime);
                } catch (Exception exception) {
                    log.error("Failed to predict!", exception);
                }
            }
            long totalTime = System.currentTimeMillis() - startTime;
            double totalSeconds = totalTime / 1000.0;
            double inferencePerSecond = (double) totalImages / totalSeconds;
            log.info("TOTAL IMAGES: " + totalImages);
            log.info("TOTAL TIME: " + totalTime);
            log.info("AVERAGE TIME PER BATCH: " + ((double) totalTime / ((double) totalImages / (double) batchSize)));
            log.info("AVERAGE TIME PER IMAGE: " + ((double) totalTime / (double) totalImages));
            log.info("PERFORMANCE: " + inferencePerSecond + "/s");
        }
        System.exit(0);
    }

    private static int getTotalImages(int num) {
        return (num * (num + 1)) / 2;
    }

    public static Image generateRandomImage(int width, int height) {
        BufferedImage img = new BufferedImage(width, height, BufferedImage.TYPE_INT_ARGB);
        for (int y = 0; y < height; y++) {
            for (int x = 0; x < width; x++) {
                int a = (int) (Math.random() * 256);
                int r = (int) (Math.random() * 256);
                int g = (int) (Math.random() * 256);
                int b = (int) (Math.random() * 256);
                int p = (a << 24) | (r << 16) | (g << 8) | b;
                img.setRGB(x, y, p);
            }
        }
        return ImageFactory.getInstance().fromImage(img);
    }

}

Steps to reproduce

You can run the code above on a GPU machine. I have tested it on RTX 3090, RTX 6000, A6000 and A100 machines all with the same issue. It appears to be independent of environment.
It takes a single argument which is the model you are loading. Here is a link to download the YOLOV5s pre-trained weights exported to TorchScript with an image size of 320: https://hodovo.b-cdn.net/yolov5s.torchscript.pt

What have you tried to solve it?

I have ran debugging and modified the source code of DJL to identify the exact spot in which the freeze it is occurring. It is in the pytorch native code executed through JNI. The IValueUtils.forward(block, inputs, isTrain) calls the moduleForward method in ai_djl_pytorch_jni_PyTorchLibrary_inference.cc. This is as far as I got as I am not very knowledgable with torch and C. I will do my best to continue looking into it though.

Environment Info

If needed, I can provide environment information from the output of the console. I have ran it and saved it. The critical details I will list below:

Java: 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Torch: 1.8.1-cu111-linux-x86_64/0.11.0-SNAPSHOT-cu111-libdjl_torch.so
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
DJL: 0.11.0-SNAPSHOT (Also happens on 0.10.0 but I have been using 0.11.0-SNAPSHOT to use Cuda 11.1 for my 30XX card since 1.7.1 does not support CUDA 11.1)

Other Information

This bug only occurs on GPU and not CPU. You can alter the test code above to only use CPU and see the performance is linear. It is only when loading the model with GPU does this delay occur.

If this is the intended behavior or if I am missing something, please don’t hesitate to tell me. This is my first GitHub issue and my first time attempting to contribute to open source. I appreciate any and all feedback and I will do my best to help in any way I can 😃

Issue Analytics

State:
Created 2 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

hodovocommented, Apr 27, 2021

Awesome. I will go ahead and add that later tonight and submit another pull request. Thank you again for all the help. I look forward to contributing more in the future 😃

0reactions

lanking520commented, Apr 29, 2021

@hodovo Thanks for your findings and contribution. Close this issue for now, please feel free to reopen if you are still observing similar ones.