Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Replace NMS GPU kernel in object detection sample with torchvision's implementation

See original GitHub issue

Make sure that inference speed in eval mode and accuracy is not significantly affected.

In case torchvision’s NMS impl is slower, clean up the existing GPU kernel by removing the empty wrapper in nms.cpp

Issue Analytics

State:
Created 3 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

2reactions

SKholkincommented, Nov 18, 2020

There is no such big difference in both inference speed and accuracy Actually PyTorch impl even a little bit faster and more accurate: ssd300 PyTorch impl: Averaged detect for batch: 1.163s Mean AP = 0.7831 Old FB kernels: Averaged detect for batch: 1.192s Mean AP = 0.7828

ssd512 PyTorch impl: Averaged detect for batch: 0.819s Mean AP = 0.8044 Old FB kernels: Averaged detect for batch: 0.843s Mean AP = 0.8026

0reactions

ljaljushkincommented, Dec 7, 2020

Custom NMS CUDA kernel from our extension uses top-k to cut a number of predictions. It’s aligned with implementation from OpenVINO: https://github.com/openvinotoolkit/openvino/blob/master/inference-engine/src/mkldnn_plugin/nodes/detectionoutput.cpp#L568

NMS from Torchvision doesn’t do that, and it seems we can’t cut a number of detections before and after function to make results equivalent to the NMS from our extension because the top-k is used internally on par with a full array of predictions.

int boxes_num = std::min(boxes.size(0), top_k);
scalar_t* boxes_dev = boxes_sorted.data<scalar_t>();
nms_kernel<<<blocks, threads>>>(boxes_num,
                                nms_overlap_thresh,
                                boxes_dev,

Without the top-k parameter, the implementations are equivalent, so I believe it’s OK to keep the custom NMS GPU kernel.

@vanyalzr, please re-open the issue, if you don’t agree or have other ideas to discuss.

Top Results From Across the Web

Object Detection from 9 FPS to 650 FPS in 6 Steps

This article is a practical deep dive into making a specific deep learning model (Nvidia's SSD300) run fast on a powerful GPU server, ......

(Soft)NMS in Object Detection: PyTorch Implementation(with ...

In soft-NMS, a score is calculated by the product of confidence score and the negative of IoU. CUDA Kernel. __global__ void nms_kernel(const int ......

Add Mean Average Precision (mAP) metric · Issue #53 - GitHub

The main metric for object detection tasks is the Mean Average Precision, implemented in PyTorch, and computed on GPU.

Train your own object detector with Faster-RCNN & PyTorch

A guide to object detection with Faster-RCNN and PyTorch ... do not worry, you'll be able to implement your own training logic and...

Operators — Torchvision main documentation - PyTorch

The below operators perform pre-processing as well as post-processing required in object detection and segmentation models.