Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature-Request] YOLOX Support

See original GitHub issue

Start with the `why`:

The tiny variant of YOLOX has a number of advantages over yolov4-tiny:

Significantly higher AP: 33% v.s. 22%!
It’s Anchorless
It’s slightly smaller, so it should offer higher frame rates.

There is also a nano variant that would be useful in low power applications, or in situations where you want to run multiple models.

Move to the `what`:

It would be great if there was first class support for YOLOX in the DepthAI Python and C++ APIs, e.g. by adding support for YOLOX to the existing YoloDetectionNetwork pipeline node. YOLOX is anchorless, so the existing device side decoding probably won’t work. I tried it out anyway (by following the OpenVINO instructions on the YOLOX github page, and then compiling to a blob for the myriad), but i get the following error:

[14442C10F1C14ED000] [50.022] [system] [critical] Fatal error. Please report to developers. Log: 'PlgDetectionParser' '109'

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:14 (6 by maintainers)

Top GitHub Comments

3reactions

PINTO0309commented, Aug 23, 2021

YOLOX (OpenVINO IR and Myriad Inference Blob, ONNX, etc…)

download_tiny.sh
download_nano.sh

https://github.com/PINTO0309/PINTO_model_zoo/tree/main/132_YOLOX

This is an old model that was converted within a day after YOLOX was officially released. Please ignore this if it is not helpful as I am not aware of the flow of the discussion.

3reactions

atmccarthycommented, Aug 21, 2021

Thanks @Luxonis-Brandon. I used the example you shared and the YOLOX OpenVINO demo code to build a working solution. Please see the code at the end of this comment, and the attached zip containing a blob for the Myriad. There are some major drawbacks with this approach:

I’m using the ESP32 / SPI in a low power application. The final layer of the network has shape (1, 3549, 85), which means you need to pull about 600kb of data back to the host for each frame, and then perform non-maximum suppression via the cpu. The ESP32+ SPI combination can’t really handle that much data. The impact on FPS is large even over USB with a powerful CPU at the other end, although note that the size of the last dimension (85 in the case of COCO) is determined by the number of classes, so the performance impact may be acceptable if you have a small number of classes. The YOLOX paper does reference an approach to doing nms in the model itself, but unfortunately i haven’t had the time to explore it further.
YOLOX has an image preprocessing / normalization step (see the preproc function in the attached code) that is not part of the model itself, and that can’t be done with the ImageManip node. This means we need to pull each camera frame back to the host, normalize it, and then send it back to the device. We could probably get rid of this step during training, although i’m not sure how big the impact on AP would be. YOLOX also expects FP16 data, but AFAIK the XLinkIn only accepts int buffers. I had to work around this by spreading the FP16 pixels across two int8s.
My pipeline has downstream nodes (e.g. ObjectTracker), so i need to make another round trip once I’ve decoded the NN outputs on the host.

As for the performance impact, i get ~15fps running the attached code on my 8700k v.s. ~28 for yolo4 running via the YoloDetectionNetwork node. It’s going to be completely unworkable on the ESP, so i haven’t tried it.

On the positive side, it does seem to be more accurate than yolov4 in practice, e.g. if i point the camera at my face, then yolo4 says i’m a dog (i had headphones on at the time, which probably didn’t help…), v.s YOLOX says i’m a person. YOLOX also detects the picture of my dog that I have sitting on my desk as a dog whereas yolo4 oscillates between cat and dog. I did notice that the YOLOX bounding boxes were a bit janky for really large objects. Anyway, this very scientific test proves that you should look into it further 😄

from pathlib import Path
import numpy as np
import cv2
import depthai as dai
import time


def preproc(image, input_size, mean, std, swap=(2, 0, 1)):
    if len(image.shape) == 3:
        padded_img = np.ones((input_size[0], input_size[1], 3)) * 114.0
    else:
        padded_img = np.ones(input_size) * 114.0
    img = np.array(image)
    r = min(input_size[0] / img.shape[0], input_size[1] / img.shape[1])
    resized_img = cv2.resize(
        img,
        (int(img.shape[1] * r), int(img.shape[0] * r)),
        interpolation=cv2.INTER_LINEAR,
    ).astype(np.float32)
    padded_img[: int(img.shape[0] * r), : int(img.shape[1] * r)] = resized_img

    padded_img = padded_img[:, :, ::-1]
    padded_img /= 255.0
    if mean is not None:
        padded_img -= mean
    if std is not None:
        padded_img /= std
    padded_img = padded_img.transpose(swap)
    padded_img = np.ascontiguousarray(padded_img, dtype=np.float16)
    return padded_img, r


def nms(boxes, scores, nms_thr):
    """Single class NMS implemented in Numpy."""
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]

    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
    order = scores.argsort()[::-1]

    keep = []
    while order.size > 0:
        i = order[0]
        keep.append(i)
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])

        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h
        ovr = inter / (areas[i] + areas[order[1:]] - inter)

        inds = np.where(ovr <= nms_thr)[0]
        order = order[inds + 1]

    return keep


def multiclass_nms(boxes, scores, nms_thr, score_thr):
    """Multiclass NMS implemented in Numpy"""
    final_dets = []
    num_classes = scores.shape[1]
    for cls_ind in range(num_classes):
        cls_scores = scores[:, cls_ind]
        valid_score_mask = cls_scores > score_thr
        if valid_score_mask.sum() == 0:
            continue
        else:
            valid_scores = cls_scores[valid_score_mask]
            valid_boxes = boxes[valid_score_mask]
            keep = nms(valid_boxes, valid_scores, nms_thr)
            if len(keep) > 0:
                cls_inds = np.ones((len(keep), 1)) * cls_ind
                dets = np.concatenate(
                    [valid_boxes[keep], valid_scores[keep, None], cls_inds], 1
                )
                final_dets.append(dets)
    if len(final_dets) == 0:
        return None
    return np.concatenate(final_dets, 0)


def demo_postprocess(outputs, img_size, p6=False):

    grids = []
    expanded_strides = []

    if not p6:
        strides = [8, 16, 32]
    else:
        strides = [8, 16, 32, 64]

    hsizes = [img_size[0] // stride for stride in strides]
    wsizes = [img_size[1] // stride for stride in strides]

    for hsize, wsize, stride in zip(hsizes, wsizes, strides):
        xv, yv = np.meshgrid(np.arange(wsize), np.arange(hsize))
        grid = np.stack((xv, yv), 2).reshape(1, -1, 2)
        grids.append(grid)
        shape = grid.shape[:2]
        expanded_strides.append(np.full((*shape, 1), stride))

    grids = np.concatenate(grids, 1)
    expanded_strides = np.concatenate(expanded_strides, 1)
    outputs[..., :2] = (outputs[..., :2] + grids) * expanded_strides
    outputs[..., 2:4] = np.exp(outputs[..., 2:4]) * expanded_strides

    return outputs


SHAPE = 416
labelMap = [
    "person",         "bicycle",    "car",           "motorbike",     "aeroplane",   "bus",           "train",
    "truck",          "boat",       "traffic light", "fire hydrant",  "stop sign",   "parking meter", "bench",
    "bird",           "cat",        "dog",           "horse",         "sheep",       "cow",           "elephant",
    "bear",           "zebra",      "giraffe",       "backpack",      "umbrella",    "handbag",       "tie",
    "suitcase",       "frisbee",    "skis",          "snowboard",     "sports ball", "kite",          "baseball bat",
    "baseball glove", "skateboard", "surfboard",     "tennis racket", "bottle",      "wine glass",    "cup",
    "fork",           "knife",      "spoon",         "bowl",          "banana",      "apple",         "sandwich",
    "orange",         "broccoli",   "carrot",        "hot dog",       "pizza",       "donut",         "cake",
    "chair",          "sofa",       "pottedplant",   "bed",           "diningtable", "toilet",        "tvmonitor",
    "laptop",         "mouse",      "remote",        "keyboard",      "cell phone",  "microwave",     "oven",
    "toaster",        "sink",       "refrigerator",  "book",          "clock",       "vase",          "scissors",
    "teddy bear",     "hair drier", "toothbrush"
]
p = dai.Pipeline()
p.setOpenVINOVersion(dai.OpenVINO.VERSION_2021_3)


class FPSHandler:
    def __init__(self, cap=None):
        self.timestamp = time.time()
        self.start = time.time()
        self.frame_cnt = 0
    def next_iter(self):
        self.timestamp = time.time()
        self.frame_cnt += 1
    def fps(self):
        return self.frame_cnt / (self.timestamp - self.start)

camRgb = p.createColorCamera()
camRgb.setPreviewSize(SHAPE, SHAPE)
camRgb.setResolution(dai.ColorCameraProperties.SensorResolution.THE_1080_P)
camRgb.setInterleaved(False)
camRgb.setColorOrder(dai.ColorCameraProperties.ColorOrder.BGR)

nn = p.createNeuralNetwork()
nn.setBlobPath(str(Path("yolox_tiny.blob").resolve().absolute()))
nn.setNumInferenceThreads(2)
nn.input.setBlocking(True)

# Send rgb frames to the host
rgb_xout = p.createXLinkOut()
rgb_xout.setStreamName("rgb")
camRgb.preview.link(rgb_xout.input)

# Send converted frames from the host to the NN
xinFrame = p.createXLinkIn()
xinFrame.setStreamName("inFrame")
xinFrame.out.link(nn.input)

# Send bounding boxes from the NN to the host via XLink
nn_xout = p.createXLinkOut()
nn_xout.setStreamName("nn")
nn.out.link(nn_xout.input)


# Pipeline is defined, now we can connect to the device
with dai.Device(p) as device:
    qRgb = device.getOutputQueue(name="rgb", maxSize=4, blocking=True)
    qIn = device.getInputQueue("inFrame", maxSize=4, blocking=True)
    qNn = device.getOutputQueue(name="nn", maxSize=4, blocking=True)
    fps = FPSHandler()

    while True:
        inRgb = qRgb.get()
        frame = inRgb.getCvFrame()
        mean = (0.485, 0.456, 0.406)
        std = (0.229, 0.224, 0.225)

        image, ratio = preproc(frame, (SHAPE, SHAPE), mean, std)
        # NOTE: The model expects an FP16 input image, but ImgFrame accepts a list of ints only. I work around this by
        # spreading the FP16 across two ints
        image = list(image.tobytes())

        dai_frame = dai.ImgFrame()
        dai_frame.setHeight(SHAPE)
        dai_frame.setWidth(SHAPE)
        dai_frame.setData(image)
        qIn.send(dai_frame)

        in_nn = qNn.tryGet()
        if in_nn is not None:
            fps.next_iter()
            cv2.putText(frame, "Fps: {:.2f}".format(fps.fps()), (2, SHAPE - 4), cv2.FONT_HERSHEY_TRIPLEX, 0.4, color=(255, 255, 255))

            data = np.array(in_nn.getLayerFp16('output')).reshape(1, 3549, 85)
            predictions = demo_postprocess(data, (SHAPE, SHAPE), p6=False)[0]

            boxes = predictions[:, :4]
            scores = predictions[:, 4, None] * predictions[:, 5:]

            boxes_xyxy = np.ones_like(boxes)
            boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2] / 2.
            boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3] / 2.
            boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2] / 2.
            boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3] / 2.
            dets = multiclass_nms(boxes_xyxy, scores, nms_thr=0.45, score_thr=0.1)

            if dets is not None:
                final_boxes = dets[:, :4]
                final_scores, final_cls_inds = dets[:, 4], dets[:, 5]

                for i in range(len(final_boxes)):
                    bbox = final_boxes[i]
                    score = final_scores[i]
                    class_name = labelMap[int(final_cls_inds[i])]

                    if score >= 0.1:
                        # Limit the bounding box to 0..SHAPE
                        bbox[bbox > SHAPE] = 1
                        bbox[bbox < 0] = 0
                        xy_min = (int(bbox[0]), int(bbox[1]))
                        xy_max = (int(bbox[2]), int(bbox[3]))
                        # Display detection's BB, label and confidence on the frame
                        cv2.rectangle(frame, xy_min , xy_max, (255, 0, 0), 2)
                        cv2.putText(frame, class_name, (xy_min[0] + 10, xy_min[1] + 20), cv2.FONT_HERSHEY_TRIPLEX, 0.5, 255)
                        cv2.putText(frame, f"{int(score * 100)}%", (xy_min[0] + 10, xy_min[1] + 40), cv2.FONT_HERSHEY_TRIPLEX, 0.5, 255)

        cv2.imshow("rgb", frame)
        if cv2.waitKey(1) == ord('q'):
            break

yolox_tiny.zip

Top Results From Across the Web

Low Precision Quantization for YOLOx - Intel

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or visit www.intel.com/design/ ......

Submitting a feature request - Sitebuilder+ - Yola

Here's how you can submit your feature request: Navigate to the top-left corner of the Sitebuilder and click on Help. Choose the option...

YOLOX Object Detector Paper Explanation and Custom Training

YOLOX object detector is a single-stage real-time detector. Check out detailed explanation of YOLOX paper and training YOLOX on custom data.

dvc git github - DagsHub

YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported.

High Speed Railway Fastener Defect Detection by Using ...

On the basis of the YoLoX-Nano model, three output feature maps of CSPDarknet and Path Aggregation Feature Pyramid Network (PAFPN) are used ...