Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SSD Object Detection Mappings?

See original GitHub issue

Can anyone explain what the mappings are of the two tensors which are produced from the output of the ssd_mobilenet_v1_android_export model which the example converts to a .mlmodel file?

When I look at this example: https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb

Which uses: ssd_mobilenet_v1_coco_2017_11_17

You can see the code does this:

(boxes, scores, classes, num) = sess.run(
    [detection_boxes, detection_scores, detection_classes, num_detections],
    feed_dict={image_tensor: image_np_expanded}
)

print(boxes.shape)
print(boxes)

(1, 100, 4)
[  3.90840471e-02   1.92150325e-02   8.72103453e-01   3.15773487e-01]
[  1.09515011e-01   4.02835608e-01   9.24646080e-01   9.73047853e-01]
[  5.07123828e-01   3.85651529e-01   8.76479626e-01   7.03940928e-01]

Looks good!

So I assume what we have here is the first 100 boxes with 4 dimensions each.

I traced the values and code and did this:

box_coords = ymin, xmin, ymax, xmax
(left, right, top, bottom) = (xmin, xmax, ymin, ymax)
e.g.
(0.10951501131057739, 0.4028356075286865, 0.9246460795402527, 0.9730478525161743)
left: 0.4028356075286865
right: 0.9730478525161743
top: 0.10951501131057739
bottom: 0.9246460795402527

These are % of the entire image so multiplied by the input size of 300 and you should get the original pixel locations.

This all makes sense.

However, I need the model in CoreML so I followed this guide: https://github.com/tf-coreml/tf-coreml/blob/master/examples/ssd_example.ipynb

Which uses: ssd_mobilenet_v1_android_export I assume from the README that its the same model: https://github.com/tensorflow/models/tree/master/research/object_detection

August 11, 2017
We have released an update to the Android Detect demo which will now run models trained using the Tensorflow Object Detection API on an Android device. By default, it currently runs a frozen SSD w/Mobilenet detector trained on COCO, but we encourage you to try out other detection models!

Obviously its slightly different but to what degree I don’t know. Can someone clarify?

Now, after the export process I load it into Xcode and when I run the model, I get this kind of output from the tensor.

boxes in concat__0 e.g. concat:0

concat:0 are the bounding-box encodings of the 1917 anchor boxes

Why are there negative values in the bounding box coords? [ 0.35306236, -0.48976013, -2.5883727 , -4.0799093 ] [ 0.8760979 , 1.1190459 , -2.6803727 , -1.5514386 ] [ 1.3935553 , 0.85614955, -0.92042184, -2.7950268 ]

Also can anyone explain what the 1917 are? There are 91 categories in COCO but why 1917 anchor boxes?

I even looked at the android example and its not much easier to understand:

Even better would be an explanation of the CoreML files output tensors? It would be great to have the end of the file map the output in python to show what they are.

Perhaps draw the box and show the category label just like the object_detection_tutorial.ipynb that would be great! 😃

I am completely lost with this so any help would be greatly appreciated!

Issue Analytics

State:
Created 6 years ago
Reactions:4
Comments:42

Top GitHub Comments

25reactions

vincentchucommented, Jan 23, 2018

I did a lot of work on this over the weekend and I have a working understanding of the outputs produced by the converted CoreML model.

I’ll try and walk through what I found— please let me know if anything is unclear. Anyway, the CoreML model outputs two MLMultiArrays:

Scores for each class (concat_1__0, a 1 x 1 x 91 x 1 x 1917 MLMultiArray)
Anchor-encoded Boxes (concat__0, a 1 x 1 x 4 x 1 x 1917 MLMultiArray)

Here, 91 refers to the index of the class labels (0 = background, 18 = dog). There are a total of 1917 anchor boxes as well.

Postprocessing

The postprocessing goes like this:

Prune out all boxes that are below a threshold of 0.01. For our golden retriever sample, the only indices that work are in scores[18][...] with [...] = [1756, 1857, 1858, 1860, 1911, 1912, 1914]
Take this set of indices and compute the corresponding bounding boxes for each prediction
Apply non-maximum suppression to this set of scores / boxes

Now, the non-maximum suppression part is pretty easy to understand. You can read about it here but the basic gist is that you sort each box by its score in descending order. You weed out any box that overlaps >50% with any other box that is scored more highly.

The trickiest part here is computing the bounding boxes. To do that, you need to take the output of the CoreML model and adjust a base set of anchor boxes. This set of 1917 anchor boxes tiles the 300x300 input image.

The output of the k-th CoreML box is:

ty, tx, th, tw = boxes[0, 0, :, 0, k]

You take these and combine them with the anchor boxes using the same routine as this python code. Note: You’ll need to use the scale_factors of 10.0 and 5.0 here.

Now, the anchor boxes themselves are generated using this logic. I followed the logic to the bitter end, but instead of trying to reimplement the logic in swift, I just exported them out of the Tensorflow Graph from the import/MultipleGridAnchorGenerator/Identity tensor. You can see those anchors here

The logic for combining the box prediction and the anchor boxes is written up here.

Hope this helps! Again, this was a lot of blood, sweat, and tears and reading a ton of Tensorflow code and going through all of the logic. Honestly thought I would stab my eyeballs out. 😭 At the end, I was able to reproduce the bounding box for the golden retriever:

13reactions

vonholstcommented, Jan 24, 2018

@vincentchu @madhavajay I created a clean (still kinda messy) project that I can share. We can continue the discussion on that forum if there is any trouble. SSDMobileNetCoreML