SSD Object Detection Mappings?
See original GitHub issueCan anyone explain what the mappings are of the two tensors which are produced from the output of the ssd_mobilenet_v1_android_export model which the example converts to a .mlmodel file?
When I look at this example: https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb
Which uses: ssd_mobilenet_v1_coco_2017_11_17
You can see the code does this:
(boxes, scores, classes, num) = sess.run(
[detection_boxes, detection_scores, detection_classes, num_detections],
feed_dict={image_tensor: image_np_expanded}
)
print(boxes.shape)
print(boxes)
(1, 100, 4)
[ 3.90840471e-02 1.92150325e-02 8.72103453e-01 3.15773487e-01]
[ 1.09515011e-01 4.02835608e-01 9.24646080e-01 9.73047853e-01]
[ 5.07123828e-01 3.85651529e-01 8.76479626e-01 7.03940928e-01]
Looks good!
So I assume what we have here is the first 100 boxes with 4 dimensions each.
I traced the values and code and did this:
box_coords = ymin, xmin, ymax, xmax
(left, right, top, bottom) = (xmin, xmax, ymin, ymax)
e.g.
(0.10951501131057739, 0.4028356075286865, 0.9246460795402527, 0.9730478525161743)
left: 0.4028356075286865
right: 0.9730478525161743
top: 0.10951501131057739
bottom: 0.9246460795402527
These are % of the entire image so multiplied by the input size of 300 and you should get the original pixel locations.
This all makes sense.
However, I need the model in CoreML so I followed this guide: https://github.com/tf-coreml/tf-coreml/blob/master/examples/ssd_example.ipynb
Which uses: ssd_mobilenet_v1_android_export I assume from the README that its the same model: https://github.com/tensorflow/models/tree/master/research/object_detection
August 11, 2017
We have released an update to the Android Detect demo which will now run models trained using the Tensorflow Object Detection API on an Android device. By default, it currently runs a frozen SSD w/Mobilenet detector trained on COCO, but we encourage you to try out other detection models!
Obviously its slightly different but to what degree I don’t know. Can someone clarify?
Now, after the export process I load it into Xcode and when I run the model, I get this kind of output from the tensor.
boxes in concat__0 e.g. concat:0
concat:0 are the bounding-box encodings of the 1917 anchor boxes
Why are there negative values in the bounding box coords? [ 0.35306236, -0.48976013, -2.5883727 , -4.0799093 ] [ 0.8760979 , 1.1190459 , -2.6803727 , -1.5514386 ] [ 1.3935553 , 0.85614955, -0.92042184, -2.7950268 ]
Also can anyone explain what the 1917 are? There are 91 categories in COCO but why 1917 anchor boxes?
I even looked at the android example and its not much easier to understand:
Even better would be an explanation of the CoreML files output tensors? It would be great to have the end of the file map the output in python to show what they are.
Perhaps draw the box and show the category label just like the object_detection_tutorial.ipynb that would be great! 😃
I am completely lost with this so any help would be greatly appreciated!
Issue Analytics
- State:
- Created 6 years ago
- Reactions:4
- Comments:42
I did a lot of work on this over the weekend and I have a working understanding of the outputs produced by the converted CoreML model.
I’ll try and walk through what I found— please let me know if anything is unclear. Anyway, the CoreML model outputs two
MLMultiArrays
:concat_1__0
, a 1 x 1 x 91 x 1 x 1917MLMultiArray
)concat__0
, a 1 x 1 x 4 x 1 x 1917MLMultiArray
)Here, 91 refers to the index of the class labels (0 = background, 18 = dog). There are a total of 1917 anchor boxes as well.
Postprocessing
The postprocessing goes like this:
scores[18][...]
with[...] = [1756, 1857, 1858, 1860, 1911, 1912, 1914]
Now, the non-maximum suppression part is pretty easy to understand. You can read about it here but the basic gist is that you sort each box by its score in descending order. You weed out any box that overlaps >50% with any other box that is scored more highly.
The trickiest part here is computing the bounding boxes. To do that, you need to take the output of the CoreML model and adjust a base set of anchor boxes. This set of 1917 anchor boxes tiles the 300x300 input image.
The output of the k-th CoreML box is:
You take these and combine them with the anchor boxes using the same routine as this python code. Note: You’ll need to use the scale_factors of 10.0 and 5.0 here.
Now, the anchor boxes themselves are generated using this logic. I followed the logic to the bitter end, but instead of trying to reimplement the logic in swift, I just exported them out of the Tensorflow Graph from the
import/MultipleGridAnchorGenerator/Identity
tensor. You can see those anchors hereThe logic for combining the box prediction and the anchor boxes is written up here.
Hope this helps! Again, this was a lot of blood, sweat, and tears and reading a ton of Tensorflow code and going through all of the logic. Honestly thought I would stab my eyeballs out. 😭 At the end, I was able to reproduce the bounding box for the golden retriever:
@vincentchu @madhavajay I created a clean (still kinda messy) project that I can share. We can continue the discussion on that forum if there is any trouble. SSDMobileNetCoreML