Add bounding boxes coordinates in predictions
See original GitHub issueIt could be useful to get bounding boxes coordinates from Document Information Extraction task predictions.
on conventional pipeline :
on Donut it could be something like:
{
'predictions': [{
'menu': [{
'cnt': '2',
'nm': 'ICE BLAOKCOFFE',
'price': '82,000',
'bbox': [xmin, ymin, xmax, ymax]
},
{
'cnt': '1',
'nm': 'AVOCADO COFFEE',
'price': '61,000',
'bbox': [xmin, ymin, xmax, ymax]
},
],
'total': {
'cashprice': '200,000',
'changeprice': '25,400',
'total_price': '174,600',
'bbox': [xmin, ymin, xmax, ymax]
}
}]
}
possible solution (I did not succeed): https://github.com/clovaai/donut/issues/16#issuecomment-1217464215
Issue Analytics
- State:
- Created a year ago
- Reactions:6
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Bounding Box Prediction from Scratch using PyTorch
Create a dictionary consisting of filepath , width , height , the bounding box coordinates ( xmin , xmax , ymin , ymax...
Read more >How to get class and bounding box coordinates from YOLOv5 ...
I have written my own python script but I cannot access the predicted class and the bounding box coordinates from the output of...
Read more >Detection algorithms - Bounding Box Predictions - UPSCFEVER
In this section, let's see how you can get your bounding box predictions to be more ... And that it outputs the bounding...
Read more >Bounding boxes augmentation for object detection
The bounding box has the following (x, y) coordinates of its corners: ... You can pass labels along with bounding boxes coordinates by...
Read more >How to display Vision bounding boxes - Machine, Think!
Vision outputs normalized coordinates · The coordinates are normalized. · The origin (0,0) is in the lower-left corner! · The predictions are ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
So, I’ve found a way to generate the heatmaps from the cross attentions from the decoder. However, the attention maps correspond to each output token from the decoder and not necessarily a word i.e. the word Restaurant might consist of three tokens (Res + tau + rant) and the attention-heatmaps are very coarse and might not give precise boxes as shown in the example.
Additionally, you need to get the correspondence between the token values and the token indices and have to snoop in the transformers library Bart batch decode implementations for that.
In the example above, I’ve fused the attention heads, the layer heads, and the different token heatmaps with max fusion. And run a threshold on the attention areas, contour them and save the bounding-box with the largest area. Maybe someone can find a way to generate better heat maps.
I’ll attach the link to the notebook I used to generate the maps. If people are interested in the code to get the token indexes to token values mapping, I can attach a modified donut/model.py as well.
https://colab.research.google.com/drive/1OzRapy23W8Ksf0AtqlkLFaVAAjJRUqbk?usp=sharing
Refer to the Document VQA Example section from this notebook. You have to use a resized shape of
[4, 16, 80, 60]
for docvqa task since the final cross-attention feature map sizes differ from the document extraction task.https://colab.research.google.com/drive/1OzRapy23W8Ksf0AtqlkLFaVAAjJRUqbk?usp=sharing