Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add bounding boxes coordinates in predictions

See original GitHub issue

It could be useful to get bounding boxes coordinates from Document Information Extraction task predictions.

on conventional pipeline : Screenshot from 2022-09-05 06-33-35

on Donut it could be something like:

{
    'predictions': [{
        'menu': [{
                'cnt': '2',
                'nm': 'ICE BLAOKCOFFE',
                'price': '82,000',
                'bbox': [xmin, ymin, xmax, ymax]
            },
            {
                'cnt': '1',
                'nm': 'AVOCADO COFFEE',
                'price': '61,000',
                'bbox': [xmin, ymin, xmax, ymax]
            },
        ],
        'total': {
            'cashprice': '200,000',
            'changeprice': '25,400',
            'total_price': '174,600',
            'bbox': [xmin, ymin, xmax, ymax]
        }
    }]
}

possible solution (I did not succeed): https://github.com/clovaai/donut/issues/16#issuecomment-1217464215

Issue Analytics

State:
Created a year ago
Reactions:6
Comments:8 (3 by maintainers)

Top GitHub Comments

11reactions

SamSamhunscommented, Sep 9, 2022

Screen Shot 2022-09-09 at 10 55 53 AM

So, I’ve found a way to generate the heatmaps from the cross attentions from the decoder. However, the attention maps correspond to each output token from the decoder and not necessarily a word i.e. the word Restaurant might consist of three tokens (Res + tau + rant) and the attention-heatmaps are very coarse and might not give precise boxes as shown in the example.

Additionally, you need to get the correspondence between the token values and the token indices and have to snoop in the transformers library Bart batch decode implementations for that.

In the example above, I’ve fused the attention heads, the layer heads, and the different token heatmaps with max fusion. And run a threshold on the attention areas, contour them and save the bounding-box with the largest area. Maybe someone can find a way to generate better heat maps.

I’ll attach the link to the notebook I used to generate the maps. If people are interested in the code to get the token indexes to token values mapping, I can attach a modified donut/model.py as well.

https://colab.research.google.com/drive/1OzRapy23W8Ksf0AtqlkLFaVAAjJRUqbk?usp=sharing

1reaction

SamSamhunscommented, Sep 11, 2022

Refer to the Document VQA Example section from this notebook. You have to use a resized shape of [4, 16, 80, 60] for docvqa task since the final cross-attention feature map sizes differ from the document extraction task.

https://colab.research.google.com/drive/1OzRapy23W8Ksf0AtqlkLFaVAAjJRUqbk?usp=sharing