Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

document-question-answering pipeline does not work with some models

See original GitHub issue

System Info

Colab, latest release

Who can help?

@NielsRogge

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

!apt install tesseract-ocr
!apt install libtesseract-dev
!pip install Pillow
!pip install pytesseract

# You can use a http link, a local path or a PIL.Image object
img_path = "https://huggingface.co/spaces/impira/docquery/resolve/main/invoice.png"

from transformers import pipeline
# This works
pipe = pipeline("document-question-answering", model="impira/layoutlm-document-qa")

# This breaks with strange error
pipe = pipeline("document-question-answering", model="impira/layoutlm-invoices")
# Error: KeyError: 'layoutlm-tc'

Expected behavior

This would work with both models

Issue Analytics

State:
Created a year ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

ankrgylcommented, Sep 21, 2022

I had a bit of discussion with @NielsRogge about this. The model type here is different because this model actually has a slightly different architecture than standard LayoutLM (it has an additional token classifier head). @NielsRogge was kind enough to submit a PR (https://huggingface.co/impira/layoutlm-invoices/discussions/1) which changes it to layoutlm.

With this change (now merged), your code above should run just fine. However, you will likely get suboptimal results, because the model has learned to depend on the token classifier to produce accurate results. I’d recommend running it through DocQuery (https://github.com/impira/docquery) which has a patched version of the model (here) that makes use of it.

You can do that via something like:

!apt install tesseract-ocr
!apt install libtesseract-dev
!pip install Pillow
!pip install pytesseract
!pip install docquery

# You can use a http link, a local path or a PIL.Image object
img_path = "https://huggingface.co/spaces/impira/docquery/resolve/main/invoice.png"

# This is a patched version of the pipeline that knows how to use the token classifier
from docquery import pipeline

# This works
pipe = pipeline("document-question-answering", model="impira/layoutlm-document-qa")

# This should work
pipe = pipeline("document-question-answering", model="impira/layoutlm-invoices")

In the meantime, I’ll explore a few alternatives, e.g. packaging up the model directly in the repo or patching it a different way, so that it uses the token classifier.

0reactions

osansevierocommented, Oct 24, 2022

Sounds good! Thanks a lot for this!