question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Padding of bbox input in LayoutLM

See original GitHub issue

I’ve been working with LayoutLM and had some issues with different lengths of samples in a batch.

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

It turns out that transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad does not pad the bbox items to maximum length in the batch and eventually trying to join differently sized lists into a tensor crashes.

One way to solve this is to pad all required items when generating samples like e.g. the official implementation does for the FUNSD data set. I also implemented it this way for my use case and it seems to work well.

But this is basically repeating the pad functionality and I was wondering if the _pad method should allow for additional required input like the bboxes are for LayoutLM. I’m happy to work on a PR for that but also wanted to check if there’s anything more to consider.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
NielsRoggecommented, Feb 25, 2021

LayoutLM would really benefit from its own tokenizer indeed. Currently you have to use BertTokenizer, but this let’s you only tokenize text, not really prepare data for the model.

A nice API (in my opinion) would look something like:

LayoutLMTokenizer(image: PIL.Image, words: List[str], bounding_boxes: List[List[int]], labels: List[str])

The tokenizer then automatically takes care of normalizing the bounding boxes (users can still choose which OCR engine to use to get words and bounding boxes), transform the words and labels into token-level input_ids, bbox, padding (as you mention), etc.

The functionality implemented in the function you refer to (convert_examples_to_features) could be added by overwriting the prepare_for_model method, and the padding functionality by overwriting _pad.

0reactions
github-actions[bot]commented, May 4, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

LayoutLM - Hugging Face
Note that one first needs to normalize the bounding boxes to be on a 0-1000 scale. ... Build model inputs from a sequence...
Read more >
Splitting the document (>512 tokens) into multiples · Issue #41 ...
For the processor input, if I manually split words, boxes, and word_labels into small slices and use the exact same entire image as...
Read more >
How do I get rid of extra padding at the top of my Kivy boxlayout?
I created a box layout in kivy with a few buttons and text input boxes. The boxes that I sized all get pushed...
Read more >
Extract Key Information from Documents using LayoutLM
Video explains the architecture of LayoutLm and Fine-tuning of LayoutLM model to extract information from documents like Invoices, Receipt, ...
Read more >
Document AI: Fine-tuning LayoutLM for document ... - philschmid
The dataset is available on Hugging Face at nielsr/funsd. Note: The LayoutLM model doesn't have a AutoProcessor to nice create the our input...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found