Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Padding of bbox input in LayoutLM

See original GitHub issue

I’ve been working with LayoutLM and had some issues with different lengths of samples in a batch.

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

It turns out that transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad does not pad the bbox items to maximum length in the batch and eventually trying to join differently sized lists into a tensor crashes.

One way to solve this is to pad all required items when generating samples like e.g. the official implementation does for the FUNSD data set. I also implemented it this way for my use case and it seems to work well.

But this is basically repeating the pad functionality and I was wondering if the _pad method should allow for additional required input like the bboxes are for LayoutLM. I’m happy to work on a PR for that but also wanted to check if there’s anything more to consider.

Issue Analytics

State:
Created 3 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

NielsRoggecommented, Feb 25, 2021

LayoutLM would really benefit from its own tokenizer indeed. Currently you have to use BertTokenizer, but this let’s you only tokenize text, not really prepare data for the model.

A nice API (in my opinion) would look something like:

LayoutLMTokenizer(image: PIL.Image, words: List[str], bounding_boxes: List[List[int]], labels: List[str])

The tokenizer then automatically takes care of normalizing the bounding boxes (users can still choose which OCR engine to use to get words and bounding boxes), transform the words and labels into token-level input_ids, bbox, padding (as you mention), etc.

The functionality implemented in the function you refer to (convert_examples_to_features) could be added by overwriting the prepare_for_model method, and the padding functionality by overwriting _pad.

0reactions

github-actions[bot]commented, May 4, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.