Padding of bbox input in LayoutLM
See original GitHub issueI’ve been working with LayoutLM and had some issues with different lengths of samples in a batch.
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
It turns out that transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad
does not pad the bbox
items to maximum length in the batch and eventually trying to join differently sized lists into a tensor crashes.
One way to solve this is to pad all required items when generating samples like e.g. the official implementation does for the FUNSD data set. I also implemented it this way for my use case and it seems to work well.
But this is basically repeating the pad functionality and I was wondering if the _pad
method should allow for additional required input like the bbox
es are for LayoutLM. I’m happy to work on a PR for that but also wanted to check if there’s anything more to consider.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (9 by maintainers)
Top GitHub Comments
LayoutLM would really benefit from its own tokenizer indeed. Currently you have to use
BertTokenizer
, but this let’s you only tokenize text, not really prepare data for the model.A nice API (in my opinion) would look something like:
LayoutLMTokenizer(image: PIL.Image, words: List[str], bounding_boxes: List[List[int]], labels: List[str])
The tokenizer then automatically takes care of normalizing the bounding boxes (users can still choose which OCR engine to use to get words and bounding boxes), transform the words and labels into token-level
input_ids
,bbox
, padding (as you mention), etc.The functionality implemented in the function you refer to (
convert_examples_to_features
) could be added by overwriting theprepare_for_model
method, and the padding functionality by overwriting_pad
.This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.