LayoutLMv2 processing doesn't handle tokenizer overflow
See original GitHub issueEnvironment info
transformers
version: 4.10.2- Platform: MAC
- Python version: 3.8.9
- PyTorch version (GPU?): Not important
- Tensorflow version (GPU?):
- Using GPU in script? Mp
- Using distributed or parallel set-up in script?:
Who can help
Information
We are porting our layoutlmv2 project to use Transformers instead of the UniLMFt package.
The additional functionality of the tokenizer has helped us to eliminate a good deal of alignment code!!
While evaluating processing_layoutlmv2.py
I noticed that overflow wasn’t being handled properly.
In the above block the input is tokeninzed, potentially with allowing overflow when return_overflowing_tokens==True
this will cause the length of encoded_inputs to be longer than the input sequence. EG if a page has 1k words and boxes it will be returned as two sequences and there will be an overflow_to_sample_mapping
attached to the encoded_inputs.
When adding the image
encoded_inputs["image"] = features.pop("pixel_values")
The length of image
will be less than the rest of the encoded inputs if there is any overflow. This will cause: 1) a mismatch between page images and examples, and 2) examples at the end of the sequence will lack image embeddings.
Expected behavior
We handle this by using the overflow_to_sample_mapping
to find which image to pair with each sequence in the batch:
images = []
for batch_index in range(len(tokenized_inputs["input_ids"])):
org_batch_index = tokenized_inputs["overflow_to_sample_mapping"][batch_index]
image = examples["image"][org_batch_index]
images.append(image)
tokenized_inputs['image] = images
return_offsets_mapping=True
is required for this to work, but you could consider raising if return_overflowing_tokens
is True and return_offset_mapping
is False to maintain the ability to pair images with the correct sequences.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Oh yes, thanks for spotting this. I added the overflow logic after implementing the processor. Will open a PR to fix this.
Opened a PR @NielsRogge #17092. Just need your input in terms of the return type of
encoded_input["image"]
but otherwise this PR should fix the issue.