Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LayoutLMv2 processing doesn't handle tokenizer overflow

See original GitHub issue

Environment info

transformers version: 4.10.2
Platform: MAC
Python version: 3.8.9
PyTorch version (GPU?): Not important
Tensorflow version (GPU?):
Using GPU in script? Mp
Using distributed or parallel set-up in script?:

Who can help

@NielsRogge @LysandreJik

Information

We are porting our layoutlmv2 project to use Transformers instead of the UniLMFt package.

The additional functionality of the tokenizer has helped us to eliminate a good deal of alignment code!!

While evaluating processing_layoutlmv2.py I noticed that overflow wasn’t being handled properly.

https://github.com/huggingface/transformers/blob/3ab0185b061baae207efed02799dd424ee8377f1/src/transformers/models/layoutlmv2/processing_layoutlmv2.py#L182-L205

In the above block the input is tokeninzed, potentially with allowing overflow when return_overflowing_tokens==True this will cause the length of encoded_inputs to be longer than the input sequence. EG if a page has 1k words and boxes it will be returned as two sequences and there will be an overflow_to_sample_mapping attached to the encoded_inputs.

When adding the image

        encoded_inputs["image"] = features.pop("pixel_values")

The length of image will be less than the rest of the encoded inputs if there is any overflow. This will cause: 1) a mismatch between page images and examples, and 2) examples at the end of the sequence will lack image embeddings.

Expected behavior

We handle this by using the overflow_to_sample_mapping to find which image to pair with each sequence in the batch:

    images = []
    for batch_index in range(len(tokenized_inputs["input_ids"])):
        org_batch_index = tokenized_inputs["overflow_to_sample_mapping"][batch_index]
        image = examples["image"][org_batch_index]
        images.append(image)
    tokenized_inputs['image] = images

return_offsets_mapping=True is required for this to work, but you could consider raising if return_overflowing_tokens is True and return_offset_mapping is False to maintain the ability to pair images with the correct sequences.

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

NielsRoggecommented, Sep 14, 2021

Oh yes, thanks for spotting this. I added the overflow logic after implementing the processor. Will open a PR to fix this.

0reactions

garyhlaicommented, May 5, 2022

Opened a PR @NielsRogge #17092. Just need your input in terms of the return type of encoded_input["image"] but otherwise this PR should fix the issue.