question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LayoutLMv2 processing doesn't handle tokenizer overflow

See original GitHub issue

Environment info

  • transformers version: 4.10.2
  • Platform: MAC
  • Python version: 3.8.9
  • PyTorch version (GPU?): Not important
  • Tensorflow version (GPU?):
  • Using GPU in script? Mp
  • Using distributed or parallel set-up in script?:

Who can help

@NielsRogge @LysandreJik

Information

We are porting our layoutlmv2 project to use Transformers instead of the UniLMFt package.

The additional functionality of the tokenizer has helped us to eliminate a good deal of alignment code!!

While evaluating processing_layoutlmv2.py I noticed that overflow wasn’t being handled properly.

https://github.com/huggingface/transformers/blob/3ab0185b061baae207efed02799dd424ee8377f1/src/transformers/models/layoutlmv2/processing_layoutlmv2.py#L182-L205

In the above block the input is tokeninzed, potentially with allowing overflow when return_overflowing_tokens==True this will cause the length of encoded_inputs to be longer than the input sequence. EG if a page has 1k words and boxes it will be returned as two sequences and there will be an overflow_to_sample_mapping attached to the encoded_inputs.

When adding the image

        encoded_inputs["image"] = features.pop("pixel_values")

The length of image will be less than the rest of the encoded inputs if there is any overflow. This will cause: 1) a mismatch between page images and examples, and 2) examples at the end of the sequence will lack image embeddings.

Expected behavior

We handle this by using the overflow_to_sample_mapping to find which image to pair with each sequence in the batch:

    images = []
    for batch_index in range(len(tokenized_inputs["input_ids"])):
        org_batch_index = tokenized_inputs["overflow_to_sample_mapping"][batch_index]
        image = examples["image"][org_batch_index]
        images.append(image)
    tokenized_inputs['image] = images

return_offsets_mapping=True is required for this to work, but you could consider raising if return_overflowing_tokens is True and return_offset_mapping is False to maintain the ability to pair images with the correct sequences.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
NielsRoggecommented, Sep 14, 2021

Oh yes, thanks for spotting this. I added the overflow logic after implementing the processor. Will open a PR to fix this.

0reactions
garyhlaicommented, May 5, 2022

Opened a PR @NielsRogge #17092. Just need your input in terms of the return type of encoded_input["image"] but otherwise this PR should fix the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

LayoutLMV2 - Hugging Face
The feature extractor handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal for a...
Read more >
Getting an error even after using truncation for tokenizer while ...
I am using truncation=True in the tokenizer self.tokenizer = AutoTokenizer.from_pretrained(bert_model_str ...
Read more >
Software > Stanford Tokenizer
A tokenizer divides text into a sequence of tokens, which roughly ... work well over text encoded in Unicode that does not require...
Read more >
tokenizer — PaddleNLP documentation
It supports sequence or sequence pair as input, and batch input is not allowed. Parameters. text (str, List[str] or List[int]) – The sequence...
Read more >
Clean and Tokenize Text With Python - Dylan Castillo
They're based on a mix of Stack Overflow answers, books, ... Then, you can use that function for pre-processing or tokenizing text.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found