Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to build a custom dataset for LayoutLMv2ForSequenceClassification?

See original GitHub issue

Environment info

transformers version: 4.10.0
Platform: Linux
Python version: 3.8.8
PyTorch version (GPU?): 1.8.0+cu101 (True)
Tensorflow version (GPU?): 2.2.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Documentation: @sgugger

Information

Model I am using (Bert, XLNet …): LayoutLMv2ForSequenceClassification

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

I am trying to build a custom dataset to fine tune LayoutLMv2ForSequenceClassification.

For that I am building a torch.utils.data.Dataset, with the following getitem function:

def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img = Image.open(self.files[idx]).convert('RGB')
        label = self.labels[idx]

        if self.transforms is not None:
            img = self.transforms(img)

        encoding = self.processor(img, return_tensors="pt")
        encoding['input_ids'] = encoding['input_ids'][:,:512]
        encoding['token_type_ids'] = encoding['token_type_ids'][:,:512]
        encoding['attention_mask'] = encoding['attention_mask'][:,:512]
        encoding['bbox'] = encoding['bbox'][:,:512,:4]

        return {
            **encoding,
            "label": label
        }

Here is how I defined the processor:

feature_extractor = LayoutLMv2FeatureExtractor() 
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

The training starts but when it starts loading the data batches it fails.

Output:

Traceback (most recent call last):
  File "main.py", line 82, in <module>
    trainer.train()
  File "env_3.8/lib/python3.8/site-packages/transformers/trainer.py", line 1258, in train
    for step, inputs in enumerate(epoch_iterator):
  File "env_3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "env_3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 557, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "env_3.8/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "env_3.8/lib/python3.8/site-packages/transformers/data/data_collator.py", line 66, in default_data_collator
    return torch_default_data_collator(features)
  File "env_3.8/lib/python3.8/site-packages/transformers/data/data_collator.py", line 105, in torch_default_data_collator
    batch[k] = torch.stack([f[k] for f in features])
RuntimeError: stack expects each tensor to be equal size, but got [1, 20] at entry 0 and [1, 266] at entry 1

How can I solve this? Is there any documentation on how to build a simple pytorch dataset that works with huggingface transformers’ models? It would be very nice if you had something like this clear documentation on how to build a dataset for pytorch. I know there is a doc on transformers.datasets but I found it pretty confusing…

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

NielsRoggecommented, Sep 2, 2021

That’s because you need to remove the batch dimension which the processor automatically adds. I have updated my code snippet above.

1reaction

NielsRoggecommented, Sep 2, 2021

Hi,

I see you are truncating the inputs, but the processor can take care of that for you. Just specify truncation=True.

def __getitem__(self, idx):
      image = Image.open(self.files[idx]).convert('RGB')
      label = self.labels[idx]

      # processor creates input_ids, attention_mask, token_type_ids, bbox, image
      encoding = self.processor(image, padding="max_length", truncation=True, return_tensors="pt")

      # remove batch dimension (which the processor automatically adds)
      for k,v in encoding.items():
           encoding[k] = v.squeeze()
        
      # add label
      encoding["labels"] = torch.tensor(label)

      return encoding

So what happens internally, is that LayoutLMv2Processor first uses LayoutLMv2FeatureExtractor to apply OCR (namely, Google’s Tesseract) on the document image to get a list of words + corresponding boxes (coordinates). The feature extractor also resizes the document image to 224x224. Next, the list of words + boxes are provided to LayoutLMv2TokenizerFast, which convert them to token-level input_ids, attention_mask, token_type_ids and bbox. Together with the resized image and the label, you have everything you need to train the model.

Top Results From Across the Web

LayoutLMV2 - Hugging Face

In short, one can provide a document image (and possibly additional data) to LayoutLMv2Processor, and it will create the inputs expected by the...

[Tutorial] How to Train LayoutLM on a Custom Dataset with ...

Learn how to fine-tune LayoutLM on a custom dataset for document extraction tasks using the Hugging Face Transformers library.

Document Classification:: LayoutLMV2 - Kaggle

Explore and run machine learning code with Kaggle Notebooks | Using data from ... LayoutLMv2FeatureExtractor, LayoutLMv2ForSequenceClassification, ...

NielsRogge/Transformers-Tutorials - GitHub

This is the standard way to prepare data for a PyTorch model, namely by subclassing torch.utils.data.Dataset , and then creating a corresponding DataLoader...

Fine-tuning LayoutLMv2ForSequenceClassification on RVL ...

Each scanned document in the dataset belongs to one of 16 classes, ... Installing build dependencies ... done Getting requirements to build wheel...