How to build a custom dataset for LayoutLMv2ForSequenceClassification?
See original GitHub issueEnvironment info
transformers
version: 4.10.0- Platform: Linux
- Python version: 3.8.8
- PyTorch version (GPU?): 1.8.0+cu101 (True)
- Tensorflow version (GPU?): 2.2.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
Documentation: @sgugger
Information
Model I am using (Bert, XLNet …): LayoutLMv2ForSequenceClassification
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
I am trying to build a custom dataset to fine tune LayoutLMv2ForSequenceClassification.
For that I am building a torch.utils.data.Dataset, with the following getitem function:
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
img = Image.open(self.files[idx]).convert('RGB')
label = self.labels[idx]
if self.transforms is not None:
img = self.transforms(img)
encoding = self.processor(img, return_tensors="pt")
encoding['input_ids'] = encoding['input_ids'][:,:512]
encoding['token_type_ids'] = encoding['token_type_ids'][:,:512]
encoding['attention_mask'] = encoding['attention_mask'][:,:512]
encoding['bbox'] = encoding['bbox'][:,:512,:4]
return {
**encoding,
"label": label
}
Here is how I defined the processor:
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)
The training starts but when it starts loading the data batches it fails.
Output:
Traceback (most recent call last):
File "main.py", line 82, in <module>
trainer.train()
File "env_3.8/lib/python3.8/site-packages/transformers/trainer.py", line 1258, in train
for step, inputs in enumerate(epoch_iterator):
File "env_3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
data = self._next_data()
File "env_3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "env_3.8/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "env_3.8/lib/python3.8/site-packages/transformers/data/data_collator.py", line 66, in default_data_collator
return torch_default_data_collator(features)
File "env_3.8/lib/python3.8/site-packages/transformers/data/data_collator.py", line 105, in torch_default_data_collator
batch[k] = torch.stack([f[k] for f in features])
RuntimeError: stack expects each tensor to be equal size, but got [1, 20] at entry 0 and [1, 266] at entry 1
How can I solve this? Is there any documentation on how to build a simple pytorch dataset that works with huggingface transformers’ models? It would be very nice if you had something like this clear documentation on how to build a dataset for pytorch. I know there is a doc on transformers.datasets but I found it pretty confusing…
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
That’s because you need to remove the batch dimension which the processor automatically adds. I have updated my code snippet above.
Hi,
I see you are truncating the inputs, but the processor can take care of that for you. Just specify
truncation=True
.So what happens internally, is that
LayoutLMv2Processor
first usesLayoutLMv2FeatureExtractor
to apply OCR (namely, Google’s Tesseract) on the document image to get a list of words + corresponding boxes (coordinates). The feature extractor also resizes the document image to 224x224. Next, the list of words + boxes are provided toLayoutLMv2TokenizerFast
, which convert them to token-levelinput_ids
,attention_mask
,token_type_ids
andbbox
. Together with the resized image and the label, you have everything you need to train the model.