Running Inference on custom dataset
See original GitHub issueModel I am using (LayoutLM ):
Fine tuning on Custom annotated data
Steps followed-
Dataset :- we took three images from FUNSD and annotated them using Microsoft Azure tool. After annotation of the three images we got two JSON files for each image, where one of the JSON annotation files contains the information of OCR and the another JSON annotation file contains the information of the labels. We created 8 copies of each image which is further used for pre-processing and model training.
Pre-processing As we saw there was a pre-processing script for FUNSD dataset, so for our Microsoft JSON annotations we need to get our own pre-processing script to get the required txt files for training.
Approach :- we took the same script of the FUNDS pre-processing and made few changes in the code which are required for our custom annotation considering the two JSON files of Microsoft azure namely ocr.JSON and labels.JSON and converted it to one JSON file passing this to pre-proecessing.py which resulted to .txt files which we use for training.This approach generated the required txt and label files used in training.
Fine tuning the model We have fine tuned the pre-trained model on the custom dataset with 2 epochs just to verify whether the training is happening in a proper way and we have saved the trained model for inference. Training has been done successfully.
Inference from the model Now we used the saved model to perform inference on the testing data of FUNDS.
Roadblock - While we are running the inference part of the code for our custom data we got the Following error :-
**IndexError: index out of range in self**
model.to(args.device)
194 result, predictions = evaluate(
--> 195 args, model, tokenizer, labels, pad_token_label_id, mode="test"
196 )
print("\n\n",inputs.keys())
333 print(inputs)
--> 334 outputs = model(**inputs)
335 tmp_eval_loss, logits = outputs[:2]
embedding_output = self.embeddings(
--> 177 input_ids, bbox, position_ids=position_ids, token_type_ids=token_type_ids
178 )
179 encoder_outputs = self.encoder(
The index out of range is occurring over here
w_position_embeddings = self.w_position_embeddings(
---> 88 bbox[:, :, 2] - bbox[:, :, 0]
89 )
We performed a few debugging steps by printing the view size and values of input of the model. The debugging output is given below.
bbox[:,:,2] shape :- torch.Size([8, 512])
bbox[:, :, 0] shape :- torch.Size([8, 512])
printing the w_position_embeddings:- Embedding(1024, 768)
Kindly help me with the same Best Regards Mohit Tuli
Issue Analytics
- State:
- Created 3 years ago
- Comments:9
Top GitHub Comments
This is an interesting discussion. I am looking to finetune layoutLM (multilingual if possible) on custom document images to perform sequence labelling (or Named Entity Recognition). I have no direct clue how to get started with this though. I would need some more concise explanations on how to “annotate” the data (all occurrences of just 1?), in what format I should create the training data, and how to run both training and predictions. Any help on the matter is appreciated. In any case, I am willing to look into it jointly to make sure others are helped as well.
Did u find any resources? I’m looking for the same