Demo: adding visual embeddings to LayoutLM
See original GitHub issueA much requested feature/question in this repo was “how do you add visual embeddings to LayoutLM?”. I wondered how this worked myself, so (just in time for the release of LayoutLM 2.0), here’s a notebook that fine-tunes LayoutLM on the FUNSD dataset, thereby adding visual embeddings from a pre-trained ResNet-101 backbone (as was done in the paper):
First, a document image is resized to 3x224x224 and sent through a pre-trained ResNet-101 to obtain a feature map of shape (1024x14x14). Next, I use ROI-align to turn each bounding box of the original document image into a feature map of shape (1024x3x3), which is then flattened and linearly projected to match the hidden_size
of LayoutLM (which is 768 for the base model). I assume that the authors did use something similar (either ROI-pooling as in Faster-R-CNN or ROI-aligning which was introduced later and improved the performance compared to ROI-pooling). The parameters of the ResNet model are updated during training, so we’re effectively fine-tuning it, together with LayoutLM.
By adding these visual features, I was able to improve performance on the test set compared to using only text + layout (bounding boxes) information to around the following:
‘precision’: 0.8053668087066682, ‘recall’: 0.8163670324538874, ‘f1’: 0.8108296133109165
Related issues:
Issue Analytics
- State:
- Created 2 years ago
- Reactions:28
- Comments:7
Top GitHub Comments
Your work for the NLP/NLU community, especially for us guys trying to apply this papers to use cases, is extremely helpful! Many thanks and keep up the good work.
Fancy Work!