Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

POS/Morph annotation empty with trained Transformer model

See original GitHub issue

How to reproduce the behaviour

I have trained a Swedish Transformer model on UD-Treebank Talbanken using mainly the Quickstart Config from the Spacy website. I only added morphologizer and lemmatizer to the model. So the full pipeline looks like this:

['transformer', 'tagger', 'morphologizer', 'lemmatizer', 'parser']

With normal sentences everything seems to work fine, but when very short sequences (such as nlp('Privat')) are passed through the model the POS and Morph annotation can be missing for some tokens.

Here are some examples with first the text sequence and then a list with respective POS and Morph annotations returned by token.pos_ and token.morph:

Privat [''] []
A-Ö ['PROPN', 'SYM', ''] [Case=Nom, , ]
A [''] []
B [''] []
C ['NOUN'] [Abbr=Yes]
D [''] []
E [''] []
F [''] []
G ['PROPN'] [Case=Nom]
H [''] []
I ['ADP'] []
J [''] []
K [''] []
L ['ADP'] []
M [''] []
N ['PROPN'] [Case=Nom]
O ['PUNCT'] []
P [''] []
Q [''] []
R [''] []
S ['ADP'] []
T [''] []
U [''] []
V [''] []
W [''] []
X [''] []
Y [''] []
Z [''] []
Å ['INTJ'] []
Ä [''] []
Ö [''] []

I also trained a model on both the UD-Talbanken and UD-Lines corpora. That model has the issue as well, but it seems to occur not that frequently:

Privat ['PUNCT'] []
A ['NOUN'] [Case=Nom|Definite=Ind|Gender=Neut|Number=Sing]
-Ö ['SYM', ''] [, ]
A ['ADV'] []
B ['PUNCT'] []
C ['PUNCT'] []
D ['PUNCT'] []
E ['PUNCT'] []
F ['PUNCT'] []
G ['PUNCT'] []
H ['PUNCT'] []
I ['PUNCT'] []
J ['PUNCT'] []
K ['PUNCT'] []
L ['PUNCT'] []
M [''] []
N ['PUNCT'] []
O ['PUNCT'] []
P ['PUNCT'] []
Q ['PUNCT'] []
R ['PUNCT'] []
S ['PUNCT'] []
T ['PUNCT'] []
U ['PUNCT'] []
V ['PUNCT'] []
W ['PUNCT'] []
X ['NOUN'] []
Y ['PUNCT'] []
Z ['PUNCT'] []
Å ['INTJ'] []
Ä ['ADV'] []
Ö ['PUNCT'] []

(Please note that the list containing the morph annotation actually contains a spacy.tokens.morphanalysis.MorphAnalysis object where the __str__/__repr__ method returns an empty string.)

TAG and dep annotations seem to be okay from what I can see.

Is there maybe some threshold for certainty/entropy that prevents predicting POS tags for uncertain cases?

If that is the case, than I would rather prefer uncertain tags over no tags, because currently I’m getting an error message from Spacy Matcher:

ValueError: [E155] The pipeline needs to include a morphologizer in order to use Matcher or PhraseMatcher with the attribute POS. Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` instead of `list(nlp.tokenizer.pipe())`.

Please let me know if you need additional information. Thanks!

Your Environment

Operating System: Ubuntu 20.04.1 LTS
Python Version Used:

Python 3.8.3 (default, May 19 2020, 18:47:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux

spaCy Version Used: 3.0.0rc2

Issue Analytics

State:
Created 3 years ago
Comments:21 (12 by maintainers)

Top GitHub Comments

2reactions

adrianeboydcommented, Nov 6, 2020

This is something that’s changed in v3. The morphologizer sets both morph and POS here:

https://github.com/explosion/spaCy/blob/8ef056cf984eea5db47194f2e7d805ed38b971eb/spacy/pipeline/morphologizer.pyx#L204-L205

0reactions

github-actions[bot]commented, Oct 30, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

Morphologizer · spaCy API Documentation

A trainable pipeline component to predict morphological features and coarse-grained POS tags following the Universal Dependencies UPOS and FEATS annotation ...

Models - Hugging Face

model (PreTrainedModel) — An instance of the model on which to load the TensorFlow checkpoint. config ( PreTrainedConfig ) — An instance of...

The Annotated Transformer - Harvard NLP

This section describes the training regime for our models. We stop for a quick interlude to introduce some of the tools needed to...

Performance of Multiple Pretrained BERT Models to Automate ...

Applying fully trained bidirectional encoder representations from transformers (BERT) models for autonomous annotation of radiology reports took as little ...

Vision Transformer: What It Is & How It Works [2022 Guide]

The ViT models were pre-trained on the ImageNet and ImageNet-21k datasets. Vision transformers have extensive applications in popular image ...