POS/Morph annotation empty with trained Transformer model
See original GitHub issueHow to reproduce the behaviour
I have trained a Swedish Transformer model on UD-Treebank Talbanken using mainly the Quickstart Config from the Spacy website. I only added morphologizer
and lemmatizer
to the model. So the full pipeline looks like this:
['transformer', 'tagger', 'morphologizer', 'lemmatizer', 'parser']
With normal sentences everything seems to work fine, but when very short sequences (such as nlp('Privat')
) are passed through the model the POS and Morph annotation can be missing for some tokens.
Here are some examples with first the text sequence and then a list with respective POS and Morph annotations returned by token.pos_
and token.morph
:
Privat [''] []
A-Ö ['PROPN', 'SYM', ''] [Case=Nom, , ]
A [''] []
B [''] []
C ['NOUN'] [Abbr=Yes]
D [''] []
E [''] []
F [''] []
G ['PROPN'] [Case=Nom]
H [''] []
I ['ADP'] []
J [''] []
K [''] []
L ['ADP'] []
M [''] []
N ['PROPN'] [Case=Nom]
O ['PUNCT'] []
P [''] []
Q [''] []
R [''] []
S ['ADP'] []
T [''] []
U [''] []
V [''] []
W [''] []
X [''] []
Y [''] []
Z [''] []
Å ['INTJ'] []
Ä [''] []
Ö [''] []
I also trained a model on both the UD-Talbanken and UD-Lines corpora. That model has the issue as well, but it seems to occur not that frequently:
Privat ['PUNCT'] []
A ['NOUN'] [Case=Nom|Definite=Ind|Gender=Neut|Number=Sing]
-Ö ['SYM', ''] [, ]
A ['ADV'] []
B ['PUNCT'] []
C ['PUNCT'] []
D ['PUNCT'] []
E ['PUNCT'] []
F ['PUNCT'] []
G ['PUNCT'] []
H ['PUNCT'] []
I ['PUNCT'] []
J ['PUNCT'] []
K ['PUNCT'] []
L ['PUNCT'] []
M [''] []
N ['PUNCT'] []
O ['PUNCT'] []
P ['PUNCT'] []
Q ['PUNCT'] []
R ['PUNCT'] []
S ['PUNCT'] []
T ['PUNCT'] []
U ['PUNCT'] []
V ['PUNCT'] []
W ['PUNCT'] []
X ['NOUN'] []
Y ['PUNCT'] []
Z ['PUNCT'] []
Å ['INTJ'] []
Ä ['ADV'] []
Ö ['PUNCT'] []
(Please note that the list containing the morph
annotation actually contains a spacy.tokens.morphanalysis.MorphAnalysis
object where the __str__/__repr__
method returns an empty string.)
TAG
and dep
annotations seem to be okay from what I can see.
Is there maybe some threshold for certainty/entropy that prevents predicting POS tags for uncertain cases?
If that is the case, than I would rather prefer uncertain tags over no tags, because currently I’m getting an error message from Spacy Matcher
:
ValueError: [E155] The pipeline needs to include a morphologizer in order to use Matcher or PhraseMatcher with the attribute POS. Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` instead of `list(nlp.tokenizer.pipe())`.
Please let me know if you need additional information. Thanks!
Your Environment
- Operating System:
Ubuntu 20.04.1 LTS
- Python Version Used:
Python 3.8.3 (default, May 19 2020, 18:47:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
- spaCy Version Used:
3.0.0rc2
Issue Analytics
- State:
- Created 3 years ago
- Comments:21 (12 by maintainers)
Top GitHub Comments
This is something that’s changed in v3. The morphologizer sets both morph and POS here:
https://github.com/explosion/spaCy/blob/8ef056cf984eea5db47194f2e7d805ed38b971eb/spacy/pipeline/morphologizer.pyx#L204-L205
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.