question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

POS/Morph annotation empty with trained Transformer model

See original GitHub issue

How to reproduce the behaviour

I have trained a Swedish Transformer model on UD-Treebank Talbanken using mainly the Quickstart Config from the Spacy website. I only added morphologizer and lemmatizer to the model. So the full pipeline looks like this:

['transformer', 'tagger', 'morphologizer', 'lemmatizer', 'parser']

With normal sentences everything seems to work fine, but when very short sequences (such as nlp('Privat')) are passed through the model the POS and Morph annotation can be missing for some tokens.

Here are some examples with first the text sequence and then a list with respective POS and Morph annotations returned by token.pos_ and token.morph:

Privat [''] []
A-Ö ['PROPN', 'SYM', ''] [Case=Nom, , ]
A [''] []
B [''] []
C ['NOUN'] [Abbr=Yes]
D [''] []
E [''] []
F [''] []
G ['PROPN'] [Case=Nom]
H [''] []
I ['ADP'] []
J [''] []
K [''] []
L ['ADP'] []
M [''] []
N ['PROPN'] [Case=Nom]
O ['PUNCT'] []
P [''] []
Q [''] []
R [''] []
S ['ADP'] []
T [''] []
U [''] []
V [''] []
W [''] []
X [''] []
Y [''] []
Z [''] []
Å ['INTJ'] []
Ä [''] []
Ö [''] []

I also trained a model on both the UD-Talbanken and UD-Lines corpora. That model has the issue as well, but it seems to occur not that frequently:

Privat ['PUNCT'] []
A ['NOUN'] [Case=Nom|Definite=Ind|Gender=Neut|Number=Sing]
-Ö ['SYM', ''] [, ]
A ['ADV'] []
B ['PUNCT'] []
C ['PUNCT'] []
D ['PUNCT'] []
E ['PUNCT'] []
F ['PUNCT'] []
G ['PUNCT'] []
H ['PUNCT'] []
I ['PUNCT'] []
J ['PUNCT'] []
K ['PUNCT'] []
L ['PUNCT'] []
M [''] []
N ['PUNCT'] []
O ['PUNCT'] []
P ['PUNCT'] []
Q ['PUNCT'] []
R ['PUNCT'] []
S ['PUNCT'] []
T ['PUNCT'] []
U ['PUNCT'] []
V ['PUNCT'] []
W ['PUNCT'] []
X ['NOUN'] []
Y ['PUNCT'] []
Z ['PUNCT'] []
Å ['INTJ'] []
Ä ['ADV'] []
Ö ['PUNCT'] []

(Please note that the list containing the morph annotation actually contains a spacy.tokens.morphanalysis.MorphAnalysis object where the __str__/__repr__ method returns an empty string.)

TAG and dep annotations seem to be okay from what I can see.

Is there maybe some threshold for certainty/entropy that prevents predicting POS tags for uncertain cases?

If that is the case, than I would rather prefer uncertain tags over no tags, because currently I’m getting an error message from Spacy Matcher:

ValueError: [E155] The pipeline needs to include a morphologizer in order to use Matcher or PhraseMatcher with the attribute POS. Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` instead of `list(nlp.tokenizer.pipe())`.

Please let me know if you need additional information. Thanks!

Your Environment

  • Operating System: Ubuntu 20.04.1 LTS
  • Python Version Used:
Python 3.8.3 (default, May 19 2020, 18:47:26) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
  • spaCy Version Used: 3.0.0rc2

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:21 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
adrianeboydcommented, Nov 6, 2020

This is something that’s changed in v3. The morphologizer sets both morph and POS here:

https://github.com/explosion/spaCy/blob/8ef056cf984eea5db47194f2e7d805ed38b971eb/spacy/pipeline/morphologizer.pyx#L204-L205

0reactions
github-actions[bot]commented, Oct 30, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Morphologizer · spaCy API Documentation
A trainable pipeline component to predict morphological features and coarse-grained POS tags following the Universal Dependencies UPOS and FEATS annotation ...
Read more >
Models - Hugging Face
model (PreTrainedModel) — An instance of the model on which to load the TensorFlow checkpoint. config ( PreTrainedConfig ) — An instance of...
Read more >
The Annotated Transformer - Harvard NLP
This section describes the training regime for our models. We stop for a quick interlude to introduce some of the tools needed to...
Read more >
Performance of Multiple Pretrained BERT Models to Automate ...
Applying fully trained bidirectional encoder representations from transformers (BERT) models for autonomous annotation of radiology reports took as little ...
Read more >
Vision Transformer: What It Is & How It Works [2022 Guide]
The ViT models were pre-trained on the ImageNet and ImageNet-21k datasets. Vision transformers have extensive applications in popular image ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found