Spacy 2.3 biluo_tags_from_offsets: "Misaligned entities ('-') will be ignored during training" but then spacy convert raises an exception.
See original GitHub issueHow to reproduce the behaviour
I think you can take this example https://github.com/explosion/spaCy/issues/5727#issuecomment-655509687 and try to convert it with spacy convert
tool.
Or take this one .conllu file:
1 asd _ _ _ _ _ _ _ -
You’ll get “ValueError: [E177] Ill-formed IOB input detected: -
” at
File "gold.pyx", line 570, in spacy.gold.iob_to_biluo
File "gold.pyx", line 592, in spacy.gold._consume_ent
Your Environment
- spaCy Version Used: 2.3.2
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Warning: [W030] Some entities could not be aligned in the text
I tried that too but it doesn't seem to work. How can I use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check indexes as ...
Read more >NER training on dataset which was annotated on older version.
Use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check the alignment. Misaligned entities ('-') will be ignored during ...
Read more >What's New in v2.3 · spaCy Usage Documentation
3 features new pretrained models for five languages, word vectors for all language models, and decreased model size and loading times for models...
Read more >resume ner model - Kaggle
Misaligned entities (with BILUO tag '-') will be ignored during training. # This is added back by InteractiveShellApp.init_path() ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Since the
-
is really a spacy-internal label, not something we expect to find in external datasets, I don’t think it makes sense to support it in general inspacy convert
.If you’re using v3 in your comparison anyway, I think it would be easiest to just use the v3 converter to produce both the
.spacy
and the.json
files for your comparison (with the short script above). I actually used an early version of v3 converter for all the internal UD training corpora for the v2.3 models.If you don’t have the
raw
text in the JSON file for spacy v2.3, the training takes the provided tokens as the gold tokens, so if there’s much of a mismatch between how the actual tokenizer tokenizes and the gold tokens, the model will not perform as well in the end on actual texts. If you evaluate usingspacy evaluate
, the evaluation on JSON files withoutraw
texts will look inflated, too.So I can understand a bit more why you might want to provide input like this to produce an accurate comparison, but you’ll need to customize the converter to either handle the
-
values or include the raw texts. If your data is clean, adding support for the# text =
comment might be simpler than addingSpaceAfter=No
handling.SpaceAfter=No
just aligns really well with how spacy represents the tokens internally and I didn’t want to add error handling for cases where the# text
doesn’t align with the tokens.This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.