question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spacy 2.3 biluo_tags_from_offsets: "Misaligned entities ('-') will be ignored during training" but then spacy convert raises an exception.

See original GitHub issue

How to reproduce the behaviour

I think you can take this example https://github.com/explosion/spaCy/issues/5727#issuecomment-655509687 and try to convert it with spacy convert tool. Or take this one .conllu file:

1 asd _ _ _ _ _ _ _ -

You’ll get “ValueError: [E177] Ill-formed IOB input detected: -” at

File "gold.pyx", line 570, in spacy.gold.iob_to_biluo
File "gold.pyx", line 592, in spacy.gold._consume_ent

Your Environment

  • spaCy Version Used: 2.3.2

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
adrianeboydcommented, Oct 19, 2020

Since the - is really a spacy-internal label, not something we expect to find in external datasets, I don’t think it makes sense to support it in general in spacy convert.

If you’re using v3 in your comparison anyway, I think it would be easiest to just use the v3 converter to produce both the .spacy and the .json files for your comparison (with the short script above). I actually used an early version of v3 converter for all the internal UD training corpora for the v2.3 models.

If you don’t have the raw text in the JSON file for spacy v2.3, the training takes the provided tokens as the gold tokens, so if there’s much of a mismatch between how the actual tokenizer tokenizes and the gold tokens, the model will not perform as well in the end on actual texts. If you evaluate using spacy evaluate, the evaluation on JSON files without raw texts will look inflated, too.

So I can understand a bit more why you might want to provide input like this to produce an accurate comparison, but you’ll need to customize the converter to either handle the - values or include the raw texts. If your data is clean, adding support for the # text = comment might be simpler than adding SpaceAfter=No handling. SpaceAfter=No just aligns really well with how spacy represents the tokens internally and I didn’t want to add error handling for cases where the # text doesn’t align with the tokens.

0reactions
github-actions[bot]commented, Oct 30, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Warning: [W030] Some entities could not be aligned in the text
I tried that too but it doesn't seem to work. How can I use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check indexes as ...
Read more >
NER training on dataset which was annotated on older version.
Use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check the alignment. Misaligned entities ('-') will be ignored during ...
Read more >
What's New in v2.3 · spaCy Usage Documentation
3 features new pretrained models for five languages, word vectors for all language models, and decreased model size and loading times for models...
Read more >
resume ner model - Kaggle
Misaligned entities (with BILUO tag '-') will be ignored during training. # This is added back by InteractiveShellApp.init_path() ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found