Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spacy 2.3 biluo_tags_from_offsets: "Misaligned entities ('-') will be ignored during training" but then spacy convert raises an exception.

See original GitHub issue

How to reproduce the behaviour

I think you can take this example https://github.com/explosion/spaCy/issues/5727#issuecomment-655509687 and try to convert it with spacy convert tool. Or take this one .conllu file:

1 asd _ _ _ _ _ _ _ -

You’ll get “ValueError: [E177] Ill-formed IOB input detected: -” at

File "gold.pyx", line 570, in spacy.gold.iob_to_biluo
File "gold.pyx", line 592, in spacy.gold._consume_ent

Your Environment

spaCy Version Used: 2.3.2

Issue Analytics

State:
Created 3 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

adrianeboydcommented, Oct 19, 2020

Since the - is really a spacy-internal label, not something we expect to find in external datasets, I don’t think it makes sense to support it in general in spacy convert.

If you’re using v3 in your comparison anyway, I think it would be easiest to just use the v3 converter to produce both the .spacy and the .json files for your comparison (with the short script above). I actually used an early version of v3 converter for all the internal UD training corpora for the v2.3 models.

If you don’t have the raw text in the JSON file for spacy v2.3, the training takes the provided tokens as the gold tokens, so if there’s much of a mismatch between how the actual tokenizer tokenizes and the gold tokens, the model will not perform as well in the end on actual texts. If you evaluate using spacy evaluate, the evaluation on JSON files without raw texts will look inflated, too.

So I can understand a bit more why you might want to provide input like this to produce an accurate comparison, but you’ll need to customize the converter to either handle the - values or include the raw texts. If your data is clean, adding support for the # text = comment might be simpler than adding SpaceAfter=No handling. SpaceAfter=No just aligns really well with how spacy represents the tokens internally and I didn’t want to add error handling for cases where the # text doesn’t align with the tokens.

0reactions

github-actions[bot]commented, Oct 30, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.