Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UserWarning: [W030] Some entities could not be aligned in the text ...

See original GitHub issue

Since upgrading to the latest Spacy 2.3.0 (I think from 2.2.4, but am not sure, I repeatedly get the following warning, always related to the same character ('-'):

lib/python3.7/site-packages/spacy/language.py:479: UserWarning: [W030] Some entities could not be aligned in the text ... Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  gold = GoldParse(doc, **gold)

What does this warning mean? When does it occur?

Your Environment

Info about spaCy

spaCy version: 2.3.0
Platform: Linux-4.15.0-101-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.4

Issue Analytics

State:
Created 3 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

6reactions

svlandegcommented, Jul 13, 2020

Hi @amaarora, this warning occurs when your “gold” entity offsets do not align with token boundaries as set by nlp.make_doc.

In your last example, you see for instance that STREET (“Alawa Crescent”) could be aligned as the second (B-STREET) and third (L-STREET) token, but the first token (“Unit4,1”) was kept as 1 token by the tokenizer and got 3 different entity types assigned to it (PROPERTY_TYPE, UNIT_NUMBER and STREET_RANGE) which resulted in a - instead because one token can only refer to one entity.

You have three options:

Ignore these warnings, but note that your model won’t be able to learn from misaligned entities
Perform pre-processing on your input texts to ensure proper punctuation & white spaces: Unit 4, 1 Alawa Crescent ...
Adjust your tokenizer so that it manages to work better on your specific text, domain and use-case.

1reaction

polmcommented, May 4, 2021

My question is, can one entity refer to multiple tokens? If no, how should I construct my training set for multi-token entities?

It’s completely normal for a single entity to refer to multiple tokens, that should not cause problems. This warning indicates something weird with your annotations or tokenization. In your case you have very strange punctuation so it’s probably related to that, but I would need to see the whole sentence and annotations to say more.

As a note for your and anyone who reads this issue in the future, if you need help with a specific case, please provide this information:

the raw text of a sentence that causes the warning
spaCy’s tokenization of that sentence
your annotations that cause the issue

To repeat Adriane’s example of the kind of problem that causes this warning:

text = "Susan went to Switzerland."
entities = [(0, 3, "PERSON")]

“Sus” in “Susan” cannot be meaningfully assigned to an entity, because you can’t have an entity on half a token. If you are getting this warning you need to look at your annotations and tokenization to figure out why it is happening, because your misaligned annotations are unusable.

Top Results From Across the Web

Warning: [W030] Some entities could not be aligned in the text

The entity offsets need to align to token boundaries. You can't start/end an entity in the middle of a token. In your case,...

[W030] Some entities could not be aligned in the text - usage

Hello! We started from ner.manual annotating 10 custom entities and faced with the following issue trying to train a NER model: UserWarning: ...

resume ner model - Kaggle

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:11: UserWarning: [W030] Some entities could not be aligned in the text "Afreen Jamadar Active ...

Update built-in NER model of Spacy instead of overwrite

During the training process, it is giving me the following error,. UserWarning: [W030] Some entities could not be aligned in the text "('I...

anly 520 assignment entity recognition.docx - Course Hero

Entity Recognition Assignment Harshit Verma 2021-01-11 Libraries / R Setup In. ... UserWarning: [W030] Some entities could not bealigned in the text "I ......