question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UserWarning: [W030] Some entities could not be aligned in the text ...

See original GitHub issue

Since upgrading to the latest Spacy 2.3.0 (I think from 2.2.4, but am not sure, I repeatedly get the following warning, always related to the same character ('-'):

lib/python3.7/site-packages/spacy/language.py:479: UserWarning: [W030] Some entities could not be aligned in the text ... Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  gold = GoldParse(doc, **gold)

What does this warning mean? When does it occur?

Your Environment

Info about spaCy

  • spaCy version: 2.3.0
  • Platform: Linux-4.15.0-101-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.4

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

6reactions
svlandegcommented, Jul 13, 2020

Hi @amaarora, this warning occurs when your “gold” entity offsets do not align with token boundaries as set by nlp.make_doc.

In your last example, you see for instance that STREET (“Alawa Crescent”) could be aligned as the second (B-STREET) and third (L-STREET) token, but the first token (“Unit4,1”) was kept as 1 token by the tokenizer and got 3 different entity types assigned to it (PROPERTY_TYPE, UNIT_NUMBER and STREET_RANGE) which resulted in a - instead because one token can only refer to one entity.

You have three options:

  • Ignore these warnings, but note that your model won’t be able to learn from misaligned entities
  • Perform pre-processing on your input texts to ensure proper punctuation & white spaces: Unit 4, 1 Alawa Crescent ...
  • Adjust your tokenizer so that it manages to work better on your specific text, domain and use-case.
1reaction
polmcommented, May 4, 2021

My question is, can one entity refer to multiple tokens? If no, how should I construct my training set for multi-token entities?

It’s completely normal for a single entity to refer to multiple tokens, that should not cause problems. This warning indicates something weird with your annotations or tokenization. In your case you have very strange punctuation so it’s probably related to that, but I would need to see the whole sentence and annotations to say more.

As a note for your and anyone who reads this issue in the future, if you need help with a specific case, please provide this information:

  • the raw text of a sentence that causes the warning
  • spaCy’s tokenization of that sentence
  • your annotations that cause the issue

To repeat Adriane’s example of the kind of problem that causes this warning:

text = "Susan went to Switzerland."
entities = [(0, 3, "PERSON")]

“Sus” in “Susan” cannot be meaningfully assigned to an entity, because you can’t have an entity on half a token. If you are getting this warning you need to look at your annotations and tokenization to figure out why it is happening, because your misaligned annotations are unusable.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Warning: [W030] Some entities could not be aligned in the text
The entity offsets need to align to token boundaries. You can't start/end an entity in the middle of a token. In your case,...
Read more >
[W030] Some entities could not be aligned in the text - usage
Hello! We started from ner.manual annotating 10 custom entities and faced with the following issue trying to train a NER model: UserWarning: ...
Read more >
resume ner model - Kaggle
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:11: UserWarning: [W030] Some entities could not be aligned in the text "Afreen Jamadar Active ...
Read more >
Update built-in NER model of Spacy instead of overwrite
During the training process, it is giving me the following error,. UserWarning: [W030] Some entities could not be aligned in the text "('I...
Read more >
anly 520 assignment entity recognition.docx - Course Hero
Entity Recognition Assignment Harshit Verma 2021-01-11 Libraries / R Setup In. ... UserWarning: [W030] Some entities could not bealigned in the text "I ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found