merge_entities does not set IOB correctly for two consecutive entities
See original GitHub issueSorry that I have to report a small issue with merge_entities that I came across.
Problem: If two named entities appear directly adjacent in a sentence and you call merge_entities to merge their tokens, also the entities get merged, because the IOB of the second entity is set to I instead of B.
Example: Original Sequence B-I-I-B-I-O-O Merge entities merges the tokens 0:3 and 3:5 correctly, but the IOB-tags are B-I-O-O and not B-B-O-O. Thus, doc.ents returns just one entity. See the example below.
import spacy
from spacy.pipeline import merge_entities
nlp = spacy.load('en')
text = "LIfetime Corp said Retirement Housing Corp has accepted."
doc = nlp(text)
# two directly consecutive named entities detected:
# "LIfetime Corp said" and "Retirement Housing Corp"
# the first one is wrong, but that's not the point
for t in doc:
if t.ent_type_ != '':
print(f"'{t.text}' {t.ent_iob_} {t.ent_type_} " )
print()
# after merge entities
doc = merge_entities(doc)
for t in doc:
if t.ent_type_ != '':
print(f"'{t.text}' {t.ent_iob_} {t.ent_type_} " )
print()
for e in doc.ents:
print(f"ents: '{e.text}' {e.label_}" )
Output:
# before merge:
'LIfetime' B ORG
'Corp' I ORG
'said' I ORG
'Retirement' B ORG
'Housing' I ORG
'Corp' I ORG
# after merge:
'LIfetime Corp said' B ORG
'Retirement Housing Corp' I ORG
ents: 'LIfetime Corp said Retirement Housing Corp' ORG
So not only the tokens, but also the entities got merged. Expected would be
'LIfetime Corp said' B ORG
'Retirement Housing Corp' B ORG
- Operating System: Linux
- Python Version Used: 3.8
- spaCy Version Used: 2.3
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Merge chart items - IBM
Merging entities It is useful to merge entities when you identify two or more entities on your chart that represent the same...
Read more >Lessons Learned from Fine-Tuning BERT for Named Entity ...
In Part 1 of this 2-part series, I introduced the task of fine-tuning BERT for named entity recognition, outlined relevant prerequisites and ...
Read more >Resolve matched entities - i2 Group documentation
Merging matched entities · In the Find Matching Entities pane, select the matched set in the Matched sets area. · In the Members...
Read more >7 Extracting Information from Text - NLTK
What are some robust methods for identifying the entities and relationships ... grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns" >>> cp...
Read more >NorNE: Annotating Named Entities for Norwegian – arXiv Vanity
The annotations in NorNE include a rich set of entity types. ... train the annotators, and hence is not representative of the subsequent...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, surprisingly with an entity ruler the problem does not show up, only with the model.
Expected behaviour in my opinion is, however, that the number of entities should not change after calling
merge_entities
, but it does so frequently (e.g. in 5% of the articles in the Reuters corpus).Here is a sentence that makes problems also in 2.3.2 (not with a ruler, though, only with the model (en_core_web_sm, pretrained))
I attached a small notebook with an updated example of me and a piece of code to produce over 100 such examples in a subset of 2000 articles of the Reuters corpus. The notebook includes all the output.
Spacy_Merge_Entities_Issue.zip
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.