Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

merge_entities does not set IOB correctly for two consecutive entities

See original GitHub issue

Sorry that I have to report a small issue with merge_entities that I came across.

Problem: If two named entities appear directly adjacent in a sentence and you call merge_entities to merge their tokens, also the entities get merged, because the IOB of the second entity is set to I instead of B.

Example: Original Sequence B-I-I-B-I-O-O Merge entities merges the tokens 0:3 and 3:5 correctly, but the IOB-tags are B-I-O-O and not B-B-O-O. Thus, doc.ents returns just one entity. See the example below.

import spacy
from spacy.pipeline import merge_entities

nlp = spacy.load('en')
text = "LIfetime Corp said Retirement Housing Corp has accepted."
doc = nlp(text)

# two directly consecutive named entities detected:
# "LIfetime Corp said" and "Retirement Housing Corp"
# the first one is wrong, but that's not the point
for t in doc:
    if t.ent_type_ != '':
        print(f"'{t.text}' {t.ent_iob_} {t.ent_type_} " )
print()

# after merge entities 
doc = merge_entities(doc)
for t in doc:
    if t.ent_type_ != '':
        print(f"'{t.text}' {t.ent_iob_} {t.ent_type_} " )
print()

for e in doc.ents:
    print(f"ents: '{e.text}' {e.label_}" )

Output:

# before merge:
'LIfetime' B ORG 
'Corp' I ORG 
'said' I ORG 
'Retirement' B ORG 
'Housing' I ORG 
'Corp' I ORG 

# after merge:
'LIfetime Corp said' B ORG 
'Retirement Housing Corp' I ORG 

ents: 'LIfetime Corp said Retirement Housing Corp' ORG

So not only the tokens, but also the entities got merged. Expected would be

'LIfetime Corp said' B ORG 
'Retirement Housing Corp' B ORG

Operating System: Linux
Python Version Used: 3.8
spaCy Version Used: 2.3

Issue Analytics

State:
Created 3 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

jsalbrcommented, Aug 31, 2020

Hi, surprisingly with an entity ruler the problem does not show up, only with the model.

Expected behaviour in my opinion is, however, that the number of entities should not change after calling merge_entities, but it does so frequently (e.g. in 5% of the articles in the Reuters corpus).

Here is a sentence that makes problems also in 2.3.2 (not with a ruler, though, only with the model (en_core_web_sm, pretrained))

text = """
Digicon Inc said it has completed the previously-announced disposition
of its computer systems division to an investment group led by
Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
"""

I attached a small notebook with an updated example of me and a piece of code to produce over 100 such examples in a subset of 2000 articles of the Reuters corpus. The notebook includes all the output.

Spacy_Merge_Entities_Issue.zip

0reactions

github-actions[bot]commented, Nov 1, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.