NER combining BIO schema words
See original GitHub issueAny way to combine the BIO tokens into compound words. I implemented this method to combine words but this does not work well for words with punctuations. For eg: S.E.C using the above function will join it as S . E . C
` CODE:
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))
return collapsed_result
``
Any workaround to form compound words?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
NER combining BIO tokens to form original compound word
Any way to combine the BIO tokens into compound ...
Read more >Named Entity Recognition (NER) Case Study | The Startup
The dataset follows a BIO format, meaning an entity starts with B (Beginning). If it has more than one word then the second...
Read more >Biomedical NER using Novel Schema and Distant Supervision
We have used a combination of GloVe word embeddings(Pennington et al., 2014), character embeddings and BERT (Bio+Discharge Summary.
Read more >Named Entity Recognition and Relation Detection ... - Frontiers
Hence, both terms, i.e., NER and NERC, are frequently used interchangeably. One reason why BioNER is challenging is the non-standard usage of abbreviations, ......
Read more >Custom Named Entity Recognition Using spaCy
NER is also simply known as entity identification, entity chunking and entity extraction. NER is used in many fields in Artificial Intelligence ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I figured out a solution to this problem.
Update This will solve most of the cases, but there always be outliers. For eg: The tags for the sentence
"U.S. Securities and Exchange Commission" are ['U.S.', 'B-ORG'] ['Securities', 'I-ORG'] ['and', 'I-ORG'] ['Exchange', 'I-ORG'] ['Commission', 'I-ORG'] And when run the collapse command changed the sentence into: "U.S.Securities and Exchange Commission"
So the complete solution is to track the identity of the word that created a certain token. Creating LUT for the original sentence. Thus
Now, given token index you can know exact word it came from, and simply concatenate tokens that belong to the same word, while adding space when a token belongs to a different word. So the NER result would be something like:
[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]
Got it. Sounds good. Let me try that if it works, I will close the issue. Also, any parameter to indicate that instead of raw text, I will pass tokens to the pre-trained model. Or do I have to change the source code function (bert_ner_preprocessor) to remove the tokenization process?
Thank you for your help.