question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NER combining BIO schema words

See original GitHub issue

Any way to combine the BIO tokens into compound words. I implemented this method to combine words but this does not work well for words with punctuations. For eg: S.E.C using the above function will join it as S . E . C

` CODE:

def collapse(ner_result):
# List with the result
collapsed_result = []


current_entity_tokens = []
current_entity = None

# Iterate over the tagged tokens
for token, tag in ner_result:
    
    if tag.startswith("B-"):
        # ... if we have a previous entity in the buffer, store it in the result list
        if current_entity is not None:
            collapsed_result.append([" ".join(current_entity_tokens), current_entity])

        current_entity = tag[2:]
        # The new entity has so far only one token
        current_entity_tokens = [token]

    # If the entity continues ...
    elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
        # Just add the token buffer
        current_entity_tokens.append(token)
    else:
        collapsed_result.append([" ".join(current_entity_tokens), current_entity])
        collapsed_result.append([token,tag[2:]])

        current_entity_tokens = []
        current_entity = None
        
        pass

# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
    collapsed_result.append([" ".join(current_entity_tokens), current_entity])
    collapsed_result = sorted(collapsed_result)
    collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))

   
return collapsed_result

``

Any workaround to form compound words?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Rajmehta123commented, May 19, 2020

I figured out a solution to this problem.

def join_tokens(tokens):
    res = ''
    if tokens:
        res = tokens[0]
        for token in tokens[1:]:
            if not (token.isalpha() and res[-1].isalpha()):
                res += token  # punctuation
            else:
                res += ' ' + token  # regular word
    return res

def collapse(ner_result):
    # List with the result
    collapsed_result = []


    current_entity_tokens = []
    current_entity = None

    # Iterate over the tagged tokens
    for token, tag in ner_result:

        if tag.startswith("B-"):
            # ... if we have a previous entity in the buffer, store it in the result list
            if current_entity is not None:
                collapsed_result.append([join_tokens(current_entity_tokens), current_entity])

            current_entity = tag[2:]
            # The new entity has so far only one token
            current_entity_tokens = [token]

        # If the entity continues ...
        elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
            # Just add the token buffer
            current_entity_tokens.append(token)
        else:
            collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
            collapsed_result.append([token,tag[2:]])

            current_entity_tokens = []
            current_entity = None

            pass

    # The last entity is still in the buffer, so add it to the result
    # ... but only if there were some entity at all
    if current_entity is not None:
        collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
        collapsed_result = sorted(collapsed_result)
        collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))

    return collapsed_result

Update This will solve most of the cases, but there always be outliers. For eg: The tags for the sentence "U.S. Securities and Exchange Commission" are ['U.S.', 'B-ORG'] ['Securities', 'I-ORG'] ['and', 'I-ORG'] ['Exchange', 'I-ORG'] ['Commission', 'I-ORG'] And when run the collapse command changed the sentence into: "U.S.Securities and Exchange Commission"

So the complete solution is to track the identity of the word that created a certain token. Creating LUT for the original sentence. Thus

text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]  
# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]

Now, given token index you can know exact word it came from, and simply concatenate tokens that belong to the same word, while adding space when a token belongs to a different word. So the NER result would be something like:

[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]

0reactions
Rajmehta123commented, May 14, 2020

Got it. Sounds good. Let me try that if it works, I will close the issue. Also, any parameter to indicate that instead of raw text, I will pass tokens to the pre-trained model. Or do I have to change the source code function (bert_ner_preprocessor) to remove the tokenization process?

Thank you for your help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NER combining BIO tokens to form original compound word
Any way to combine the BIO tokens into compound ...
Read more >
Named Entity Recognition (NER) Case Study | The Startup
The dataset follows a BIO format, meaning an entity starts with B (Beginning). If it has more than one word then the second...
Read more >
Biomedical NER using Novel Schema and Distant Supervision
We have used a combination of GloVe word embeddings(Pennington et al., 2014), character embeddings and BERT (Bio+Discharge Summary.
Read more >
Named Entity Recognition and Relation Detection ... - Frontiers
Hence, both terms, i.e., NER and NERC, are frequently used interchangeably. One reason why BioNER is challenging is the non-standard usage of abbreviations, ......
Read more >
Custom Named Entity Recognition Using spaCy
NER is also simply known as entity identification, entity chunking and entity extraction. NER is used in many fields in Artificial Intelligence ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found