Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Offsets returned by model.predict are not usable if there is whitespace in the text.

See original GitHub issue

This is a huge problem, because I would like to create stand-off annotations for the detected entities in the original document:

For example, the sentence may look like this

txt = "          Microsoft and Apple" # (starting with ten spaces)

then what model.predict([txt]) returns is:

[{'entity': [
  {'type': 'organization', 'position': [0, 9], 'mention': 'Microsoft', 'probability': 0.9995076656341553}, 
  {'type': 'organization', 'position': [14, 19], 'mention': 'Apple', 'probability': 0.9992972612380981}], 
'sentence': 'Microsoft and Apple'}]

As can be seen, the leading whitespace has been removed also in the returned “sentence” field.

This also happens if whitespace is in the middle of the sentence e.g.

txt = "Microsoft          and Apple" # (ten spaces after Microsoft)

returns

[{'entity': 
  [{'type': 'organization', 'position': [0, 9], 'mention': 'Microsoft', 'probability': 0.9995076656341553}, 
  {'type': 'organization', 'position': [14, 19], 'mention': 'Apple', 'probability': 0.9992972612380981}], 
'sentence': 'Microsoft and Apple'}]

Again the returned sentece text contains a single space where the original text contained 10.

This makes it hard to reliable map the offsets back to the true offsets in the original text. It is also not clear which other characters would cause any changes with the offsets. Is there a way to guarantee getting back the proper offsets or at least getting information about which characters in the original text have been removed? Where exactly does this happen in the code?

Issue Analytics

State:
Created a year ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

asahi417commented, Jul 2, 2022

Hi thanks so much for working around the issue, and this is absolutely not a healthy behavior indeed. In the code that normalizes the halfspaces, I’ll try to keep the information of the pre-processing with which we can restore the original input and offsets to adjust the predicted entity span correctly.

0reactions

johann-petrakcommented, Jul 8, 2022

You can find my workaround here: https://github.com/GateNLP/python-gatenlp-ml-tner/blob/cb367881516b7d130aa888bda126a7a494828cf6/gatenlp_ml_tner/annotators.py#L79

It is based on the assumptions, that only multiple and leading whitespace causes the misalginments and it uses the nltk align_tokens method for help.

However, getting the correct offsets right away would obviously be much better. Note that all “fast” tokenizers in the huggingface library can give you the original offsets for each transformers token as the library offers Encoding.token_to_chars(tokenidx) and similar methods to help with this.