Offsets returned by model.predict are not usable if there is whitespace in the text.
See original GitHub issueThis is a huge problem, because I would like to create stand-off annotations for the detected entities in the original document:
For example, the sentence may look like this
txt = " Microsoft and Apple" # (starting with ten spaces)
then what model.predict([txt])
returns is:
[{'entity': [
{'type': 'organization', 'position': [0, 9], 'mention': 'Microsoft', 'probability': 0.9995076656341553},
{'type': 'organization', 'position': [14, 19], 'mention': 'Apple', 'probability': 0.9992972612380981}],
'sentence': 'Microsoft and Apple'}]
As can be seen, the leading whitespace has been removed also in the returned “sentence” field.
This also happens if whitespace is in the middle of the sentence e.g.
txt = "Microsoft and Apple" # (ten spaces after Microsoft)
returns
[{'entity':
[{'type': 'organization', 'position': [0, 9], 'mention': 'Microsoft', 'probability': 0.9995076656341553},
{'type': 'organization', 'position': [14, 19], 'mention': 'Apple', 'probability': 0.9992972612380981}],
'sentence': 'Microsoft and Apple'}]
Again the returned sentece text contains a single space where the original text contained 10.
This makes it hard to reliable map the offsets back to the true offsets in the original text. It is also not clear which other characters would cause any changes with the offsets. Is there a way to guarantee getting back the proper offsets or at least getting information about which characters in the original text have been removed? Where exactly does this happen in the code?
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)
Hi thanks so much for working around the issue, and this is absolutely not a healthy behavior indeed. In the code that normalizes the halfspaces, I’ll try to keep the information of the pre-processing with which we can restore the original input and offsets to adjust the predicted entity span correctly.
You can find my workaround here: https://github.com/GateNLP/python-gatenlp-ml-tner/blob/cb367881516b7d130aa888bda126a7a494828cf6/gatenlp_ml_tner/annotators.py#L79
It is based on the assumptions, that only multiple and leading whitespace causes the misalginments and it uses the nltk align_tokens method for help.
However, getting the correct offsets right away would obviously be much better. Note that all “fast” tokenizers in the huggingface library can give you the original offsets for each transformers token as the library offers
Encoding.token_to_chars(tokenidx)
and similar methods to help with this.