question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Offsets returned by model.predict are not usable if there is whitespace in the text.

See original GitHub issue

This is a huge problem, because I would like to create stand-off annotations for the detected entities in the original document:

For example, the sentence may look like this

txt = "          Microsoft and Apple" # (starting with ten spaces)

then what model.predict([txt]) returns is:

[{'entity': [
  {'type': 'organization', 'position': [0, 9], 'mention': 'Microsoft', 'probability': 0.9995076656341553}, 
  {'type': 'organization', 'position': [14, 19], 'mention': 'Apple', 'probability': 0.9992972612380981}], 
'sentence': 'Microsoft and Apple'}]

As can be seen, the leading whitespace has been removed also in the returned “sentence” field.

This also happens if whitespace is in the middle of the sentence e.g.

txt = "Microsoft          and Apple" # (ten spaces after Microsoft)

returns

[{'entity': 
  [{'type': 'organization', 'position': [0, 9], 'mention': 'Microsoft', 'probability': 0.9995076656341553}, 
  {'type': 'organization', 'position': [14, 19], 'mention': 'Apple', 'probability': 0.9992972612380981}], 
'sentence': 'Microsoft and Apple'}]

Again the returned sentece text contains a single space where the original text contained 10.

This makes it hard to reliable map the offsets back to the true offsets in the original text. It is also not clear which other characters would cause any changes with the offsets. Is there a way to guarantee getting back the proper offsets or at least getting information about which characters in the original text have been removed? Where exactly does this happen in the code?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
asahi417commented, Jul 2, 2022

Hi thanks so much for working around the issue, and this is absolutely not a healthy behavior indeed. In the code that normalizes the halfspaces, I’ll try to keep the information of the pre-processing with which we can restore the original input and offsets to adjust the predicted entity span correctly.

0reactions
johann-petrakcommented, Jul 8, 2022

You can find my workaround here: https://github.com/GateNLP/python-gatenlp-ml-tner/blob/cb367881516b7d130aa888bda126a7a494828cf6/gatenlp_ml_tner/annotators.py#L79

It is based on the assumptions, that only multiple and leading whitespace causes the misalginments and it uses the nltk align_tokens method for help.

However, getting the correct offsets right away would obviously be much better. Note that all “fast” tokenizers in the huggingface library can give you the original offsets for each transformers token as the library offers Encoding.token_to_chars(tokenidx) and similar methods to help with this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keras AttributeError: 'Sequential' object has no attribute ...
This function were removed in TensorFlow version 2.6. According to the keras in rstudio reference. update to predict_x=model.predict(X_test) ...
Read more >
Time Series Prediction with LSTM Recurrent Neural Networks ...
You are not interested in the date, given that each observation is separated by the same interval of one month. Therefore, when you...
Read more >
API Reference — DataRobot Python Client 2.22.2 documentation
Create new batch prediction job, upload the scoring dataset and return a batch ... If this field is not present, then there is...
Read more >
Time Series Analysis. “It's tough to make predictions…
If the errors in your model are not white noise, then there is probably still information in there. And you can probably take...
Read more >
Python Tutorial - File and Text Processing
The search() returns a special Match object encapsulating the first match (or None if there is no matches). You can then use the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found