Lookup tokens by character offset / Regex match
See original GitHub issueFirst, thank you so much for creating this revolutionary tool!
Enhancement/Request:
One thing I often want to do is search a spacy doc using regex. As noted in by others (e.g., #475 , #486 ), the pattern Matcher does not currently support regex for a number of reasons.
Currently, I am using the following work-around (essentially, mapping a regex match from the raw text back to spacy tokens/spans):
import re
import spacy
import pandas as pd
nlp = spacy.load('en')
doc = nlp('The cat sat on the $500 dollar mat')
def token_i_and_idx(doc):
return pd.DataFrame([[t.idx, t.i] for t in doc], columns=['t_idx', 't_i'])
def idx_to_token_i_map(doc):
chr_offsets = pd.DataFrame(list(range(len(doc.text)+1)))
map_df = pd.merge(chr_offsets, token_i_and_idx(doc), how='left', left_index=True,
right_on='t_idx').t_i.fillna(method='ffill').reset_index()
return map_df.to_dict()['t_i']
def regex_to_doc_span(regex, doc):
"""returns spacy token / span matching the regex"""
map = idx_to_token_i_map(doc)
return (doc[map[m.start()]:map[m.end()]+1] for m in regex.finditer(doc.text))
regex = re.compile('\$\d\d')
print(list(regex_to_doc_span(regex,doc)))
#Out[66]:
#[$500]
regex = re.compile('\$\d+\s\w+\s\w+')
print([(t, t.ent_type_) for span in regex_to_doc_span(regex,doc) for t in span])
#Out[70]:
#[($, 'MONEY'), (500, 'MONEY'), (dollar, 'MONEY'), (mat, '')]
While the above has been working, I was wondering whether there is a better way to do this ( or if this issue is already being addressed in updates to the Matcher).
To the extent regex is not incorporated into the Matcher, is there any way to allow lookup of tokens by the character offset of the token’s underlying text (similar to how we can access a sentence containing any given span)? This would make it easier to translate regex matches into spacy tokens / spans (which we can then search for other attributes like ent-type, POS, etc).
Thank you again!!
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
I’m huge fan of regexp and “why can I use it?” was my first question to spacy (I’m newbie in practical nlp) But as I understand (really a little) in context-free grammars. Regexes has to problems:
So maybe, you just don’t need them as much as you think
p.s. I’m not a pro. And answered just to start discussion
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.