question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lookup tokens by character offset / Regex match

See original GitHub issue

First, thank you so much for creating this revolutionary tool!

Enhancement/Request:
One thing I often want to do is search a spacy doc using regex. As noted in by others (e.g., #475 , #486 ), the pattern Matcher does not currently support regex for a number of reasons.

Currently, I am using the following work-around (essentially, mapping a regex match from the raw text back to spacy tokens/spans):

import re
import spacy
import pandas as pd
nlp = spacy.load('en')

doc = nlp('The cat sat on the $500 dollar mat')

def token_i_and_idx(doc):
	return pd.DataFrame([[t.idx, t.i] for t in doc], columns=['t_idx', 't_i'])

def idx_to_token_i_map(doc):
	chr_offsets = pd.DataFrame(list(range(len(doc.text)+1)))
	map_df = pd.merge(chr_offsets, token_i_and_idx(doc), how='left', left_index=True,
	                right_on='t_idx').t_i.fillna(method='ffill').reset_index()
	return map_df.to_dict()['t_i']

def regex_to_doc_span(regex, doc):
	"""returns spacy token / span matching the regex"""
	map = idx_to_token_i_map(doc)
	return (doc[map[m.start()]:map[m.end()]+1] for m in regex.finditer(doc.text))

regex = re.compile('\$\d\d')
print(list(regex_to_doc_span(regex,doc)))

#Out[66]:
#[$500]

regex = re.compile('\$\d+\s\w+\s\w+')
print([(t, t.ent_type_) for span in regex_to_doc_span(regex,doc) for t in span])
#Out[70]:
#[($, 'MONEY'), (500, 'MONEY'), (dollar, 'MONEY'), (mat, '')]

While the above has been working, I was wondering whether there is a better way to do this ( or if this issue is already being addressed in updates to the Matcher).

To the extent regex is not incorporated into the Matcher, is there any way to allow lookup of tokens by the character offset of the token’s underlying text (similar to how we can access a sentence containing any given span)? This would make it easier to translate regex matches into spacy tokens / spans (which we can then search for other attributes like ent-type, POS, etc).

Thank you again!!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
slavaGanzincommented, May 12, 2017

I’m huge fan of regexp and “why can I use it?” was my first question to spacy (I’m newbie in practical nlp) But as I understand (really a little) in context-free grammars. Regexes has to problems:

DOLLAR_SIGN = $
MONEY_SIGN = DOLLAR_SIGN|YENA_SIGN|...
MONEY = MONEY_SIGN SPACE NUMBER | NUMBER SPACE MONEY_SIGN
  • they never will work in O(n) by definition

So maybe, you just don’t need them as much as you think

p.s. I’m not a pro. And answered just to start discussion

0reactions
lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokens and Wildcards – Regular Expressions for Biologists
Compose regular expressions that include tokens to match particular classes of character. Describe the risks associated with using tokens and ...
Read more >
Efficient way to find the token (word) index after a regular ...
Search for x in y using the first regular expression and get the character offset z · Split y into an array of...
Read more >
Ultimate Regex Cheat Sheet - KeyCDN Support
This guide provides a regex cheat sheet as well as example use-cases that you can use as a reference when creating your regex...
Read more >
Basic Regular Expression Syntax
A regex is a specification of a pattern to be matched in the searched text. This pattern consists of a sequence of tokens,...
Read more >
Rule-based matching · spaCy Usage Documentation
spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found