Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lookup tokens by character offset / Regex match

See original GitHub issue

First, thank you so much for creating this revolutionary tool!

Enhancement/Request:
One thing I often want to do is search a spacy doc using regex. As noted in by others (e.g., #475 , #486 ), the pattern Matcher does not currently support regex for a number of reasons.

Currently, I am using the following work-around (essentially, mapping a regex match from the raw text back to spacy tokens/spans):

import re
import spacy
import pandas as pd
nlp = spacy.load('en')

doc = nlp('The cat sat on the $500 dollar mat')

def token_i_and_idx(doc):
	return pd.DataFrame([[t.idx, t.i] for t in doc], columns=['t_idx', 't_i'])

def idx_to_token_i_map(doc):
	chr_offsets = pd.DataFrame(list(range(len(doc.text)+1)))
	map_df = pd.merge(chr_offsets, token_i_and_idx(doc), how='left', left_index=True,
	                right_on='t_idx').t_i.fillna(method='ffill').reset_index()
	return map_df.to_dict()['t_i']

def regex_to_doc_span(regex, doc):
	"""returns spacy token / span matching the regex"""
	map = idx_to_token_i_map(doc)
	return (doc[map[m.start()]:map[m.end()]+1] for m in regex.finditer(doc.text))

regex = re.compile('\$\d\d')
print(list(regex_to_doc_span(regex,doc)))

#Out[66]:
#[$500]

regex = re.compile('\$\d+\s\w+\s\w+')
print([(t, t.ent_type_) for span in regex_to_doc_span(regex,doc) for t in span])
#Out[70]:
#[($, 'MONEY'), (500, 'MONEY'), (dollar, 'MONEY'), (mat, '')]

While the above has been working, I was wondering whether there is a better way to do this ( or if this issue is already being addressed in updates to the Matcher).

To the extent regex is not incorporated into the Matcher, is there any way to allow lookup of tokens by the character offset of the token’s underlying text (similar to how we can access a sentence containing any given span)? This would make it easier to translate regex matches into spacy tokens / spans (which we can then search for other attributes like ent-type, POS, etc).

Thank you again!!

Issue Analytics

State:
Created 6 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

slavaGanzincommented, May 12, 2017

I’m huge fan of regexp and “why can I use it?” was my first question to spacy (I’m newbie in practical nlp) But as I understand (really a little) in context-free grammars. Regexes has to problems:

they can’t be abstractly composed, lilke in context-free parsers:

DOLLAR_SIGN = $
MONEY_SIGN = DOLLAR_SIGN|YENA_SIGN|...
MONEY = MONEY_SIGN SPACE NUMBER | NUMBER SPACE MONEY_SIGN

they never will work in O(n) by definition

So maybe, you just don’t need them as much as you think

p.s. I’m not a pro. And answered just to start discussion

0reactions

lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

Tokens and Wildcards – Regular Expressions for Biologists

Compose regular expressions that include tokens to match particular classes of character. Describe the risks associated with using tokens and ...

Efficient way to find the token (word) index after a regular ...

Search for x in y using the first regular expression and get the character offset z · Split y into an array of...

Ultimate Regex Cheat Sheet - KeyCDN Support

This guide provides a regex cheat sheet as well as example use-cases that you can use as a reference when creating your regex...

Basic Regular Expression Syntax

A regex is a specification of a pattern to be matched in the searched text. This pattern consists of a sequence of tokens,...

Rule-based matching · spaCy Usage Documentation

spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations ......