Matcher is not able to merge mutliple matches per doc
See original GitHub issueHi! When I try to merge multiple matches in a sentence/doc with retokenizer.merge
I receive the error IndexError: [E035] Error creating span with start 17 and end 21 for Doc of length 17.
, with the second match, as the indices don’t line up anymore: The underlying doc has changed, but the start
end
indices in matches
were not updated. Below I include a code to reproduce the issue as well as a workaround I used to fix the issue for now.
But maybe it would be more elegant to update the start/end indices upon calling retokenizer.merge
?
How to reproduce the behaviour
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
def merge_tokens(matcher, doc: spacy.tokens.doc, i, matches):
"""Merges tokens to one."""
match_id, start, end = matches[i]
span = Span(doc, start, end, label="EVENT")
with doc.retokenize() as retokenizer:
attrs = {"LEMMA": "<gs>", "TAG": "<gs>"}
retokenizer.merge(span, attrs=attrs)
def merge_tokens_fixed(matcher, doc: spacy.tokens.doc, i, matches):
"""Merges tokens to one."""
match_id, start, end = matches[i]
for i in range(0, i):
span_len = matches[i][2]-matches[i][1]-1
start -= span_len
end -= span_len
span = Span(doc, start, end, label="EVENT")
with doc.retokenize() as retokenizer:
attrs = {"LEMMA": "<gs>", "TAG": "<gs>"}
retokenizer.merge(span, attrs=attrs)
patterns = [
[
{'IS_BRACKET': True},
{'LOWER': 'this'},
{'LOWER': 'is'},
{'LOWER': 'a', 'OP': '*'},
{'IS_BRACKET': True}
]
]
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add('this_is_a', patterns, on_match=merge_tokens_fixed)
text = "Hello World! (This is a) test! (This is) another one. (This is). This is"
doc = nlp(text)
matches = matcher(doc)
print([t.text for t in doc])
matcher2 = Matcher(nlp.vocab)
matcher2.add('this_is_a2', patterns, on_match=merge_tokens)
doc2 = nlp(text)
matches2 = matcher2(doc2)
print([t.text for t in doc2])
Output:
['Hello', 'World', '!', '(This is a)', 'test', '!', '(This is)', 'another', 'one', '.', '(This is)', '.', 'This', 'is']
Traceback (most recent call last):
File "/mnt/c/Users/xxx/AppData/Roaming/JetBrains/PyCharm2020.2/scratches/scratch.py", line 59, in <module>
matches2 = matcher2(doc2)
File "matcher.pyx", line 242, in spacy.matcher.matcher.Matcher.__call__
File "/mnt/c/Users/xxx/AppData/Roaming/JetBrains/PyCharm2020.2/scratches/scratch.py", line 10, in merge_tokens
span = Span(doc, start, end, label="EVENT")
File "span.pyx", line 104, in spacy.tokens.span.Span.__cinit__
IndexError: [E035] Error creating span with start 17 and end 21 for Doc of length 17.
Process finished with exit code 1
Environment
- Operating System: WSL/Ubuntu 20.04
- Python Version Used: 3.8
- spaCy Version Used: 2.3.2
- Environment Information: n/a
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Matching dataframes with multiple matches - Stack Overflow
A solution using the data.table package. The key concepts here are: 1. Use toString to aggregate multiple strings in df2 . 2. Merge...
Read more >Match and Merge Documentation - Matching
Introduction. The EBX® Match and Merge Add-on finds records that might be duplicates. You can run it manually and configure it to run...
Read more >Match and Merge in CDI - Ex Libris Knowledge Center
A "transitive merge" is a scenario where three or more records are merged, where at least two of the records would not be...
Read more >Excel: Merge tables by matching column data or headers
See how to quickly merge two tables in Excel by matching data in one or more columns and how to combine worksheets based...
Read more >Rule-based matching · spaCy Usage Documentation
If spaCy's tokenization doesn't match the tokens defined in a pattern, ... the matcher will only return the matches and not do anything...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The
on_match
callback will still be called for each match, yes.So if there’s further processing to do per match you probably want to do it all in the same
i == 0
step before retokenizing so that the match spans are still valid.And it’s up to you to make sure later calls to the callback don’t try to do anything with the invalid spans.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.