question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Matcher is not able to merge mutliple matches per doc

See original GitHub issue

Hi! When I try to merge multiple matches in a sentence/doc with retokenizer.merge I receive the error IndexError: [E035] Error creating span with start 17 and end 21 for Doc of length 17., with the second match, as the indices don’t line up anymore: The underlying doc has changed, but the start end indices in matches were not updated. Below I include a code to reproduce the issue as well as a workaround I used to fix the issue for now.

But maybe it would be more elegant to update the start/end indices upon calling retokenizer.merge?

How to reproduce the behaviour

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

def merge_tokens(matcher, doc: spacy.tokens.doc, i, matches):
    """Merges tokens to one."""

    match_id, start, end = matches[i]

    span = Span(doc, start, end, label="EVENT")

    with doc.retokenize() as retokenizer:
        attrs = {"LEMMA": "<gs>", "TAG": "<gs>"}
        retokenizer.merge(span, attrs=attrs)

def merge_tokens_fixed(matcher, doc: spacy.tokens.doc, i, matches):
    """Merges tokens to one."""

    match_id, start, end = matches[i]

    for i in range(0, i):
        span_len = matches[i][2]-matches[i][1]-1
        start -= span_len
        end -= span_len

    span = Span(doc, start, end, label="EVENT")

    with doc.retokenize() as retokenizer:
        attrs = {"LEMMA": "<gs>", "TAG": "<gs>"}
        retokenizer.merge(span, attrs=attrs)


patterns = [
    [
        {'IS_BRACKET': True},
            {'LOWER': 'this'},
            {'LOWER': 'is'},
            {'LOWER': 'a', 'OP': '*'},
        {'IS_BRACKET': True}
    ]
]


nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add('this_is_a', patterns, on_match=merge_tokens_fixed)

text = "Hello World! (This is a) test! (This is) another one. (This is). This is"
doc = nlp(text)
matches = matcher(doc)

print([t.text for t in doc])


matcher2 = Matcher(nlp.vocab)
matcher2.add('this_is_a2', patterns, on_match=merge_tokens)

doc2 = nlp(text)
matches2 = matcher2(doc2)

print([t.text for t in doc2])

Output:

['Hello', 'World', '!', '(This is a)', 'test', '!', '(This is)', 'another', 'one', '.', '(This is)', '.', 'This', 'is']
Traceback (most recent call last):
  File "/mnt/c/Users/xxx/AppData/Roaming/JetBrains/PyCharm2020.2/scratches/scratch.py", line 59, in <module>
    matches2 = matcher2(doc2)
  File "matcher.pyx", line 242, in spacy.matcher.matcher.Matcher.__call__
  File "/mnt/c/Users/xxx/AppData/Roaming/JetBrains/PyCharm2020.2/scratches/scratch.py", line 10, in merge_tokens
    span = Span(doc, start, end, label="EVENT")
  File "span.pyx", line 104, in spacy.tokens.span.Span.__cinit__
IndexError: [E035] Error creating span with start 17 and end 21 for Doc of length 17.

Process finished with exit code 1

Environment

  • Operating System: WSL/Ubuntu 20.04
  • Python Version Used: 3.8
  • spaCy Version Used: 2.3.2
  • Environment Information: n/a

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
adrianeboydcommented, Nov 26, 2020

The on_match callback will still be called for each match, yes.

So if there’s further processing to do per match you probably want to do it all in the same i == 0 step before retokenizing so that the match spans are still valid.

And it’s up to you to make sure later calls to the callback don’t try to do anything with the invalid spans.

0reactions
github-actions[bot]commented, Oct 29, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Matching dataframes with multiple matches - Stack Overflow
A solution using the data.table package. The key concepts here are: 1. Use toString to aggregate multiple strings in df2 . 2. Merge...
Read more >
Match and Merge Documentation - Matching
Introduction. The EBX® Match and Merge Add-on finds records that might be duplicates. You can run it manually and configure it to run...
Read more >
Match and Merge in CDI - Ex Libris Knowledge Center
A "transitive merge" is a scenario where three or more records are merged, where at least two of the records would not be...
Read more >
Excel: Merge tables by matching column data or headers
See how to quickly merge two tables in Excel by matching data in one or more columns and how to combine worksheets based...
Read more >
Rule-based matching · spaCy Usage Documentation
If spaCy's tokenization doesn't match the tokens defined in a pattern, ... the matcher will only return the matches and not do anything...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found