Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Matcher is not able to merge mutliple matches per doc

See original GitHub issue

Hi! When I try to merge multiple matches in a sentence/doc with retokenizer.merge I receive the error IndexError: [E035] Error creating span with start 17 and end 21 for Doc of length 17., with the second match, as the indices don’t line up anymore: The underlying doc has changed, but the start end indices in matches were not updated. Below I include a code to reproduce the issue as well as a workaround I used to fix the issue for now.

But maybe it would be more elegant to update the start/end indices upon calling retokenizer.merge?

How to reproduce the behaviour

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

def merge_tokens(matcher, doc: spacy.tokens.doc, i, matches):
    """Merges tokens to one."""

    match_id, start, end = matches[i]

    span = Span(doc, start, end, label="EVENT")

    with doc.retokenize() as retokenizer:
        attrs = {"LEMMA": "<gs>", "TAG": "<gs>"}
        retokenizer.merge(span, attrs=attrs)

def merge_tokens_fixed(matcher, doc: spacy.tokens.doc, i, matches):
    """Merges tokens to one."""

    match_id, start, end = matches[i]

    for i in range(0, i):
        span_len = matches[i][2]-matches[i][1]-1
        start -= span_len
        end -= span_len

    span = Span(doc, start, end, label="EVENT")

    with doc.retokenize() as retokenizer:
        attrs = {"LEMMA": "<gs>", "TAG": "<gs>"}
        retokenizer.merge(span, attrs=attrs)


patterns = [
    [
        {'IS_BRACKET': True},
            {'LOWER': 'this'},
            {'LOWER': 'is'},
            {'LOWER': 'a', 'OP': '*'},
        {'IS_BRACKET': True}
    ]
]


nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add('this_is_a', patterns, on_match=merge_tokens_fixed)

text = "Hello World! (This is a) test! (This is) another one. (This is). This is"
doc = nlp(text)
matches = matcher(doc)

print([t.text for t in doc])


matcher2 = Matcher(nlp.vocab)
matcher2.add('this_is_a2', patterns, on_match=merge_tokens)

doc2 = nlp(text)
matches2 = matcher2(doc2)

print([t.text for t in doc2])

Output:

['Hello', 'World', '!', '(This is a)', 'test', '!', '(This is)', 'another', 'one', '.', '(This is)', '.', 'This', 'is']
Traceback (most recent call last):
  File "/mnt/c/Users/xxx/AppData/Roaming/JetBrains/PyCharm2020.2/scratches/scratch.py", line 59, in <module>
    matches2 = matcher2(doc2)
  File "matcher.pyx", line 242, in spacy.matcher.matcher.Matcher.__call__
  File "/mnt/c/Users/xxx/AppData/Roaming/JetBrains/PyCharm2020.2/scratches/scratch.py", line 10, in merge_tokens
    span = Span(doc, start, end, label="EVENT")
  File "span.pyx", line 104, in spacy.tokens.span.Span.__cinit__
IndexError: [E035] Error creating span with start 17 and end 21 for Doc of length 17.

Process finished with exit code 1

Environment

Operating System: WSL/Ubuntu 20.04
Python Version Used: 3.8
spaCy Version Used: 2.3.2
Environment Information: n/a

Issue Analytics

State:
Created 3 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

adrianeboydcommented, Nov 26, 2020

The on_match callback will still be called for each match, yes.

So if there’s further processing to do per match you probably want to do it all in the same i == 0 step before retokenizing so that the match spans are still valid.

And it’s up to you to make sure later calls to the callback don’t try to do anything with the invalid spans.

0reactions

github-actions[bot]commented, Oct 29, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

Matching dataframes with multiple matches - Stack Overflow

A solution using the data.table package. The key concepts here are: 1. Use toString to aggregate multiple strings in df2 . 2. Merge...

Match and Merge Documentation - Matching

Introduction. The EBX® Match and Merge Add-on finds records that might be duplicates. You can run it manually and configure it to run...

Match and Merge in CDI - Ex Libris Knowledge Center

A "transitive merge" is a scenario where three or more records are merged, where at least two of the records would not be...

Excel: Merge tables by matching column data or headers

See how to quickly merge two tables in Excel by matching data in one or more columns and how to combine worksheets based...

Rule-based matching · spaCy Usage Documentation

If spaCy's tokenization doesn't match the tokens defined in a pattern, ... the matcher will only return the matches and not do anything...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Matcher is not able to merge mutliple matches per doc

How to reproduce the behaviour

Output:

Environment

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

mypy error: Skipping analyzing 'spacy.tokens': found module but no type hints or library stubs

Retrain JUST the NER component to have character CNN features?