question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Merging spans - discussion

See original GitHub issue

Hi!

I’ve been working on relation extraction using spaCy for the past week or so and the API has been very convenient and the results are great 😄 - lovely lib!

However I have stumbled upon the problem of span merging as pointed out in comments in the code - namely the problem that spans that have been extracted earlier are invalidated. How are you thinking of approaching this problem?

I have solved it partially for my purposes, but its a bit of an ugly hack:

            # iterating in reverse order doesn't invalidate next spans we want to process
            # since we are shrinking the doc starting from the end
            for sentence in list(reversed(list(doc.sents))):
                # same for entities
                for ent in reversed(sentence.ents):
                    # collapse function basically merges and uses default stuff for labels
                    ent.collapse()
            # get sentences again - with correct spans
            for sentence in doc.sents:
                # do stuff you wanted to do with collapsed entities here

I extended Spans to have the same ents property that is defined in Doc and added a collapse function which basically achieves the same as merging the span using reasonable labels. I know adding the ents property to a Span doesn’t always make sense - but in the case where the span is a sentence or a noun phrase it does. A sentence seems to be a quite specific type of span - many properties of doc make sense for a sentence span. But i digress.

    def collapse(self):
        start_idx = self[0].idx
        end_idx = self[-1].idx + len(self[-1])
        lemma = u' '.join(word.lemma_ for word in self)
        ent_type = max([word.ent_type_ for word in self])
        merged =  self.doc.merge(start_idx, end_idx, self.root.tag_, lemma, ent_type)
        return merged

I thought of changing the merge function so that it didn’t shrink the doc, simply replacing the tokens that have been merged using a special placeholder in the array, and then in iter iterate through the array as normal, and only yield the object if it is not that placeholder. However this isn’t compatible with getting items using an index. Any ideas?

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
honnibalcommented, Nov 7, 2015

This is merged in now, with a lot of other changes (I made major changes to the thinc library).

I ended up blending 1.b and 1.c: .start and .end are not computed properties, but we call a function to verify them and reset if necessary when we go to fetch a token.

Thanks for your help on this, and for agitating for the change.

Example of this at work:

>>> for np in doc.noun_chunks:
...   np.merge(np.root.tag_, np.text, np.root.ent_type_)
... 
>>> for tok in doc:
...   print(tok.text)
... 
The cat
sat
on
the mat
in
a fuzzy hat
.

Much nicer.

0reactions
lock[bot]commented, May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Span.merge Method - spaCy - Tutorialspoint
spaCy - Span.merge Method, As the name implies, this method of Span class will retokenize the document in a way that the span...
Read more >
Span · spaCy API Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
Articulate 360: Merging Subscriptions
We can help you merge them into a single subscription. In this article, you'll learn why it's beneficial to combine subscriptions, what happens ......
Read more >
The Interaction Between Span of Control and Group Size
Merging Management and Behavioral Theory: ... that a merger of management and behavioral ... to discussions of the span of control concept.
Read more >
Table Cell Row Span greyed out | OutSystems
Maybe I'm missing something but the cell property called column span and row ... 1. you can merge the header cells by selecting...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found