question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow entity merging on large texts

See original GitHub issue

First, I’d like to say that I’m really impressed with what you guys have been doing with spacy and the 1.X releases look great.

My issue is that when I do entity merging on a large document (novel length) it takes an unbelievably long time (as much as 10 minutes).

This takes about 12 seconds on four novels:

def tokenize(s):
    return [tok.lemma_ for tok in nlp(s, parse=False) if not tok.is_space]

But this takes up to 50 minutes on the same inputs:

def tokenize_merge(s):
    doc = nlp(s, parse=False)
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, ent.label_)
    return [tok.lemma_ for tok in doc if not tok.is_space]

(It actually took 24 minutes on a system running 0.101 and 51 minutes on a comperable machine running 1.2.0 but I only ran it a couple of times on each machine so that’s not really conclusive.)

The second function is adapted from the old documentation and the sense2vec code.

My theory, based on documentation that says that token indices are changed when merging occurs, is that merge is O(n) on the length of the document. Rather than try and dive into the code I thought I’d just ask, what is going on here?

Is there a better way to do this than what I’m doing? It looks like some of the new callback functionality may be able to perform the merging for me? What if I can guarantee that the document will never be sliced into, it will only ever be iterated over from the beginning, so it doesn’t matter if the indices are messed up?

I love the ability to merge tokens but unless there’s a faster way to do it (I really hope I’m just doing it wrong), I don’t think it’s usable on input of the size I’m working with.

Details

  • Combined length of the input docs ~829000 tokens
  • Python 2.7.6
  • Ubuntu 14.04 (yes, I know it’s time to upgrade, I’m working on it)

Side notes:

It’s even worse when I do NPs too (I assume because there are more of them), but I don’t have the times because I didn’t let it finish. If I’m right, and this is O(n), you could get a tiny decrease in constant factors by starting from the end.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:2
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
honnibalcommented, Oct 8, 2017

@ibrahimsharaf I’m not sure how to give a useful answer outside of the information already in the thread.

Maybe it would be useful to sketch out the algorithm in pure Python. Imagine you’ve got a list of elements like (index, text, head_offset), where index needs to be the position of the element within the list, and head needs to indicate another token. You can make the tokens objects if that’s easier for you.

We want a bulk merge operation that doesn’t make a nested loop over the tokens, i.e. that run in linear time with respect to the number of words.

1reaction
ELind77commented, Jan 18, 2017

@sadovnychyi thank you for sharing! I ended up doing this for the time being:

def yield_merged_ents(doc, attr='lemma_'):
    ent = ''
    for t in doc:
        # If we're in an ent append and continue
        if t.ent_iob_ == 'I':
            ent += '_%s' % getattr(t, attr)
            continue
        # If the current entity has ended, add it an clear
        if ent:
            yield ent
            ent = ''
        # start a new entity if needed
        if t.ent_iob_ == 'B':
            ent += getattr(t, attr)
            continue
        yield getattr(t, attr)
    # Clean up
    if ent:
        yield ent

– Eric

Read more comments on GitHub >

github_iconTop Results From Across the Web

Slow Entity Framework performance when navigating other ...
I'm writing an app in MVC3 using EF 4.1, code first and ninject. What I'm trying to do is very simple yet runs...
Read more >
Improving bulk insert performance in Entity framework [duplicate]
Is there any way other than using SP to improve its performance. This is my code: foreach (Employees item in sequence) { t...
Read more >
Improve Entity Framework Performance when Saving Data to ...
When adding or modifying a large number of records (10³ and more), the Entity Framework performance is far from perfect.
Read more >
Entity Framework Performance and What You Can Do About It
Without a doubt, Entity Framework is a quick and satisfactory way of producing a database-driven web application.
Read more >
Queries from a big table are slow during merge - Ask TOM
Queries from a big table are slow during merge Our application reads data from a big table (~120mil rows). Indexes are set properly, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found