question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why does pipeline perform worse than normal operation?

See original GitHub issue

Hi, I’ve been using spacy matcher to detect passive voice in my data. However, I have around 1 million records, each containing around 10 sentences. Here the speed of the process is a problem and that’s why I need to use SpaCy pipelines that can utilize CPU usage. But unfortunately, the pipeline is taking around 10x more time than the simple mapping of each record and finding the matches. I think maybe I’m missing a simple thing here. Thanks in advance!

nlp = spacy.load("en_core_web_md", disable=["ner"])
matcher = Matcher(nlp.vocab)

passive_rule1 = [
    {"DEP": "nsubjpass"},
    {"DEP": "xcomp", "OP": "*"},
    {"DEP": "aux", "OP": "*"},
    {"DEP": "nsubjpass", "OP": "*"},
    {"DEP": "auxpass"},
    {"DEP": "nsubj", "OP": "*"},
    {"TAG": "VBN"},
]
passive_rule2 = [
    {"DEP": "attr"},
    {"DEP": "det", "OP": "*"},
    {"Tag": "NOUN", "OP": "?"},
    {"TAG": "VBN"},
]


matcher.add("passive_rule1", None, passive_rule1)
matcher.add("passive_rule2", None, passive_rule2)

%%time
docs = nlp.pipe(data_1, n_process = -1, batch_size = 50)
pass_list = [1 if matcher(doc) else 0 for doc in docs]
>> output: ~1min 26s

  • Operating System: Windows 10
  • Python Version Used: Python 3.8.3
  • spaCy Version Used: 2.3.5

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mitramir55commented, Jan 12, 2021

Thank you very much! Indeed, I tried a few different ways and found that my system can perform best with n_process = 4 and batch_size around 1000. I guess while I tried out my matcher on a small number of sentences and records, the overhead was significant compared to when I stuck to one core, but moving one to larger corpora, I found using more processes can be of great help.

0reactions
github-actions[bot]commented, Oct 25, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

5 Reasons Why the Keystone Pipeline is Bad for the Economy
And it's no wonder: Construction unemployment is double the national average and, from a worker's perspective, Keystone jobs will be good-paying union jobs ......
Read more >
Pipelining vs Non-Pipelining - GeeksforGeeks
A Pipeline is a set of data processing units arranged in series such that the output of one element is the input of...
Read more >
Organization of Computer Systems: Pipelining - UF CISE
Structural Hazards occur when different instructions collide while trying to access the same piece of hardware in the same segment of a pipeline....
Read more >
Modeling Pipeline Optimization With scikit-learn
Tutorial Overview. This tutorial will show you how to. Set up a pipeline using the Pipeline object from sklearn.pipeline. Perform a grid search ......
Read more >
A Simple Example of Pipeline in Machine Learning with Scikit ...
I will finish this post with a simple intuitive explanation of why Pipeline can be necessary at times. It helps to enforce desired...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found