Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why does pipeline perform worse than normal operation?

See original GitHub issue

Hi, I’ve been using spacy matcher to detect passive voice in my data. However, I have around 1 million records, each containing around 10 sentences. Here the speed of the process is a problem and that’s why I need to use SpaCy pipelines that can utilize CPU usage. But unfortunately, the pipeline is taking around 10x more time than the simple mapping of each record and finding the matches. I think maybe I’m missing a simple thing here. Thanks in advance!

nlp = spacy.load("en_core_web_md", disable=["ner"])
matcher = Matcher(nlp.vocab)

passive_rule1 = [
    {"DEP": "nsubjpass"},
    {"DEP": "xcomp", "OP": "*"},
    {"DEP": "aux", "OP": "*"},
    {"DEP": "nsubjpass", "OP": "*"},
    {"DEP": "auxpass"},
    {"DEP": "nsubj", "OP": "*"},
    {"TAG": "VBN"},
]
passive_rule2 = [
    {"DEP": "attr"},
    {"DEP": "det", "OP": "*"},
    {"Tag": "NOUN", "OP": "?"},
    {"TAG": "VBN"},
]


matcher.add("passive_rule1", None, passive_rule1)
matcher.add("passive_rule2", None, passive_rule2)

%%time
docs = nlp.pipe(data_1, n_process = -1, batch_size = 50)
pass_list = [1 if matcher(doc) else 0 for doc in docs]
>> output: ~1min 26s

Operating System: Windows 10
Python Version Used: Python 3.8.3
spaCy Version Used: 2.3.5

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

mitramir55commented, Jan 12, 2021

Thank you very much! Indeed, I tried a few different ways and found that my system can perform best with n_process = 4 and batch_size around 1000. I guess while I tried out my matcher on a small number of sentences and records, the overhead was significant compared to when I stuck to one core, but moving one to larger corpora, I found using more processes can be of great help.

0reactions

github-actions[bot]commented, Oct 25, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

5 Reasons Why the Keystone Pipeline is Bad for the Economy

And it's no wonder: Construction unemployment is double the national average and, from a worker's perspective, Keystone jobs will be good-paying union jobs ......

Pipelining vs Non-Pipelining - GeeksforGeeks

A Pipeline is a set of data processing units arranged in series such that the output of one element is the input of...

Organization of Computer Systems: Pipelining - UF CISE

Structural Hazards occur when different instructions collide while trying to access the same piece of hardware in the same segment of a pipeline....

Modeling Pipeline Optimization With scikit-learn

Tutorial Overview. This tutorial will show you how to. Set up a pipeline using the Pipeline object from sklearn.pipeline. Perform a grid search ......

A Simple Example of Pipeline in Machine Learning with Scikit ...

I will finish this post with a simple intuitive explanation of why Pipeline can be necessary at times. It helps to enforce desired...