Why does pipeline perform worse than normal operation?
See original GitHub issueHi, I’ve been using spacy matcher to detect passive voice in my data. However, I have around 1 million records, each containing around 10 sentences. Here the speed of the process is a problem and that’s why I need to use SpaCy pipelines that can utilize CPU usage. But unfortunately, the pipeline is taking around 10x more time than the simple mapping of each record and finding the matches. I think maybe I’m missing a simple thing here. Thanks in advance!
nlp = spacy.load("en_core_web_md", disable=["ner"])
matcher = Matcher(nlp.vocab)
passive_rule1 = [
{"DEP": "nsubjpass"},
{"DEP": "xcomp", "OP": "*"},
{"DEP": "aux", "OP": "*"},
{"DEP": "nsubjpass", "OP": "*"},
{"DEP": "auxpass"},
{"DEP": "nsubj", "OP": "*"},
{"TAG": "VBN"},
]
passive_rule2 = [
{"DEP": "attr"},
{"DEP": "det", "OP": "*"},
{"Tag": "NOUN", "OP": "?"},
{"TAG": "VBN"},
]
matcher.add("passive_rule1", None, passive_rule1)
matcher.add("passive_rule2", None, passive_rule2)
%%time
docs = nlp.pipe(data_1, n_process = -1, batch_size = 50)
pass_list = [1 if matcher(doc) else 0 for doc in docs]
>> output: ~1min 26s
- Operating System: Windows 10
- Python Version Used: Python 3.8.3
- spaCy Version Used: 2.3.5
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
5 Reasons Why the Keystone Pipeline is Bad for the Economy
And it's no wonder: Construction unemployment is double the national average and, from a worker's perspective, Keystone jobs will be good-paying union jobs ......
Read more >Pipelining vs Non-Pipelining - GeeksforGeeks
A Pipeline is a set of data processing units arranged in series such that the output of one element is the input of...
Read more >Organization of Computer Systems: Pipelining - UF CISE
Structural Hazards occur when different instructions collide while trying to access the same piece of hardware in the same segment of a pipeline....
Read more >Modeling Pipeline Optimization With scikit-learn
Tutorial Overview. This tutorial will show you how to. Set up a pipeline using the Pipeline object from sklearn.pipeline. Perform a grid search ......
Read more >A Simple Example of Pipeline in Machine Learning with Scikit ...
I will finish this post with a simple intuitive explanation of why Pipeline can be necessary at times. It helps to enforce desired...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you very much! Indeed, I tried a few different ways and found that my system can perform best with n_process = 4 and batch_size around 1000. I guess while I tried out my matcher on a small number of sentences and records, the overhead was significant compared to when I stuck to one core, but moving one to larger corpora, I found using more processes can be of great help.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.