Memory leak with benepar (spacy plugin)
See original GitHub issueHow to reproduce the behaviour
I use checking code similar to https://github.com/explosion/spaCy/issues/3618
import random
import spacy
import plac
import psutil
import sys
from benepar.spacy_plugin import BeneparComponent
def load_data():
return ["This is a fake test document number %d."%i for i in random.sample(range(100_000), 10_000)]
def parse_texts(nlp, texts, iterations=1_000):
for i in range(iterations):
for doc in nlp.pipe(texts):
yield doc
@plac.annotations(
iterations=("Number of iterations", "option", "n", int),
model=("spaCy model to load", "positional", None, str)
)
def main(model='en_core_web_sm', iterations=1_000):
nlp = spacy.load(model)
nlp.add_pipe(BeneparComponent('benepar_en'))
texts = load_data()
for i, doc in enumerate(parse_texts(nlp, texts, iterations=iterations)):
if i % 100 == 0:
print(i, psutil.virtual_memory().percent)
sys.stdout.flush()
if __name__ == '__main__':
plac.call(main)
Without benepar plugin, this piece of code on spacy 2.3.0 works fine. while using benepar plugin memory leak seem to be really important Here is the output
0 47.4 100 47.5 200 47.5 300 47.6 400 47.8 500 48.2 … 1500 52.5 1600 56.5 1700 57.7 1800 57.5 1900 56.7 2000 57.1 … 4000 59.3 4100 59.1 4200 59.2 4300 60.0
Your Environment
- Operating System: MacOS
- Python Version Used: python3.8.3
- spaCy Version Used: spacy 2.3.0
- Environment Information:
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Memory leak with en_core_web_trf model, Spacy
I just use the pipeline to predict, not train any data or other stuff and tried with different batch sizes, but nothing happened,...
Read more >explosion/spaCy - Gitter
Hi. I'm piping a lot of documents (millions) and memory usage seems to be gradually increasing over time, eventually resulting in spaCy throwing...
Read more >Simple Index - SUSTech Open Source Mirrors
a-pandas-ex-less-memory-more-speed · a-pandas-ex-loc-no-exceptions ... abba-voyage-leaked-album-download-easy ... acryl-datahub-airflow-plugin
Read more >Telegraf 1.24.1 maintenance release fixes memory leak for ...
We added the ability to parse tags in the processor parser. Parser updates. We fixed a memory leak for plugins using ParserFunc ....
Read more >Bram on Twitter: "@srchvrs There used to be a memory leak ...
multi-processing. 1. Python lack of multi-threading is super-annoying but there're crutches that can still be useful. 2.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Here is the valgrind summary, with environment variable PYTHONMALLOC=malloc
I managed to bypass this memory leak by using a process worker to handle the process. The process worker has --max-tasks-per-child set to a limited number. (For my case, I use celery worker. I got this idea from https://github.com/explosion/spaCy/issues#issuecomment-487714092) According to my observation , the nlp object itself doesn’t make memory leak, using same nlp but not same process saves the time of reloading spacy models / third-party plugins.
So, we can close this issue as it will not require any change in spacy code, but the way of applying.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.