question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak with benepar (spacy plugin)

See original GitHub issue

How to reproduce the behaviour

I use checking code similar to https://github.com/explosion/spaCy/issues/3618

import random
import spacy
import plac
import psutil
import sys
from benepar.spacy_plugin import BeneparComponent


def load_data():
    return ["This is a fake test document number %d."%i for i in random.sample(range(100_000), 10_000)]


def parse_texts(nlp, texts, iterations=1_000):
    for i in range(iterations):
        for doc in nlp.pipe(texts):
            yield doc


@plac.annotations(
    iterations=("Number of iterations", "option", "n", int),
    model=("spaCy model to load", "positional", None, str)
)
def main(model='en_core_web_sm', iterations=1_000):
    nlp = spacy.load(model)
    nlp.add_pipe(BeneparComponent('benepar_en'))
    texts = load_data()
    for i, doc in enumerate(parse_texts(nlp, texts, iterations=iterations)):
        if i % 100 == 0:
            print(i, psutil.virtual_memory().percent)
            sys.stdout.flush()


if __name__ == '__main__':
    plac.call(main)

Without benepar plugin, this piece of code on spacy 2.3.0 works fine. while using benepar plugin memory leak seem to be really important Here is the output

0 47.4 100 47.5 200 47.5 300 47.6 400 47.8 500 48.2 … 1500 52.5 1600 56.5 1700 57.7 1800 57.5 1900 56.7 2000 57.1 … 4000 59.3 4100 59.1 4200 59.2 4300 60.0

Your Environment

  • Operating System: MacOS
  • Python Version Used: python3.8.3
  • spaCy Version Used: spacy 2.3.0
  • Environment Information:

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
zhoudoufucommented, Nov 13, 2020

If you do want to try again with valgrind, add PYTHONMALLOC=malloc at the beginning of the command to see if there are more details related to python objects?

Here is the valgrind summary, with environment variable PYTHONMALLOC=malloc

==16082==    definitely lost: 11,584 bytes in 89 blocks
==16082==    indirectly lost: 0 bytes in 0 blocks
==16082==      possibly lost: 1,651,622 bytes in 12,456 blocks
==16082==    still reachable: 22,922,513 bytes in 182,620 blocks
==16082==                       of which reachable via heuristic:
==16082==                         newarray           : 4,664 bytes in 8 blocks
==16082==                         multipleinheritance: 4,064 bytes in 41 blocks
==16082==         suppressed: 120 bytes in 1 blocks

I managed to bypass this memory leak by using a process worker to handle the process. The process worker has --max-tasks-per-child set to a limited number. (For my case, I use celery worker. I got this idea from https://github.com/explosion/spaCy/issues#issuecomment-487714092) According to my observation , the nlp object itself doesn’t make memory leak, using same nlp but not same process saves the time of reloading spacy models / third-party plugins.

So, we can close this issue as it will not require any change in spacy code, but the way of applying.

0reactions
github-actions[bot]commented, Oct 30, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory leak with en_core_web_trf model, Spacy
I just use the pipeline to predict, not train any data or other stuff and tried with different batch sizes, but nothing happened,...
Read more >
explosion/spaCy - Gitter
Hi. I'm piping a lot of documents (millions) and memory usage seems to be gradually increasing over time, eventually resulting in spaCy throwing...
Read more >
Simple Index - SUSTech Open Source Mirrors
a-pandas-ex-less-memory-more-speed · a-pandas-ex-loc-no-exceptions ... abba-voyage-leaked-album-download-easy ... acryl-datahub-airflow-plugin
Read more >
Telegraf 1.24.1 maintenance release fixes memory leak for ...
We added the ability to parse tags in the processor parser. Parser updates. We fixed a memory leak for plugins using ParserFunc ....
Read more >
Bram on Twitter: "@srchvrs There used to be a memory leak ...
multi-processing. 1. Python lack of multi-threading is super-annoying but there're crutches that can still be useful. 2.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found