Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak with benepar (spacy plugin)

See original GitHub issue

How to reproduce the behaviour

I use checking code similar to https://github.com/explosion/spaCy/issues/3618

import random
import spacy
import plac
import psutil
import sys
from benepar.spacy_plugin import BeneparComponent


def load_data():
    return ["This is a fake test document number %d."%i for i in random.sample(range(100_000), 10_000)]


def parse_texts(nlp, texts, iterations=1_000):
    for i in range(iterations):
        for doc in nlp.pipe(texts):
            yield doc


@plac.annotations(
    iterations=("Number of iterations", "option", "n", int),
    model=("spaCy model to load", "positional", None, str)
)
def main(model='en_core_web_sm', iterations=1_000):
    nlp = spacy.load(model)
    nlp.add_pipe(BeneparComponent('benepar_en'))
    texts = load_data()
    for i, doc in enumerate(parse_texts(nlp, texts, iterations=iterations)):
        if i % 100 == 0:
            print(i, psutil.virtual_memory().percent)
            sys.stdout.flush()


if __name__ == '__main__':
    plac.call(main)

Without benepar plugin, this piece of code on spacy 2.3.0 works fine. while using benepar plugin memory leak seem to be really important Here is the output

0 47.4 100 47.5 200 47.5 300 47.6 400 47.8 500 48.2 … 1500 52.5 1600 56.5 1700 57.7 1800 57.5 1900 56.7 2000 57.1 … 4000 59.3 4100 59.1 4200 59.2 4300 60.0

Your Environment

Operating System: MacOS
Python Version Used: python3.8.3
spaCy Version Used: spacy 2.3.0
Environment Information:

Issue Analytics

State:
Created 3 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

zhoudoufucommented, Nov 13, 2020

If you do want to try again with valgrind, add PYTHONMALLOC=malloc at the beginning of the command to see if there are more details related to python objects?

Here is the valgrind summary, with environment variable PYTHONMALLOC=malloc

==16082==    definitely lost: 11,584 bytes in 89 blocks
==16082==    indirectly lost: 0 bytes in 0 blocks
==16082==      possibly lost: 1,651,622 bytes in 12,456 blocks
==16082==    still reachable: 22,922,513 bytes in 182,620 blocks
==16082==                       of which reachable via heuristic:
==16082==                         newarray           : 4,664 bytes in 8 blocks
==16082==                         multipleinheritance: 4,064 bytes in 41 blocks
==16082==         suppressed: 120 bytes in 1 blocks

I managed to bypass this memory leak by using a process worker to handle the process. The process worker has --max-tasks-per-child set to a limited number. (For my case, I use celery worker. I got this idea from https://github.com/explosion/spaCy/issues#issuecomment-487714092) According to my observation , the nlp object itself doesn’t make memory leak, using same nlp but not same process saves the time of reloading spacy models / third-party plugins.

So, we can close this issue as it will not require any change in spacy code, but the way of applying.

0reactions

github-actions[bot]commented, Oct 30, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.