Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory leaks in image pipeline

See original GitHub issue

It seems to me that image pipeline is leaking memory in a very significant ways. I have spider that downloads lists of images. There were always problems with memory when downloading images, but now my list of images to download got larger and I thought about opening issue here.

Basically after opening some images memory usage goes up and stays up (it’s not reset to previous value). It might be some issue with PIL or it might be something we’re doing in pipeline that is causing this. In any case this looks worrying and I think we should reflect on steps to take to limit this problem.

Following code reproduces the problem (I know it’s long but this is really shortest I could get), it relies on presence of images.txt file that contains list of urls to images.

import resource
import shutil
import sys
import tempfile

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.test import get_crawler
from twisted.internet import reactor
from twisted.python import log


def log_memory(result):
    mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    log.msg("{} bytes".format(mem))


class SomeSpider(scrapy.Spider):
    name = 'foo'


def download_image(url, pipe, spider):
    # Download image with pipeline.
    item = {
        'image_urls': [url]
    }
    dfd = pipe.process_item(item, spider)
    dfd.addBoth(log_memory)
    return dfd


log.startLogging(sys.stdout)
# Directory must be removed, otherwise pipeline will not attempt to download.
some_dir = tempfile.mkdtemp()
crawler = get_crawler(settings_dict={'IMAGES_STORE': some_dir,
                                     "IMAGE_EXPIRES": 0})

spider = SomeSpider()
spider.crawler = crawler
crawler.crawl(spider)
pipeline = ImagesPipeline.from_crawler(crawler)
pipeline.open_spider(spider)


def clean_up():
    print("removing {}".format(some_dir))
    log_memory(None)
    shutil.rmtree(some_dir)


with open('images.txt') as image_list:
    image_urls = image_list.read().split()

for url in image_urls[:20]:
    dfd = download_image(url, pipeline, spider)

reactor.addSystemEventTrigger('before', 'shutdown', clean_up)
reactor.run()

Sample output on attached image file: images.txt

2016-12-13 13:13:00+0100 [-] Log opened.
2016-12-13 13:13:00+0100 [-] TelnetConsole starting on 6023
2016-12-13 13:13:00+0100 [-] 44516 bytes
2016-12-13 13:13:00+0100 [-] 44752 bytes
2016-12-13 13:13:00+0100 [-] 49492 bytes
2016-12-13 13:13:01+0100 [-] 49680 bytes
2016-12-13 13:13:01+0100 [-] 49680 bytes
2016-12-13 13:13:01+0100 [-] 49680 bytes
2016-12-13 13:13:01+0100 [-] 52312 bytes
2016-12-13 13:13:01+0100 [-] 52312 bytes
2016-12-13 13:13:01+0100 [-] 52316 bytes
2016-12-13 13:13:01+0100 [-] 52316 bytes
2016-12-13 13:13:01+0100 [-] 52316 bytes
2016-12-13 13:13:01+0100 [-] 52532 bytes
2016-12-13 13:13:01+0100 [-] 52532 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:01+0100 [-] 52700 bytes
2016-12-13 13:13:02+0100 [-] 52700 bytes
2016-12-13 13:13:03+0100 [-] (TCP Port 6023 Closed)
2016-12-13 13:13:03+0100 [-] 52700 bytes
^C2016-12-13 13:13:13+0100 [-] Received SIGINT, shutting down.
2016-12-13 13:13:13+0100 [-] removing /tmp/tmpD1dRmc
2016-12-13 13:13:13+0100 [-] 52700 bytes
2016-12-13 13:13:13+0100 [-] Main loop terminated.

Notice how memory goes up and stays up. from 44516 to 52700. Notice delay between final request and SIGINT ( 10 seconds). After this delay memory usage still stays at 52700.

Issue Analytics

State:
Created 7 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

dev-iwfcommented, Dec 12, 2017

Looks like it. Thanks.

0reactions

icankacommented, Mar 22, 2018

I had significant memory leak issue too when downloading images with IMAGES_MIN_HEIGHT and IMAGES_MIN_WEIGHT set. All of these image responses that does not meet min_height and min_weight condition raises ImageException and memory is filling up with these image responses. I deduced info.downloaded[fp] = result part of the _cache_result_and_execute_waiters function was the problem. result turns out to be twisted Failure and dont know why but causes these image responses to fill up memory. i tried emptying info.downloaded cache as mentioned in #939 but then obviously scraping takes significantly longer.I realized only key if fp in info.downloaded: is checked to decide if request has already downloaded.I solved it by assigning None to info.downloaded[fp] if its an instance of Failure.Total request count and scraping time doesnt seem to be changed.I dont know how much of a good idea this is but it solved my issue.

Top Results From Across the Web

Using the Image Pipeline Directly - Fresco

Using the image pipeline directly is challenging because of the memory usage. ... once you are done with it, you risk memory leaks...

WPF Image control memory leak - Stack Overflow

BeginInit(); //I am trying to make the Image control use as less memory as possible ... It may be the event handlers that...

Memory leaks with partial pipeline destruction

Hi, I have a relatively complicated pipeline with a tee followed by amongst others a streaming branch, a recording branch and a live...

Understanding Memory Leaks In Programming - Medium

Memory leaks are a type of resources mismanagement in programming. The resource is available computer memory allocated to the programming ...

Pipeline with no ray.get and a memory leak

I have a pipeline where a camera actor continuously acquires images, pre-processes the image, sends it to a GPU Actor, and then subsequently...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Does the Item Pipeline drops Items if scrapy runs at 99% CPU? I'm experiencing Scrapinghub 10-20 times difference on the item count vs running it on my own vps.

memory leaks in image pipeline

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Does the Item Pipeline drops Items if scrapy runs at 99% CPU? I'm experiencing Scrapinghub 10-20 times difference on the item count vs running it on my own vps.

Access Response object from the Scrapy Pipeline