Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Broad crawl possible memory leak

See original GitHub issue

Hi,

I was doing broad crawl and noticed constantly increasing memory consumption for a spider. Pruning my spider to most simple form doesn’t help me here (memory still increases constantly). I also noticed that others spiders (with much smaller crawl rates, CONCURRENT_REQUESTS = 16) don’t have such problem.

So i was wondering if I misuse scrapy or there is a problem. Brief issue search didn’t show anything, so I went ahead and created experimental spider for tests: https://github.com/rampage644/experimental

First, I’d like to know if someone has experienced memory problems with high rate crawl or another memory problem.
Second, I’d like to figure out why this simple spider leaks and can we do anything about that?

Issue Analytics

State:
Created 7 years ago
Comments:23 (9 by maintainers)

Top GitHub Comments

3reactions

rampage644commented, Jun 24, 2016

@lopuhin, I’m using 100 CONCURRENT_REQUESTS setting and able to get 1200 rpm with 1 unit. On startup top shows 60M rss size, in 30 minutes it grows up to 300M

2reactions

rampage644commented, Jun 30, 2016

I’ve done some more experiments t pin down what is causing a leak:

First I removed all links that caused error messages in log (because of downloader errors). Good news are that memory footprint was reduced from 440MB to 300MB at peak (according to stats). Bad news are it’s still there. (Error entries in log count reduced from 20k to 2k).
Second. Long in the past i noticed that sometimes Requests objects are stuck somewhere (according to prefs() output and live_refs info). There is a pattern here. pprint.pprint(map(lambda x: (x[0], time.time()-x[1]), sorted(rqs.items(), key=operator.itemgetter(1)))) prints requests objects sorted by their creation time. Once Requests object start staying alive a group of them with pretty same time (>60s) appear in a tracking dict. That could happen multiple time, i.e. multiple groups.
Third finding: While working on some focused crawl spider and trying to use guppy it shows nothing interesting: str, tuples, dict. But here i get bunch of twisted objects staying in memory (maybe they are present in other spiders, but here there are much more of them):

>>> hpy.heap()                                                                                                                                                                                                                                                                                                                
Partition of a set of 1286140 objects. Total size = 176360928 bytes.                                                                                                                                                                                                                                                          
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)                                                                                                                                                                                                                                                     
     0 456074  35 29027064  16  29027064  16 str                                                                                                                                                                                                                                                                              
     1 261200  20 20719536  12  49746600  28 tuple                                                                                                                                                                                                                                                                            
     2 205542  16 16970096  10  66716696  38 list                                                                                                                                                                                                                                                                             
     3  17046   1 16640016   9  83356712  47 dict (no owner)                                                                                                                                                                                                                                                                  
     4  14746   1 15453808   9  98810520  56 dict of twisted.internet.base.DelayedCall                                                                                                                                                                                                                                        
     5   8403   1  8277960   5 107088480  61 dict of twisted.internet.defer.Deferred                                                                                                                                                                                                                                          
     6   7685   1  8053880   5 115142360  65 dict of twisted.internet.tcp.Client                                                                                                                                                                                                                                              
     7   7685   1  8053880   5 123196240  70 dict of twisted.internet.tcp.Connector                                                                                                                                                                                                                                           
     8   7677   1  8045496   5 131241736  74 dict of twisted.web._newclient.HTTP11ClientProtocol                                                                                                                                                                                                                              
     9  51931   4  4154480   2 135396216  77 types.MethodType

I’m going to try @kmike advice regarding tracemalloc module.

Top Results From Across the Web

Broad Crawls — Scrapy 2.7.1 documentation

If your broad crawl shows a high memory usage, in addition to crawling in BFO order and lowering concurrency you should debug your...

Broad Crawls - 《Scrapy v2.0 Documentation》 - 书栈网

Crawl in BFO order instead to save memory. Be mindful of memory leaks. If your broad crawl shows a high memory usage, in...

Debugging memory leaks — Scrapy documentation

To help debugging memory leaks, Scrapy provides a built-in mechanism for tracking objects references called trackref, and you can also use a third-party ......

Automatically Debugging Memory Leaks in Web Applications

Leaks degrade responsiveness by increasing. GC frequency and overhead, and can even lead to browser tab crashes by exhausting available memory. Because previ-....

Three kinds of memory leaks - Made of Bugs - Nelson Elhage

So, you've got a program that's using more and more over time as it runs. Probably you can immediately identify this as a...