question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Broad crawl possible memory leak

See original GitHub issue

Hi,

I was doing broad crawl and noticed constantly increasing memory consumption for a spider. Pruning my spider to most simple form doesn’t help me here (memory still increases constantly). I also noticed that others spiders (with much smaller crawl rates, CONCURRENT_REQUESTS = 16) don’t have such problem.

So i was wondering if I misuse scrapy or there is a problem. Brief issue search didn’t show anything, so I went ahead and created experimental spider for tests: https://github.com/rampage644/experimental

  1. First, I’d like to know if someone has experienced memory problems with high rate crawl or another memory problem.
  2. Second, I’d like to figure out why this simple spider leaks and can we do anything about that?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:23 (9 by maintainers)

github_iconTop GitHub Comments

3reactions
rampage644commented, Jun 24, 2016

@lopuhin, I’m using 100 CONCURRENT_REQUESTS setting and able to get 1200 rpm with 1 unit. On startup top shows 60M rss size, in 30 minutes it grows up to 300M

2reactions
rampage644commented, Jun 30, 2016

I’ve done some more experiments t pin down what is causing a leak:

  1. First I removed all links that caused error messages in log (because of downloader errors). Good news are that memory footprint was reduced from 440MB to 300MB at peak (according to stats). Bad news are it’s still there. (Error entries in log count reduced from 20k to 2k).
  2. Second. Long in the past i noticed that sometimes Requests objects are stuck somewhere (according to prefs() output and live_refs info). There is a pattern here. pprint.pprint(map(lambda x: (x[0], time.time()-x[1]), sorted(rqs.items(), key=operator.itemgetter(1)))) prints requests objects sorted by their creation time. Once Requests object start staying alive a group of them with pretty same time (>60s) appear in a tracking dict. That could happen multiple time, i.e. multiple groups.
  3. Third finding: While working on some focused crawl spider and trying to use guppy it shows nothing interesting: str, tuples, dict. But here i get bunch of twisted objects staying in memory (maybe they are present in other spiders, but here there are much more of them):
>>> hpy.heap()                                                                                                                                                                                                                                                                                                                
Partition of a set of 1286140 objects. Total size = 176360928 bytes.                                                                                                                                                                                                                                                          
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)                                                                                                                                                                                                                                                     
     0 456074  35 29027064  16  29027064  16 str                                                                                                                                                                                                                                                                              
     1 261200  20 20719536  12  49746600  28 tuple                                                                                                                                                                                                                                                                            
     2 205542  16 16970096  10  66716696  38 list                                                                                                                                                                                                                                                                             
     3  17046   1 16640016   9  83356712  47 dict (no owner)                                                                                                                                                                                                                                                                  
     4  14746   1 15453808   9  98810520  56 dict of twisted.internet.base.DelayedCall                                                                                                                                                                                                                                        
     5   8403   1  8277960   5 107088480  61 dict of twisted.internet.defer.Deferred                                                                                                                                                                                                                                          
     6   7685   1  8053880   5 115142360  65 dict of twisted.internet.tcp.Client                                                                                                                                                                                                                                              
     7   7685   1  8053880   5 123196240  70 dict of twisted.internet.tcp.Connector                                                                                                                                                                                                                                           
     8   7677   1  8045496   5 131241736  74 dict of twisted.web._newclient.HTTP11ClientProtocol                                                                                                                                                                                                                              
     9  51931   4  4154480   2 135396216  77 types.MethodType               

I’m going to try @kmike advice regarding tracemalloc module.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Broad Crawls — Scrapy 2.7.1 documentation
If your broad crawl shows a high memory usage, in addition to crawling in BFO order and lowering concurrency you should debug your...
Read more >
Broad Crawls - 《Scrapy v2.0 Documentation》 - 书栈网
Crawl in BFO order instead to save memory. Be mindful of memory leaks. If your broad crawl shows a high memory usage, in...
Read more >
Debugging memory leaks — Scrapy documentation
To help debugging memory leaks, Scrapy provides a built-in mechanism for tracking objects references called trackref, and you can also use a third-party ......
Read more >
Automatically Debugging Memory Leaks in Web Applications
Leaks degrade responsiveness by increasing. GC frequency and overhead, and can even lead to browser tab crashes by exhausting available memory. Because previ-....
Read more >
Three kinds of memory leaks - Made of Bugs - Nelson Elhage
So, you've got a program that's using more and more over time as it runs. Probably you can immediately identify this as a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found