Broad crawl possible memory leak
See original GitHub issueHi,
I was doing broad crawl and noticed constantly increasing memory consumption for a spider. Pruning my spider to most simple form doesn’t help me here (memory still increases constantly).
I also noticed that others spiders (with much smaller crawl rates, CONCURRENT_REQUESTS = 16
) don’t have such problem.
So i was wondering if I misuse scrapy
or there is a problem. Brief issue search didn’t show anything, so I went ahead and created experimental spider for tests: https://github.com/rampage644/experimental
- First, I’d like to know if someone has experienced memory problems with high rate crawl or another memory problem.
- Second, I’d like to figure out why this simple spider leaks and can we do anything about that?
Issue Analytics
- State:
- Created 7 years ago
- Comments:23 (9 by maintainers)
Top Results From Across the Web
Broad Crawls — Scrapy 2.7.1 documentation
If your broad crawl shows a high memory usage, in addition to crawling in BFO order and lowering concurrency you should debug your...
Read more >Broad Crawls - 《Scrapy v2.0 Documentation》 - 书栈网
Crawl in BFO order instead to save memory. Be mindful of memory leaks. If your broad crawl shows a high memory usage, in...
Read more >Debugging memory leaks — Scrapy documentation
To help debugging memory leaks, Scrapy provides a built-in mechanism for tracking objects references called trackref, and you can also use a third-party ......
Read more >Automatically Debugging Memory Leaks in Web Applications
Leaks degrade responsiveness by increasing. GC frequency and overhead, and can even lead to browser tab crashes by exhausting available memory. Because previ-....
Read more >Three kinds of memory leaks - Made of Bugs - Nelson Elhage
So, you've got a program that's using more and more over time as it runs. Probably you can immediately identify this as a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@lopuhin, I’m using 100
CONCURRENT_REQUESTS
setting and able to get 1200 rpm with 1 unit. On startuptop
shows 60M rss size, in 30 minutes it grows up to 300MI’ve done some more experiments t pin down what is causing a leak:
Requests
objects are stuck somewhere (according toprefs()
output andlive_refs
info). There is a pattern here.pprint.pprint(map(lambda x: (x[0], time.time()-x[1]), sorted(rqs.items(), key=operator.itemgetter(1))))
prints requests objects sorted by their creation time. OnceRequests
object start staying alive a group of them with pretty same time (>60s) appear in a tracking dict. That could happen multiple time, i.e. multiple groups.twisted
objects staying in memory (maybe they are present in other spiders, but here there are much more of them):I’m going to try @kmike advice regarding
tracemalloc
module.