Scrapy is slow to start the first time
See original GitHub issueHi,
This is not an issue for me, but I’m really curious about why it works this way. I guess this is in general for all python apps, although it’s more prominent in scrapy.
The first time I start a scrapy spider, it took:
real 0m40.704s
user 0m3.547s
sys 0m3.484s
Any subsequent runs took around 6 seconds:
real 0m6.499s
user 0m3.266s
sys 0m1.547s
Is some sort of caching happening here? I’m running an SSD and scrapy has around 80 spiders.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Cause of slow Scrapy scraper - Stack Overflow
1 Answer 1 ... There are only two potential reasons, given that your spiders indicate that you're quite careful/experienced. ... Highly likely the ......
Read more >Benchmarking — Scrapy 2.7.1 documentation
Scrapy comes with a simple benchmarking suite that spawns a local HTTP server and crawls it at the maximum possible speed.
Read more >5 Useful Tips While Working With Python Scrapy - Jerry Ng
Quick tips to improve your Scrapy projects. Reduce database write using bulk insert in Scrapy item pipeline, colorized logging in Scrapy and ...
Read more >How To Set Scrapy Delays/Sleeps Between Requests
However, when scraping with Scrapy you shouldn't use time.sleep as it will block the Twisted reactor (the underlying framework powering Scrapy), ...
Read more >How to crawl the web politely with Scrapy - Zyte
Mission-critical to having a polite crawler is making sure your crawler doesn't hit a website too hard. Respect the delay that crawlers should ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
scrapy crawl foo
imports all spider modules becausefoo
is the spidername
attribute, not the module name. If importing your modules takes that much time we can’t help with that, but you can try moving imports of heavy modules from the top level to code that only runs when the spider is used.I find it very difficult to give a general answer without any knowledge of the code being run. Off the top of head, I can think about the resuming jobs feature and the
HttpCacheMiddleware
middleware. I suspect this is more of a support question (see Getting help), but let’s leave it open for now in case we find some bottleneck within the codebase. Could you provide more information?