Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy is slow to start the first time

See original GitHub issue

Hi,
This is not an issue for me, but I’m really curious about why it works this way. I guess this is in general for all python apps, although it’s more prominent in scrapy.
The first time I start a scrapy spider, it took:

real    0m40.704s
user    0m3.547s
sys     0m3.484s

Any subsequent runs took around 6 seconds:

real    0m6.499s
user    0m3.266s
sys     0m1.547s

Is some sort of caching happening here? I’m running an SSD and scrapy has around 80 spiders.

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

wRARcommented, Nov 11, 2020

scrapy crawl foo imports all spider modules because foo is the spider name attribute, not the module name. If importing your modules takes that much time we can’t help with that, but you can try moving imports of heavy modules from the top level to code that only runs when the spider is used.

1reaction

elacuestacommented, Nov 11, 2020

I find it very difficult to give a general answer without any knowledge of the code being run. Off the top of head, I can think about the resuming jobs feature and the HttpCacheMiddleware middleware. I suspect this is more of a support question (see Getting help), but let’s leave it open for now in case we find some bottleneck within the codebase. Could you provide more information?

Top Results From Across the Web

Cause of slow Scrapy scraper - Stack Overflow

1 Answer 1 ... There are only two potential reasons, given that your spiders indicate that you're quite careful/experienced. ... Highly likely the ......

Benchmarking — Scrapy 2.7.1 documentation

Scrapy comes with a simple benchmarking suite that spawns a local HTTP server and crawls it at the maximum possible speed.

5 Useful Tips While Working With Python Scrapy - Jerry Ng

Quick tips to improve your Scrapy projects. Reduce database write using bulk insert in Scrapy item pipeline, colorized logging in Scrapy and ...

How To Set Scrapy Delays/Sleeps Between Requests

However, when scraping with Scrapy you shouldn't use time.sleep as it will block the Twisted reactor (the underlying framework powering Scrapy), ...

How to crawl the web politely with Scrapy - Zyte

Mission-critical to having a polite crawler is making sure your crawler doesn't hit a website too hard. Respect the delay that crawlers should ......