question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy is slow to start the first time

See original GitHub issue

Hi,
This is not an issue for me, but I’m really curious about why it works this way. I guess this is in general for all python apps, although it’s more prominent in scrapy.
The first time I start a scrapy spider, it took:

real    0m40.704s
user    0m3.547s
sys     0m3.484s

Any subsequent runs took around 6 seconds:

real    0m6.499s
user    0m3.266s
sys     0m1.547s

Is some sort of caching happening here? I’m running an SSD and scrapy has around 80 spiders.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
wRARcommented, Nov 11, 2020

scrapy crawl foo imports all spider modules because foo is the spider name attribute, not the module name. If importing your modules takes that much time we can’t help with that, but you can try moving imports of heavy modules from the top level to code that only runs when the spider is used.

1reaction
elacuestacommented, Nov 11, 2020

I find it very difficult to give a general answer without any knowledge of the code being run. Off the top of head, I can think about the resuming jobs feature and the HttpCacheMiddleware middleware. I suspect this is more of a support question (see Getting help), but let’s leave it open for now in case we find some bottleneck within the codebase. Could you provide more information?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cause of slow Scrapy scraper - Stack Overflow
1 Answer 1 ... There are only two potential reasons, given that your spiders indicate that you're quite careful/experienced. ... Highly likely the ......
Read more >
Benchmarking — Scrapy 2.7.1 documentation
Scrapy comes with a simple benchmarking suite that spawns a local HTTP server and crawls it at the maximum possible speed.
Read more >
5 Useful Tips While Working With Python Scrapy - Jerry Ng
Quick tips to improve your Scrapy projects. Reduce database write using bulk insert in Scrapy item pipeline, colorized logging in Scrapy and ...
Read more >
How To Set Scrapy Delays/Sleeps Between Requests
However, when scraping with Scrapy you shouldn't use time.sleep as it will block the Twisted reactor (the underlying framework powering Scrapy), ...
Read more >
How to crawl the web politely with Scrapy - Zyte
Mission-critical to having a polite crawler is making sure your crawler doesn't hit a website too hard. Respect the delay that crawlers should ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found