question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy is definitely slow when working from cache

See original GitHub issue

Description

I’m using scrapy with cache enabled to first crawl the pages I need over night and then polish the extraction, while working from cache. Suddenly, the speed of the processing of cached is pages quite slow, below 2k pages per minute. My data processing is trivial, my datastorage is mongo (I’ve tried to disable it to eliminate extra-factor, but that didn’t affected the speed), my CPU/IO isn’t even sweating. I’ve tried to bump CONCURRENT_ITEMS to a higher value, but got no result. I’m using twisted reactor. More than that, on decent internet connection my crawling speed on empty cache is roughly the same (1800 items per second).

Below are my cache settings.

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = "httpcache"
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

Steps to Reproduce

  1. Crawl the website and cache all the pages
  2. Rerun the spider on cache
  3. Watch slow processing speed

Expected behavior: Reasonably higher parsing speed when working from cache

Actual behavior: Speed of parsing is quite low and basically the same as with empty cache on decent connection.

Reproduces how often: All of my spiders are suffering from this behavior.

Versions

Scrapy       : 2.5.0
lxml         : 4.6.3.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.2.0
Python       : 3.9.1 (default, Apr 12 2021, 01:27:54) - [Clang 10.0.0 (clang-1000.10.44.4)]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021)
cryptography : 3.4.7
Platform     : macOS-10.13.6-x86_64-i386-64bit

Additional context

My full settings.py is below:

import os
from urllib.parse import quote_plus


def get_env_str(k, default):
    return os.environ.get(k, default)


def get_env_int(k, default):
    return int(get_env_str(k, default))


BOT_NAME = "corpora"

SPIDER_MODULES = ["corpora.spiders"]
NEWSPIDER_MODULE = "corpora.spiders"


ROBOTSTXT_OBEY = False


AUTOTHROTTLE_ENABLED = False
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = "httpcache"
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"


ITEM_PIPELINES = {
    "scrapy.pipelines.files.FilesPipeline": 1,
    "corpora.pipelines.MongoDBPipeline": 9000,
}

FILES_STORE = "nanu_pdfs"

DOWNLOAD_WARNSIZE = 3355443200
DOWNLOAD_TIMEOUT = 1800
HTTPCACHE_IGNORE_HTTP_CODES = [500, 501, 502, 503, 401, 403]
RETRY_ENABLED = True

MONGODB_HOST = quote_plus(get_env_str("MONGODB_HOST", "localhost"))
MONGODB_PORT = get_env_int("MONGODB_PORT", 27017)
MONGODB_USERNAME = quote_plus(get_env_str("MONGODB_USERNAME", ""))
MONGODB_PASSWORD = quote_plus(get_env_str("MONGODB_PASSWORD", ""))
MONGODB_AUTH_DB = get_env_str("MONGODB_AUTH_DB", "admin")
MONGODB_DB = get_env_str("MONGODB_DB", "ubertext")
MONGODB_CONNECTION_POOL_KWARGS = {}

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:20 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
omabcommented, Apr 25, 2021

@dchaplinsky, sorry, but I can’t help with any guidance here, I’ve been disconnected from Scrapy for a very long time now. Good luck!

1reaction
GeorgeA92commented, Apr 25, 2021

@dchaplinsky

that’s an expected behavior?

It is not expected that website allows to continiously send ~1500+ requests per minute from single IP without IP bans or other anti-bot restrictions from server side. (this is very unique case) Usually it is not… polite to send that amount of requests with that rate. From my point of view ~1500+ requests per minute sending rate from single IP - is not slow, it is already too much.

In my case the gain from working solely from cache is less that 2x (which makes me really sad)

On nearly all of cases websites give IP bans (temp or permanent) with that requests sending rate. First thing that we usually do in this cases is to limit requests with DOWNLOAD_DELAY setting. DOWNLOAD_DELAY = 1 (~60 requests per minute) DOWNLOAD_DELAY = 0.5 (~120 requests per minute)

Solution with enabled HttpCache will work faster by 10x and more as @Gallaecio said comparing to this more… ban safe scrapy configuration limited by DOWNLOAD_DELAY setting

Is there an option to optimize/utilize more CPU (cpu cores)?

I think that In this case(~1500+ requests per minute) working performance limited by I/O bottleneck (not CPU, and not Internet connection quality). My conclusion based on performance difference between FilesystemCacheStorage and DbmCacheStorage. DbmCacheStorage - is less I/O intesnsive comparing to FilesystemCacheStorage (at least with relatively low amount of cahced data).

Or because of reactor, this is the maximum, that reactor might provide?

As far as I know (not 100% sure) usage of reactor features mosly aimed to optimize network and cpu performance (not I/O). reading/writing from/to cache is not related to reactor.

Read more comments on GitHub >

github_iconTop Results From Across the Web

provide a way to work with scrapy http cache without making ...
Currently it is hard to extract information from scrapy cache: cache storages ... Scrapy is definitely slow when working from cache #5105.
Read more >
Why Scrapy is slow? - Stack Overflow
1 Answer 1 · Well it's happening with all the websites. So I am concerned if Scrapy architecture is scalable enough for such...
Read more >
Scrapy Tips from the Pros - Hacker News
The very first thing I do with every scraping project is enable the HttpCacheMiddleware[0]. After downloading a page once, subsequent runs ...
Read more >
latest PDF - Scrapy Documentation
Scrapy (/skrepa/) is an application framework for crawling web sites and extracting structured data which can be used.
Read more >
5 Useful Tips While Working With Python Scrapy - Jerry Ng
TL;DR. Use HTTPCache during development; Always use AutoThrottle; Consume sites' API whenever available; Use bulk insert for database write ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found