Duplicates filtering and RAM usage
See original GitHub issueSummary
I am running a broad crawl with an input of ~4 million starting URLs. I followed the suggestions for broad crawls from here and am using the JOBDIR
option to persist request queues to disk. I have been running this crawl for ~1.5 months. With time, I have observed the RAM usage of the crawler increase from ~2 GB (1.5 months ago) to ~4.5 GB currently. I have already read about causes and workarounds here.
Based on my debugging, I found the main cause of this increased RAM usage to be the set of request fingerprints that are stored in memory and queried during duplicates filtering as per here.
Motivation
My non-beefy system only has 8 GB of RAM and to prevent OOM issues, I decided to write a duplicates filter that writes and queries fingerprints from a sqlite
database. Below is the source code for the modified duplicates filter:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from scrapy.dupefilters import RFPDupeFilter
from scrapy.http.request import Request
from contextlib import closing
import sqlite3
import logging
import os
class RFPSQLiteDupeFilter(RFPDupeFilter):
def __init__(self, path: str, debug: bool = False) -> None:
self.logdupes = True
self.debug = debug
self.logger = logging.getLogger(__name__)
self.schema = """
CREATE TABLE requests_seen (fingerprint TEXT PRIMARY KEY);"""
self.db = os.path.join(path, "requests_seen.sqlite")
db_exists = os.path.exists(self.db)
self.conn = sqlite3.connect(self.db)
# conditional actions if database exists
if not db_exists:
with closing(self.conn.cursor()) as cursor:
cursor.execute(self.schema)
self.conn.commit()
self.logger.info("Created database: %s" % self.db)
else:
self.logger.info(
"Skipping database creation since it already exists: %s" %
self.db)
def request_seen(self, request: Request) -> bool:
# create fingerprint
fp = self.request_fingerprint(request)
# assign fingerprint or produce error
try:
with closing(self.conn.cursor()) as cursor:
cursor.execute(
"""
INSERT INTO requests_seen VALUES (?);
""", (fp, ))
self.conn.commit()
except sqlite3.IntegrityError:
return True
else:
return False
def close(self, reason: str) -> None:
self.conn.close()
Observations
This helped to reduce my RAM usage back to the levels I observed 1.5 months ago. Additionally, I did not observe a significant negative impact in crawling speed.
Feature request
Does it make sense to add this RFPSQLiteDupeFilter
class to dupefilters.py
in scrapy
?
I can imagine this to be a nice feature for broad crawls on machines with limited RAM. I would be glad to submit a PR if this is of interest.
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (7 by maintainers)
Top GitHub Comments
Another possibility is to use a key-value database, with the key being the fingerprint and the value some metadata such as time of request being enqueued.
scrapy-deltafetch
uses this approach with the nativedbm
python library, which reduces the need for external dependencies:https://github.com/scrapy-plugins/scrapy-deltafetch/blob/master/scrapy_deltafetch/middleware.py
OK. It would then be interesting to benchmark, like Benchmarking Semidbm, with LMDB.