Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicates filtering and RAM usage

See original GitHub issue

Summary

I am running a broad crawl with an input of ~4 million starting URLs. I followed the suggestions for broad crawls from here and am using the JOBDIR option to persist request queues to disk. I have been running this crawl for ~1.5 months. With time, I have observed the RAM usage of the crawler increase from ~2 GB (1.5 months ago) to ~4.5 GB currently. I have already read about causes and workarounds here.

Based on my debugging, I found the main cause of this increased RAM usage to be the set of request fingerprints that are stored in memory and queried during duplicates filtering as per here.

Motivation

My non-beefy system only has 8 GB of RAM and to prevent OOM issues, I decided to write a duplicates filter that writes and queries fingerprints from a sqlite database. Below is the source code for the modified duplicates filter:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from scrapy.dupefilters import RFPDupeFilter
from scrapy.http.request import Request
from contextlib import closing
import sqlite3
import logging
import os


class RFPSQLiteDupeFilter(RFPDupeFilter):
    def __init__(self, path: str, debug: bool = False) -> None:
        self.logdupes = True
        self.debug = debug
        self.logger = logging.getLogger(__name__)
        self.schema = """
        CREATE TABLE requests_seen (fingerprint TEXT PRIMARY KEY);"""
        self.db = os.path.join(path, "requests_seen.sqlite")
        db_exists = os.path.exists(self.db)
        self.conn = sqlite3.connect(self.db)

        # conditional actions if database exists
        if not db_exists:
            with closing(self.conn.cursor()) as cursor:
                cursor.execute(self.schema)
                self.conn.commit()
            self.logger.info("Created database: %s" % self.db)
        else:
            self.logger.info(
                "Skipping database creation since it already exists: %s" %
                self.db)

    def request_seen(self, request: Request) -> bool:
        # create fingerprint
        fp = self.request_fingerprint(request)

        # assign fingerprint or produce error
        try:
            with closing(self.conn.cursor()) as cursor:
                cursor.execute(
                    """
                    INSERT INTO requests_seen VALUES (?);
                    """, (fp, ))
                self.conn.commit()
        except sqlite3.IntegrityError:
            return True
        else:
            return False

    def close(self, reason: str) -> None:
        self.conn.close()

Observations

This helped to reduce my RAM usage back to the levels I observed 1.5 months ago. Additionally, I did not observe a significant negative impact in crawling speed.

Feature request

Does it make sense to add this RFPSQLiteDupeFilter class to dupefilters.py in scrapy?

I can imagine this to be a nice feature for broad crawls on machines with limited RAM. I would be glad to submit a PR if this is of interest.

Issue Analytics

State:
Created 2 years ago
Comments:16 (7 by maintainers)

Top GitHub Comments

1reaction

atreyashacommented, Nov 4, 2021

Another possibility is to use a key-value database, with the key being the fingerprint and the value some metadata such as time of request being enqueued.

scrapy-deltafetch uses this approach with the native dbm python library, which reduces the need for external dependencies:

https://github.com/scrapy-plugins/scrapy-deltafetch/blob/master/scrapy_deltafetch/middleware.py

0reactions

LeMousselcommented, Nov 15, 2021

OK. It would then be interesting to benchmark, like Benchmarking Semidbm, with LMDB.

Top Results From Across the Web

How to Remove Duplicates in a Large Dataset Reducing ...

To give you a sense of proportion, the memory required to filter just 100 Million push tokens is 100M * 256 = 25...

Filter for duplicates (translation memory)

Filter for TM duplicates · Confirm a lot of identical translations to it, using different contexts. · Use a translation memory that allows...

Detecting Duplicates over Sliding Windows with RAM-Efficient ...

This paper proposes a Detached Counting Bloom filter Array (DCBA) to flexibly and efficiently detect duplicates over sliding windows. A DCBA consists of...

How to drop duplicates memory efficiently? - Stack Overflow

First, create a couple sample data frames for illustration. (If necessary, you could use Excel itself to convert .xlsx files to .csv, and...

NGSReadsTreatment – A Cuckoo Filter-based Tool for ... - NCBI

CD-HIT has two different tools for removing duplicates of single-end ... Table 2 lists the total memory used by each tool in the...