question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicates filtering and RAM usage

See original GitHub issue

Summary

I am running a broad crawl with an input of ~4 million starting URLs. I followed the suggestions for broad crawls from here and am using the JOBDIR option to persist request queues to disk. I have been running this crawl for ~1.5 months. With time, I have observed the RAM usage of the crawler increase from ~2 GB (1.5 months ago) to ~4.5 GB currently. I have already read about causes and workarounds here.

Based on my debugging, I found the main cause of this increased RAM usage to be the set of request fingerprints that are stored in memory and queried during duplicates filtering as per here.

Motivation

My non-beefy system only has 8 GB of RAM and to prevent OOM issues, I decided to write a duplicates filter that writes and queries fingerprints from a sqlite database. Below is the source code for the modified duplicates filter:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from scrapy.dupefilters import RFPDupeFilter
from scrapy.http.request import Request
from contextlib import closing
import sqlite3
import logging
import os


class RFPSQLiteDupeFilter(RFPDupeFilter):
    def __init__(self, path: str, debug: bool = False) -> None:
        self.logdupes = True
        self.debug = debug
        self.logger = logging.getLogger(__name__)
        self.schema = """
        CREATE TABLE requests_seen (fingerprint TEXT PRIMARY KEY);"""
        self.db = os.path.join(path, "requests_seen.sqlite")
        db_exists = os.path.exists(self.db)
        self.conn = sqlite3.connect(self.db)

        # conditional actions if database exists
        if not db_exists:
            with closing(self.conn.cursor()) as cursor:
                cursor.execute(self.schema)
                self.conn.commit()
            self.logger.info("Created database: %s" % self.db)
        else:
            self.logger.info(
                "Skipping database creation since it already exists: %s" %
                self.db)

    def request_seen(self, request: Request) -> bool:
        # create fingerprint
        fp = self.request_fingerprint(request)

        # assign fingerprint or produce error
        try:
            with closing(self.conn.cursor()) as cursor:
                cursor.execute(
                    """
                    INSERT INTO requests_seen VALUES (?);
                    """, (fp, ))
                self.conn.commit()
        except sqlite3.IntegrityError:
            return True
        else:
            return False

    def close(self, reason: str) -> None:
        self.conn.close()

Observations

This helped to reduce my RAM usage back to the levels I observed 1.5 months ago. Additionally, I did not observe a significant negative impact in crawling speed.

Feature request

Does it make sense to add this RFPSQLiteDupeFilter class to dupefilters.py in scrapy?

I can imagine this to be a nice feature for broad crawls on machines with limited RAM. I would be glad to submit a PR if this is of interest.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
atreyashacommented, Nov 4, 2021

Another possibility is to use a key-value database, with the key being the fingerprint and the value some metadata such as time of request being enqueued.

scrapy-deltafetch uses this approach with the native dbm python library, which reduces the need for external dependencies:

https://github.com/scrapy-plugins/scrapy-deltafetch/blob/master/scrapy_deltafetch/middleware.py

0reactions
LeMousselcommented, Nov 15, 2021

OK. It would then be interesting to benchmark, like Benchmarking Semidbm, with LMDB.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Remove Duplicates in a Large Dataset Reducing ...
To give you a sense of proportion, the memory required to filter just 100 Million push tokens is 100M * 256 = 25...
Read more >
Filter for duplicates (translation memory)
Filter for TM duplicates · Confirm a lot of identical translations to it, using different contexts. · Use a translation memory that allows...
Read more >
Detecting Duplicates over Sliding Windows with RAM-Efficient ...
This paper proposes a Detached Counting Bloom filter Array (DCBA) to flexibly and efficiently detect duplicates over sliding windows. A DCBA consists of...
Read more >
How to drop duplicates memory efficiently? - Stack Overflow
First, create a couple sample data frames for illustration. (If necessary, you could use Excel itself to convert .xlsx files to .csv, and...
Read more >
NGSReadsTreatment – A Cuckoo Filter-based Tool for ... - NCBI
CD-HIT has two different tools for removing duplicates of single-end ... Table 2 lists the total memory used by each tool in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found