question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Centralized Request fingerprints

See original GitHub issue

It is very easy to have a subtle bug when using a custom duplicates filter that changes how request fingerprint is calculated.

  • Duplicate filter checks request fingerprint and makes Scheduler drop the request if it is a duplicate.
  • Cache storage checks request fingerprint and fetches response from cache if it is a duplicate.
  • If fingerprint algorithms differ we’re in trouble.

The problem is that there is no way to override request fingerprint globally; to make Scrapy always take something extra in account (an http header, a meta option) user must override duplicates filter and all cache storages that are in use.

Ideas about how to fix it:

  1. Use duplicates filter request_fingerprint method in cache storage if this method is available;
  2. create a special Request.meta key that request_fingerprint function will take into account;
  3. create a special Request.meta key that will allow to provide a pre-calculated fingerprint;
  4. add a settings.py option to override request fingerprint function globally.

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Comments:40 (32 by maintainers)

github_iconTop GitHub Comments

2reactions
kmikecommented, Mar 5, 2019

Hey! Currently I’m not sure we should be developing a rule engine for this. For example, instead of

CUSTOM_FINGERPRINTING = {
  1: {
    'consider_method': True,
    'consider_url': True,
    'query_params': {
      'id': None,
      'category': '1',  # set default value to 1
    },
    'headers': {
      'ignore': True,  # headers are ignored entirely
    },
    'meta': {
      'ignore': False,  # all meta data will be considered
    },
  }
}

I’d prefer to have something along these lines:

from functools import partial
from scrapy.utils.request import request_fingerprint

REQUEST_FINGERPRINT_FUNC = partial(
    request_fingerprint, 
    include_headers=True, 
    include_meta={'foo': True}
)

This way one can use any Python function, and instead of using a dict with string constants we use function arguments. This allows to

  • swap implementation without changing Scrapy;
  • write tests more easily;
  • validate that arguments are correct at Python syntax level.

This exact API probably won’t work, as I’d like it to handle more cases - it’d be good to have a way to customize it not only in settings.py, but also in middlewares and in a spider as well, per-request. Anyways, you get the idea 😃

By the way, https://github.com/scrapy/scrapy/pull/3420 may be relevant, as we started to look at fingerprints and thinking about API.

0reactions
Gallaeciocommented, Mar 22, 2019

@lenzai Fingerprints go beyond URLs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fingerprints | FINRA.org
Firms must submit fingerprints for individuals specified in Rule 17f-2 of the Securities and Exchange Act of 1934.
Read more >
Department of Human Services | Central Fingerprint Unit
The Central Fingerprint Unit is responsible for the collection, review, interpretation and dissemination of all criminal history record information (CHRI).
Read more >
Fingerprinting Services - Texas Department of Public Safety
The current methodology requiring submission of paper fingerprint cards, although effective, is centralized and may take several days to process ...
Read more >
National Fingerprint Based Background Checks Steps for ... - FBI
The check must be fingerprint-based. · The check should be submitted through the state's central record repository and include a state criminal history...
Read more >
Fingerprinting - Nebraska State Patrol
If those fingerprints are to be submitted to the FBI for a nationwide criminal ... Roadways are opening across the panhandle and north...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found