question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Per-field required percentage checking

See original GitHub issue

Right now Spidermon field percentage validation works in this manner:

  • you set some of your fields as required in the schema
  • the validation code counts all missing required fields in all items, by catching schema validation errors
  • the error count is divided by the item count

The result is not very useful as it’s not clear what threshold to set because all missing fields in all items are combined into one number.

What we actually need is a percentage check for each required field, or, better, separate percentages for all fields. How do I see this:

  • something calculates the field count for each field
  • ValidationInfo calculates percentages for each field by dividing counts by the item count
  • ValidationMonitorMixin gets per-field, per-spider and per-project thresholds and compares the percentage for each field with the relevant threshold
  • an error, or a bunch of errors, is emitted, listing all failing fields with their actual and expected percentages

Some implementations I’ve seen:

  • count items in a pipeline, store counts in the Spider object; get the thresholds from settings and spider attributes
  • put thresholds in the JSON schema for required fields and retrieve them in the monitor; compare per-field validation errors (which spidermon already counts) with the thresholds

The main question here is where to count the fields and where to store them. ItemValidationPipeline seems to be the ideal place for the counting code as it already processes all items. The data can be stored in the stats, though I’m not sure what is the policy of adding multiple new keys (one per field) to stats. If this is a problem the code can be disabled unless some setting is set.

The thresholds can be taken from settings and spider attributes, it should be possible to pass a single number or a dict of field: number mappings, I don’t think this is a problem.

To keep the compatibility the old methods should be kept and new ones created, though I’m not sure if the old ones are useful as is.

We can also want to check percentages for subfields, it’s possible with the current code as jsonschema emits errors for them too, though because of mangling during converting errors to stats keys some guesswork is needed to unmangle them. My current proposal doesn’t take this additional feature into account, suggestions welcome.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:5
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
wRARcommented, Apr 29, 2019

We’ve decided to try this approach:

  • as a first step the current code will be changed to check missing field percentage per field, this can already be done using the existing data from validation; this will work only for fields marked as required in the schema; this changes the existing code behavior but I don’t think that code is useful in the current shape, as explained at the beginning of the initial post.
  • then it can be expanded to optionally take more specific threshold percentages.

This should already be very useful. If this works I’ll see if there is anything else useful that can be done.

2reactions
rennerochacommented, Aug 18, 2020

https://github.com/scrapinghub/spidermon/pull/262 provided stats for number of items returned and coverage by field, without relying on schemas. These stats can be compared between spiders executions and custom monitors can be used to find when we have a drop in coverage of a certain field.

https://github.com/scrapinghub/spidermon/pull/263 added a new built-in monitor and easy way to set coverage thresholds by field.

I believe that setting threshold should not be related to any specific JSON Schema, so in my opinion, these PRs can be used to solve the problems mentioned on this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PenFed Checking Accounts - Checking Made Simple
Choose from PenFed's free checking or an Access America account that can ... No monthly fees, no minimum balance - Free Checking built...
Read more >
Penfield Branch - The Summit Federal Credit Union
We provide checking and savings accounts, credit cards, and mortgage products for all of your financial needs. Services Available. Drive-up teller; Drive-up 24 ......
Read more >
Interest-Bearing Checking Account | Perks Checking
Perks Checking is an interest-bearing checking account with no minimum required to open. It's loaded with features designed to make your relationship with ......
Read more >
Premier Checking Interest-Bearing Account | Wells Fargo
Open this interest bearing-checking account online for elevated relationship banking benefits & services, including interest rate discounts & fee waivers»
Read more >
Rates & Yields - Personal checking accounts - Presidential Bank
This is a complete list of rates and yields for all personal checking and savings accounts and CDs.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found