Per-field required percentage checking
See original GitHub issueRight now Spidermon field percentage validation works in this manner:
- you set some of your fields as required in the schema
- the validation code counts all missing required fields in all items, by catching schema validation errors
- the error count is divided by the item count
The result is not very useful as it’s not clear what threshold to set because all missing fields in all items are combined into one number.
What we actually need is a percentage check for each required field, or, better, separate percentages for all fields. How do I see this:
- something calculates the field count for each field
ValidationInfo
calculates percentages for each field by dividing counts by the item countValidationMonitorMixin
gets per-field, per-spider and per-project thresholds and compares the percentage for each field with the relevant threshold- an error, or a bunch of errors, is emitted, listing all failing fields with their actual and expected percentages
Some implementations I’ve seen:
- count items in a pipeline, store counts in the Spider object; get the thresholds from settings and spider attributes
- put thresholds in the JSON schema for required fields and retrieve them in the monitor; compare per-field validation errors (which spidermon already counts) with the thresholds
The main question here is where to count the fields and where to store them. ItemValidationPipeline
seems to be the ideal place for the counting code as it already processes all items. The data can be stored in the stats, though I’m not sure what is the policy of adding multiple new keys (one per field) to stats. If this is a problem the code can be disabled unless some setting is set.
The thresholds can be taken from settings and spider attributes, it should be possible to pass a single number or a dict of field: number mappings, I don’t think this is a problem.
To keep the compatibility the old methods should be kept and new ones created, though I’m not sure if the old ones are useful as is.
We can also want to check percentages for subfields, it’s possible with the current code as jsonschema emits errors for them too, though because of mangling during converting errors to stats keys some guesswork is needed to unmangle them. My current proposal doesn’t take this additional feature into account, suggestions welcome.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:5
- Comments:9 (8 by maintainers)
Top GitHub Comments
We’ve decided to try this approach:
This should already be very useful. If this works I’ll see if there is anything else useful that can be done.
https://github.com/scrapinghub/spidermon/pull/262 provided stats for number of items returned and coverage by field, without relying on schemas. These stats can be compared between spiders executions and custom monitors can be used to find when we have a drop in coverage of a certain field.
https://github.com/scrapinghub/spidermon/pull/263 added a new built-in monitor and easy way to set coverage thresholds by field.
I believe that setting threshold should not be related to any specific JSON Schema, so in my opinion, these PRs can be used to solve the problems mentioned on this issue.