NAN consistency in JSON schema, pandas and numpy
See original GitHub issueComing from here https://github.com/scrapinghub/arche/issues/83
I would like to treat missing values consistent, but I would also love to keep json schemas work and keep spidermon
and arche
compatible.
By inconsistency I mean that if some field’s value is missing, one might discard the field:
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
Or make it None
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
Empty strings ""
are consistent, so no issues here.
Either of this approaches of storing missing values requires different json schema
(and maybe schematics
too). If using pandas
, it will just put NAN
in both cases.
So, about consistency - it can look like:
- Promote a uniform ideology in the company - (missing field =
None
ornp.nan
) - Then for
json schema
, it always will benull
type - e.g."type": ["string", "null"]
bad idea ~3. Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g.
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
>
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
~
More information here https://github.com/scrapinghub/arche/issues/83
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Isn’t it too hardcore? Also, those projects (scrapy, spidermon, arche) are meant to be used by other people and organizations with different requirements and use cases.
This could hide some edge cases when the spider is not returning the field.
I believe this is a code smell. The way arche handles data internally is leaking across multiple repositories. It should be transparent.
I have the tendency to be contrary to Spidermon changing the contents of a returned item. If the spider returned the item without content, I don’t think it is Spidermon’s job to include it back with None. It may be something desired by the spider developer.