Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NAN consistency in JSON schema, pandas and numpy

See original GitHub issue

Coming from here https://github.com/scrapinghub/arche/issues/83

I would like to treat missing values consistent, but I would also love to keep json schemas work and keep spidermon and arche compatible. By inconsistency I mean that if some field’s value is missing, one might discard the field: [{"availability": 1, "_key": "0"}, {"_key": "1"}}] Or make it None [{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}] Empty strings "" are consistent, so no issues here. Either of this approaches of storing missing values requires different json schema (and maybe schematics too). If using pandas, it will just put NAN in both cases.

So, about consistency - it can look like:

Promote a uniform ideology in the company - (missing field = None or np.nan)
Then for json schema, it always will be null type - e.g. "type": ["string", "null"]

bad idea ~3. Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g. [{"availability": 1, "_key": "0"}, {"_key": "1"}}] > [{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]~

More information here https://github.com/scrapinghub/arche/issues/83

Issue Analytics

State:
Created 4 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

victor-torrescommented, May 14, 2019

Promote a uniform ideology in the company - (missing field = None or np.nan)

Isn’t it too hardcore? Also, those projects (scrapy, spidermon, arche) are meant to be used by other people and organizations with different requirements and use cases.

Then for json schema, it always will be null type - e.g. “type”: [“string”, “null”]

This could hide some edge cases when the spider is not returning the field.

Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g. [{“availability”: 1, “_key”: “0”}, {“_key”: “1”}}] > [{“availability”: 1, “_key”: “0”}, {“availability”: None, “_key”: “1”}}]

I believe this is a code smell. The way arche handles data internally is leaking across multiple repositories. It should be transparent.

1reaction

rennerochacommented, May 14, 2019

I have the tendency to be contrary to Spidermon changing the contents of a returned item. If the spider returned the item without content, I don’t think it is Spidermon’s job to include it back with None. It may be something desired by the spider developer.