question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NAN consistency in JSON schema, pandas and numpy

See original GitHub issue

Coming from here https://github.com/scrapinghub/arche/issues/83

I would like to treat missing values consistent, but I would also love to keep json schemas work and keep spidermon and arche compatible. By inconsistency I mean that if some field’s value is missing, one might discard the field: [{"availability": 1, "_key": "0"}, {"_key": "1"}}] Or make it None [{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}] Empty strings "" are consistent, so no issues here. Either of this approaches of storing missing values requires different json schema (and maybe schematics too). If using pandas, it will just put NAN in both cases.

So, about consistency - it can look like:

  1. Promote a uniform ideology in the company - (missing field = None or np.nan)
  2. Then for json schema, it always will be null type - e.g. "type": ["string", "null"]

bad idea ~3. Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g. [{"availability": 1, "_key": "0"}, {"_key": "1"}}] > [{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]~

More information here https://github.com/scrapinghub/arche/issues/83

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
victor-torrescommented, May 14, 2019

Promote a uniform ideology in the company - (missing field = None or np.nan)

Isn’t it too hardcore? Also, those projects (scrapy, spidermon, arche) are meant to be used by other people and organizations with different requirements and use cases.

Then for json schema, it always will be null type - e.g. “type”: [“string”, “null”]

This could hide some edge cases when the spider is not returning the field.

Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g. [{“availability”: 1, “_key”: “0”}, {“_key”: “1”}}] > [{“availability”: 1, “_key”: “0”}, {“availability”: None, “_key”: “1”}}]

I believe this is a code smell. The way arche handles data internally is leaking across multiple repositories. It should be transparent.

1reaction
rennerochacommented, May 14, 2019

I have the tendency to be contrary to Spidermon changing the contents of a returned item. If the spider returned the item without content, I don’t think it is Spidermon’s job to include it back with None. It may be something desired by the spider developer.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Replace Pandas Column NaN Values with Empty List ...
first find the nan and replace by the same shape of data json_df.loc[json_df.b.isnull(), 'b'] = json_df.loc[json_df.b.isnull(), ...
Read more >
A Simplified Guide to Pandas Load JSON: 3 Essential Steps
In this article, we will dig deeper into understanding Pandas load JSON, its features, the JSON file format, and how to load and...
Read more >
All Pandas json_normalize() you should know for flattening ...
We can see that no error is thrown and those missing keys are shown as NaN . 2. Flattening a JSON with multiple...
Read more >
What's new in 0.25.0 (July 18, 2019) - Pandas
Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 ... Constructing a MultiIndex with NaN levels or codes value <...
Read more >
Spark SQL, DataFrames and Datasets Guide
For hive implementation, this is ignored. JSON Datasets. Scala; Java; Python; R; Sql. Spark SQL can automatically infer the schema of a JSON...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found