Converting NaN objects falsely turns NaN into None
See original GitHub issueWhen converting a DataFrame/Series of type object
(i.e. Strings) with np.nan
values to Koalas DataFrames and back, the former np.nan
values are replaced with None
as can be seen below:
>>> ks.Series(['a', np.nan]).to_pandas()
0 a
1 None
Name: 0, dtype: object
However, the following output would be expected instead:
0 a
1 NaN
Name: 0, dtype: object
I assume this behavior is caused by the fact that Spark does not support NaN values for String columns but uses None instead. Consequently, there is probably no definite way to decide whether a None value in Spark should be converted to a Python NaN or None at the time when the conversion from Spark to pandas happens. However, I would argue that in doubt converting to NaN makes more sense in most cases than None and should thus be the default.
What is your opinion on this?
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
python - Replacing Pandas or Numpy Nan with a None to use ...
MysqlDB doesn't seem understand 'nan' and my database throws out an error saying nan is not in the field list. I need to...
Read more >Dealing with Missing Values NaN and None in Python - Medium
As summary, NaN and None are different data types in Python. However, when it comes to missing values detection and elimination, pandas.
Read more >Missing values in pandas (nan, None, pd.NA) - nkmk note
In pandas, a missing value (NA: not available) is mainly represented by nan (not a number). None is also considered a missing value....
Read more >pandas.DataFrame.to_json — pandas 1.5.2 documentation
Convert the object to a JSON string. Note NaN's and None will be converted to null and datetime objects will be converted to...
Read more >Handling Missing Data | Python Data Science Handbook
Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a NaN value. (Be aware...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Note that performance is not going to be an issue here in any case: the internal representation in spark of None is a bit mask and the scalar values are always allocated.
Here is what I suggest to do when we convert the data between pandas and spark:
@floscha does that cover all the use cases that you are thinking of? Can you write a couple of test cases to validate that this is the expected behaviour.
Also, be aware that the nullability support in spark can be brittle, some corner cases especially after UDFs or joins will forget nullability info.
Let me close this for now, since It can’t be supported because Spark can’t tell if the null value of
object
type should beNone
orNaN
when converting to pandas anyway.