When using ks.readcsv, Koalas type inference differs from pandas due to the use of Spark's inferSchema logic
See original GitHub issueHi guys,
I’m using Koalas to combine two sets of similar data, and there is a step where we block rows using a direct comparison on a specific column. As part of this, we remove all NaN
values from that column using DataFrame.dropna
, but in our tests, we found that for float
columns, Koalas would convert that row to an int
if the values did not have decimal components. This is not exactly undesirable per se, but it IS different from the behaviour in pandas, so I thought I’d at least double check if this is intended behaviour.
Test data:
ID,name,birth_year,hourly_wage,address,zipcode
b1,Mark Levene,1987,29.5,"108 Clement St, San Francisco",94107
b2,Bill Bridge,1986,32,"3131 Webster St, San Francisco",
b3,Mike Franklin,1988,27.5,"1652 Stockton St, San Francisco",94122
b4,,1982,26,"108 South Park, San Francisco",
b5,Alfons Kemper,1984,35,"170 Post St, Apt 4, San Francisco",94122
b6,Michael Brodie,1987,32.5,,94107
Test script:
import pandas as pd
from databricks import koalas as ks
csv_file = 'test.csv'
pandas_A = pd.read_csv(csv_file)
koalas_A = ks.read_csv(csv_file)
print('pandas_A:\n{}\n'.format(pandas_A.to_string()))
print('koalas_A:\n{}\n'.format(koalas_A.to_string()))
pandas_A_dropna = pandas_A.dropna(subset=['zipcode'])
koalas_A_dropna = koalas_A.dropna(subset=['zipcode'])
print('pandas_A_dropna:\n{}\n'.format(pandas_A_dropna.to_string()))
print('koalas_A_dropna:\n{}\n'.format(koalas_A_dropna.to_string()))
print('pandas_A_dropna[zipcode].dtype = {}'.format(pandas_A_dropna['zipcode'].dtype))
print('koalas_A_dropna[zipcode].dtype = {}'.format(koalas_A_dropna['zipcode'].dtype))
Output:
pandas_A:
ID name birth_year hourly_wage address zipcode
0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco 94107.0
1 b2 Bill Bridge 1986 32.0 3131 Webster St, San Francisco NaN
2 b3 Mike Franklin 1988 27.5 1652 Stockton St, San Francisco 94122.0
3 b4 NaN 1982 26.0 108 South Park, San Francisco NaN
4 b5 Alfons Kemper 1984 35.0 170 Post St, Apt 4, San Francisco 94122.0
5 b6 Michael Brodie 1987 32.5 NaN 94107.0
koalas_A:
ID name birth_year hourly_wage address zipcode
0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco 94107.0
1 b2 Bill Bridge 1986 32.0 3131 Webster St, San Francisco NaN
2 b3 Mike Franklin 1988 27.5 1652 Stockton St, San Francisco 94122.0
3 b4 None 1982 26.0 108 South Park, San Francisco NaN
4 b5 Alfons Kemper 1984 35.0 170 Post St, Apt 4, San Francisco 94122.0
5 b6 Michael Brodie 1987 32.5 None 94107.0
pandas_A_dropna:
ID name birth_year hourly_wage address zipcode
0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco 94107.0
2 b3 Mike Franklin 1988 27.5 1652 Stockton St, San Francisco 94122.0
4 b5 Alfons Kemper 1984 35.0 170 Post St, Apt 4, San Francisco 94122.0
5 b6 Michael Brodie 1987 32.5 NaN 94107.0
koalas_A_dropna:
ID name birth_year hourly_wage address zipcode
0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco 94107
2 b3 Mike Franklin 1988 27.5 1652 Stockton St, San Francisco 94122
4 b5 Alfons Kemper 1984 35.0 170 Post St, Apt 4, San Francisco 94122
5 b6 Michael Brodie 1987 32.5 None 94107
pandas_A_dropna[zipcode].dtype = float64
koalas_A_dropna[zipcode].dtype = int32
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
databricks.koalas.read_csv — Koalas 1.8.2 documentation
Read CSV (comma-separated) file into DataFrame or Series. ... string in Spark SQL, which is preferred to avoid schema inference for better performance....
Read more >Interoperability between Koalas and Apache Spark - Databricks
Before a deep dive, let's look at the general differences between Koalas and PySpark DataFrames first. Externally, they are different. Koalas ...
Read more >Spark Option: inferSchema vs header = true - Stack Overflow
The header and schema are separate things. Header: If the csv file have a header (column names in the first row) then set...
Read more >Databricks Koalas: bridge between pandas and spark
Here we present Koalas, the bridge between Python's pandas api and apache spark's pyspark. Read this article to know more.
Read more >How to handle type inference issues for a date column while ...
While reading a CSV file, we provide schema = true. By setting this spark infers the data type of column based on the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
totally agree. maybe related with #1203
Missing value being float is known issue and that’s one of thing even pandas authors don’t quite like. So far, Koalas tried to copy this behaviour but In pandas 1.0.0, it targets to replace
np.nan
(https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values).I think it’s time for Koalas to consider to keep the type but interprets as
None
orpd.NA
.