Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When using ks.readcsv, Koalas type inference differs from pandas due to the use of Spark's inferSchema logic

See original GitHub issue

Hi guys,

I’m using Koalas to combine two sets of similar data, and there is a step where we block rows using a direct comparison on a specific column. As part of this, we remove all NaN values from that column using DataFrame.dropna, but in our tests, we found that for float columns, Koalas would convert that row to an int if the values did not have decimal components. This is not exactly undesirable per se, but it IS different from the behaviour in pandas, so I thought I’d at least double check if this is intended behaviour.

Test data:

ID,name,birth_year,hourly_wage,address,zipcode
b1,Mark Levene,1987,29.5,"108 Clement St, San Francisco",94107
b2,Bill Bridge,1986,32,"3131 Webster St, San Francisco",
b3,Mike Franklin,1988,27.5,"1652 Stockton St, San Francisco",94122
b4,,1982,26,"108 South Park, San Francisco",
b5,Alfons Kemper,1984,35,"170 Post St, Apt 4,  San Francisco",94122
b6,Michael Brodie,1987,32.5,,94107

Test script:

import pandas as pd
from databricks import koalas as ks

csv_file = 'test.csv'

pandas_A = pd.read_csv(csv_file)
koalas_A = ks.read_csv(csv_file)

print('pandas_A:\n{}\n'.format(pandas_A.to_string()))
print('koalas_A:\n{}\n'.format(koalas_A.to_string()))

pandas_A_dropna = pandas_A.dropna(subset=['zipcode'])
koalas_A_dropna = koalas_A.dropna(subset=['zipcode'])

print('pandas_A_dropna:\n{}\n'.format(pandas_A_dropna.to_string()))
print('koalas_A_dropna:\n{}\n'.format(koalas_A_dropna.to_string()))

print('pandas_A_dropna[zipcode].dtype = {}'.format(pandas_A_dropna['zipcode'].dtype))
print('koalas_A_dropna[zipcode].dtype = {}'.format(koalas_A_dropna['zipcode'].dtype))

Output:

pandas_A:
   ID            name  birth_year  hourly_wage                             address  zipcode
0  b1     Mark Levene        1987         29.5       108 Clement St, San Francisco  94107.0
1  b2     Bill Bridge        1986         32.0      3131 Webster St, San Francisco      NaN
2  b3   Mike Franklin        1988         27.5     1652 Stockton St, San Francisco  94122.0
3  b4             NaN        1982         26.0       108 South Park, San Francisco      NaN
4  b5   Alfons Kemper        1984         35.0  170 Post St, Apt 4,  San Francisco  94122.0
5  b6  Michael Brodie        1987         32.5                                 NaN  94107.0

koalas_A:
   ID            name  birth_year  hourly_wage                             address  zipcode
0  b1     Mark Levene        1987         29.5       108 Clement St, San Francisco  94107.0
1  b2     Bill Bridge        1986         32.0      3131 Webster St, San Francisco      NaN
2  b3   Mike Franklin        1988         27.5     1652 Stockton St, San Francisco  94122.0
3  b4            None        1982         26.0       108 South Park, San Francisco      NaN
4  b5   Alfons Kemper        1984         35.0  170 Post St, Apt 4,  San Francisco  94122.0
5  b6  Michael Brodie        1987         32.5                                None  94107.0

pandas_A_dropna:
   ID            name  birth_year  hourly_wage                             address  zipcode
0  b1     Mark Levene        1987         29.5       108 Clement St, San Francisco  94107.0
2  b3   Mike Franklin        1988         27.5     1652 Stockton St, San Francisco  94122.0
4  b5   Alfons Kemper        1984         35.0  170 Post St, Apt 4,  San Francisco  94122.0
5  b6  Michael Brodie        1987         32.5                                 NaN  94107.0

koalas_A_dropna:
   ID            name  birth_year  hourly_wage                             address  zipcode
0  b1     Mark Levene        1987         29.5       108 Clement St, San Francisco    94107
2  b3   Mike Franklin        1988         27.5     1652 Stockton St, San Francisco    94122
4  b5   Alfons Kemper        1984         35.0  170 Post St, Apt 4,  San Francisco    94122
5  b6  Michael Brodie        1987         32.5                                None    94107

pandas_A_dropna[zipcode].dtype = float64
koalas_A_dropna[zipcode].dtype = int32

Issue Analytics

State:
Created 4 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

2reactions

itholiccommented, Feb 9, 2020

I think it’s time for Koalas to consider to keep the type but interprets as None or pd.NA.

totally agree. maybe related with #1203

2reactions

HyukjinKwoncommented, Feb 9, 2020

Missing value being float is known issue and that’s one of thing even pandas authors don’t quite like. So far, Koalas tried to copy this behaviour but In pandas 1.0.0, it targets to replace np.nan (https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values).

I think it’s time for Koalas to consider to keep the type but interprets as None or pd.NA.

Top Results From Across the Web

databricks.koalas.read_csv — Koalas 1.8.2 documentation

Read CSV (comma-separated) file into DataFrame or Series. ... string in Spark SQL, which is preferred to avoid schema inference for better performance....

Interoperability between Koalas and Apache Spark - Databricks

Before a deep dive, let's look at the general differences between Koalas and PySpark DataFrames first. Externally, they are different. Koalas ...

Spark Option: inferSchema vs header = true - Stack Overflow

The header and schema are separate things. Header: If the csv file have a header (column names in the first row) then set...

Databricks Koalas: bridge between pandas and spark

Here we present Koalas, the bridge between Python's pandas api and apache spark's pyspark. Read this article to know more.

How to handle type inference issues for a date column while ...

While reading a CSV file, we provide schema = true. By setting this spark infers the data type of column based on the...