question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`to_pandas` method of the `Package` class does not preserve integer data type

See original GitHub issue

Overview

When reading a data package that has a resource with an integer column, which has one or more null values (empty cells in the csv), and then exporting it to Pandas, the integer data type is not preserved and is interpreted as a float64 instead.

How to reproduce

Zipping the file may be superfluous, but I wanted to test reading the data and metadata from the same (zipped) file.

import zipfile

import pandas as pd
from frictionless import Package, Resource, describe

# Create the data

data = {
    'id': [1, 2, 3],
    'quantity': [15, None, 20]
}
table = pd.DataFrame(data, dtype='Int64')
assert(table['quantity'].dtype == 'Int64')
table.to_csv('table.csv', index=False)

# Create the package metadata

package = Package(name='zipped-package', title='Example zipped data package')
package.add_resource(describe('table.csv'))
package.to_yaml('datapackage.yaml')

# Pack the zip archive

with zipfile.ZipFile('zipped-package.zip', 'w', compression=zipfile.ZIP_DEFLATED) as archive:
    archive.write('datapackage.yaml')
    archive.write('table.csv')

# Read the zipped package

new_package = Package('zipped-package.zip', descriptor='datapackage.yaml')
new_resource = new_package.get_resource(new_package.resource_names[0])
new_table = new_resource.to_pandas()

df.dtypes
# id            int64
# quantity    float64
# dtype: object

assert(new_table['quantity'].dtype == 'Int64')

Proposal

Maybe the to_pandas method should check for a schema. If found, then it would use it to determine the values when calling read_csv with the appropriate dtype argument, passing a dictionary mapping the column names to the appropriate types.

new_table2 = pd.read_csv('table.csv', dtype={'id': 'Int64', 'quantity': 'Int64'})

new_table2.dtypes
# id          Int64
# quantity    Int64
# dtype: object

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
aivukcommented, May 29, 2022

Hi @augusto-herrmann, thanks for reporting. The issue happens only when you do have a empty field in one of the rows, as in your example the quantity is None for the second row. When the pandas plugin finds a None in an integer field, it replaces it with a numpy.NaN (Numpy Not a Number, that is a float). The problem is that in pandas documentation (see https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#integer-dtypes-and-missing-data):

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support for integer NA for more). pandas provides a nullable integer array, which can be used by explicitly requesting the dtype

I will need to think a bit more in the best approach for this case. I will update here ASAP.

0reactions
augusto-herrmanncommented, Jun 9, 2022

Perhaps it is possible if we use a sequence of pandas Series, considering that each Series can have its own dtype. But we can discuss that in a new issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PySpark toPandas function is changing column type
Is there any way to keep integer value with toPandas() or I can only cast column type in resulting pandas dataframe?
Read more >
pyspark.sql module — PySpark 1.3.0 documentation
When schema is a list of column names, the type of each column will be inferred from rdd . When schema is None,...
Read more >
Convert between PySpark and pandas DataFrames
All Spark SQL data types are supported by Arrow-based conversion except MapType , ArrayType of TimestampType , and nested StructType .
Read more >
Convert PySpark DataFrame to Pandas - Spark by {Examples}
PySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in ...
Read more >
pandas.DataFrame.iterrows — pandas 1.5.2 documentation
Because iterrows returns a Series for each row, it does not preserve dtypes ... Depending on the data types, the iterator returns a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found