Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`to_pandas` method of the `Package` class does not preserve integer data type

See original GitHub issue

Overview

When reading a data package that has a resource with an integer column, which has one or more null values (empty cells in the csv), and then exporting it to Pandas, the integer data type is not preserved and is interpreted as a float64 instead.

How to reproduce

Zipping the file may be superfluous, but I wanted to test reading the data and metadata from the same (zipped) file.

import zipfile

import pandas as pd
from frictionless import Package, Resource, describe

# Create the data

data = {
    'id': [1, 2, 3],
    'quantity': [15, None, 20]
}
table = pd.DataFrame(data, dtype='Int64')
assert(table['quantity'].dtype == 'Int64')
table.to_csv('table.csv', index=False)

# Create the package metadata

package = Package(name='zipped-package', title='Example zipped data package')
package.add_resource(describe('table.csv'))
package.to_yaml('datapackage.yaml')

# Pack the zip archive

with zipfile.ZipFile('zipped-package.zip', 'w', compression=zipfile.ZIP_DEFLATED) as archive:
    archive.write('datapackage.yaml')
    archive.write('table.csv')

# Read the zipped package

new_package = Package('zipped-package.zip', descriptor='datapackage.yaml')
new_resource = new_package.get_resource(new_package.resource_names[0])
new_table = new_resource.to_pandas()

df.dtypes
# id            int64
# quantity    float64
# dtype: object

assert(new_table['quantity'].dtype == 'Int64')

Proposal

Maybe the to_pandas method should check for a schema. If found, then it would use it to determine the values when calling read_csv with the appropriate dtype argument, passing a dictionary mapping the column names to the appropriate types.

new_table2 = pd.read_csv('table.csv', dtype={'id': 'Int64', 'quantity': 'Int64'})

new_table2.dtypes
# id          Int64
# quantity    Int64
# dtype: object

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

aivukcommented, May 29, 2022

Hi @augusto-herrmann, thanks for reporting. The issue happens only when you do have a empty field in one of the rows, as in your example the quantity is None for the second row. When the pandas plugin finds a None in an integer field, it replaces it with a numpy.NaN (Numpy Not a Number, that is a float). The problem is that in pandas documentation (see https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#integer-dtypes-and-missing-data):

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support for integer NA for more). pandas provides a nullable integer array, which can be used by explicitly requesting the dtype

I will need to think a bit more in the best approach for this case. I will update here ASAP.

0reactions

augusto-herrmanncommented, Jun 9, 2022

Perhaps it is possible if we use a sequence of pandas Series, considering that each Series can have its own dtype. But we can discuss that in a new issue.