`to_pandas` method of the `Package` class does not preserve integer data type
See original GitHub issueOverview
When reading a data package that has a resource with an integer column, which has one or more null values (empty cells in the csv), and then exporting it to Pandas, the integer data type is not preserved and is interpreted as a float64
instead.
How to reproduce
Zipping the file may be superfluous, but I wanted to test reading the data and metadata from the same (zipped) file.
import zipfile
import pandas as pd
from frictionless import Package, Resource, describe
# Create the data
data = {
'id': [1, 2, 3],
'quantity': [15, None, 20]
}
table = pd.DataFrame(data, dtype='Int64')
assert(table['quantity'].dtype == 'Int64')
table.to_csv('table.csv', index=False)
# Create the package metadata
package = Package(name='zipped-package', title='Example zipped data package')
package.add_resource(describe('table.csv'))
package.to_yaml('datapackage.yaml')
# Pack the zip archive
with zipfile.ZipFile('zipped-package.zip', 'w', compression=zipfile.ZIP_DEFLATED) as archive:
archive.write('datapackage.yaml')
archive.write('table.csv')
# Read the zipped package
new_package = Package('zipped-package.zip', descriptor='datapackage.yaml')
new_resource = new_package.get_resource(new_package.resource_names[0])
new_table = new_resource.to_pandas()
df.dtypes
# id int64
# quantity float64
# dtype: object
assert(new_table['quantity'].dtype == 'Int64')
Proposal
Maybe the to_pandas
method should check for a schema. If found, then it would use it to determine the values when calling read_csv with the appropriate dtype
argument, passing a dictionary mapping the column names to the appropriate types.
new_table2 = pd.read_csv('table.csv', dtype={'id': 'Int64', 'quantity': 'Int64'})
new_table2.dtypes
# id Int64
# quantity Int64
# dtype: object
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
PySpark toPandas function is changing column type
Is there any way to keep integer value with toPandas() or I can only cast column type in resulting pandas dataframe?
Read more >pyspark.sql module — PySpark 1.3.0 documentation
When schema is a list of column names, the type of each column will be inferred from rdd . When schema is None,...
Read more >Convert between PySpark and pandas DataFrames
All Spark SQL data types are supported by Arrow-based conversion except MapType , ArrayType of TimestampType , and nested StructType .
Read more >Convert PySpark DataFrame to Pandas - Spark by {Examples}
PySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in ...
Read more >pandas.DataFrame.iterrows — pandas 1.5.2 documentation
Because iterrows returns a Series for each row, it does not preserve dtypes ... Depending on the data types, the iterator returns a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @augusto-herrmann, thanks for reporting. The issue happens only when you do have a empty field in one of the rows, as in your example the quantity is None for the second row. When the pandas plugin finds a None in an integer field, it replaces it with a numpy.NaN (Numpy Not a Number, that is a float). The problem is that in pandas documentation (see https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#integer-dtypes-and-missing-data):
I will need to think a bit more in the best approach for this case. I will update here ASAP.
Perhaps it is possible if we use a sequence of pandas
Series
, considering that eachSeries
can have its owndtype
. But we can discuss that in a new issue.