to_parquet() - dtype parameter
See original GitHub issueHello,
I have a process where I am reading AWS DMS created parquet files sourced from SQL Server. In one of my source tables, I have a date column with a value of 3015-06-29 in it. When I read the metadata from the parquet file, it does show my date column as a ‘date’ type (‘startdate’: ‘date’). Also, when I display the data from the dataframe, I do see the 3015-06-29 value in the ‘startdate’ column. Without any transformations, I attempt to write the file out with supplying the dtype parameter in the to_parquet() call, but I get an error (OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3015-06-29 00:00:00). However, if I remove the dtype parameter and write to S3, it completes successfully.
key = 's3://bucket/file.parquet'
column_metadata = wr.s3.read_parquet_metadata(key)[0]
df = wr.s3.read_parquet(key)
output_key = 's3://bucket/output/file.parquet'
#Write to S3 with dtype parameter.
wr.s3.to_parquet(df, path=output_key, dtype=column_metadata, compression='gzip') #This errors with OutOfBoundsDatetime error)
#Write to S3 without dtype parameter.
wr.s3.to_parquet(new_df, path=output_key, compression='gzip') #This completes successfully.
wr.s3.read_parquet_metadata(output_key)[0] #Shows 'startdate': 'date'
wr.s3.read_parquet(output_key) #This shows the date 3015-06-29 as from source.
I hope this is all understandable! I need the ability to supply the column types on the write because the inferred schema is causing other issues elsewhere. Thanks for your time.
Edited to add: I am using aws wrangler version 1.8.0.
Jarret
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Got it, thanks for the explanation!
Hi Igor! I tested the dev branch and it does successfully write the file with the dtype parameter passed in. You have my vote to push into v1.9.0.
I originally tried the new version with my script above (as written, which uses the timestamp datatype), and it still fails with the OutOfBounds error. Once I casted the value as date, it did work as expected. Is there a plan to correct the timestamp type as well?
Thanks for your help!!!