question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

to_parquet() - dtype parameter

See original GitHub issue

Hello,

I have a process where I am reading AWS DMS created parquet files sourced from SQL Server. In one of my source tables, I have a date column with a value of 3015-06-29 in it. When I read the metadata from the parquet file, it does show my date column as a ‘date’ type (‘startdate’: ‘date’). Also, when I display the data from the dataframe, I do see the 3015-06-29 value in the ‘startdate’ column. Without any transformations, I attempt to write the file out with supplying the dtype parameter in the to_parquet() call, but I get an error (OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3015-06-29 00:00:00). However, if I remove the dtype parameter and write to S3, it completes successfully.

key = 's3://bucket/file.parquet'
column_metadata = wr.s3.read_parquet_metadata(key)[0]
df = wr.s3.read_parquet(key)

output_key = 's3://bucket/output/file.parquet'

#Write to S3 with dtype parameter.
wr.s3.to_parquet(df, path=output_key, dtype=column_metadata, compression='gzip') #This errors with OutOfBoundsDatetime error)

#Write to S3 without dtype parameter.
wr.s3.to_parquet(new_df, path=output_key, compression='gzip') #This completes successfully.

wr.s3.read_parquet_metadata(output_key)[0] #Shows 'startdate': 'date'

wr.s3.read_parquet(output_key) #This shows the date 3015-06-29 as from source.

I hope this is all understandable! I need the ability to supply the column types on the write because the inferred schema is causing other issues elsewhere. Thanks for your time.

Edited to add: I am using aws wrangler version 1.8.0.

Jarret

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jarretgcommented, Aug 28, 2020

Got it, thanks for the explanation!

1reaction
jarretgcommented, Aug 28, 2020

Hi Igor! I tested the dev branch and it does successfully write the file with the dtype parameter passed in. You have my vote to push into v1.9.0.

I originally tried the new version with my script above (as written, which uses the timestamp datatype), and it still fails with the OutOfBounds error. Once I casted the value as date, it did work as expected. Is there a plan to correct the timestamp type as well?

Thanks for your help!!!

Read more comments on GitHub >

github_iconTop Results From Across the Web

to_parquet() - dtype parameter · Issue #365 - GitHub
Hello, I have a process where I am reading AWS DMS created parquet files sourced from SQL Server. In one of my source...
Read more >
pandas.DataFrame.to_parquet
This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the...
Read more >
How to force parquet dtypes when saving pd.DataFrame?
DataFrame , using df.to_parquet(filename) . All dataframes have the same columns, but for some a given column might contain only null values ...
Read more >
awswrangler.s3.to_parquet - Read the Docs
In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count(). Note. This function has arguments which...
Read more >
dask.dataframe.to_parquet - Dask documentation
Parameters. dfdask.dataframe.DataFrame: pathstring or pathlib.Path. Destination directory for data. Prepend with protocol like s3:// or hdfs:// for remote ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found