Write multiple parquet files from a single dataframe defined by max file size
See original GitHub issueThe Problem
I was unsuccessful in finding a way to write a single dataframe to multiple parquet files using the s3.to_parquet()
method. Currently, it seems to write one parquet file which could slow down Athena queries.
Possible Solution
It would be great to have an option for the “max parquet file size” when using s3.to_parquet()
. So instead of creating one large parquet file, we could create many smaller parquet files that would help optimize Athena queries.
My reasoning I believe Athena, behind the scenes, uses the number of files to split the query workload across different nodes.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Write multiple parquet files from a single dataframe defined by ...
The Problem I was unsuccessful in finding a way to write a single dataframe to multiple parquet files using the s3.to_parquet() method.
Read more >pandas df.to_parquet write to multiple smaller files
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy') to write to a file ...
Read more >Compaction / Merge of parquet files | by Chris Finlayson
Compaction / Merge of parquet files. Optimising size of parquet files for processing by Hadoop or Spark. The small file problem. One of...
Read more >Reading and Writing the Apache Parquet Format
Multiple Parquet files constitute a Parquet dataset. These may present in a number of ways: A list of Parquet absolute file paths. A...
Read more >Convert Many Parquet Files to a Single CSV using Python
Let's see how we can load multiple Parquet files into a DataFrame and write them to a single CSV file using the Dask...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Can I work on this particular request ?
@rparthas yep, it would be very welcome! Just make sure to checkout from our dev branch and then open a pull request against the same.