Add an option to coalesce partitions in DataFrame.to_csv()
See original GitHub issueks.DataFrame.to_csv
is too slow because it will transform to pandas.DataFrame
first.Is there a way save to csv
format directly?
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (13 by maintainers)
Top Results From Across the Web
Spark Write DataFrame into Single CSV File (merge multiple ...
When you are ready to write a DataFrame, first use Spark repartition() and coalesce() to merge data from all partitions into a single...
Read more >Write single CSV file using spark-csv - scala - Stack Overflow
Coalesce() would be fine if a single executor has more RAM for use than the driver. ... option("mode","append") --> appending data to existing...
Read more >Managing Spark Partitions with Coalesce and Repartition
The coalesce method reduces the number of partitions in a DataFrame. Here's how to consolidate the data in two partitions:
Read more >Write to a Single CSV File - Databricks
The Spark Dataframe API has a method called coalesce that tells Spark to shuffle your data into the specified number of partitions. Since...
Read more >PySpark: Dataframe To File (Part 1) - DbmsTutorials
coalesce() function can be used to reduce number of partitions and thereby reducing the number of files created by DataFrameWriter from dataframe. Both...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
If you add a coalesce, you will run into the same issue that the CSV file writing is not parallelized.
It’s probably best to have an option to indicate whether we allow multiple output files, and explain what that means in terms of tradeoffs.
Yes, I think this case was fixed as of https://github.com/databricks/koalas/pull/677