Enhancements to SparkDataSet
See original GitHub issueDescription
The SparkDataSet is great, but we can make it greater. When dealing with big data, we might want to choose the number of partitions when writing to disk rather than default to Spark’s default.
Context
Setting the number of partitions in the catalog config rather than something passed to the function appears to be the nicer way. E.g.:
my_spark_df:
type: SparkDataSet
save_args: ...
repartition: 10
As opposed to within the node (really not the responsibility of the transformation logic):
def my_node(df, x, repartition):
...
return df.repartition(repartition)
Possible Implementation
class SparkDataSet(...)
def __init__(self, ..., repartition: int, ...):
def _save(self, data: DataFrame) -> None:
save_path = _strip_dbfs_prefix(self._fs_prefix + str(self._get_save_path()))
if repartition:
df = df.repartition(repartition)
data.write.save(save_path, self._file_format, **self._save_args)
Possible Alternatives
Make repartition and coalesce both options available, but raise error if both are provided.
Also other options like .partitionBy
would be great and can follow the same implementation.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Introducing Apache Spark Datasets - The Databricks Blog
As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: Performance optimizations - In many cases, the ...
Read more >Spark Performance Tuning & Best Practices
Spark Dataset /DataFrame includes Project Tungsten which optimizes Spark jobs for Memory and CPU efficiency. Tungsten is a Spark SQL ...
Read more >Datasets, DataFrames, and Spark SQL for Processing of ...
A Spark Dataset is a distributed collection of typed objects, which are partitioned across multiple nodes in a cluster and can be operated...
Read more >Rolling your own reduceByKey in Spark Dataset
There is still room for improvements, so suggestions would be welcome. Share.
Read more >kedro.extras.datasets.spark.SparkDataSet
SparkDataSet loads and saves Spark dataframes. Checks whether a data set's output already exists by calling the provided _exists() method. Create a data...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I guess the argument is that it saves a repetitive step which can be templated and applied at scale, but I’m still not convinced it’s common enough to make a special exception like we are going to do with https://github.com/quantumblacklabs/kedro/pull/887 to allow SQL to be defined in a separate file
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.