question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enhancements to SparkDataSet

See original GitHub issue

Description

The SparkDataSet is great, but we can make it greater. When dealing with big data, we might want to choose the number of partitions when writing to disk rather than default to Spark’s default.

Context

Setting the number of partitions in the catalog config rather than something passed to the function appears to be the nicer way. E.g.:

my_spark_df:
    type: SparkDataSet
    save_args: ...
    repartition: 10

As opposed to within the node (really not the responsibility of the transformation logic):

def my_node(df, x, repartition):
    ...
    return df.repartition(repartition)

Possible Implementation

class SparkDataSet(...)
    def __init__(self, ..., repartition: int, ...):
    
    def _save(self, data: DataFrame) -> None:
        save_path = _strip_dbfs_prefix(self._fs_prefix + str(self._get_save_path()))
        if repartition:
            df = df.repartition(repartition)
        data.write.save(save_path, self._file_format, **self._save_args)

Possible Alternatives

Make repartition and coalesce both options available, but raise error if both are provided.

Also other options like .partitionBy would be great and can follow the same implementation.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
datajoelycommented, Oct 11, 2021

I guess the argument is that it saves a repetitive step which can be templated and applied at scale, but I’m still not convinced it’s common enough to make a special exception like we are going to do with https://github.com/quantumblacklabs/kedro/pull/887 to allow SQL to be defined in a separate file

0reactions
stale[bot]commented, Dec 10, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Introducing Apache Spark Datasets - The Databricks Blog
As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: Performance optimizations - In many cases, the ...
Read more >
Spark Performance Tuning & Best Practices
Spark Dataset /DataFrame includes Project Tungsten which optimizes Spark jobs for Memory and CPU efficiency. Tungsten is a Spark SQL ...
Read more >
Datasets, DataFrames, and Spark SQL for Processing of ...
A Spark Dataset is a distributed collection of typed objects, which are partitioned across multiple nodes in a cluster and can be operated...
Read more >
Rolling your own reduceByKey in Spark Dataset
There is still room for improvements, so suggestions would be welcome. Share.
Read more >
kedro.extras.datasets.spark.SparkDataSet
SparkDataSet loads and saves Spark dataframes. Checks whether a data set's output already exists by calling the provided _exists() method. Create a data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found