question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[KED-1561] Rename SparkHiveDataSet

See original GitHub issue

Description

SparkHiveDataSet does not really depend on Hive but on tables that are registered in spark.sql. There are some cases where spark.sql() does not point to a Hive database (e.g. in Databricks you can access data that is registered in spark tables but the backend might not be a Hive database). Now that datasets are undergoing a big refactoring, how about renaming this to SparkTableDataSet or something like it?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
mzjp2commented, Apr 2, 2020

@deepyaman @yetudada I think this is an interesting topic and we should open a new issue to discuss it in detail. I’m not so sure if we should leave or remove the prefix but I think the point here is how to manage different python libraries to manipulate data. Today kedro assumes pandas as default since is the most used library to manipulate in memory dataframes, but probably more tools will become popular soon (pyarrow, dask, etc.) and it would be good to start thinking how to manage that.

I think we’re on the same page with you there @mrg143504. This is what we’re moving towards!

We recently shifted to the kedro.extras model and you opt-in to the dependencies you need based on what data you work with.

At the moment, we have pandas as a “default” (bundled with our kedro package) to preserve backwards compatibility, but we’re looking to decouple the I/O completely for precisely the reasons you mention (people who intend to use Pyspark exclusively don’t need to spend time installing pandas, etc.)

We’re moving to a dask-like model where you pip install "kedro[pandas]" or pip install "kedro[spark]" and loads of others based on what you need to use.

0reactions
yetudadacommented, Nov 12, 2020

I’m going to close this ticket. Perhaps there can be work done to rebuild the Hive DataSet and that should be captured in another ticket? 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

[KED-1561] Rename SparkHiveDataSet · Issue #278 - GitHub
Description SparkHiveDataSet does not really depend on Hive but on tables that are registered in spark.sql. There are some cases where ...
Read more >
kedro.extras.datasets.spark.SparkHiveDataSet
Creates a new instance of SparkHiveDataSet . Parameters. database ( str ) – The name of the hive database. table ( ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found