[KED-1561] Rename SparkHiveDataSet
See original GitHub issueDescription
SparkHiveDataSet
does not really depend on Hive but on tables that are registered in spark.sql
. There are some cases where spark.sql()
does not point to a Hive database (e.g. in Databricks you can access data that is registered in spark tables but the backend might not be a Hive database).
Now that datasets are undergoing a big refactoring, how about renaming this to SparkTableDataSet
or something like it?
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (8 by maintainers)
Top Results From Across the Web
[KED-1561] Rename SparkHiveDataSet · Issue #278 - GitHub
Description SparkHiveDataSet does not really depend on Hive but on tables that are registered in spark.sql. There are some cases where ...
Read more >kedro.extras.datasets.spark.SparkHiveDataSet
Creates a new instance of SparkHiveDataSet . Parameters. database ( str ) – The name of the hive database. table ( ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think we’re on the same page with you there @mrg143504. This is what we’re moving towards!
We recently shifted to the
kedro.extras
model and you opt-in to the dependencies you need based on what data you work with.At the moment, we have
pandas
as a “default” (bundled with ourkedro
package) to preserve backwards compatibility, but we’re looking to decouple the I/O completely for precisely the reasons you mention (people who intend to use Pyspark exclusively don’t need to spend time installing pandas, etc.)We’re moving to a
dask
-like model where youpip install "kedro[pandas]"
orpip install "kedro[spark]"
and loads of others based on what you need to use.I’m going to close this ticket. Perhaps there can be work done to rebuild the Hive DataSet and that should be captured in another ticket? 😄