Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[KED-1561] Rename SparkHiveDataSet

See original GitHub issue

Description

SparkHiveDataSet does not really depend on Hive but on tables that are registered in spark.sql. There are some cases where spark.sql() does not point to a Hive database (e.g. in Databricks you can access data that is registered in spark tables but the backend might not be a Hive database). Now that datasets are undergoing a big refactoring, how about renaming this to SparkTableDataSet or something like it?

Issue Analytics

State:
Created 4 years ago
Comments:10 (8 by maintainers)

Top GitHub Comments

2reactions

mzjp2commented, Apr 2, 2020

@deepyaman @yetudada I think this is an interesting topic and we should open a new issue to discuss it in detail. I’m not so sure if we should leave or remove the prefix but I think the point here is how to manage different python libraries to manipulate data. Today kedro assumes pandas as default since is the most used library to manipulate in memory dataframes, but probably more tools will become popular soon (pyarrow, dask, etc.) and it would be good to start thinking how to manage that.

I think we’re on the same page with you there @mrg143504. This is what we’re moving towards!

We recently shifted to the kedro.extras model and you opt-in to the dependencies you need based on what data you work with.

At the moment, we have pandas as a “default” (bundled with our kedro package) to preserve backwards compatibility, but we’re looking to decouple the I/O completely for precisely the reasons you mention (people who intend to use Pyspark exclusively don’t need to spend time installing pandas, etc.)

We’re moving to a dask-like model where you pip install "kedro[pandas]" or pip install "kedro[spark]" and loads of others based on what you need to use.

0reactions

yetudadacommented, Nov 12, 2020

I’m going to close this ticket. Perhaps there can be work done to rebuild the Hive DataSet and that should be captured in another ticket? 😄