question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for DeltaLake, Iceberg, Hudi as an offline source covered by Spark, Ray and Dask engines

See original GitHub issue

Is your feature request related to a problem? Please describe. Feast currently can run on spark, dask and ray. Using this engines the support for deltalake, apache iceberg and apache hudi (which was requested by the community) can be added. This feature will be very helpful for the teams which base on Spark but also for integrations with Flink, Spark Streaming eg. when historical features are saved in the data lake formats.

Describe the solution you’d like EnginesSources drawio

The solution assumes adding new data lake sources:

  • DeltaDataSource
  • IcebergDataSource
  • HudiDataSource

and also support for CSV files.

The data lake sources support will be covered by the Spark engine (which is already in contrib) for users which use Feast on Spark (or Databricks) but also for Dask and Ray.

Additional assumptions:

  1. Data sources can be mixed eg. you can use DeltaDataSource, CSV and IcebergDataSource to fetch historical features.
  2. The engine change won’t require changes in the code (only feature_store.yaml configuration) thus user can test on the laptop using Dask (without any cluster setup) and then deploy to Spark cluster (or Dask and Ray clusters)
  3. The implementation should enable to simply add new DataSources like Apache Arrow Flight (if the python api will be added) and simply mix them with other data sources in the future.

Delta Lake

The support for Delta Lake for Feast on Spark is already proposed and tested on Databricks and local spark (https://github.com/qooba/feast-pyspark). The solution is based on the plain pyspark rather than on Spark SQL and Jinja thus it is to decide which implementation will be more desirable and maintainable.

The support for Delta Lake for Feast on Dask (Ray) can be implemented using delta python interface:

Apache Iceberg

The Apache Iceberg is covered by the Spark engine but also by the python api which can be used to add Dask/Ray implementation.

Apache Hudi

The Apache Hudi is covered by the Spark. Currently there is no python api (as far as I know).

CSV

The support for csv files will be dedicated for the data scientists which would like to conduct ad-hoc experiments.

Describe alternatives you’ve considered N/A

Additional context N/A

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
qoobacommented, Jul 25, 2022

Will this feature be picked up ?

Yes, but this requires a lot of changes in the Feast architecture thus I have decided to move it to separate repository and create Feast extension: https://github.com/qooba/yummy

I had a quick link and it seems like the IcebergDataSource is still missing. Are you planning to add it or has anything changed? Do you need help?

@LeonardAukea - I have finished feast integration with Iceberg here you have video introduction: https://www.youtube.com/watch?v=kv0iWuSf4jw

1reaction
qoobacommented, May 19, 2022

Will this feature be picked up ?

Yes, but this requires a lot of changes in the Feast architecture thus I have decided to move it to separate repository and create Feast extension: https://github.com/qooba/yummy

I had a quick link and it seems like the IcebergDataSource is still missing. Are you planning to add it or has anything changed? Do you need help?

I’ll add Iceberg support soon 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Thorough Comparison of Delta Lake, Iceberg and Hudi
My topic is a thorough comparison of Delta Lake, Iceberg, and Hudi. I'm a software engineer, working at Tencent Data Lake Team. So,...
Read more >
Hudi, Iceberg and Delta Lake: Data Lake Table Formats ...
It supports ingesting data from multiple sources, primarily Apache Spark and Apache Flink. It also provides a Spark based utility to read from ......
Read more >
The end of big data - Hacker News
Databricks has a product called Delta Lake that covers the infinitely ... Databricks recently rewrote the Spark query engine in C++ (called ...
Read more >
What I Learned From Tecton's apply() 2022 Conference
Examples include Delta Lake from Databricks, Apache Iceberg from Netflix, ... Connecting to different data sources, both offline / batch (Snowflake, ...
Read more >
A Thorough Comparison of Delta Lake, Iceberg ... - SlideShare
Recently, a set of modern table formats such as Delta Lake, Hudi, ... Iceberg ▫ Support spark struct streaming ▫ As streaming source...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found