Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for DeltaLake, Iceberg, Hudi as an offline source covered by Spark, Ray and Dask engines

See original GitHub issue

Is your feature request related to a problem? Please describe. Feast currently can run on spark, dask and ray. Using this engines the support for deltalake, apache iceberg and apache hudi (which was requested by the community) can be added. This feature will be very helpful for the teams which base on Spark but also for integrations with Flink, Spark Streaming eg. when historical features are saved in the data lake formats.

Describe the solution you’d like EnginesSources drawio

The solution assumes adding new data lake sources:

DeltaDataSource
IcebergDataSource
HudiDataSource

and also support for CSV files.

The data lake sources support will be covered by the Spark engine (which is already in contrib) for users which use Feast on Spark (or Databricks) but also for Dask and Ray.

Additional assumptions:

Data sources can be mixed eg. you can use DeltaDataSource, CSV and IcebergDataSource to fetch historical features.
The engine change won’t require changes in the code (only feature_store.yaml configuration) thus user can test on the laptop using Dask (without any cluster setup) and then deploy to Spark cluster (or Dask and Ray clusters)
The implementation should enable to simply add new DataSources like Apache Arrow Flight (if the python api will be added) and simply mix them with other data sources in the future.

Delta Lake

The support for Delta Lake for Feast on Spark is already proposed and tested on Databricks and local spark (https://github.com/qooba/feast-pyspark). The solution is based on the plain pyspark rather than on Spark SQL and Jinja thus it is to decide which implementation will be more desirable and maintainable.

The support for Delta Lake for Feast on Dask (Ray) can be implemented using delta python interface:

Apache Iceberg

The Apache Iceberg is covered by the Spark engine but also by the python api which can be used to add Dask/Ray implementation.

Apache Hudi

The Apache Hudi is covered by the Spark. Currently there is no python api (as far as I know).

CSV

The support for csv files will be dedicated for the data scientists which would like to conduct ad-hoc experiments.

Describe alternatives you’ve considered N/A

Additional context N/A

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

qoobacommented, Jul 25, 2022

Will this feature be picked up ？

Yes, but this requires a lot of changes in the Feast architecture thus I have decided to move it to separate repository and create Feast extension: https://github.com/qooba/yummy

I had a quick link and it seems like the IcebergDataSource is still missing. Are you planning to add it or has anything changed? Do you need help?

@LeonardAukea - I have finished feast integration with Iceberg here you have video introduction: https://www.youtube.com/watch?v=kv0iWuSf4jw

1reaction

qoobacommented, May 19, 2022

Will this feature be picked up ？

Yes, but this requires a lot of changes in the Feast architecture thus I have decided to move it to separate repository and create Feast extension: https://github.com/qooba/yummy

I had a quick link and it seems like the IcebergDataSource is still missing. Are you planning to add it or has anything changed? Do you need help?

I’ll add Iceberg support soon 😃

Top Results From Across the Web

A Thorough Comparison of Delta Lake, Iceberg and Hudi

My topic is a thorough comparison of Delta Lake, Iceberg, and Hudi. I'm a software engineer, working at Tencent Data Lake Team. So,...

Hudi, Iceberg and Delta Lake: Data Lake Table Formats ...

It supports ingesting data from multiple sources, primarily Apache Spark and Apache Flink. It also provides a Spark based utility to read from ......

The end of big data - Hacker News

Databricks has a product called Delta Lake that covers the infinitely ... Databricks recently rewrote the Spark query engine in C++ (called ...

What I Learned From Tecton's apply() 2022 Conference

Examples include Delta Lake from Databricks, Apache Iceberg from Netflix, ... Connecting to different data sources, both offline / batch (Snowflake, ...

A Thorough Comparison of Delta Lake, Iceberg ... - SlideShare

Recently, a set of modern table formats such as Delta Lake, Hudi, ... Iceberg ▫ Support spark struct streaming ▫ As streaming source...