Support for DeltaLake, Iceberg, Hudi as an offline source covered by Spark, Ray and Dask engines
See original GitHub issueIs your feature request related to a problem? Please describe.
Feast currently can run on spark, dask and ray. Using this engines the support for deltalake, apache iceberg and apache hudi (which was requested by the community) can be added. This feature will be very helpful for the teams which base on Spark
but also for integrations with Flink
, Spark Streaming
eg. when historical features are saved in the data lake formats.
Describe the solution you’d like
The solution assumes adding new data lake sources:
DeltaDataSource
IcebergDataSource
HudiDataSource
and also support for CSV files.
The data lake sources support will be covered by the Spark engine (which is already in contrib) for users which use Feast on Spark (or Databricks) but also for Dask and Ray.
Additional assumptions:
- Data sources can be mixed eg. you can use
DeltaDataSource
,CSV
andIcebergDataSource
to fetch historical features. - The engine change won’t require changes in the code (only
feature_store.yaml
configuration) thus user can test on the laptop using Dask (without any cluster setup) and then deploy to Spark cluster (or Dask and Ray clusters) - The implementation should enable to simply add new DataSources like Apache Arrow Flight (if the python api will be added) and simply mix them with other data sources in the future.
Delta Lake
The support for Delta Lake for Feast on Spark is already proposed and tested on Databricks and local spark (https://github.com/qooba/feast-pyspark). The solution is based on the plain pyspark
rather than on Spark SQL
and Jinja
thus it is to decide which implementation will be more desirable and maintainable.
The support for Delta Lake for Feast on Dask (Ray) can be implemented using delta python interface:
Apache Iceberg
The Apache Iceberg is covered by the Spark engine but also by the python api which can be used to add Dask/Ray implementation.
Apache Hudi
The Apache Hudi is covered by the Spark. Currently there is no python api (as far as I know).
CSV
The support for csv
files will be dedicated for the data scientists which would like to conduct ad-hoc experiments.
Describe alternatives you’ve considered N/A
Additional context N/A
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
@LeonardAukea - I have finished feast integration with Iceberg here you have video introduction: https://www.youtube.com/watch?v=kv0iWuSf4jw
I’ll add
Iceberg
support soon 😃