question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feast API: Sources

See original GitHub issue

This issue can be used to discuss the role of sources in Feast, and how we see the concept evolving in future versions.

Status quo Feast currently supports only a single Source type on KafkaSource. This can be defined through a Feature Set or omitted. If users omit the source, Feast Core will fill in a default for the user.

Contributor comment from @ches (link)

While I get the utility and convenience especially for deploying Feast with more distributed ownership in the organization, we ignore / work around a lot of the Source and Store configuration kept in the registry RDBMS, in favor of service discovery.

I also think that asking data scientists and data engineers to be concerned with operational infra configuration like Kafka broker addresses when registering feature sets is not an elegant separation of concerns.

Maybe we can take this as an opportunity to consider design alternatives as well.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
chescommented, Apr 19, 2020

Thanks for the thoughtful reply @woop. I didn’t pay close attention to the 0.5 milestone on #632 so thanks for diverting it here, makes sense to open this discussion issue that parallels other API ones.

By “we” do you mean your current team? I’d love to get a better understanding of how you are currently doing it, especially if it means we can improve sources or stores.

Yes, at Agoda our Feast deployment is perhaps more centralized in the sense that:

  • There is one Store for warehouse (HDFS with Hive-interoperable engines)
  • There is one Store for online serving (Cassandra)
  • There is logically one Source Kafka topic for all feature ingestion (modulo internal idiosyncrasies not pertinent to the discussion)
  • Feast is offered to all clients as a managed service provided by one team, client teams do not deploy their own Stores or Serving API instances.

As I’m sure is true for Gojek too, some core entity types in our business domain have very high cardinality (e.g. customers). Most client teams serving online will use some features of these, and it isn’t practical or economical for us to deploy many storage cluster islands that can support the scale. Cassandra is massively scalable; Kafka throughput is massively scalable; we have dedicated teams expert at doing those. There’s also the merit that new clients don’t need to provision new infrastructure to start using the system, this is one of the key problems we’re solving from the status quo before Feast. (We track cost attribution in other ways, if anyone wonders about that aspect).

The one point above that I imagine could become more flexible over time is the Kafka topics, there may be use cases for special-purpose / priority ones, and I believe it should be straightforward to support that if the need arises.

That brings up a notable distinction in regard to the current Source configuration I think: if we did support this, it will be useful to declaratively associate feature sets with source topics (as Feast already allows). However, users will never need to think about the brokers, they will differ for the same topic name across DCs and our SDK wrapper and Ingestion get them from service discovery. I think this speaks to your thought that “there is value in having users configure some aspects of the sourcing of data”.

Sources today are a glorified “connection string”. If a source was to only ever stay a connection string, then I don’t really see the point in exposing that to users. This can easily be configured behind the scenes by administrators and exposed through source names or a more human friendly way. Users can then select the source they want to use. If I understand correctly then this is the largest part of what you think is a bad separation of concern.

Yes I believe we’re on the same page then. Roughly, an abstraction over operational or environmental details of infrastructure. Operators of a Feast deployment could plug service discovery into this abstraction, potentially.

Most of our data still lives in data lakes or data warehouses, which means federation is a natural next step. In a federated model we would probably opt to extend sources to allow new data sources to be accessed through Feast, especially without users having to export and reimport into Feast, and with Feast being lazy towards retrieval and exports (no long running jobs).

I feel the federation is an elegant idea in theory, but I’m initially skeptical of how it will work out in practice. Not to say it isn’t worth trying or to discourage it, just would urge breaking it off to an MVP to validate without disrupting Feast’s ongoing technical improvement and data model refinement—it could be a year spent rearchitecting for such a pivot in vision, with considerable risk that it doesn’t work out well or serve users markedly better.

Some of my concerns with it:

  • “Jack of all trades, master of none” being difficult to give consistently optimal experience with multiple storage engines, e.g. not being able to push down filters and joins across stores. See also areas like #444 and more specifically the data locality-related discussion on #482 that I think possibly belongs under #444.
  • Harder (and less efficient) to lazily impose data quality measures on data at rest.
  • Hard to rely on sustained availability of data from sources controlled by many owners—producers controlling fate of consumers without a safety buffer.

We may learn differently with more experience, but at the outset in our org I think we are content to bring data into managed feature store storage. There’s a cultural expectation that it is “special” data, expected to be subjected to higher quality standards, stable maintenance, etc. that federated sources may not.

However, I see massive value in allowing users to define SQL queries. And this is the direction that I would like to take sources (or if not sources, another part of feature sets).

One of the main reasons why I see this as valuable is that all of our users are familiar with SQL. Using SQL improves the Feast user experience because they are able to validate and prove that their query works without Feast in the loop. They can then bring that query to Feast, publish it, and see the results. If there is a failure, then the problem is likely Feast. SQL is also supported by virtually all sources and stores.

I’m on board with looking for ways to use SQL as an interface to the system. It does make barriers vanish for many potential users, especially data scientists/engineers/analysts who can contribute new data sets to the feature store without more specialized development knowledge/skills. Indeed something that has happened even before Feast going live for us was another team eagerly building an integration with an in-house ETL tool we have to move data between engines with—you guessed it—SQL expressions of the input. So we’ve already “solved” this, in a proprietary way and with some overhead of redundant import/exports that you refer to with federation.

We (at least I 😇) have a vision/dream of a streaming platform where users are expressing Beam/Flink/Spark/Whatever SQL, with ability to include/join feature store data, and (optionally) choosing to ingest results into the feature store in the same engine DAG (no extra hop out through Kafka or the like). In theory we are not that far from the query part, the data is already in tables the engine can make available in PCollections or their analogues.

I may have lost the course a little bit there, but hopefully it gives color to ubiquity of SQL.

0reactions
stale[bot]commented, Aug 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Feast Python API Documentation — Feast documentation
Retrieves the list of data sources from the registry. Parameters. name – Name of the data source. Returns. The specified data source. Raises....
Read more >
feast-dev/feast: Feature Store for Machine Learning - GitHub
Feast (Feature Store) is an open source feature store for machine learning. Feast is the fastest path to manage existing infrastructure to productionize ......
Read more >
Feast Feature Store - MLOps Community
Feature Store Capabilities ; Storage and Feature Processing Infrastructure. Online storage: Cloud Firestore (Feast 0.10) and Redis (Feast 0.9). Offline storage: ...
Read more >
feast - PyPI
Python SDK for Feast. ... linter Docs Latest Python API License GitHub Release. Overview. Feast (Feature Store) is an open source feature store...
Read more >
Creating a Feature Store with Feast | by Kedion - Medium
Part 3: Building An API and React App for Feast ... the online store ( online_store.db ), and data sources ( driver_stats_1.parquet and ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found