Feast API: Sources
See original GitHub issueThis issue can be used to discuss the role of sources
in Feast, and how we see the concept evolving in future versions.
Status quo
Feast currently supports only a single Source
type on KafkaSource. This can be defined through a Feature Set
or omitted. If users omit the source, Feast Core will fill in a default for the user.
Contributor comment from @ches (link)
While I get the utility and convenience especially for deploying Feast with more distributed ownership in the organization, we ignore / work around a lot of the Source and Store configuration kept in the registry RDBMS, in favor of service discovery.
I also think that asking data scientists and data engineers to be concerned with operational infra configuration like Kafka broker addresses when registering feature sets is not an elegant separation of concerns.
Maybe we can take this as an opportunity to consider design alternatives as well.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
Thanks for the thoughtful reply @woop. I didn’t pay close attention to the 0.5 milestone on #632 so thanks for diverting it here, makes sense to open this discussion issue that parallels other API ones.
Yes, at Agoda our Feast deployment is perhaps more centralized in the sense that:
As I’m sure is true for Gojek too, some core entity types in our business domain have very high cardinality (e.g. customers). Most client teams serving online will use some features of these, and it isn’t practical or economical for us to deploy many storage cluster islands that can support the scale. Cassandra is massively scalable; Kafka throughput is massively scalable; we have dedicated teams expert at doing those. There’s also the merit that new clients don’t need to provision new infrastructure to start using the system, this is one of the key problems we’re solving from the status quo before Feast. (We track cost attribution in other ways, if anyone wonders about that aspect).
The one point above that I imagine could become more flexible over time is the Kafka topics, there may be use cases for special-purpose / priority ones, and I believe it should be straightforward to support that if the need arises.
That brings up a notable distinction in regard to the current Source configuration I think: if we did support this, it will be useful to declaratively associate feature sets with source topics (as Feast already allows). However, users will never need to think about the brokers, they will differ for the same topic name across DCs and our SDK wrapper and Ingestion get them from service discovery. I think this speaks to your thought that “there is value in having users configure some aspects of the sourcing of data”.
Yes I believe we’re on the same page then. Roughly, an abstraction over operational or environmental details of infrastructure. Operators of a Feast deployment could plug service discovery into this abstraction, potentially.
I feel the federation is an elegant idea in theory, but I’m initially skeptical of how it will work out in practice. Not to say it isn’t worth trying or to discourage it, just would urge breaking it off to an MVP to validate without disrupting Feast’s ongoing technical improvement and data model refinement—it could be a year spent rearchitecting for such a pivot in vision, with considerable risk that it doesn’t work out well or serve users markedly better.
Some of my concerns with it:
We may learn differently with more experience, but at the outset in our org I think we are content to bring data into managed feature store storage. There’s a cultural expectation that it is “special” data, expected to be subjected to higher quality standards, stable maintenance, etc. that federated sources may not.
I’m on board with looking for ways to use SQL as an interface to the system. It does make barriers vanish for many potential users, especially data scientists/engineers/analysts who can contribute new data sets to the feature store without more specialized development knowledge/skills. Indeed something that has happened even before Feast going live for us was another team eagerly building an integration with an in-house ETL tool we have to move data between engines with—you guessed it—SQL expressions of the input. So we’ve already “solved” this, in a proprietary way and with some overhead of redundant import/exports that you refer to with federation.
We (at least I 😇) have a vision/dream of a streaming platform where users are expressing Beam/Flink/Spark/Whatever SQL, with ability to include/join feature store data, and (optionally) choosing to ingest results into the feature store in the same engine DAG (no extra hop out through Kafka or the like). In theory we are not that far from the query part, the data is already in tables the engine can make available in
PCollection
s or their analogues.I may have lost the course a little bit there, but hopefully it gives color to ubiquity of SQL.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.