Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feast API: Feature references, concept hierarchy, and data model

See original GitHub issue

This issue is meant to be a discussion of the current Feast API as it relates to feature references, a key component of the user facing API. Additionally, it will also discuss the current data model and our concept hierarchy.

1. Background

The Feast user facing API and data model changed dramatically from 0.1 to 0.2+. The original intention was to simplify the API as much as possible and gradually evolve it as new user requirements available.

Two important reference documents on this topic are

2. Problem statement

The Feast API is evolving as more and more teams adopt the software and share their requirements with us. In most cases this means an expansion of the API, but in some cases it means a reversal.

With the introduction of projects into Feast (Feast Projects RFC), our API has evolved again. This change has affected feature references, the data model, and concept hierarchy.

The most critical feedback on this change has been that it introduces unnecessary complexity to address problems (isolation, namespacing, security), that could be solved in a different way.

3. Objective

The point of this GitHub issue is to settle our API for feature references, our concept hierarchy, and data model in such a way that we

Meet all our known requirements for future development
Minimize user facing changes and migration requirements
Maintain flexibility in accepting new user requirements and evolving our API

Put simply, we want to make sure that we are on the right path and make the necessary changes now when its least disruptive.

4. What are feature references?

Feature references (previously Feature Ids) are strings/objects within Feast that allows Feast and users of Feast to reference specific features. Feature references are primarily used as a means of indicating to Feast which features a user would like to retrieve.

Originally, feature references were defined as follows <feature-set>:<feature-name>:<feature-version> All parts of the above reference were required at the time.

Feature references have recently been updated (as part of the Projects RFC)

The move towards project namespaces now moves feature sets and features/entities into the following hierarchy Screenshot from 2020-02-18 10-23-19

Feature references are now defined as: <project>/<feature-name>:<feature-version>

The following constraints apply

Versions are optional. If no version is provided then the latest version of a feature is used.
Feature names must be unique within a project (even across feature sets within that project).
Entity names must be unique within a project (but can be reused across feature sets).

One of our primary motivations was to allow users to reference features directly by name. With versions becoming optional and allowing the project to be set externally, this is now possible. Users can provide features as a list of feature names

An example of feature references being used below (from the Python SDK):

online_features = client.get_online_features(
    feature_refs=[
        f"daily_transactions",
        f"total_transactions",
    ],
    entity_rows=entity_rows,
)

5. How are feature references used?

5.1 During online serving

During online serving the user will provide two sets of information to Feast during feature retrieval.

A list of feature references
A list of entities

Feast wants to construct a response object with all of the data from these features on all of these entities.

For example, if a user sends a request with a single feature reference as daily_transactions, Feast will attempt to add the missing information. It will add the project id (which currently must be provided by the user), it will then determine the feature set that contains that feature name, and then finally it will determine the latest version of the feature set in which the feature occurs.

Internally, Feast is left with something that resembles the following my_customer_project/my_customer_feature_set:daily_transactions:3

Since features are stored based on feature sets, Feast first converts the above into what we can informally define as a feature set reference, resembling the following <project>/<feature-set-name>:<feature-set-version> or tangibly my_customer_project/my_customer_feature_set:3

In the case of Redis, Feast will use the above feature set reference, along with the entities the user has provided, to construct a list of keys to look up. The responses from the database are then used to build a response object that is returned to the user.

5.2 During batch serving

The batch serving case is very similar to the online serving case, but with more complexity on queries and joins.

The user provides the following during batch retrieval

A list of feature references
A list of entities paired with timestamps

Feature references are converted into their full form, as well as used to create feature set references (as in online serving). In the case of BigQuery, the feature set reference maps directly to a table. For each feature set table that Feast needs to query features from, Feast runs a point in time correct query using the entities+timestamps for the specific feature columns. This produces a resultant table with the users requested feature data, over the timestamps and features, but one specific feature set.

Feast then uses the entity columns in each feature set table as a means of joining the results of these sub-queries into a single resultant dataframe.

5.3 During ingestion of data into stores

When loading data into Feast, data first needs to be converted into FeatureRow format and then pushed into a Kafka stream.

During this conversion to feature row form, it is necessary to set a field called feature_set with the feature set reference. To reiterate, the feature set reference looks something like: <project>/<feature-set-name>:<feature-set-version>

Ingestion jobs that pick up these rows are then able to easily identify the row as belonging to a specific project and feature set. The jobs then write all of these rows to all of the stores that subscribe to these feature sets.

6. Problems with the current implementation

6.1 Feature set versions are unnecessary:

The concept of feature set versions was introduced in order to allow users to reuse feature set names. However, they add additional complexity at both ingestion time as well as retrieval time. Users need to maintain a knowledge of the correct version of feature set to ingest data to and to retrieve data from. If they dont pin their retrieval to a specific version then they risk having their system go down at a version increment.

6.2 Projects could be unnecessary at the top of the concept hierarchy:

Projects as a concept was introduced to provide a means of

Isolation between users: Users can register the same feature sets and features within their own project namespace without conflicts arrising between users.
Access control: Projects provide a top level hierarchy that makes access control more convenient to implement
Ease of feature retrieval: By introducing naming constraints at the project level, it is easier to logically group and reference feature by name. Thus, projects provide a way of grouping based on retrieval where feature sets provide a means of grouping based on ingestion.

The problem with projects is that it introduces a layer into the concept hierarchy that makes Feast harder to understand and could be introducing unnecessary complexity. It’s possible that all of the above requirements for introducing projects could be addressed while still maintaining feature sets as the top level concept.

6.3 Projects are a cause for code smell in the data model:

There are currently three locations where projects occur.

Ingestion (FeatureRows)
Stores (tables and keys)
Serving/retrieval (incoming queries)

The current approach has code smell in the fact that FeatureRows have to know their own identity. Today, having each FeatureRow know its own identify allows Feast to consume from topics that contain mixed feature sets (versions and names). Feast is able to differentiate FeatureRows from each other and can know how to interpret their contents based on a feature reference contained within the row.

However, In the case that Feast were to consume features from an external stream that it had no control over (not even the data model), Feast would not have the feature set reference conveniently available inside the event payload.

The second occurrence of projects is in the store. Tables are currently named according to projectName_featureSet_version. Projects are a necessity here since feature set names can be duplicated across projects. However, projects are not essential complexity in the same way a feature set is, and doesnt seem natural to encode into the data model itself.

6.4 Feature sets are a leaky abstraction:

Feature sets are a core part of the existing data model. Feature data is stored on a feature set within a feast store like Redis or BigQuery. In order to find the features a user is looking for, it is still necessary to determine the feature set they need from their feature reference. This seems to work at retrieval time since Feast Serving can maintain a cache of available feature sets (albeit introducing a new inefficiency during lookup). Two problems exist here:

There is a disconnect between how users are producing data (feature set references) and how users are consuming data (feature references). Users are loading in FeatureRows into feature sets, but they are querying out features from projects. Ideally these two concepts wouldn’t be so distinct.
Currently, feature references are defined as follows: <project>/<feature-name>:<feature-version>. However, the concept of a feature-version doesn’t exist. Feature are currently inheriting their version from their feature set. So right now a feature references still contain trace information about the parent feature set.

Issue Analytics

State:
Created 4 years ago
Comments:23 (13 by maintainers)

Top GitHub Comments

2reactions

Wirickcommented, Feb 28, 2020

So just to share a possibility from my experience and wheelhouse, the plan for feast in my org is to have a features repo that defines avro schemas for feature sets. The feature set schemas (similarly to all of our event schemas) are generated programmatically, along with python and go glue code, annotated with version for evolution purposes, and then applied on master merge to the feast clusters.

When a user wants to ingest features, they use the generated schema object to ingest, validate, and publish a dataframe (note as an organization we use “ts” instead of datetime, so this also abstracts this difference):

from pmfeatures.buyer import CustomerFeatures

def generate_features():
           ...
            customer_features = CustomerFeatures(
                    customer_uuid=features["customer_uuid"],
                    ts=features['ts'],
                    features=features
            )
            customer_features.publish()
            ...

We automatically annotate schema changes with an updated version, and enforce schema evolution rules (it seems we would want similar rules for updating feature sets in bigquery if you want to use the same table) to make sure schemas are forward compatible. If feast had an ability to specify the version, this is the one I would use. However, when ingesting features the version doesn’t usually appear, and by enforcing schema evolution rules we can be sure that any serving code will work with updated schemas, since the only allowed operations are adding new nullable fields and relaxing the type of a field.

I mention this because we are adopting the confluent schema registry in our general kafka strategy so that we don’t have to have schema information encoded in the body of the message, so it seems like it could be used to help solve the outlined issues about an event knowing about it’s feature set (6.3).

Additionally, we have a concept of namespace in our schemas, and we use that in the feature set name, and I’ve found that most want the latest version of a feature set. it’s for this reason that project and version seem safe to remove, perhaps by incorporating into the 7.2.1 Change 1. The first piece of utility code that I wrote for my feature set objects was a method that takes a list of features and annotates them with the latest version (calls feast core I believe)

1reaction

tfurmstoncommented, Jun 25, 2020

I just wonder whether there is a one-fits-all solution to this that will work for everyone, or whether it would be possible to provide users with more flexibility in how they structure the underlying data.

@tfurmston for the record I 100% agree that there is room for optimizing the data model here and to provide users with more flexibility. One way to do this is allow feature sets to be defined as materialized views.
name: features_that_I_will_retrieve
features:
- name: f1
  ref: fs1:f1
- name: f2 
  ref: fs2:f1
During ingestion we could write to these materialized views as well as the original feature set tables. However, I don’t see too much value of doing this for historical data. I do think it would be useful for online serving. In the case of online serving though, it would probably require a read + write since data will be coming in separate events. Alternatively we could maintain state in the ingestion jobs to support this.

Perhaps that would work. To be honest, I still don’t have enough usage of feast from the user perspective to know either way. Just from reading this thread, it does seem that there are issues with the data model. Hence my comments.

Re-reading @mrzzy proposal from the 25th, i.e., grouping by entity. I think this makes a lot of sense. Maybe this would make a good default and then if it transpires that people need more flexibility, then address it then.