question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Entity types as a higher-level concept

See original GitHub issue

Introduction

Currently an entity, or more formally an entity type, is treated as a special type of field within a feature set. There has been an attempt to simplify the creation and management of entities and to keep them consistent with features, however some challenges exist with our current approach.

Note: The terms entity and entity type will be used interchangeable in the following issue.

How are entities created?

  • Users define an entity as part of a feature set. An entity in this case is a field like any other within the feature set. More than one entity can exist within a feature set.
  • An entity’s name must be unique within a feature set.
  • There are no constraints on entities outside of a feature set, either at the project or global level. This means that multiple feature sets can define the same entities again.

How are entities used?

  • Retrieving feature values: Entities are used as a key for retrieving features. In order to retrieve feature values within a feature set, all entities must be provided as part of the lookup.
  • Joining feature sets: In the event that feature values are being retrieved from multiple feature sets, entities are used to look up these feature values. Entities are also used to join across these feature sets to construct a single result set.

What is the problem?

  1. Discovery: It seems intuitive that users would start their discovery experience from the point of view of an entity type, since their business problem is generally framed around one or more entities. By nesting entities within feature sets and within projects and not providing a discovery means, it makes discovery harder.
  2. Consistency: Entities are typically consistent across all projects and systems in most organizations. This consistency is not enforced in Feast at the moment. Users are bound to redefine entities in their local projects if no consistency is enforced at an organizational level. Failure would occur when lookups happen or when joins happen across feature sets, especially when joins need to happen across projects.
  3. Key building: If entities and features maintain mutual compatibility in terms of supported data types, then support must be maintained for building keys from all feature value types. This adds a lot of complexity to key building since support must be maintained to serialize complex composite data structures in order to build these keys.

Proposals

1. Project-level entities

Functionality

  • Entities are created outside of feature sets, but they still reside in a specific project namespace.
  • Entities have their own distinct API and supported data types (which may be more limited than features)
  • Entities must be unique within a project namespace, but can be duplicated across an organization. Uniqueness is ensured through a full entity reference (gojek/customer).
  • Entities are still defined as part of a feature set, but this is a selection process instead of creation.

Advantages

  • Entities receive all the sharing and isolation benefits of “projects”. Entities would not have to be treated separately from a logical and/or development standpoint. There would also be no explosion of a global entity namespace
  • Users are free to experiment and develop within their projects without affecting other users, since duplication is allowed across projects.
  • No need for a central team to gate-keep the creation of entities.

Disadvantages

  • By not elevating entities to the global level, end users would be required to know which projects contain the entities they should be referencing. This means an organizational process must exist in order to select these entities.
  • Most projects would have to reference entities from another more authoritative project. In fact, it’s likely that an organization will have a central project which contains only entities. This could be a little counter-intuitive if a feature set contains fields that are referencing an external project.

2. Global-level entities

Functionality

  • Entities are defined globally for a Feast deployment.
  • Entities have their own distinct API and supported data types (which may be more limited than features).
  • Entities must be globally unique.
  • Entities are still defined as part of a feature set, but this is a selection process instead of creation.

Advantages

  • Central authoritative listing of entities within an organization.
  • Easier to discover which entities should be used, without needing an organizational policy.
  • Easy to reason about and easier to understand when referencing an entity within a feature set.

Disadvantages

  • Requires development of separate logic from projects, feature sets, and features.
  • Requires a team and process to manage the creation of entities.
  • No way to isolate conflicts. If one team wants to use a float and another wants to use a string for an entity data type, then it would likely result in two entities being created. This would still be the case in the Project-level entity proposal, but at least in that proposal the unorthodox approach (maybe string) could be isolated to a specific project.

3. Default project entities

Functionality

  • If a user does not specify a project, then they are automatically located inside of the default project. This would be similar to how Kubernetes does namespacing.
  • All other functionality would be the same as the project level entities proposal, except users don’t actually have to create an entity inside of a named project.
  • Feature references could be created that allow users to reference entities without a project. So instead of having my_company/customer, it would be possible to refer to “global” entities by either using customer or default/customer.

Advantages

  • All of the advantages of project-level entities.
  • Most of the advantages of global-level entities, except that this default project would still not be a true global namespace. There would still need to be an organizational process that informs users to use the entities in this project.
  • Simplifies development since project-level sharing and isolation can be reused.

Disadvantages

  • Still requires access control on the default namespace.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:4
  • Comments:13 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
khorshuhengcommented, Jan 22, 2020

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driver_sg and driver_ th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

Its not clear what you mean here. What prevents you from having simply driver as a global entity?

Actually, yeah you are correct, I can just have driver in a global project instead of having the entity defined in each regional project. Too entrenched in the code base that I am currently working on and didn’t consider this possibility.

1reaction
woopcommented, Jan 22, 2020

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driver_sg and driver_ th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

Its not clear what you mean here. What prevents you from having simply driver as a global entity?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Entity types as a higher-level concept · Issue #405 · feast ...
Entities have their own distinct API and supported data types (which may be more limited than features). Entities must be globally unique.
Read more >
Types of Business Entities
Most business owners will choose from the six most common options: sole proprietorship, general partnership, limited partnership, LLC, C ...
Read more >
Generalization vs. Specialization: Definitions and Differences
The common attributes together form a higher-level component called a generalized entity. Two entity types in a university's database, ...
Read more >
Generalization, Specialization and Aggregation in ER Model
In specialization, an entity is divided into sub-entities based on their characteristics. It is a top-down approach where higher level entity is ...
Read more >
What is specialization and generalization in DBMS?
Specialization is a top-down approach in which a higher-level entity is divided into multiple specialized lower-level entities. In addition to sharing the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found