Duplicate entity row created when multiple feature set apply() calls happen asynchronously
See original GitHub issueExpected Behavior
feast=# select * from entities;
feature_set_id | name | project | type | version
-----------------------------------------------+------+---------+-------+---------
0/customer_transactions:1 | id | 0 | INT64 | 1
0/customer_transactions2:1 | id2 | 0 | INT64 | 1
0/customer_transactions_with_same_entity_id:1 | id | 0 | INT64 | 1
(3 rows)
Current Behavior
feast=# select * from entities;
feature_set_id | name | project | type | version
-----------------------------------------------+------+---------+-------+---------
0/customer_transactions:1 | id | 0 | INT64 | 1
0/customer_transactions:1 | id | 0 | INT64 | 1
0/customer_transactions2:1 | id2 | 0 | INT64 | 1
0/customer_transactions_with_same_entity_id:1 | id | 0 | INT64 | 1
(4 rows)
client.list_feature_sets()
and client.list_entities()
also throw the following error:
>>> client.list_entities()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/feast/client.py", line 453, in list_entities
for fs in self.list_feature_sets():
File "/usr/local/lib/python3.7/site-packages/feast/client.py", line 403, in list_feature_sets
feature_set = FeatureSet.from_proto(feature_set_proto)
File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 726, in from_proto
else feature_set_proto.spec.project,
File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 63, in __init__
self.entities = entities
File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 159, in entities
self._add_fields(entities)
File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 309, in _add_fields
self.add(field)
File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 275, in add
+ '"'
ValueError: could not add field "id" since it already exists in feature set "customer_transactions"
Steps to reproduce
In local python client create two different feature sets with the same entity name:
import pandas as pd
import numpy as np
from pytz import timezone, utc
from feast import Client, Entity, FeatureSet, ValueType
from sdk.python.feast.serving.ServingService_pb2 import GetOnlineFeaturesRequest
from sdk.python.feast.types.Value_pb2 import Value as Value
from google.protobuf.duration_pb2 import Duration
from datetime import datetime, timedelta
from random import randrange
import random
import os
project = '0'
CORE_URL = os.environ.get('FEAST_CORE_URL')
ONLINE_SERVING_URL = os.environ.get('FEAST_ONLINE_SERVING_URL')
BATCH_SERVING_URL = os.environ.get('FEAST_BATCH_SERVING_URL')
client = Client(core_url=CORE_URL, serving_url=BATCH_SERVING_URL) # Connect to Feast Core
client.create_project(project)
client.set_project(project)
days = [datetime.utcnow().replace(hour=0, minute=0, second=0, microsecond=0).replace(tzinfo=utc) \
- timedelta(day) for day in range(31)]
# create feature set based on notebook example
customers = [1001, 1002, 1003, 1004, 1005]
customer_features = pd.DataFrame(
{
"datetime": [day for day in days for customer in customers],
"id": [customer for day in days for customer in customers],
"daily_transactions": [np.random.rand() * 10 for _ in range(len(days) * len(customers))],
"total_transactions": [np.random.randint(100) for _ in range(len(days) * len(customers))],
}
)
print(customer_features.head(10))
customer_fs = FeatureSet(
"customer_transactions",
max_age=Duration(seconds=86400),
entities=[Entity(name='id', dtype=ValueType.INT64)]
)
customer_fs.infer_fields_from_df(customer_features, replace_existing_features=True)
client.apply(customer_fs)
# create new feature set with same entity
customers = [1001, 1002, 1003, 1004, 1005]
customer_features = pd.DataFrame(
{
"datetime": [day for day in days for customer in customers],
"id": [customer for day in days for customer in customers],
"daily_transactions3": [np.random.rand() * 10 for _ in range(len(days) * len(customers))],
"total_transactions3": [np.random.randint(100) for _ in range(len(days) * len(customers))],
}
)
print(customer_features.head(10))
customer_fs = FeatureSet(
"customer_transactions_with_same_entity_id",
max_age=Duration(seconds=86400),
entities=[Entity(name='id', dtype=ValueType.INT64)]
)
customer_fs.infer_fields_from_df(customer_features, replace_existing_features=True)
client.apply(customer_fs)
Specifications
- Version: 0.4.4 python client, 0.4.3 installed on GKE cluster.
- Platform: OSX 10.14
- Subsystem: Python 3.7.6
Possible Solution
- I suppose having the same entity name is an anti-pattern/bad practice and should be avoided given plans for #405, but I’m not clear why this is happening. It’s also not clear in the docs that entity names should be unique even when changing feature sets.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:15 (15 by maintainers)
Top Results From Across the Web
Multiple async tasks causing duplicates on SQL Server insert ...
The Id (primary key) is different but the remaining row data is an actual duplicate.
Read more >Asynchronous Programming - EF Core - Microsoft Learn
Querying and saving data asynchronously with Entity Framework Core.
Read more >How To Handle Async Data Loading, Lazy Loading, and Code ...
Suppose you called the asynchronous function inside of your component and then set the data to a variable using the useState Hook.
Read more >Synchronous and asynchronous requests - Web APIs | MDN
Line 3 creates an event handler function object and assigns it to the request's onload attribute.
Read more >Getting Started | Creating Asynchronous Methods - Spring
This guide walks you through creating asynchronous queries to GitHub. The focus is on the asynchronous part, a feature often used when scaling...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Nope, that is Gojek specific IP so we can’t open source it. It’s basically just .yaml files of the feature set specifications in folders like
dev
,staging
,production
and then grouped by system.I believe that this might be resolved already since we have changed the db schemas, but without a way to reproduce this we won’t know.
Closing for now. Lets reopen if we can reproduce the problem.