question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicate entity row created when multiple feature set apply() calls happen asynchronously

See original GitHub issue

Expected Behavior

feast=# select * from entities;
                feature_set_id                 | name | project | type  | version 
-----------------------------------------------+------+---------+-------+---------
 0/customer_transactions:1                     | id   | 0       | INT64 |       1
 0/customer_transactions2:1                    | id2  | 0       | INT64 |       1
 0/customer_transactions_with_same_entity_id:1 | id   | 0       | INT64 |       1
(3 rows)

Current Behavior

feast=# select * from entities;
                feature_set_id                 | name | project | type  | version 
-----------------------------------------------+------+---------+-------+---------
 0/customer_transactions:1                     | id   | 0       | INT64 |       1
 0/customer_transactions:1                     | id   | 0       | INT64 |       1
 0/customer_transactions2:1                    | id2  | 0       | INT64 |       1
 0/customer_transactions_with_same_entity_id:1 | id   | 0       | INT64 |       1
(4 rows)

client.list_feature_sets() and client.list_entities() also throw the following error:

>>> client.list_entities()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/feast/client.py", line 453, in list_entities
    for fs in self.list_feature_sets():
  File "/usr/local/lib/python3.7/site-packages/feast/client.py", line 403, in list_feature_sets
    feature_set = FeatureSet.from_proto(feature_set_proto)
  File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 726, in from_proto
    else feature_set_proto.spec.project,
  File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 63, in __init__
    self.entities = entities
  File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 159, in entities
    self._add_fields(entities)
  File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 309, in _add_fields
    self.add(field)
  File "/usr/local/lib/python3.7/site-packages/feast/feature_set.py", line 275, in add
    + '"'
ValueError: could not add field "id" since it already exists in feature set "customer_transactions"

Steps to reproduce

In local python client create two different feature sets with the same entity name:

import pandas as pd
import numpy as np
from pytz import timezone, utc
from feast import Client, Entity, FeatureSet, ValueType
from sdk.python.feast.serving.ServingService_pb2 import GetOnlineFeaturesRequest
from sdk.python.feast.types.Value_pb2 import Value as Value
from google.protobuf.duration_pb2 import Duration
from datetime import datetime, timedelta
from random import randrange
import random
import os


project = '0'
CORE_URL = os.environ.get('FEAST_CORE_URL')
ONLINE_SERVING_URL = os.environ.get('FEAST_ONLINE_SERVING_URL')
BATCH_SERVING_URL = os.environ.get('FEAST_BATCH_SERVING_URL')
client = Client(core_url=CORE_URL, serving_url=BATCH_SERVING_URL) # Connect to Feast Core
client.create_project(project)
client.set_project(project)
days = [datetime.utcnow().replace(hour=0, minute=0, second=0, microsecond=0).replace(tzinfo=utc) \
        - timedelta(day) for day in range(31)]

# create feature set based on notebook example
customers = [1001, 1002, 1003, 1004, 1005]
customer_features = pd.DataFrame(
    {
        "datetime": [day for day in days for customer in customers],
        "id": [customer for day in days for customer in customers],
        "daily_transactions": [np.random.rand() * 10 for _ in range(len(days) * len(customers))],
        "total_transactions": [np.random.randint(100) for _ in range(len(days) * len(customers))],
    }
)
print(customer_features.head(10))
customer_fs = FeatureSet(
    "customer_transactions",
    max_age=Duration(seconds=86400),
    entities=[Entity(name='id', dtype=ValueType.INT64)]
)
customer_fs.infer_fields_from_df(customer_features, replace_existing_features=True)
client.apply(customer_fs)

# create new feature set with same entity
customers = [1001, 1002, 1003, 1004, 1005]
customer_features = pd.DataFrame(
    {
        "datetime": [day for day in days for customer in customers],
        "id": [customer for day in days for customer in customers],
        "daily_transactions3": [np.random.rand() * 10 for _ in range(len(days) * len(customers))],
        "total_transactions3": [np.random.randint(100) for _ in range(len(days) * len(customers))],
    }
)
print(customer_features.head(10))
customer_fs = FeatureSet(
    "customer_transactions_with_same_entity_id",
    max_age=Duration(seconds=86400),
    entities=[Entity(name='id', dtype=ValueType.INT64)]
)
customer_fs.infer_fields_from_df(customer_features, replace_existing_features=True)
client.apply(customer_fs)

Specifications

  • Version: 0.4.4 python client, 0.4.3 installed on GKE cluster.
  • Platform: OSX 10.14
  • Subsystem: Python 3.7.6

Possible Solution

  • I suppose having the same entity name is an anti-pattern/bad practice and should be avoided given plans for #405, but I’m not clear why this is happening. It’s also not clear in the docs that entity names should be unique even when changing feature sets.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
woopcommented, Feb 17, 2020

…we have a separate repository with our specifications. Changes to this repository are version controlled and applied through CI.

Is any of this open-sourced?

Nope, that is Gojek specific IP so we can’t open source it. It’s basically just .yaml files of the feature set specifications in folders like dev, staging, production and then grouped by system.

0reactions
woopcommented, Mar 28, 2020

I believe that this might be resolved already since we have changed the db schemas, but without a way to reproduce this we won’t know.

Closing for now. Lets reopen if we can reproduce the problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multiple async tasks causing duplicates on SQL Server insert ...
The Id (primary key) is different but the remaining row data is an actual duplicate.
Read more >
Asynchronous Programming - EF Core - Microsoft Learn
Querying and saving data asynchronously with Entity Framework Core.
Read more >
How To Handle Async Data Loading, Lazy Loading, and Code ...
Suppose you called the asynchronous function inside of your component and then set the data to a variable using the useState Hook.
Read more >
Synchronous and asynchronous requests - Web APIs | MDN
Line 3 creates an event handler function object and assigns it to the request's onload attribute.
Read more >
Getting Started | Creating Asynchronous Methods - Spring
This guide walks you through creating asynchronous queries to GitHub. The focus is on the asynchronous part, a feature often used when scaling...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found