question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Specify unique-row-id column in get_historical_features

See original GitHub issue

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I’m always frustrated when […] If I have columns X, A, B, C, event_timestamp in my entity source data and A, B, C are the entity columns to join feature data to but the combination of [A, B, C, event_timestamp] may not be unique, the join will have issues that produces duplicate rows. One solution is to preprocess the data so that only the unique rows of the combination are filtered out for the join, but we may want all the rows to be preserved since X may already be unique and each row represents a real unique training example. It could also be that there are columns Y, Z, etc that aren’t part of the feast join but contain unique info on a row basis so it doesn’t make sense to filter those out. In this example, X might be impression_id for instance and we’re not joining data directly based on impression_id but based on the A, B, C columns which might be tweet_id, user_id, etc.

Describe the solution you’d like A clear and concise description of what you want to happen. Be able to optionally specify a unique-row-id column in get_historical_features so in the example above, X would be chosen as the unique-row-id column. I’ve tested swapping this part of the feast join query

CONCAT( {% for entity in featureview.entities %} CAST({{entity}} AS STRING), {% endfor %} CAST({{entity_df_event_timestamp_col}} AS STRING) ) AS {{featureview.name}}__entity_row_unique_id,

with just

X as entity_row_unique_id

and it fixes the issue, plus there seems to be performance gains possibly from just eliminating the work of creating the concatenated strings for each row. I think the change should be relatively easy to make though this involves an API change which always requires some consideration. get_historical_features might become something like

    def get_historical_features(
        entity_df: Union[pd.DataFrame, str],
        feature_refs: List[str],
        unique_id_col: str = "",
        full_feature_names: bool = False,
    ) -> RetrievalJob:

 unique_id_col: str = "", being the new addition of an optional param

Describe alternatives you’ve considered A clear and concise description of any alternative solutions or features you’ve considered.

N/A

Additional context Add any other context or screenshots about the feature request here.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
felixwang9817commented, Jul 26, 2021

I don’t have a way to reproduce this bug, but I believe you can see the potential for duplication by inspecting the final part of the SQL query we use for get_historical_features:

SELECT * EXCEPT(entity_timestamp, {% for featureview in featureviews %} {{featureview.name}}__entity_row_unique_id{% if loop.last %}{% else %},{% endif %}{% endfor %})
FROM entity_dataframe
{% for featureview in featureviews %}
LEFT JOIN (
    SELECT
        {{featureview.name}}__entity_row_unique_id
        {% for feature in featureview.features %}
            ,{% if full_feature_names %}{{ featureview.name }}__{{feature}}{% else %}{{ feature }}{% endif %}
        {% endfor %}
    FROM {{ featureview.name }}__cleaned
) USING ({{featureview.name}}__entity_row_unique_id)
{% endfor %}

We take the spine (entity_dataframe) and enrich it by doing a LEFT JOIN with feature data from the offline store. For each feature view, the LEFT JOIN checks for equality on {{featureview.name}}__entity_row_unique_id; if this is not unique, then the LEFT JOIN will produce extra rows. I’m assuming that in @mavysavydav 's example, since (A, B, C, event_timestamp) is not unique, this exact issue is occurring.

1reaction
mavysavydavcommented, Jul 29, 2021

yes @MattDelac ! I had thought that your change would just optimize the query run (which it does), but now that I try it against this duplication issue, it’s resolved that too. Upon a closer look, it seems like it’s b/c of the groupby you added. In the older version of the sql, we groupby the unique entity row id for the intermediary generated featureview tables at every stage except we don’t do it for the creation of {{feature_view.name}}__cleaned, which are the feature tables that are joined to the entity source data. So if the entity source data has duplicates, there’s a good chance that the intermediary feature tables have duplicates, and these duplicates multiply out causing an explosion. Thanks for your PR!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unique row ID in R - duplicates - Stack Overflow
I have to get ID from Unique function in R. I am trying to find out duplicate rows in a dataframe, which is...
Read more >
ROWID Pseudocolumn
They are the fastest way to access a single row. They can show you how the rows in a table are stored. They...
Read more >
Create a Unique Row Id with computeRelative - YouTube
Ever need an ID on each row, but you don't have one?
Read more >
Updating unique columns [feature request] - SQLite Forum
Let's assume we have a table with a unique column named "id" with values 1, 2, 3, 4, 5. It is no problem...
Read more >
Easily Get a Unique ID Column in Your OA Dataset
But if I need to have a unique row id for every row in my dataset, no matter what attributes I am showing...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found