question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Latest Only option for Historical Retrieval

See original GitHub issue

Is your feature request related to a problem? Please describe.

In many batch workflows, it is worthwhile to retrieve the latest features by entity only. This is useful from the purposes of both production and backtesting purposes.

E.g. if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, we wouldn’t really use the online store for this.

Describe the solution you’d like

Allow users to specify an entity set extracted from a feature view should have an option to be deduplicated by latest. Depends on #1611

my_daily_batch_scoring_df = store.get_latest_features(
    entity_df = "my_df", 
    feature_refs = [...],
)

Additional context Linked issue #1611

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
MattDelaccommented, Nov 21, 2021

I still believe that this is an important feature for batch prediction pipelines. In that case you need the latest values from the offline store.

You also need to keep this idea of an “entity_df” that we don’t have with the pull_latest_from_table_or_query() method

2reactions
woopcommented, Jul 5, 2021

@MattDelac is this API moving closer to what you are using internally?

Not really

But we have the same need for batch predictions where we want to predict the latest values of the features in batch. Therefore we could bypass the historical retrieval logic and have a SQL template that is much more efficient.

In terms of API i would rather have another API eg: store.get_latest_features() rather than a boolean parameter. And as I said, store.get_latest_features() could be a very efficient SQL query

Hope that makes sense

store.get_latest_features() could be a shared method that is also used for materialization into the online store. Seems like a good idea to me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Historical retrieval without an entity dataframe #1611 - GitHub
I was thinking entity key. Only as an option - there are use cases for enabling both of them. For example, if our...
Read more >
AVEVA™ Historian Retrieval Guide
This guide describes the retrieval modes and options that you can use to retrieve your data. •. AVEVA Historian Database Reference.
Read more >
Eagle EIE – History Retrieval Optimization
The non-optimized way to retrieve historical data uses static parameters like start date, end date, and index to define data retrieval.
Read more >
Archive retrieval options - Amazon Simple Storage Service
Bulk – The lowest-cost retrieval option in Amazon S3 Glacier. With bulk retrievals, you can retrieve large amounts, even petabytes, of data inexpensively....
Read more >
Feature retrieval - Feast
Retrieving historical features (for training data or batch scoring) ... Feast abstracts away point-in-time join complexities with the get_historical_features API.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found