Latest Only option for Historical Retrieval
See original GitHub issueIs your feature request related to a problem? Please describe.
In many batch workflows, it is worthwhile to retrieve the latest features by entity only. This is useful from the purposes of both production and backtesting purposes.
E.g. if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, we wouldn’t really use the online store for this.
Describe the solution you’d like
Allow users to specify an entity set extracted from a feature view should have an option to be deduplicated by latest
. Depends on #1611
my_daily_batch_scoring_df = store.get_latest_features(
entity_df = "my_df",
feature_refs = [...],
)
Additional context Linked issue #1611
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:10 (6 by maintainers)
Top Results From Across the Web
Historical retrieval without an entity dataframe #1611 - GitHub
I was thinking entity key. Only as an option - there are use cases for enabling both of them. For example, if our...
Read more >AVEVA™ Historian Retrieval Guide
This guide describes the retrieval modes and options that you can use to retrieve your data. •. AVEVA Historian Database Reference.
Read more >Eagle EIE – History Retrieval Optimization
The non-optimized way to retrieve historical data uses static parameters like start date, end date, and index to define data retrieval.
Read more >Archive retrieval options - Amazon Simple Storage Service
Bulk – The lowest-cost retrieval option in Amazon S3 Glacier. With bulk retrievals, you can retrieve large amounts, even petabytes, of data inexpensively....
Read more >Feature retrieval - Feast
Retrieving historical features (for training data or batch scoring) ... Feast abstracts away point-in-time join complexities with the get_historical_features API.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I still believe that this is an important feature for batch prediction pipelines. In that case you need the latest values from the offline store.
You also need to keep this idea of an “entity_df” that we don’t have with the
pull_latest_from_table_or_query()
methodstore.get_latest_features()
could be a shared method that is also used for materialization into the online store. Seems like a good idea to me.