get_historical_features fails with dask error for file offline store
See original GitHub issueExpected Behavior
feature_store.get_historical_features(df, features=fs_columns).to_df()
where feature_store
is a feature store with file offline store and fs_columns
is a list of column names, and df
is a Pandas data frame, should work.
Current Behavior
It currently raises an error inside of dask:
E NotImplementedError: dd.DataFrame.apply only supports axis=1
E Try: df.apply(func, axis=1)
Stacktrace:
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/offline_store.py:81: in to_df
features_df = self._to_df_internal()
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/usage.py:280: in wrapper
raise exc.with_traceback(traceback)
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/usage.py:269: in wrapper
return func(*args, **kwargs)
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:75: in _to_df_internal
df = self.evaluation_function().compute()
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:231: in evaluate_historical_retrieval
df_to_join = _normalize_timestamp(
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:530: in _normalize_timestamp
df_to_join[timestamp_field] = df_to_join[timestamp_field].apply(
Steps to reproduce
Here is my feature store definition:
from feast import FeatureStore, RepoConfig, FileSource, FeatureView, ValueType, Entity, Feature
from feast.infra.offline_stores.file import FileOfflineStoreConfig
from google.protobuf.duration_pb2 import Duration
source_path = tmp_path / "source.parquet"
timestamp = datetime.datetime(year=2022, month=4, day=29, tzinfo=datetime.timezone.utc)
df = pd.DataFrame(
{
"entity": [0, 1, 2, 3, 4],
"f1": [1.0, 1.1, 1.2, 1.3, 1.4],
"f2": ["a", "b", "c", "d", "e"],
"timestamp": [
timestamp,
# this one should not be fetched as it is too far into the past
timestamp - datetime.timedelta(days=2),
timestamp,
timestamp,
timestamp,
],
}
)
df.to_parquet(source_path)
source = FileSource(
path=str(source_path),
event_timestamp_column="timestamp",
created_timestamp_column="timestamp",
)
entity = Entity(
name="entity",
value_type=ValueType.INT64,
description="Entity",
)
view = FeatureView(
name="view",
entities=["entity"],
ttl=Duration(seconds=86400 * 1),
features=[
Feature(name="f1", dtype=ValueType.FLOAT),
Feature(name="f2", dtype=ValueType.STRING),
],
online=True,
batch_source=source,
tags={},
)
config = RepoConfig(
registry=str(tmp_path / "registry.db"),
project="hello",
provider="local",
offline_store=FileOfflineStoreConfig(),
)
store = FeatureStore(config=config)
store.apply([entity, view])
expected = pd.DataFrame(
{
"event_timestamp": timestamp,
"entity": [0, 1, 2, 3, 5],
"someval": [0.0, 0.1, 0.2, 0.3, 0.5],
"f1": [1.0, np.nan, 1.2, 1.3, np.nan],
"f2": ["a", np.nan, "c", "d", np.nan],
}
)
Specifications
- Version: 0.21.3
- Platform: Linux
- Subsystem: Python 3.9
Possible Solution
This works fine in at least version 0.18.1, but I think it fails for any >0.20
It might have something to do with adding Dask requirement, maybe the version is insufficient? I used to use 2022.2 before, but the requirement is now for 2022.1.1. But this is just a guess, really.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Connect to remote data - Dask documentation
Connect to remote data¶. Dask can read data from a variety of data stores including local file systems, network file systems, cloud object...
Read more >Futures - Dask documentation
Dask futures reimplements the Python futures API so you can scale your Python futures workflow across a Dask cluster.
Read more >dask.dataframe.read_csv - Dask documentation
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......
Read more >Frequently Asked Questions - Dask.distributed
PermissionError [Errno 13 ] Permission Denied: `/root/. dask` This error can be seen when starting distributed through the standard process control tool ...
Read more >DataFrames: Read and Write Data - Dask Examples
Dask Dataframes can read and store data in many of the same formats as Pandas ... First we create an artificial dataset and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the details @elshize - this definitely smells like a bug we need to fix!
Yes, the problem was reusing the column. I shared that when I earlier in the comment, sorry if it wasn’t entirely clear.