Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

get_historical_features fails with dask error for file offline store

See original GitHub issue

Expected Behavior

feature_store.get_historical_features(df, features=fs_columns).to_df()

where feature_store is a feature store with file offline store and fs_columns is a list of column names, and df is a Pandas data frame, should work.

Current Behavior

It currently raises an error inside of dask:

E           NotImplementedError: dd.DataFrame.apply only supports axis=1
E             Try: df.apply(func, axis=1)

Stacktrace:

../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/offline_store.py:81: in to_df
    features_df = self._to_df_internal()
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/usage.py:280: in wrapper
    raise exc.with_traceback(traceback)
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/usage.py:269: in wrapper
    return func(*args, **kwargs)
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:75: in _to_df_internal
    df = self.evaluation_function().compute()
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:231: in evaluate_historical_retrieval
    df_to_join = _normalize_timestamp(
../../.cache/pypoetry/virtualenvs/w3-search-letor-SCEBvDm1-py3.9/lib/python3.9/site-packages/feast/infra/offline_stores/file.py:530: in _normalize_timestamp
    df_to_join[timestamp_field] = df_to_join[timestamp_field].apply(

Steps to reproduce

Here is my feature store definition:

from feast import FeatureStore, RepoConfig, FileSource, FeatureView, ValueType, Entity, Feature
from feast.infra.offline_stores.file import FileOfflineStoreConfig
from google.protobuf.duration_pb2 import Duration

source_path = tmp_path / "source.parquet"
timestamp = datetime.datetime(year=2022, month=4, day=29, tzinfo=datetime.timezone.utc)
df = pd.DataFrame(
    {
        "entity": [0, 1, 2, 3, 4],
        "f1": [1.0, 1.1, 1.2, 1.3, 1.4],
        "f2": ["a", "b", "c", "d", "e"],
        "timestamp": [
            timestamp,
            # this one should not be fetched as it is too far into the past
            timestamp - datetime.timedelta(days=2),
            timestamp,
            timestamp,
            timestamp,
        ],
    }
)
df.to_parquet(source_path)
source = FileSource(
    path=str(source_path),
    event_timestamp_column="timestamp",
    created_timestamp_column="timestamp",
)
entity = Entity(
    name="entity",
    value_type=ValueType.INT64,
    description="Entity",
)

view = FeatureView(
    name="view",
    entities=["entity"],
    ttl=Duration(seconds=86400 * 1),
    features=[
        Feature(name="f1", dtype=ValueType.FLOAT),
        Feature(name="f2", dtype=ValueType.STRING),
    ],
    online=True,
    batch_source=source,
    tags={},
)

config = RepoConfig(
    registry=str(tmp_path / "registry.db"),
    project="hello",
    provider="local",
    offline_store=FileOfflineStoreConfig(),
)

store = FeatureStore(config=config)
store.apply([entity, view])

expected = pd.DataFrame(
    {
        "event_timestamp": timestamp,
        "entity": [0, 1, 2, 3, 5],
        "someval": [0.0, 0.1, 0.2, 0.3, 0.5],
        "f1": [1.0, np.nan, 1.2, 1.3, np.nan],
        "f2": ["a", np.nan, "c", "d", np.nan],
    }
)

Specifications

Version: 0.21.3
Platform: Linux
Subsystem: Python 3.9

Possible Solution

This works fine in at least version 0.18.1, but I think it fails for any >0.20

It might have something to do with adding Dask requirement, maybe the version is insufficient? I used to use 2022.2 before, but the requirement is now for 2022.1.1. But this is just a guess, really.

Issue Analytics

State:
Created a year ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

achalscommented, Jun 30, 2022

Thanks for the details @elshize - this definitely smells like a bug we need to fix!

0reactions

elshizecommented, Jul 21, 2022

Yes, the problem was reusing the column. I shared that when I earlier in the comment, sorry if it wasn’t entirely clear.

Top Results From Across the Web

Connect to remote data - Dask documentation

Connect to remote data¶. Dask can read data from a variety of data stores including local file systems, network file systems, cloud object...

Futures - Dask documentation

Dask futures reimplements the Python futures API so you can scale your Python futures workflow across a Dask cluster.

dask.dataframe.read_csv - Dask documentation

Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......

Frequently Asked Questions - Dask.distributed

PermissionError [Errno 13 ] Permission Denied: `/root/. dask` This error can be seen when starting distributed through the standard process control tool ...

DataFrames: Read and Write Data - Dask Examples

Dask Dataframes can read and store data in many of the same formats as Pandas ... First we create an artificial dataset and...