question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unclear error is printed when wrong event_timestamp column type is used

See original GitHub issue

When running feast materialize-incremental 2022-01-01T00:00:00 on a parquet source that contains to a string based event_timestamp column, the following exception is printed.

Materializing 1 feature views to 2022-01-01 00:00:00-08:00 into the sqlite online store.

fake_data_fv from 2021-05-21 02:11:51-07:00 to 2022-01-01 00:00:00-08:00:
Traceback (most recent call last):
  File "/home/willem/.pyenv/versions/3.7.7/bin/feast", line 8, in <module>
    sys.exit(cli())
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/cli.py", line 270, in materialize_incremental_command
    end_date=datetime.fromisoformat(end_ts),
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/telemetry.py", line 151, in exception_logging_wrapper
    result = func(*args, **kwargs)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/feature_store.py", line 379, in materialize_incremental
    tqdm_builder,
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/local.py", line 193, in materialize_single_feature_view
    end_date=end_date,
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/offline_stores/file.py", line 208, in pull_latest_from_table_or_query
    lambda x: x if x.tzinfo is not None else x.replace(tzinfo=pytz.utc)
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pandas/core/series.py", line 3848, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
  File "/home/willem/.pyenv/versions/3.7.7/lib/python3.7/site-packages/feast/infra/offline_stores/file.py", line 208, in <lambda>
    lambda x: x if x.tzinfo is not None else x.replace(tzinfo=pytz.utc)
AttributeError: 'str' object has no attribute 'tzinfo'

Instead, we should validate types during materialize and print a clearer error message.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:8
  • Comments:6

github_iconTop GitHub Comments

5reactions
fcascommented, Apr 14, 2022

@sgvarsh the workaround that I found:

from pyspark.sql.functions import to_timestamp

conf = SparkConf().setMaster(SPARK_MASTER)
# FEAST does not work with INT96 (this is the default type using pyspark 
# to write parquet files containing timestamp fields, 
# another option is to use string based timestamps, but...)
# https://issues.apache.org/jira/browse/PARQUET-323
# https://stackoverflow.com/questions/56582539/how-to-save-spark-dataframe-to-parquet-without-using-int96-format-for-timestamp
# FEAST works with TIMESTAMP_MICROS (I did not try TIMESTAMP_MILLIS)
conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
spark_context = SparkContext(conf=conf)
sql_context = SQLContext(spark_context)
df = sql_context.read.csv(path)
df = df.withColumn("event_timestamp", to_timestamp(df.event_timestamp, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ"))
## FEAST cannot read a directory with .parquet files
df.coalesce(1).write.mode("overwrite").parquet('output.parquet')

Inspecting the file output.parquet:

############ Column(event_timestamp) ############
name: event_timestamp
path: event_timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS

Reading the feature view:

training_df = fs.get_historical_features(
        entity_df=entity_df,
        features=[
            "feature_view:***",
            "feature_view:***",
            "feature_view:***",
        ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head(8))
----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
     Column                 Non-Null Count        Dtype              
---  ------                 --------------        -----              
 0   feast_id                     5 non-null      object             
 1   event_timestamp              0 non-null      datetime64[ns, UTC]
 2   ***                          5 non-null      object             
 3   ***                          5 non-null      object             
 4   ***                          5 non-null      object             
dtypes: `datetime64[ns, UTC](1)`,  `object(4)`
memory usage: 240.0+ bytes

----- Example features -----

   feast_id                              ...      ***
0  12f8cbcf-286a-44f6-a84d-e6d9a8fe902a  ...      ***
1  c47e2260-87eb-4748-b63f-cfda3c7fd258  ...      ***
2  7e835362-4ed8-41ed-b81d-7591b38c151d  ...      ***
3  24fa1717-5e92-4a57-bd19-0b3e851ea357  ...      ***
4  8ce9e852-3a4d-4e96-95dc-fa809481c08a  ...      ***

[5 rows x 5 columns]
1reaction
fcascommented, Apr 12, 2022

@woop do you know some workaround for this issue? It’s a stale issue, but the same problem existis even in the version 0.19.4 =/

Read more comments on GitHub >

github_iconTop Results From Across the Web

insert datetime from csv to postgres error - Stack Overflow
It includes a date, time, and a timezone offset. Apparently, your table's event_time column is timestamp format with date and time only. You ......
Read more >
Detect and Fix Data Quality Problems - Fluxicon
The very first check is to make sure that there are no error messages when you import your data set. Error messages can...
Read more >
Database Engine events and errors - SQL Server
Consult this MSSQL error code list to find explanations for error messages for SQL Server database engine events.
Read more >
How to Get SQL Server Dates and Times Horribly Wrong
One of the problems is that most SQL Server date/time data types are fairly ambiguous. For example, suppose we have a table in...
Read more >
How to Effectively Use Dates and Timestamps in Spark 3.0
Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE , which is a combination of the fields ( YEAR ,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found