UnknownTimeZoneError: 'tzutc()' when writing a dataframe using from_dict() then reading it using read_parquet()
See original GitHub issueCode Sample
import datetime
import fastparquet
import pandas as pd
from dateutil.tz import tzutc
def setup():
# note: mock or real API call results in the same error
data = mock_list_s3_bucket()
df = pd.DataFrame.from_dict(data)
#print(data)
df.to_parquet(path, engine='auto', compression='snappy')
def main():
# error happens here - pytz.exceptions.UnknownTimeZoneError: 'tzutc()'
df = pd.read_parquet(path, engine='auto')
print(df)
def mock_list_s3_bucket():
# mock a call to S3 API
return [
{'Key': 'fun/file1.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'},
{'Key': 'fun/file2.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'},
{'Key': 'fun/file3.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'}
]
if __name__ == "__main__":
setup()
main()
Problem description
File "pandas\_libs\tslibs\timezones.pyx", line 84, in pandas._libs.tslibs.timezones.maybe_get_tz
File "pandas\_libs\tslibs\timezones.pyx", line 99, in pandas._libs.tslibs.timezones.maybe_get_tz
File "C:\Users\userx\AppData\Local\Programs\Python\Python37\lib\site-packages\pytz\__init__.py", line 178, in timezone
raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: 'tzutc()'
I have an app that reads bucket list data on a periodic basis from the S3 API. Up until recently, everything worked fine. When we upgraded to pandas 0.24 the problem with the parquet files being generated started to surface.
Note: I created a clean VM, installed pandas 0.24 and all dependencies, and was able to reproduce the issue.
Here is more info on the column metadata generated by fastparquet.
pandas 0.23 metadata
{"columns": [{"metadata": null, "name": "ETag", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "name": "Key", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": {"timezone": "UTC"}, "name": "LastModified", "numpy_type": "datetime64[ns, UTC]", "pandas_type": "datetimetz"}, {"metadata": null, "name": "Size", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata": null, "name": "StorageClass", "numpy_type": "object", "pandas_type": "unicode"}], "index_columns": [], "pandas_version": "0.23.4"}
pandas 0.24 metadata
{"columns": [{"metadata": null, "name": "ETag", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "name": "Key", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": {"timezone": "tzutc()"}, "name": "LastModified", "numpy_type": "datetime64[ns, tzutc()]", "pandas_type": "datetimetz"}, {"metadata": null, "name": "Size", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata": null, "name": "StorageClass", "numpy_type": "object", "pandas_type": "unicode"}], "index_columns": [], "pandas_version": "0.24.1"}
Output of pd.show_versions()
pandas: 0.24.1 pytest: None pip: 19.0.3 setuptools: 39.0.1 Cython: None numpy: 1.16.1 scipy: None pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.2.5 bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.2.1 pandas_gbq: None pandas_datareader: None gcsfs: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (5 by maintainers)
@joeax I opened https://issues.apache.org/jira/browse/ARROW-5248 for supporting this on the pyarrow side, and https://github.com/dask/fastparquet/issues/424 on the fastparquet side.
So we can follow up on both projects, and therefore closing this issue here.
Just came here to re-iterate the workaround by @jorisvandenbossche a bit more completely
Docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.tz_convert.html