Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnknownTimeZoneError: 'tzutc()' when writing a dataframe using from_dict() then reading it using read_parquet()

See original GitHub issue

Code Sample

import datetime
import fastparquet
import pandas as pd

from dateutil.tz import tzutc

def setup():
    # note: mock or real API call results in the same error
    data = mock_list_s3_bucket()
    df = pd.DataFrame.from_dict(data)
    #print(data)
    df.to_parquet(path, engine='auto', compression='snappy')

def main():
    # error happens here - pytz.exceptions.UnknownTimeZoneError: 'tzutc()'
    df = pd.read_parquet(path, engine='auto')
    print(df)

def mock_list_s3_bucket():
    # mock a call to S3 API
    return [
        {'Key': 'fun/file1.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'},
        {'Key': 'fun/file2.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'},
        {'Key': 'fun/file3.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'}
    ]

if __name__ == "__main__":
    setup()
    main()

Problem description

  File "pandas\_libs\tslibs\timezones.pyx", line 84, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "pandas\_libs\tslibs\timezones.pyx", line 99, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "C:\Users\userx\AppData\Local\Programs\Python\Python37\lib\site-packages\pytz\__init__.py", line 178, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: 'tzutc()'

I have an app that reads bucket list data on a periodic basis from the S3 API. Up until recently, everything worked fine. When we upgraded to pandas 0.24 the problem with the parquet files being generated started to surface.

Note: I created a clean VM, installed pandas 0.24 and all dependencies, and was able to reproduce the issue.

Here is more info on the column metadata generated by fastparquet.

pandas 0.23 metadata

{"columns": [{"metadata": null, "name": "ETag", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "name": "Key", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": {"timezone": "UTC"}, "name": "LastModified", "numpy_type": "datetime64[ns, UTC]", "pandas_type": "datetimetz"}, {"metadata": null, "name": "Size", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata": null, "name": "StorageClass", "numpy_type": "object", "pandas_type": "unicode"}], "index_columns": [], "pandas_version": "0.23.4"}

pandas 0.24 metadata

{"columns": [{"metadata": null, "name": "ETag", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "name": "Key", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": {"timezone": "tzutc()"}, "name": "LastModified", "numpy_type": "datetime64[ns, tzutc()]", "pandas_type": "datetimetz"}, {"metadata": null, "name": "Size", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata": null, "name": "StorageClass", "numpy_type": "object", "pandas_type": "unicode"}], "index_columns": [], "pandas_version": "0.24.1"}

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.24.1 pytest: None pip: 19.0.3 setuptools: 39.0.1 Cython: None numpy: 1.16.1 scipy: None pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.2.5 bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.2.1 pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

State:
Created 5 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, May 2, 2019

@joeax I opened https://issues.apache.org/jira/browse/ARROW-5248 for supporting this on the pyarrow side, and https://github.com/dask/fastparquet/issues/424 on the fastparquet side.

So we can follow up on both projects, and therefore closing this issue here.

0reactions

shadiakiki1986commented, Sep 18, 2019

Just came here to re-iterate the workaround by @jorisvandenbossche a bit more completely

import pytz
df['Timestamp'] = df.Timestamp.dt.tz_convert(pytz.utc)

import pyarrow as pa
context = pa.default_serialization_context()
pybytes = context.serialize(df).to_buffer().to_pybytes()  # <<<<< This wouldn't work without the tz_convert call above

Docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.tz_convert.html

Top Results From Across the Web

Transfer and write Parquet with python and pandas got ...

When I load the parquet file from S3 using read_parquet(), and Pyarrow engine I get the datetime field as string. I'd like to...

pandas.read_parquet — pandas 1.5.2 documentation

Load a parquet object from the file path, returning a DataFrame. Parameters ... PathLike[str] ), or file-like object implementing a binary read() function....

pd.read_parquet: Read Parquet Files in Pandas - Datagy

To read a Parquet file into a Pandas DataFrame, you can use the pd.read_parquet() function. The function allows you to load data from...

PySpark Read and Write Parquet File - Spark by {Examples}

Pyspark SQL provides methods to read Parquet file into DataFrame and write DataFrame to Parquet files, parquet() function from ...

Parquet Files - Spark 3.3.1 Documentation

Spark SQL provides support for both reading and writing Parquet files that ... printSchema() // The final schema consists of all 3 columns...