question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnknownTimeZoneError: 'tzutc()' when writing a dataframe using from_dict() then reading it using read_parquet()

See original GitHub issue

Code Sample

import datetime
import fastparquet
import pandas as pd

from dateutil.tz import tzutc

def setup():
    # note: mock or real API call results in the same error
    data = mock_list_s3_bucket()
    df = pd.DataFrame.from_dict(data)
    #print(data)
    df.to_parquet(path, engine='auto', compression='snappy')

def main():
    # error happens here - pytz.exceptions.UnknownTimeZoneError: 'tzutc()'
    df = pd.read_parquet(path, engine='auto')
    print(df)

def mock_list_s3_bucket():
    # mock a call to S3 API
    return [
        {'Key': 'fun/file1.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'},
        {'Key': 'fun/file2.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'},
        {'Key': 'fun/file3.txt', 'LastModified': datetime.datetime(2018, 6, 7, 2, 59, 59, tzinfo=tzutc()), 'ETag': '"8c768c05b4faea563a8520acb983fb79"', 'Size': 4445, 'StorageClass': 'GLACIER'}
    ]

if __name__ == "__main__":
    setup()
    main()

Problem description

  File "pandas\_libs\tslibs\timezones.pyx", line 84, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "pandas\_libs\tslibs\timezones.pyx", line 99, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "C:\Users\userx\AppData\Local\Programs\Python\Python37\lib\site-packages\pytz\__init__.py", line 178, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: 'tzutc()'

I have an app that reads bucket list data on a periodic basis from the S3 API. Up until recently, everything worked fine. When we upgraded to pandas 0.24 the problem with the parquet files being generated started to surface.

Note: I created a clean VM, installed pandas 0.24 and all dependencies, and was able to reproduce the issue.

Here is more info on the column metadata generated by fastparquet.

pandas 0.23 metadata

{"columns": [{"metadata": null, "name": "ETag", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "name": "Key", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": {"timezone": "UTC"}, "name": "LastModified", "numpy_type": "datetime64[ns, UTC]", "pandas_type": "datetimetz"}, {"metadata": null, "name": "Size", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata": null, "name": "StorageClass", "numpy_type": "object", "pandas_type": "unicode"}], "index_columns": [], "pandas_version": "0.23.4"}

pandas 0.24 metadata

{"columns": [{"metadata": null, "name": "ETag", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "name": "Key", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": {"timezone": "tzutc()"}, "name": "LastModified", "numpy_type": "datetime64[ns, tzutc()]", "pandas_type": "datetimetz"}, {"metadata": null, "name": "Size", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata": null, "name": "StorageClass", "numpy_type": "object", "pandas_type": "unicode"}], "index_columns": [], "pandas_version": "0.24.1"}

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.24.1 pytest: None pip: 19.0.3 setuptools: 39.0.1 Cython: None numpy: 1.16.1 scipy: None pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: None dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: 4.2.5 bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.2.1 pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, May 2, 2019

@joeax I opened https://issues.apache.org/jira/browse/ARROW-5248 for supporting this on the pyarrow side, and https://github.com/dask/fastparquet/issues/424 on the fastparquet side.

So we can follow up on both projects, and therefore closing this issue here.

0reactions
shadiakiki1986commented, Sep 18, 2019

Just came here to re-iterate the workaround by @jorisvandenbossche a bit more completely

import pytz
df['Timestamp'] = df.Timestamp.dt.tz_convert(pytz.utc)

import pyarrow as pa
context = pa.default_serialization_context()
pybytes = context.serialize(df).to_buffer().to_pybytes()  # <<<<< This wouldn't work without the tz_convert call above

Docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.tz_convert.html

Read more comments on GitHub >

github_iconTop Results From Across the Web

Transfer and write Parquet with python and pandas got ...
When I load the parquet file from S3 using read_parquet(), and Pyarrow engine I get the datetime field as string. I'd like to...
Read more >
pandas.read_parquet — pandas 1.5.2 documentation
Load a parquet object from the file path, returning a DataFrame. Parameters ... PathLike[str] ), or file-like object implementing a binary read() function....
Read more >
pd.read_parquet: Read Parquet Files in Pandas - Datagy
To read a Parquet file into a Pandas DataFrame, you can use the pd.read_parquet() function. The function allows you to load data from...
Read more >
PySpark Read and Write Parquet File - Spark by {Examples}
Pyspark SQL provides methods to read Parquet file into DataFrame and write DataFrame to Parquet files, parquet() function from ...
Read more >
Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that ... printSchema() // The final schema consists of all 3 columns...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found