Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Address parquet / pyarrow 1.0.0 vs. pandas tzinfo incompatibilites in timezone aware columns

See original GitHub issue

Something about the way that we are specifying the arrow / parquet schema in epacems_to_parquet appears to be incompatible with Apache Arrow 1.0.0 – though it works fine with Arrow 0.17.1. If you attempt to run an epacems_to_parquet conversion with arrow 1.0.0 installed, it fails on attempting to convert the operating_datetime_utc column from a pandas to an arrow column. However, if the timezone is left out of the timestamp column’s schema definition (and it is allowed to assume UTC, which is the default and correct in this case) it then fails on one of the int32 columns, saying that a floating point value has been truncated.

Not entirely clear what changed between the pre-1.0 and post-1.0 versions of Arrow that would have broken this, but that seems to be the controlling factor in whether what we’re doing works or fails.

See the Arrow 1.0.0 release announcement which links to a complete change log.

To minimally recreate the bad behavior, given some pre-existing EPA CEMS datapackage outputs, you can use the following code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

import pudl

# The path to wherever your EPA CEMS outputs are:
epacems_dir = "./"
# The path to wherever you want it to create a Parquet dataset:
output_dir = "./"
year = 2018
state = "ID"

df = (
    pd.read_csv(
        pathlib.Path(epacems_dir) / f"hourly_emissions_epacems_{year}_{state.lower()}.csv.gz",
        parse_dates=["operating_datetime_utc"],
        dtype=pudl.convert.epacems_to_parquet.create_in_dtypes()
    )
    .assign(year=year)
)
pq.write_to_dataset(
    pa.Table.from_pandas(
        df,
        preserve_index=False,
        schema=pudl.convert.epacems_to_parquet.create_cems_schema()),
    root_path=output_dir,
    partition_cols=["year", "state"],
    compression="snappy"
)

Issue Analytics

State:
Created 3 years ago
Comments:19 (12 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, Nov 10, 2020

Thanks for opening the issue!

I made PRs to address both bugs: https://github.com/apache/arrow/pull/8624, https://github.com/apache/arrow/pull/8625

1reaction

wesmcommented, Nov 7, 2020

It’s a good practice to enforce explicit schemas if you know what they’re supposed to be

Top Results From Across the Web

Timestamps — Apache Arrow v10.0.1

Arrow timestamps are stored as a 64-bit integer with column metadata to associate a time unit (e.g. milliseconds, microseconds, or nanoseconds), and an...

What's New — pandas 0.23.1 documentation - PyData |

Instantiation from dicts respects order for Python 3.6+. Dependent column arguments for assign. Merging / sorting on a combination of columns and index...

What's New — pandas 0.21.0 documentation - PyData |

This functionality depends on either the pyarrow or fastparquet library. ... to UTC only if the original SQL columns were timezone aware datetime...

What's new in 0.24.0 (January 25, 2019) - Pandas

This is a major release from 0.23.4 and includes a number of API changes, new features, enhancements, and performance improvements along with a...

What's new in 1.5.0 (September 19, 2022) - Pandas

Most operations are supported and have been implemented using pyarrow compute functions ... Currently timezones in datetime columns are not preserved when a ......