Address parquet / pyarrow 1.0.0 vs. pandas tzinfo incompatibilites in timezone aware columns
See original GitHub issueSomething about the way that we are specifying the arrow / parquet schema in epacems_to_parquet
appears to be incompatible with Apache Arrow 1.0.0 – though it works fine with Arrow 0.17.1. If you attempt to run an epacems_to_parquet
conversion with arrow 1.0.0 installed, it fails on attempting to convert the operating_datetime_utc
column from a pandas to an arrow column. However, if the timezone is left out of the timestamp column’s schema definition (and it is allowed to assume UTC, which is the default and correct in this case) it then fails on one of the int32
columns, saying that a floating point value has been truncated.
Not entirely clear what changed between the pre-1.0 and post-1.0 versions of Arrow that would have broken this, but that seems to be the controlling factor in whether what we’re doing works or fails.
See the Arrow 1.0.0 release announcement which links to a complete change log.
To minimally recreate the bad behavior, given some pre-existing EPA CEMS datapackage outputs, you can use the following code:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pudl
# The path to wherever your EPA CEMS outputs are:
epacems_dir = "./"
# The path to wherever you want it to create a Parquet dataset:
output_dir = "./"
year = 2018
state = "ID"
df = (
pd.read_csv(
pathlib.Path(epacems_dir) / f"hourly_emissions_epacems_{year}_{state.lower()}.csv.gz",
parse_dates=["operating_datetime_utc"],
dtype=pudl.convert.epacems_to_parquet.create_in_dtypes()
)
.assign(year=year)
)
pq.write_to_dataset(
pa.Table.from_pandas(
df,
preserve_index=False,
schema=pudl.convert.epacems_to_parquet.create_cems_schema()),
root_path=output_dir,
partition_cols=["year", "state"],
compression="snappy"
)
Issue Analytics
- State:
- Created 3 years ago
- Comments:19 (12 by maintainers)
Top GitHub Comments
Thanks for opening the issue!
I made PRs to address both bugs: https://github.com/apache/arrow/pull/8624, https://github.com/apache/arrow/pull/8625
It’s a good practice to enforce explicit schemas if you know what they’re supposed to be