question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Address parquet / pyarrow 1.0.0 vs. pandas tzinfo incompatibilites in timezone aware columns

See original GitHub issue

Something about the way that we are specifying the arrow / parquet schema in epacems_to_parquet appears to be incompatible with Apache Arrow 1.0.0 – though it works fine with Arrow 0.17.1. If you attempt to run an epacems_to_parquet conversion with arrow 1.0.0 installed, it fails on attempting to convert the operating_datetime_utc column from a pandas to an arrow column. However, if the timezone is left out of the timestamp column’s schema definition (and it is allowed to assume UTC, which is the default and correct in this case) it then fails on one of the int32 columns, saying that a floating point value has been truncated.

Not entirely clear what changed between the pre-1.0 and post-1.0 versions of Arrow that would have broken this, but that seems to be the controlling factor in whether what we’re doing works or fails.

See the Arrow 1.0.0 release announcement which links to a complete change log.

To minimally recreate the bad behavior, given some pre-existing EPA CEMS datapackage outputs, you can use the following code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

import pudl

# The path to wherever your EPA CEMS outputs are:
epacems_dir = "./"
# The path to wherever you want it to create a Parquet dataset:
output_dir = "./"
year = 2018
state = "ID"

df = (
    pd.read_csv(
        pathlib.Path(epacems_dir) / f"hourly_emissions_epacems_{year}_{state.lower()}.csv.gz",
        parse_dates=["operating_datetime_utc"],
        dtype=pudl.convert.epacems_to_parquet.create_in_dtypes()
    )
    .assign(year=year)
)
pq.write_to_dataset(
    pa.Table.from_pandas(
        df,
        preserve_index=False,
        schema=pudl.convert.epacems_to_parquet.create_cems_schema()),
    root_path=output_dir,
    partition_cols=["year", "state"],
    compression="snappy"
)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:19 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, Nov 10, 2020

Thanks for opening the issue!

I made PRs to address both bugs: https://github.com/apache/arrow/pull/8624, https://github.com/apache/arrow/pull/8625

1reaction
wesmcommented, Nov 7, 2020

It’s a good practice to enforce explicit schemas if you know what they’re supposed to be

Read more comments on GitHub >

github_iconTop Results From Across the Web

Timestamps — Apache Arrow v10.0.1
Arrow timestamps are stored as a 64-bit integer with column metadata to associate a time unit (e.g. milliseconds, microseconds, or nanoseconds), and an...
Read more >
What's New — pandas 0.23.1 documentation - PyData |
Instantiation from dicts respects order for Python 3.6+. Dependent column arguments for assign. Merging / sorting on a combination of columns and index...
Read more >
What's New — pandas 0.21.0 documentation - PyData |
This functionality depends on either the pyarrow or fastparquet library. ... to UTC only if the original SQL columns were timezone aware datetime...
Read more >
What's new in 0.24.0 (January 25, 2019) - Pandas
This is a major release from 0.23.4 and includes a number of API changes, new features, enhancements, and performance improvements along with a...
Read more >
What's new in 1.5.0 (September 19, 2022) - Pandas
Most operations are supported and have been implemented using pyarrow compute functions ... Currently timezones in datetime columns are not preserved when a ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found