Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data irregularities cause epacems_to_parquet to fail

See original GitHub issue

After (apparently) successfully running the new data package based ETL process on the full EPA CEMS dataset (all years, all states), I tried to run the epacems_to_parquet script, but it encountered errors, and ultimately failed. Several errors were of the type:

sys:1: DtypeWarning: Columns (8,10,12,14) have mixed types. Specify dtype option on import or set low_memory=False.

But the thing that crashed it eventually was:

Traceback (most recent call last):
  File "/home/zane/miniconda3/envs/pudl-dev/bin/epacems_to_parquet", line 11, in <module>
    load_entry_point('catalystcoop.pudl', 'console_scripts', 'epacems_to_parquet')()
  File "/home/zane/pudl/src/pudl/convert/epacems_to_parquet.py", line 319, in main
    clobber=args.clobber)
  File "/home/zane/pudl/src/pudl/convert/epacems_to_parquet.py", line 205, in epacems_to_parquet
    df = year_from_operating_datetime(df).astype(IN_DTYPES)
  File "/home/zane/pudl/src/pudl/convert/epacems_to_parquet.py", line 123, in year_from_operating_datetime
    df['year'] = df.operating_datetime_utc.dt.year
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/pandas/core/generic.py", line 5175, in __getattr__
    return object.__getattribute__(self, name)
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/pandas/core/accessor.py", line 175, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/pandas/core/indexes/accessors.py", line 343, in __new__
    raise AttributeError("Can only use .dt accessor with datetimelike " "values")
AttributeError: Can only use .dt accessor with datetimelike values

Issue Analytics

State:
Created 4 years ago
Comments:16 (12 by maintainers)

Top GitHub Comments

1reaction

karldwcommented, Oct 20, 2019

I wrote some code that should address this, but I’ll test it later this week before creating a PR.

0reactions

rollcommented, Oct 25, 2019

@cmgosnell Please try tableschema-pandas@1.1. I have added support for composite primary keys

Top Results From Across the Web

Doing Business – Data Irregularities Statement - World Bank

A number of irregularities have been reported regarding changes to the data in the Doing Business 2018 and Doing Business 2020 reports, ...

Known Data Problems | ECHO | US EPA

This page lists known data quality problems with larger sets of data. Concerns have been identified by EPA or state/local environmental agency staff....

Why Big Data Science & Data Analytics Projects Fail

85% of data science projects fail. Why? Learn these eight leading reasons and what you can do to beat the odds.

Known Data Irregularities - NASA Langley Science Directorate

SRB results show noticeable anomalies in this period, some of which are likely artifacts of the calibration situation.

If Your Data Is Bad, Your Machine Learning Tools Are Useless

Poor data quality is enemy number one to the widespread, profitable use of machine learning. The quality demands of machine learning are ...