Data irregularities cause epacems_to_parquet to fail
See original GitHub issueAfter (apparently) successfully running the new data package based ETL process on the full EPA CEMS dataset (all years, all states), I tried to run the epacems_to_parquet
script, but it encountered errors, and ultimately failed. Several errors were of the type:
sys:1: DtypeWarning: Columns (8,10,12,14) have mixed types. Specify dtype option on import or set low_memory=False.
But the thing that crashed it eventually was:
Traceback (most recent call last):
File "/home/zane/miniconda3/envs/pudl-dev/bin/epacems_to_parquet", line 11, in <module>
load_entry_point('catalystcoop.pudl', 'console_scripts', 'epacems_to_parquet')()
File "/home/zane/pudl/src/pudl/convert/epacems_to_parquet.py", line 319, in main
clobber=args.clobber)
File "/home/zane/pudl/src/pudl/convert/epacems_to_parquet.py", line 205, in epacems_to_parquet
df = year_from_operating_datetime(df).astype(IN_DTYPES)
File "/home/zane/pudl/src/pudl/convert/epacems_to_parquet.py", line 123, in year_from_operating_datetime
df['year'] = df.operating_datetime_utc.dt.year
File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/pandas/core/generic.py", line 5175, in __getattr__
return object.__getattribute__(self, name)
File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/pandas/core/accessor.py", line 175, in __get__
accessor_obj = self._accessor(obj)
File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/pandas/core/indexes/accessors.py", line 343, in __new__
raise AttributeError("Can only use .dt accessor with datetimelike " "values")
AttributeError: Can only use .dt accessor with datetimelike values
Issue Analytics
- State:
- Created 4 years ago
- Comments:16 (12 by maintainers)
Top Results From Across the Web
Doing Business – Data Irregularities Statement - World Bank
A number of irregularities have been reported regarding changes to the data in the Doing Business 2018 and Doing Business 2020 reports, ...
Read more >Known Data Problems | ECHO | US EPA
This page lists known data quality problems with larger sets of data. Concerns have been identified by EPA or state/local environmental agency staff....
Read more >Why Big Data Science & Data Analytics Projects Fail
85% of data science projects fail. Why? Learn these eight leading reasons and what you can do to beat the odds.
Read more >Known Data Irregularities - NASA Langley Science Directorate
SRB results show noticeable anomalies in this period, some of which are likely artifacts of the calibration situation.
Read more >If Your Data Is Bad, Your Machine Learning Tools Are Useless
Poor data quality is enemy number one to the widespread, profitable use of machine learning. The quality demands of machine learning are ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I wrote some code that should address this, but I’ll test it later this week before creating a PR.
@cmgosnell Please try
tableschema-pandas@1.1
. I have added support for composite primary keys