Validating full epacems with goodtables_pandas runs out of memory
See original GitHub issueAt the end of the ETL process after the (compressed, partitioned) tabular data package for epacems
has been output, we attempt to validate them using @ezwelty’s goodtables_pandas
library. However, if you do a significant subset of the available states and years on a single machine, you’ll probably run out of memory, since the hourly_emissions_epacems
table has almost a billion rows in it. We need to either only validate a sample, or skip the validation, or come up with some way to serialize it when run on a single machine.
It seems like something that could be done with dask
if we wanted. But also it would be easy to just skip it. @rousik how does this end up working in the prefect
& dask
setup? Are the subsets of the data package validated separately on their own nodes?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
Now that we are emitting epacems files directly to parquet, this is no longer an issue as epacems tables are not included into datapackages anymore.
I suppose that as a stop-gap solution we could consider doing validation on a sampled subset of epacems data.