question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validating full epacems with goodtables_pandas runs out of memory

See original GitHub issue

At the end of the ETL process after the (compressed, partitioned) tabular data package for epacems has been output, we attempt to validate them using @ezwelty’s goodtables_pandas library. However, if you do a significant subset of the available states and years on a single machine, you’ll probably run out of memory, since the hourly_emissions_epacems table has almost a billion rows in it. We need to either only validate a sample, or skip the validation, or come up with some way to serialize it when run on a single machine.

It seems like something that could be done with dask if we wanted. But also it would be easy to just skip it. @rousik how does this end up working in the prefect & dask setup? Are the subsets of the data package validated separately on their own nodes?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
rousikcommented, Feb 24, 2021

Now that we are emitting epacems files directly to parquet, this is no longer an issue as epacems tables are not included into datapackages anymore.

0reactions
rousikcommented, Jan 17, 2021

I suppose that as a stop-gap solution we could consider doing validation on a sampled subset of epacems data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Preventing Running Out of Memory Error During Validation
During validation of an XML node, a large maxOccurs value for an element in the IS schema used as the blueprint can cause...
Read more >
Out of memory error when using validation while training a ...
Another workaround is to reduce the size of the validation set until the GPU does not run out of memory, and continue training...
Read more >
Troubleshoot Out of Memory issues - ASP.NET - Microsoft Learn
By implementing paging and validating input so that large sets of data aren't returned, you can avoid this problem. Run in a production ......
Read more >
Memory leak when validation generator is used #38581 - GitHub
The memory leak occurs only when keras is imported from tensorflow eg. from tensorflow import keras . It works fine when keras is...
Read more >
Eclipse OutOfMemory Error Fix by increasing Heap Memory ...
Sometimes it's because of Permgen space (below Java 8) and sometimes your eclipse memory runs out of heap memory.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found