Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validating full epacems with goodtables_pandas runs out of memory

See original GitHub issue

At the end of the ETL process after the (compressed, partitioned) tabular data package for epacems has been output, we attempt to validate them using @ezwelty’s goodtables_pandas library. However, if you do a significant subset of the available states and years on a single machine, you’ll probably run out of memory, since the hourly_emissions_epacems table has almost a billion rows in it. We need to either only validate a sample, or skip the validation, or come up with some way to serialize it when run on a single machine.

It seems like something that could be done with dask if we wanted. But also it would be easy to just skip it. @rousik how does this end up working in the prefect & dask setup? Are the subsets of the data package validated separately on their own nodes?

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

rousikcommented, Feb 24, 2021

Now that we are emitting epacems files directly to parquet, this is no longer an issue as epacems tables are not included into datapackages anymore.

0reactions

rousikcommented, Jan 17, 2021

I suppose that as a stop-gap solution we could consider doing validation on a sampled subset of epacems data.

Read more comments on GitHub >

Top Results From Across the Web

Preventing Running Out of Memory Error During Validation

During validation of an XML node, a large maxOccurs value for an element in the IS schema used as the blueprint can cause...

Out of memory error when using validation while training a ...

Another workaround is to reduce the size of the validation set until the GPU does not run out of memory, and continue training...

Troubleshoot Out of Memory issues - ASP.NET - Microsoft Learn

By implementing paging and validating input so that large sets of data aren't returned, you can avoid this problem. Run in a production ......

Memory leak when validation generator is used #38581 - GitHub

The memory leak occurs only when keras is imported from tensorflow eg. from tensorflow import keras . It works fine when keras is...

Eclipse OutOfMemory Error Fix by increasing Heap Memory ...

Sometimes it's because of Permgen space (below Java 8) and sometimes your eclipse memory runs out of heap memory.

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Easy way to discover valid `--partitions` when using `pudl_datastore`

EPA EmPOWER Project