question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

daskify metadata computation

See original GitHub issue

We have a project where we want to use Terracotta to serve up some huge watermasks. There’s no way we can load an entire file into memory and do computations (a 32gb machine fails when computing the metadata), this is of course no problem for serving the files, as they are cloud-optimized.

However, the metadata computation when creating the database still assumes that the entire file fits into memory and then some. So we should use Dask to chunk the computations when sizes exceed the memory limit.

To speed up the common case (where files fit into memory) we could do this only when a MemoryError is thrown. Or we could set a memory limit that we think is reasonable and always chunk the files such that we never exceed that and then maybe decrease it if we hit a MemoryError. Thoughts?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:23

github_iconTop GitHub Comments

2reactions
dionhaefnercommented, Aug 27, 2018

Another option: Only implement chunked computation for large int rasters (in this case, we can use the bincount trick). Rasters containing float data types would have to be processed in-memory.

1reaction
mrpgraaecommented, Aug 28, 2018

@j08lue to clarify, I would be okay with having the explicit option of using overviews, since then if the user knows that it won’t be a problem with their data, they can do that.

If we do it implicitly, it may cause nasty surprises. How nasty is entirely dependent on the nature of the data. I don’t think Terracotta should make these kind of assumptions about the users data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding Dask's meta keyword argument
While computing, Dask evaluates the actual metadata with columns x and y . This does not match the meta that we provided, and...
Read more >
yt + Dask particle IO - HackMD
yt reads particle and grid-based data by iterating across the chunks , with frontend-specific IO functions. For gridded data, each frontend implements a ......
Read more >
dask_histogram — dask-histogram 2022.11.0 documentation
Daskified Histogram collection factory function; keep partitioned. ... Histogram (*axes[, storage, metadata]). Histogram object capable of lazy computation.
Read more >
cooler Documentation - Read the Docs
Metadata is retrieved as a json-serializable Python dictionary. ... manipulate and distribute computations on larger-than-memory data using.
Read more >
Working notes by Matthew Rocklin - SciPy
Dask is a Python library for parallel and distributed computing that aims to ... Now we render dataframes as a Pandas dataframe, but...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found