Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

daskify metadata computation

See original GitHub issue

We have a project where we want to use Terracotta to serve up some huge watermasks. There’s no way we can load an entire file into memory and do computations (a 32gb machine fails when computing the metadata), this is of course no problem for serving the files, as they are cloud-optimized.

However, the metadata computation when creating the database still assumes that the entire file fits into memory and then some. So we should use Dask to chunk the computations when sizes exceed the memory limit.

To speed up the common case (where files fit into memory) we could do this only when a MemoryError is thrown. Or we could set a memory limit that we think is reasonable and always chunk the files such that we never exceed that and then maybe decrease it if we hit a MemoryError. Thoughts?

Issue Analytics

State:
Created 5 years ago
Comments:23

Top GitHub Comments

2reactions

dionhaefnercommented, Aug 27, 2018

Another option: Only implement chunked computation for large int rasters (in this case, we can use the bincount trick). Rasters containing float data types would have to be processed in-memory.

1reaction

mrpgraaecommented, Aug 28, 2018

@j08lue to clarify, I would be okay with having the explicit option of using overviews, since then if the user knows that it won’t be a problem with their data, they can do that.

If we do it implicitly, it may cause nasty surprises. How nasty is entirely dependent on the nature of the data. I don’t think Terracotta should make these kind of assumptions about the users data.

Top Results From Across the Web

Understanding Dask's meta keyword argument

While computing, Dask evaluates the actual metadata with columns x and y . This does not match the meta that we provided, and...

yt + Dask particle IO - HackMD

yt reads particle and grid-based data by iterating across the chunks , with frontend-specific IO functions. For gridded data, each frontend implements a ......

dask_histogram — dask-histogram 2022.11.0 documentation

Daskified Histogram collection factory function; keep partitioned. ... Histogram (*axes[, storage, metadata]). Histogram object capable of lazy computation.

cooler Documentation - Read the Docs

Metadata is retrieved as a json-serializable Python dictionary. ... manipulate and distribute computations on larger-than-memory data using.

Working notes by Matthew Rocklin - SciPy

Dask is a Python library for parallel and distributed computing that aims to ... Now we render dataframes as a Pandas dataframe, but...