daskify metadata computation
See original GitHub issueWe have a project where we want to use Terracotta to serve up some huge watermasks. There’s no way we can load an entire file into memory and do computations (a 32gb machine fails when computing the metadata), this is of course no problem for serving the files, as they are cloud-optimized.
However, the metadata computation when creating the database still assumes that the entire file fits into memory and then some. So we should use Dask to chunk the computations when sizes exceed the memory limit.
To speed up the common case (where files fit into memory) we could do this only when a MemoryError
is thrown. Or we could set a memory limit that we think is reasonable and always chunk the files such that we never exceed that and then maybe decrease it if we hit a MemoryError
. Thoughts?
Issue Analytics
- State:
- Created 5 years ago
- Comments:23
Top GitHub Comments
Another option: Only implement chunked computation for large
int
rasters (in this case, we can use thebincount
trick). Rasters containing float data types would have to be processed in-memory.@j08lue to clarify, I would be okay with having the explicit option of using overviews, since then if the user knows that it won’t be a problem with their data, they can do that.
If we do it implicitly, it may cause nasty surprises. How nasty is entirely dependent on the nature of the data. I don’t think Terracotta should make these kind of assumptions about the users data.