question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

If a NetCDF file is chunked on disk, open it with compatible dask chunks

See original GitHub issue

NetCDF4 data can be saved as chunks on disk, which has several benefits including efficient reads when using a compatible chunk shape. This is particularly important for files with chunk-based compression (ie all nc4 files with compression) or on HPC and parallel file systems (eg), where IO is typically dominated by the number of reads and chunks-from-disk are often cached. Caches are also common in network data backends such as Thredds OPeNDAP, in which case using disk-compatible chunks will reduce cache pressure as well as latency.

Xarray can use chunks, of course, but as of v0.9 the chunk size has to be specified manually - and the easiest way to discover it is to open the file and look at the _Chunksizes attribute for each variable. I propose that xr.open_dataset (and array, and mfdataset) change their default behaviour.

If Dask is available and chunks=None (the default), chunks should be taken from the file on disk. This may lead to a chunked or unchunked dataset. To force an un-chunked load, users can specify chunks={}, or simple .load() the dataset after opening it.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:1
  • Comments:23 (17 by maintainers)

github_iconTop GitHub Comments

4reactions
shoyercommented, Jun 6, 2017

I think its unavoidable that users understand how their data will be processed (e.g., whether operations will be mapped over time or space). But maybe some sort of heuristics (if not a fully automated solution) are possible.

For example, maybe chunks={'time'} (note the set rather than a dict) could indicate “divide me into automatically chosen chunks over the time dimension”. It’s still explicit about how chunking is being done, but comes closer to expressing the intent rather than the details.

3reactions
jhammancommented, Jun 6, 2017

I’d certainly support a warning when dask chunks do not align with the on-disk chunks.

Beyond that, I think we could work on a utility for automatically determining chunks sizes for xarray using some heuristics. Before we go there though, I think we really should develop some performance benchmarks. We’re starting to get a lot of questions/issues about performance and it seems like we need some benchmarking to happen before we can really start fixing the underlying issues.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Chunks - Dask documentation
Modern NDArray storage formats like HDF5, NetCDF, TIFF, and Zarr, allow arrays to be stored in chunks or tiles so that blocks of...
Read more >
Help me beat my pythonist friend's code. Speeding up data ...
So I think chunks need to be specified when the dataset is opened. There may be some performance under the hood if the...
Read more >
Technical question about reading file by chunks and Xarray ...
Only the NetCDF file chunks needed to load the selected values, no matter the dask array chunks used.
Read more >
selecting a point from an mfdataset · Issue #1396 · pydata/xarray
Zac-HD mentioned this issue on Jun 6, 2017. If a NetCDF file is chunked on disk, open it with compatible dask chunks #1440....
Read more >
Parallel computing with Dask - Xarray
By default, open_mfdataset() will chunk each netCDF file into a single Dask array; again, supply the chunks argument to control the size of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found