If a NetCDF file is chunked on disk, open it with compatible dask chunks
See original GitHub issueNetCDF4 data can be saved as chunks on disk, which has several benefits including efficient reads when using a compatible chunk shape. This is particularly important for files with chunk-based compression (ie all nc4 files with compression) or on HPC and parallel file systems (eg), where IO is typically dominated by the number of reads and chunks-from-disk are often cached. Caches are also common in network data backends such as Thredds OPeNDAP, in which case using disk-compatible chunks will reduce cache pressure as well as latency.
Xarray can use chunks, of course, but as of v0.9 the chunk size has to be specified manually - and the easiest way to discover it is to open the file and look at the _Chunksizes attribute for each variable. I propose that xr.open_dataset (and array, and mfdataset) change their default behaviour.
If Dask is available and chunks=None (the default), chunks should be taken from the file on disk. This may lead to a chunked or unchunked dataset. To force an un-chunked load, users can specify chunks={}, or simple .load() the dataset after opening it.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:23 (17 by maintainers)

Top Related StackOverflow Question
I think its unavoidable that users understand how their data will be processed (e.g., whether operations will be mapped over time or space). But maybe some sort of heuristics (if not a fully automated solution) are possible.
For example, maybe
chunks={'time'}(note thesetrather than adict) could indicate “divide me into automatically chosen chunks over thetimedimension”. It’s still explicit about how chunking is being done, but comes closer to expressing the intent rather than the details.I’d certainly support a warning when dask chunks do not align with the on-disk chunks.
Beyond that, I think we could work on a utility for automatically determining chunks sizes for xarray using some heuristics. Before we go there though, I think we really should develop some performance benchmarks. We’re starting to get a lot of questions/issues about performance and it seems like we need some benchmarking to happen before we can really start fixing the underlying issues.