Writing a netCDF file is unexpectedly slow
See original GitHub issuencdat=xr.open_mfdataset(nclist, concat_dim='time')
ncdat['lat']=ncdat['lat'].isel(time=0).drop('time')
ncdat['lon']=ncdat['lon'].isel(time=0).drop('time')
ncdat=ncdat.rename({'north_south':'lat', 'east_west':'lon'})
lat_coords=ncdat.lat[:,0] #Extract latitudes
lon_coords=ncdat.lon[0,:] #Extract longitudes
ncdat=ncdat.drop(['lat','lon'])
reformatted_ncdat=ncdat.assign_coords(lat=lat_coords,lon=lon_coords, time=ncdat.coords['time'])
ncdat = reformatted_ncdat.sortby('time')
ncdat.to_netcdf('testing.nc')
Problem description
After some processing, I am left with this xarray dataset ncdat
which I want to export to a netCDF file.
<xarray.Dataset>
Dimensions: (lat: 59, lon: 75, time: 500)
Coordinates:
* time (time) datetime64[ns] 2007-01-22 ... 2008-06-04
* lat (lat) float32 -4.25 -4.15 ... 1.4500003 1.5500002
* lon (lon) float32 29.049988 29.149994 ... 36.450012
Data variables:
Streamflow_tavg (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
RiverDepth_tavg (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
RiverFlowVelocity_tavg (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
FloodedFrac_tavg (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
SurfElev_tavg (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
SWS_tavg (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
Attributes:
missing_value: -9999.0
NUM_SOIL_LAYERS: 1
SOIL_LAYER_THICKNESSES: 1.0
title: LIS land surface model output
institution: NASA GSFC
source: model_not_specified
history: created on date: 2019-04-19T09:11:12.992
references: Kumar_etal_EMS_2006, Peters-Lidard_etal_ISSE_2007
conventions: CF-1.6
comment: website: http://lis.gsfc.nasa.gov/
MAP_PROJECTION: EQUIDISTANT CYLINDRICAL
SOUTH_WEST_CORNER_LAT: -4.25
SOUTH_WEST_CORNER_LON: 29.05
DX: 0.1
DY: 0.1
But the problem is it takes an inordinately long time to export. Almost 10 mins for this particular file which is only 35M.
How can I expedite this process? Is there anything wrong with the structure of ncdat
?
Expected Output
A netCDF file
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, 23:01:00)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.0.101-0.47.105-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2
xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.2 scipy: 1.2.1 netCDF4: 1.5.0.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 dask: 1.2.0 distributed: 1.27.0 matplotlib: 3.0.3 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 41.0.0 pip: 19.0.3 conda: None pytest: None IPython: 7.4.0 sphinx: None
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Writing a netCDF file is extremely slow - dask - Stack Overflow
First a quick note: The files load instantly using xr.open_dataset. You probably did not actually load the data at this point, ...
Read more >Known Problems with netCDF
A data variable at the end of a file being created must be written in reverse order from ... Some of these workarounds...
Read more >Matlab slowing down while reading netCDF - MathWorks
I'm experiencing a problem when I try to read data from a succession of netCDF files. After a certain point (~10 loops of...
Read more >Loading NetCDFs in TensorFlow | Noah D. Brenowitz
Dataset object and iterate through all the samples to mimic one epoch of a machine learning training loop. For simplicity, each NetCDF file...
Read more >IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
However this will often fail with an OverflowError or give unexpected results. In general the best approach for unsupported objects or dtypes is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Diagnosis
Thank you very much! I found this. For now, I will use the load() option.
Loading netCDFs
Slower export
Faster export
I suspect it could work pretty well to explicitly rechunk your dataset into larger chunks (e.g., with the
Dataset.chunk()
method). This way you could continue to use dask for lazy writes, but reduce the overhead of writing individual chunks.