question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing a netCDF file is unexpectedly slow

See original GitHub issue
ncdat=xr.open_mfdataset(nclist, concat_dim='time')

ncdat['lat']=ncdat['lat'].isel(time=0).drop('time')
ncdat['lon']=ncdat['lon'].isel(time=0).drop('time')
ncdat=ncdat.rename({'north_south':'lat', 'east_west':'lon'})

lat_coords=ncdat.lat[:,0] #Extract latitudes
lon_coords=ncdat.lon[0,:] #Extract longitudes

ncdat=ncdat.drop(['lat','lon'])

reformatted_ncdat=ncdat.assign_coords(lat=lat_coords,lon=lon_coords, time=ncdat.coords['time'])

ncdat = reformatted_ncdat.sortby('time')
ncdat.to_netcdf('testing.nc')

Problem description

After some processing, I am left with this xarray dataset ncdat which I want to export to a netCDF file.

<xarray.Dataset>
Dimensions:                 (lat: 59, lon: 75, time: 500)
Coordinates:
  * time                    (time) datetime64[ns] 2007-01-22 ... 2008-06-04
  * lat                     (lat) float32 -4.25 -4.15 ... 1.4500003 1.5500002
  * lon                     (lon) float32 29.049988 29.149994 ... 36.450012
Data variables:
    Streamflow_tavg         (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
    RiverDepth_tavg         (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
    RiverFlowVelocity_tavg  (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
    FloodedFrac_tavg        (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
    SurfElev_tavg           (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
    SWS_tavg                (time, lat, lon) float32 dask.array<shape=(500, 59, 75), chunksize=(1, 59, 75)>
Attributes:
    missing_value:           -9999.0
    NUM_SOIL_LAYERS:         1
    SOIL_LAYER_THICKNESSES:  1.0
    title:                   LIS land surface model output
    institution:             NASA GSFC
    source:                  model_not_specified
    history:                 created on date: 2019-04-19T09:11:12.992
    references:              Kumar_etal_EMS_2006, Peters-Lidard_etal_ISSE_2007
    conventions:             CF-1.6
    comment:                 website: http://lis.gsfc.nasa.gov/
    MAP_PROJECTION:          EQUIDISTANT CYLINDRICAL
    SOUTH_WEST_CORNER_LAT:   -4.25
    SOUTH_WEST_CORNER_LON:   29.05
    DX:                      0.1
    DY:                      0.1

But the problem is it takes an inordinately long time to export. Almost 10 mins for this particular file which is only 35M.

How can I expedite this process? Is there anything wrong with the structure of ncdat?

Expected Output

A netCDF file

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, 23:01:00) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.0.101-0.47.105-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.2 scipy: 1.2.1 netCDF4: 1.5.0.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 dask: 1.2.0 distributed: 1.27.0 matplotlib: 3.0.3 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 41.0.0 pip: 19.0.3 conda: None pytest: None IPython: 7.4.0 sphinx: None

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

8reactions
msahariacommented, Apr 22, 2019

Diagnosis

Thank you very much! I found this. For now, I will use the load() option.

Loading netCDFs

In [8]: time ncdat=reformat_LIS_outputs(outlist)
CPU times: user 7.78 s, sys: 220 ms, total: 8 s
Wall time: 8.02 s

Slower export

In [6]: time ncdat.to_netcdf('test_slow')
CPU times: user 12min, sys: 8.19 s, total: 12min 9s
Wall time: 12min 14s

Faster export

In [9]: time ncdat.load().to_netcdf('test_faster.nc')
CPU times: user 42.6 s, sys: 2.82 s, total: 45.4 s
Wall time: 54.6 s
1reaction
shoyercommented, Sep 25, 2019

I suspect it could work pretty well to explicitly rechunk your dataset into larger chunks (e.g., with the Dataset.chunk() method). This way you could continue to use dask for lazy writes, but reduce the overhead of writing individual chunks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Writing a netCDF file is extremely slow - dask - Stack Overflow
First a quick note: The files load instantly using xr.open_dataset. You probably did not actually load the data at this point, ...
Read more >
Known Problems with netCDF
A data variable at the end of a file being created must be written in reverse order from ... Some of these workarounds...
Read more >
Matlab slowing down while reading netCDF - MathWorks
I'm experiencing a problem when I try to read data from a succession of netCDF files. After a certain point (~10 loops of...
Read more >
Loading NetCDFs in TensorFlow | Noah D. Brenowitz
Dataset object and iterate through all the samples to mimic one epoch of a machine learning training loop. For simplicity, each NetCDF file...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
However this will often fail with an OverflowError or give unexpected results. In general the best approach for unsupported objects or dtypes is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found