Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large pickle overhead in ds.to_netcdf() involving dask.delayed functions

See original GitHub issue

If we write a dask array that doesn’t involve dask.delayed functions using ds.to_netcdf, there is only little overhead from pickle:

vals = da.random.random(500, chunks=(1,))
ds = xr.Dataset({'vals': (['a'], vals)})
write = ds.to_netcdf('file2.nc', compute=False)
%prun -stime -l10 write.compute()

         123410 function calls (104395 primitive calls) in 13.720 seconds

   Ordered by: internal time
   List reduced from 203 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        8   10.032    1.254   10.032    1.254 {method 'acquire' of '_thread.lock' objects}
     1001    2.939    0.003    2.950    0.003 {built-in method _pickle.dumps}
     1001    0.614    0.001    3.569    0.004 pickle.py:30(dumps)
6504/1002    0.012    0.000    0.021    0.000 utils.py:803(convert)
11507/1002    0.010    0.000    0.019    0.000 utils_comm.py:144(unpack_remotedata)
     6013    0.009    0.000    0.009    0.000 utils.py:767(tokey)
3002/1002    0.008    0.000    0.017    0.000 utils_comm.py:181(<listcomp>)
    11512    0.007    0.000    0.008    0.000 core.py:26(istask)
     1002    0.006    0.000    3.589    0.004 worker.py:788(dumps_task)
        1    0.005    0.005    0.007    0.007 core.py:273(<dictcomp>)

But if we use results from dask.delayed, pickle takes up most of the time:

@dask.delayed
def make_data():
    return np.array(np.random.randn())

vals = da.stack([da.from_delayed(make_data(), (), np.float64) for _ in range(500)])
ds = xr.Dataset({'vals': (['a'], vals)})
write = ds.to_netcdf('file5.nc', compute=False)
%prun -stime -l10 write.compute()

         115045243 function calls (104115443 primitive calls) in 67.240 seconds

   Ordered by: internal time
   List reduced from 292 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
8120705/501   17.597    0.000   59.036    0.118 pickle.py:457(save)
2519027/501    7.581    0.000   59.032    0.118 pickle.py:723(save_tuple)
        4    6.978    1.745    6.978    1.745 {method 'acquire' of '_thread.lock' objects}
  3082150    5.362    0.000    8.748    0.000 pickle.py:413(memoize)
 11474396    4.516    0.000    5.970    0.000 pickle.py:213(write)
  8121206    4.186    0.000    5.202    0.000 pickle.py:200(commit_frame)
 13747943    2.703    0.000    2.703    0.000 {method 'get' of 'dict' objects}
 17057538    1.887    0.000    1.887    0.000 {built-in method builtins.id}
  4568116    1.772    0.000    1.782    0.000 {built-in method _struct.pack}
  2762513    1.613    0.000    2.826    0.000 pickle.py:448(get)

This additional pickle overhead does not happen if we compute the dataset without writing it to a file.

Output of %prun -stime -l10 ds.compute() without dask.delayed:

         83856 function calls (73348 primitive calls) in 0.566 seconds

   Ordered by: internal time
   List reduced from 259 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.441    0.110    0.441    0.110 {method 'acquire' of '_thread.lock' objects}
      502    0.013    0.000    0.013    0.000 {method 'send' of '_socket.socket' objects}
      500    0.011    0.000    0.011    0.000 {built-in method _pickle.dumps}
     1000    0.007    0.000    0.008    0.000 core.py:159(get_dependencies)
     3500    0.007    0.000    0.007    0.000 utils.py:767(tokey)
 3000/500    0.006    0.000    0.010    0.000 utils.py:803(convert)
      500    0.005    0.000    0.019    0.000 pickle.py:30(dumps)
        1    0.004    0.004    0.008    0.008 core.py:3826(concatenate3)
 4500/500    0.004    0.000    0.008    0.000 utils_comm.py:144(unpack_remotedata)
        1    0.004    0.004    0.017    0.017 order.py:83(order)

With dask.delayed:

         149376 function calls (139868 primitive calls) in 1.738 seconds

   Ordered by: internal time
   List reduced from 264 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    1.568    0.392    1.568    0.392 {method 'acquire' of '_thread.lock' objects}
        1    0.015    0.015    0.038    0.038 optimization.py:455(fuse)
      502    0.012    0.000    0.012    0.000 {method 'send' of '_socket.socket' objects}
     6500    0.010    0.000    0.010    0.000 utils.py:767(tokey)
5500/1000    0.009    0.000    0.012    0.000 utils_comm.py:144(unpack_remotedata)
     2500    0.008    0.000    0.009    0.000 core.py:159(get_dependencies)
      500    0.007    0.000    0.009    0.000 client.py:142(__init__)
     1000    0.005    0.000    0.008    0.000 core.py:280(subs)
2000/1000    0.005    0.000    0.008    0.000 utils.py:803(convert)
        1    0.004    0.004    0.022    0.022 order.py:83(order)

I am using dask.distributed. I haven’t tested it with anything else.

Software versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_GB.UTF-8
LANG: None
LOCALE: en_GB.UTF-8

xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.1
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.6.2
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: 2.2.2
cartopy: None
seaborn: 0.9.0
setuptools: 40.2.0
pip: 18.0
conda: 4.5.11
pytest: 3.7.3
IPython: 6.5.0
sphinx: 1.7.7

Issue Analytics

State:
Created 5 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

1reaction

shoyercommented, Aug 30, 2018

OK, so it seems like the complete solution here should involve refactoring our backend classes to avoid any references to objects storing dask graphs. This is a cleaner solution even regardless of the pickle overhead because it allows us to eliminate all state stored in backend classes. I’ll get on that in #2261.

0reactions

shoyercommented, Sep 6, 2018

Removing the self-references to the dask graphs in #2261 seems to resolve the performance issue on its own.

I would be interested if https://github.com/pydata/xarray/pull/2391 still improves performance in any real world yes cases – perhaps it helps when working with a real cluster or on large datasets? I can’t see any difference in my local benchmarks using dask-distributed.

Top Results From Across the Web

Best Practices - Dask documentation

Dask delayed operates on functions like dask.delayed(f)(x, ... Calling y.compute() within the loop would await the result of the computation every time, ...

Difficulties using multi-threading and multi-processing and ...

Computations seem to get slower the larger the array ... Have tried including a line to skip nans in the custom .apply() function,...

Parallel computing with Dask - Xarray

Note that xarray only makes use of dask.array and dask.delayed . ... Re-chunking the dataset after creation with ds.chunk() will lead to an...

xclim Documentation

xclim is a library of functions to compute climate indices from ... gdd.plot() ... Such a dask Client is usually created with a...

Possible overhead on dask computation over list of delayed ...

Dask DataFrame Structure: datetime ndvi str utm_x utm_y fpath scl_value npartitions=71 ... In each partition I want to run a custom function