Large pickle overhead in ds.to_netcdf() involving dask.delayed functions
See original GitHub issueIf we write a dask array that doesn’t involve dask.delayed
functions using ds.to_netcdf
, there is only little overhead from pickle:
vals = da.random.random(500, chunks=(1,))
ds = xr.Dataset({'vals': (['a'], vals)})
write = ds.to_netcdf('file2.nc', compute=False)
%prun -stime -l10 write.compute()
123410 function calls (104395 primitive calls) in 13.720 seconds
Ordered by: internal time
List reduced from 203 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
8 10.032 1.254 10.032 1.254 {method 'acquire' of '_thread.lock' objects}
1001 2.939 0.003 2.950 0.003 {built-in method _pickle.dumps}
1001 0.614 0.001 3.569 0.004 pickle.py:30(dumps)
6504/1002 0.012 0.000 0.021 0.000 utils.py:803(convert)
11507/1002 0.010 0.000 0.019 0.000 utils_comm.py:144(unpack_remotedata)
6013 0.009 0.000 0.009 0.000 utils.py:767(tokey)
3002/1002 0.008 0.000 0.017 0.000 utils_comm.py:181(<listcomp>)
11512 0.007 0.000 0.008 0.000 core.py:26(istask)
1002 0.006 0.000 3.589 0.004 worker.py:788(dumps_task)
1 0.005 0.005 0.007 0.007 core.py:273(<dictcomp>)
But if we use results from dask.delayed
, pickle takes up most of the time:
@dask.delayed
def make_data():
return np.array(np.random.randn())
vals = da.stack([da.from_delayed(make_data(), (), np.float64) for _ in range(500)])
ds = xr.Dataset({'vals': (['a'], vals)})
write = ds.to_netcdf('file5.nc', compute=False)
%prun -stime -l10 write.compute()
115045243 function calls (104115443 primitive calls) in 67.240 seconds
Ordered by: internal time
List reduced from 292 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
8120705/501 17.597 0.000 59.036 0.118 pickle.py:457(save)
2519027/501 7.581 0.000 59.032 0.118 pickle.py:723(save_tuple)
4 6.978 1.745 6.978 1.745 {method 'acquire' of '_thread.lock' objects}
3082150 5.362 0.000 8.748 0.000 pickle.py:413(memoize)
11474396 4.516 0.000 5.970 0.000 pickle.py:213(write)
8121206 4.186 0.000 5.202 0.000 pickle.py:200(commit_frame)
13747943 2.703 0.000 2.703 0.000 {method 'get' of 'dict' objects}
17057538 1.887 0.000 1.887 0.000 {built-in method builtins.id}
4568116 1.772 0.000 1.782 0.000 {built-in method _struct.pack}
2762513 1.613 0.000 2.826 0.000 pickle.py:448(get)
This additional pickle overhead does not happen if we compute the dataset without writing it to a file.
Output of %prun -stime -l10 ds.compute()
without dask.delayed
:
83856 function calls (73348 primitive calls) in 0.566 seconds
Ordered by: internal time
List reduced from 259 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
4 0.441 0.110 0.441 0.110 {method 'acquire' of '_thread.lock' objects}
502 0.013 0.000 0.013 0.000 {method 'send' of '_socket.socket' objects}
500 0.011 0.000 0.011 0.000 {built-in method _pickle.dumps}
1000 0.007 0.000 0.008 0.000 core.py:159(get_dependencies)
3500 0.007 0.000 0.007 0.000 utils.py:767(tokey)
3000/500 0.006 0.000 0.010 0.000 utils.py:803(convert)
500 0.005 0.000 0.019 0.000 pickle.py:30(dumps)
1 0.004 0.004 0.008 0.008 core.py:3826(concatenate3)
4500/500 0.004 0.000 0.008 0.000 utils_comm.py:144(unpack_remotedata)
1 0.004 0.004 0.017 0.017 order.py:83(order)
With dask.delayed
:
149376 function calls (139868 primitive calls) in 1.738 seconds
Ordered by: internal time
List reduced from 264 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
4 1.568 0.392 1.568 0.392 {method 'acquire' of '_thread.lock' objects}
1 0.015 0.015 0.038 0.038 optimization.py:455(fuse)
502 0.012 0.000 0.012 0.000 {method 'send' of '_socket.socket' objects}
6500 0.010 0.000 0.010 0.000 utils.py:767(tokey)
5500/1000 0.009 0.000 0.012 0.000 utils_comm.py:144(unpack_remotedata)
2500 0.008 0.000 0.009 0.000 core.py:159(get_dependencies)
500 0.007 0.000 0.009 0.000 client.py:142(__init__)
1000 0.005 0.000 0.008 0.000 core.py:280(subs)
2000/1000 0.005 0.000 0.008 0.000 utils.py:803(convert)
1 0.004 0.004 0.022 0.022 order.py:83(order)
I am using dask.distributed
. I haven’t tested it with anything else.
Software versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_GB.UTF-8
LANG: None
LOCALE: en_GB.UTF-8
xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.1
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.6.2
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: 2.2.2
cartopy: None
seaborn: 0.9.0
setuptools: 40.2.0
pip: 18.0
conda: 4.5.11
pytest: 3.7.3
IPython: 6.5.0
sphinx: 1.7.7
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (8 by maintainers)
Top Results From Across the Web
Best Practices - Dask documentation
Dask delayed operates on functions like dask.delayed(f)(x, ... Calling y.compute() within the loop would await the result of the computation every time, ...
Read more >Difficulties using multi-threading and multi-processing and ...
Computations seem to get slower the larger the array ... Have tried including a line to skip nans in the custom .apply() function,...
Read more >Parallel computing with Dask - Xarray
Note that xarray only makes use of dask.array and dask.delayed . ... Re-chunking the dataset after creation with ds.chunk() will lead to an...
Read more >xclim Documentation
xclim is a library of functions to compute climate indices from ... gdd.plot() ... Such a dask Client is usually created with a...
Read more >Possible overhead on dask computation over list of delayed ...
Dask DataFrame Structure: datetime ndvi str utm_x utm_y fpath scl_value npartitions=71 ... In each partition I want to run a custom function
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
OK, so it seems like the complete solution here should involve refactoring our backend classes to avoid any references to objects storing dask graphs. This is a cleaner solution even regardless of the pickle overhead because it allows us to eliminate all state stored in backend classes. I’ll get on that in #2261.
Removing the self-references to the dask graphs in #2261 seems to resolve the performance issue on its own.
I would be interested if https://github.com/pydata/xarray/pull/2391 still improves performance in any real world yes cases – perhaps it helps when working with a real cluster or on large datasets? I can’t see any difference in my local benchmarks using dask-distributed.