question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2

See original GitHub issue

I noticed a big speed discrepancy between xarray versions 0.8.2 and 0.9.1 when using open_mfdataset() on a dataset ~ 1.2 GB in size, consisting of 3 files and using netcdf4 as the engine. 0.8.2 was run first, so this is probably not a disk caching issue.

Test

import xarray as xr
import time

start_time = time.time()
ds0 = xr.open_mfdataset('./*.nc')
print("--- %s seconds ---" % (time.time() - start_time))

Result

xarray==0.8.2, dask==0.11.1, netcdf4==1.2.4

--- 0.736030101776 seconds ---

xarray==0.9.1, dask==0.13.0, netcdf4==1.2.4

--- 52.2800869942 seconds ---

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:17 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
friedrichknuthcommented, Nov 16, 2017

Looks like it has been resolved! Tested with the latest pre-release v0.10.0rc2 on the dataset linked by najascutellatus above. https://marine.rutgers.edu/~michaesm/netcdf/data/

da.set_options(get=da.async.get_sync)
%prun -l 10 ds = xr.open_mfdataset('./*.nc')

xarray==0.10.0rc2-1-g8267fdb dask==0.15.4

         194381 function calls (188429 primitive calls) in 0.869 seconds

   Ordered by: internal time
   List reduced from 469 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       50    0.393    0.008    0.393    0.008 {numpy.core.multiarray.arange}
       50    0.164    0.003    0.557    0.011 indexing.py:266(_index_indexer_1d)
        5    0.083    0.017    0.085    0.017 netCDF4_.py:185(_open_netcdf4_group)
      190    0.024    0.000    0.066    0.000 netCDF4_.py:256(open_store_variable)
      190    0.022    0.000    0.022    0.000 netCDF4_.py:29(__init__)
       50    0.018    0.000    0.021    0.000 {operator.getitem}
5145/3605    0.012    0.000    0.019    0.000 indexing.py:493(shape)
2317/1291    0.009    0.000    0.094    0.000 _abcoll.py:548(update)
    26137    0.006    0.000    0.013    0.000 {isinstance}
      720    0.005    0.000    0.006    0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}

xarray==0.9.1 dask==0.13.0


         241253 function calls (229881 primitive calls) in 98.123 seconds

   Ordered by: internal time
   List reduced from 659 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       30   87.527    2.918   87.527    2.918 {pandas._libs.tslib.array_to_timedelta64}
       65    7.055    0.109    7.059    0.109 {operator.getitem}
       80    0.799    0.010    0.799    0.010 {numpy.core.multiarray.arange}
7895/4420    0.502    0.000    0.524    0.000 utils.py:412(shape)
       68    0.442    0.007    0.442    0.007 {pandas._libs.algos.ensure_object}
       80    0.350    0.004    1.150    0.014 indexing.py:318(_index_indexer_1d)
    60/30    0.296    0.005   88.407    2.947 timedeltas.py:158(_convert_listlike)
       30    0.284    0.009    0.298    0.010 algorithms.py:719(checked_add_with_arr)
      123    0.140    0.001    0.140    0.001 {method 'astype' of 'numpy.ndarray' objects}
 1049/719    0.096    0.000   96.513    0.134 {numpy.core.multiarray.array}
0reactions
jhammancommented, Nov 14, 2017

@friedrichknuth, any chance you can take a look at this with the latest v0.10 release candidate?

Read more comments on GitHub >

github_iconTop Results From Across the Web

open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 -
I noticed a big speed discrepancy between xarray versions 0.8.2 and 0.9.1 when using open_mfdataset() on a dataset ~ 1.2 GB in size,...
Read more >
pydata/xarray - Gitter
has anyone run into the following with open_mfdataset : ValueError: unable to decode time units u'hours since analysis' with the default calendar.
Read more >
xarray load() running slowly on open_mfdataset() data
Yes, it is loading the values from the files that is taking most of time, but also, using xarray.open_mfdataset is much slower than...
Read more >
whats-new.rst.txt - Xarray
Prevent passing `concat_dim` to :py:func:`xarray.open_mfdataset` when ... Significantly higher ``unstack`` performance on numpy-backed arrays which contain ...
Read more >
xarray.open_mfdataset — xarray 0.8.2 documentation
concat_dim : str or DataArray or Index, optional. Dimension to concatenate files along. This argument is passed on to xarray.auto_combine() along with the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found