question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Limiting threads/cores used by xarray(/dask?)

See original GitHub issue

I’m fairly new to xarray and I’m currently trying to leverage it to subset some NetCDFs. I’m running this on a shared server and would like to know how best to limit the processing power used by xarray so that it plays nicely with others. I’ve read through the dask and xarray documentation a bit but it doesn’t seem clear to me how to set a cap on cpus/threads. Here’s an example of a spatial subset:

import glob
import os
import xarray as xr

from multiprocessing.pool import ThreadPool
import dask

wd = os.getcwd()

test_data = os.path.join(wd, 'test_data')
lat_bnds = (43, 50)
lon_bnds = (-67, -80)
output = 'test_data_subset'

def subset_nc(ncfile, lat_bnds, lon_bnds, output):
    if not glob.os.path.exists(output):
        glob.os.makedirs(output)
    outfile = os.path.join(output, os.path.basename(ncfile).replace('.nc', '_subset.nc'))

    with dask.config.set(scheduler='threads', pool=ThreadPool(5)):
        ds = xr.open_dataset(ncfile, decode_times=False)

        ds_sub = ds.where(
            (ds.lon >= min(lon_bnds)) & (ds.lon <= max(lon_bnds)) & (ds.lat >= min(lat_bnds)) & (ds.lat <= max(lat_bnds)),
            drop=True)
        comp = dict(zlib=True, complevel=5)
        encoding = {var: comp for var in ds.data_vars}
        ds_sub.to_netcdf(outfile, format='NETCDF4', encoding=encoding)

list_files = glob.glob(os.path.join(test_data, '*'))
print(list_files)

for i in list_files:
    subset_nc(i, lat_bnds, lon_bnds, output)

I’ve tried a few variations on this by moving the ThreadPool configuration around but I still see way too much activity in the server’s top (>3000% cpu activity). I’m not sure where the issue lies.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
TheSwainecommented, Feb 4, 2019

hi, my testcode is running properly on 5 threads thanks for the help

import xarray as xr
import os
import numpy
import sys
import dask
from multiprocessing.pool import ThreadPool 

#dask-worker = --nthreads 1

with dask.config.set(schedular='threads', pool=ThreadPool(5)):
	dset = xr.open_mfdataset("/data/Environmental_Data/Sea_Surface_Height/*/*.nc", engine='netcdf4', concat_dim='time', chunks={"latitude":180,"longitude":360})
	dset1 = dset["adt"]-dset["sla"]
	dset1.to_dataset(name = 'ssh_mean')
	dset["ssh_mean"] = dset1
	dset = dset.drop("crs")
	dset = dset.drop("lat_bnds")
	dset = dset.drop("lon_bnds")
	dset = dset.drop("__xarray_dataarray_variable__")
	dset = dset.drop("nv")
	dset_all_over_monthly_mean = dset.groupby("time.month").mean(dim="time", skipna=True)
	dset_all_over_season1_mean = dset_all_over_monthly_mean.sel(month=[1,2,3])
	dset_all_over_season1_mean.mean(dim="month",skipna=True)
	dset_all_over_season1_mean.to_netcdf("/data/Environmental_Data/dump/mean/all_over_season1_mean_ssh_copernicus_0.25deg_season1_data_mean.nc")
0reactions
Zeitsperrecommented, Feb 11, 2019

Hi @jhamman, please excuse the lateness of this reply. It turned out that in the end all I needed to do was set OMP_NUM_THREADS to the number based on my cores I want to use (2 threads/core) before launching my processes. Thanks for the help and for keeping this open. Feel free to close this thread.

Read more comments on GitHub >

github_iconTop Results From Across the Web

xarray/dask - limiting the number of threads/cpus
I'm running this on a shared server and would like to know how best to limit the processing power used by xarray so...
Read more >
Parallel computing with Dask - Xarray
The actual computation is controlled by a multi-processing or thread pool, which allows Dask to take full advantage of multiple processors available on...
Read more >
Futures — Dask 2.23.0 documentation
You must start a Client to use the futures interface. This tracks state among the various worker processes or threads: from dask.distributed import...
Read more >
Parallel computing with Dask — xarray 0.14.1 documentation
Note that xarray only makes use of dask.array and dask.delayed . Reading and writing data¶. The usual way to create a ...
Read more >
Basics of UNIX - Berkeley Statistics
2.2.2 Fixing the number of threads (cores used). In general, if you want to limit the number of threads used, you can set...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found