Rust implementation of dask-scheduler
See original GitHub issueHello,
We (I and @Kobzol) are working on a Rust implementation of dask-scheduler as an experimental drop-in replacement of dask-scheduler without any modification on worker/client side. It is an experiment for (a) evaluate performance gain of non-Python scheduler scheduler and (b) allow experimentation with different schedulers. Here I would like to report preliminary results for (a).
I am sorry for abusing Github issues, if there is a better place for contacting the community, please redirect us.
Repository: https://github.com/spirali/rsds/tree/master/src Project status:
- Server is able to accept client and worker connections and redistribute simple task graphs.
- rsds distinguishes “runtime” (=part that communicates with workers/clients and maintain service information) and “scheduler” (= part that decides where tasks will run). The scheduler is asynchronous and offloaded into a separated thread. It communicates with a simple procotol with the runtime. The protocol is serializable, hence in the future, the scheduler may be written in a different language than Rust.
- The current version have only random scheduler that randomly assigns workers to tasks.
- Propagating exceptions from workers to a client is implemented
- Failure of client/worker is not yet correctly implemented
- Many things in the protocol is not implemented
- We did not actively profile or optimize the code, we are reporting the first running version
Benchmark
We were running the following simple code as a benchmark of a server runtime overhead.
from dask.distributed import Client
from dask import delayed
client = Client("tcp://localhost:7070")
print("CLIENT", client)
@delayed
def do_something(x):
return x * 10
@delayed
def merge(*args):
return sum(args)
xs = [do_something(x) for x in range(80000)]
result = merge(*xs)
print(result.compute())
Results
Times are obtained through “time -p python test.py”.
1 node / 23 workers
rsds : 19.09 +/- 0.17
dask-scheduler: 39.19 +/- 1.01
8 nodes / 191 workers (7x24 + 23)
rsds : 20.74 +/- 2.46
dask-scheduler: 215.19 +/- 20.07
We are aware that the benchmark is far from ideal from many aspects, we would be happy for pointing us on a code that does a better job.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:7
- Comments:30 (30 by maintainers)
Top GitHub Comments
I wanted to try a newer version, but it required Python 3.7. It’s not a problem to upgrade, but the other benchmarks were executed with Python 3.6, so to keep it more fair I would probably need to upgrade all of them. I will do it eventually, I just haven’t gotten to it yet 😃 I need to modify the benchmark suite a bit to make it easier to execute multiple benchmarks with different Python/Python library versions.
The code is open source here: https://github.com/It4innovations/rsds/blob/master/scripts/benchmark.py The benchmarks can be run locally using
python benchmark.py benchmark <input-file (see reference.json)> <directory>
or on a PBS cluster usingbenchmark.py submit
. It’s not exactly super documented though 😅The benchmarked task workflows can be found here.
Ah ok. Yeah that should provide a good overview.
After moving a bunch of things to HLGs, we’ve wound up spending time on other things related to serialization ( https://github.com/dask/distributed/pull/4923 ), comms ( https://github.com/dask/distributed/issues/4443 ), IO/creation ( https://github.com/dask/dask/issues/6791 ), etc. Also there is more recent work on operation reordering ( https://github.com/dask/dask/issues/7933 ). Probably forgetting other things here 😅
Sure. Oh gotcha 😄 Well it would be great to have another set of eyes. Would be curious to see what other issues you see when hammering on things 🙂