[dask on ray issues]: errors running dask matmul
See original GitHub issueWhat is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): ray: 2.0.0.dev0 dask==2021.1.1
When enable the object spilling on a ray cluster(by passing --system-config=‘{“automatic_object_spilling_enabled”:true,“max_io_workers”:2,“object_spilling_config”:“{"type":"filesystem","params":{"directory_path":"/tmp/spill"}}”}’ to the head node configuration), the program aborts with bus error.
Error output:
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 3.238.219.129
Shared connection to 3.238.219.129 closed.
Shared connection to 3.238.219.129 closed.
Fetched IP: 3.238.219.129
Shared connection to 3.238.219.129 closed.
2021-02-15 14:26:48,908 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.31.7.112:6379
(autoscaler +26s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +26s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +32s) Adding 3 nodes of type ray-legacy-worker-node-type.
(autoscaler +57s) Resized to 12 CPUs.
(autoscaler +57s) Adding 1 nodes of type ray-legacy-worker-node-type.
(autoscaler +1m4s) Resized to 24 CPUs.
(autoscaler +1m4s) Adding 3 nodes of type ray-legacy-worker-node-type.
(autoscaler +1m10s) Resized to 28 CPUs.
(autoscaler +1m29s) Resized to 32 CPUs.
(autoscaler +1m35s) Resized to 36 CPUs.
(autoscaler +1m41s) Resized to 44 CPUs.
Bus error (core dumped)
Shared connection to 3.238.219.129 closed.
Error: Command failed:
ssh -tt -i /home/zhitingz/.ssh/ray-autoscaler_us-east-1.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4dcb4daf68/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.238.219.129 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/gemm.py --download 25000 25000 uint16)'"'"'"'"'"'"'"'"''"'"' )'
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
from timeit import default_timer as dtimer
import ray
from ray.util.dask import ray_dask_get
import dask
import dask.array as da
import numpy as np
def main():
_ = ray.init(address="auto")
dask.config.set(scheduler=ray_dask_get)
chunks=(1000, 1000)
x = da.random.randint(0, 65_535, size=(25000, 25000),
dtype=np.uint16, chunks=chunks)
y = da.random.randint(0, 65_535, size=(25000, 25000),
dtype=np.uint16, chunks=chunks)
start = dtimer()
z = da.matmul(x, y)
# print(z)
# z.visualize(filename=f'gemm_{rows}x{cols}_{tp}_{chunks}.svg')
z_val = z.compute()
end = dtimer()
print(z_val)
print(f"time: {end-start} s")
if __name__ == "__main__":
main()
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:25 (17 by maintainers)
Top Results From Across the Web
Using Dask on Ray — Ray 2.2.0
Dask-on-Ray is an ongoing project and is not expected to achieve the same performance as using Ray directly. All Dask abstractions should run...
Read more >dask.array.matmul - Dask documentation
Some inconsistencies with the Dask version may exist. Matrix product of two arrays. Parameters. x1, x2array_like. Input arrays, scalars not allowed.
Read more >Using Dask and Ray to Analyze Petabytes of Remote Sensing ...
By implementing a Dask scheduler that farms Dask tasks out to a Ray cluster, we can run the entirety of the Dask ecosystem...
Read more >dask ation - manuals.plus
High Performance Computers: How to run Dask on traditional HPC environments ... quickly they are able to identify and resolve bugs and performance...
Read more >Distributed XGBoost with Dask — xgboost 1.7.2 documentation
Why is the initialization of DaskDMatrix so slow and throws weird errors ... example which illustrates basic usage of running XGBoost on a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I change the chunk size to 5000 to make the issue easier to reproduce:
The raylet.err outputs this:
For shm, I cat on the head node:
Is there any command to execute commands on all nodes?