question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[dask on ray issues]: errors running dask matmul

See original GitHub issue

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): ray: 2.0.0.dev0 dask==2021.1.1

When enable the object spilling on a ray cluster(by passing --system-config=‘{“automatic_object_spilling_enabled”:true,“max_io_workers”:2,“object_spilling_config”:“{"type":"filesystem","params":{"directory_path":"/tmp/spill"}}”}’ to the head node configuration), the program aborts with bus error.

Error output:

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 3.238.219.129
Shared connection to 3.238.219.129 closed.
Shared connection to 3.238.219.129 closed.
Fetched IP: 3.238.219.129
Shared connection to 3.238.219.129 closed.
2021-02-15 14:26:48,908 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.31.7.112:6379
(autoscaler +26s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +26s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +32s) Adding 3 nodes of type ray-legacy-worker-node-type.
(autoscaler +57s) Resized to 12 CPUs.
(autoscaler +57s) Adding 1 nodes of type ray-legacy-worker-node-type.
(autoscaler +1m4s) Resized to 24 CPUs.
(autoscaler +1m4s) Adding 3 nodes of type ray-legacy-worker-node-type.
(autoscaler +1m10s) Resized to 28 CPUs.
(autoscaler +1m29s) Resized to 32 CPUs.
(autoscaler +1m35s) Resized to 36 CPUs.
(autoscaler +1m41s) Resized to 44 CPUs.
Bus error (core dumped)
Shared connection to 3.238.219.129 closed.
Error: Command failed:

  ssh -tt -i /home/zhitingz/.ssh/ray-autoscaler_us-east-1.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_4dcb4daf68/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.238.219.129 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/gemm.py --download 25000 25000 uint16)'"'"'"'"'"'"'"'"''"'"' )'


Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

from timeit import default_timer as dtimer
import ray
from ray.util.dask import ray_dask_get
import dask
import dask.array as da
import numpy as np


def main():
    _ = ray.init(address="auto")
    dask.config.set(scheduler=ray_dask_get)
    chunks=(1000, 1000)
    x = da.random.randint(0, 65_535, size=(25000, 25000),
                          dtype=np.uint16, chunks=chunks)
    y = da.random.randint(0, 65_535, size=(25000, 25000),
                          dtype=np.uint16, chunks=chunks)
    start = dtimer()
    z = da.matmul(x, y)
    # print(z)
    # z.visualize(filename=f'gemm_{rows}x{cols}_{tp}_{chunks}.svg')

    z_val = z.compute()
    end = dtimer()
    print(z_val)
    print(f"time: {end-start} s")


if __name__ == "__main__":
    main()

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:25 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
photoszztcommented, Feb 15, 2021

I change the chunk size to 5000 to make the issue easier to reproduce:

from timeit import default_timer as dtimer
import ray
from ray.util.dask import ray_dask_get
import dask
import dask.array as da
import numpy as np


def main():
    _ = ray.init(address="auto")
    dask.config.set(scheduler=ray_dask_get)
    chunks=(5000, 5000)
    x = da.random.randint(0, 65_535, size=(25000, 25000),
                          dtype=np.uint16, chunks=chunks)
    y = da.random.randint(0, 65_535, size=(25000, 25000),
                          dtype=np.uint16, chunks=chunks)
    start = dtimer()
    z = da.matmul(x, y)
    # print(z)
    # z.visualize(filename=f'gemm_{rows}x{cols}_{tp}_{chunks}.svg')

    z_val = z.compute()
    end = dtimer()
    print(z_val)
    print(f"time: {end-start} s")


if __name__ == "__main__":
    main()
1reaction
photoszztcommented, Feb 15, 2021

The raylet.err outputs this:

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 3.238.219.129
Warning: Permanently added '3.238.219.129' (ECDSA) to the list of known hosts.
[2021-02-15 15:19:50,617 C 39361 39361] local_object_manager.cc:340:  Check failed: spilled_objects_url_.count(object_id) > 0
[2021-02-15 15:19:50,617 E 39361 39361] logging.cc:435: *** Aborted at 1613431190 (unix time) try "date -d @1613431190" if you are using GNU date ***
[2021-02-15 15:19:50,618 E 39361 39361] logging.cc:435: PC: @                0x0 (unknown)
[2021-02-15 15:19:50,623 E 39361 39361] logging.cc:435: *** SIGABRT (@0x3e8000099c1) received by PID 39361 (TID 0x7f837aeab800) from PID 39361; stack trace: ***
[2021-02-15 15:19:50,624 E 39361 39361] logging.cc:435:     @     0x561d996eacef google::(anonymous namespace)::FailureSignalHandler()
[2021-02-15 15:19:50,624 E 39361 39361] logging.cc:435:     @     0x7f837b40d3c0 (unknown)
[2021-02-15 15:19:50,625 E 39361 39361] logging.cc:435:     @     0x7f837aef618b gsignal
[2021-02-15 15:19:50,625 E 39361 39361] logging.cc:435:     @     0x7f837aed5859 abort
[2021-02-15 15:19:50,627 E 39361 39361] logging.cc:435:     @     0x561d996dc0c5 ray::SpdLogMessage::Flush()
[2021-02-15 15:19:50,628 E 39361 39361] logging.cc:435:     @     0x561d996dc0fd ray::RayLog::~RayLog()
[2021-02-15 15:19:50,629 E 39361 39361] logging.cc:435:     @     0x561d99329e07 ray::raylet::LocalObjectManager::AsyncRestoreSpilledObject()
[2021-02-15 15:19:50,629 E 39361 39361] logging.cc:435:     @     0x561d992ca0cc _ZNSt17_Function_handlerIFvRKN3ray8ObjectIDERKSsRKNS0_6NodeIDESt8functionIFvRKNS0_6StatusEEEEZNS0_6raylet6RayletC4ERN5boost4asio10io_contextES5_S5_S5_iS5_RKNSG_17NodeManagerConfigERKNS0_19ObjectManagerConfigESt10shared_ptrINS0_3gcs9GcsClientEEiEUlS3_S5_S8_SE_E_E9_M_invokeERKSt9_Any_dataS3_S5_S8_OSE_
[2021-02-15 15:19:50,631 E 39361 39361] logging.cc:435:     @     0x561d993e053a ray::PullManager::TryToMakeObjectLocal()
[2021-02-15 15:19:50,633 E 39361 39361] logging.cc:435:     @     0x561d993e0bcb ray::PullManager::UpdatePullsBasedOnAvailableMemory()
[2021-02-15 15:19:50,635 E 39361 39361] logging.cc:435:     @     0x561d993e1c20 ray::PullManager::OnLocationChange()
[2021-02-15 15:19:50,636 E 39361 39361] logging.cc:435:     @     0x561d993b9b48 _ZNSt17_Function_handlerIFvRKN3ray8ObjectIDERKSt6vectorINS0_3rpc20ObjectLocationChangeESaIS6_EEEZNS0_15ObjectDirectory24SubscribeObjectLocationsERKNS0_8UniqueIDES3_RKNS5_7AddressERKSt8functionIFvS3_RKSt13unordered_setINS0_6NodeIDESt4hashISL_ESt8equal_toISL_ESaISL_EERKSsRKSL_mEEEUlS3_SA_E_E9_M_invokeERKSt9_Any_dataS3_SA_
[2021-02-15 15:19:50,637 E 39361 39361] logging.cc:435:     @     0x561d9947e713 _ZNSt17_Function_handlerIFvN3ray6StatusERKN5boost8optionalINS0_3rpc18ObjectLocationInfoEEEEZZNS0_3gcs30ServiceBasedObjectInfoAccessor25AsyncSubscribeToLocationsERKNS0_8ObjectIDERKSt8functionIFvSE_RKSt6vectorINS4_20ObjectLocationChangeESaISH_EEEERKSF_IFvS1_EEENKUlST_E_clEST_EUlRKS1_S8_E_E9_M_invokeERKSt9_Any_dataOS1_S8_
[2021-02-15 15:19:50,637 E 39361 39361] logging.cc:435:     @     0x561d99485a30 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc23GetObjectLocationsReplyEEZNS0_3gcs30ServiceBasedObjectInfoAccessor17AsyncGetLocationsERKNS0_8ObjectIDERKSt8functionIFvS1_RKN5boost8optionalINS4_18ObjectLocationInfoEEEEEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
[2021-02-15 15:19:50,639 E 39361 39361] logging.cc:435:     @     0x561d994542d1 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc23GetObjectLocationsReplyEEZNS4_12GcsRpcClient18GetObjectLocationsERKNS4_25GetObjectLocationsRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
[2021-02-15 15:19:50,640 E 39361 39361] logging.cc:435:     @     0x561d9945702f ray::rpc::ClientCallImpl<>::OnReplyReceived()
[2021-02-15 15:19:50,641 E 39361 39361] logging.cc:435:     @     0x561d99345472 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
[2021-02-15 15:19:50,643 E 39361 39361] logging.cc:435:     @     0x561d99a59901 boost::asio::detail::scheduler::do_run_one()
[2021-02-15 15:19:50,645 E 39361 39361] logging.cc:435:     @     0x561d99a5afa9 boost::asio::detail::scheduler::run()
[2021-02-15 15:19:50,645 E 39361 39361] logging.cc:435:     @     0x561d99a5d497 boost::asio::io_context::run()
[2021-02-15 15:19:50,646 E 39361 39361] logging.cc:435:     @     0x561d992a2072 main
[2021-02-15 15:19:50,647 E 39361 39361] logging.cc:435:     @     0x7f837aed70b3 __libc_start_main
[2021-02-15 15:19:50,649 E 39361 39361] logging.cc:435:     @     0x561d992b7165 (unknown)

For shm, I cat on the head node:

Filesystem     1K-blocks    Used Available Use% Mounted on
shm              5101024 4589952    511072  90% /dev/shm

Is there any command to execute commands on all nodes?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using Dask on Ray — Ray 2.2.0
Dask-on-Ray is an ongoing project and is not expected to achieve the same performance as using Ray directly. All Dask abstractions should run...
Read more >
dask.array.matmul - Dask documentation
Some inconsistencies with the Dask version may exist. Matrix product of two arrays. Parameters. x1, x2array_like. Input arrays, scalars not allowed.
Read more >
Using Dask and Ray to Analyze Petabytes of Remote Sensing ...
By implementing a Dask scheduler that farms Dask tasks out to a Ray cluster, we can run the entirety of the Dask ecosystem...
Read more >
dask ation - manuals.plus
High Performance Computers: How to run Dask on traditional HPC environments ... quickly they are able to identify and resolve bugs and performance...
Read more >
Distributed XGBoost with Dask — xgboost 1.7.2 documentation
Why is the initialization of DaskDMatrix so slow and throws weird errors ... example which illustrates basic usage of running XGBoost on a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found