[Core] `max_calls=1` crashes ray when many tasks are launched.
See original GitHub issueWhat is the problem?
- Ray crashes with
overflow_cpu_instances[i] == 0 Should not be overflow
when launching many tasks withmax_calls=1
. Full error at bottom
Ray version and other system information (Python version, TensorFlow version, OS):
- Ray: 1.2.0 (and Ray 2.0.0)
- Python 3.7.7
- Docker: rayproject/ray-ml:1.2.0-gpu (but no GPUs used)
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
On a machine with 2 CPUs (m5.large
)
import ray
import time
ray.init()
@ray.remote(max_calls=1)
def one_sec():
time.sleep(1)
for _ in range(60):
one_sec.remote()
one_sec.remote()
time.sleep(0.5)
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
(raylet) [2021-03-18 09:18:04,554 C 1068 1068] cluster_task_manager.cc:809: Check failed: overflow_cpu_instances[i] == 0 Should not be overflow
(raylet) [2021-03-18 09:18:04,554 E 1068 1068] logging.cc:415: *** Aborted at 1616084284 (unix time) try "date -d @1616084284" if you are using GNU date ***
(raylet) [2021-03-18 09:18:04,555 E 1068 1068] logging.cc:415: PC: @ 0x0 (unknown)
(raylet) [2021-03-18 09:18:04,555 E 1068 1068] logging.cc:415: *** SIGABRT (@0x42c) received by PID 1068 (TID 0x7fbd2d556800) from PID 1068; stack trace: ***
(raylet) [2021-03-18 09:18:04,555 E 1068 1068] logging.cc:415: @ 0x564ba40989ef google::(anonymous namespace)::FailureSignalHandler()
(raylet) [2021-03-18 09:18:04,556 E 1068 1068] logging.cc:415: @ 0x7fbd2d12c980 (unknown)
(raylet) [2021-03-18 09:18:04,556 E 1068 1068] logging.cc:415: @ 0x7fbd2c220fb7 gsignal
(raylet) [2021-03-18 09:18:04,556 E 1068 1068] logging.cc:415: @ 0x7fbd2c222921 abort
(raylet) [2021-03-18 09:18:04,556 E 1068 1068] logging.cc:415: @ 0x564ba3c36e9c _ZN3ray6RayLogD2Ev.cold
(raylet) [2021-03-18 09:18:04,557 E 1068 1068] logging.cc:415: @ 0x564ba3d3b11f ray::raylet::ClusterTaskManager::ReleaseCpuResourcesFromUnblockedWorker()
(raylet) [2021-03-18 09:18:04,558 E 1068 1068] logging.cc:415: @ 0x564ba3ce9737 ray::raylet::NodeManager::HandleDirectCallTaskBlocked()
(raylet) [2021-03-18 09:18:04,559 E 1068 1068] logging.cc:415: @ 0x564ba3ce97e9 ray::raylet::NodeManager::ProcessDirectCallTaskBlocked()
(raylet) [2021-03-18 09:18:04,560 E 1068 1068] logging.cc:415: @ 0x564ba3d277e2 ray::raylet::NodeManager::ProcessClientMessage()
(raylet) [2021-03-18 09:18:04,560 E 1068 1068] logging.cc:415: @ 0x564ba3c866a1 _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionEElRKSt6vectorIhSaIhEEEZNS1_6raylet6Raylet12HandleAcceptERKN5boost6system10error_codeEEUlS3_lS8_E0_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
(raylet) [2021-03-18 09:18:04,562 E 1068 1068] logging.cc:415: @ 0x564ba404430e ray::ClientConnection::ProcessMessage()
(raylet) [2021-03-18 09:18:04,563 E 1068 1068] logging.cc:415: @ 0x564ba40413bc boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(raylet) [2021-03-18 09:18:04,564 E 1068 1068] logging.cc:415: @ 0x564ba4407301 boost::asio::detail::scheduler::do_run_one()
(raylet) [2021-03-18 09:18:04,566 E 1068 1068] logging.cc:415: @ 0x564ba44089a9 boost::asio::detail::scheduler::run()
(raylet) [2021-03-18 09:18:04,567 E 1068 1068] logging.cc:415: @ 0x564ba440ae97 boost::asio::io_context::run()
(raylet) [2021-03-18 09:18:04,568 E 1068 1068] logging.cc:415: @ 0x564ba3c52ce2 main
(raylet) [2021-03-18 09:18:04,568 E 1068 1068] logging.cc:415: @ 0x7fbd2c203bf7 __libc_start_main
(raylet) [2021-03-18 09:18:04,570 E 1068 1068] logging.cc:415: @ 0x564ba3c67da5 (unknown)
zsh: abort (core dumped) ipython
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:13 (10 by maintainers)
Top Results From Across the Web
FCallStackInfo
Framework for creating high-fidelity digital humans in minutes. ... Fast, easy, real-time immersive 3D visualization. ... Your gateway to Megascans and a world...
Read more >https://downloads.asterisk.org/pub/telephony/certi...
2017-04-04 12:37 +0000 Asterisk Development Team <asteriskteam@digium.com> * asterisk certified/13.13-cert3 Released. 2017-03-27 09:03 +0000 [d91f264721] ...
Read more >Cisco-TelePresence-Video-Communication-Server-and- ...
Provides a list of the licenses and notices for open source software used in this product.
Read more >SCIP Doxygen Documentation: CHANGELOG Source File
118 feasibility of the ray is now checked. This fix now might lead to several rounds of separation in order to resolve unbounded...
Read more >2000-7.0-RFM01-1299 Glossary
The number of times the caller hangs up unexpectedly during billing ... Many computer keyboards have arrow keys that move the cursor up,...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
#15083 fixes it (manually confirmed). Should be able to merge it pretty soon.
OK, thank you! Looking forward to 1.4.