Ray 1.2+ crashes with `Check failed: it != object_pull_requests_.end()`
See original GitHub issueWhat is the problem?
Ray crashes with certain workloads in Modin.
Ray version and other system information (Python version, TensorFlow version, OS): 1.2
Reproduction (REQUIRED)
I haven’t been able to reproduce this outside of Modin, I am still working on that. I am posting here now to see if there are any pointers and I’ll update as I know more. The same issue occurs on nightly, but not on Ray 1.1.
import ray
ray.init()
import modin.pandas as pd
df = pd.DataFrame([1])
df + df # crash
We are using task._remote
and pass args
and num_returns
for the codepath where this is breaking.
Mac Logs:
(raylet) [2021-02-23 00:25:47,462 C 13930 2446768] pull_manager.cc:246: Check failed: it != object_pull_requests_.end()
(raylet) [2021-02-23 00:25:47,462 E 13930 2446768] logging.cc:415: *** Aborted at 1614057947 (unix time) try "date -d @1614057947" if you are using GNU date ***
(raylet) [2021-02-23 00:25:47,462 E 13930 2446768] logging.cc:415: PC: @ 0x0 (unknown)
(raylet) [2021-02-23 00:25:47,462 E 13930 2446768] logging.cc:415: *** SIGABRT (@0x7fff6b34933a) received by PID 13930 (TID 0x10e1eedc0) stack trace: ***
(raylet) [2021-02-23 00:25:47,462 E 13930 2446768] logging.cc:415: @ 0x7fff6b3fa5fd _sigtramp
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10cfcf008 absl::lts_2019_08_08::AlphaNum::AlphaNum()::hexdigits
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x7fff6b2d0808 abort
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10c8cafba ray::SpdLogMessage::Flush()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10c8a1fb9 ray::RayLog::~RayLog()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10c5730f9 ray::PullManager::CancelPull()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10c53944a ray::ObjectManager::CancelPull()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10c4456be ray::raylet::DependencyManager::RemoveTaskDependencies()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10c4f4c39 ray::raylet::ClusterTaskManager::DispatchScheduledTasksToWorkers()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10c4f678d ray::raylet::ClusterTaskManager::QueueAndScheduleTask()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415: @ 0x10c47096e ray::raylet::NodeManager::HandleRequestWorkerLease()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415: @ 0x10c4ae9a2 ray::rpc::ServerCallImpl<>::HandleRequestImpl()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415: @ 0x10c4ae8e4 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc14ServerCallImplINS4_25NodeManagerServiceHandlerENS4_25RequestWorkerLeaseRequestENS4_23RequestWorkerLeaseReplyEE13HandleRequestEvEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415: @ 0x10cd389b3 boost::asio::detail::scheduler::do_run_one()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415: @ 0x10cd2c7f2 boost::asio::detail::scheduler::run()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415: @ 0x10cd2c68b boost::asio::io_context::run()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415: @ 0x10c423518 main
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415: @ 0x7fff6b201cc9 start
Windows Logs:
(pid=None) [2021-02-23 11:02:37,167 C 29848 39604] pull_manager.cc:100: Check failed: active_object_pull_requests_[obj_id].erase(request_it->first)
(pid=None) [2021-02-23 11:02:37,168 E 29848 39604] logging.cc:415: *** Aborted at 1614099757 (unix time) try "date -d @1614099757" if you are using GNU date ***
(pid=None) [2021-02-23 11:02:37,172 E 29848 39604] logging.cc:415: @ 0x7ffec5ee1881 raise
(pid=None) [2021-02-23 11:02:37,172 E 29848 39604] logging.cc:415: @ 0x7ffec5ee2851 abort
(pid=None) [2021-02-23 11:02:37,174 E 29848 39604] logging.cc:415: @ 0x7ff6e640ff19 public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
(pid=None) [2021-02-23 11:02:37,174 E 29848 39604] logging.cc:415: @ 0x7ff6e640e7e1 public: virtual __cdecl google::NullStreamFatal::~NullStreamFatal(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e628ddf5 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e628d664 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e627748b public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e624a499 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e62522e4 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e6211913 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e6210731 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e61e681c public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e62292fc public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e66ba454 bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e66bdc4f bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e66bd4bb bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e61a8017 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415: @ 0x7ff6e66eed30 bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
(pid=None) [2021-02-23 11:02:37,179 E 29848 39604] logging.cc:415: @ 0x7ffec6477034 BaseThreadInitThunk
(pid=None) [2021-02-23 11:02:37,182 E 29848 39604] logging.cc:415: @ 0x7ffec7f5d241 RtlUserThreadStart
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
Troubleshooting Failures — Ray 3.0.0.dev0
Ray throws an ObjectLostError to the application when an object cannot be retrieved due to application or system error. This can occur during...
Read more >[Ray Tune] Ray crashes and system hangs - Google Groups
Sometimes on crash I get pyarrow EOF error, but other times I do not see such an error. Also the error logs seems...
Read more >Bug listing with status RESOLVED with resolution FIXED as at ...
... fails after it does not find autogen.sh" status:RESOLVED resolution:FIXED ... Bug:149 - "ghex-1.2.1.ebuild" status:RESOLVED resolution:FIXED severity: ...
Read more >6 Crashing JVM - Oracle Help Center
A Java application might stop running for several reasons. ... Check the size of the binary crash file to determine whether the JVM...
Read more >Troubleshoot Firefox crashes (closing or quitting unexpectedly)
Learn how to troubleshoot and fix Firefox crashes, and find out how to get more help if you're having problems solving the issue....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks @rkooo567! I see, that makes sense. It is not unexpected in Modin. It can happen multiple ways:
df + df
)A single
ObjectRef
can belong to multiple dataframes. We try to keep copying to a minimum.Thanks @rkooo567, @stephanie-wang, and @ericl!