question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray 1.2+ crashes with `Check failed: it != object_pull_requests_.end()`

See original GitHub issue

What is the problem?

Ray crashes with certain workloads in Modin.

Ray version and other system information (Python version, TensorFlow version, OS): 1.2

Reproduction (REQUIRED)

I haven’t been able to reproduce this outside of Modin, I am still working on that. I am posting here now to see if there are any pointers and I’ll update as I know more. The same issue occurs on nightly, but not on Ray 1.1.

import ray
ray.init()
import modin.pandas as pd
df = pd.DataFrame([1])
df + df  # crash

We are using task._remote and pass args and num_returns for the codepath where this is breaking.

Mac Logs:

(raylet) [2021-02-23 00:25:47,462 C 13930 2446768] pull_manager.cc:246:  Check failed: it != object_pull_requests_.end() 
(raylet) [2021-02-23 00:25:47,462 E 13930 2446768] logging.cc:415: *** Aborted at 1614057947 (unix time) try "date -d @1614057947" if you are using GNU date ***
(raylet) [2021-02-23 00:25:47,462 E 13930 2446768] logging.cc:415: PC: @                0x0 (unknown)
(raylet) [2021-02-23 00:25:47,462 E 13930 2446768] logging.cc:415: *** SIGABRT (@0x7fff6b34933a) received by PID 13930 (TID 0x10e1eedc0) stack trace: ***
(raylet) [2021-02-23 00:25:47,462 E 13930 2446768] logging.cc:415:     @     0x7fff6b3fa5fd _sigtramp
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10cfcf008 absl::lts_2019_08_08::AlphaNum::AlphaNum()::hexdigits
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @     0x7fff6b2d0808 abort
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10c8cafba ray::SpdLogMessage::Flush()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10c8a1fb9 ray::RayLog::~RayLog()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10c5730f9 ray::PullManager::CancelPull()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10c53944a ray::ObjectManager::CancelPull()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10c4456be ray::raylet::DependencyManager::RemoveTaskDependencies()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10c4f4c39 ray::raylet::ClusterTaskManager::DispatchScheduledTasksToWorkers()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10c4f678d ray::raylet::ClusterTaskManager::QueueAndScheduleTask()
(raylet) [2021-02-23 00:25:47,463 E 13930 2446768] logging.cc:415:     @        0x10c47096e ray::raylet::NodeManager::HandleRequestWorkerLease()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415:     @        0x10c4ae9a2 ray::rpc::ServerCallImpl<>::HandleRequestImpl()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415:     @        0x10c4ae8e4 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc14ServerCallImplINS4_25NodeManagerServiceHandlerENS4_25RequestWorkerLeaseRequestENS4_23RequestWorkerLeaseReplyEE13HandleRequestEvEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415:     @        0x10cd389b3 boost::asio::detail::scheduler::do_run_one()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415:     @        0x10cd2c7f2 boost::asio::detail::scheduler::run()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415:     @        0x10cd2c68b boost::asio::io_context::run()
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415:     @        0x10c423518 main
(raylet) [2021-02-23 00:25:47,464 E 13930 2446768] logging.cc:415:     @     0x7fff6b201cc9 start

Windows Logs:

(pid=None) [2021-02-23 11:02:37,167 C 29848 39604] pull_manager.cc:100:  Check failed: active_object_pull_requests_[obj_id].erase(request_it->first)
(pid=None) [2021-02-23 11:02:37,168 E 29848 39604] logging.cc:415: *** Aborted at 1614099757 (unix time) try "date -d @1614099757" if you are using GNU date ***
(pid=None) [2021-02-23 11:02:37,172 E 29848 39604] logging.cc:415:     @     0x7ffec5ee1881 raise
(pid=None) [2021-02-23 11:02:37,172 E 29848 39604] logging.cc:415:     @     0x7ffec5ee2851 abort
(pid=None) [2021-02-23 11:02:37,174 E 29848 39604] logging.cc:415:     @     0x7ff6e640ff19 public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
(pid=None) [2021-02-23 11:02:37,174 E 29848 39604] logging.cc:415:     @     0x7ff6e640e7e1 public: virtual __cdecl google::NullStreamFatal::~NullStreamFatal(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e628ddf5 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e628d664 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e627748b public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e624a499 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e62522e4 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e6211913 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e6210731 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e61e681c public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e62292fc public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e66ba454 bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e66bdc4f bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e66bd4bb bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e61a8017 public: class google::NullStream & __ptr64 __cdecl google::NullStream::stream(void) __ptr64
(pid=None) [2021-02-23 11:02:37,177 E 29848 39604] logging.cc:415:     @     0x7ff6e66eed30 bool __cdecl google::Demangle(char const * __ptr64,char * __ptr64,int)
(pid=None) [2021-02-23 11:02:37,179 E 29848 39604] logging.cc:415:     @     0x7ffec6477034 BaseThreadInitThunk
(pid=None) [2021-02-23 11:02:37,182 E 29848 39604] logging.cc:415:     @     0x7ffec7f5d241 RtlUserThreadStart

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
devin-petersohncommented, Feb 25, 2021

Thanks @rkooo567! I see, that makes sense. It is not unexpected in Modin. It can happen multiple ways:

  • A dataframe joins with itself (df + df)
  • Data partitions become shared between dataframes (a memory footprint optimization)

A single ObjectRef can belong to multiple dataframes. We try to keep copying to a minimum.

0reactions
devin-petersohncommented, Feb 26, 2021
Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting Failures — Ray 3.0.0.dev0
Ray throws an ObjectLostError to the application when an object cannot be retrieved due to application or system error. This can occur during...
Read more >
[Ray Tune] Ray crashes and system hangs - Google Groups
Sometimes on crash I get pyarrow EOF error, but other times I do not see such an error. Also the error logs seems...
Read more >
Bug listing with status RESOLVED with resolution FIXED as at ...
... fails after it does not find autogen.sh" status:RESOLVED resolution:FIXED ... Bug:149 - "ghex-1.2.1.ebuild" status:RESOLVED resolution:FIXED severity: ...
Read more >
6 Crashing JVM - Oracle Help Center
A Java application might stop running for several reasons. ... Check the size of the binary crash file to determine whether the JVM...
Read more >
Troubleshoot Firefox crashes (closing or quitting unexpectedly)
Learn how to troubleshoot and fix Firefox crashes, and find out how to get more help if you're having problems solving the issue....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found