[Bug] Manually started gcs crashed. Might be related to gcs client/server version mismatch.
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
When I tried to manually execute gcs and start a head node to connect to it, I found gcs crashed with the following error. I used lldb to print more information.
This was not happening before in earlier versions such as 1.9.2. What could be the potential reason for this failure? Or is there any suggestion on how I can extract more information from it?
[2022-01-11 23:02:32,286 I 56455 502606] gcs_server.cc:531: GcsNodeManager: {RegisterNode request count: 0, DrainNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, GetAllActorInfo request count: 0, KillActor request count: 0, ListNamedActors request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, GetNamedPlacementGroup request count: 0, Scheduling pending placement group count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
GrpcBasedResourceBroadcaster: {Tracked nodes: 0}
[2022-01-11 23:02:34,319 I 56455 502606] gcs_node_manager.cc:42: Registering node info, node id = 29a3b7fb5010b320ad1efd9d4c425904e489e6880cb982be6ed0176a, address = 127.0.0.1
[2022-01-11 23:02:34,319 I 56455 502606] gcs_node_manager.cc:47: Finished registering node info, node id = 29a3b7fb5010b320ad1efd9d4c425904e489e6880cb982be6ed0176a, address = 127.0.0.1
[2022-01-11 23:02:34,320 I 56455 502606] gcs_placement_group_manager.cc:722: A new node: 29a3b7fb5010b320ad1efd9d4c425904e489e6880cb982be6ed0176a registered, will try to reschedule all the infeasible placement groups.
Process 56455 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000100197ac5 gcs_server`ray::gcs::InternalPubSubHandler::HandleGcsSubscriberCommandBatch(ray::rpc::GcsSubscriberCommandBatchRequest const&, ray::rpc::GcsSubscriberCommandBatchReply*, std::__1::function<void (ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)>) + 741
gcs_server`ray::gcs::InternalPubSubHandler::HandleGcsSubscriberCommandBatch:
-> 0x100197ac5 <+741>: movq (%r12), %rax
0x100197ac9 <+745>: movq %r12, %rdi
0x100197acc <+748>: movl %r13d, %esi
0x100197acf <+751>: leaq -0xb8(%rbp), %rdx
Target 0: (gcs_server) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x0000000100197ac5 gcs_server`ray::gcs::InternalPubSubHandler::HandleGcsSubscriberCommandBatch(ray::rpc::GcsSubscriberCommandBatchRequest const&, ray::rpc::GcsSubscriberCommandBatchReply*, std::__1::function<void (ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)>) + 741
frame #1: 0x000000010016edc4 gcs_server`ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberCommandBatchRequest, ray::rpc::GcsSubscriberCommandBatchReply>::HandleRequestImpl() + 132
frame #2: 0x00000001002c9692 gcs_server`boost::asio::detail::completion_handler<std::__1::function<void ()>, boost::asio::io_context::basic_executor_type<std::__1::allocator<void>, 0u> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) + 162
frame #3: 0x00000001008568d6 gcs_server`boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) + 742
frame #4: 0x000000010084cf71 gcs_server`boost::asio::detail::scheduler::run(boost::system::error_code&) + 225
frame #5: 0x000000010084ce7b gcs_server`boost::asio::io_context::run() + 43
frame #6: 0x0000000100006806 gcs_server`main + 5190
frame #7: 0x00000001014054fe dyld`start + 462
Versions / Dependencies
master branch. running on mac os.
Reproduction script
This is based on my own patch which has an option --no-gcs
that can start a cluster using existing gcs server.
Anything else
Can #12644 be related to this issue?
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Segfault when submitting actor tasks from background thread
Here's the relevant code being called from the background thread: ... Might be related to gcs client/server version mismatch. #21549.
Read more >Troubleshooting | Cloud Storage
This page describes troubleshooting methods for common errors you may encounter while using Cloud Storage. See the Google Cloud Status Dashboard for ...
Read more >Google OAuth “invalid_grant” nightmare — and how to fix it
We're using to Google Calendar API, so the integration is user-specific; We're using the OAuth 2.0 protocol through Google's PHP SDK. First clue....
Read more >Bugs fixed in each 19.0.0.0.0 Release Update and ... - ANBOB
Bugs are listed by category; The version column indicates the RU ... crashed after hitting 29770 due to lck 'gcs ddet enter server...
Read more >Bug listing with status UNCONFIRMED as at 2022/12/20 15 ...
Bug :128538 - "sys-apps/coreutils: /bin/hostname should be installed from ... Bug:385983 - "Nvidia GT240 X-server crash" status:UNCONFIRMED resolution: ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
While we recommend to follow the existing deployment pattern, I think this could be related to the recent change. @mwtian can you make sure if this is the unexpected bug or expected after new changes we’ve made recently?
I think GCS client should have a version check when first connected to the GCS server, if there is no such mechanism yet.