question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Manually started gcs crashed. Might be related to gcs client/server version mismatch.

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

When I tried to manually execute gcs and start a head node to connect to it, I found gcs crashed with the following error. I used lldb to print more information.

This was not happening before in earlier versions such as 1.9.2. What could be the potential reason for this failure? Or is there any suggestion on how I can extract more information from it?

[2022-01-11 23:02:32,286 I 56455 502606] gcs_server.cc:531: GcsNodeManager: {RegisterNode request count: 0, DrainNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, GetAllActorInfo request count: 0, KillActor request count: 0, ListNamedActors request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, GetNamedPlacementGroup request count: 0, Scheduling pending placement group count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
GrpcBasedResourceBroadcaster: {Tracked nodes: 0}
[2022-01-11 23:02:34,319 I 56455 502606] gcs_node_manager.cc:42: Registering node info, node id = 29a3b7fb5010b320ad1efd9d4c425904e489e6880cb982be6ed0176a, address = 127.0.0.1
[2022-01-11 23:02:34,319 I 56455 502606] gcs_node_manager.cc:47: Finished registering node info, node id = 29a3b7fb5010b320ad1efd9d4c425904e489e6880cb982be6ed0176a, address = 127.0.0.1
[2022-01-11 23:02:34,320 I 56455 502606] gcs_placement_group_manager.cc:722: A new node: 29a3b7fb5010b320ad1efd9d4c425904e489e6880cb982be6ed0176a registered, will try to reschedule all the infeasible placement groups.
Process 56455 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100197ac5 gcs_server`ray::gcs::InternalPubSubHandler::HandleGcsSubscriberCommandBatch(ray::rpc::GcsSubscriberCommandBatchRequest const&, ray::rpc::GcsSubscriberCommandBatchReply*, std::__1::function<void (ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)>) + 741
gcs_server`ray::gcs::InternalPubSubHandler::HandleGcsSubscriberCommandBatch:
->  0x100197ac5 <+741>: movq   (%r12), %rax
    0x100197ac9 <+745>: movq   %r12, %rdi
    0x100197acc <+748>: movl   %r13d, %esi
    0x100197acf <+751>: leaq   -0xb8(%rbp), %rdx
Target 0: (gcs_server) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000100197ac5 gcs_server`ray::gcs::InternalPubSubHandler::HandleGcsSubscriberCommandBatch(ray::rpc::GcsSubscriberCommandBatchRequest const&, ray::rpc::GcsSubscriberCommandBatchReply*, std::__1::function<void (ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)>) + 741
    frame #1: 0x000000010016edc4 gcs_server`ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberCommandBatchRequest, ray::rpc::GcsSubscriberCommandBatchReply>::HandleRequestImpl() + 132
    frame #2: 0x00000001002c9692 gcs_server`boost::asio::detail::completion_handler<std::__1::function<void ()>, boost::asio::io_context::basic_executor_type<std::__1::allocator<void>, 0u> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) + 162
    frame #3: 0x00000001008568d6 gcs_server`boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) + 742
    frame #4: 0x000000010084cf71 gcs_server`boost::asio::detail::scheduler::run(boost::system::error_code&) + 225
    frame #5: 0x000000010084ce7b gcs_server`boost::asio::io_context::run() + 43
    frame #6: 0x0000000100006806 gcs_server`main + 5190
    frame #7: 0x00000001014054fe dyld`start + 462

Versions / Dependencies

master branch. running on mac os.

Reproduction script

This is based on my own patch which has an option --no-gcs that can start a cluster using existing gcs server.

Anything else

Can #12644 be related to this issue?

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
rkooo567commented, Jan 17, 2022

While we recommend to follow the existing deployment pattern, I think this could be related to the recent change. @mwtian can you make sure if this is the unexpected bug or expected after new changes we’ve made recently?

0reactions
mwtiancommented, Feb 18, 2022

I think GCS client should have a version check when first connected to the GCS server, if there is no such mechanism yet.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segfault when submitting actor tasks from background thread
Here's the relevant code being called from the background thread: ... Might be related to gcs client/server version mismatch. #21549.
Read more >
Troubleshooting | Cloud Storage
This page describes troubleshooting methods for common errors you may encounter while using Cloud Storage. See the Google Cloud Status Dashboard for ...
Read more >
Google OAuth “invalid_grant” nightmare — and how to fix it
We're using to Google Calendar API, so the integration is user-specific; We're using the OAuth 2.0 protocol through Google's PHP SDK. First clue....
Read more >
Bugs fixed in each 19.0.0.0.0 Release Update and ... - ANBOB
Bugs are listed by category; The version column indicates the RU ... crashed after hitting 29770 due to lck 'gcs ddet enter server...
Read more >
Bug listing with status UNCONFIRMED as at 2022/12/20 15 ...
Bug :128538 - "sys-apps/coreutils: /bin/hostname should be installed from ... Bug:385983 - "Nvidia GT240 X-server crash" status:UNCONFIRMED resolution: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found