[core] Failed to add any wildcard listeners RPC when starting many workers on cluster
See original GitHub issueCluster config: https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/tests/release/cpu_large_scale.yaml
200 nodes, 4 workers each:
I get this sometimes after starting a benchmark script. It might be a general issue that comes up with many nodes:
(base) root@ip-172-31-30-231:/release_tests# python benchmark_cpu_gpu.py 200 10 8000
2020-11-20 06:55:00,056 INFO worker.py:651 -- Connecting to existing Ray cluster at address: 172.31.30.231:6379                                                                                                                                                                    2020-11-20 06:55:01,352 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffffbbb9cbf608000000 is pending and cannot currently be scheduled. It requires {} for execution and {CPU: 1.000000} for placement, but this node only has remaining {actor_cpus: 4.000000},
 {node:172.31.26.235: 1.000000}, {object_store_memory: 3.076172 GiB}, {CPU: 4.000000}, {memory: 10.449219 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2020-11-20 06:55:01,517 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffffc23e60dd08000000 is pending and cannot currently be scheduled. It requires {} for execution and {CPU: 1.000000} for placement, but this node only has remaining {actor_cpus: 4.000000}, {node:172.31.18.252: 1.000000}, {CPU: 4.000000}, {memory: 10.351562 GiB}, {object_store_memory: 3.027344 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, co
nsider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
(pid=raylet, ip=172.31.19.101) E1120 06:55:02.190124912    1060 server_chttp2.cc:40]        {"created":"@1605884102.190040718","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc
","file_line":394,"referenced_errors":[{"created":"@1605884102.190038437","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":341,"referenced_errors":[{"created":"@1605884102.189969464
","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":208,"referenced_errors":[{"created":"@1605884102.189960718","description":"Address already in use","errno":98,"file":"e
xternal/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1605884102.190037585","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_gr
pc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":208,"referenced_errors":[{"created":"@1605884102.190033903","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_
line":181,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(pid=raylet, ip=172.31.19.101) *** Aborted at 1605884102 (unix time) try "date -d @1605884102" if you are using GNU date ***
(pid=raylet, ip=172.31.19.101) PC: @                0x0 (unknown)
(pid=raylet, ip=172.31.19.101) *** SIGSEGV (@0x58) received by PID 1060 (TID 0x7f1f3f4587c0) from PID 88; stack trace: ***
(pid=raylet, ip=172.31.19.101)     @     0x7f1f3f68a3c0 (unknown)
(pid=raylet, ip=172.31.19.101)     @     0x563a035cf0e2 grpc::ServerInterface::RegisteredAsyncRequest::IssueRequest()
(pid=raylet, ip=172.31.19.101)     @     0x563a0325a429 ray::rpc::ObjectManagerService::WithAsyncMethod_Push<>::RequestPush()
(pid=raylet, ip=172.31.19.101)     @     0x563a03269a1b ray::rpc::ServerCallFactoryImpl<>::CreateCall()
(pid=raylet, ip=172.31.19.101)     @     0x563a034ede19 ray::rpc::GrpcServer::Run()
(pid=raylet, ip=172.31.19.101)     @     0x563a0325e59e ray::ObjectManager::StartRpcService()
(pid=raylet, ip=172.31.19.101)     @     0x563a032712bc ray::ObjectManager::ObjectManager()
(pid=raylet, ip=172.31.19.101)     @     0x563a031c0e20 ray::raylet::Raylet::Raylet()             
(pid=raylet, ip=172.31.19.101)     @     0x563a0319796b _ZZ4mainENKUlN3ray6StatusEN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pairIKSsSsEEEEEE_clES0_SD_
(pid=raylet, ip=172.31.19.101)     @     0x563a03198af1 _ZNSt17_Function_handlerIFvN3ray6StatusERKN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pairIKSsSsEEEEEEZ4mainEUlS1_SE_E_E9_M_invokeERKSt9_Any_dataS1_SG_
(pid=raylet, ip=172.31.19.101)     @     0x563a03319a2c _ZZN3ray3gcs28ServiceBasedNodeInfoAccessor22AsyncGetInternalConfigERKSt8functionIFvNS_6StatusERKN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pairIKSsSsEEEEEEEENKUlRKS3_RKNS_3rpc22GetInternalCon
figReplyEE_clESO_SS_                                                                                                                     
(pid=raylet, ip=172.31.19.101)     @     0x563a032cd33f _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc22GetInternalConfigReplyEEZNS4_12GcsRpcClient17GetInternalConfigERKNS4_24GetInternalConfigRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
(pid=raylet, ip=172.31.19.101)     @     0x563a032cd43d ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=raylet, ip=172.31.19.101)     @     0x563a031fa800 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=raylet, ip=172.31.19.101)     @     0x563a0386270f boost::asio::detail::scheduler::do_run_one()
(pid=raylet, ip=172.31.19.101)     @     0x563a03863c11 boost::asio::detail::scheduler::run()                
(pid=raylet, ip=172.31.19.101)     @     0x563a03864c42 boost::asio::io_context::run()
(pid=raylet, ip=172.31.19.101)     @     0x563a03177cbc main                                                                             
(pid=raylet, ip=172.31.19.101)     @     0x7f1f3f4840b3 __libc_start_main
(pid=raylet, ip=172.31.19.101)     @     0x563a0318a621 (unknown)                                                                        
2020-11-20 06:55:03,190 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffff84eb106208000000 is pending and cannot currently be scheduled. It requires {} for execution and {CPU: 1.000000} for placement, but this node only has remaining {actor_cpus: 4.000000},
 {node:172.31.29.90: 1.000000}, {CPU: 4.000000}, {memory: 10.351562 GiB}, {object_store_memory: 3.027344 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, con
sider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2020-11-20 06:55:03,211 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffff11b9d6bb08000000 is pending and cannot currently be scheduled. It requires {actor_cpus: 4.000000}, {CPU: 4.000000} for execution and {actor_cpus: 4.000000}, {CPU: 4.000000} for placem
ent, but this node only has remaining {actor_cpus: 4.000000}, {node:172.31.18.42: 1.000000}, {CPU: 4.000000}, {memory: 10.351562 GiB}, {object_store_memory: 3.027344 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster
 resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2020-11-20 06:55:03,528 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffff3904136408000000 is pending and cannot currently be scheduled. It requires {} for execution and {CPU: 1.000000} for placement, but this node only has remaining {actor_cpus: 4.000000},
 {node:172.31.21.58: 1.000000}, {CPU: 4.000000}, {memory: 10.351562 GiB}, {object_store_memory: 3.027344 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, con
sider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
_Originally posted by @krfricke in https://github.com/ray-project/ray/issues/12206#issuecomment-731218232_
Issue Analytics
- State:
 - Created 3 years ago
 - Comments:7 (7 by maintainers)
 
Top Results From Across the Web
UnknownError: Could not start gRPC server - Stack Overflow
In my case , I find ps raise this error and woker wait for response when I submit a tensorflowonspark job yarn cluster...
Read more >The gRPC server program will crash when can't bind ...
a server program which uses gRPC crashed when started: $ ./server 60001 . ... "description":"Failed to add any wildcard listeners", ...
Read more >Issue with ray cluster in Red hat machine
I have two node ray cluster version 1.12.1. On red hat machines. When starting the worker node. The dashboard disappears and i get...
Read more >Configuring Infinispan caches
Infinispan replicates all cache entries on all nodes in a cluster and ... Set the minimum number of nodes required before caches start...
Read more >Configuration Files | EMQX 5.0 Documentation
Default authentication configs for all MQTT listeners. For per-listener overrides see ... node: node. cluster: cluster. log: log. rpc: rpc. broker: broker.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

This is probably the file descriptor limit bug again. Can you try increasing the ulimit as a workaround?
On Mon, Jan 4, 2021, 4:30 AM Kai Fricke notifications@github.com wrote:
Here is a simple reproduction script. I run it on a cluster of about 200 workers (see cluster yaml).
Usage:
python launch.py, do this several times (I usefor i in {0..30}; do python launch.py; done). The error doesn’t always come up, but once it comes up, it comes up every time I launch the script again.Sometimes I was able to run the
launch.pydozens of times without errors, but errors came up once I changed the sleep seconds (e.g.actor.sleep.remote(0.6)) - this might be a coincidence though.Once the error comes up, sometimes old logs are re-printed during the run (this is why I include a unique run ID in the script).
Happy to go on a a short call to show the error in case you can’t immediately replicate it.
launch.pycluster.yaml: