Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[core] Failed to add any wildcard listeners RPC when starting many workers on cluster

See original GitHub issue

Cluster config: https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/tests/release/cpu_large_scale.yaml

200 nodes, 4 workers each:

I get this sometimes after starting a benchmark script. It might be a general issue that comes up with many nodes:

(base) root@ip-172-31-30-231:/release_tests# python benchmark_cpu_gpu.py 200 10 8000
2020-11-20 06:55:00,056 INFO worker.py:651 -- Connecting to existing Ray cluster at address: 172.31.30.231:6379                                                                                                                                                                    2020-11-20 06:55:01,352 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffffbbb9cbf608000000 is pending and cannot currently be scheduled. It requires {} for execution and {CPU: 1.000000} for placement, but this node only has remaining {actor_cpus: 4.000000},
 {node:172.31.26.235: 1.000000}, {object_store_memory: 3.076172 GiB}, {CPU: 4.000000}, {memory: 10.449219 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2020-11-20 06:55:01,517 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffffc23e60dd08000000 is pending and cannot currently be scheduled. It requires {} for execution and {CPU: 1.000000} for placement, but this node only has remaining {actor_cpus: 4.000000}, {node:172.31.18.252: 1.000000}, {CPU: 4.000000}, {memory: 10.351562 GiB}, {object_store_memory: 3.027344 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, co
nsider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
(pid=raylet, ip=172.31.19.101) E1120 06:55:02.190124912    1060 server_chttp2.cc:40]        {"created":"@1605884102.190040718","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc
","file_line":394,"referenced_errors":[{"created":"@1605884102.190038437","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":341,"referenced_errors":[{"created":"@1605884102.189969464
","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":208,"referenced_errors":[{"created":"@1605884102.189960718","description":"Address already in use","errno":98,"file":"e
xternal/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1605884102.190037585","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_gr
pc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":208,"referenced_errors":[{"created":"@1605884102.190033903","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_
line":181,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(pid=raylet, ip=172.31.19.101) *** Aborted at 1605884102 (unix time) try "date -d @1605884102" if you are using GNU date ***
(pid=raylet, ip=172.31.19.101) PC: @                0x0 (unknown)
(pid=raylet, ip=172.31.19.101) *** SIGSEGV (@0x58) received by PID 1060 (TID 0x7f1f3f4587c0) from PID 88; stack trace: ***
(pid=raylet, ip=172.31.19.101)     @     0x7f1f3f68a3c0 (unknown)
(pid=raylet, ip=172.31.19.101)     @     0x563a035cf0e2 grpc::ServerInterface::RegisteredAsyncRequest::IssueRequest()
(pid=raylet, ip=172.31.19.101)     @     0x563a0325a429 ray::rpc::ObjectManagerService::WithAsyncMethod_Push<>::RequestPush()
(pid=raylet, ip=172.31.19.101)     @     0x563a03269a1b ray::rpc::ServerCallFactoryImpl<>::CreateCall()
(pid=raylet, ip=172.31.19.101)     @     0x563a034ede19 ray::rpc::GrpcServer::Run()
(pid=raylet, ip=172.31.19.101)     @     0x563a0325e59e ray::ObjectManager::StartRpcService()
(pid=raylet, ip=172.31.19.101)     @     0x563a032712bc ray::ObjectManager::ObjectManager()
(pid=raylet, ip=172.31.19.101)     @     0x563a031c0e20 ray::raylet::Raylet::Raylet()             
(pid=raylet, ip=172.31.19.101)     @     0x563a0319796b _ZZ4mainENKUlN3ray6StatusEN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pairIKSsSsEEEEEE_clES0_SD_
(pid=raylet, ip=172.31.19.101)     @     0x563a03198af1 _ZNSt17_Function_handlerIFvN3ray6StatusERKN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pairIKSsSsEEEEEEZ4mainEUlS1_SE_E_E9_M_invokeERKSt9_Any_dataS1_SG_
(pid=raylet, ip=172.31.19.101)     @     0x563a03319a2c _ZZN3ray3gcs28ServiceBasedNodeInfoAccessor22AsyncGetInternalConfigERKSt8functionIFvNS_6StatusERKN5boost8optionalISt13unordered_mapISsSsSt4hashISsESt8equal_toISsESaISt4pairIKSsSsEEEEEEEENKUlRKS3_RKNS_3rpc22GetInternalCon
figReplyEE_clESO_SS_                                                                                                                     
(pid=raylet, ip=172.31.19.101)     @     0x563a032cd33f _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc22GetInternalConfigReplyEEZNS4_12GcsRpcClient17GetInternalConfigERKNS4_24GetInternalConfigRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
(pid=raylet, ip=172.31.19.101)     @     0x563a032cd43d ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=raylet, ip=172.31.19.101)     @     0x563a031fa800 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=raylet, ip=172.31.19.101)     @     0x563a0386270f boost::asio::detail::scheduler::do_run_one()
(pid=raylet, ip=172.31.19.101)     @     0x563a03863c11 boost::asio::detail::scheduler::run()                
(pid=raylet, ip=172.31.19.101)     @     0x563a03864c42 boost::asio::io_context::run()
(pid=raylet, ip=172.31.19.101)     @     0x563a03177cbc main                                                                             
(pid=raylet, ip=172.31.19.101)     @     0x7f1f3f4840b3 __libc_start_main
(pid=raylet, ip=172.31.19.101)     @     0x563a0318a621 (unknown)                                                                        
2020-11-20 06:55:03,190 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffff84eb106208000000 is pending and cannot currently be scheduled. It requires {} for execution and {CPU: 1.000000} for placement, but this node only has remaining {actor_cpus: 4.000000},
 {node:172.31.29.90: 1.000000}, {CPU: 4.000000}, {memory: 10.351562 GiB}, {object_store_memory: 3.027344 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, con
sider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2020-11-20 06:55:03,211 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffff11b9d6bb08000000 is pending and cannot currently be scheduled. It requires {actor_cpus: 4.000000}, {CPU: 4.000000} for execution and {actor_cpus: 4.000000}, {CPU: 4.000000} for placem
ent, but this node only has remaining {actor_cpus: 4.000000}, {node:172.31.18.42: 1.000000}, {CPU: 4.000000}, {memory: 10.351562 GiB}, {object_store_memory: 3.027344 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster
 resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2020-11-20 06:55:03,528 WARNING worker.py:1091 -- The actor or task with ID ffffffffffffffff3904136408000000 is pending and cannot currently be scheduled. It requires {} for execution and {CPU: 1.000000} for placement, but this node only has remaining {actor_cpus: 4.000000},
 {node:172.31.21.58: 1.000000}, {CPU: 4.000000}, {memory: 10.351562 GiB}, {object_store_memory: 3.027344 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, con
sider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

_Originally posted by @krfricke in https://github.com/ray-project/ray/issues/12206#issuecomment-731218232_

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, Jan 4, 2021

This is probably the file descriptor limit bug again. Can you try increasing the ulimit as a workaround?

On Mon, Jan 4, 2021, 4:30 AM Kai Fricke notifications@github.com wrote:

Here is a simple reproduction script. I run it on a cluster of about 200 workers (see cluster yaml).

Usage: python launch.py, do this several times (I use for i in {0…30}; do python launch.py; done). The error doesn’t always come up, but once it comes up, it comes up every time I launch the script again.

Sometimes I was able to run the launch.py dozens of times without errors, but errors came up once I changed the sleep seconds (e.g. actor.sleep.remote(0.6)) - this might be a coincidence though.

Once the error comes up, sometimes old logs are re-printed during the run (this is why I include a unique run ID in the script).

Happy to go on a a short call to show the error in case you can’t immediately replicate it.

launch.py

import time import uuid

import ray

ray.init(address=“auto”)

@ray.remote class Actor: def init(self, run_id: str, rank: int): self.run_id = run_id self.rank = rank
def sleep(self, seconds: float = 5.):
    time.sleep(seconds)
    print(f"Run ID: {self.run_id}, actor: {self.rank}")
run_id = uuid.uuid4() required_resources = {“actor_cpus”: 4}

actor_cls = Actor.options(resources=required_resources) actors = [actor_cls.remote(run_id, i) for i in range(200)]

futures = [actor.sleep.remote(0.5) for actor in actors] ray.get(futures) print(“Done.”)

cluster.yaml:

cluster_name: xgboost_actor_test min_workers: 202 max_workers: 202 initial_workers: 202 autoscaling_mode: default docker: image: “anyscale/ray-ml:latest” container_name: ray_container pull_before_run: true run_options: - --privileged target_utilization_fraction: 0.8 idle_timeout_minutes: 5 provider: type: aws region: us-west-2 availability_zone: us-west-2a cache_stopped_nodes: true auth: ssh_user: ubuntu head_node: InstanceType: m5.xlarge ImageId: ami-05ac7a76b4c679a79

worker_nodes: InstanceType: m5.xlarge ImageId: ami-05ac7a76b4c679a79 InstanceMarketOptions: MarketType: spot

file_mounts: { “/cluster-actors”: “./” } cluster_synced_files: [] file_mounts_sync_continuously: true initialization_commands: [] setup_commands:

pip install -U ray head_setup_commands: [] worker_setup_commands: [] head_start_ray_commands:

ray stop

“ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --resources=‘{"actor_cpus": 0}’” worker_start_ray_commands:

ray stop

“ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --resources=‘{"actor_cpus": 4}’” metadata: anyscale: working_dir: “/cluster-actors”

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/12213#issuecomment-753948726, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUST74MBNVNTHNTDNVFTSYGYGZANCNFSM4T5DJP4Q .

0reactions

krfrickecommented, Jan 4, 2021

Here is a simple reproduction script. I run it on a cluster of about 200 workers (see cluster yaml).

Usage: python launch.py, do this several times (I use for i in {0..30}; do python launch.py; done). The error doesn’t always come up, but once it comes up, it comes up every time I launch the script again.

Sometimes I was able to run the launch.py dozens of times without errors, but errors came up once I changed the sleep seconds (e.g. actor.sleep.remote(0.6)) - this might be a coincidence though.

Once the error comes up, sometimes old logs are re-printed during the run (this is why I include a unique run ID in the script).

Happy to go on a a short call to show the error in case you can’t immediately replicate it.

launch.py

import time
import uuid

import ray

ray.init(address="auto")


@ray.remote
class Actor:
    def __init__(self, run_id: str, rank: int):
        self.run_id = run_id
        self.rank = rank

    def sleep(self, seconds: float = 5.):
        time.sleep(seconds)
        print(f"Run ID: {self.run_id}, actor: {self.rank}")


run_id = uuid.uuid4()
required_resources = {"actor_cpus": 4}

actor_cls = Actor.options(resources=required_resources)
actors = [actor_cls.remote(run_id, i) for i in range(200)]

futures = [actor.sleep.remote(0.5) for actor in actors]
ray.get(futures)
print("Done.")

cluster.yaml:

cluster_name: xgboost_actor_test
min_workers: 202
max_workers: 202
initial_workers: 202
autoscaling_mode: default
docker:
    image: "anyscale/ray-ml:latest"
    container_name: ray_container
    pull_before_run: true
    run_options:
        - --privileged
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a
    cache_stopped_nodes: true
auth:
    ssh_user: ubuntu
head_node:
    InstanceType: m5.xlarge
    ImageId: ami-05ac7a76b4c679a79

worker_nodes:
    InstanceType: m5.xlarge
    ImageId: ami-05ac7a76b4c679a79
    InstanceMarketOptions:
        MarketType: spot

file_mounts: {
  "/cluster-actors": "./"
}
cluster_synced_files: []
file_mounts_sync_continuously: true
initialization_commands: []
setup_commands:
  - pip install -U ray
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
    - ray stop
    - "ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --resources='{\"actor_cpus\": 0}'"
worker_start_ray_commands:
    - ray stop
    - "ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --resources='{\"actor_cpus\": 4}'"
metadata:
    anyscale:
        working_dir: "/cluster-actors"

Top Results From Across the Web

UnknownError: Could not start gRPC server - Stack Overflow

In my case , I find ps raise this error and woker wait for response when I submit a tensorflowonspark job yarn cluster...

The gRPC server program will crash when can't bind ...

a server program which uses gRPC crashed when started: $ ./server 60001 . ... "description":"Failed to add any wildcard listeners", ...

Issue with ray cluster in Red hat machine

I have two node ray cluster version 1.12.1. On red hat machines. When starting the worker node. The dashboard disappears and i get...

Configuring Infinispan caches

Infinispan replicates all cache entries on all nodes in a cluster and ... Set the minimum number of nodes required before caches start...

Configuration Files | EMQX 5.0 Documentation

Default authentication configs for all MQTT listeners. For per-listener overrides see ... node: node. cluster: cluster. log: log. rpc: rpc. broker: broker.