question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Ray k8s operator got stuck in a corrupted state

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

My Ray cluster is running on GKE using Ray k8s operator. After reliably running for a week, the Ray operator got stuck in a corrupted state and couldn’t recover itself. The following is the error message I found in the Ray operator logs. I had to manually restart the operator to get the cluster working again.

Demands:
 (no resource demands)
ray-playground,ray-playground:2022-02-16 04:51:40,678	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 5 nodes\n - MostDelayedHeartbeats: {'10.169.0.134': 0.3025646209716797, '10.169.0.198': 0.3025240898132324, '10.169.1.3': 0.3024895191192627, '10.169.1.131': 0.3024570941925049, '10.169.1.67': 0.30243349075317383}\n - NodeIdleSeconds: Min=29487 Mean=29487 Max=29487\n - ResourceUsage: 0.0/40.0 CPU, 0.0/20.0 GPU, 0.0/5.0 accelerator_type:T4, 0.0 GiB/126.0 GiB memory, 0.0 GiB/53.84 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 5" True None
ray-playground,ray-playground:2022-02-16 04:51:40,679	DEBUG legacy_info_string.py:24 -- Cluster status: 5 nodes
 - MostDelayedHeartbeats: {'10.169.0.134': 0.3025646209716797, '10.169.0.198': 0.3025240898132324, '10.169.1.3': 0.3024895191192627, '10.169.1.131': 0.3024570941925049, '10.169.1.67': 0.30243349075317383}
 - NodeIdleSeconds: Min=29487 Mean=29487 Max=29487
 - ResourceUsage: 0.0/40.0 CPU, 0.0/20.0 GPU, 0.0/5.0 accelerator_type:T4, 0.0 GiB/126.0 GiB memory, 0.0 GiB/53.84 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - rayWorkerType: 5
ray-playground,ray-playground:2022-02-16 04:51:40,831	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-46zv6 is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,853	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-l2g29 is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,876	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-lnnzj is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,899	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-lxhcg is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,920	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-snrvl is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,941	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-46zv6 is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,956	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-l2g29 is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,970	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-lnnzj is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,984	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-lxhcg is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,997	DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-snrvl is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:41,153	DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'object_store_memory': 6413522534.0, 'memory': 15032385536.0, 'node:10.169.1.67': 1.0}, {'memory': 24051816857.0, 'CPU': 8.0, 'object_store_memory': 10280468889.0, 'accelerator_type:T4': 1.0, 'node:10.169.0.134': 1.0, 'GPU': 4.0}, {'object_store_memory': 10280514355.0, 'memory': 24051816857.0, 'GPU': 4.0, 'node:10.169.0.198': 1.0, 'accelerator_type:T4': 1.0, 'CPU': 8.0}, {'CPU': 8.0, 'GPU': 4.0, 'memory': 24051816857.0, 'object_store_memory': 10280192409.0, 'node:10.169.1.3': 1.0, 'accelerator_type:T4': 1.0}, {'GPU': 4.0, 'object_store_memory': 10280294400.0, 'accelerator_type:T4': 1.0, 'memory': 24051816857.0, 'node:10.169.1.131': 1.0, 'CPU': 8.0}, {'CPU': 8.0, 'object_store_memory': 10280339865.0, 'node:10.169.1.195': 1.0, 'accelerator_type:T4': 1.0, 'GPU': 4.0, 'memory': 24051816857.0}]
ray-playground,ray-playground:2022-02-16 04:51:41,153	DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 5})
ray-playground,ray-playground:2022-02-16 04:51:41,153	DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray-playground,ray-playground:2022-02-16 04:51:41,153	DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray-playground,ray-playground:2022-02-16 04:51:41,153	DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray-playground,ray-playground:2022-02-16 04:51:41,153	DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray-playground,ray-playground:2022-02-16 04:51:41,189	ERROR autoscaler.py:267 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect
    conn = self._new_conn()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb04c2833d0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 264, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 327, in _update
    ensure_min_cluster_size=self.load_metrics.
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 269, in get_nodes_to_launch
    placement_groups_nodes_max_limit)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 425, in _get_concurrent_resource_demand_to_launch
    non_terminated_nodes, connected_nodes,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 463, in _separate_running_and_pending_nodes
    node_ip = self.provider.internal_ip(node_id)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_kubernetes/node_provider.py", line 88, in internal_ip
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23084, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23185, in read_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
    query_params=query_params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 217, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 75, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.160.216.1', port=443): Max retries exceeded with url: /api/v1/namespaces/ray-playground/pods/ray-playground-ray-head-type-2n8gb (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb04c2833d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
ray-playground,ray-playground:2022-02-16 04:51:41,197	ERROR monitor.py:394 -- Error in monitor loop
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect
    conn = self._new_conn()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 435, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 325, in _run
    status["autoscaler_report"] = asdict(self.autoscaler.summary())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1196, in summary
    ip = self.provider.internal_ip(node_id)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_kubernetes/node_provider.py", line 88, in internal_ip
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23084, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23185, in read_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
    query_params=query_params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 217, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 75, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.160.216.1', port=443): Max retries exceeded with url: /api/v1/namespaces/ray-playground/pods/ray-playground-ray-head-type-2n8gb (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused'))
ray-playground,ray-playground:2022-02-16 04:51:41,198	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_error' b'The autoscaler failed with the following error:\nTraceback (most recent call last):\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn\n    (self._dns_host, self.port), self.timeout, **extra_kw\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection\n    raise err\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection\n    sock.connect(sa)\nConnectionRefusedError: [Errno 111] Connection refused\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen\n    chunked=chunked,\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request\n    self._validate_conn(conn)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn\n    conn.connect()\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect\n    conn = self._new_conn()\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn\n    self, "Failed to establish a new connection: %s" % e\nurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 435, in run\n    self._run()\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 325, in _run\n    status["autoscaler_report"] = asdict(self.autoscaler.summary())\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1196, in summary\n    ip = self.provider.internal_ip(node_id)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_kubernetes/node_provider.py", line 88, in internal_ip\n    pod = core_api().read_namespaced_pod(node_id, self.namespace)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23084, in read_namespaced_pod\n    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23185, in read_namespaced_pod_with_http_info\n    collection_formats=collection_formats)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api\n    _preload_content, _request_timeout, _host)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api\n    _request_timeout=_request_timeout)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request\n    headers=headers)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET\n    query_params=query_params)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 217, in request\n    headers=headers)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 75, in request\n    method, url, fields=fields, headers=headers, **urlopen_kw\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 96, in request_encode_url\n    return self.urlopen(method, url, **extra_kw)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 375, in urlopen\n    response = conn.urlopen(method, u.request_uri, **kw)\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen\n    **response_kw\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen\n    **response_kw\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen\n    **response_kw\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen\n    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]\n  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment\n    raise MaxRetryError(_pool, url, error or ResponseError(cause))\nurllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host=\'10.160.216.1\', port=443): Max retries exceeded with url: /api/v1/namespaces/ray-playground/pods/ray-playground-ray-head-type-2n8gb (Caused by NewConnectionError(\'<urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused\'))\n' True None
Process ray-playground,ray-playground:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect
    conn = self._new_conn()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:
2022-02-13 12:30:19,945	INFO commands.py:230 -- Cluster: ray-playground
2022-02-13 12:30:19,966	INFO commands.py:293 -- Checking Kubernetes environment settings
2022-02-13 12:30:20,133	INFO commands.py:587 -- Cluster Ray runtime will not be restarted due to `--no-restart`.
2022-02-13 12:30:20,134	INFO commands.py:592 -- Updating cluster configuration and running setup commands. Confirm [y/N]: y [automatic, due to --yes]
2022-02-13 12:30:20,141	INFO commands.py:658 -- <1/1> Setting up head node
2022-02-13 12:30:20,163	INFO updater.py:296 -- New status: waiting-for-ssh
2022-02-13 12:30:20,163	INFO updater.py:241 -- [1/7] Waiting for SSH to become available
2022-02-13 12:30:20,164	INFO updater.py:244 -- Running `uptime` as a test.
2022-02-13 12:30:20,901	SUCC updater.py:257 -- Success.
2022-02-13 12:30:20,901	INFO log_timer.py:27 -- NodeUpdater: ray-playground-ray-head-type-2n8gb: Got remote shell  [LogTimer=738ms]
2022-02-13 12:30:20,912	INFO updater.py:339 -- [2-6/7] Configuration already up to date, skipping file mounts, initalization and setup commands.
2022-02-13 12:30:20,912	INFO updater.py:450 -- [7/7] Starting the Ray runtime
2022-02-13 12:30:20,912	INFO log_timer.py:27 -- NodeUpdater: ray-playground-ray-head-type-2n8gb: Ray start commands succeeded [LogTimer=0ms]
2022-02-13 12:30:20,912	INFO log_timer.py:27 -- NodeUpdater: ray-playground-ray-head-type-2n8gb: Applied config 1fbdf091da83aaf45ad05cc74e125a870efdef8c  [LogTimer=771ms]
2022-02-13 12:30:20,932	INFO updater.py:167 -- New status: up-to-date
2022-02-13 12:30:20,943	INFO commands.py:739 -- Useful commands
2022-02-13 12:30:20,943	INFO commands.py:741 -- Monitor autoscaling with
2022-02-13 12:30:20,943	INFO commands.py:744 --   ray exec /home/ray/ray_cluster_configs/ray-playground/ray-playground_config.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 435, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 325, in _run
    status["autoscaler_report"] = asdict(self.autoscaler.summary())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1196, in summary
    ip = self.provider.internal_ip(node_id)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_kubernetes/node_provider.py", line 88, in internal_ip
    pod = core_api().read_namespaced_pod(node_id, self.namespace)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23084, in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23185, in read_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
2022-02-13 12:30:20,943	INFO commands.py:746 -- Connect to a terminal on the cluster head:
2022-02-13 12:30:20,943	INFO commands.py:748 --   ray attach /home/ray/ray_cluster_configs/ray-playground/ray-playground_config.yaml
2022-02-13 12:30:20,943	INFO commands.py:751 -- Get a remote shell to the cluster manually:
2022-02-13 12:30:20,943	INFO commands.py:752 --   kubectl -n ray-playground exec -it ray-playground-ray-head-type-2n8gb -- bash
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
    query_params=query_params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 217, in request
    headers=headers)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 75, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
    **response_kw
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.160.216.1', port=443): Max retries exceeded with url: /api/v1/namespaces/ray-playground/pods/ray-playground-ray-head-type-2n8gb (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 77, in _create_or_update
    self.start_monitor()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 117, in start_monitor
    mtr.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 437, in run
    self._handle_failure(traceback.format_exc())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 409, in _handle_failure
    if args.gcs_address:
NameError: name 'args' is not defined
Request attempt #1/9 failed; will retry: GET https://10.160.216.1:443/apis/cluster.ray.io/v1/namespaces/ray-playground/rayclusters?watch=true&resourceVersion=30245324 -> ClientConnectorError(ConnectionKey(host='10.160.216.1', port=443, is_ssl=True, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=8573530917045623554), ConnectionRefusedError(111, "Connect call failed ('10.160.216.1', 443)"))
Request attempt #2/9 failed; will retry: GET https://10.160.216.1:443/apis/cluster.ray.io/v1/namespaces/ray-playground/rayclusters?watch=true&resourceVersion=30245324 -> ClientConnectorError(ConnectionKey(host='10.160.216.1', port=443, is_ssl=True, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=8573530917045623554), ConnectionRefusedError(111, "Connect call failed ('10.160.216.1', 443)"))
Request attempt #3/9 failed; will retry: GET https://10.160.216.1:443/apis/cluster.ray.io/v1/namespaces/ray-playground/rayclusters?watch=true&resourceVersion=30245324 -> ClientConnectorError(ConnectionKey(host='10.160.216.1', port=443, is_ssl=True, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=8573530917045623554), TimeoutError(110, "Connect call failed ('10.160.216.1', 443)"))

Versions / Dependencies

ray==1.10.0 / Python 3.7 Ray k8s operator is running on image rayproject/ray:1.10.0 GKE master version: 1.21.5-gke.1802 GKE node version: 1.21.5-gke.1302

Reproduction script

Start a Ray cluster on GKE using Ray k8s operator, and let it run for a couple of days.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:3
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
DmitriGekhtmancommented, Jun 23, 2022

The current instructions can be found here: https://ray-project.github.io/kuberay/guidance/autoscaler/

1reaction
DmitriGekhtmancommented, Mar 11, 2022

@DmitriGekhtman will this issue go away if we switch to using KubeRay?

Yes.

If you have any problems, feel free to discuss on the #kuberay channel in the Ray Slack and/or the KubeRay GitHub!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot the Kubernetes Operator - MongoDB
The Kubernetes Operator is unable to reconcile the resource deployment state. This happens when a reconciliation times out or if the Kubernetes Operator ......
Read more >
Amazon EKS troubleshooting - AWS Documentation
Nodes fail to join cluster ... There are a few common reasons that prevent nodes from joining the cluster: ... The node is...
Read more >
Bug listing with status RESOLVED with resolution TEST ...
Bug :233 - "Emacs segfaults when merged through the sandbox. ... Bug:34367 - "collating in pl_PL locale is broken" status:RESOLVED resolution:TEST-REQUEST ...
Read more >
Xray Release Notes - JFrog Documentation
Fixed an issue whereby, a 400 error was issued on the Watch Violations page ... Xray scans Terraform states for AWS, Azure, and...
Read more >
Why do Kubernetes pod stay in pending state? - Stackify
Your pod suddenly crashes. Maybe it's because it is ready for debugging after it is scheduled or it will not function properly due...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found