[Bug] Ray k8s operator got stuck in a corrupted state
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Clusters
What happened + What you expected to happen
My Ray cluster is running on GKE using Ray k8s operator. After reliably running for a week, the Ray operator got stuck in a corrupted state and couldn’t recover itself. The following is the error message I found in the Ray operator logs. I had to manually restart the operator to get the cluster working again.
Demands:
(no resource demands)
ray-playground,ray-playground:2022-02-16 04:51:40,678 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 5 nodes\n - MostDelayedHeartbeats: {'10.169.0.134': 0.3025646209716797, '10.169.0.198': 0.3025240898132324, '10.169.1.3': 0.3024895191192627, '10.169.1.131': 0.3024570941925049, '10.169.1.67': 0.30243349075317383}\n - NodeIdleSeconds: Min=29487 Mean=29487 Max=29487\n - ResourceUsage: 0.0/40.0 CPU, 0.0/20.0 GPU, 0.0/5.0 accelerator_type:T4, 0.0 GiB/126.0 GiB memory, 0.0 GiB/53.84 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 5" True None
ray-playground,ray-playground:2022-02-16 04:51:40,679 DEBUG legacy_info_string.py:24 -- Cluster status: 5 nodes
- MostDelayedHeartbeats: {'10.169.0.134': 0.3025646209716797, '10.169.0.198': 0.3025240898132324, '10.169.1.3': 0.3024895191192627, '10.169.1.131': 0.3024570941925049, '10.169.1.67': 0.30243349075317383}
- NodeIdleSeconds: Min=29487 Mean=29487 Max=29487
- ResourceUsage: 0.0/40.0 CPU, 0.0/20.0 GPU, 0.0/5.0 accelerator_type:T4, 0.0 GiB/126.0 GiB memory, 0.0 GiB/53.84 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- rayWorkerType: 5
ray-playground,ray-playground:2022-02-16 04:51:40,831 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-46zv6 is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,853 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-l2g29 is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,876 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-lnnzj is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,899 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-lxhcg is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,920 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-snrvl is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,941 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-46zv6 is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,956 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-l2g29 is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,970 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-lnnzj is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,984 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-lxhcg is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:40,997 DEBUG autoscaler.py:1148 -- ray-playground-ray-worker-type-snrvl is not being updated and passes config check (can_update=True).
ray-playground,ray-playground:2022-02-16 04:51:41,153 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'object_store_memory': 6413522534.0, 'memory': 15032385536.0, 'node:10.169.1.67': 1.0}, {'memory': 24051816857.0, 'CPU': 8.0, 'object_store_memory': 10280468889.0, 'accelerator_type:T4': 1.0, 'node:10.169.0.134': 1.0, 'GPU': 4.0}, {'object_store_memory': 10280514355.0, 'memory': 24051816857.0, 'GPU': 4.0, 'node:10.169.0.198': 1.0, 'accelerator_type:T4': 1.0, 'CPU': 8.0}, {'CPU': 8.0, 'GPU': 4.0, 'memory': 24051816857.0, 'object_store_memory': 10280192409.0, 'node:10.169.1.3': 1.0, 'accelerator_type:T4': 1.0}, {'GPU': 4.0, 'object_store_memory': 10280294400.0, 'accelerator_type:T4': 1.0, 'memory': 24051816857.0, 'node:10.169.1.131': 1.0, 'CPU': 8.0}, {'CPU': 8.0, 'object_store_memory': 10280339865.0, 'node:10.169.1.195': 1.0, 'accelerator_type:T4': 1.0, 'GPU': 4.0, 'memory': 24051816857.0}]
ray-playground,ray-playground:2022-02-16 04:51:41,153 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 5})
ray-playground,ray-playground:2022-02-16 04:51:41,153 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray-playground,ray-playground:2022-02-16 04:51:41,153 DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray-playground,ray-playground:2022-02-16 04:51:41,153 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray-playground,ray-playground:2022-02-16 04:51:41,153 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray-playground,ray-playground:2022-02-16 04:51:41,189 ERROR autoscaler.py:267 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect
conn = self._new_conn()
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb04c2833d0>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 264, in update
self._update()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 327, in _update
ensure_min_cluster_size=self.load_metrics.
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 269, in get_nodes_to_launch
placement_groups_nodes_max_limit)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 425, in _get_concurrent_resource_demand_to_launch
non_terminated_nodes, connected_nodes,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 463, in _separate_running_and_pending_nodes
node_ip = self.provider.internal_ip(node_id)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_kubernetes/node_provider.py", line 88, in internal_ip
pod = core_api().read_namespaced_pod(node_id, self.namespace)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23084, in read_namespaced_pod
return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs) # noqa: E501
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23185, in read_namespaced_pod_with_http_info
collection_formats=collection_formats)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
_preload_content, _request_timeout, _host)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
_request_timeout=_request_timeout)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
headers=headers)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
query_params=query_params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 217, in request
headers=headers)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 75, in request
method, url, fields=fields, headers=headers, **urlopen_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 96, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.160.216.1', port=443): Max retries exceeded with url: /api/v1/namespaces/ray-playground/pods/ray-playground-ray-head-type-2n8gb (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb04c2833d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
ray-playground,ray-playground:2022-02-16 04:51:41,197 ERROR monitor.py:394 -- Error in monitor loop
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect
conn = self._new_conn()
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 435, in run
self._run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 325, in _run
status["autoscaler_report"] = asdict(self.autoscaler.summary())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1196, in summary
ip = self.provider.internal_ip(node_id)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_kubernetes/node_provider.py", line 88, in internal_ip
pod = core_api().read_namespaced_pod(node_id, self.namespace)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23084, in read_namespaced_pod
return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs) # noqa: E501
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23185, in read_namespaced_pod_with_http_info
collection_formats=collection_formats)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
_preload_content, _request_timeout, _host)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
_request_timeout=_request_timeout)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
headers=headers)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
query_params=query_params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 217, in request
headers=headers)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 75, in request
method, url, fields=fields, headers=headers, **urlopen_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 96, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.160.216.1', port=443): Max retries exceeded with url: /api/v1/namespaces/ray-playground/pods/ray-playground-ray-head-type-2n8gb (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused'))
ray-playground,ray-playground:2022-02-16 04:51:41,198 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_error' b'The autoscaler failed with the following error:\nTraceback (most recent call last):\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn\n (self._dns_host, self.port), self.timeout, **extra_kw\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection\n raise err\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection\n sock.connect(sa)\nConnectionRefusedError: [Errno 111] Connection refused\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen\n chunked=chunked,\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request\n self._validate_conn(conn)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn\n conn.connect()\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect\n conn = self._new_conn()\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn\n self, "Failed to establish a new connection: %s" % e\nurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 435, in run\n self._run()\n File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 325, in _run\n status["autoscaler_report"] = asdict(self.autoscaler.summary())\n File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1196, in summary\n ip = self.provider.internal_ip(node_id)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_kubernetes/node_provider.py", line 88, in internal_ip\n pod = core_api().read_namespaced_pod(node_id, self.namespace)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23084, in read_namespaced_pod\n return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs) # noqa: E501\n File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23185, in read_namespaced_pod_with_http_info\n collection_formats=collection_formats)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api\n _preload_content, _request_timeout, _host)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api\n _request_timeout=_request_timeout)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request\n headers=headers)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET\n query_params=query_params)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 217, in request\n headers=headers)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 75, in request\n method, url, fields=fields, headers=headers, **urlopen_kw\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 96, in request_encode_url\n return self.urlopen(method, url, **extra_kw)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 375, in urlopen\n response = conn.urlopen(method, u.request_uri, **kw)\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen\n **response_kw\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen\n **response_kw\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen\n **response_kw\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen\n method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]\n File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment\n raise MaxRetryError(_pool, url, error or ResponseError(cause))\nurllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host=\'10.160.216.1\', port=443): Max retries exceeded with url: /api/v1/namespaces/ray-playground/pods/ray-playground-ray-head-type-2n8gb (Caused by NewConnectionError(\'<urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused\'))\n' True None
Process ray-playground,ray-playground:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 353, in connect
conn = self._new_conn()
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
2022-02-13 12:30:19,945 INFO commands.py:230 -- Cluster: ray-playground
2022-02-13 12:30:19,966 INFO commands.py:293 -- Checking Kubernetes environment settings
2022-02-13 12:30:20,133 INFO commands.py:587 -- Cluster Ray runtime will not be restarted due to `--no-restart`.
2022-02-13 12:30:20,134 INFO commands.py:592 -- Updating cluster configuration and running setup commands. Confirm [y/N]: y [automatic, due to --yes]
2022-02-13 12:30:20,141 INFO commands.py:658 -- <1/1> Setting up head node
2022-02-13 12:30:20,163 INFO updater.py:296 -- New status: waiting-for-ssh
2022-02-13 12:30:20,163 INFO updater.py:241 -- [1/7] Waiting for SSH to become available
2022-02-13 12:30:20,164 INFO updater.py:244 -- Running `uptime` as a test.
2022-02-13 12:30:20,901 SUCC updater.py:257 -- Success.
2022-02-13 12:30:20,901 INFO log_timer.py:27 -- NodeUpdater: ray-playground-ray-head-type-2n8gb: Got remote shell [LogTimer=738ms]
2022-02-13 12:30:20,912 INFO updater.py:339 -- [2-6/7] Configuration already up to date, skipping file mounts, initalization and setup commands.
2022-02-13 12:30:20,912 INFO updater.py:450 -- [7/7] Starting the Ray runtime
2022-02-13 12:30:20,912 INFO log_timer.py:27 -- NodeUpdater: ray-playground-ray-head-type-2n8gb: Ray start commands succeeded [LogTimer=0ms]
2022-02-13 12:30:20,912 INFO log_timer.py:27 -- NodeUpdater: ray-playground-ray-head-type-2n8gb: Applied config 1fbdf091da83aaf45ad05cc74e125a870efdef8c [LogTimer=771ms]
2022-02-13 12:30:20,932 INFO updater.py:167 -- New status: up-to-date
2022-02-13 12:30:20,943 INFO commands.py:739 -- Useful commands
2022-02-13 12:30:20,943 INFO commands.py:741 -- Monitor autoscaling with
2022-02-13 12:30:20,943 INFO commands.py:744 -- ray exec /home/ray/ray_cluster_configs/ray-playground/ray-playground_config.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 435, in run
self._run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 325, in _run
status["autoscaler_report"] = asdict(self.autoscaler.summary())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 1196, in summary
ip = self.provider.internal_ip(node_id)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_kubernetes/node_provider.py", line 88, in internal_ip
pod = core_api().read_namespaced_pod(node_id, self.namespace)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23084, in read_namespaced_pod
return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs) # noqa: E501
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 23185, in read_namespaced_pod_with_http_info
collection_formats=collection_formats)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
_preload_content, _request_timeout, _host)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
_request_timeout=_request_timeout)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
headers=headers)
2022-02-13 12:30:20,943 INFO commands.py:746 -- Connect to a terminal on the cluster head:
2022-02-13 12:30:20,943 INFO commands.py:748 -- ray attach /home/ray/ray_cluster_configs/ray-playground/ray-playground_config.yaml
2022-02-13 12:30:20,943 INFO commands.py:751 -- Get a remote shell to the cluster manually:
2022-02-13 12:30:20,943 INFO commands.py:752 -- kubectl -n ray-playground exec -it ray-playground-ray-head-type-2n8gb -- bash
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
query_params=query_params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 217, in request
headers=headers)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 75, in request
method, url, fields=fields, headers=headers, **urlopen_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/request.py", line 96, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 375, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 796, in urlopen
**response_kw
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/ray/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.160.216.1', port=443): Max retries exceeded with url: /api/v1/namespaces/ray-playground/pods/ray-playground-ray-head-type-2n8gb (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb04fc58f90>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 77, in _create_or_update
self.start_monitor()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/ray_operator/operator.py", line 117, in start_monitor
mtr.run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 437, in run
self._handle_failure(traceback.format_exc())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 409, in _handle_failure
if args.gcs_address:
NameError: name 'args' is not defined
Request attempt #1/9 failed; will retry: GET https://10.160.216.1:443/apis/cluster.ray.io/v1/namespaces/ray-playground/rayclusters?watch=true&resourceVersion=30245324 -> ClientConnectorError(ConnectionKey(host='10.160.216.1', port=443, is_ssl=True, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=8573530917045623554), ConnectionRefusedError(111, "Connect call failed ('10.160.216.1', 443)"))
Request attempt #2/9 failed; will retry: GET https://10.160.216.1:443/apis/cluster.ray.io/v1/namespaces/ray-playground/rayclusters?watch=true&resourceVersion=30245324 -> ClientConnectorError(ConnectionKey(host='10.160.216.1', port=443, is_ssl=True, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=8573530917045623554), ConnectionRefusedError(111, "Connect call failed ('10.160.216.1', 443)"))
Request attempt #3/9 failed; will retry: GET https://10.160.216.1:443/apis/cluster.ray.io/v1/namespaces/ray-playground/rayclusters?watch=true&resourceVersion=30245324 -> ClientConnectorError(ConnectionKey(host='10.160.216.1', port=443, is_ssl=True, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=8573530917045623554), TimeoutError(110, "Connect call failed ('10.160.216.1', 443)"))
Versions / Dependencies
ray==1.10.0 / Python 3.7 Ray k8s operator is running on image rayproject/ray:1.10.0 GKE master version: 1.21.5-gke.1802 GKE node version: 1.21.5-gke.1302
Reproduction script
Start a Ray cluster on GKE using Ray k8s operator, and let it run for a couple of days.
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:13 (13 by maintainers)
Top Results From Across the Web
Troubleshoot the Kubernetes Operator - MongoDB
The Kubernetes Operator is unable to reconcile the resource deployment state. This happens when a reconciliation times out or if the Kubernetes Operator ......
Read more >Amazon EKS troubleshooting - AWS Documentation
Nodes fail to join cluster ... There are a few common reasons that prevent nodes from joining the cluster: ... The node is...
Read more >Bug listing with status RESOLVED with resolution TEST ...
Bug :233 - "Emacs segfaults when merged through the sandbox. ... Bug:34367 - "collating in pl_PL locale is broken" status:RESOLVED resolution:TEST-REQUEST ...
Read more >Xray Release Notes - JFrog Documentation
Fixed an issue whereby, a 400 error was issued on the Watch Violations page ... Xray scans Terraform states for AWS, Azure, and...
Read more >Why do Kubernetes pod stay in pending state? - Stackify
Your pod suddenly crashes. Maybe it's because it is ready for debugging after it is scheduled or it will not function properly due...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The current instructions can be found here: https://ray-project.github.io/kuberay/guidance/autoscaler/
Yes.
If you have any problems, feel free to discuss on the #kuberay channel in the Ray Slack and/or the KubeRay GitHub!