AKS reliability issue - pending spawn / pending stop - resolved but undocumented fix
See original GitHub issueI am having intermittent problems with the KubeSpawner on Azure, where the hub thinks that a user server is either pending start or pending stop, but the pod on the cluster for that user is not being started or not being stopped. This is an intermittent problem that seems to come and go. The only way I have found of straightening things out (temporarily) is to delete the hub pod so that it automatically restarts.
Here is a section of the hub logs which may be relevant:
[I 2018-12-01 14:26:23.386 JupyterHub proxy:301] Checking routes
[I 2018-12-01 14:27:15.255 JupyterHub log:158] 200 GET /hub/admin (tjcrone@10.240.0.13) 35.11ms
[I 2018-12-01 14:27:18.274 JupyterHub proxy:264] Removing user tjcrone from proxy (/user/tjcrone/)
[I 2018-12-01 14:27:18.277 JupyterHub spawner:1770] Deleting pod jupyter-tjcrone
[E 2018-12-01 14:27:21.402 JupyterHub gen:974] Exception in Future <Future finished exception=ReadTimeoutError("HTTPSConnectionPool(host='nlsees-201-neanderthallab-17dca6-28535aba.hcp.centralus.azmk8s.io', port=443): Read timed out. (read timeout=None)",)> after timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
future.result()
File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1667, in _start
body=pvc
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1489, in asynchronize
return method(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 5950, in create_namespaced_persistent_volume_claim
(data) = self.create_namespaced_persistent_volume_claim_with_http_info(namespace, body, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 6035, in create_namespaced_persistent_volume_claim_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 364, in request
body=body)
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 266, in POST
body=body)
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 166, in request
headers=headers)
File "/usr/local/lib/python3.6/dist-packages/urllib3/request.py", line 72, in request
**urlopen_kw)
File "/usr/local/lib/python3.6/dist-packages/urllib3/request.py", line 150, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python3.6/dist-packages/urllib3/poolmanager.py", line 322, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 317, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='nlsees-201-neanderthallab-17dca6-28535aba.hcp.centralus.azmk8s.io', port=443): Read timed out. (read timeout=None)
[I 2018-12-01 14:27:23.340 JupyterHub proxy:301] Checking routes
[I 2018-12-01 14:27:23.540 JupyterHub log:158] 200 GET /hub/api/users (cull-idle@127.0.0.1) 50.42ms
[W 181201 14:27:23 cull_idle_servers:128] Not culling server tjcrone with pending stop
[W 2018-12-01 14:27:28.277 JupyterHub base:786] User tjcrone: server is slow to stop
At this stage, the hub thinks the server is pending stop, but the server pod is up and running and it is not being stopped. Something similar can happen when the hub tries to start a pod, but no pod launches, but the hub continues to think that a server is pending start.
Any idea what is going on? Suggestions on better troubleshooting?
Issue Analytics
- State:
- Created 5 years ago
- Comments:25 (10 by maintainers)
Top GitHub Comments
Just setup Jupyterhub on AKS and ran into this issue. When debugging it I noticed that the first attempt to start/stop servers always succeeded. Starting a hub and then waiting usually ended up with the next action failing because of an eventual timeout. Looking more into it, with some unsophisticated timing involved, it was usually around 4 minutes it took for the next action to fail. This corresponds to the Azure Load Balancer default keep alive ttl.
Every pod in k8s should get the Kubernetes api host injected as an environment variable, this is then used by the SDKs to load in-cluster config(see https://github.com/kubernetes-client/python-base/blob/master/config/incluster_config.py#L23-L24). On AKS this is for some reason the public api dns,
xxx.hcp.centralus.azmk8s.io
.K8s also exposes the api by default on an internally routed host,
kubernetes.default.svc.cluster.local
. However for some reason when I tried to manually set the env var in the pod AKS would just convert it to the public one when the pod was deployed.In an attempt to verify my suspicions that it was the ALB keep alive that was at fault I wrapped the hub-image and manually hard coded the env var
KUBERNETES_SERVICE_HOST
to the locally routed one:With this image built, pushed and Jupyterhub using it I no longer see any timeouts. The image is available at
sgulseth/jupyter-k8s-hub:0.7.0
if you wanna test it out.I’m not sure if this is unexpected behaviour from the SDK or Azure Load balancer, but for my sake I only need Jupyterhub running over the weekend, so I’m OK with a temp ugly hack.
@tjcrone fwiw, after I deployed https://github.com/jupyterhub/kubespawner/pull/433 the exact same problem you are describing went away.