question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AKS reliability issue - pending spawn / pending stop - resolved but undocumented fix

See original GitHub issue

I am having intermittent problems with the KubeSpawner on Azure, where the hub thinks that a user server is either pending start or pending stop, but the pod on the cluster for that user is not being started or not being stopped. This is an intermittent problem that seems to come and go. The only way I have found of straightening things out (temporarily) is to delete the hub pod so that it automatically restarts.

Here is a section of the hub logs which may be relevant:

[I 2018-12-01 14:26:23.386 JupyterHub proxy:301] Checking routes
[I 2018-12-01 14:27:15.255 JupyterHub log:158] 200 GET /hub/admin (tjcrone@10.240.0.13) 35.11ms
[I 2018-12-01 14:27:18.274 JupyterHub proxy:264] Removing user tjcrone from proxy (/user/tjcrone/)
[I 2018-12-01 14:27:18.277 JupyterHub spawner:1770] Deleting pod jupyter-tjcrone
[E 2018-12-01 14:27:21.402 JupyterHub gen:974] Exception in Future <Future finished exception=ReadTimeoutError("HTTPSConnectionPool(host='nlsees-201-neanderthallab-17dca6-28535aba.hcp.centralus.azmk8s.io', port=443): Read timed out. (read timeout=None)",)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1667, in _start
        body=pvc
      File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1489, in asynchronize
        return method(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 5950, in create_namespaced_persistent_volume_claim
        (data) = self.create_namespaced_persistent_volume_claim_with_http_info(namespace, body, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 6035, in create_namespaced_persistent_volume_claim_with_http_info
        collection_formats=collection_formats)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
        _return_http_data_only, collection_formats, _preload_content, _request_timeout)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
        _request_timeout=_request_timeout)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 364, in request
        body=body)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 266, in POST
        body=body)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 166, in request
        headers=headers)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/request.py", line 72, in request
        **urlopen_kw)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/request.py", line 150, in request_encode_body
        return self.urlopen(method, url, **extra_kw)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/poolmanager.py", line 322, in urlopen
        response = conn.urlopen(method, u.request_uri, **kw)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
        _stacktrace=sys.exc_info()[2])
      File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 367, in increment
        raise six.reraise(type(error), error, _stacktrace)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 686, in reraise
        raise value
      File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
        chunked=chunked)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
        self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 317, in _raise_timeout
        raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
    urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='nlsees-201-neanderthallab-17dca6-28535aba.hcp.centralus.azmk8s.io', port=443): Read timed out. (read timeout=None)
    
[I 2018-12-01 14:27:23.340 JupyterHub proxy:301] Checking routes
[I 2018-12-01 14:27:23.540 JupyterHub log:158] 200 GET /hub/api/users (cull-idle@127.0.0.1) 50.42ms
[W 181201 14:27:23 cull_idle_servers:128] Not culling server tjcrone with pending stop
[W 2018-12-01 14:27:28.277 JupyterHub base:786] User tjcrone: server is slow to stop

At this stage, the hub thinks the server is pending stop, but the server pod is up and running and it is not being stopped. Something similar can happen when the hub tries to start a pod, but no pod launches, but the hub continues to think that a server is pending start.

Any idea what is going on? Suggestions on better troubleshooting?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:25 (10 by maintainers)

github_iconTop GitHub Comments

4reactions
sgulsethcommented, Jan 25, 2019

Just setup Jupyterhub on AKS and ran into this issue. When debugging it I noticed that the first attempt to start/stop servers always succeeded. Starting a hub and then waiting usually ended up with the next action failing because of an eventual timeout. Looking more into it, with some unsophisticated timing involved, it was usually around 4 minutes it took for the next action to fail. This corresponds to the Azure Load Balancer default keep alive ttl.

Every pod in k8s should get the Kubernetes api host injected as an environment variable, this is then used by the SDKs to load in-cluster config(see https://github.com/kubernetes-client/python-base/blob/master/config/incluster_config.py#L23-L24). On AKS this is for some reason the public api dns, xxx.hcp.centralus.azmk8s.io.

» ksysdpo tunnelfront-5774f86cb5-fffwc | grep KUBERNETES_SERVICE_HOST
      KUBERNETES_SERVICE_HOST:       some-address.hcp.westeurope.azmk8s.io

K8s also exposes the api by default on an internally routed host, kubernetes.default.svc.cluster.local. However for some reason when I tried to manually set the env var in the pod AKS would just convert it to the public one when the pod was deployed.

In an attempt to verify my suspicions that it was the ALB keep alive that was at fault I wrapped the hub-image and manually hard coded the env var KUBERNETES_SERVICE_HOST to the locally routed one:

» cat Dockerfile
FROM jupyterhub/k8s-hub:0.7.0

ENV PATH="/srv/bin:${PATH}"

COPY jupyterhub.sh /srv/bin/jupyterhub

» cat jupyterhub.sh
#!/bin/bash

export KUBERNETES_SERVICE_HOST="kubernetes.default.svc.cluster.local"

/usr/local/bin/jupyterhub $@

With this image built, pushed and Jupyterhub using it I no longer see any timeouts. The image is available at sgulseth/jupyter-k8s-hub:0.7.0 if you wanna test it out.


I’m not sure if this is unexpected behaviour from the SDK or Azure Load balancer, but for my sake I only need Jupyterhub running over the weekend, so I’m OK with a temp ugly hack.

2reactions
yuvipandacommented, Oct 27, 2020

@tjcrone fwiw, after I deployed https://github.com/jupyterhub/kubespawner/pull/433 the exact same problem you are describing went away.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting a pending pod in a Kubernetes cluster (AKS)
The important part is how to solve it. Once I established that the problem was due to insufficient memory I knew I had...
Read more >
Jupyterhub on k8s/Azure intermittently times out with no events
Hello - I have a jupyterhub installed on Azure following the z2jh instructions. Occasionally (about 50% of the time), when I try to...
Read more >
Voter registration issue poses 'problem' in pending driver's ...
Voter registration issue poses 'problem' in pending driver's license bill for undocumented immigrants, Gov. Charlie Baker says - masslive.com.
Read more >
DACA - National Immigration Law Center
The court is not extending any relief to people with pending first-time DACA applications or whose DACA has lapsed for over a year....
Read more >
Trump Leaves Biden 1.3 Million Case Backlog in Immigration ...
When President Donald Trump assumed office, 542,411 people had deportation cases pending before the Immigration Courts. At the start of 2021 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found