question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pod Evictions (memory, node shutodowns) - requires manual hub restart!

See original GitHub issue

About

We have an issue when pods are removed unexpectedly that seems to require a hub restart. This could happen if a node fails, is preempted, or an admin deletes a user pod I think. This is really bad I’d say as it requires the specific user that has this problem to contact and administrator and then the administrator has to restart the hub to solve it.

I deem this vital to fix as preemtible nodes (google) and spot nodes (amazon) are an awesome way to reduce cost, but usage of them risks causing this kind of huge trouble right now!

I’m still lacking some insight in whats going on in the proxy and the user management within JupyterHub though so I’m hoping someone can pick this issue up from my writeup. @minrk or @yuvipanda perhaps you know someone to /cc or could give me a pointer on where to look to solve this?

Experience summary

  1. Pod dies
  2. “503 - Service Unavailable”
  3. hub/admin stop pod
  4. “Your server is stopping”
  5. Distracting consequence
  6. Issue still not resolved
  7. Hub restart
  8. Everything is fine

Experience log

1. A preemptible node was reclaimed and my user pod was lost

  • The user pod name was erik-2esundell
  • The user name was erik.sundell

2. Later - I visit the hub

image


3. Directly after - I visit hub/admin and press the stop server button

  • The button turns blue and is labeled start server again

image

  • The hub logs sais
[I 2018-08-01 09:36:05.670 JupyterHub proxy:264] Removing user erik.sundell from proxy (/user/erik.sundell/)                                                                                                                 
[I 2018-08-01 09:36:05.675 JupyterHub spawner:1644] Deleting pod jupyter-erik-2esundell                                                                                                                                      
[W 2018-08-01 09:36:05.688 JupyterHub spawner:1657] No pod jupyter-erik-2esundell to delete. Assuming already deleted.                                                                                                       
[W 2018-08-01 09:36:15.671 JupyterHub base:751] User erik.sundell: server is slow to stop
[I 2018-08-01 09:36:15.672 JupyterHub log:158] 202 DELETE /hub/api/users/erik.sundell/server (erik.sundell@10.20.0.1) 10013.08ms
  • The proxy chp logs sais
09:36:05.673 - info: [ConfigProxy] Removing route /user/erik.sundell
09:36:05.674 - info: [ConfigProxy] 204 DELETE /api/routes/user/erik.sundell
  • If I would press start server, this would show image

4. I revisit my singleuser server

  • What I see image

  • What the hub log sais after a while

[I 2018-08-01 09:36:50.337 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:36:52.736 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:36:57.838 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:36:58.940 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:04.053 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:09.154 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:14.235 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:19.374 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:24.450 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:29.546 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:34.699 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:36.529 JupyterHub proxy:301] Checking routes                                                                                                                                                             
[I 2018-08-01 09:37:39.774 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:44.877 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:49.953 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:55.064 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:38:00.139 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:38:05.262 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:38:10.430 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:38:15.512 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:20.614 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:25.705 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:28.067 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:29.443 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:34.546 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:38:39.654 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:44.747 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:49.873 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:54.987 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:00.088 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:05.170 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:10.253 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:15.341 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:20.452 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:25.532 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:30.642 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:35.720 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:39:40.825 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:45.924 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:51.031 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:56.143 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:01.262 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:06.360 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:11.459 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:14.980 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:17.269 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:22.437 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:27.544 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:32.660 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:40:37.750 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:42.819 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:47.880 JupyterHub base:978] erik.sundell is pending stop

[E 2018-08-01 09:40:50.170 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.stop_single_user.<locals>.stop() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py:730> exception=TimeoutError('pod/jupyter-erik-2esundell did not disappear in 300 seconds!',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 740, in stop
        await user.stop(name)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 548, in stop
        await spawner.stop()
      File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1664, in stop
        timeout=self.start_timeout
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/utils.py", line 155, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: pod/jupyter-erik-2esundell did not disappear in 300 seconds!

5. I look again

  • This is now what I see image

  • I think it is because excessive refreshing of the pending stop page, once for every log message entry in the written point above.


6. I refresh

  • This is what I see image

  • This is what the proxy chp logs say - an route was added, but it is unreachable

09:41:36.531 - info: [ConfigProxy] Adding route /user/erik.sundell -> http://10.20.1.4:8888
09:41:36.532 - info: [ConfigProxy] 201 POST /api/routes/user/erik.sundell 
09:42:36.528 - info: [ConfigProxy] 200 GET /api/routes 
09:43:36.528 - info: [ConfigProxy] 200 GET /api/routes 
09:44:36.529 - info: [ConfigProxy] 200 GET /api/routes 
09:45:36.529 - info: [ConfigProxy] 200 GET /api/routes 
09:46:36.529 - info: [ConfigProxy] 200 GET /api/routes 
09:47:36.529 - info: [ConfigProxy] 200 GET /api/routes 
09:48:19.959 - error: [ConfigProxy] 503 GET /user/erik.sundell/ Error: connect EHOSTUNREACH 10.20.1.4:8888
    at Object.exports._errnoException (util.js:1020:11)
    at exports._exceptionWithHostPort (util.js:1043:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1086:14)

7. I restart the hub

  • The hub logs after restart
[I 2018-08-01 10:17:10.079 JupyterHub app:1667] Using Authenticator: builtins.GenericAuthenticator
[I 2018-08-01 10:17:10.079 JupyterHub app:1667] Using Spawner: kubespawner.spawner.KubeSpawner
[I 2018-08-01 10:17:10.080 JupyterHub app:1010] Loading cookie_secret from env[JPY_COOKIE_SECRET]
[W 2018-08-01 10:17:10.155 JupyterHub app:1129] JupyterHub.hub_connect_port is deprecated as of 0.9. Use JupyterHub.hub_connect_url to fully specify the URL for connecting to the Hub.
[I 2018-08-01 10:17:10.184 JupyterHub app:1199] Not using whitelist. Any authenticated user will be allowed.
[I 2018-08-01 10:17:10.237 JupyterHub reflector:176] watching for pods with label selector component=singleuser-server / field selector  in namespace jh
[W 2018-08-01 10:17:10.263 JupyterHub app:1460] erik.sundell appears to have stopped while the Hub was down
[W 2018-08-01 10:17:10.317 JupyterHub app:1513] Deleting OAuth client jupyterhub-user-erik-sundell
[I 2018-08-01 10:17:10.329 JupyterHub app:1849] Hub API listening on http://0.0.0.0:8081/hub/
[I 2018-08-01 10:17:10.329 JupyterHub app:1851] Private Hub API connect url http://10.0.8.147:8081/hub/
[I 2018-08-01 10:17:10.330 JupyterHub app:1864] Not starting proxy
[I 2018-08-01 10:17:10.334 JupyterHub proxy:301] Checking routes
[W 2018-08-01 10:17:10.335 JupyterHub proxy:363] Deleting stale route /user/erik.sundell/
[I 2018-08-01 10:17:10.337 JupyterHub app:1906] JupyterHub is now running at http://10.0.2.156:80/
[I 2018-08-01 10:18:10.341 JupyterHub proxy:301] Checking routes

8. I login again and everything works

  • The hub logs after restart and login
[I 2018-08-01 10:21:49.143 JupyterHub base:499] User logged in: erik.sundell
[I 2018-08-01 10:21:49.145 JupyterHub log:158] 302 GET /hub/oauth_callback?code=[secret]&state=[secret]&session_state=[secret] -> /user/erik.sundell/ (@10.20.0.1) 630.45ms
[I 2018-08-01 10:21:49.185 JupyterHub log:158] 302 GET /user/erik.sundell/ -> /hub/user/erik.sundell/ (@10.20.0.1) 1.56ms
[I 2018-08-01 10:21:49.322 JupyterHub spawner:1550] PVC claim-erik-2esundell already exists, so did not create new pvc.
[W 2018-08-01 10:21:59.231 JupyterHub base:679] User erik.sundell is slow to start (timeout=10)
[I 2018-08-01 10:21:59.231 JupyterHub base:1016] erik.sundell is pending spawn
[I 2018-08-01 10:21:59.238 JupyterHub log:158] 200 GET /hub/user/erik.sundell/ (erik.sundell@10.20.0.1) 10015.25ms
[I 2018-08-01 10:22:00.703 JupyterHub log:158] 200 GET /hub/api (@10.20.1.4) 0.75ms
[I 2018-08-01 10:22:03.854 JupyterHub base:628] User erik.sundell took 14.623 seconds to start
[I 2018-08-01 10:22:03.854 JupyterHub proxy:242] Adding user erik.sundell to proxy /user/erik.sundell/ => http://10.20.1.4:8888
[I 2018-08-01 10:22:03.857 JupyterHub users:510] Server erik.sundell is ready

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:4
  • Comments:18 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
meeseeksmachinecommented, Jul 12, 2020

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/weve-been-seeing-503s-occasionally-for-user-servers/3824/2

1reaction
consideRatiocommented, Nov 25, 2019

Ah hmmm I realize that we can watch for k8s events about pods being Evicted and take action based on that… https://www.bluematador.com/blog/kubernetes-events-explained

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding node rebooting - OKD Documentation
To reboot a node without causing an outage for applications running on the platform, it is important to first evacuate the pods. For...
Read more >
Node-pressure Eviction | Kubernetes
Node -pressure eviction is the process by which the kubelet proactively terminates pods to reclaim resources on nodes.
Read more >
Restarting a cluster gracefully | Backup and restore
You can restart your cluster after it has been shut down gracefully. Prerequisites. You have access to the cluster as a user with...
Read more >
Startup and shutdown :: WebLogic Kubernetes Operator
Rolling restarts; Draining a node and PodDisruptionBudget ... Sometimes you need to completely shut down the domain (for example, take it out of...
Read more >
Best practices for running cost-optimized Kubernetes ...
That means, the Pod is deleted, CPU and memory are adjusted, ... Pods get evicted and the high-priority Pod immediately takes their place....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found