Pod Evictions (memory, node shutodowns) - requires manual hub restart!
See original GitHub issueAbout
We have an issue when pods are removed unexpectedly that seems to require a hub restart. This could happen if a node fails, is preempted, or an admin deletes a user pod I think. This is really bad I’d say as it requires the specific user that has this problem to contact and administrator and then the administrator has to restart the hub to solve it.
I deem this vital to fix as preemtible nodes (google) and spot nodes (amazon) are an awesome way to reduce cost, but usage of them risks causing this kind of huge trouble right now!
I’m still lacking some insight in whats going on in the proxy and the user management within JupyterHub though so I’m hoping someone can pick this issue up from my writeup. @minrk or @yuvipanda perhaps you know someone to /cc or could give me a pointer on where to look to solve this?
Experience summary
- Pod dies
- “503 - Service Unavailable”
hub/admin
stop pod- “Your server is stopping”
- Distracting consequence
- Issue still not resolved
- Hub restart
- Everything is fine
Experience log
1. A preemptible node was reclaimed and my user pod was lost
- The user pod name was
erik-2esundell
- The user name was
erik.sundell
2. Later - I visit the hub
3. Directly after - I visit hub/admin
and press the stop server button
- The button turns blue and is labeled
start server
again
- The hub logs sais
[I 2018-08-01 09:36:05.670 JupyterHub proxy:264] Removing user erik.sundell from proxy (/user/erik.sundell/)
[I 2018-08-01 09:36:05.675 JupyterHub spawner:1644] Deleting pod jupyter-erik-2esundell
[W 2018-08-01 09:36:05.688 JupyterHub spawner:1657] No pod jupyter-erik-2esundell to delete. Assuming already deleted.
[W 2018-08-01 09:36:15.671 JupyterHub base:751] User erik.sundell: server is slow to stop
[I 2018-08-01 09:36:15.672 JupyterHub log:158] 202 DELETE /hub/api/users/erik.sundell/server (erik.sundell@10.20.0.1) 10013.08ms
- The proxy chp logs sais
09:36:05.673 - info: [ConfigProxy] Removing route /user/erik.sundell
09:36:05.674 - info: [ConfigProxy] 204 DELETE /api/routes/user/erik.sundell
- If I would press
start server
, this would show
4. I revisit my singleuser server
-
What I see
-
What the hub log sais after a while
[I 2018-08-01 09:36:50.337 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:36:52.736 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:36:57.838 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:36:58.940 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:04.053 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:09.154 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:14.235 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:19.374 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:24.450 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:29.546 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:34.699 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:37:39.774 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:44.877 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:49.953 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:37:55.064 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:00.139 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:05.262 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:10.430 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:15.512 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:20.614 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:25.705 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:28.067 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:29.443 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:34.546 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:38:39.654 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:44.747 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:49.873 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:54.987 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:00.088 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:05.170 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:10.253 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:15.341 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:20.452 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:25.532 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:30.642 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:35.720 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:39:40.825 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:45.924 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:51.031 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:56.143 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:01.262 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:06.360 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:11.459 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:14.980 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:17.269 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:22.437 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:27.544 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:32.660 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:40:37.750 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:42.819 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:47.880 JupyterHub base:978] erik.sundell is pending stop
[E 2018-08-01 09:40:50.170 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.stop_single_user.<locals>.stop() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py:730> exception=TimeoutError('pod/jupyter-erik-2esundell did not disappear in 300 seconds!',)> after timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
future.result()
File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 740, in stop
await user.stop(name)
File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 548, in stop
await spawner.stop()
File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1664, in stop
timeout=self.start_timeout
File "/usr/local/lib/python3.6/dist-packages/jupyterhub/utils.py", line 155, in exponential_backoff
raise TimeoutError(fail_message)
TimeoutError: pod/jupyter-erik-2esundell did not disappear in 300 seconds!
5. I look again
-
This is now what I see
-
I think it is because excessive refreshing of the pending stop page, once for every log message entry in the written point above.
6. I refresh
-
This is what I see
-
This is what the proxy chp logs say - an route was added, but it is unreachable
09:41:36.531 - info: [ConfigProxy] Adding route /user/erik.sundell -> http://10.20.1.4:8888
09:41:36.532 - info: [ConfigProxy] 201 POST /api/routes/user/erik.sundell
09:42:36.528 - info: [ConfigProxy] 200 GET /api/routes
09:43:36.528 - info: [ConfigProxy] 200 GET /api/routes
09:44:36.529 - info: [ConfigProxy] 200 GET /api/routes
09:45:36.529 - info: [ConfigProxy] 200 GET /api/routes
09:46:36.529 - info: [ConfigProxy] 200 GET /api/routes
09:47:36.529 - info: [ConfigProxy] 200 GET /api/routes
09:48:19.959 - error: [ConfigProxy] 503 GET /user/erik.sundell/ Error: connect EHOSTUNREACH 10.20.1.4:8888
at Object.exports._errnoException (util.js:1020:11)
at exports._exceptionWithHostPort (util.js:1043:20)
at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1086:14)
7. I restart the hub
- The hub logs after restart
[I 2018-08-01 10:17:10.079 JupyterHub app:1667] Using Authenticator: builtins.GenericAuthenticator
[I 2018-08-01 10:17:10.079 JupyterHub app:1667] Using Spawner: kubespawner.spawner.KubeSpawner
[I 2018-08-01 10:17:10.080 JupyterHub app:1010] Loading cookie_secret from env[JPY_COOKIE_SECRET]
[W 2018-08-01 10:17:10.155 JupyterHub app:1129] JupyterHub.hub_connect_port is deprecated as of 0.9. Use JupyterHub.hub_connect_url to fully specify the URL for connecting to the Hub.
[I 2018-08-01 10:17:10.184 JupyterHub app:1199] Not using whitelist. Any authenticated user will be allowed.
[I 2018-08-01 10:17:10.237 JupyterHub reflector:176] watching for pods with label selector component=singleuser-server / field selector in namespace jh
[W 2018-08-01 10:17:10.263 JupyterHub app:1460] erik.sundell appears to have stopped while the Hub was down
[W 2018-08-01 10:17:10.317 JupyterHub app:1513] Deleting OAuth client jupyterhub-user-erik-sundell
[I 2018-08-01 10:17:10.329 JupyterHub app:1849] Hub API listening on http://0.0.0.0:8081/hub/
[I 2018-08-01 10:17:10.329 JupyterHub app:1851] Private Hub API connect url http://10.0.8.147:8081/hub/
[I 2018-08-01 10:17:10.330 JupyterHub app:1864] Not starting proxy
[I 2018-08-01 10:17:10.334 JupyterHub proxy:301] Checking routes
[W 2018-08-01 10:17:10.335 JupyterHub proxy:363] Deleting stale route /user/erik.sundell/
[I 2018-08-01 10:17:10.337 JupyterHub app:1906] JupyterHub is now running at http://10.0.2.156:80/
[I 2018-08-01 10:18:10.341 JupyterHub proxy:301] Checking routes
8. I login again and everything works
- The hub logs after restart and login
[I 2018-08-01 10:21:49.143 JupyterHub base:499] User logged in: erik.sundell
[I 2018-08-01 10:21:49.145 JupyterHub log:158] 302 GET /hub/oauth_callback?code=[secret]&state=[secret]&session_state=[secret] -> /user/erik.sundell/ (@10.20.0.1) 630.45ms
[I 2018-08-01 10:21:49.185 JupyterHub log:158] 302 GET /user/erik.sundell/ -> /hub/user/erik.sundell/ (@10.20.0.1) 1.56ms
[I 2018-08-01 10:21:49.322 JupyterHub spawner:1550] PVC claim-erik-2esundell already exists, so did not create new pvc.
[W 2018-08-01 10:21:59.231 JupyterHub base:679] User erik.sundell is slow to start (timeout=10)
[I 2018-08-01 10:21:59.231 JupyterHub base:1016] erik.sundell is pending spawn
[I 2018-08-01 10:21:59.238 JupyterHub log:158] 200 GET /hub/user/erik.sundell/ (erik.sundell@10.20.0.1) 10015.25ms
[I 2018-08-01 10:22:00.703 JupyterHub log:158] 200 GET /hub/api (@10.20.1.4) 0.75ms
[I 2018-08-01 10:22:03.854 JupyterHub base:628] User erik.sundell took 14.623 seconds to start
[I 2018-08-01 10:22:03.854 JupyterHub proxy:242] Adding user erik.sundell to proxy /user/erik.sundell/ => http://10.20.1.4:8888
[I 2018-08-01 10:22:03.857 JupyterHub users:510] Server erik.sundell is ready
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:18 (12 by maintainers)
Top GitHub Comments
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:
https://discourse.jupyter.org/t/weve-been-seeing-503s-occasionally-for-user-servers/3824/2
Ah hmmm I realize that we can watch for k8s events about pods being Evicted and take action based on that… https://www.bluematador.com/blog/kubernetes-events-explained