NVLink Throughput and Timeline throwing 500's and Errors in lab interface
See original GitHub issueThis is an awesome project. Thanks for the hard work here! It’s really nice to have a dashboard for watching GPU resources (and is way better than opening up a terminal and running watch nvidia-smi
😀 )
I’ll preface this issue with this is mostly just some user feedback. Do with it what you will 😃. I’m happy to help debug further, but have zero ability to actually write JLab extensions so can’t help on the writing-code-to-help-fix side of things.
I’m brand new to using this extension (installed it like 15 mins ago) and was clicking around seeing what all the different dashboards do. When I open up the NVLink Throughput and NVLink Timeline dashboards, I immediately get stack traces in my jupyter server logs and a “500: Internal Server Error” in the jupyterlab widget. This is almost certainly because I’m not running on a multi-gpu system.
NVLink jupterlab 500 in the opened panel:
server logs from NVLink error:
ERROR:tornado.access:500 GET /NVLink-Throughput (127.0.0.1) 1.00ms
[E 11:27:00.965 LabApp] {
"Host": "localhost:8888",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Dnt": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36",
"Sec-Fetch-Mode": "nested-navigate",
"Sec-Fetch-User": "?1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Sec-Fetch-Site": "same-origin",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Cookie": "_xsrf=2|a800529e|3734e978fe7758a227e33d8ef289c566|1571322840; username-localhost-8888=\"2|1:0|10:1571326020|23:username-localhost-8888|44:MDg0NGY3NjY4OTMzNDdlMGI1MDQ5NmIwYjM0NmJjYTY=|969f92a7df270d40c831ddb30a0c7dfba20c443d37adaa36bea79c3fb78891a4\""
}
[E 11:27:00.965 LabApp] 500 GET /nvdashboard/NVLink-Throughput (127.0.0.1) 18.59ms referer=None
ERROR:tornado.application:Uncaught exception GET /NVLink-Timeline (127.0.0.1)
HTTPServerRequest(protocol='http', host='localhost:8888', method='GET', uri='/NVLink-Timeline', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/web.py", line 1699, in _execute
result = await result
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/views/doc_handler.py", line 55, in get
session = yield self.get_session()
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/views/session_handler.py", line 77, in get_session
session = yield self.application_context.create_session_if_needed(session_id, self.request)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 748, in run
yielded = self.gen.send(value)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/contexts.py", line 215, in create_session_if_needed
self._application.initialize_document(doc)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/application/application.py", line 178, in initialize_document
h.modify_document(doc)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/application/handlers/function.py", line 133, in modify_document
self._func(doc)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in nvlink_timeline
for i in range(ngpus)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in <listcomp>
for i in range(ngpus)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 392, in <listcomp>
for j in range(nlinks)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/pynvml/nvml.py", line 1999, in nvmlDeviceGetNvLinkUtilizationCounter
check_return(ret)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/pynvml/nvml.py", line 366, in check_return
raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:tornado.access:500 GET /NVLink-Timeline (127.0.0.1) 38.24ms
Oh, interesting. Every time I switch between different tabs in jupyterlab, it seems like the dashboard needs to reconnect to the websocket. Sometimes this also throws an exception in the jupyter server logs. (Clearly the workaround is to have all of the dashboards exposed and not in tabs)
Websocket error:
[E 11:34:44.378 LabApp] Uncaught exception
Traceback (most recent call last):
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 649, in _run_callback
result = callback(*args, **kwargs)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 1528, in on_message
return self._on_message(message)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 1534, in _on_message
self._on_message_callback(message)
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyter_server_proxy/handlers.py", line 247, in message_cb
self.write_message(message, binary=isinstance(message, bytes))
File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 339, in write_message
raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
[I 11:34:45.794 LabApp] Trying to establish websocket connection to ws://localhost:36330/GPU-Memory/ws?bokeh-protocol-version=1.0&bokeh-session-id=zz8FYsgbMiW8RnVQwCKUJolWg5YcUxMupkjdHkU8nL4G
[I 11:34:45.847 LabApp] Websocket connection established to ws://localhost:36330/GPU-Memory/ws?bokeh-protocol-version=1.0&bokeh-session-id=zz8FYsgbMiW8RnVQwCKUJolWg5YcUxMupkjdHkU8nL4G
Current environment:
# packages in environment at /home/ericdill/miniconda/envs/rapidsai:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
arrow-cpp 0.14.1 py37h6b969ab_1 conda-forge
backcall 0.1.0 py_0 conda-forge
boost-cpp 1.70.0 h8e57a91_2 conda-forge
brotli 1.0.7 he1b5a44_1000 conda-forge
bzip2 1.0.8 h516909a_0 conda-forge
c-ares 1.15.0 h516909a_1001 conda-forge
ca-certificates 2019.6.16 hecc5488_0 conda-forge
certifi 2019.6.16 py37_1 conda-forge
cffi 1.12.3 py37h8022711_0 conda-forge
cudatoolkit 10.0.130 0
cudf 0.9.0 py37_0 rapidsai
cugraph 0.9.0 py37_0 rapidsai
cuml 0.9.1 cuda10.0_py37_0 rapidsai
cython 0.29.13 py37he1b5a44_0 conda-forge
decorator 4.4.0 py_0 conda-forge
dlpack 0.2 he1b5a44_0 conda-forge
double-conversion 3.1.5 he1b5a44_1 conda-forge
fastavro 0.22.4 py37h516909a_0 conda-forge
gflags 2.2.2 he1b5a44_1001 conda-forge
glog 0.4.0 he1b5a44_1 conda-forge
grpc-cpp 1.23.0 h18db393_0 conda-forge
icu 64.2 he1b5a44_1 conda-forge
ipykernel 5.1.2 py37h5ca1d4c_0 conda-forge
ipython 7.8.0 py37h5ca1d4c_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.15.1 py37_0 conda-forge
jupyter_client 5.3.1 py_0 conda-forge
jupyter_core 4.4.0 py_0 conda-forge
libblas 3.8.0 12_openblas conda-forge
libcblas 3.8.0 12_openblas conda-forge
libcudf 0.9.0 cuda10.0_0 rapidsai
libcugraph 0.9.0 cuda10.0_0 rapidsai
libcuml 0.9.1 cuda10.0_0 rapidsai
libcumlprims 0.9.0 cuda10.0_0 nvidia
libevent 2.1.10 h72c5cf5_0 conda-forge
libffi 3.2.1 he1b5a44_1006 conda-forge
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
liblapack 3.8.0 12_openblas conda-forge
libnvstrings 0.9.0 cuda10.0_0 rapidsai
libopenblas 0.3.7 h6e990d7_1 conda-forge
libprotobuf 3.8.0 h8b12597_0 conda-forge
librmm 0.9.0 cuda10.0_0 rapidsai
libsodium 1.0.17 h516909a_0 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
llvmlite 0.29.0 py37hf484d3e_0 numba
lz4-c 1.8.3 he1b5a44_1001 conda-forge
nccl 2.4.6.1 cuda10.0_0 nvidia
ncurses 6.1 hf484d3e_1002 conda-forge
numba 0.45.1 np116py37hf484d3e_0 numba
numpy 1.16.4 py37h95a1406_0 conda-forge
nvstrings 0.9.0 py37_0 rapidsai
openssl 1.1.1c h516909a_0 conda-forge
pandas 0.24.2 py37hb3f55d8_0 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.5.1 py_0 conda-forge
pexpect 4.7.0 py37_0 conda-forge
pickleshare 0.7.5 py37_1000 conda-forge
pip 19.2.3 py37_0 conda-forge
prompt_toolkit 2.0.9 py_0 conda-forge
ptyprocess 0.6.0 py_1001 conda-forge
pyarrow 0.14.1 py37h8b68381_0 conda-forge
pycparser 2.19 py37_1 conda-forge
pygments 2.4.2 py_0 conda-forge
python 3.7.3 h33d41f4_1 conda-forge
python-dateutil 2.8.0 py_0 conda-forge
pytz 2019.2 py_0 conda-forge
pyzmq 18.0.2 py37h1768529_2 conda-forge
re2 2019.09.01 he1b5a44_0 conda-forge
readline 8.0 hf8c457e_0 conda-forge
rmm 0.9.0 py37_0 rapidsai
setuptools 41.2.0 py37_0 conda-forge
six 1.12.0 py37_1000 conda-forge
snappy 1.1.7 he1b5a44_1002 conda-forge
sqlite 3.29.0 hcee41ef_1 conda-forge
thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge
tk 8.6.9 hed695b0_1002 conda-forge
tornado 6.0.3 py37h516909a_0 conda-forge
traitlets 4.3.2 py37_1000 conda-forge
uriparser 0.9.3 he1b5a44_1 conda-forge
wcwidth 0.1.7 py_1 conda-forge
wheel 0.33.6 py37_0 conda-forge
xz 5.2.4 h14c3975_1001 conda-forge
zeromq 4.3.2 he1b5a44_2 conda-forge
zlib 1.2.11 h516909a_1005 conda-forge
zstd 1.4.0 h3b9ef0a_0 conda-forge
Installed extension with pip install jupyterlab-nvdashboard
and then jupyter labextension install jupyterlab-nvdashboard
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
I am running the latest version of this dashbaord on a DGX Station with nvlink and I am still seeing this error. Is there a specific driver version I need?
It looks like I am seeing an issue with a different metric than the previous user
nvmlDeviceGetNvLinkUtilizationCounter
. I remember seeing a related bug/change with some related metrics in the driver, so maybe this API has actually changed.Running this Dockerfile:
Thanks for raising this!
Yeah it looks like we should catch that
pynvml.nvml.NVMLError_NotSupported: Not Supported
error on systems without NVLink and show a sensible message to the user in the front end.