Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NVLink Throughput and Timeline throwing 500's and Errors in lab interface

See original GitHub issue

This is an awesome project. Thanks for the hard work here! It’s really nice to have a dashboard for watching GPU resources (and is way better than opening up a terminal and running watch nvidia-smi 😀 )

I’ll preface this issue with this is mostly just some user feedback. Do with it what you will 😃. I’m happy to help debug further, but have zero ability to actually write JLab extensions so can’t help on the writing-code-to-help-fix side of things.

I’m brand new to using this extension (installed it like 15 mins ago) and was clicking around seeing what all the different dashboards do. When I open up the NVLink Throughput and NVLink Timeline dashboards, I immediately get stack traces in my jupyter server logs and a “500: Internal Server Error” in the jupyterlab widget. This is almost certainly because I’m not running on a multi-gpu system.

NVLink jupterlab 500 in the opened panel: Oct 17-11 31 05

server logs from NVLink error:

ERROR:tornado.access:500 GET /NVLink-Throughput (127.0.0.1) 1.00ms
[E 11:27:00.965 LabApp] {
      "Host": "localhost:8888",
      "Connection": "keep-alive",
      "Upgrade-Insecure-Requests": "1",
      "Dnt": "1",
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36",
      "Sec-Fetch-Mode": "nested-navigate",
      "Sec-Fetch-User": "?1",
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
      "Sec-Fetch-Site": "same-origin",
      "Accept-Encoding": "gzip, deflate, br",
      "Accept-Language": "en-US,en;q=0.9",
      "Cookie": "_xsrf=2|a800529e|3734e978fe7758a227e33d8ef289c566|1571322840; username-localhost-8888=\"2|1:0|10:1571326020|23:username-localhost-8888|44:MDg0NGY3NjY4OTMzNDdlMGI1MDQ5NmIwYjM0NmJjYTY=|969f92a7df270d40c831ddb30a0c7dfba20c443d37adaa36bea79c3fb78891a4\""
    }
[E 11:27:00.965 LabApp] 500 GET /nvdashboard/NVLink-Throughput (127.0.0.1) 18.59ms referer=None
ERROR:tornado.application:Uncaught exception GET /NVLink-Timeline (127.0.0.1)
HTTPServerRequest(protocol='http', host='localhost:8888', method='GET', uri='/NVLink-Timeline', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/web.py", line 1699, in _execute
    result = await result
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/views/doc_handler.py", line 55, in get
    session = yield self.get_session()
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/views/session_handler.py", line 77, in get_session
    session = yield self.application_context.create_session_if_needed(session_id, self.request)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/contexts.py", line 215, in create_session_if_needed
    self._application.initialize_document(doc)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/application/application.py", line 178, in initialize_document
    h.modify_document(doc)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/application/handlers/function.py", line 133, in modify_document
    self._func(doc)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in nvlink_timeline
    for i in range(ngpus)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in <listcomp>
    for i in range(ngpus)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 392, in <listcomp>
    for j in range(nlinks)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/pynvml/nvml.py", line 1999, in nvmlDeviceGetNvLinkUtilizationCounter
    check_return(ret)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:tornado.access:500 GET /NVLink-Timeline (127.0.0.1) 38.24ms

Oh, interesting. Every time I switch between different tabs in jupyterlab, it seems like the dashboard needs to reconnect to the websocket. Sometimes this also throws an exception in the jupyter server logs. (Clearly the workaround is to have all of the dashboards exposed and not in tabs)

Websocket error:

[E 11:34:44.378 LabApp] Uncaught exception
    Traceback (most recent call last):
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 649, in _run_callback
        result = callback(*args, **kwargs)
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 1528, in on_message
        return self._on_message(message)
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 1534, in _on_message
        self._on_message_callback(message)
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyter_server_proxy/handlers.py", line 247, in message_cb
        self.write_message(message, binary=isinstance(message, bytes))
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 339, in write_message
        raise WebSocketClosedError()
    tornado.websocket.WebSocketClosedError
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
[I 11:34:45.794 LabApp] Trying to establish websocket connection to ws://localhost:36330/GPU-Memory/ws?bokeh-protocol-version=1.0&bokeh-session-id=zz8FYsgbMiW8RnVQwCKUJolWg5YcUxMupkjdHkU8nL4G
[I 11:34:45.847 LabApp] Websocket connection established to ws://localhost:36330/GPU-Memory/ws?bokeh-protocol-version=1.0&bokeh-session-id=zz8FYsgbMiW8RnVQwCKUJolWg5YcUxMupkjdHkU8nL4G

Current environment:

# packages in environment at /home/ericdill/miniconda/envs/rapidsai:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
arrow-cpp                 0.14.1           py37h6b969ab_1    conda-forge
backcall                  0.1.0                      py_0    conda-forge
boost-cpp                 1.70.0               h8e57a91_2    conda-forge
brotli                    1.0.7             he1b5a44_1000    conda-forge
bzip2                     1.0.8                h516909a_0    conda-forge
c-ares                    1.15.0            h516909a_1001    conda-forge
ca-certificates           2019.6.16            hecc5488_0    conda-forge
certifi                   2019.6.16                py37_1    conda-forge
cffi                      1.12.3           py37h8022711_0    conda-forge
cudatoolkit               10.0.130                      0  
cudf                      0.9.0                    py37_0    rapidsai
cugraph                   0.9.0                    py37_0    rapidsai
cuml                      0.9.1           cuda10.0_py37_0    rapidsai
cython                    0.29.13          py37he1b5a44_0    conda-forge
decorator                 4.4.0                      py_0    conda-forge
dlpack                    0.2                  he1b5a44_0    conda-forge
double-conversion         3.1.5                he1b5a44_1    conda-forge
fastavro                  0.22.4           py37h516909a_0    conda-forge
gflags                    2.2.2             he1b5a44_1001    conda-forge
glog                      0.4.0                he1b5a44_1    conda-forge
grpc-cpp                  1.23.0               h18db393_0    conda-forge
icu                       64.2                 he1b5a44_1    conda-forge
ipykernel                 5.1.2            py37h5ca1d4c_0    conda-forge
ipython                   7.8.0            py37h5ca1d4c_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.15.1                   py37_0    conda-forge
jupyter_client            5.3.1                      py_0    conda-forge
jupyter_core              4.4.0                      py_0    conda-forge
libblas                   3.8.0               12_openblas    conda-forge
libcblas                  3.8.0               12_openblas    conda-forge
libcudf                   0.9.0                cuda10.0_0    rapidsai
libcugraph                0.9.0                cuda10.0_0    rapidsai
libcuml                   0.9.1                cuda10.0_0    rapidsai
libcumlprims              0.9.0                cuda10.0_0    nvidia
libevent                  2.1.10               h72c5cf5_0    conda-forge
libffi                    3.2.1             he1b5a44_1006    conda-forge
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
liblapack                 3.8.0               12_openblas    conda-forge
libnvstrings              0.9.0                cuda10.0_0    rapidsai
libopenblas               0.3.7                h6e990d7_1    conda-forge
libprotobuf               3.8.0                h8b12597_0    conda-forge
librmm                    0.9.0                cuda10.0_0    rapidsai
libsodium                 1.0.17               h516909a_0    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0  
llvmlite                  0.29.0           py37hf484d3e_0    numba
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
nccl                      2.4.6.1              cuda10.0_0    nvidia
ncurses                   6.1               hf484d3e_1002    conda-forge
numba                     0.45.1          np116py37hf484d3e_0    numba
numpy                     1.16.4           py37h95a1406_0    conda-forge
nvstrings                 0.9.0                    py37_0    rapidsai
openssl                   1.1.1c               h516909a_0    conda-forge
pandas                    0.24.2           py37hb3f55d8_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.5.1                      py_0    conda-forge
pexpect                   4.7.0                    py37_0    conda-forge
pickleshare               0.7.5                 py37_1000    conda-forge
pip                       19.2.3                   py37_0    conda-forge
prompt_toolkit            2.0.9                      py_0    conda-forge
ptyprocess                0.6.0                   py_1001    conda-forge
pyarrow                   0.14.1           py37h8b68381_0    conda-forge
pycparser                 2.19                     py37_1    conda-forge
pygments                  2.4.2                      py_0    conda-forge
python                    3.7.3                h33d41f4_1    conda-forge
python-dateutil           2.8.0                      py_0    conda-forge
pytz                      2019.2                     py_0    conda-forge
pyzmq                     18.0.2           py37h1768529_2    conda-forge
re2                       2019.09.01           he1b5a44_0    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
rmm                       0.9.0                    py37_0    rapidsai
setuptools                41.2.0                   py37_0    conda-forge
six                       1.12.0                py37_1000    conda-forge
snappy                    1.1.7             he1b5a44_1002    conda-forge
sqlite                    3.29.0               hcee41ef_1    conda-forge
thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
tk                        8.6.9             hed695b0_1002    conda-forge
tornado                   6.0.3            py37h516909a_0    conda-forge
traitlets                 4.3.2                 py37_1000    conda-forge
uriparser                 0.9.3                he1b5a44_1    conda-forge
wcwidth                   0.1.7                      py_1    conda-forge
wheel                     0.33.6                   py37_0    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zeromq                    4.3.2                he1b5a44_2    conda-forge
zlib                      1.2.11            h516909a_1005    conda-forge
zstd                      1.4.0                h3b9ef0a_0    conda-forge

Installed extension with pip install jupyterlab-nvdashboard and then jupyter labextension install jupyterlab-nvdashboard

Issue Analytics

State:
Created 4 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

supertetelmancommented, Jul 21, 2020

I am running the latest version of this dashbaord on a DGX Station with nvlink and I am still seeing this error. Is there a specific driver version I need?

It looks like I am seeing an issue with a different metric than the previous user nvmlDeviceGetNvLinkUtilizationCounter. I remember seeing a related bug/change with some related metrics in the driver, so maybe this API has actually changed.

ERROR:tornado.application:Uncaught exception GET /NVLink-Throughput (127.0.0.1)
HTTPServerRequest(protocol='http', host='sae-npn-01:8899', method='GET', uri='/NVLink-Throughput', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
    result = await result
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/views/doc_handler.py", line 56, in get
    session = yield self.get_session()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/views/session_handler.py", line 79, in get_session
    session = yield self.application_context.create_session_if_needed(session_id, self.request)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/contexts.py", line 222, in create_session_if_needed
    self._application.initialize_document(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/application/application.py", line 178, in initialize_document
    h.modify_document(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/application/handlers/function.py", line 133, in modify_document
    self._func(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 233, in nvlink
    for i in range(ngpus)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 233, in <listcomp>
    for i in range(ngpus)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 230, in <listcomp>
    for j in range(nlinks)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/pynvml/nvml.py", line 2006, in nvmlDeviceGetNvLinkUtilizationCounter
    check_return(ret)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:tornado.access:500 GET /NVLink-Throughput (127.0.0.1) 43.87ms
ERROR:tornado.application:Uncaught exception GET /NVLink-Timeline (127.0.0.1)
HTTPServerRequest(protocol='http', host='sae-npn-01:8899', method='GET', uri='/NVLink-Timeline', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
    result = await result
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/views/doc_handler.py", line 56, in get
    session = yield self.get_session()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/views/session_handler.py", line 79, in get_session
    session = yield self.application_context.create_session_if_needed(session_id, self.request)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/contexts.py", line 222, in create_session_if_needed
    self._application.initialize_document(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/application/application.py", line 178, in initialize_document
    h.modify_document(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/application/handlers/function.py", line 133, in modify_document
    self._func(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in nvlink_timeline
    for i in range(ngpus)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in <listcomp>
    for i in range(ngpus)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 392, in <listcomp>
    for j in range(nlinks)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/pynvml/nvml.py", line 2006, in nvmlDeviceGetNvLinkUtilizationCounter
    check_return(ret)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:tornado.access:500 GET /NVLink-Timeline (127.0.0.1) 38.54ms

...
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
...
# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV1     NV2     0-39            N/A
GPU1    NV1      X      NV2     NV1     0-39            N/A
GPU2    NV1     NV2      X      NV1     0-39            N/A
GPU3    NV2     NV1     NV1      X      0-39            N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Running this Dockerfile:

# https://ngc.nvidia.com/catalog/containers/nvidia:rapidsai:rapidsai
FROM nvcr.io/nvidia/rapidsai/rapidsai:cuda10.2-runtime-ubuntu18.04

ENTRYPOINT ["/bin/sh"]
CMD ["-c", "/opt/conda/envs/rapids/bin/jupyter lab  --notebook-dir=/rapids --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]
``

1reaction

jacobtomlinsoncommented, Oct 17, 2019

Thanks for raising this!

Yeah it looks like we should catch that pynvml.nvml.NVMLError_NotSupported: Not Supported error on systems without NVLink and show a sensible message to the user in the front end.

Top Results From Across the Web

DCGM Diagnostics - NVIDIA Documentation Center

Integrate the following concepts into a single tool to discover deployment, system software and hardware configuration issues, basic diagnostics ...

NVLink - Wikipedia

NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, ...

PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect - arXiv

In this paper, we fill this gap by thoroughly characteriz- ing a variety of modern GPU interconnects, including PCIe,. NVLink Version-1, NVLink ......

NVIDIA NVLink High-Speed Interconnect - YouTube

Accelerated Computing is driving the next generation of discovery by tapping into the massively parallel processing power of GPUs for a wide ...

NVLink - Nvidia - WikiChip

It's worth noting that NVLink was also designed for CPU-GPU communication with higher bandwidth than PCIe. Although it's unlikely that NVLink ...