question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] test_dgx fails on a machine with IPv6 address on IB interface

See original GitHub issue

Describe the bug Executing dask-cuda 0.9.1 test_dgx.py test on a IBM AC922 (linux_ppc64le) with IPv6 address on IB interface fails -

.. o/p truncated ..
>       raise ValueError("interface %r doesn't have an IPv4 address" % (ifname,))
E       ValueError: interface 'ib0' doesn't have an IPv4 address

../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils.py:184: ValueError

Pls note - I did not have a machine to execute the same test on an AC922 with IPv4 assigned to IB interface. I am not sure if assigning an IPv4 address to the IB interface is the only change required to get past this failure.

Steps/Code to reproduce bug

  1. Install the dask-cuda 0.9.1 conda package which we have built for linux_ppc64le
  2. Clone the v0.9.1 code of https://github.com/rapidsai/dask-cuda.git
  3. cd dask_cuda/tests
  4. Execute pytest test_dgx.py

Expected behavior Going by the name of the test scenario this seems to be targetted for DGX machines! But I am hopeful that it could also be made to work on AC922 machines.

Environment details

  • Environment location: Bare-metal (IBM AC922 machine with NVIDIA GPUs.)
  • Method of dask-cuda install: conda [Built for linux_ppc64le]

Additional context

$ pytest test_dgx.py
========================================= test session starts =========================================
platform linux -- Python 3.6.9, pytest-5.1.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/sangeek/sandbox/dask-cuda-test
collected 1 item

test_dgx.py F                                                                                   [100%]

============================================== FAILURES ===============================================
______________________________________________ test_dgx _______________________________________________

    def test_func():
        with clean() as loop:
            if iscoroutinefunction(func):
                cor = func
            else:
                cor = gen.coroutine(func)
>           loop.run_sync(cor, timeout=timeout)

../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils_test.py:761:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/tornado/ioloop.py:532: in run_sync
    return future_cell[0].result()
test_dgx.py:14: in test_dgx
    async with DGX(asynchronous=True) as cluster:
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/deploy/cluster.py:325: in __aenter__
    await self
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/deploy/spec.py:282: in _
    await self._start()
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/deploy/spec.py:224: in _start
    **self.scheduler_spec.get("options", {})
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/scheduler.py:1116: in __init__
    default_port=self.default_port,
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/comm/addressing.py:236: in address_from_user_args
    host = get_ip_interface(interface)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

ifname = 'ib0'

    def get_ip_interface(ifname):
        """
        Get the local IPv4 address of a network interface.

        KeyError is raised if the interface doesn't exist.
        ValueError is raised if the interface does no have an IPv4 address
        associated with it.
        """
        import psutil

        net_if_addrs = psutil.net_if_addrs()

        if ifname not in net_if_addrs:
            allowed_ifnames = list(net_if_addrs.keys())
            raise ValueError(
                "{!r} is not a valid network interface. "
                "Valid network interfaces are: {}".format(ifname, allowed_ifnames)
            )

        for info in net_if_addrs[ifname]:
            if info.family == socket.AF_INET:
                return info.address
>       raise ValueError("interface %r doesn't have an IPv4 address" % (ifname,))
E       ValueError: interface 'ib0' doesn't have an IPv4 address

../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils.py:184: ValueError
========================================== 1 failed in 3.06s ==========================================

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
quasibencommented, Dec 17, 2020

@ksangeek Note for IB usage, you will have to building UCX from source. We have instructions here for UCX+OFED https://ucx-py.readthedocs.io/en/latest/install.html#ucx-ofed

1reaction
pentschevcommented, May 8, 2020

Thanks @ksangeek , I don’t mean to pressure, just suggesting this is now a good time for UCX, no worries if you and your colleagues can’t do it at this time.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Configuring IPv6 on InfiniBand interface fails - IBM
Using the 'chdev -l ib0 -a' command to configure IPV6 on the IB interface will fail and show the following error: Method error...
Read more >
How to Fix an IPv6 No Network Access Error - Lifewire
Fix an IPv6 No Network Access error on Windows, macOS, or a mobile device. Follow these steps to get your IPv6 connection working...
Read more >
interface looses link-local address when DHCPv6 fails
When the DHCPv6 server goes away, the connection fails and the device is disconnected removing IPv6 addresses. To avoid that the connection fails,...
Read more >
Unbound fails to start with multiple ipv6 interfaces #545 - GitHub
Describe the bug Inside pfsense 2.5.2-RELEASE unbound will fail to start if the following is true: ipv6 enabled LAN interface IPv6 ...
Read more >
Troubleshoot IPv6 Dynamic Address Assignment with Cisco ...
This document describes the available options for dynamic IPv6 address assignment.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found