[BUG] test_dgx fails on a machine with IPv6 address on IB interface
See original GitHub issueDescribe the bug
Executing dask-cuda 0.9.1 test_dgx.py
test on a IBM AC922 (linux_ppc64le) with IPv6 address on IB interface fails -
.. o/p truncated ..
> raise ValueError("interface %r doesn't have an IPv4 address" % (ifname,))
E ValueError: interface 'ib0' doesn't have an IPv4 address
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils.py:184: ValueError
Pls note - I did not have a machine to execute the same test on an AC922 with IPv4 assigned to IB interface. I am not sure if assigning an IPv4 address to the IB interface is the only change required to get past this failure.
Steps/Code to reproduce bug
- Install the dask-cuda
0.9.1
conda package which we have built for linux_ppc64le - Clone the
v0.9.1
code of https://github.com/rapidsai/dask-cuda.git - cd
dask_cuda/tests
- Execute
pytest test_dgx.py
Expected behavior Going by the name of the test scenario this seems to be targetted for DGX machines! But I am hopeful that it could also be made to work on AC922 machines.
Environment details
- Environment location: Bare-metal (IBM AC922 machine with NVIDIA GPUs.)
- Method of dask-cuda install: conda [Built for linux_ppc64le]
Additional context
$ pytest test_dgx.py
========================================= test session starts =========================================
platform linux -- Python 3.6.9, pytest-5.1.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/sangeek/sandbox/dask-cuda-test
collected 1 item
test_dgx.py F [100%]
============================================== FAILURES ===============================================
______________________________________________ test_dgx _______________________________________________
def test_func():
with clean() as loop:
if iscoroutinefunction(func):
cor = func
else:
cor = gen.coroutine(func)
> loop.run_sync(cor, timeout=timeout)
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils_test.py:761:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/tornado/ioloop.py:532: in run_sync
return future_cell[0].result()
test_dgx.py:14: in test_dgx
async with DGX(asynchronous=True) as cluster:
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/deploy/cluster.py:325: in __aenter__
await self
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/deploy/spec.py:282: in _
await self._start()
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/deploy/spec.py:224: in _start
**self.scheduler_spec.get("options", {})
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/scheduler.py:1116: in __init__
default_port=self.default_port,
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/comm/addressing.py:236: in address_from_user_args
host = get_ip_interface(interface)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ifname = 'ib0'
def get_ip_interface(ifname):
"""
Get the local IPv4 address of a network interface.
KeyError is raised if the interface doesn't exist.
ValueError is raised if the interface does no have an IPv4 address
associated with it.
"""
import psutil
net_if_addrs = psutil.net_if_addrs()
if ifname not in net_if_addrs:
allowed_ifnames = list(net_if_addrs.keys())
raise ValueError(
"{!r} is not a valid network interface. "
"Valid network interfaces are: {}".format(ifname, allowed_ifnames)
)
for info in net_if_addrs[ifname]:
if info.family == socket.AF_INET:
return info.address
> raise ValueError("interface %r doesn't have an IPv4 address" % (ifname,))
E ValueError: interface 'ib0' doesn't have an IPv4 address
../../../../aconda3/envs/dask-cuda-py36/lib/python3.6/site-packages/distributed/utils.py:184: ValueError
========================================== 1 failed in 3.06s ==========================================
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (12 by maintainers)
Top Results From Across the Web
Configuring IPv6 on InfiniBand interface fails - IBM
Using the 'chdev -l ib0 -a' command to configure IPV6 on the IB interface will fail and show the following error: Method error...
Read more >How to Fix an IPv6 No Network Access Error - Lifewire
Fix an IPv6 No Network Access error on Windows, macOS, or a mobile device. Follow these steps to get your IPv6 connection working...
Read more >interface looses link-local address when DHCPv6 fails
When the DHCPv6 server goes away, the connection fails and the device is disconnected removing IPv6 addresses. To avoid that the connection fails,...
Read more >Unbound fails to start with multiple ipv6 interfaces #545 - GitHub
Describe the bug Inside pfsense 2.5.2-RELEASE unbound will fail to start if the following is true: ipv6 enabled LAN interface IPv6 ...
Read more >Troubleshoot IPv6 Dynamic Address Assignment with Cisco ...
This document describes the available options for dynamic IPv6 address assignment.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ksangeek Note for IB usage, you will have to building UCX from source. We have instructions here for UCX+OFED https://ucx-py.readthedocs.io/en/latest/install.html#ucx-ofed
Thanks @ksangeek , I don’t mean to pressure, just suggesting this is now a good time for UCX, no worries if you and your colleagues can’t do it at this time.