question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HOROVOD_WITH_MXNET=1 to debug the build error. any bady can help me.

See original GitHub issue

run.sh but I get a error like this, I have tried mxnet 1.6.0, mxnet-cu101, but it is not work .the horovodrun --check like this .

Horovod v0.19.2:

Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo
  • my cuda version is 10.02 . so , Is my cuda version is wrong ???

when I run.sh , the problem like this .


[ps-SYS-4028GR-TR:13182] Warning: could not find environment variable "LD_LIBRARY_PATH"
Traceback (most recent call last):
  File "train_memory.py", line 14, in <module>
    import horovod.mxnet as hvd
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/mxnet/__init__.py", line 23, in <module>
    __file__, 'mpi_lib')
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.mxnet has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_MXNET=1 to debug the build error.
Traceback (most recent call last):
  File "train_memory.py", line 14, in <module>
    import horovod.mxnet as hvd
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/mxnet/__init__.py", line 23, in <module>
    __file__, 'mpi_lib')
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.mxnet has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_MXNET=1 to debug the build error.
Traceback (most recent call last):
  File "train_memory.py", line 14, in <module>
    import horovod.mxnet as hvd
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/mxnet/__init__.py", line 23, in <module>
    __file__, 'mpi_lib')
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.mxnet has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_MXNET=1 to debug the build error.
Traceback (most recent call last):
  File "train_memory.py", line 14, in <module>
    import horovod.mxnet as hvd
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/mxnet/__init__.py", line 23, in <module>
    __file__, 'mpi_lib')
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.mxnet has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_MXNET=1 to debug the build error.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "train_memory.py", line 14, in <module>
    import horovod.mxnet as hvd
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/mxnet/__init__.py", line 23, in <module>
    __file__, 'mpi_lib')
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.mxnet has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_MXNET=1 to debug the build error.
Traceback (most recent call last):
  File "train_memory.py", line 14, in <module>
    import horovod.mxnet as hvd
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/mxnet/__init__.py", line 23, in <module>
    __file__, 'mpi_lib')
  File "/home/liuyang/anaconda3/envs/mxnet_partial/lib/python3.6/site-packages/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.mxnet has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_MXNET=1 to debug the build error.
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[35340,1],1]
  Exit code:    1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7

github_iconTop GitHub Comments

3reactions
ZHUXUHANcommented, Dec 1, 2020

if you can help me, thank u so much

maybe you have a mxnet of cpu version, we use the specifed version of mxnet is [mxnet-cu101 1.6.0.post0]. you can check this.

0reactions
fucker007commented, Jan 5, 2021

Thank you so much!, Have a good day!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting - Horovod documentation - Read the Docs
If you see the error message below, it means that TensorFlow cannot be loaded. If you're installing Horovod into a container on a...
Read more >
Apache MXNet - Quick Guide - Tutorialspoint
Below are the steps with the help of which, we can setup MXNet with CUDA. Step 1− First install Microsoft Visual Studio 2017...
Read more >
Horovod-MXNet Integration - Apache Software Foundation
We propose to add Horovod support to MXNet. This will help our users achieve goal of linear scalability to 256 GPUs and beyond....
Read more >
Machine Learning | GitHub Release Tracker
Now you can enable large tensor support by changing the following build flag to ... to MXNet engine in callback (#13922); Restore save/load...
Read more >
build failed: error: can't find python, please install ... - You.com
C:\Users\Michael Nguyen>pip install dlib Collecting dlib Using cached dlib-19.8.1.tar.gz Building wheels for collected packages: dlib Running setup.py ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found