Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cifar-10 example - RuntimeError: Error building extension 'fused_adam'

See original GitHub issue

Hey, I was trying out the cifar-10 tutorial (link).
Could you assist with the runtime error.

On executing (run_ds.sh):


(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ sh run_ds.sh
[2021-01-26 05:43:56,524] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-26 05:43:56,554] [INFO] [runner.py:355:main] cmd = /home/axe/VirtualEnvs/dspeed/bin/python3.6 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
[2021-01-26 05:43:56,972] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2021-01-26 05:43:56,972] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=2, node_rank=0
[2021-01-26 05:43:56,972] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2021-01-26 05:43:56,972] [INFO] [launch.py:100:main] dist_world_size=2
[2021-01-26 05:43:56,973] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
0it [00:00, ?it/s]Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉  | 168140800/170498071 [00:07<00:00, 28603271.23it/s]Extracting ./data/cifar-10-python.tar.gz to ./data
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Files already downloaded and verified
170500096it [00:10, 16970356.67it/s]                                                                                                                                                                          
170500096it [00:10, 16911123.86it/s]                                                                                                                                                                          
horse plane   cat  bird
[2021-01-26 05:44:13,334] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.10, git-hash=unknown, git-branch=unknown
[2021-01-26 05:44:13,335] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
truck horse  ship  ship
[2021-01-26 05:44:14,857] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.10, git-hash=unknown, git-branch=unknown
[2021-01-26 05:44:14,857] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-26 05:44:18,027] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2021-01-26 05:44:18,028] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Using /home/axe/.cache/torch_extensions as PyTorch extensions root...
Using /home/axe/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/axe/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda_10_1_7_6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
FAILED: multi_tensor_adam.cuda.o 
/usr/local/cuda_10_1_7_6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6688:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’ without object
       __p->_M_set_sharable();
       ~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’ without object
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "cifar10_deepspeed.py", line 144, in <module>
    args=args, model=net, model_parameters=parameters, training_data=trainset)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
    config_params=config_params)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
    return self.jit_load(verbose)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
    verbose=verbose)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'         # *******************************************************
Loading extension module fused_adam...
Traceback (most recent call last):
  File "cifar10_deepspeed.py", line 144, in <module>
    args=args, model=net, model_parameters=parameters, training_data=trainset)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
    config_params=config_params)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
    return self.jit_load(verbose)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
    verbose=verbose)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/imp.py", line 297, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam'    # *******************************************************

Here’s ds_report:

(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ ds_report 
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
 [WARNING]  sparse_attn requires the 'cmake' command, but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1+cu101
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.10, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1

Running with CUDA 10.1 on Ubuntu 18/04. Here’s the virtual environment:

(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ pip freeze
cycler==0.10.0
dataclasses==0.8
deepspeed==0.3.10
kiwisolver==1.3.1
matplotlib==3.3.3
ninja==1.10.0.post2
numpy==1.19.5
Pillow==8.1.0
protobuf==3.14.0
pyparsing==2.4.7
python-dateutil==2.8.1
six==1.15.0
tensorboardX==1.8
torch==1.7.1+cu101
torchaudio==0.7.2
torchvision==0.8.2+cu101
tqdm==4.56.0
typing-extensions==3.7.4.3

Issue Analytics

State:
Created 3 years ago
Comments:15 (4 by maintainers)

Top GitHub Comments

3reactions

windspirit95commented, Feb 9, 2021

In my case, the same issue happened even after I update cuda to version 10.1.243, and I could not update CUDA 10.2 as my Ubuntu is 14.04 I found that my issue caused by the old version of GCC (4.8). I follow this solution to update GCC 6 and problem solved: https://gist.github.com/application2000/73fd6f4bf1be6600a2cf9f56315a2d91 Hope this help someone ^^

1reaction

TevenLeScaocommented, Jan 31, 2021

I had issues with installation and was following the idea in https://github.com/microsoft/DeepSpeed/issues/629#issuecomment-753993124 to change CUDA from 10.1.105 to 10.1.243 and ended up installing 10.2 instead, which fixed this issue.

Sorry, I won’t have time to revert to 10.1 to look for the underlying cause, but in any case, that should be an easy fix in the meantime.