Cifar-10 example - RuntimeError: Error building extension 'fused_adam'
See original GitHub issueHey, I was trying out the cifar-10 tutorial (link).
Could you assist with the runtime error.
On executing (run_ds.sh):
(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ sh run_ds.sh
[2021-01-26 05:43:56,524] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-26 05:43:56,554] [INFO] [runner.py:355:main] cmd = /home/axe/VirtualEnvs/dspeed/bin/python3.6 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
[2021-01-26 05:43:56,972] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2021-01-26 05:43:56,972] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=2, node_rank=0
[2021-01-26 05:43:56,972] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2021-01-26 05:43:56,972] [INFO] [launch.py:100:main] dist_world_size=2
[2021-01-26 05:43:56,973] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
0it [00:00, ?it/s]Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
99%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 168140800/170498071 [00:07<00:00, 28603271.23it/s]Extracting ./data/cifar-10-python.tar.gz to ./data
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Files already downloaded and verified
170500096it [00:10, 16970356.67it/s]
170500096it [00:10, 16911123.86it/s]
horse plane cat bird
[2021-01-26 05:44:13,334] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.10, git-hash=unknown, git-branch=unknown
[2021-01-26 05:44:13,335] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
truck horse ship ship
[2021-01-26 05:44:14,857] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.10, git-hash=unknown, git-branch=unknown
[2021-01-26 05:44:14,857] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-26 05:44:18,027] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2021-01-26 05:44:18,028] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Using /home/axe/.cache/torch_extensions as PyTorch extensions root...
Using /home/axe/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/axe/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda_10_1_7_6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/local/cuda_10_1_7_6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of βstatic std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]β:
/usr/include/c++/7/bits/basic_string.tcc:578:28: required from βstatic _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]β
/usr/include/c++/7/bits/basic_string.h:5042:20: required from βstatic _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]β
/usr/include/c++/7/bits/basic_string.h:5063:24: required from βstatic _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]β
/usr/include/c++/7/bits/basic_string.tcc:656:134: required from βstd::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]β
/usr/include/c++/7/bits/basic_string.h:6688:95: required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function βvoid std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]β without object
__p->_M_set_sharable();
~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of βstatic std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]β:
/usr/include/c++/7/bits/basic_string.tcc:578:28: required from βstatic _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]β
/usr/include/c++/7/bits/basic_string.h:5042:20: required from βstatic _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]β
/usr/include/c++/7/bits/basic_string.h:5063:24: required from βstatic _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]β
/usr/include/c++/7/bits/basic_string.tcc:656:134: required from βstd::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]β
/usr/include/c++/7/bits/basic_string.h:6693:95: required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function βvoid std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]β without object
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
env=env)
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "cifar10_deepspeed.py", line 144, in <module>
args=args, model=net, model_parameters=parameters, training_data=trainset)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
config_params=config_params)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
optimizer = FusedAdam(model_parameters, **optimizer_parameters)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
return self.jit_load(verbose)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
verbose=verbose)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
with_cuda=with_cuda)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
error_prefix="Error building extension '{}'".format(name))
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam' # *******************************************************
Loading extension module fused_adam...
Traceback (most recent call last):
File "cifar10_deepspeed.py", line 144, in <module>
args=args, model=net, model_parameters=parameters, training_data=trainset)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
config_params=config_params)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
optimizer = FusedAdam(model_parameters, **optimizer_parameters)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
return self.jit_load(verbose)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
verbose=verbose)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
file, path, description = imp.find_module(module_name, [path])
File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/imp.py", line 297, in find_module
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam' # *******************************************************
Hereβs ds_report:
(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
[WARNING] sparse_attn requires the 'cmake' command, but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1+cu101
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.10, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1
Running with CUDA 10.1 on Ubuntu 18/04. Hereβs the virtual environment:
(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ pip freeze
cycler==0.10.0
dataclasses==0.8
deepspeed==0.3.10
kiwisolver==1.3.1
matplotlib==3.3.3
ninja==1.10.0.post2
numpy==1.19.5
Pillow==8.1.0
protobuf==3.14.0
pyparsing==2.4.7
python-dateutil==2.8.1
six==1.15.0
tensorboardX==1.8
torch==1.7.1+cu101
torchaudio==0.7.2
torchvision==0.8.2+cu101
tqdm==4.56.0
typing-extensions==3.7.4.3
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (4 by maintainers)
Top Results From Across the Web
Installation Details - DeepSpeed
This will build a python wheel locally and copy it to all the nodes listed in your hostfile (either given via --hostfile ,...
Read more >Problems using pretrained ResNet50 in PyTorch to solve ...
I got the following error using a pretrained ResNet50 in PyTorch: ... allow_unreachable flag 101 102 RuntimeError: element 0 of tensors doesΒ ...
Read more >PyTorch - CC Doc
PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU accelerationΒ ...
Read more >Distributed Training with Ignite on CIFAR10
In this example, we will see how to can enable data distributed training which ... We maintain a config dictionary which can be...
Read more >Unable to understand the Runtime Error - PyTorch Forums
Hi, I am new to Pytorch and trying to run a simple CNN on CIFAR10 dataset in Pytorch. However I am getting the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
In my case, the same issue happened even after I update cuda to version 10.1.243, and I could not update CUDA 10.2 as my Ubuntu is 14.04 I found that my issue caused by the old version of GCC (4.8). I follow this solution to update GCC 6 and problem solved: https://gist.github.com/application2000/73fd6f4bf1be6600a2cf9f56315a2d91 Hope this help someone ^^
I had issues with installation and was following the idea in https://github.com/microsoft/DeepSpeed/issues/629#issuecomment-753993124 to change CUDA from 10.1.105 to 10.1.243 and ended up installing 10.2 instead, which fixed this issue.
Sorry, I wonβt have time to revert to 10.1 to look for the underlying cause, but in any case, that should be an easy fix in the meantime.