Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] torch-nightly: linker issue with `cpu_adam.so`

See original GitHub issue

When HF CI runs deepspeed tests with torch-nightly - I get multiple issues with cpu_adam.so

I get most tests fail with either

Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf967353a0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 97, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

or:

           Traceback (most recent call last):
E             File "/__w/transformers/transformers/examples/pytorch/summarization/run_summarization.py", line 648, in <module>
E               main()
E             File "/__w/transformers/transformers/examples/pytorch/summarization/run_summarization.py", line 570, in main
E               train_result = trainer.train(resume_from_checkpoint=checkpoint)
E             File "/__w/transformers/transformers/src/transformers/trainer.py", line 1163, in train
E               deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
E             File "/__w/transformers/transformers/src/transformers/deepspeed.py", line 406, in deepspeed_init
E               model, optimizer, _, lr_scheduler = deepspeed.initialize(
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
E               engine = DeepSpeedEngine(args=args,
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
E               self._configure_optimizer(optimizer, model_parameters)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1106, in _configure_optimizer
E               basic_optimizer = self._configure_basic_optimizer(model_parameters)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1191, in _configure_basic_optimizer
E               optimizer = DeepSpeedCPUAdam(model_parameters,
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in __init__
E               self.ds_opt_adam = CPUAdamBuilder().load()
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 370, in load
E               return self.jit_load(verbose)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 402, in jit_load
E               op_module = load(
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1130, in load
E               return _jit_compile(
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1368, in _jit_compile
E               return _import_module_from_library(name, build_directory, is_python_module)
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1758, in _import_module_from_library
E               module = importlib.util.module_from_spec(spec)
E             File "<frozen importlib._bootstrap>", line 556, in module_from_spec
E             File "<frozen importlib._bootstrap_external>", line 1101, in create_module
E             File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
E           ImportError: /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/cpu_adam.so: undefined symbol: curandCreateGenerator

(e.g. test: test_can_resume_training_normal_0_zero2, but almost all tests fail)

The compilation went through just fine:

Installed CUDA version 11.2 does not match the version torch was compiled with 11.1 but since the APIs are compatible, accepting this combination
Using /github/home/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Creating extension directory /github/home/.cache/torch_extensions/py38_cu111/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...

It must be something specific to that box - since I can’t reproduce these problems on my box with the same torch-nightly version / py38.

But if I check on my home box (where things work)

nm ~/.cache/torch_extensions/py38_cu113/cpu_adam/cpu_adam.so | grep curandCreateGenerator
                 U curandCreateGenerator

So curandCreateGenerator is indeed undefined and it’s used here:

https://github.com/microsoft/DeepSpeed/blob/91e15593ea4487014114a03c7b4a2a05567fd3f8/csrc/includes/context.h#L46

but for some reason it doesn’t cause a problem on my setup. Perhaps it’s a linker issue - some library doesn’t get properly linked?

Thank you!

@RezaYazdaniAminabadi, @jeffra, @tjruwase

Issue Analytics

State:
Created 2 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Jan 10, 2022

@jeffra, could we then apply the fix above to the core, so that our CI can run these tests - as it uses JIT build it’ll solve the problem, while pytorch folks are figuring out the cpp extension pre-building. Thanks!

Please let me know if you’d like me to create a PR or whether it’d be easier for you to do that. Especially since you found the solution.

Thank you!

1reaction

stas00commented, Dec 9, 2021

OK, I was able to reproduce the problem by installing:

pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html -U

so I think it’s a bug in the pip package of torch-nightly, since there is no problem with conda version of the same.

I will report to pytorch, so nothing to do about it at the moment.

https://github.com/pytorch/pytorch/issues/69666

Top Results From Across the Web

Linking errors when using nightly prebuilt binaries

I'm currently using the prebuilt binary version of pytorch nightly, built with cuda 11.0. which I installed from ...

C++ linker error: Undefined reference - when linking package ...

I'm new to CMake and also trying to understand how linking works, or what could cause libtorch and OpenNMTTokenizer.so not work together.

DeepSpeed - bytemeta

[BUG] torch-nightly: linker issue with `cpu_adam.so` · [QUESTION] OOM at Allgather in pre-submodule · CPUAdam does not find CUDA.

News Montigny-lès-Cormeilles zTJ - Concrete Prefabbricati

Winter sickness bug incubation period, Vip-asiakkuus, Saas fee zermatt ski ... Set cover problem complexity, Miastral cancer 2015, Redna odpoved pogodbe o ...

Untitled

Find a one-night stand or a hookup you can also hang out with. Which dating site is best for serious relationships? What is...