question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] torch-nightly: linker issue with `cpu_adam.so`

See original GitHub issue

When HF CI runs deepspeed tests with torch-nightly - I get multiple issues with cpu_adam.so

I get most tests fail with either

Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf967353a0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 97, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

or:

           Traceback (most recent call last):
E             File "/__w/transformers/transformers/examples/pytorch/summarization/run_summarization.py", line 648, in <module>
E               main()
E             File "/__w/transformers/transformers/examples/pytorch/summarization/run_summarization.py", line 570, in main
E               train_result = trainer.train(resume_from_checkpoint=checkpoint)
E             File "/__w/transformers/transformers/src/transformers/trainer.py", line 1163, in train
E               deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
E             File "/__w/transformers/transformers/src/transformers/deepspeed.py", line 406, in deepspeed_init
E               model, optimizer, _, lr_scheduler = deepspeed.initialize(
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
E               engine = DeepSpeedEngine(args=args,
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
E               self._configure_optimizer(optimizer, model_parameters)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1106, in _configure_optimizer
E               basic_optimizer = self._configure_basic_optimizer(model_parameters)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1191, in _configure_basic_optimizer
E               optimizer = DeepSpeedCPUAdam(model_parameters,
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in __init__
E               self.ds_opt_adam = CPUAdamBuilder().load()
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 370, in load
E               return self.jit_load(verbose)
E             File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 402, in jit_load
E               op_module = load(
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1130, in load
E               return _jit_compile(
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1368, in _jit_compile
E               return _import_module_from_library(name, build_directory, is_python_module)
E             File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1758, in _import_module_from_library
E               module = importlib.util.module_from_spec(spec)
E             File "<frozen importlib._bootstrap>", line 556, in module_from_spec
E             File "<frozen importlib._bootstrap_external>", line 1101, in create_module
E             File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
E           ImportError: /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/cpu_adam.so: undefined symbol: curandCreateGenerator

(e.g. test: test_can_resume_training_normal_0_zero2, but almost all tests fail)

The compilation went through just fine:

Installed CUDA version 11.2 does not match the version torch was compiled with 11.1 but since the APIs are compatible, accepting this combination
Using /github/home/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Creating extension directory /github/home/.cache/torch_extensions/py38_cu111/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...

It must be something specific to that box - since I can’t reproduce these problems on my box with the same torch-nightly version / py38.

But if I check on my home box (where things work)

nm ~/.cache/torch_extensions/py38_cu113/cpu_adam/cpu_adam.so | grep curandCreateGenerator
                 U curandCreateGenerator

So curandCreateGenerator is indeed undefined and it’s used here:

https://github.com/microsoft/DeepSpeed/blob/91e15593ea4487014114a03c7b4a2a05567fd3f8/csrc/includes/context.h#L46

but for some reason it doesn’t cause a problem on my setup. Perhaps it’s a linker issue - some library doesn’t get properly linked?

Thank you!

@RezaYazdaniAminabadi, @jeffra, @tjruwase

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Jan 10, 2022

@jeffra, could we then apply the fix above to the core, so that our CI can run these tests - as it uses JIT build it’ll solve the problem, while pytorch folks are figuring out the cpp extension pre-building. Thanks!

Please let me know if you’d like me to create a PR or whether it’d be easier for you to do that. Especially since you found the solution.

Thank you!

1reaction
stas00commented, Dec 9, 2021

OK, I was able to reproduce the problem by installing:

pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html -U

so I think it’s a bug in the pip package of torch-nightly, since there is no problem with conda version of the same.

I will report to pytorch, so nothing to do about it at the moment.

https://github.com/pytorch/pytorch/issues/69666

Read more comments on GitHub >

github_iconTop Results From Across the Web

Linking errors when using nightly prebuilt binaries
I'm currently using the prebuilt binary version of pytorch nightly, built with cuda 11.0. which I installed from ...
Read more >
C++ linker error: Undefined reference - when linking package ...
I'm new to CMake and also trying to understand how linking works, or what could cause libtorch and OpenNMTTokenizer.so not work together.
Read more >
DeepSpeed - bytemeta
[BUG] torch-nightly: linker issue with `cpu_adam.so` · [QUESTION] OOM at Allgather in pre-submodule · CPUAdam does not find CUDA.
Read more >
News Montigny-lès-Cormeilles zTJ - Concrete Prefabbricati
Winter sickness bug incubation period, Vip-asiakkuus, Saas fee zermatt ski ... Set cover problem complexity, Miastral cancer 2015, Redna odpoved pogodbe o ...
Read more >
Untitled
Find a one-night stand or a hookup you can also hang out with. Which dating site is best for serious relationships? What is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found