[BUG] torch-nightly: linker issue with `cpu_adam.so`
See original GitHub issueWhen HF CI runs deepspeed tests with torch-nightly - I get multiple issues with cpu_adam.so
I get most tests fail with either
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf967353a0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 97, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
or:
Traceback (most recent call last):
E File "/__w/transformers/transformers/examples/pytorch/summarization/run_summarization.py", line 648, in <module>
E main()
E File "/__w/transformers/transformers/examples/pytorch/summarization/run_summarization.py", line 570, in main
E train_result = trainer.train(resume_from_checkpoint=checkpoint)
E File "/__w/transformers/transformers/src/transformers/trainer.py", line 1163, in train
E deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
E File "/__w/transformers/transformers/src/transformers/deepspeed.py", line 406, in deepspeed_init
E model, optimizer, _, lr_scheduler = deepspeed.initialize(
E File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
E engine = DeepSpeedEngine(args=args,
E File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
E self._configure_optimizer(optimizer, model_parameters)
E File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1106, in _configure_optimizer
E basic_optimizer = self._configure_basic_optimizer(model_parameters)
E File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1191, in _configure_basic_optimizer
E optimizer = DeepSpeedCPUAdam(model_parameters,
E File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in __init__
E self.ds_opt_adam = CPUAdamBuilder().load()
E File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 370, in load
E return self.jit_load(verbose)
E File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 402, in jit_load
E op_module = load(
E File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1130, in load
E return _jit_compile(
E File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1368, in _jit_compile
E return _import_module_from_library(name, build_directory, is_python_module)
E File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1758, in _import_module_from_library
E module = importlib.util.module_from_spec(spec)
E File "<frozen importlib._bootstrap>", line 556, in module_from_spec
E File "<frozen importlib._bootstrap_external>", line 1101, in create_module
E File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
E ImportError: /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/cpu_adam.so: undefined symbol: curandCreateGenerator
(e.g. test: test_can_resume_training_normal_0_zero2
, but almost all tests fail)
The compilation went through just fine:
Installed CUDA version 11.2 does not match the version torch was compiled with 11.1 but since the APIs are compatible, accepting this combination
Using /github/home/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Creating extension directory /github/home/.cache/torch_extensions/py38_cu111/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /github/home/.cache/torch_extensions/py38_cu111/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
It must be something specific to that box - since I can’t reproduce these problems on my box with the same torch-nightly version / py38.
But if I check on my home box (where things work)
nm ~/.cache/torch_extensions/py38_cu113/cpu_adam/cpu_adam.so | grep curandCreateGenerator
U curandCreateGenerator
So curandCreateGenerator
is indeed undefined and it’s used here:
but for some reason it doesn’t cause a problem on my setup. Perhaps it’s a linker issue - some library doesn’t get properly linked?
Thank you!
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Linking errors when using nightly prebuilt binaries
I'm currently using the prebuilt binary version of pytorch nightly, built with cuda 11.0. which I installed from ...
Read more >C++ linker error: Undefined reference - when linking package ...
I'm new to CMake and also trying to understand how linking works, or what could cause libtorch and OpenNMTTokenizer.so not work together.
Read more >DeepSpeed - bytemeta
[BUG] torch-nightly: linker issue with `cpu_adam.so` · [QUESTION] OOM at Allgather in pre-submodule · CPUAdam does not find CUDA.
Read more >News Montigny-lès-Cormeilles zTJ - Concrete Prefabbricati
Winter sickness bug incubation period, Vip-asiakkuus, Saas fee zermatt ski ... Set cover problem complexity, Miastral cancer 2015, Redna odpoved pogodbe o ...
Read more >Untitled
Find a one-night stand or a hookup you can also hang out with. Which dating site is best for serious relationships? What is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jeffra, could we then apply the fix above to the core, so that our CI can run these tests - as it uses JIT build it’ll solve the problem, while pytorch folks are figuring out the cpp extension pre-building. Thanks!
Please let me know if you’d like me to create a PR or whether it’d be easier for you to do that. Especially since you found the solution.
Thank you!
OK, I was able to reproduce the problem by installing:
so I think it’s a bug in the pip package of torch-nightly, since there is no problem with conda version of the same.
I will report to pytorch, so nothing to do about it at the moment.
https://github.com/pytorch/pytorch/issues/69666