[BUG] Unable to build extension "transformer_inference"
See original GitHub issueDescribe the bug A clear and concise description of what the bug is.
After installing deepspeed, I try to run very basic inference with the transformers
library but it seems that there is no way to install the transformer_inference
extension necessary for inference. This is even after adding the flag for DS_BUILD_TRANSFORMER_INFERENCE=1
to the install process. I can get around this problem by just running init_inference
twice and catching the first error, but this seems wrong.
To Reproduce Steps to reproduce the behavior:
- Simple inference script to reproduce
- What packages are required and their versions
- How to run the script
- …
DS_BUILD_OPS=1 DS_BUILD_TRANSFORMER_INFERENCE=1 pip install deepspeed
python -c """
import deepspeed
import transformers
import os
model = transformers.AutoModelForCausalLM.from_pretrained('gpt2')
world_size = int(os.getenv('WORLD_SIZE', '1'))
model = deepspeed.init_inference(
model,
mp_size=world_size,
replace_with_kernel_inject=True,
replace_method='auto',
)
"""
Error:
File "~/.conda/envs/py3/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 468, in replace_with_policy
new_module = transformer_inference.DeepSpeedTransformerInference(
File "~/.conda/envs/py3/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 53, in __init__
inference_cuda_module = builder.load()
File "~/.conda/envs/py3/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 460, in load
return self.jit_load(verbose)
File "~/.conda/envs/py3/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 495, in jit_load
op_module = load(
File "~/.conda/envs/py3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 986, in load
return _jit_compile(
File "~/.conda/envs/py3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1193, in _jit_compile
_write_ninja_file_and_build_library(
File "~/.conda/envs/py3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1297, in _write_ninja_file_and_build_library
_run_ninja_build(
File "~/.conda/envs/py3/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'transformer_inference'
Expected behavior A clear and concise description of what you expected to happen.
Expected to see transformer
and transformer_inference
in the output of the ds_report
command.
ds_report output
Please run ds_report
to give us details about your setup.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these wer
e not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the
CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these wer
e not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the
CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['~/.conda/envs/py3/lib/python3.8/site-
packages/torch']
torch version .................... 1.7.1+cu110
torch cuda version ............... 11.0
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['~/.conda/envs/py3/lib/python3.8/site-
packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu
- GPU count and types: A100 machine with 8 GPUs
- (if applicable) what DeepSpeed-MII version are you using: 0.7.7
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions: 4.24.0
- Python version: 3.8.15
- Any other relevant info about your setup
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 9 months ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Hi @ianbstewart
I just checked on my side, and I see that the compiler chosen by torch is different between us. I am seeing
nvcc
for the CUDA files such asgelu.cu
, however I see c++ for the cpp files likept_biniding.cpp
:I am not sure how this is set as
nvcc++
on your side (/share/apps/nvhpc/22.3/Linux_x86_64/22.3/compilers/bin/nvc++
)! There must be some way to change this to point to the right compiler for the cpp files at torch.@jeffra, do you have any idea how to resolve this issue?
Thanks, Reza
See below: