[BUG] DS_BUILD_OPS =1 does not work
See original GitHub issueDescribe the bug Hi all,
I’m trying to setup a container with ops pre-compiled, however: Whatever I do, it never works. No chance whatsoever. I’m expecting that with pre-compiled ops model inference should load much faster.
To Reproduce
Dockerfile:
RUN mkdir -p /tmp && \
cd /tmp && \
git clone https://github.com/microsoft/DeepSpeed.git && \
cd DeepSpeed && \
git checkout v0.6.0 && \
pip install -r requirements/requirements-dev.txt && \
pip install -r requirements/requirements.txt && \
python setup.py build_ext -j32 bdist_wheel &&\
pip install dist/*.whl && \
ds_report
Works:
docker build --build-arg DS_BUILD_OPS=0 --build-arg DS_VERSION=0.5.10 -t test .
and outputs:
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
Doesn’t work:
docker build --build-arg DS_BUILD_OPS=1 --build-arg DS_VERSION=0.5.10 -t test .
and outputs:
39 errors detected in the compilation of "csrc/transformer/normalize_kernels.cu".
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1
or:
g++ -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.8/csrc/lamb/fused_lamb_cuda.o build/temp.linux-x86_64-3.8/csrc/lamb/fused_lamb_cuda_kernel.o -L/opt/conda/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-3.8/deepspeed/ops/lamb/fused_lamb_op.cpython-38-x86_64-linux-gnu.so
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1
The following tags don’t work:
- 0.5.10
- 0.5.9
- 0.5.8
- 0.5.7
- 0.5.6
- 0.5.5
- 0.5.4
- 0.5.3
- I stopped testing afterwards
Expected behavior Pre compiling the ops sould work.
System info (please complete the following information): Not applicable.
Docker context Full container:
FROM nvidia/cuda:11.1.1-base-ubuntu20.04
ARG PYTHON_VERSION=3.8.10
ARG OPEN_MPI_VERSION=4.0.1
ENV CUDNN_VERSION=8.0.5.39
ARG CUBLAS_VERSION=11.3.0.106
ENV NCCL_VERSION=2.7.8
ENV OMPI_VERSION=4.1.1
ENV NVML_VERSION=11.1.74
ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
LD_LIBRARY_PATH="/opt/conda/lib/:${LD_LIBRARY_PATH}:/usr/local/lib" \
PYTHONIOENCODING=UTF-8 \
LANG=C.UTF-8 \
LC_ALL=C.UTF-8 \
DEBIAN_FRONTEND=noninteractive
ENV PATH /opt/conda/bin:$PATH
RUN apt-get update \
&& apt-get -y upgrade --only-upgrade systemd openssl \
&& apt-get install -y --no-install-recommends software-properties-common \
&& add-apt-repository ppa:openjdk-r/ppa \
&& apt-get update \
&& apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
ca-certificates \
cmake \
cuda-command-line-tools-11-1 \
cuda-cudart-11-1 \
cuda-libraries-dev-11-1 \
curl \
emacs \
git \
jq \
libcublas-11-1=${CUBLAS_VERSION}-1 \
libcublas-dev-11-1=${CUBLAS_VERSION}-1 \
libcudnn8=$CUDNN_VERSION-1+cuda11.1 \
libcufft-dev-11-1 \
libcurand-dev-11-1 \
libcusolver-dev-11-1 \
libcusparse-dev-11-1 \
cuda-nvml-dev-11-1=${NVML_VERSION}-1 \
libgl1-mesa-glx \
libglib2.0-0 \
libgomp1 \
libibverbs-dev \
libnuma1 \
libnuma-dev \
libsm6 \
libssl1.1 \
libxext6 \
libxrender-dev \
openjdk-8-jdk-headless \
openssl \
vim \
wget \
unzip \
zlib1g-dev \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
RUN cd /tmp \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
&& cd nccl \
&& make -j64 src.build BUILDDIR=/usr/local \
&& rm -rf /tmp/nccl
RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
&& chmod +x ~/miniconda.sh \
&& ~/miniconda.sh -b -p /opt/conda \
&& rm ~/miniconda.sh \
&& /opt/conda/bin/conda update conda \
&& /opt/conda/bin/conda install -c conda-forge \
python=$PYTHON_VERSION \
&& /opt/conda/bin/conda install -y \
ruamel_yaml==0.15.100 \
cython \
botocore \
mkl-include \
mkl \
&& /opt/conda/bin/conda clean -ya
RUN pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \
&& ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
&& pip install packaging==20.4 \
enum-compat==0.0.3 \
"cryptography>3.2"
RUN rm -rf /opt/conda/lib/libtinfo.so.6
RUN wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-$OPEN_MPI_VERSION.tar.gz \
&& gunzip -c openmpi-$OPEN_MPI_VERSION.tar.gz | tar xf - \
&& cd openmpi-$OPEN_MPI_VERSION \
&& ./configure --prefix=/home/.openmpi \
&& make all install \
&& cd .. \
&& rm openmpi-$OPEN_MPI_VERSION.tar.gz \
&& rm -rf openmpi-$OPEN_MPI_VERSION
ENV PATH="$PATH:/home/.openmpi/bin"
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/.openmpi/lib/"
RUN cd /tmp/ \
&& rm -rf tmp*
RUN pip uninstall -y torch \
&& pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
RUN pip install --no-cache-dir \
protobuf==3.12.0
#################################
# Deepspeed specific section #
#################################
RUN apt-get update -y && \
apt-get install -y libaio-dev
ARG DS_VERSION=0.5.3
ARG DS_BUILD_OPS=0
RUN mkdir -p /tmp && \
cd /tmp && \
git clone https://github.com/microsoft/DeepSpeed.git && \
cd DeepSpeed && \
git checkout "v$DS_VERSION" && \
pip install -r requirements/requirements-dev.txt && \
pip install -r requirements/requirements.txt && \
DS_BUILD_OPS=$DS_BUILD_OPS python setup.py build_ext -j32 bdist_wheel &&\
pip install dist/*.whl && \
ds_report
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
How can I find a rare bug that seems to only occur in release ...
Try switching off optimisation in the release build to see if the problem goes away (or switch it on in the debug build...
Read more >Troubleshoot the Ops Agent | Operations Suite - Google Cloud
Go to the Agent is installed but not running section first to fix that condition. You might see PermissionDenied errors when writing to...
Read more >Major bug or glitch - left hand items not working - please help
If I equip a weapon, it looks like it works, but shows absolutely nothing (empty square in corner) and nothing on screen.
Read more >Patch 12.7 Bug Megathread : r/leagueoflegends - Reddit
Emote shortcut dont work. (i set for shift+1, shift+2,etc). just emote wheel is working.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was able to repro your issue on my side. I reduced your Dockerfile down to the following that can reproduce it and uses the
devel
image from nvidia that already includesnvcc
, etc. I am continuing to investigate and will let you know when I have a fix. Thank you for reporting your issue, we very much appreciate it.Excellent, thank you for the update @oborchers! I agree, we’ll update the docs wrt these build args.