Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] DS_BUILD_OPS =1 does not work

See original GitHub issue

Describe the bug Hi all,

I’m trying to setup a container with ops pre-compiled, however: Whatever I do, it never works. No chance whatsoever. I’m expecting that with pre-compiled ops model inference should load much faster.

To Reproduce

Dockerfile:

RUN mkdir -p /tmp && \
    cd /tmp && \
    git clone https://github.com/microsoft/DeepSpeed.git && \
    cd DeepSpeed && \
    git checkout v0.6.0 && \
    pip install -r requirements/requirements-dev.txt && \
    pip install -r requirements/requirements.txt && \
    python setup.py build_ext -j32 bdist_wheel &&\
    pip install dist/*.whl && \
    ds_report

Works:

docker build --build-arg DS_BUILD_OPS=0 --build-arg DS_VERSION=0.5.10 -t test .

and outputs:

--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------

Doesn’t work:

docker build --build-arg DS_BUILD_OPS=1 --build-arg DS_VERSION=0.5.10 -t test .

and outputs:

39 errors detected in the compilation of "csrc/transformer/normalize_kernels.cu".
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

or:

g++ -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.8/csrc/lamb/fused_lamb_cuda.o build/temp.linux-x86_64-3.8/csrc/lamb/fused_lamb_cuda_kernel.o -L/opt/conda/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-3.8/deepspeed/ops/lamb/fused_lamb_op.cpython-38-x86_64-linux-gnu.so
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

The following tags don’t work:

0.5.10
0.5.9
0.5.8
0.5.7
0.5.6
0.5.5
0.5.4
0.5.3
I stopped testing afterwards

Expected behavior Pre compiling the ops sould work.

System info (please complete the following information): Not applicable.

Docker context Full container:

FROM nvidia/cuda:11.1.1-base-ubuntu20.04

ARG PYTHON_VERSION=3.8.10
ARG OPEN_MPI_VERSION=4.0.1
ENV CUDNN_VERSION=8.0.5.39
ARG CUBLAS_VERSION=11.3.0.106
ENV NCCL_VERSION=2.7.8
ENV OMPI_VERSION=4.1.1
ENV NVML_VERSION=11.1.74
ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    LD_LIBRARY_PATH="/opt/conda/lib/:${LD_LIBRARY_PATH}:/usr/local/lib" \
    PYTHONIOENCODING=UTF-8 \
    LANG=C.UTF-8 \
    LC_ALL=C.UTF-8 \
    DEBIAN_FRONTEND=noninteractive

ENV PATH /opt/conda/bin:$PATH

RUN apt-get update \
    && apt-get -y upgrade --only-upgrade systemd openssl \
    && apt-get install -y --no-install-recommends software-properties-common \
    && add-apt-repository ppa:openjdk-r/ppa \
    && apt-get update \
    && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
    build-essential \
    ca-certificates \
    cmake \
    cuda-command-line-tools-11-1 \
    cuda-cudart-11-1 \
    cuda-libraries-dev-11-1 \
    curl \
    emacs \
    git \
    jq \
    libcublas-11-1=${CUBLAS_VERSION}-1 \
    libcublas-dev-11-1=${CUBLAS_VERSION}-1 \
    libcudnn8=$CUDNN_VERSION-1+cuda11.1 \
    libcufft-dev-11-1 \
    libcurand-dev-11-1 \
    libcusolver-dev-11-1 \
    libcusparse-dev-11-1 \
    cuda-nvml-dev-11-1=${NVML_VERSION}-1 \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libgomp1 \
    libibverbs-dev \
    libnuma1 \
    libnuma-dev \
    libsm6 \
    libssl1.1 \
    libxext6 \
    libxrender-dev \
    openjdk-8-jdk-headless \
    openssl \
    vim \
    wget \
    unzip \
    zlib1g-dev \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN cd /tmp \
    && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
    && cd nccl \
    && make -j64 src.build BUILDDIR=/usr/local \
    && rm -rf /tmp/nccl

RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && chmod +x ~/miniconda.sh \
    && ~/miniconda.sh -b -p /opt/conda \
    && rm ~/miniconda.sh \
    && /opt/conda/bin/conda update conda \
    && /opt/conda/bin/conda install -c conda-forge \
    python=$PYTHON_VERSION \
    && /opt/conda/bin/conda install -y \
    ruamel_yaml==0.15.100 \
    cython \
    botocore \
    mkl-include \
    mkl \
    && /opt/conda/bin/conda clean -ya

RUN pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \
    && ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
    && pip install packaging==20.4 \
    enum-compat==0.0.3 \
    "cryptography>3.2"
RUN rm -rf /opt/conda/lib/libtinfo.so.6

RUN wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-$OPEN_MPI_VERSION.tar.gz \
    && gunzip -c openmpi-$OPEN_MPI_VERSION.tar.gz | tar xf - \
    && cd openmpi-$OPEN_MPI_VERSION \
    && ./configure --prefix=/home/.openmpi \
    && make all install \
    && cd .. \
    && rm openmpi-$OPEN_MPI_VERSION.tar.gz \
    && rm -rf openmpi-$OPEN_MPI_VERSION

ENV PATH="$PATH:/home/.openmpi/bin"
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/.openmpi/lib/"

RUN cd /tmp/ \
    && rm -rf tmp*

RUN pip uninstall -y torch \
    && pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

RUN pip install --no-cache-dir \
    protobuf==3.12.0

#################################
# Deepspeed specific section    #
#################################

RUN apt-get update -y && \
    apt-get install -y libaio-dev

ARG DS_VERSION=0.5.3
ARG DS_BUILD_OPS=0
RUN mkdir -p /tmp && \
    cd /tmp && \
    git clone https://github.com/microsoft/DeepSpeed.git && \
    cd DeepSpeed && \
    git checkout "v$DS_VERSION" && \
    pip install -r requirements/requirements-dev.txt && \
    pip install -r requirements/requirements.txt && \
    DS_BUILD_OPS=$DS_BUILD_OPS python setup.py build_ext -j32 bdist_wheel &&\
    pip install dist/*.whl && \
    ds_report

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

jeffracommented, Mar 17, 2022

I was able to repro your issue on my side. I reduced your Dockerfile down to the following that can reproduce it and uses the devel image from nvidia that already includes nvcc, etc. I am continuing to investigate and will let you know when I have a fix. Thank you for reporting your issue, we very much appreciate it.

FROM nvidia/cuda:11.1.1-devel-ubuntu20.04
ENV PATH /opt/conda/bin:$PATH

RUN apt-get update \
    && apt-get install -y curl build-essential git libaio-dev

RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && chmod +x ~/miniconda.sh \
    && ~/miniconda.sh -b -p /opt/conda \
    && rm ~/miniconda.sh \
    && /opt/conda/bin/conda update conda \
    && /opt/conda/bin/conda install -c conda-forge \
    python=$PYTHON_VERSION \
    && /opt/conda/bin/conda install -y \
    ruamel_yaml==0.15.100 \
    cython \
    botocore \
    mkl-include \
    mkl \
    && /opt/conda/bin/conda clean -ya

RUN pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \
    && ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
    && pip install packaging==20.4 \
    enum-compat==0.0.3 \
    "cryptography>3.2"

RUN pip uninstall -y torch \
    && pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

ENV TORCH_CUDA_ARCH_LIST="7.0"
ARG DS_VERSION=0.5.10
ARG DS_BUILD_OPS=1
RUN mkdir -p /tmp && \
    cd /tmp && \
    git clone https://github.com/microsoft/DeepSpeed.git && \
    cd DeepSpeed && \
    git checkout "v$DS_VERSION" && \
    pip install -r requirements/requirements-dev.txt && \
    pip install -r requirements/requirements.txt && \
    DS_BUILD_TRANSFORMER=1 python setup.py build_ext -j32 bdist_wheel &&\
    pip install dist/*.whl && \
    ds_report

0reactions

jeffracommented, Mar 22, 2022

Excellent, thank you for the update @oborchers! I agree, we’ll update the docs wrt these build args.