question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] DS_BUILD_OPS =1 does not work

See original GitHub issue

Describe the bug Hi all,

I’m trying to setup a container with ops pre-compiled, however: Whatever I do, it never works. No chance whatsoever. I’m expecting that with pre-compiled ops model inference should load much faster.

To Reproduce

Dockerfile:

RUN mkdir -p /tmp && \
    cd /tmp && \
    git clone https://github.com/microsoft/DeepSpeed.git && \
    cd DeepSpeed && \
    git checkout v0.6.0 && \
    pip install -r requirements/requirements-dev.txt && \
    pip install -r requirements/requirements.txt && \
    python setup.py build_ext -j32 bdist_wheel &&\
    pip install dist/*.whl && \
    ds_report

Works:

docker build --build-arg DS_BUILD_OPS=0 --build-arg DS_VERSION=0.5.10 -t test .

and outputs:

--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------

Doesn’t work:

docker build --build-arg DS_BUILD_OPS=1 --build-arg DS_VERSION=0.5.10 -t test .

and outputs:

39 errors detected in the compilation of "csrc/transformer/normalize_kernels.cu".
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

or:

g++ -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.8/csrc/lamb/fused_lamb_cuda.o build/temp.linux-x86_64-3.8/csrc/lamb/fused_lamb_cuda_kernel.o -L/opt/conda/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-3.8/deepspeed/ops/lamb/fused_lamb_op.cpython-38-x86_64-linux-gnu.so
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1

The following tags don’t work:

  • 0.5.10
  • 0.5.9
  • 0.5.8
  • 0.5.7
  • 0.5.6
  • 0.5.5
  • 0.5.4
  • 0.5.3
  • I stopped testing afterwards

Expected behavior Pre compiling the ops sould work.

System info (please complete the following information): Not applicable.

Docker context Full container:

FROM nvidia/cuda:11.1.1-base-ubuntu20.04

ARG PYTHON_VERSION=3.8.10
ARG OPEN_MPI_VERSION=4.0.1
ENV CUDNN_VERSION=8.0.5.39
ARG CUBLAS_VERSION=11.3.0.106
ENV NCCL_VERSION=2.7.8
ENV OMPI_VERSION=4.1.1
ENV NVML_VERSION=11.1.74
ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    LD_LIBRARY_PATH="/opt/conda/lib/:${LD_LIBRARY_PATH}:/usr/local/lib" \
    PYTHONIOENCODING=UTF-8 \
    LANG=C.UTF-8 \
    LC_ALL=C.UTF-8 \
    DEBIAN_FRONTEND=noninteractive

ENV PATH /opt/conda/bin:$PATH

RUN apt-get update \
    && apt-get -y upgrade --only-upgrade systemd openssl \
    && apt-get install -y --no-install-recommends software-properties-common \
    && add-apt-repository ppa:openjdk-r/ppa \
    && apt-get update \
    && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
    build-essential \
    ca-certificates \
    cmake \
    cuda-command-line-tools-11-1 \
    cuda-cudart-11-1 \
    cuda-libraries-dev-11-1 \
    curl \
    emacs \
    git \
    jq \
    libcublas-11-1=${CUBLAS_VERSION}-1 \
    libcublas-dev-11-1=${CUBLAS_VERSION}-1 \
    libcudnn8=$CUDNN_VERSION-1+cuda11.1 \
    libcufft-dev-11-1 \
    libcurand-dev-11-1 \
    libcusolver-dev-11-1 \
    libcusparse-dev-11-1 \
    cuda-nvml-dev-11-1=${NVML_VERSION}-1 \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libgomp1 \
    libibverbs-dev \
    libnuma1 \
    libnuma-dev \
    libsm6 \
    libssl1.1 \
    libxext6 \
    libxrender-dev \
    openjdk-8-jdk-headless \
    openssl \
    vim \
    wget \
    unzip \
    zlib1g-dev \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

RUN cd /tmp \
    && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
    && cd nccl \
    && make -j64 src.build BUILDDIR=/usr/local \
    && rm -rf /tmp/nccl

RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && chmod +x ~/miniconda.sh \
    && ~/miniconda.sh -b -p /opt/conda \
    && rm ~/miniconda.sh \
    && /opt/conda/bin/conda update conda \
    && /opt/conda/bin/conda install -c conda-forge \
    python=$PYTHON_VERSION \
    && /opt/conda/bin/conda install -y \
    ruamel_yaml==0.15.100 \
    cython \
    botocore \
    mkl-include \
    mkl \
    && /opt/conda/bin/conda clean -ya

RUN pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \
    && ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
    && pip install packaging==20.4 \
    enum-compat==0.0.3 \
    "cryptography>3.2"
RUN rm -rf /opt/conda/lib/libtinfo.so.6

RUN wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-$OPEN_MPI_VERSION.tar.gz \
    && gunzip -c openmpi-$OPEN_MPI_VERSION.tar.gz | tar xf - \
    && cd openmpi-$OPEN_MPI_VERSION \
    && ./configure --prefix=/home/.openmpi \
    && make all install \
    && cd .. \
    && rm openmpi-$OPEN_MPI_VERSION.tar.gz \
    && rm -rf openmpi-$OPEN_MPI_VERSION

ENV PATH="$PATH:/home/.openmpi/bin"
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/.openmpi/lib/"

RUN cd /tmp/ \
    && rm -rf tmp*

RUN pip uninstall -y torch \
    && pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

RUN pip install --no-cache-dir \
    protobuf==3.12.0

#################################
# Deepspeed specific section    #
#################################

RUN apt-get update -y && \
    apt-get install -y libaio-dev

ARG DS_VERSION=0.5.3
ARG DS_BUILD_OPS=0
RUN mkdir -p /tmp && \
    cd /tmp && \
    git clone https://github.com/microsoft/DeepSpeed.git && \
    cd DeepSpeed && \
    git checkout "v$DS_VERSION" && \
    pip install -r requirements/requirements-dev.txt && \
    pip install -r requirements/requirements.txt && \
    DS_BUILD_OPS=$DS_BUILD_OPS python setup.py build_ext -j32 bdist_wheel &&\
    pip install dist/*.whl && \
    ds_report

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jeffracommented, Mar 17, 2022

I was able to repro your issue on my side. I reduced your Dockerfile down to the following that can reproduce it and uses the devel image from nvidia that already includes nvcc, etc. I am continuing to investigate and will let you know when I have a fix. Thank you for reporting your issue, we very much appreciate it.

FROM nvidia/cuda:11.1.1-devel-ubuntu20.04
ENV PATH /opt/conda/bin:$PATH

RUN apt-get update \
    && apt-get install -y curl build-essential git libaio-dev

RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && chmod +x ~/miniconda.sh \
    && ~/miniconda.sh -b -p /opt/conda \
    && rm ~/miniconda.sh \
    && /opt/conda/bin/conda update conda \
    && /opt/conda/bin/conda install -c conda-forge \
    python=$PYTHON_VERSION \
    && /opt/conda/bin/conda install -y \
    ruamel_yaml==0.15.100 \
    cython \
    botocore \
    mkl-include \
    mkl \
    && /opt/conda/bin/conda clean -ya

RUN pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \
    && ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
    && pip install packaging==20.4 \
    enum-compat==0.0.3 \
    "cryptography>3.2"

RUN pip uninstall -y torch \
    && pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

ENV TORCH_CUDA_ARCH_LIST="7.0"
ARG DS_VERSION=0.5.10
ARG DS_BUILD_OPS=1
RUN mkdir -p /tmp && \
    cd /tmp && \
    git clone https://github.com/microsoft/DeepSpeed.git && \
    cd DeepSpeed && \
    git checkout "v$DS_VERSION" && \
    pip install -r requirements/requirements-dev.txt && \
    pip install -r requirements/requirements.txt && \
    DS_BUILD_TRANSFORMER=1 python setup.py build_ext -j32 bdist_wheel &&\
    pip install dist/*.whl && \
    ds_report
0reactions
jeffracommented, Mar 22, 2022

Excellent, thank you for the update @oborchers! I agree, we’ll update the docs wrt these build args.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I find a rare bug that seems to only occur in release ...
Try switching off optimisation in the release build to see if the problem goes away (or switch it on in the debug build...
Read more >
Troubleshoot the Ops Agent | Operations Suite - Google Cloud
Go to the Agent is installed but not running section first to fix that condition. You might see PermissionDenied errors when writing to...
Read more >
Major bug or glitch - left hand items not working - please help
If I equip a weapon, it looks like it works, but shows absolutely nothing (empty square in corner) and nothing on screen.
Read more >
Patch 12.7 Bug Megathread : r/leagueoflegends - Reddit
Emote shortcut dont work. (i set for shift+1, shift+2,etc). just emote wheel is working.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found