Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory Decreases! But Latency Increases....

See original GitHub issue

Things seem to be working as intended! I went from using GPT-J-6B with

model = AutoModelForCausalLM.from_pretrained("/mnt/models",torch_dtype=torch.float16,low_cpu_mem_usage=True).to(torch.device("cuda",0))

model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True)

With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!

However, I don’t think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I’m using something like:

output_ids = self.model.generate(
    input_ids.cuda(),
    max_length=45,
    do_sample=True,
    top_p=request.get("top_p", 1.0),
    top_k=request.get("top_k", 50),
   ...
)

Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:

FROM nvidia/cuda:11.3.0-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

ENV APP_HOME /app
WORKDIR $APP_HOME

# NVIDIA rotated their GPG keys, so we have to remove the old ones to do apt-get update
RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get install -y build-essential wget vim git

RUN apt-get update
RUN apt-get install --yes git

# Note: we need curl for the liveness probe
RUN apt-get install --yes curl
RUN apt-get install --yes vim

# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
     /bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH

# Install conda dependencies.
RUN conda install python=3.8
RUN conda install pytorch=1.12.1 cudatoolkit=11.3 -c pytorch

# Install pip deps
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt

# Copy local code to container image
COPY *.py ./

CMD ["python", "model.py"]

with requirements.txt being

kserve==0.9.0
git+https://github.com/huggingface/transformers.git@4a51075a96d2049f368b5f3dd6c0e9f08f599b62
accelerate==0.12.0
bitsandbytes==0.31.8

Issue Analytics

State:
Created a year ago
Comments:12 (4 by maintainers)

Top GitHub Comments

6reactions

TimDettmerscommented, Aug 10, 2022

Hi Mitchell!

Currently, this is expected, but we are aware of the issues, and we plan to solve the issues that can be resolved in future releases.

To summarize the issues:

For the release of a memory efficient implementation I needed to quickly roll a CUDA kernel for outlier extraction from matrices with a special format (COL4_4R2_8C and COL32_2R_4R4, aka colTuring and colAmpere). The CUDA kernel is currently not very efficient.
The fp16 matrix multiplication used in conjunction with Int8 matmul is currently run in the same CUDA stream. This makes processing sequential even though the multiplications are independent.
The fp16 matrix multiplication kernel might not be fully optimized for the extreme matrix sizes used in the outlier multiplication. A custom kernel would be lightning fast, but would require some work.
Overall, int8 matrix multiplication is not very fast for small models. This is so, because it is difficult to saturate the GPU cores with int8 elements, and as such int8 is just as fast as fp16 for small models. However, one has additional overhead of quantization which slows overall inference down. Raw speedups for a 6B model would be maybe 20-40%. I am not sure about inference though since the overhead is more complex and depends on many factors (sequence length, batch size etc).

I have not done precise benchmarks, but if I distributed a weight of 1.0 for all these issues in terms of which one slows the system down the most, this would be my guess: (1) 10%, (2) 20%, (3) 60%, (4) 10%.

In other words, the most effective would be a custom kernel for fp16 matmul, followed by a fp16 matmul done in a second stream, followed by a better CUDA kernel for outlier extraction, and then hard ware issues (not solvable).

2reactions

mitchellgordon95commented, Aug 11, 2022

Hi Younes!

That did decrease the latency, but it is still around 6.1s which is still almost double the latency without int8.