question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory Decreases! But Latency Increases....

See original GitHub issue

Things seem to be working as intended! I went from using GPT-J-6B with

model = AutoModelForCausalLM.from_pretrained("/mnt/models",torch_dtype=torch.float16,low_cpu_mem_usage=True).to(torch.device("cuda",0))

to

model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True)

With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!

However, I don’t think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I’m using something like:

output_ids = self.model.generate(
    input_ids.cuda(),
    max_length=45,
    do_sample=True,
    top_p=request.get("top_p", 1.0),
    top_k=request.get("top_k", 50),
   ...
)

Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:

FROM nvidia/cuda:11.3.0-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

ENV APP_HOME /app
WORKDIR $APP_HOME

# NVIDIA rotated their GPG keys, so we have to remove the old ones to do apt-get update
RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get install -y build-essential wget vim git

RUN apt-get update
RUN apt-get install --yes git

# Note: we need curl for the liveness probe
RUN apt-get install --yes curl
RUN apt-get install --yes vim

# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
     /bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH

# Install conda dependencies.
RUN conda install python=3.8
RUN conda install pytorch=1.12.1 cudatoolkit=11.3 -c pytorch

# Install pip deps
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt

# Copy local code to container image
COPY *.py ./

CMD ["python", "model.py"]

with requirements.txt being

kserve==0.9.0
git+https://github.com/huggingface/transformers.git@4a51075a96d2049f368b5f3dd6c0e9f08f599b62
accelerate==0.12.0
bitsandbytes==0.31.8

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

6reactions
TimDettmerscommented, Aug 10, 2022

Hi Mitchell!

Currently, this is expected, but we are aware of the issues, and we plan to solve the issues that can be resolved in future releases.

To summarize the issues:

  1. For the release of a memory efficient implementation I needed to quickly roll a CUDA kernel for outlier extraction from matrices with a special format (COL4_4R2_8C and COL32_2R_4R4, aka colTuring and colAmpere). The CUDA kernel is currently not very efficient.
  2. The fp16 matrix multiplication used in conjunction with Int8 matmul is currently run in the same CUDA stream. This makes processing sequential even though the multiplications are independent.
  3. The fp16 matrix multiplication kernel might not be fully optimized for the extreme matrix sizes used in the outlier multiplication. A custom kernel would be lightning fast, but would require some work.
  4. Overall, int8 matrix multiplication is not very fast for small models. This is so, because it is difficult to saturate the GPU cores with int8 elements, and as such int8 is just as fast as fp16 for small models. However, one has additional overhead of quantization which slows overall inference down. Raw speedups for a 6B model would be maybe 20-40%. I am not sure about inference though since the overhead is more complex and depends on many factors (sequence length, batch size etc).

I have not done precise benchmarks, but if I distributed a weight of 1.0 for all these issues in terms of which one slows the system down the most, this would be my guess: (1) 10%, (2) 20%, (3) 60%, (4) 10%.

In other words, the most effective would be a custom kernel for fp16 matmul, followed by a fp16 matmul done in a second stream, followed by a better CUDA kernel for outlier extraction, and then hard ware issues (not solvable).

2reactions
mitchellgordon95commented, Aug 11, 2022

Hi Younes!

That did decrease the latency, but it is still around 6.1s which is still almost double the latency without int8.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Difference Between RAM Speed and CAS Latency
Latency is best measured in nanoseconds, which is a combination of speed and CAS latency; Both speed increases and latency decreases result in...
Read more >
Memory Latency - an overview
Memory latency is designed to be hidden on GPUs by running threads from other ... providing increased memory throughput and opportunity for the...
Read more >
Understanding Storage Performance - IOPS and Latency
If a storage solution can reach 10,000 IOPS but only at an average latency of 50 ms that could result in very bad...
Read more >
Why the latency increases on highload? [closed]
And then suddenly it increased to 7 seconds. I would understand if that happened because CPU load was 100% or no free memory....
Read more >
Understanding Disk I/O - when should you be worried?
Disk latency is around 13ms, but it depends on the quality and rotational speed of the hard drive. RAM latency is around 83...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found