Memory Decreases! But Latency Increases....
See original GitHub issueThings seem to be working as intended! I went from using GPT-J-6B with
model = AutoModelForCausalLM.from_pretrained("/mnt/models",torch_dtype=torch.float16,low_cpu_mem_usage=True).to(torch.device("cuda",0))
to
model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True)
With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!
However, I don’t think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I’m using something like:
output_ids = self.model.generate(
input_ids.cuda(),
max_length=45,
do_sample=True,
top_p=request.get("top_p", 1.0),
top_k=request.get("top_k", 50),
...
)
Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:
FROM nvidia/cuda:11.3.0-devel-ubuntu20.04
ARG DEBIAN_FRONTEND=noninteractive
ENV APP_HOME /app
WORKDIR $APP_HOME
# NVIDIA rotated their GPG keys, so we have to remove the old ones to do apt-get update
RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get install -y build-essential wget vim git
RUN apt-get update
RUN apt-get install --yes git
# Note: we need curl for the liveness probe
RUN apt-get install --yes curl
RUN apt-get install --yes vim
# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH
# Install conda dependencies.
RUN conda install python=3.8
RUN conda install pytorch=1.12.1 cudatoolkit=11.3 -c pytorch
# Install pip deps
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt
# Copy local code to container image
COPY *.py ./
CMD ["python", "model.py"]
with requirements.txt being
kserve==0.9.0
git+https://github.com/huggingface/transformers.git@4a51075a96d2049f368b5f3dd6c0e9f08f599b62
accelerate==0.12.0
bitsandbytes==0.31.8
Issue Analytics
- State:
- Created a year ago
- Comments:12 (4 by maintainers)
Top Results From Across the Web
The Difference Between RAM Speed and CAS Latency
Latency is best measured in nanoseconds, which is a combination of speed and CAS latency; Both speed increases and latency decreases result in...
Read more >Memory Latency - an overview
Memory latency is designed to be hidden on GPUs by running threads from other ... providing increased memory throughput and opportunity for the...
Read more >Understanding Storage Performance - IOPS and Latency
If a storage solution can reach 10,000 IOPS but only at an average latency of 50 ms that could result in very bad...
Read more >Why the latency increases on highload? [closed]
And then suddenly it increased to 7 seconds. I would understand if that happened because CPU load was 100% or no free memory....
Read more >Understanding Disk I/O - when should you be worried?
Disk latency is around 13ms, but it depends on the quality and rotational speed of the hard drive. RAM latency is around 83...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi Mitchell!
Currently, this is expected, but we are aware of the issues, and we plan to solve the issues that can be resolved in future releases.
To summarize the issues:
I have not done precise benchmarks, but if I distributed a weight of 1.0 for all these issues in terms of which one slows the system down the most, this would be my guess: (1) 10%, (2) 20%, (3) 60%, (4) 10%.
In other words, the most effective would be a custom kernel for fp16 matmul, followed by a fp16 matmul done in a second stream, followed by a better CUDA kernel for outlier extraction, and then hard ware issues (not solvable).
Hi Younes!
That did decrease the latency, but it is still around 6.1s which is still almost double the latency without int8.