Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ZeroQuant quantization kernels and LKD

See original GitHub issue

Hi,

I was trying out the compression library for ZeroQuant quantization (for GPT-J model). While I was able to compress the model, I didn’t see any throughput/latency gain from the quantization during inference. I have a few questions regarding this:

Do you guys have any guide to running inference on compressed models(especially ZeroQuant)? InferenceEngine only seems to support Mixture-of-Quantization but not ZeroQuant. I also tried int8 quantization without using compression module as shown in the code snippet below but end up getting CUDA error: an illegal memory access error
Have you guys released the fused kernels for GeLU+Quantize and GeMM+dequantize proposed in the ZeroQuant paper yet?
Any tentative release date for Layer-by-layer Knowledge Distillation?
What’s the motivation for multiplying quantized input by scale here? Wouldn’t that dequantize inputs?

injection_policy={gptj_transformer: 
                          module_inject.replace_policy.HFGPTJLayerPolicy}

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.int8,
    quantization_setting=2,
    replace_with_kernel_inject=True,
    injection_policy=injection_policy,
)

Any help would be appreciated.

Issue Analytics

State:
Created a year ago
Comments:9 (2 by maintainers)

Top GitHub Comments

4reactions

david-macleodcommented, Nov 2, 2022

Thanks @yaozhewei! Do you know whether there is a rough timeline for this? e.g. 1 month, 6 months, 1 year? It would be very useful to know as we’d like to decide where to wait or explore other options. Thanks again!

1reaction

yaozheweicommented, Nov 2, 2022

@david-macleod LKD example is just released (not merged yet): https://github.com/microsoft/DeepSpeedExamples/pull/214

For kernel, please stay tuned

Top Results From Across the Web

ZeroQuant: Efficient and Affordable Post-Training Quantization ...

We propose a novel layer-by-layer knowledge distillation method (LKD) for INT4/INT8 mixed-precision quantization, where the neural network is ...

ZeroQuant: Efficient and Affordable Post ... - OpenReview

Specifically, ZeroQuant consists of hardware-constraint group-wise weight quantization and kernel fusion based token-wise activation ...

ZeroQuant: Efficient and Affordable Post-Training Quantization ...

ZeroQuant enables quantizing BERT and GPT-3-style models into INT8 weight and activations to retain accuracy without incurring any retraining cost. Compared to ...

ZeroQuant: Efficient and Affordable Post-Training ... - Microsoft

ZeroQuant is an end-to-end quantization and inference pipeline with three ... (LKD) even without the access to the original training data; ...

DeepSpeed Model Compression Library

Tutorial for ZeroQuant: efficient and affordable post-training quantization; 3. Tutorial for XTC: simple yet effective compression pipeline for extreme ...