question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ZeroQuant quantization kernels and LKD

See original GitHub issue

Hi,

I was trying out the compression library for ZeroQuant quantization (for GPT-J model). While I was able to compress the model, I didn’t see any throughput/latency gain from the quantization during inference. I have a few questions regarding this:

  • Do you guys have any guide to running inference on compressed models(especially ZeroQuant)? InferenceEngine only seems to support Mixture-of-Quantization but not ZeroQuant. I also tried int8 quantization without using compression module as shown in the code snippet below but end up getting CUDA error: an illegal memory access error
  • Have you guys released the fused kernels for GeLU+Quantize and GeMM+dequantize proposed in the ZeroQuant paper yet?
  • Any tentative release date for Layer-by-layer Knowledge Distillation?
  • What’s the motivation for multiplying quantized input by scale here? Wouldn’t that dequantize inputs?
injection_policy={gptj_transformer: 
                          module_inject.replace_policy.HFGPTJLayerPolicy}

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.int8,
    quantization_setting=2,
    replace_with_kernel_inject=True,
    injection_policy=injection_policy,
)

Any help would be appreciated.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
david-macleodcommented, Nov 2, 2022

Thanks @yaozhewei! Do you know whether there is a rough timeline for this? e.g. 1 month, 6 months, 1 year? It would be very useful to know as we’d like to decide where to wait or explore other options. Thanks again!

1reaction
yaozheweicommented, Nov 2, 2022

@david-macleod LKD example is just released (not merged yet): https://github.com/microsoft/DeepSpeedExamples/pull/214

For kernel, please stay tuned

Read more comments on GitHub >

github_iconTop Results From Across the Web

ZeroQuant: Efficient and Affordable Post-Training Quantization ...
We propose a novel layer-by-layer knowledge distillation method (LKD) for INT4/INT8 mixed-precision quantization, where the neural network is ...
Read more >
ZeroQuant: Efficient and Affordable Post ... - OpenReview
Specifically, ZeroQuant consists of hardware-constraint group-wise weight quantization and kernel fusion based token-wise activation ...
Read more >
ZeroQuant: Efficient and Affordable Post-Training Quantization ...
ZeroQuant enables quantizing BERT and GPT-3-style models into INT8 weight and activations to retain accuracy without incurring any retraining cost. Compared to ...
Read more >
ZeroQuant: Efficient and Affordable Post-Training ... - Microsoft
ZeroQuant is an end-to-end quantization and inference pipeline with three ... (LKD) even without the access to the original training data; ...
Read more >
DeepSpeed Model Compression Library
Tutorial for ZeroQuant: efficient and affordable post-training quantization; 3. Tutorial for XTC: simple yet effective compression pipeline for extreme ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found