ZeroQuant quantization kernels and LKD
See original GitHub issueHi,
I was trying out the compression library for ZeroQuant quantization (for GPT-J model). While I was able to compress the model, I didn’t see any throughput/latency gain from the quantization during inference. I have a few questions regarding this:
- Do you guys have any guide to running inference on compressed models(especially ZeroQuant)? InferenceEngine only seems to support Mixture-of-Quantization but not ZeroQuant. I also tried int8 quantization without using compression module as shown in the code snippet below but end up getting
CUDA error: an illegal memory access
error - Have you guys released the fused kernels for GeLU+Quantize and GeMM+dequantize proposed in the ZeroQuant paper yet?
- Any tentative release date for Layer-by-layer Knowledge Distillation?
- What’s the motivation for multiplying quantized input by scale here? Wouldn’t that dequantize inputs?
injection_policy={gptj_transformer:
module_inject.replace_policy.HFGPTJLayerPolicy}
model = deepspeed.init_inference(
model,
mp_size=world_size,
dtype=torch.int8,
quantization_setting=2,
replace_with_kernel_inject=True,
injection_policy=injection_policy,
)
Any help would be appreciated.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
ZeroQuant: Efficient and Affordable Post-Training Quantization ...
We propose a novel layer-by-layer knowledge distillation method (LKD) for INT4/INT8 mixed-precision quantization, where the neural network is ...
Read more >ZeroQuant: Efficient and Affordable Post ... - OpenReview
Specifically, ZeroQuant consists of hardware-constraint group-wise weight quantization and kernel fusion based token-wise activation ...
Read more >ZeroQuant: Efficient and Affordable Post-Training Quantization ...
ZeroQuant enables quantizing BERT and GPT-3-style models into INT8 weight and activations to retain accuracy without incurring any retraining cost. Compared to ...
Read more >ZeroQuant: Efficient and Affordable Post-Training ... - Microsoft
ZeroQuant is an end-to-end quantization and inference pipeline with three ... (LKD) even without the access to the original training data; ...
Read more >DeepSpeed Model Compression Library
Tutorial for ZeroQuant: efficient and affordable post-training quantization; 3. Tutorial for XTC: simple yet effective compression pipeline for extreme ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @yaozhewei! Do you know whether there is a rough timeline for this? e.g. 1 month, 6 months, 1 year? It would be very useful to know as we’d like to decide where to wait or explore other options. Thanks again!
@david-macleod LKD example is just released (not merged yet): https://github.com/microsoft/DeepSpeedExamples/pull/214
For kernel, please stay tuned