Quantization and pruning from DeepSpeed Compression not working
See original GitHub issueI would like to use DeepSpeed for post-training compression with CUDA, using quantization or pruning.
I’m using a pretrained ResNet as a simple example to test how DeepSpeed works, in a simple case following this.
However, I’m not able to achieve any performance improvement at all, neither with weight quantization, activation quantization or sparse and row pruning. When using pruning, I checked that the weights are actually modified, but no gain in the performance is obtained.
Here is the full code I’m using:
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import deepspeed
from deepspeed.compression.compress import init_compression, redundancy_clean
import argparse
# use GPUs if available
if torch.cuda.is_available():
print("CUDA Available")
device = torch.device('cuda')
else:
print('CUDA Not Available')
device = torch.device('cpu')
# Routine to compute the inference time
def checktime(model, ndata = 500):
timelist = []
for i in range(ndata):
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
out = model(data)
end.record()
torch.cuda.synchronize()
timelist.append(start.elapsed_time(end))
timelist = timelist[30:] # remove the first warm up calls
timelist = np.array(timelist)
print("Inference time [ms]. Mean: {:.1f}, Std: {:.1f}".format(timelist.mean(),timelist.std()))
return timelist
# An instance of the ResNet model
model = torchvision.models.resnet18().to(device)
model.eval()
"""
# Check names of layers
for name, param in model.named_parameters():
print(name)
"""
# An example input
data = torch.rand(4, 3, 224, 224, device=device)
# Compute the inference time in the standard pre-compressed model
timelist_standard = checktime(model)
# out: Inference time [ms]. Mean: 2.7, Std: 0.1
# Get arguments for DeepSpeed
parser = argparse.ArgumentParser(description='Deepspeed')
parser = deepspeed.add_config_arguments(parser)
args = parser.parse_args()
print("\n",args,"\n")
deepspeed.init_distributed() # I think this line is not required
# Compress the model
model = init_compression(model, args.deepspeed_config)
model = redundancy_clean(model, args.deepspeed_config)
model.eval()
# Compute the inference time in the compressed model
timelist_compressed = checktime(model)
# out: Inference time [ms]. Mean: 2.7, Std: 0.1
And this an example of the config file (although I tried with different variants):
{
"compression_training": {
"weight_quantization": {
"shared_parameters":{
"enabled": true,
"quantizer_kernel": false,
"schedule_offset": 0,
"quantize_groups": 1,
"quantize_verbose": true,
"quantization_type": "asymmetric",
"quantize_weight_in_forward": false,
"rounding": "nearest",
"fp16_mixed_quantize":{
"enabled": false,
"quantize_change_ratio": 0.001
}
},
"different_groups":{
"wq1": {
"params": {
"start_bits": 12,
"target_bits": 8,
"quantization_period": 50
},
"modules": [
"conv1",
"conv2"
]
}
}
}
}
}
Am I doing something wrong to apply the DeepSpeed tools for post-training compression? Am I missing something?
Thanks in advance, Pablo.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
DeepSpeed Compression: A composable library for extreme ...
It offers multiple cutting-edge compression methods, as shown in Table 1, including extreme quantization, head/row/channel pruning, and ...
Read more >DeepSpeed Model Compression Library
DeepSpeed Model Compression Library. Contents. 1. General Tutorial. 1.1 Layer Reduction; 1.2 Weight Quantization; 1.3 Activation Quantization; 1.4 Pruning.
Read more >Deep learning model compression - Rachit Singh
Quantization [CoreML Tools documentation]. Pruning. Pruning is removing some weights (i.e. connections) or entire neurons from a neural network ...
Read more >ZeroQuant: Efficient and Affordable Post-Training Quantization ...
In this work, we present an efficient and affordable post-training quantization approach to compress large. Transformer-based models, termed as ...
Read more >PyTorch Lightning V1.2.0- DeepSpeed, Pruning, Quantization ...
As always, feel free to reach out on Slack or discussions for any questions you might have or issues you are facing. PyTorch...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi,
For quantization related and sparse pruning features, it will need special kernels to get the real speedup. This part has not been fully released yet. For channel pruning, you should be able to see latency reduction. Would you mind share the config you used for channel pruning?
Hi, Thanks for your question. Indeed, something is missing in your sample code. you need to apply deepspeed.initialize, see this example (https://github.com/microsoft/DeepSpeedExamples-internal/blob/staging_compression_library_v1/model_compression/cifar/train.py#L116)
The reason is that we implemented the quantization training in our deepspeed.runtime Let us know if you have any questions 😃