question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Quantization and pruning from DeepSpeed Compression not working

See original GitHub issue

I would like to use DeepSpeed for post-training compression with CUDA, using quantization or pruning.

I’m using a pretrained ResNet as a simple example to test how DeepSpeed works, in a simple case following this.

However, I’m not able to achieve any performance improvement at all, neither with weight quantization, activation quantization or sparse and row pruning. When using pruning, I checked that the weights are actually modified, but no gain in the performance is obtained.

Here is the full code I’m using:

import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt

import deepspeed
from deepspeed.compression.compress import init_compression, redundancy_clean
import argparse

# use GPUs if available
if torch.cuda.is_available():
    print("CUDA Available")
    device = torch.device('cuda')
else:
    print('CUDA Not Available')
    device = torch.device('cpu')

# Routine to compute the inference time
def checktime(model, ndata = 500):

    timelist = []

    for i in range(ndata):

        torch.cuda.synchronize()
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        out = model(data)
        end.record()

        torch.cuda.synchronize()
        timelist.append(start.elapsed_time(end))

    timelist = timelist[30:]    # remove the first warm up calls
    timelist = np.array(timelist)
    print("Inference time [ms]. Mean: {:.1f}, Std: {:.1f}".format(timelist.mean(),timelist.std()))

    return timelist

# An instance of the ResNet model
model = torchvision.models.resnet18().to(device)
model.eval()

"""
# Check names of layers
for name, param in model.named_parameters():
    print(name)
"""

# An example input
data = torch.rand(4, 3, 224, 224, device=device)

# Compute the inference time in the standard pre-compressed model
timelist_standard = checktime(model)
# out: Inference time [ms]. Mean: 2.7, Std: 0.1

# Get arguments for DeepSpeed
parser = argparse.ArgumentParser(description='Deepspeed')
parser = deepspeed.add_config_arguments(parser)
args = parser.parse_args()
print("\n",args,"\n")
deepspeed.init_distributed()   # I think this line is not required

# Compress the model
model = init_compression(model, args.deepspeed_config)
model = redundancy_clean(model, args.deepspeed_config)
model.eval()

# Compute the inference time in the compressed model
timelist_compressed = checktime(model)
# out: Inference time [ms]. Mean: 2.7, Std: 0.1

And this an example of the config file (although I tried with different variants):

{ 
    "compression_training": {
      "weight_quantization": {
        "shared_parameters":{
          "enabled": true,
          "quantizer_kernel": false,
          "schedule_offset": 0,
          "quantize_groups": 1,
          "quantize_verbose": true,
          "quantization_type": "asymmetric",
          "quantize_weight_in_forward": false,
          "rounding": "nearest",
          "fp16_mixed_quantize":{
            "enabled": false,
            "quantize_change_ratio": 0.001
          }
        },
        "different_groups":{
          "wq1": {
            "params": {
                "start_bits": 12, 
                "target_bits": 8,
                "quantization_period": 50
            },
            "modules": [
              "conv1",
              "conv2"
            ]
          }
        }
      }
  }
 }

Am I doing something wrong to apply the DeepSpeed tools for post-training compression? Am I missing something?

Thanks in advance, Pablo.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
yaozheweicommented, Sep 28, 2022

Hi,

For quantization related and sparse pruning features, it will need special kernels to get the real speedup. This part has not been fully released yet. For channel pruning, you should be able to see latency reduction. Would you mind share the config you used for channel pruning?

1reaction
xiaoxiawu-microsoftcommented, Sep 27, 2022

Hi, Thanks for your question. Indeed, something is missing in your sample code. you need to apply deepspeed.initialize, see this example (https://github.com/microsoft/DeepSpeedExamples-internal/blob/staging_compression_library_v1/model_compression/cifar/train.py#L116)

The reason is that we implemented the quantization training in our deepspeed.runtime Let us know if you have any questions 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed Compression: A composable library for extreme ...
It offers multiple cutting-edge compression methods, as shown in Table 1, including extreme quantization, head/row/channel pruning, and ...
Read more >
DeepSpeed Model Compression Library
DeepSpeed Model Compression Library. Contents. 1. General Tutorial. 1.1 Layer Reduction; 1.2 Weight Quantization; 1.3 Activation Quantization; 1.4 Pruning.
Read more >
Deep learning model compression - Rachit Singh
Quantization [CoreML Tools documentation]. Pruning. Pruning is removing some weights (i.e. connections) or entire neurons from a neural network ...
Read more >
ZeroQuant: Efficient and Affordable Post-Training Quantization ...
In this work, we present an efficient and affordable post-training quantization approach to compress large. Transformer-based models, termed as ...
Read more >
PyTorch Lightning V1.2.0- DeepSpeed, Pruning, Quantization ...
As always, feel free to reach out on Slack or discussions for any questions you might have or issues you are facing. PyTorch...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found