Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Quantization and pruning from DeepSpeed Compression not working

See original GitHub issue

I would like to use DeepSpeed for post-training compression with CUDA, using quantization or pruning.

I’m using a pretrained ResNet as a simple example to test how DeepSpeed works, in a simple case following this.

However, I’m not able to achieve any performance improvement at all, neither with weight quantization, activation quantization or sparse and row pruning. When using pruning, I checked that the weights are actually modified, but no gain in the performance is obtained.

Here is the full code I’m using:

import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt

import deepspeed
from deepspeed.compression.compress import init_compression, redundancy_clean
import argparse

# use GPUs if available
if torch.cuda.is_available():
    print("CUDA Available")
    device = torch.device('cuda')
else:
    print('CUDA Not Available')
    device = torch.device('cpu')

# Routine to compute the inference time
def checktime(model, ndata = 500):

    timelist = []

    for i in range(ndata):

        torch.cuda.synchronize()
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        out = model(data)
        end.record()

        torch.cuda.synchronize()
        timelist.append(start.elapsed_time(end))

    timelist = timelist[30:]    # remove the first warm up calls
    timelist = np.array(timelist)
    print("Inference time [ms]. Mean: {:.1f}, Std: {:.1f}".format(timelist.mean(),timelist.std()))

    return timelist

# An instance of the ResNet model
model = torchvision.models.resnet18().to(device)
model.eval()

"""
# Check names of layers
for name, param in model.named_parameters():
    print(name)
"""

# An example input
data = torch.rand(4, 3, 224, 224, device=device)

# Compute the inference time in the standard pre-compressed model
timelist_standard = checktime(model)
# out: Inference time [ms]. Mean: 2.7, Std: 0.1

# Get arguments for DeepSpeed
parser = argparse.ArgumentParser(description='Deepspeed')
parser = deepspeed.add_config_arguments(parser)
args = parser.parse_args()
print("\n",args,"\n")
deepspeed.init_distributed()   # I think this line is not required

# Compress the model
model = init_compression(model, args.deepspeed_config)
model = redundancy_clean(model, args.deepspeed_config)
model.eval()

# Compute the inference time in the compressed model
timelist_compressed = checktime(model)
# out: Inference time [ms]. Mean: 2.7, Std: 0.1

And this an example of the config file (although I tried with different variants):

{ 
    "compression_training": {
      "weight_quantization": {
        "shared_parameters":{
          "enabled": true,
          "quantizer_kernel": false,
          "schedule_offset": 0,
          "quantize_groups": 1,
          "quantize_verbose": true,
          "quantization_type": "asymmetric",
          "quantize_weight_in_forward": false,
          "rounding": "nearest",
          "fp16_mixed_quantize":{
            "enabled": false,
            "quantize_change_ratio": 0.001
          }
        },
        "different_groups":{
          "wq1": {
            "params": {
                "start_bits": 12, 
                "target_bits": 8,
                "quantization_period": 50
            },
            "modules": [
              "conv1",
              "conv2"
            ]
          }
        }
      }
  }
 }

Am I doing something wrong to apply the DeepSpeed tools for post-training compression? Am I missing something?

Thanks in advance, Pablo.

Issue Analytics

State:
Created a year ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

yaozheweicommented, Sep 28, 2022

Hi,

For quantization related and sparse pruning features, it will need special kernels to get the real speedup. This part has not been fully released yet. For channel pruning, you should be able to see latency reduction. Would you mind share the config you used for channel pruning?

1reaction

xiaoxiawu-microsoftcommented, Sep 27, 2022

Hi, Thanks for your question. Indeed, something is missing in your sample code. you need to apply deepspeed.initialize, see this example (https://github.com/microsoft/DeepSpeedExamples-internal/blob/staging_compression_library_v1/model_compression/cifar/train.py#L116)

The reason is that we implemented the quantization training in our deepspeed.runtime Let us know if you have any questions 😃

Top Results From Across the Web

DeepSpeed Compression: A composable library for extreme ...

It offers multiple cutting-edge compression methods, as shown in Table 1, including extreme quantization, head/row/channel pruning, and ...

DeepSpeed Model Compression Library

DeepSpeed Model Compression Library. Contents. 1. General Tutorial. 1.1 Layer Reduction; 1.2 Weight Quantization; 1.3 Activation Quantization; 1.4 Pruning.

Deep learning model compression - Rachit Singh

Quantization [CoreML Tools documentation]. Pruning. Pruning is removing some weights (i.e. connections) or entire neurons from a neural network ...

ZeroQuant: Efficient and Affordable Post-Training Quantization ...

In this work, we present an efficient and affordable post-training quantization approach to compress large. Transformer-based models, termed as ...

PyTorch Lightning V1.2.0- DeepSpeed, Pruning, Quantization ...

As always, feel free to reach out on Slack or discussions for any questions you might have or issues you are facing. PyTorch...