Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC]Integrate Intel Extension for Pytorch into accelerate to get out-of-box optimizations on intel platform.

See original GitHub issue

Motivation

Intel Extension for Pytorch(a.k.a IPEX) can provides extra optimization and performance boost on intel hardware platform(currently only for CPU) for both inference and training. These optimization include graph level optimization such as operator fusion, auto mixed precision which support rich bf16 operators and optimization for optimizer which boost the training performance. In contrast with trainer, accelerate mostly is used with distributed training and inference on transformer model, but it also can benefit from IPEX’s optimization. So, integrate IPEX into accelerate can make users who do distributed training or evaluation get out-of-box performance boost on CPU.

Design

User interface

The first thing is how we tell accelerate to enable ipex. We can utilize the CLI tool accelerate config to config our training and inference environment with IPEX feature enabled. This tool will ask a series of questions including IPEX related config such as ipex_enabled and ipex_fusion_enabled. We can also pass these two options to our python script which is launched by accelerate launch. The detailed usage examples of these two scenarios can refer to pacman100’s comments. The meaning of these two option: ipex_enabled: If this option is set, the IPEX’s python package will be imported, and optimization such as Conv+BN folding and weight prepacking will at least be applied at inference. ipex_fusion_enabled: Besides the basic optimization, operator fusion is also an important technic to boost the performance, we can trace a model to enable this graph level optimization first, and then ipex will provide more specially optimized fusion op on intel platform with this option.
Model and optimizer wrapper for distributed training.

model, optim, data = accelerator.prepare(model, optim, data)

Accelerator.prepare() is the main method where most magic happens, IPEX can also hide optimization via a similar front end API called ipex.optimize(). If we choose use ipex, then we can automatically invoke ipex’s API inside prepare(). If the current workload is training, the still use the original accelerate’s original prepare API like this: model, optim, data = accelerator.prepare(model, optim, data) ,so, there is no code change in training. But for inference, if we want to benefit from ipex’s optimization in addition to operator fusion such as weight prepack etc, we must explicit pass model to prepare method such as:

data = accelerator.prepare(data) # Origin evaluation
model, data = accelerator.prepare(model, data) # Need explicitly pass model to prepare() to benefit from broader ipex's optimization.

Auto Mixed Precision Once we import IPEX, we will register all the bf16 operator supported in IPEX and if we use auto mixed precision, the model optimized by IPEX and under an AMP context can also benefit from IPEX’s optimization. Basically, there is no code change in user interface too.

Implementation

Instantiate Accelerate As above state, we first need modify Accelerate class’s constructor as follows:

class Accelerator:
    def __init__(
        self,
        device_placement: bool = True,
        split_batches: bool = False,
        fp16: bool = None,
        mixed_precision: Union[PrecisionType, str] = None,
        gradient_accumulation_steps: int = 1,
        cpu: bool = False,
        deepspeed_plugin: DeepSpeedPlugin = None,
        fsdp_plugin: FullyShardedDataParallelPlugin = None,
        rng_types: Optional[List[Union[str, RNGType]]] = None,
        log_with: Optional[List[Union[str, LoggerType, GeneralTracker]]] = None,
        logging_dir: Optional[Union[str, os.PathLike]] = None,
        dispatch_batches: Optional[bool] = None,
        step_scheduler_with_optimizer: bool = True,
        kwargs_handlers: Optional[List[KwargsHandler]] = None,
        use_ipex: bool = False,
        do_fusion: bool = False,
    ):

In this stage, accelerate will analysis current distribute environment, and only if the current environment is MULTI-CPU, will we apply ipex. And if we pass use_ipex equals true, we need check if IPEX is available on the current platform and if it is, we import it. For the ‘do_fusion’ option, we keep it as a object state for later use.

Prepare model and optimizer In this stage, we will distinguish whether we do training or inference by if we pass optimizer to the Accelerate.prepare() function. Such as:

model, optim, data = accelerator.prepare(model, optim, data) # For training
model, data = accelerator.prepare(model, data) # For inference

If current workload is inference and we set do_fusion to true in the constructor, we will use torch.jit.trace() to trace the model first(when we import ipex, it will register lots of fusion pattern optimized for CPU), and then apply additional ipex’s additional optimization by ipex.optimize(). If we do not specify do_fusion, we directly apply model = ipex.optimize(model) on the passed in model. If current workload is training and whether we set do_fusion to true or false in the constructor, we won’t apply any graph level optimization, cause currently, ipex haven’t support graph optimization for training graph(Will support in the future). So we just apply ipex’s optimization for optimizer such as model, optimizer = ipex.optimize(model, optimizer=optimizer) in addition to model. If we specify mixed_precision = bf16, we need specify the dtype=bf16 in the above ipex.optimize() statement to enable complete ipex optimization on bf16 such as:

model = ipex.optimize(model, dtype=bf16) # For inference
model, optimizer = ipex.optimize(model, dtype=torch.bfloat16, optimizer=optimizer) # For training

Issue Analytics

State:
Created a year ago
Comments:7

Top GitHub Comments

1reaction

tangleintelcommented, Sep 19, 2022

@pacman100 Oh, I see. Thanks for the work very much. I will try it out and then give you feedback.

1reaction

pacman100commented, Sep 19, 2022

Hello @tangleintel, I have already implemented the above feature request in the draft PR #701. Please test that out and let us know if it works as expected or refine the draft PR using it as a starting point.

Top Results From Across the Web

Accelerate PyTorch with Intel® Extension for PyTorch

The open source Intel® Extension for PyTorch optimizes deep learning and quickly brings PyTorch users additional performance on Intel® processors.

Accelerating PyTorch with Intel® Extension for ... - Medium

Runtime optimizations are encapsulated in the runtime extension module which provides a couple of PyTorch frontend APIs for users to get finer- ...

intel-extension-for-pytorch - PyPI

Intel ® Extension for PyTorch* extends PyTorch* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take ...

Intel® Extension for PyTorch

The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators have been massively enabled in Intel® Extension...

Accelerating PyTorch distributed fine-tuning with Intel ...

For all their amazing performance, state of the art deep learning models often take a long time to train. In order to speed...