[RFC]Integrate Intel Extension for Pytorch into accelerate to get out-of-box optimizations on intel platform.
See original GitHub issueMotivation
Intel Extension for Pytorch(a.k.a IPEX) can provides extra optimization and performance boost on intel hardware platform(currently only for CPU) for both inference and training. These optimization include graph level optimization such as operator fusion, auto mixed precision which support rich bf16 operators and optimization for optimizer which boost the training performance. In contrast with trainer, accelerate mostly is used with distributed training and inference on transformer model, but it also can benefit from IPEX’s optimization. So, integrate IPEX into accelerate can make users who do distributed training or evaluation get out-of-box performance boost on CPU.
Design
User interface
-
The first thing is how we tell accelerate to enable ipex. We can utilize the CLI tool accelerate config to config our training and inference environment with IPEX feature enabled. This tool will ask a series of questions including IPEX related config such as ipex_enabled and ipex_fusion_enabled. We can also pass these two options to our python script which is launched by accelerate launch. The detailed usage examples of these two scenarios can refer to pacman100’s comments. The meaning of these two option: ipex_enabled: If this option is set, the IPEX’s python package will be imported, and optimization such as Conv+BN folding and weight prepacking will at least be applied at inference. ipex_fusion_enabled: Besides the basic optimization, operator fusion is also an important technic to boost the performance, we can trace a model to enable this graph level optimization first, and then ipex will provide more specially optimized fusion op on intel platform with this option.
-
Model and optimizer wrapper for distributed training.
model, optim, data = accelerator.prepare(model, optim, data)
Accelerator.prepare() is the main method where most magic happens, IPEX can also hide optimization via a similar front end API called ipex.optimize(). If we choose use ipex, then we can automatically invoke ipex’s API inside prepare(). If the current workload is training, the still use the original accelerate’s original prepare API like this: model, optim, data = accelerator.prepare(model, optim, data) ,so, there is no code change in training. But for inference, if we want to benefit from ipex’s optimization in addition to operator fusion such as weight prepack etc, we must explicit pass model to prepare method such as:
data = accelerator.prepare(data) # Origin evaluation
model, data = accelerator.prepare(model, data) # Need explicitly pass model to prepare() to benefit from broader ipex's optimization.
- Auto Mixed Precision Once we import IPEX, we will register all the bf16 operator supported in IPEX and if we use auto mixed precision, the model optimized by IPEX and under an AMP context can also benefit from IPEX’s optimization. Basically, there is no code change in user interface too.
Implementation
- Instantiate Accelerate As above state, we first need modify Accelerate class’s constructor as follows:
class Accelerator:
def __init__(
self,
device_placement: bool = True,
split_batches: bool = False,
fp16: bool = None,
mixed_precision: Union[PrecisionType, str] = None,
gradient_accumulation_steps: int = 1,
cpu: bool = False,
deepspeed_plugin: DeepSpeedPlugin = None,
fsdp_plugin: FullyShardedDataParallelPlugin = None,
rng_types: Optional[List[Union[str, RNGType]]] = None,
log_with: Optional[List[Union[str, LoggerType, GeneralTracker]]] = None,
logging_dir: Optional[Union[str, os.PathLike]] = None,
dispatch_batches: Optional[bool] = None,
step_scheduler_with_optimizer: bool = True,
kwargs_handlers: Optional[List[KwargsHandler]] = None,
use_ipex: bool = False,
do_fusion: bool = False,
):
In this stage, accelerate will analysis current distribute environment, and only if the current environment is MULTI-CPU, will we apply ipex. And if we pass use_ipex equals true, we need check if IPEX is available on the current platform and if it is, we import it. For the ‘do_fusion’ option, we keep it as a object state for later use.
- Prepare model and optimizer In this stage, we will distinguish whether we do training or inference by if we pass optimizer to the Accelerate.prepare() function. Such as:
model, optim, data = accelerator.prepare(model, optim, data) # For training
model, data = accelerator.prepare(model, data) # For inference
If current workload is inference and we set do_fusion to true in the constructor, we will use torch.jit.trace() to trace the model first(when we import ipex, it will register lots of fusion pattern optimized for CPU), and then apply additional ipex’s additional optimization by ipex.optimize(). If we do not specify do_fusion, we directly apply model = ipex.optimize(model) on the passed in model. If current workload is training and whether we set do_fusion to true or false in the constructor, we won’t apply any graph level optimization, cause currently, ipex haven’t support graph optimization for training graph(Will support in the future). So we just apply ipex’s optimization for optimizer such as model, optimizer = ipex.optimize(model, optimizer=optimizer) in addition to model. If we specify mixed_precision = bf16, we need specify the dtype=bf16 in the above ipex.optimize() statement to enable complete ipex optimization on bf16 such as:
model = ipex.optimize(model, dtype=bf16) # For inference
model, optimizer = ipex.optimize(model, dtype=torch.bfloat16, optimizer=optimizer) # For training
Issue Analytics
- State:
- Created a year ago
- Comments:7
Top GitHub Comments
@pacman100 Oh, I see. Thanks for the work very much. I will try it out and then give you feedback.
Hello @tangleintel, I have already implemented the above feature request in the draft PR #701. Please test that out and let us know if it works as expected or refine the draft PR using it as a starting point.