[RFC][Graph Tuner] Graph level auto-tuning
See original GitHub issueMotivation
Currently we can tune operator with handcrafted schedules or AutoTVM, which can give us descent kernel performance. However, usually these kernel templates involve transform operator data layout to other formats, such as conv2d for Intel and ARM CPU. In this case, a lot of layout transformation can be introduced into graph. Graph tuner considers both fast kernel schedules and extra layout transformations, generating descent end to end performance.
Design
There are two steps to achieve graph level optimal schedules. First, get a set of schedule candidates for each workload in the graph. This step can be finished by AutoTVM Second, feed schedule candidates into graph tuner and run graph level tuning.
In the second step, graph tuner will benchmark all possible layout transformations in the graph, given a set of schedule candidates, and then combine schedule and layout transformation execution time together to find the optimal schedule combination.
The current solution for this optimization problem is to model it as a Markov Decision Process and use dynamic programming to solve it. This will give us global optimal solution. However, the time/memory complexity for DP is prohibitively expensive for some networks, such as SSD. In this case, we need to use approximation methods. For now graph tuner provides a graph coloring algorithm(PBQP) in this scenario.
API
We provide a base class and built-in subclass:
class BaseGraphTuner(object):
"""Class to search schedules considering both kernel execution time and
layout transformation time.
Before creating a Graph Executor instance, schedule candidates for all kernels in
graph should be provided through tensor searching.
"""
......
class DPTuner(BaseGraphTuner):
"""Tuner which uses dynamic programming to solve MDP problem.
Note: currently dynamic programming is used to solve this MDP problem. However,
this problem is intrinsically non-polynomial. DP can't apply for more complicated
models, such as networks with many element-wise sum operators. In this case, a simple
greedy method greedy_schedule can be used to generate suboptimal schedules.
"""
......
class PBQPTuner(BaseGraphTuner):
"""Tuner using graph coloring algorithm to solve DP intractable graph.
"""
Implementation details: One key part of graph tuner is to generate all possible layout transformations given a set of workloads and schedules. To hide any operator related information from graph tuner, we can add a new generic function bind with topi operator, which accepts workload and cfg, and returns i/o shapes/layouts. An example for conv2d:
@tvm.target.generic_func
def conv2d_infer_layout(workload, cfg):
"""Infer input/output shapes and layouts given a workload and a config.
Returns two lists of tuple in the format of (input_shape, input_layout) and (output_shape, output_layout).
"""
Note that although graph tuner only supports target op with single input and output(conv2d, conv2d_transpose, dense, etc), we make this api generic enough to support operators with multiple io.
With this function, in graph tuner we only need a dictionary mapping topi function name to corresponding infer layout function, similar to what autotvm task extraction function does. After shape and layout info is extracted, we can use a generic graph traversal to fetch all possible layout transformations.
Performance Benchmark
Intel Xeon CPU(AWS c5.9xlarge, 18 physical cores)
Framework/Mode | resnet50_v1 | vgg19_bn | inceptionv3 | densenet201 | SSD_resnet50_512 |
---|---|---|---|---|---|
MXNet MKLDNN 1.2.1 | 12.95 ms | 25.35 ms | 15.08 ms | 34.84 ms | 67.06 ms |
TVM with default schedules | 8.07 ms | 24.78 ms | 13.88 ms | 18.19 ms | 44. 56 ms |
TVM with history best schedules | 7.54 ms | 24.05 ms | 14.39 ms | 18.35 ms | 40. 29 ms |
TVM with graph tuner | 5.73 ms | 21.98 ms | 10.68 ms | 13.97 ms | 29.05 ms |
More benchmark data to be added.
Issue Analytics
- State:
- Created 5 years ago
- Comments:46 (46 by maintainers)
Top GitHub Comments
Yes, in nnvm.build.compiler, we will load tophub context (which is an empty ApplyHistoryBest in your case)
https://github.com/dmlc/tvm/blob/01ec533e7cfb603f6114a763f03dad4c564f589d/nnvm/python/nnvm/compiler/build_module.py#L244-L246
But this is not a problem. Although current context is ApplyHistoryBest, when you query it, you cannot find the config for your workload. Then it will query its upper context, which is the root fallback context and returns a
FallbackConfigEntity
. You can usecfg.is_fallback
to check whether it is a fallback. Don’t use something likeisinstance(autotvm.task.DispatchContext.current, autotvm.FallbackContext)
@tqchen I open an issue to track x86 AutoTVM related tasks.