Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC][Graph Tuner] Graph level auto-tuning

See original GitHub issue

Motivation

Currently we can tune operator with handcrafted schedules or AutoTVM, which can give us descent kernel performance. However, usually these kernel templates involve transform operator data layout to other formats, such as conv2d for Intel and ARM CPU. In this case, a lot of layout transformation can be introduced into graph. Graph tuner considers both fast kernel schedules and extra layout transformations, generating descent end to end performance.

Design

There are two steps to achieve graph level optimal schedules. First, get a set of schedule candidates for each workload in the graph. This step can be finished by AutoTVM Second, feed schedule candidates into graph tuner and run graph level tuning.

In the second step, graph tuner will benchmark all possible layout transformations in the graph, given a set of schedule candidates, and then combine schedule and layout transformation execution time together to find the optimal schedule combination.

The current solution for this optimization problem is to model it as a Markov Decision Process and use dynamic programming to solve it. This will give us global optimal solution. However, the time/memory complexity for DP is prohibitively expensive for some networks, such as SSD. In this case, we need to use approximation methods. For now graph tuner provides a graph coloring algorithm(PBQP) in this scenario.

API

We provide a base class and built-in subclass:

class BaseGraphTuner(object):
    """Class to search schedules considering both kernel execution time and
    layout transformation time.

    Before creating a Graph Executor instance, schedule candidates for all kernels in
    graph should be provided through tensor searching. 
    """
    ......

class DPTuner(BaseGraphTuner):
    """Tuner which uses dynamic programming to solve MDP problem.

    Note: currently dynamic programming is used to solve this MDP problem. However,
    this problem is intrinsically non-polynomial. DP can't apply for more complicated
    models, such as networks with many element-wise sum operators. In this case, a simple
    greedy method greedy_schedule can be used to generate suboptimal schedules.
    """
    ......

class PBQPTuner(BaseGraphTuner):
    """Tuner using graph coloring algorithm to solve DP intractable graph.
    """

Implementation details: One key part of graph tuner is to generate all possible layout transformations given a set of workloads and schedules. To hide any operator related information from graph tuner, we can add a new generic function bind with topi operator, which accepts workload and cfg, and returns i/o shapes/layouts. An example for conv2d:

@tvm.target.generic_func
def conv2d_infer_layout(workload, cfg):
    """Infer input/output shapes and layouts given a workload and a config.
    Returns two lists of tuple in the format of (input_shape, input_layout) and (output_shape, output_layout).
    """

Note that although graph tuner only supports target op with single input and output(conv2d, conv2d_transpose, dense, etc), we make this api generic enough to support operators with multiple io.

With this function, in graph tuner we only need a dictionary mapping topi function name to corresponding infer layout function, similar to what autotvm task extraction function does. After shape and layout info is extracted, we can use a generic graph traversal to fetch all possible layout transformations.

Performance Benchmark

Intel Xeon CPU(AWS c5.9xlarge, 18 physical cores)

Framework/Mode	resnet50_v1	vgg19_bn	inceptionv3	densenet201	SSD_resnet50_512
MXNet MKLDNN 1.2.1	12.95 ms	25.35 ms	15.08 ms	34.84 ms	67.06 ms
TVM with default schedules	8.07 ms	24.78 ms	13.88 ms	18.19 ms	44. 56 ms
TVM with history best schedules	7.54 ms	24.05 ms	14.39 ms	18.35 ms	40. 29 ms
TVM with graph tuner	5.73 ms	21.98 ms	10.68 ms	13.97 ms	29.05 ms

More benchmark data to be added.

PR: https://github.com/dmlc/tvm/pull/1586

Issue Analytics

State:
Created 5 years ago
Comments:46 (46 by maintainers)

Top GitHub Comments

1reaction

merrymercycommented, Sep 12, 2018

Yes, in nnvm.build.compiler, we will load tophub context (which is an empty ApplyHistoryBest in your case)

https://github.com/dmlc/tvm/blob/01ec533e7cfb603f6114a763f03dad4c564f589d/nnvm/python/nnvm/compiler/build_module.py#L244-L246

But this is not a problem. Although current context is ApplyHistoryBest, when you query it, you cannot find the config for your workload. Then it will query its upper context, which is the root fallback context and returns a FallbackConfigEntity. You can use cfg.is_fallback to check whether it is a fallback. Don’t use something like isinstance(autotvm.task.DispatchContext.current, autotvm.FallbackContext)

1reaction

kevinthesuncommented, Sep 6, 2018

@tqchen I open an issue to track x86 AutoTVM related tasks.

Top Results From Across the Web

[RFC][Graph Tuner] Graph level auto-tuning #1585 - GitHub

There are two steps to achieve graph level optimal schedules. First, get a set of schedule candidates for each workload in the graph....

Questions on auto-tuning - Apache TVM Discuss

Hi, I had tried tuning the network for tflite model, and I have certain ... There are some early efforts at graph-level optimizations...

XLA Autotuning - Phitchaya Mangpo Phothilimthana

Autotuning Multi-Pass ... A common strategy partitions a graph into subgraphs ... Separate tuning graph-level and kernel-level optimizations for scalability.

Bolt: Bridging the Gap between Auto-tuners and Hardware ...

Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and model levels. Bolt demonstrates this ...

A Pattern Based Algorithmic Autotuner for Graph Processing ...

a simple programming interface that conceals low-level tun- ing details from the user. ... Keywords GPU, Graph processing, Auto-tuning. 1 Introduction.