question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deep Nets performance tuning

See original GitHub issue

I am considering using this library for a research project and would like to get the best performance (inference latency) possible before I (hopefully) manage to run a large model (e.g., resnet-152) on imagenet.

I implemented a working+simplified version of bench_conv2d_sigmoid.py, including the same computations on vanilla tensorflow.

Code

import sys
import time
from functools import reduce
import tensorflow as tf
import tf_encrypted as tfe

config = tfe.LocalConfig([
    'server0',
    'server1',
    'crypto-producer',
    'weights-provider',
    'prediction-client'
])

tfe.set_config(config)
#tfe.set_protocol(tfe.protocol.Pond())
tfe.set_protocol(tfe.protocol.SecureNN())

input_shape1 = [1, 3, 32, 32]
input_shape2 = [1, 3, 192, 192]
conv1_fshape = [3, 3, 3, 32]

def provide_input_conv1weights() -> tf.Tensor:
    w = tf.random_normal(shape=conv1_fshape, dtype=tf.float32)
    return tf.Print(w, [w], message="w1:")

def provide_input_prediction() -> tf.Tensor:
    x = tf.random_normal(shape=input_shape1, dtype=tf.float32)
    return tf.Print(x, [x], message="x:")

def provide_input_prediction2() -> tf.Tensor:
    x = tf.random_normal(shape=input_shape2, dtype=tf.float32)
    return tf.Print(x, [x], message="x:")

def receive_output(tensor: tf.Tensor) -> tf.Operation:
    return tf.Print(tensor, [tensor, tf.shape(tensor)], message="output:")

def tfe_inference(x):
  conv1 = tfe.layers.Conv2D(x.shape.as_list(), conv1_fshape, 1, "SAME")
  initial_w_conv1 = tfe.define_private_input('weights-provider',
                                             provide_input_conv1weights)
  conv1.initialize(initial_w_conv1)
  x = conv1.forward(x)
  relu1 = tfe.layers.activation.Relu(x.shape.as_list())
  x = relu1.forward(x)
  pool1 = tfe.layers.pooling.MaxPooling2D(x.shape.as_list(), pool_size=2,
                                          strides=2, padding='SAME')
  x = pool1.forward(x)
  shape = reduce(lambda x,y: x*y, x.shape.as_list())
  x = x.reshape([shape, -1])

  dense = tfe.layers.Dense(x.shape, 2)
  dense.initialize()
  x = dense.forward(x)

  return x

def tf_inference(x):
  x = tf.nn.conv2d(x, provide_input_conv1weights(), strides=[1,1,1,1], padding="SAME")
  x = tf.nn.relu(x)
  x = tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

  shape1 = reduce(lambda x,y: x*y, x.shape.as_list())
  x = tf.reshape(x, shape=[shape1, -1])
  return tf.layers.dense(x, 2)


for input_func, m in zip([provide_input_prediction,
                          provide_input_prediction2],
                         ["small input", "large input"]):
  print(m)
  with tf.Graph().as_default():
    print("TFE")
    x = tfe.define_private_input('prediction-client', input_func)
    y = tfe_inference(x)
    prediction_op = tfe.define_output('prediction-client', [y], receive_output)

    with tfe.Session(config=config) as sess:
        print("Initialize tensors")
        sess.run(tf.global_variables_initializer(), tag='init')

        print("Predict")
        for i in range(3):
          t = time.time()
          sess.run(prediction_op, tag='prediction')
          print("Inference time: %g" % (time.time() - t))


  with tf.Graph().as_default():
    print("TF")
    x = input_func()
    x = tf.transpose(x, (0,2,3,1)) # NHWC

    prediction_op = tf_inference(x)

    with tf.Session() as sess:
        print("Initialize tensors")
        sess.run(tf.global_variables_initializer())

        print("Predict")
        for i in range(3):
          t = time.time()
          sess.run(prediction_op)
          print("Inference time: %g" % (time.time() - t))

I did some tests by using different options and here are the performance results I get:

Input size Implementation Prediction#1 Prediction#2 Prediction#3
3x32x32 tf-encrypted 44.4 s 2.8 s 2.8 s
3x32x32 tf 0.01 s 0.001 s 0.001 s
3x192x192 tf-encrypted 1146.9 s 138.1 s 136.7 s
3x192x192 tf 0.03 s 0.02 s 0.02 s

There are 2 problems:

  1. the inference latency of tf-encrypted (for the second and third prediction) is 3-5 orders of magnitude bigger than vanilla tensorflow.
  2. tf-encrypted is scaling very badly with increasing input size.

Of course there should be an overhead comparing to vanilla TF, but this difference is too big to even hope for a reasonable performance for a large network. I would really appreciate any ideas/suggestions for tuning-efficiently using your library.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:3
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
mortendahlcommented, Dec 4, 2018

Hi @gdamaskinos,

Amazing to hear you’re considering tf-encrypted! Let’s see if we can’t get those numbers down: computing on encrypted data is inherently more expensive (say currently at least two orders of magnitude) but I believe what you’re seeing can be improved.

I’ll continue to investigate your model and get back to you next week but in the meantime here are a few things to try:

  • First thing is to make sure you’re using the latest version of tf-encrypted as we had a bug very recently around sampling of randomness that made everything significantly slower. If you simply pull and use the master branch you should be good (run pip install -e . to install it).

  • Then make sure you’re using a version of TensorFlow that supports int64 operations, which can be verified via tfe.config.tensorflow_supports_int64(). You can get a custom version here until the TensorFlow patch becomes part of the official release.

With these a quick test shows ~50s for the large models on my laptop, but note that using LocalConfig might not give an accurate idea of the performance you get from running in an actual cluster; let me get back to you on how to test this in a cluster setting.

There has been some prior experiments with larger networks (a variant of VGG16) that @yanndupis might be able to provide a quick outline of, including any optimizations used.

2reactions
yanndupiscommented, Dec 20, 2018

Hello @gdamaskinos ,

Thank you very much for your great analysis!

We have benchmarked your model with our internal tools on Google Cloud Platform. On each instance we have 36 CPUs. With int64, we are getting lower runtime than in the table above and they are more aligned with the local performance your are getting.

  • Small input (3x32x32) - tfe-remote (int64 ops)
screen shot 2018-12-20 at 10 54 41 am
  • Large input (3x192x192) - tfe-remote (int64 ops) screen shot 2018-12-20 at 10 54 53 am

We are observing these differences potentially because we are not using the same machines/ number of CPUs or you have more latency in the network.

In terms of next steps we are planning to:

  • Release docs and tools after the holiday to help external contributors benchmarking models with tf-encrypted.
  • We believe we could get these numbers down. By inspecting the TensorFlow graph for your model, we have identified several potential optimizations. Several operations could be better parallelized, better memory usage etc. We have listed the issues here #364.
  • As mentioned in my earlier message for a variant of vgg16, we were getting a runtime of 1 min 4 seconds with the securenn protocol. We also benchmarked this model with the POND protocol where we used AveragePooling instead of MaxPooling and an approximated ReLU. The runtime was 4 seconds. However, the accuracy wasn’t good because we accumulated too much error in very large models with the approximated ReLU. We are currently investigating different approximation strategies for activation functions to get good accuracy and fast runtime with the POND protocol. It would be another alternative where we could get very good performance in privacy-preserving setting by slightly adapting the models.
Read more comments on GitHub >

github_iconTop Results From Across the Web

A Comprehensive Guide on Neural Networks Performance ...
Hyperparameter Tuning of Deep Neural Network. There is a library called Keras-Tuner that you can use to select the right set of parameters...
Read more >
Improve Deep Learning Models performance & deep network ...
In this section, we will discuss how to improve the performance of a deep learning network and how to tune deep learning hyperparameters....
Read more >
NVIDIA Deep Learning Performance Documentation
This page provides recommendations that apply to most deep learning operations. It also provides links, short explanations of other performance ...
Read more >
How To Improve Deep Learning Performance
Here are some ideas on tuning your neural network algorithms in order to get more out of them. Diagnostics. Weight Initialization. Learning Rate ......
Read more >
Deep Learning Performance Optimization via Model ...
In terms of deployment, deep neural networks (DNNs) are found in consumer devices, ... Deep Learning Performance Optimization via Model Parallelization.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found