Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flip-ratio does not work on Multi GPU

See original GitHub issue

Describe the bug

Trying to use the flip ratio metric on multi GPU gives me an error.

To Reproduce

import contextlib
import numpy as np
import tensorflow.keras as keras
import larq as lq
import tensorflow as tf


def get_model():
    model = keras.Sequential()
    model.add(keras.layers.Flatten(input_shape=(28, 28)))
    model.add(
        lq.layers.QuantDense(
            10,
            input_quantizer="ste_sign",
            kernel_quantizer="ste_sign",
            kernel_constraint="weight_clip",
        )
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(), loss="sparse_categorical_crossentropy"
    )
    return model


def attempt_fit_with_metric(metrics=[], distributed_training=False):
    fashion_mnist = keras.datasets.fashion_mnist
    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
    train_images = train_images / 255.0
    test_images = test_images / 255.0

    with strategy.scope() if distributed_training else contextlib.nullcontext():
        with lq.metrics.scope(metrics):
            model = get_model()
        model.fit(train_images, train_labels, epochs=1)


if __name__ == "__main__":
    strategy = tf.distribute.MirroredStrategy()
    for distributed_training in [False, True]:
        for metrics in [[], ["flip_ratio"]]:
            print("distributed training: ", distributed_training)
            print("metric:", metrics)
            try:
                attempt_fit_with_metric(metrics, distributed_training)
                print(
                    "Successfully fittet model with metric ",
                    metrics,
                    ", and distributed training = ",
                    distributed_training,
                )
            except Exception as e:
                print("Exception raised: \n", e)
            print()

Expected behavior

I get the following error, while I would expect the flip ratio to work on both single and multi GPU.

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
distributed training:  False
metric: []
Train on 60000 samples
60000/60000 [==============================] - 3s 56us/sample - loss: 11.0572
Successfully fittet model with metric  [] , and distributed training =  False

distributed training:  False
metric: ['flip_ratio']
Train on 60000 samples
60000/60000 [==============================] - 4s 64us/sample - loss: 12.3813 - flip_ratio/quant_dense_1: 5.9605e-08
Successfully fittet model with metric  ['flip_ratio'] , and distributed training =  False

distributed training:  True
metric: []
Train on 60000 samples
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
60000/60000 [==============================] - 9s 156us/sample - loss: 11.0572
Successfully fittet model with metric  [] , and distributed training =  True

distributed training:  True
metric: ['flip_ratio']
Train on 60000 samples
WARNING:tensorflow:Gradients do not exist for variables ['quant_dense_3/kernel:0'] when minimizing the loss.
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
   32/60000 [..............................] - ETA: 1:44:14Exception raised: 
 An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
  @tf.function
  def has_init_scope():
    my_constant = tf.constant(1.)
    with tf.init_scope():
      added = my_constant * 2
The graph tensor has name: replica_2/sequential_3/quant_dense_3/IdentityN_1:0

Environment

TensorFlow version: 2.1.0 Larq version: 0.8.2

Issue Analytics

State:
Created 4 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

jneevencommented, Feb 14, 2020

Remind!

It works and the warnings are gone 🎉

0reactions

leonoverweelcommented, Feb 14, 2020

Remind!

Top Results From Across the Web

Problem with multi-GPU training · Issue #58 - GitHub

My workstation has 4x1080Ti (CUDA 9.2, cuDNN 7, Nvidia drivers 410.48) and I tried to train on COCO dataset on multiple GPUs.

Run MATLAB Functions on Multiple GPUs - MathWorks

This example shows how to run MATLAB® code on multiple GPUs in parallel, first on your local machine, then scaling up to a...

To verify that multi-GPU acceleration is enabled and working

To verify that multi-GPU acceleration is enabled and working. From the NVIDIA Control Panel navigation tree pane, under 3D Settings, select Set Multi-GPU...

TensorFlow-Slim Multi-GPU training - Stack Overflow

The GPUs run normally (memory usage and GPU utilization), but the training is not faster compared with a single GPU training. This issue...

Efficient Training on Multiple GPUs - Hugging Face

When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...