question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flip-ratio does not work on Multi GPU

See original GitHub issue

Describe the bug

Trying to use the flip ratio metric on multi GPU gives me an error.

To Reproduce

import contextlib
import numpy as np
import tensorflow.keras as keras
import larq as lq
import tensorflow as tf


def get_model():
    model = keras.Sequential()
    model.add(keras.layers.Flatten(input_shape=(28, 28)))
    model.add(
        lq.layers.QuantDense(
            10,
            input_quantizer="ste_sign",
            kernel_quantizer="ste_sign",
            kernel_constraint="weight_clip",
        )
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(), loss="sparse_categorical_crossentropy"
    )
    return model


def attempt_fit_with_metric(metrics=[], distributed_training=False):
    fashion_mnist = keras.datasets.fashion_mnist
    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
    train_images = train_images / 255.0
    test_images = test_images / 255.0

    with strategy.scope() if distributed_training else contextlib.nullcontext():
        with lq.metrics.scope(metrics):
            model = get_model()
        model.fit(train_images, train_labels, epochs=1)


if __name__ == "__main__":
    strategy = tf.distribute.MirroredStrategy()
    for distributed_training in [False, True]:
        for metrics in [[], ["flip_ratio"]]:
            print("distributed training: ", distributed_training)
            print("metric:", metrics)
            try:
                attempt_fit_with_metric(metrics, distributed_training)
                print(
                    "Successfully fittet model with metric ",
                    metrics,
                    ", and distributed training = ",
                    distributed_training,
                )
            except Exception as e:
                print("Exception raised: \n", e)
            print()

Expected behavior

I get the following error, while I would expect the flip ratio to work on both single and multi GPU.

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
distributed training:  False
metric: []
Train on 60000 samples
60000/60000 [==============================] - 3s 56us/sample - loss: 11.0572
Successfully fittet model with metric  [] , and distributed training =  False

distributed training:  False
metric: ['flip_ratio']
Train on 60000 samples
60000/60000 [==============================] - 4s 64us/sample - loss: 12.3813 - flip_ratio/quant_dense_1: 5.9605e-08
Successfully fittet model with metric  ['flip_ratio'] , and distributed training =  False

distributed training:  True
metric: []
Train on 60000 samples
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
60000/60000 [==============================] - 9s 156us/sample - loss: 11.0572
Successfully fittet model with metric  [] , and distributed training =  True

distributed training:  True
metric: ['flip_ratio']
Train on 60000 samples
WARNING:tensorflow:Gradients do not exist for variables ['quant_dense_3/kernel:0'] when minimizing the loss.
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
   32/60000 [..............................] - ETA: 1:44:14Exception raised: 
 An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
  @tf.function
  def has_init_scope():
    my_constant = tf.constant(1.)
    with tf.init_scope():
      added = my_constant * 2
The graph tensor has name: replica_2/sequential_3/quant_dense_3/IdentityN_1:0

Environment

TensorFlow version: 2.1.0 Larq version: 0.8.2

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
jneevencommented, Feb 14, 2020

Remind!

It works and the warnings are gone 🎉

0reactions
leonoverweelcommented, Feb 14, 2020

Remind!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Problem with multi-GPU training · Issue #58 - GitHub
My workstation has 4x1080Ti (CUDA 9.2, cuDNN 7, Nvidia drivers 410.48) and I tried to train on COCO dataset on multiple GPUs.
Read more >
Run MATLAB Functions on Multiple GPUs - MathWorks
This example shows how to run MATLAB® code on multiple GPUs in parallel, first on your local machine, then scaling up to a...
Read more >
To verify that multi-GPU acceleration is enabled and working
To verify that multi-GPU acceleration is enabled and working. From the NVIDIA Control Panel navigation tree pane, under 3D Settings, select Set Multi-GPU...
Read more >
TensorFlow-Slim Multi-GPU training - Stack Overflow
The GPUs run normally (memory usage and GPU utilization), but the training is not faster compared with a single GPU training. This issue...
Read more >
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found