Flip-ratio does not work on Multi GPU
See original GitHub issueDescribe the bug
Trying to use the flip ratio metric on multi GPU gives me an error.
To Reproduce
import contextlib
import numpy as np
import tensorflow.keras as keras
import larq as lq
import tensorflow as tf
def get_model():
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=(28, 28)))
model.add(
lq.layers.QuantDense(
10,
input_quantizer="ste_sign",
kernel_quantizer="ste_sign",
kernel_constraint="weight_clip",
)
)
model.compile(
optimizer=tf.keras.optimizers.Adam(), loss="sparse_categorical_crossentropy"
)
return model
def attempt_fit_with_metric(metrics=[], distributed_training=False):
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0
with strategy.scope() if distributed_training else contextlib.nullcontext():
with lq.metrics.scope(metrics):
model = get_model()
model.fit(train_images, train_labels, epochs=1)
if __name__ == "__main__":
strategy = tf.distribute.MirroredStrategy()
for distributed_training in [False, True]:
for metrics in [[], ["flip_ratio"]]:
print("distributed training: ", distributed_training)
print("metric:", metrics)
try:
attempt_fit_with_metric(metrics, distributed_training)
print(
"Successfully fittet model with metric ",
metrics,
", and distributed training = ",
distributed_training,
)
except Exception as e:
print("Exception raised: \n", e)
print()
Expected behavior
I get the following error, while I would expect the flip ratio to work on both single and multi GPU.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
distributed training: False
metric: []
Train on 60000 samples
60000/60000 [==============================] - 3s 56us/sample - loss: 11.0572
Successfully fittet model with metric [] , and distributed training = False
distributed training: False
metric: ['flip_ratio']
Train on 60000 samples
60000/60000 [==============================] - 4s 64us/sample - loss: 12.3813 - flip_ratio/quant_dense_1: 5.9605e-08
Successfully fittet model with metric ['flip_ratio'] , and distributed training = False
distributed training: True
metric: []
Train on 60000 samples
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
60000/60000 [==============================] - 9s 156us/sample - loss: 11.0572
Successfully fittet model with metric [] , and distributed training = True
distributed training: True
metric: ['flip_ratio']
Train on 60000 samples
WARNING:tensorflow:Gradients do not exist for variables ['quant_dense_3/kernel:0'] when minimizing the loss.
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
32/60000 [..............................] - ETA: 1:44:14Exception raised:
An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
@tf.function
def has_init_scope():
my_constant = tf.constant(1.)
with tf.init_scope():
added = my_constant * 2
The graph tensor has name: replica_2/sequential_3/quant_dense_3/IdentityN_1:0
Environment
TensorFlow version: 2.1.0 Larq version: 0.8.2
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Problem with multi-GPU training · Issue #58 - GitHub
My workstation has 4x1080Ti (CUDA 9.2, cuDNN 7, Nvidia drivers 410.48) and I tried to train on COCO dataset on multiple GPUs.
Read more >Run MATLAB Functions on Multiple GPUs - MathWorks
This example shows how to run MATLAB® code on multiple GPUs in parallel, first on your local machine, then scaling up to a...
Read more >To verify that multi-GPU acceleration is enabled and working
To verify that multi-GPU acceleration is enabled and working. From the NVIDIA Control Panel navigation tree pane, under 3D Settings, select Set Multi-GPU...
Read more >TensorFlow-Slim Multi-GPU training - Stack Overflow
The GPUs run normally (memory usage and GPU utilization), but the training is not faster compared with a single GPU training. This issue...
Read more >Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It works and the warnings are gone 🎉
Remind!