Issue with running on multiple GPUs
See original GitHub issueCopy-pasting comments from #12
@muminoff: I cannot run custom unet with multi-gpu. I followed distributed training part in Tensorflow documentation, but no luck. It seems I need to refactor code and use custom distributed training (namely strategy.experimental_distribute_dataset).
@karolzak: Can you share the code you used, TF/Keras version and error msg? That way I might be able to help you out or at least investigate it.
@muminoff:
I haven’t tried tf.keras.utils.multi_gpu_model
since it is deprecated. But, I tried with tf.distribute.MirroredStrategy()
.
And, here is my code:
from keras_unet.models import custom_unet
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam, SGD
from keras_unet.metrics import iou, iou_thresholded
from keras_unet.losses import jaccard_distance
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
input_shape = x_train[0].shape
model = custom_unet(
input_shape,
filters=32,
use_batch_norm=True,
dropout=0.3,
dropout_change_per_layer=0.0,
num_layers=6
)
model.summary()
model_filename = 'model-v2.h5'
callback_checkpoint = ModelCheckpoint(
model_filename,
verbose=1,
monitor='val_loss',
save_best_only=True,
)
model.compile(
optimizer=Adam(),
#optimizer=SGD(lr=0.01, momentum=0.99),
loss='binary_crossentropy',
#loss=jaccard_distance,
metrics=[iou, iou_thresholded]
)
history = model.fit_generator(
train_gen,
steps_per_epoch=200,
epochs=50,
validation_data=(x_val, y_val),
callbacks=[callback_checkpoint]
)
Error:
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.
fyi, using multi_gpu_model
raises following exception:
ValueError: ('Expected `model` argument to be a `Model` instance, got ', <keras.engine.training.Model object at 0x7f1b347372d0>)
@karolzak: can you specify the version that you’re using for TF/Keras? This seem to be related to that problem.
tf.__version__
'2.1.0'
keras.__version__
'2.3.1'
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
input_shape = x_train[0].shape
model = custom_unet(
input_shape,
filters=32,
use_batch_norm=True,
dropout=0.3,
dropout_change_per_layer=0.0,
num_layers=6
)
model.summary()
model_filename = 'model-v2.h5'
callback_checkpoint = ModelCheckpoint(
model_filename,
verbose=1,
monitor='val_loss',
save_best_only=True,
)
model.compile(
optimizer=Adam(),
#optimizer=SGD(lr=0.01, momentum=0.99),
loss='binary_crossentropy',
#loss=jaccard_distance,
metrics=[iou, iou_thresholded]
)
history = model.fit_generator(
train_gen,
steps_per_epoch=200,
epochs=50,
validation_data=(x_val, y_val),
callbacks=[callback_checkpoint]
)
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
So I'm confused...issue with multiple gpus - Graphics Cards
They can request and transfer the contents in VRAM, but they can't freely access each other's VRAM without SLI. You can run two...
Read more >Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >Problems with multi-gpus - MATLAB Answers - MathWorks
I don't know why you're getting timeouts in multi-gpu mode but not on a single GPU. Are your other GPUs much lower powered...
Read more >Problem with multiple GPUs - BOINC
Uninstall of the Intel drivers and disabling the IGP was enough to have both nVidia and Ati/AMD cards running together. After the dead...
Read more >Running PyTorch codes with multi-GPU/nodes on ... - YouTube
In this seminar, we will demonstrate how to run Machine Learning codes on Alliance systems using multiple GPUs.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@karolzak I have followed your instruction. Now it works. Thanks a lot for your support!
@muminoff Ok, I think I finally got it. I’m using
packaging
module internally do compare TF version and I was convinced that it’s a base python package which turns out it’s not and it needs to be installed so:Sorry for the trouble and let me know if that finally solved the issue