Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue with running on multiple GPUs

See original GitHub issue

Copy-pasting comments from #12

@muminoff: I cannot run custom unet with multi-gpu. I followed distributed training part in Tensorflow documentation, but no luck. It seems I need to refactor code and use custom distributed training (namely strategy.experimental_distribute_dataset).

@karolzak: Can you share the code you used, TF/Keras version and error msg? That way I might be able to help you out or at least investigate it.

@muminoff: I haven’t tried tf.keras.utils.multi_gpu_model since it is deprecated. But, I tried with tf.distribute.MirroredStrategy().

And, here is my code:

from keras_unet.models import custom_unet
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam, SGD
from keras_unet.metrics import iou, iou_thresholded
from keras_unet.losses import jaccard_distance

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():

    input_shape = x_train[0].shape

    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

    model.summary()

    model_filename = 'model-v2.h5'

    callback_checkpoint = ModelCheckpoint(
        model_filename, 
        verbose=1, 
        monitor='val_loss', 
        save_best_only=True,
    )

    model.compile(
        optimizer=Adam(), 
        #optimizer=SGD(lr=0.01, momentum=0.99),
        loss='binary_crossentropy',
        #loss=jaccard_distance,
        metrics=[iou, iou_thresholded]
    )

    history = model.fit_generator(
        train_gen,
        steps_per_epoch=200,
        epochs=50,
        validation_data=(x_val, y_val),
        callbacks=[callback_checkpoint]
    )

Error:

ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

fyi, using multi_gpu_model raises following exception:

ValueError: ('Expected `model` argument to be a `Model` instance, got ', <keras.engine.training.Model object at 0x7f1b347372d0>)

@karolzak: can you specify the version that you’re using for TF/Keras? This seem to be related to that problem.

@muminoff:

tf.__version__
'2.1.0'

keras.__version__
'2.3.1'

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():

    input_shape = x_train[0].shape

    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

model.summary()

model_filename = 'model-v2.h5'

callback_checkpoint = ModelCheckpoint(
    model_filename, 
    verbose=1, 
    monitor='val_loss', 
    save_best_only=True,
)

model.compile(
    optimizer=Adam(), 
    #optimizer=SGD(lr=0.01, momentum=0.99),
    loss='binary_crossentropy',
    #loss=jaccard_distance,
    metrics=[iou, iou_thresholded]
)

history = model.fit_generator(
    train_gen,
    steps_per_epoch=200,
    epochs=50,
    validation_data=(x_val, y_val),
    callbacks=[callback_checkpoint]
)

ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

Issue Analytics

State:
Created 4 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

muminoffcommented, Feb 14, 2020

@karolzak I have followed your instruction. Now it works. Thanks a lot for your support!

1reaction

karolzakcommented, Feb 14, 2020

@muminoff Ok, I think I finally got it. I’m using packaging module internally do compare TF version and I was convinced that it’s a base python package which turns out it’s not and it needs to be installed so: