question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue with running on multiple GPUs

See original GitHub issue

Copy-pasting comments from #12


@muminoff: I cannot run custom unet with multi-gpu. I followed distributed training part in Tensorflow documentation, but no luck. It seems I need to refactor code and use custom distributed training (namely strategy.experimental_distribute_dataset).


@karolzak: Can you share the code you used, TF/Keras version and error msg? That way I might be able to help you out or at least investigate it.


@muminoff: I haven’t tried tf.keras.utils.multi_gpu_model since it is deprecated. But, I tried with tf.distribute.MirroredStrategy().

And, here is my code:

from keras_unet.models import custom_unet
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam, SGD
from keras_unet.metrics import iou, iou_thresholded
from keras_unet.losses import jaccard_distance

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():

    input_shape = x_train[0].shape

    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

    model.summary()

    model_filename = 'model-v2.h5'

    callback_checkpoint = ModelCheckpoint(
        model_filename, 
        verbose=1, 
        monitor='val_loss', 
        save_best_only=True,
    )

    model.compile(
        optimizer=Adam(), 
        #optimizer=SGD(lr=0.01, momentum=0.99),
        loss='binary_crossentropy',
        #loss=jaccard_distance,
        metrics=[iou, iou_thresholded]
    )

    history = model.fit_generator(
        train_gen,
        steps_per_epoch=200,
        epochs=50,
        validation_data=(x_val, y_val),
        callbacks=[callback_checkpoint]
    )

Error:

ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

fyi, using multi_gpu_model raises following exception:

ValueError: ('Expected `model` argument to be a `Model` instance, got ', <keras.engine.training.Model object at 0x7f1b347372d0>)

@karolzak: can you specify the version that you’re using for TF/Keras? This seem to be related to that problem.


@muminoff:

tf.__version__
'2.1.0'

keras.__version__
'2.3.1'
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():

    input_shape = x_train[0].shape

    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

model.summary()

model_filename = 'model-v2.h5'

callback_checkpoint = ModelCheckpoint(
    model_filename, 
    verbose=1, 
    monitor='val_loss', 
    save_best_only=True,
)

model.compile(
    optimizer=Adam(), 
    #optimizer=SGD(lr=0.01, momentum=0.99),
    loss='binary_crossentropy',
    #loss=jaccard_distance,
    metrics=[iou, iou_thresholded]
)

history = model.fit_generator(
    train_gen,
    steps_per_epoch=200,
    epochs=50,
    validation_data=(x_val, y_val),
    callbacks=[callback_checkpoint]
)
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
muminoffcommented, Feb 14, 2020

@karolzak I have followed your instruction. Now it works. Thanks a lot for your support! image

image

1reaction
karolzakcommented, Feb 14, 2020

@muminoff Ok, I think I finally got it. I’m using packaging module internally do compare TF version and I was convinced that it’s a base python package which turns out it’s not and it needs to be installed so:

pip install packaging

Sorry for the trouble and let me know if that finally solved the issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

So I'm confused...issue with multiple gpus - Graphics Cards
They can request and transfer the contents in VRAM, but they can't freely access each other's VRAM without SLI. You can run two...
Read more >
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
Problems with multi-gpus - MATLAB Answers - MathWorks
I don't know why you're getting timeouts in multi-gpu mode but not on a single GPU. Are your other GPUs much lower powered...
Read more >
Problem with multiple GPUs - BOINC
Uninstall of the Intel drivers and disabling the IGP was enough to have both nVidia and Ati/AMD cards running together. After the dead...
Read more >
Running PyTorch codes with multi-GPU/nodes on ... - YouTube
In this seminar, we will demonstrate how to run Machine Learning codes on Alliance systems using multiple GPUs.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found