question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

In multi_gpu_model with cpu_relocation the weights of the template model do not change

See original GitHub issue

When using multi_gpu_model with cpu_relocation the weights of the template model do not change when training the model, and are different to the weights of the parallel model, which do change. See below for an example.

This is in contradiction with the documentation, which states:

To save the multi-gpu model, use .save(fname) or .save_weights(fname) with the template model (the argument you passed to multi_gpu_model), rather than the model returned by multi_gpu_model.

But it is useless to save the template model if it does not evolve due to training, and if its weights are different to the parallel model.

See the following minimal example:

from keras import Model, Input
from keras.layers import Dense
from keras.utils import multi_gpu_model
import keras.backend as K
import numpy as np

BATCHSIZE = 8
NITER = 4

# dummy model
x = Input(shape=(4,))
layer = Dense(2, activation='relu')(x)
y = Dense(1)(layer)
model = Model(inputs=x, outputs=y)

try:
    parallel_model = multi_gpu_model(model, cpu_relocation=True)
    print("Training using multiple GPUs..")
except ValueError:
    parallel_model = model
    print("Training using single GPU or CPU..")

parallel_model.compile(optimizer='sgd', loss='mse')

original_weights = K.batch_get_value(model.weights)

# Dummy training
for i in range(NITER):
    x = np.random.randn(BATCHSIZE, 4)
    y = np.random.randn(BATCHSIZE)
    parallel_model.train_on_batch(x, y)

weights = K.batch_get_value(model.weights)
parallel_weights = K.batch_get_value(parallel_model.weights)

if all([np.all(w == ow) for w, ow in zip(weights, original_weights)]):
    print('Weights in the template model have not changed')
else:
    print('Weights in the template model have changed')

if all([np.all(w == pw) for w, pw in zip(weights, parallel_weights)]):
    print('Weights in the template and parallel model are equal')
else:
    print('Weights in the template and parallel model are different')

When executing on a single GPU or CPU, the result is:

Training using single GPU or CPU… Weights in the template model have changed Weights in the template and parallel model are equal

When executing on multiple GPUs, the result is:

Training using multiple GPUs… Weights in the template model have not changed Weights in the template and parallel model are different


  • Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps

  • Check that your version of TensorFlow is up-to-date. The installation instructions can be found here.

  • Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6

github_iconTop GitHub Comments

3reactions
darteagacommented, Oct 8, 2018

If I replace

model = Model(inputs=x, outputs=y)

try:
    parallel_model = multi_gpu_model(model, cpu_relocation=True)
    print("Training using multiple GPUs..")
except ValueError:
    parallel_model = model
    print("Training using single GPU or CPU..")

by

try:
    with tf.device('/cpu:0'):
        model = Model(inputs=x, outputs=y)
    parallel_model = multi_gpu_model(model)
    print("Training using multiple GPUs..")
except ValueError:
    model = Model(inputs=x, outputs=y)
    parallel_model = model
    print("Training using single GPU or CPU..")

then, when training with multiple GPUs, I get, as expected:

Training using multiple GPUs… Weights in the template model have changed Weights in the template and parallel model are equal

1reaction
darteagacommented, Jan 10, 2019

@loretoparisi I haven’t tried it, since the workaround in https://github.com/keras-team/keras/issues/11313#issuecomment-427768441 works fine for me. BTW, notice that #8123 is closed because it makes reference to this bug.

However, https://github.com/keras-team/keras/issues/11313#issuecomment-427768441 it is still a workaround, and not a solution to the bug, which, as far as I can see, it is still an open and unresolved bug.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can not save model using model.save following ...
It's something that need a little work around by loading the multi_gpu_model weight to the regular model weight. e.g.
Read more >
Handling big models - Hugging Face
Create the model with randomly initialized weights; Load the model weights ... Here is an example where we don't want to use more...
Read more >
Multi GPU Model Training: Monitoring and Optimizing
Model parallelism partitions a model among multiple GPUs, where each GPU is responsible for the weight updates of the assigned layers of a...
Read more >
multi_gpu_model - TensorFlow for R - RStudio
A boolean value to identify whether to create the model's weights under the scope of the CPU. If the model is not defined...
Read more >
Multi-GPU and distributed training - Keras
This works best with models that have a naturally-parallel ... to train Keras models on multiple GPUs, with minimal changes to your code, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found