Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

In multi_gpu_model with cpu_relocation the weights of the template model do not change

See original GitHub issue

When using multi_gpu_model with cpu_relocation the weights of the template model do not change when training the model, and are different to the weights of the parallel model, which do change. See below for an example.

This is in contradiction with the documentation, which states:

To save the multi-gpu model, use .save(fname) or .save_weights(fname) with the template model (the argument you passed to multi_gpu_model), rather than the model returned by multi_gpu_model.

But it is useless to save the template model if it does not evolve due to training, and if its weights are different to the parallel model.

See the following minimal example:

from keras import Model, Input
from keras.layers import Dense
from keras.utils import multi_gpu_model
import keras.backend as K
import numpy as np

BATCHSIZE = 8
NITER = 4

# dummy model
x = Input(shape=(4,))
layer = Dense(2, activation='relu')(x)
y = Dense(1)(layer)
model = Model(inputs=x, outputs=y)

try:
    parallel_model = multi_gpu_model(model, cpu_relocation=True)
    print("Training using multiple GPUs..")
except ValueError:
    parallel_model = model
    print("Training using single GPU or CPU..")

parallel_model.compile(optimizer='sgd', loss='mse')

original_weights = K.batch_get_value(model.weights)

# Dummy training
for i in range(NITER):
    x = np.random.randn(BATCHSIZE, 4)
    y = np.random.randn(BATCHSIZE)
    parallel_model.train_on_batch(x, y)

weights = K.batch_get_value(model.weights)
parallel_weights = K.batch_get_value(parallel_model.weights)

if all([np.all(w == ow) for w, ow in zip(weights, original_weights)]):
    print('Weights in the template model have not changed')
else:
    print('Weights in the template model have changed')

if all([np.all(w == pw) for w, pw in zip(weights, parallel_weights)]):
    print('Weights in the template and parallel model are equal')
else:
    print('Weights in the template and parallel model are different')

When executing on a single GPU or CPU, the result is:

Training using single GPU or CPU… Weights in the template model have changed Weights in the template and parallel model are equal

When executing on multiple GPUs, the result is:

Training using multiple GPUs… Weights in the template model have not changed Weights in the template and parallel model are different

Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps
Check that your version of TensorFlow is up-to-date. The installation instructions can be found here.
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

Issue Analytics

State:
Created 5 years ago
Comments:6

Top GitHub Comments

3reactions

darteagacommented, Oct 8, 2018

If I replace

model = Model(inputs=x, outputs=y)

try:
    parallel_model = multi_gpu_model(model, cpu_relocation=True)
    print("Training using multiple GPUs..")
except ValueError:
    parallel_model = model
    print("Training using single GPU or CPU..")

try:
    with tf.device('/cpu:0'):
        model = Model(inputs=x, outputs=y)
    parallel_model = multi_gpu_model(model)
    print("Training using multiple GPUs..")
except ValueError:
    model = Model(inputs=x, outputs=y)
    parallel_model = model
    print("Training using single GPU or CPU..")

then, when training with multiple GPUs, I get, as expected:

Training using multiple GPUs… Weights in the template model have changed Weights in the template and parallel model are equal

1reaction

darteagacommented, Jan 10, 2019

@loretoparisi I haven’t tried it, since the workaround in https://github.com/keras-team/keras/issues/11313#issuecomment-427768441 works fine for me. BTW, notice that #8123 is closed because it makes reference to this bug.

However, https://github.com/keras-team/keras/issues/11313#issuecomment-427768441 it is still a workaround, and not a solution to the bug, which, as far as I can see, it is still an open and unresolved bug.