Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Saved model behaves differently on different machines

See original GitHub issue

After studying #439, #2228, #2743, #6737 and the new FAQ about reproducibility, I was able to get consistent, reproducible result on my development machines using Theano. If I run my code twice, I get the exact same results.

The problem is that the results are reproducible only on the same machine. In other words, if I

Train a model on machine A
Evaluate the model using predict
Save the model (using save_model, or model_to_json and save_weights)
Transfer the model to machine B and load it
Evaluate again the model on machine B using predict

The results of the two predicts are different. Using CPU or GPU makes no difference - after I copy the model file(s) from a machine to another, the performance of predict changes dramatically.

The only difference on the two machines is the hardware (I use my laptop’s 980M and a workstation with a Titan X Pascal) and the NVIDIA driver version, which is slightly older on the workstation. Both computers run Ubuntu 16.04 LTS and Cuda 8 with cuDNN. All libraries are on the same version on both machines, and the Python version is the same as well (3.6.1).

Is this behavior intended? I expect that running a pre-trained model on with the same architecture and weights on two different machines yields the same results, but this doesn’t seem the case.

On a side note, a suggestion: on the FAQs about reproducibility, it should be explicitly stated that the development version of Theano is needed to get reproducible results.

Issue Analytics

State:
Created 6 years ago
Reactions:12
Comments:29

Top GitHub Comments

7reactions

wangchenouccommented, Oct 24, 2017

@basaldella Have you fixed this issue? It seems that I have the same problem. I re-traind a model with fine tuning InceptionV3 with my own images on a GPU machine. After training, the accuracy could up to 91% which I am happy with it. During the training the improved model was saved with callbacks. So I can load the best retrained model with model.load_model(model_path), and I tested it with one image. The predict results are always the same and correct (because I know what this image belong to). the results is like this: [[ 0.00197385 0.01141251 0.02262068 0.9121536 0.00810914 0.01657074 0.00370198 0.00617629 0.00972648 0.00531203 0.00224261]]

Now, I try to copy the retrained model (HDF5 file) to my laptop, and load the model again, and test the model with the same image, then I got a totally different result. [[ 0.00373867 0.22160383 0.10066977 0.35440436 0.02839879 0.17799987 0.01744748 0.02645957 0.0299265 0.03026218 0.00908909]]

The python environment are the same in the two machine with keras 2.0.8: The result are always the same in the same machine. The weights are the same after I load the model file. …I checked many things.

Why the results are different in the two machine? Is there somebody know about this?

3reactions

rsmith49commented, Sep 1, 2017

@basaldella Yes, turns out my issue was more along the lines of #4875, and was inconsistent between different Python sessions, not just different machines.