question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ImageRecordReader crashes JVM with loaded Keras model in 1.0.0-beta7

See original GitHub issue

Issue Description

I encountered a strange problem in 1.0.0-beta7 while trying to run a Keras model loaded from a .h5 file (e.g., VGG16.h5 from here) - this model previously ran fine in 1.0.0-beta6.

Calling computationGraph.feedForward(features, false) would crash the JVM (error log, using this code snippet:

// Create VGG16 from a Keras .h5 file
ComputationGraph tmpModel = KerasModelImport.importKerasModelAndWeights("VGG16.h5");
tmpModel.init();

ImageRecordReader reader = new ImageRecordReader(224, 224, 3);
reader.initialize(new FileSplit(new File("img_125_5.jpg"))); // Test with a single image
DataSetIterator it = new RecordReaderDataSetIterator(reader, 1);

// Keras model has wrong channel order, so flip it at the reader level
reader.setNchw_channels_first(false);

INDArray features = it.next().getFeatures();
// INDArray features = Nd4j.rand(1, 224, 224, 3); // Runs fine when initializing from random array of same size

System.out.println(Arrays.toString(features.shape())); // prints [1, 224, 224, 3]

tmpModel.feedForward(features, false);

The crash would happen specifically within the ComputationGraph class at line 1976 - figured this by stepping through the code in IntelliJ.

Strangely though, the code snippet above runs fine if you use a random numpy array of the same shape (so the issue isn’t caused by the features shape). Looking into the values of the features given by the DatasetIterator, there aren’t any NaNs or weird values (all are between 0 and 1).

Also interesting to note is that the .h5 model can be saved in beta6 to a zip using model.save(new File("VGG.zip")), then loaded in beta7, and the above snippet works fine (swapping the KerasModelImport... for ComputationGraph.load(new File("beta6KerasVGG.zip"), true);

Another note, the above snippet works fine if using a different model (e.g., ResNet50.h5) - so it’s not all Keras models that this problem occurs with.

Conclusion

On one hand, it seems like the problem is caused by updates to the KerasModelImport process - a .h5 file which loaded and ran fine in 1.0.0-beta6 now no longer works in 1.0.0-beta7. Additionally, saving a .zip file of the beta6 version and loading a new ComputationGraph in beta7 circumvents the above problem.

However, it also seems like the ImageRecordReader or DataSetIterator could be the culprit - when those are taken out of the equation (by using a random INDArray) no errors occur.

Attached files

img_125_5

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version - 1.0.0-beta7
  • Platform information (OS, etc) - Ubuntu 18.04
  • CUDA version, if used
  • NVIDIA driver version, if in use

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
basedrhyscommented, Jun 8, 2020

I’ve made a simple Gradle project to demonstrate this and help you reproduce it.

Instructions

  1. Download and unzip the project file from Google Drive: Link
  2. Open/Import the project in IntelliJ (or your IDE of choice). Let your IDE download the relevant dependencies
  3. Run the main() method in Main.java. The project initializes using beta6 so the main() method should complete successfully.
  4. In build.gradle, change the nd4j and dl4j versions from 1.0.0-beta6 to 1.0.0-beta7. Let your IDE import these changes.
  5. Run main() again. This should now cause the program to crash (JVM crash on Ubuntu 18.04 (log file attached) and nondescript Gradle error on Windows 10).

In Main.java, I’ve also written in some different scenarios that I’ve tried to help debug the issue; most notable is Scenario 3 which is the duplicating fix mentioned above.

Hopefully this can be reproduced on your machine, let me know if there’s any other info you’d like 😃

Attached Files

hs_err_pid17974.log

0reactions
agibsonccccommented, Apr 15, 2021

Closing in favor of sam’s linked issue with more details: https://github.com/eclipse/deeplearning4j/issues/8785

Read more comments on GitHub >

github_iconTop Results From Across the Web

ImageRecordReader crashes JVM with loaded Keras model ...
I encountered a strange problem in 1.0.0-beta7 while trying to run a Keras model loaded from a .h5 file (e.g., VGG16.h5 from here)...
Read more >
Crash when using keras ? Or when loading a model?
load () it crashes. It keeps crashing ("Segmentation fault (core dumped)") when I run the code in a shell (just compiling the models)....
Read more >
How to Save and Load Your Keras Deep Learning Model
In this post, you will discover how to save your Keras models to files and load them up again to make predictions.
Read more >
Save and load models | TensorFlow Core
This guide uses tf.keras—a high-level API to build and train ... Using the SavedModel format guide and the Save and load Keras models...
Read more >
Keras Applications
Model Size (MB) Top‑1 Accuracy Top‑5 Accuracy Parameters Depth Time (ms) per infer... Xception 88 79.0% 94.5% 22.9M 81 109.4 VGG16 528 71.3% 90.1% 138.4M...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found