JRE Crash on Tensorflow model import
See original GitHub issueIssue Description
I encounter a JVM crash when importing a BERT model with TFGraphMapper.
I pasted the dump here: https://gist.github.com/DavenH/6d261dcb171d96104cd81674f96e42f9
Alternatively, with a different run I get the error “*** Error in `/usr/lib/jvm/default-java/jre/bin/java’: corrupted size vs. prev_size: 0x00007f9b90b7d470 ***” Dump – https://gist.github.com/DavenH/893795b819b447732af4b61efb92451b
The .pb file is huge for BERT, so I won’t attach it here. The way I created it it was to download the pretrained, uncased base model from here: https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Then run this script (https://gist.github.com/DavenH/5084e45086c71d838b0c5e51bc41e2e1 with appropriate adjustments to where you extract the file) to transform the bert_model.ckpt file to a .pb file.
Then run this code, with -Xmx8g for JVM memory:
TFGraphMapper mapper = TFGraphMapper.getInstance(); SameDiff sd = mapper.importGraph(new File("/path/to/bert/model/bert_exported.pb"));
I tried the same code with one of the Zoo models (http://download.tensorflow.org/models/compression_residual_gru-2016-08-23.tar.gz), and it was successful, so my theory is this crash has not something to do with the extractor code so much as some backend allocation when the models are more complex and have more parameters.
Version Information
Please indicate relevant versions, including, if relevant:
- Deeplearning4j version = 1.0.0-SNAPSHOT (as of Mar 7, 2019)
- platform information = Ubuntu 16.04
- JVM = OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
These were the relevant loaded jars:
org/nd4j/nd4j-native/1.0.0-SNAPSHOT/nd4j-native-1.0.0-20190307.130751-10851.jar org/nd4j/nd4j-native/1.0.0-SNAPSHOT/nd4j-native-1.0.0-20190307.130751-10851-linux-x86_64.jar org/nd4j/nd4j-native-api/1.0.0-SNAPSHOT/nd4j-native-api-1.0.0-20190307.131533-15879.jar org/nd4j/nd4j-buffer/1.0.0-SNAPSHOT/nd4j-buffer-1.0.0-20190307.131325-16002.jar org/nd4j/nd4j-api/1.0.0-SNAPSHOT/nd4j-api-1.0.0-20190307.131532-15898.jar com/google/flatbuffers/flatbuffers-java/1.10.0/flatbuffers-java-1.10.0.jar com/github/os72/protobuf-java-shaded-351/0.9/protobuf-java-shaded-351-0.9.jar com/github/os72/protobuf-java-util-shaded-351/0.9/protobuf-java-util-shaded-351-0.9.jar org/deeplearning4j/deeplearning4j-modelimport/1.0.0-SNAPSHOT/deeplearning4j-modelimport-1.0.0-20190307.131445-2598.jar org/deeplearning4j/deeplearning4j-nn/1.0.0-SNAPSHOT/deeplearning4j-nn-1.0.0-20190307.131200-2613.jar org/deeplearning4j/deeplearning4j-utility-iterators/1.0.0-SNAPSHOT/deeplearning4j-utility-iterators-1.0.0-20190307.131142-2564.jar org/deeplearning4j/deeplearning4j-util/1.0.0-SNAPSHOT/deeplearning4j-util-1.0.0-20190307.131305-2563.jar org/nd4j/nd4j-common/1.0.0-SNAPSHOT/nd4j-common-1.0.0-20190307.131138-15855.jar org/apache/commons/commons-compress/1.16.1/commons-compress-1.16.1.jar org/nd4j/nd4j-jackson/1.0.0-SNAPSHOT/nd4j-jackson-1.0.0-20190307.131544-16017.jar org/nd4j/nd4j-context/1.0.0-SNAPSHOT/nd4j-context-1.0.0-20190307.131240-15810.jar org/nd4j/jackson/1.0.0-SNAPSHOT/jackson-1.0.0-20190307.131530-16221.jar org/deeplearning4j/deeplearning4j-common/1.0.0-SNAPSHOT/deeplearning4j-common-1.0.0-20190307.131534-2031.jar
Contributing
I can help fix insofar as I can keep testing things my side. I don’t have the capacity to learn the codebase to help write the fix at the moment.
Aha! Link: https://skymindai.aha.io/features/ND4J-66
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
@DavenH Thanks for the issue and the code to reproduce, looks like I should have everything I need. It might be next week before I can take a proper look at this - I’ll comment here once I have done so.
TF import has been rewritten since this issue, any isuses here have likely been solved (all BERT/transformer tests are passing on the new system also).