Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

JRE Crash on Tensorflow model import

See original GitHub issue

Issue Description

I encounter a JVM crash when importing a BERT model with TFGraphMapper.

I pasted the dump here: https://gist.github.com/DavenH/6d261dcb171d96104cd81674f96e42f9

Alternatively, with a different run I get the error “*** Error in `/usr/lib/jvm/default-java/jre/bin/java’: corrupted size vs. prev_size: 0x00007f9b90b7d470 ***” Dump – https://gist.github.com/DavenH/893795b819b447732af4b61efb92451b

The .pb file is huge for BERT, so I won’t attach it here. The way I created it it was to download the pretrained, uncased base model from here: https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

Then run this script (https://gist.github.com/DavenH/5084e45086c71d838b0c5e51bc41e2e1 with appropriate adjustments to where you extract the file) to transform the bert_model.ckpt file to a .pb file.

Then run this code, with -Xmx8g for JVM memory: TFGraphMapper mapper = TFGraphMapper.getInstance(); SameDiff sd = mapper.importGraph(new File("/path/to/bert/model/bert_exported.pb"));

I tried the same code with one of the Zoo models (http://download.tensorflow.org/models/compression_residual_gru-2016-08-23.tar.gz), and it was successful, so my theory is this crash has not something to do with the extractor code so much as some backend allocation when the models are more complex and have more parameters.

Version Information

Please indicate relevant versions, including, if relevant:

Deeplearning4j version = 1.0.0-SNAPSHOT (as of Mar 7, 2019)
platform information = Ubuntu 16.04
JVM = OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

These were the relevant loaded jars:

org/nd4j/nd4j-native/1.0.0-SNAPSHOT/nd4j-native-1.0.0-20190307.130751-10851.jar org/nd4j/nd4j-native/1.0.0-SNAPSHOT/nd4j-native-1.0.0-20190307.130751-10851-linux-x86_64.jar org/nd4j/nd4j-native-api/1.0.0-SNAPSHOT/nd4j-native-api-1.0.0-20190307.131533-15879.jar org/nd4j/nd4j-buffer/1.0.0-SNAPSHOT/nd4j-buffer-1.0.0-20190307.131325-16002.jar org/nd4j/nd4j-api/1.0.0-SNAPSHOT/nd4j-api-1.0.0-20190307.131532-15898.jar com/google/flatbuffers/flatbuffers-java/1.10.0/flatbuffers-java-1.10.0.jar com/github/os72/protobuf-java-shaded-351/0.9/protobuf-java-shaded-351-0.9.jar com/github/os72/protobuf-java-util-shaded-351/0.9/protobuf-java-util-shaded-351-0.9.jar org/deeplearning4j/deeplearning4j-modelimport/1.0.0-SNAPSHOT/deeplearning4j-modelimport-1.0.0-20190307.131445-2598.jar org/deeplearning4j/deeplearning4j-nn/1.0.0-SNAPSHOT/deeplearning4j-nn-1.0.0-20190307.131200-2613.jar org/deeplearning4j/deeplearning4j-utility-iterators/1.0.0-SNAPSHOT/deeplearning4j-utility-iterators-1.0.0-20190307.131142-2564.jar org/deeplearning4j/deeplearning4j-util/1.0.0-SNAPSHOT/deeplearning4j-util-1.0.0-20190307.131305-2563.jar org/nd4j/nd4j-common/1.0.0-SNAPSHOT/nd4j-common-1.0.0-20190307.131138-15855.jar org/apache/commons/commons-compress/1.16.1/commons-compress-1.16.1.jar org/nd4j/nd4j-jackson/1.0.0-SNAPSHOT/nd4j-jackson-1.0.0-20190307.131544-16017.jar org/nd4j/nd4j-context/1.0.0-SNAPSHOT/nd4j-context-1.0.0-20190307.131240-15810.jar org/nd4j/jackson/1.0.0-SNAPSHOT/jackson-1.0.0-20190307.131530-16221.jar org/deeplearning4j/deeplearning4j-common/1.0.0-SNAPSHOT/deeplearning4j-common-1.0.0-20190307.131534-2031.jar

Contributing

I can help fix insofar as I can keep testing things my side. I don’t have the capacity to learn the codebase to help write the fix at the moment.

Aha! Link: https://skymindai.aha.io/features/ND4J-66

Issue Analytics

State:
Created 5 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

AlexDBlackcommented, Mar 8, 2019

@DavenH Thanks for the issue and the code to reproduce, looks like I should have everything I need. It might be next week before I can take a proper look at this - I’ll comment here once I have done so.

0reactions

AlexDBlackcommented, Nov 1, 2019

TF import has been rewritten since this issue, any isuses here have likely been solved (all BERT/transformer tests are passing on the new system also).