question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

JRE Crash on Tensorflow model import

See original GitHub issue

Issue Description

I encounter a JVM crash when importing a BERT model with TFGraphMapper.

I pasted the dump here: https://gist.github.com/DavenH/6d261dcb171d96104cd81674f96e42f9

Alternatively, with a different run I get the error “*** Error in `/usr/lib/jvm/default-java/jre/bin/java’: corrupted size vs. prev_size: 0x00007f9b90b7d470 ***” Dump – https://gist.github.com/DavenH/893795b819b447732af4b61efb92451b

The .pb file is huge for BERT, so I won’t attach it here. The way I created it it was to download the pretrained, uncased base model from here: https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

Then run this script (https://gist.github.com/DavenH/5084e45086c71d838b0c5e51bc41e2e1 with appropriate adjustments to where you extract the file) to transform the bert_model.ckpt file to a .pb file.

Then run this code, with -Xmx8g for JVM memory: TFGraphMapper mapper = TFGraphMapper.getInstance(); SameDiff sd = mapper.importGraph(new File("/path/to/bert/model/bert_exported.pb"));

I tried the same code with one of the Zoo models (http://download.tensorflow.org/models/compression_residual_gru-2016-08-23.tar.gz), and it was successful, so my theory is this crash has not something to do with the extractor code so much as some backend allocation when the models are more complex and have more parameters.

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version = 1.0.0-SNAPSHOT (as of Mar 7, 2019)
  • platform information = Ubuntu 16.04
  • JVM = OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

These were the relevant loaded jars:

org/nd4j/nd4j-native/1.0.0-SNAPSHOT/nd4j-native-1.0.0-20190307.130751-10851.jar org/nd4j/nd4j-native/1.0.0-SNAPSHOT/nd4j-native-1.0.0-20190307.130751-10851-linux-x86_64.jar org/nd4j/nd4j-native-api/1.0.0-SNAPSHOT/nd4j-native-api-1.0.0-20190307.131533-15879.jar org/nd4j/nd4j-buffer/1.0.0-SNAPSHOT/nd4j-buffer-1.0.0-20190307.131325-16002.jar org/nd4j/nd4j-api/1.0.0-SNAPSHOT/nd4j-api-1.0.0-20190307.131532-15898.jar com/google/flatbuffers/flatbuffers-java/1.10.0/flatbuffers-java-1.10.0.jar com/github/os72/protobuf-java-shaded-351/0.9/protobuf-java-shaded-351-0.9.jar com/github/os72/protobuf-java-util-shaded-351/0.9/protobuf-java-util-shaded-351-0.9.jar org/deeplearning4j/deeplearning4j-modelimport/1.0.0-SNAPSHOT/deeplearning4j-modelimport-1.0.0-20190307.131445-2598.jar org/deeplearning4j/deeplearning4j-nn/1.0.0-SNAPSHOT/deeplearning4j-nn-1.0.0-20190307.131200-2613.jar org/deeplearning4j/deeplearning4j-utility-iterators/1.0.0-SNAPSHOT/deeplearning4j-utility-iterators-1.0.0-20190307.131142-2564.jar org/deeplearning4j/deeplearning4j-util/1.0.0-SNAPSHOT/deeplearning4j-util-1.0.0-20190307.131305-2563.jar org/nd4j/nd4j-common/1.0.0-SNAPSHOT/nd4j-common-1.0.0-20190307.131138-15855.jar org/apache/commons/commons-compress/1.16.1/commons-compress-1.16.1.jar org/nd4j/nd4j-jackson/1.0.0-SNAPSHOT/nd4j-jackson-1.0.0-20190307.131544-16017.jar org/nd4j/nd4j-context/1.0.0-SNAPSHOT/nd4j-context-1.0.0-20190307.131240-15810.jar org/nd4j/jackson/1.0.0-SNAPSHOT/jackson-1.0.0-20190307.131530-16221.jar org/deeplearning4j/deeplearning4j-common/1.0.0-SNAPSHOT/deeplearning4j-common-1.0.0-20190307.131534-2031.jar

Contributing

I can help fix insofar as I can keep testing things my side. I don’t have the capacity to learn the codebase to help write the fix at the moment.

Aha! Link: https://skymindai.aha.io/features/ND4J-66

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
AlexDBlackcommented, Mar 8, 2019

@DavenH Thanks for the issue and the code to reproduce, looks like I should have everything I need. It might be next week before I can take a proper look at this - I’ll comment here once I have done so.

0reactions
AlexDBlackcommented, Nov 1, 2019

TF import has been rewritten since this issue, any isuses here have likely been solved (all BERT/transformer tests are passing on the new system also).

Read more comments on GitHub >

github_iconTop Results From Across the Web

JVM crashes with BERT classification example #1403 - GitHub
I first converted the above model to saved_model format in Python: from transformers import TFBertForSequenceClassification model = TFBe...
Read more >
Python code using Keras crashes on call to model.fit with no ...
I've tried using conda install -c tensorflow and conda install -c keras but I get the error CondaValueError: too few arguments, must supply ......
Read more >
Problem using my trained model on ImageJ/Fiji with Mac M1
Hi everyone, I've trained a model using StarDist on jupyter notebook. I did that starting from an environment with tensorflow 1.15.0 (latest ...
Read more >
CNN - How to use 160000 images without crashing - Kaggle
We will train the model using 144,000 images and validate on 16,000 images. ... from numpy.random import seed seed(101) from tensorflow import ......
Read more >
5.1 Determine Where the Crash Occurred
If the native library where the crash occurred is part of the Java Runtime Environment (JRE) (for example awt.dll, net.dll, and so forth),...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found