question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

jvm crushes in native when try to running torch model

See original GitHub issue

Description

Hi. I have some problems with running a model from https://github.com/emiliantolo/pytorch_nsfw_model in DJL. Jvm crashes with an error in native. I try to run it with openjdk 8, zulu 8, zulu 13.

Expected Behavior

Expected that a model will run correctly

Error Message

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000134a917ae, pid=12860, tid=9219
#
# JRE version: OpenJDK Runtime Environment 
(Zulu13.28+11-CA) (13.0.1+10) (build 13.0.1+10-MTS)
# Java VM: OpenJDK 64-Bit Server VM (13.0.1+10-MTS, mixed mode, sharing, tiered, compressed oops, g1 gc, bsd-amd64)
# Problematic frame:
# C  [libtorch_cpu.dylib+0x2a957ae]  torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&)+0x2e
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/evgenyzakharov/Workspace/pytorch_nsfw_model_jvm/hs_err_pid12860.log
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

And in dump error message is next:

Stack: [0x000070000ea99000,0x000070000eb99000],  sp=0x000070000eb95f50,  free space=1011k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libtorch_cpu.dylib+0x2a957ae]  torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&)+0x2e
C  [libtorch_cpu.dylib+0x2d21cb5]  torch::jit::ScriptTypeParser::parseClassConstant(torch::jit::Assign const&)+0xd5
C  [libtorch_cpu.dylib+0x2a9097a]  torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool)+0x210a
C  [libtorch_cpu.dylib+0x2a8c61a]  torch::jit::SourceImporterImpl::importNamedType(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, torch::jit::ClassDef const&)+0x64a
C  [libtorch_cpu.dylib+0x2a88d2b]  torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&)+0xfb

Full error log: hs_err_pid12860.log

How to Reproduce?

Download model from repository https://github.com/emiliantolo/pytorch_nsfw_model and try to run with DJL

Steps to reproduce

  1. Download model
  2. Convert model to “.pt” with script or trace:
model = models.resnet50()
model.fc = nn.Sequential(nn.Linear(2048, 512),
                                 nn.ReLU(),
                                 nn.Dropout(0.2),
                                 nn.Linear(512, 10),
                                 nn.LogSoftmax(dim=1))
model.load_state_dict(torch.load('ResNet50_nsfw_model.pth', map_location=torch.device('cpu')))
model.eval()

#script
export = torch.jit.script(model)
torch.jit.save(export, "out.pt")

#trace
image = Image.open(data_dir+"1.jpg")
image_tensor = test_transforms(image).float()
image_tensor = image_tensor.unsqueeze_(0)
input = Variable(image_tensor)
net_trace = torch.jit.trace(model, input)
net_trace.save("out.pt")
  1. Try to use model in DJL with next criteria:
val criteria = Criteria.builder()
        .setTypes(
            Image::class.java,
            Classifications::class.java
        )
        .optModelZoo(DefaultModelZoo("<path to folder with model>"))
        .optModelName("out.pt")
        .optTranslator(translator)
        .optProgress(ProgressBar())
        .build()

Version in build.gradle.kts:


implementation("ai.djl:api:0.6.0")
runtimeOnly("ai.djl.pytorch:pytorch-engine:0.6.0")
runtimeOnly("ai.djl.pytorch:pytorch-native-auto:1.5.0")

What have you tried to solve it?

  1. Try to use different version of jvm (openjdk 8, zulu 8 zulu 13)
  2. Try to use different version of torch (1.6.0, 1.4.0)

Environment Info

OS:uname:Darwin 18.7.0 Darwin Kernel Version 18.7.0: Thu Jun 18 20:50:10 PDT 2020; root:xnu-4903.278.43~1/RELEASE_X86_64 x86_64
rlimit: STACK 8192k, CORE 0k, NPROC 1418, NOFILE 10240, AS infinity, DATA infinity, FSIZE infinity
load average:2.75 2.65 2.72

CPU:total 8 (initial active 8) (4 cores per cpu, 2 threads per core) family 6 model 158 stepping 9, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, avx, avx2, aes, clmul, erms, 3dnowpref, lzcnt, ht, tsc, tscinvbit, bmi1, bmi2, adx, fma

Memory: 4k page, physical 16777216k(62596k free), swap 6291456k(874240k free)

vm_info: OpenJDK 64-Bit Server VM (13.0.1+10-MTS) for bsd-amd64 JRE (13.0.1+10-MTS) (Zulu13.28+11-CA), built on Oct  9 2019 12:07:25 by "zulu_re" with clang 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:19 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
evgzakharovcommented, Aug 7, 2020

I have compare image pixels and figure out that some of them slightly not same in jvm (ImageIO) and python (PIL). And found in stackoverflow discussion about it. All looks like that only in OSX results is not same. I will try later to check results in docker image.

2reactions
frankfliucommented, Aug 6, 2020

@evgzakharov Thanks for reporting this issue. Will try to reproduce it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

jvm crushes in native when try to running torch model #150
Hi. I have some problems with running a model from https://github.com/emiliantolo/pytorch_nsfw_model in DJL. Jvm crashes with an error in native ...
Read more >
Crash when running my TorchScript model from JAVA - Mobile
Hi, I'm trying to run my TorchScript model in JAVA, but I get an error and my application crashes. I based on the...
Read more >
Troubleshooting | djl - Deep Java Library
You might see the error when DJL tries to load the native library for the engines, but some shared libraries are missing. Let's...
Read more >
How should I diagnose and prevent JVM crashes?
Try running whatever hardware diagnostics are most appropriate for your system. As JVM crashes are rare I'd report them to Sun.
Read more >
JVM Crash -- need help debugging — oracle-tech
What is the best way to try to figure out these crashes? This seems to occur when there is either keyboard or mouse...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found