jvm crushes in native when try to running torch model
See original GitHub issueDescription
Hi. I have some problems with running a model from https://github.com/emiliantolo/pytorch_nsfw_model in DJL. Jvm crashes with an error in native. I try to run it with openjdk 8, zulu 8, zulu 13.
Expected Behavior
Expected that a model will run correctly
Error Message
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000134a917ae, pid=12860, tid=9219
#
# JRE version: OpenJDK Runtime Environment
(Zulu13.28+11-CA) (13.0.1+10) (build 13.0.1+10-MTS)
# Java VM: OpenJDK 64-Bit Server VM (13.0.1+10-MTS, mixed mode, sharing, tiered, compressed oops, g1 gc, bsd-amd64)
# Problematic frame:
# C [libtorch_cpu.dylib+0x2a957ae] torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&)+0x2e
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/evgenyzakharov/Workspace/pytorch_nsfw_model_jvm/hs_err_pid12860.log
#
# If you would like to submit a bug report, please visit:
# http://www.azulsystems.com/support/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
And in dump error message is next:
Stack: [0x000070000ea99000,0x000070000eb99000], sp=0x000070000eb95f50, free space=1011k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libtorch_cpu.dylib+0x2a957ae] torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&)+0x2e
C [libtorch_cpu.dylib+0x2d21cb5] torch::jit::ScriptTypeParser::parseClassConstant(torch::jit::Assign const&)+0xd5
C [libtorch_cpu.dylib+0x2a9097a] torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool)+0x210a
C [libtorch_cpu.dylib+0x2a8c61a] torch::jit::SourceImporterImpl::importNamedType(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, torch::jit::ClassDef const&)+0x64a
C [libtorch_cpu.dylib+0x2a88d2b] torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&)+0xfb
Full error log: hs_err_pid12860.log
How to Reproduce?
Download model from repository https://github.com/emiliantolo/pytorch_nsfw_model and try to run with DJL
Steps to reproduce
- Download model
- Convert model to “.pt” with
script
ortrace
:
model = models.resnet50()
model.fc = nn.Sequential(nn.Linear(2048, 512),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(512, 10),
nn.LogSoftmax(dim=1))
model.load_state_dict(torch.load('ResNet50_nsfw_model.pth', map_location=torch.device('cpu')))
model.eval()
#script
export = torch.jit.script(model)
torch.jit.save(export, "out.pt")
#trace
image = Image.open(data_dir+"1.jpg")
image_tensor = test_transforms(image).float()
image_tensor = image_tensor.unsqueeze_(0)
input = Variable(image_tensor)
net_trace = torch.jit.trace(model, input)
net_trace.save("out.pt")
- Try to use model in DJL with next criteria:
val criteria = Criteria.builder()
.setTypes(
Image::class.java,
Classifications::class.java
)
.optModelZoo(DefaultModelZoo("<path to folder with model>"))
.optModelName("out.pt")
.optTranslator(translator)
.optProgress(ProgressBar())
.build()
Version in build.gradle.kts:
implementation("ai.djl:api:0.6.0")
runtimeOnly("ai.djl.pytorch:pytorch-engine:0.6.0")
runtimeOnly("ai.djl.pytorch:pytorch-native-auto:1.5.0")
What have you tried to solve it?
- Try to use different version of jvm (openjdk 8, zulu 8 zulu 13)
- Try to use different version of torch (1.6.0, 1.4.0)
Environment Info
OS:uname:Darwin 18.7.0 Darwin Kernel Version 18.7.0: Thu Jun 18 20:50:10 PDT 2020; root:xnu-4903.278.43~1/RELEASE_X86_64 x86_64
rlimit: STACK 8192k, CORE 0k, NPROC 1418, NOFILE 10240, AS infinity, DATA infinity, FSIZE infinity
load average:2.75 2.65 2.72
CPU:total 8 (initial active 8) (4 cores per cpu, 2 threads per core) family 6 model 158 stepping 9, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, avx, avx2, aes, clmul, erms, 3dnowpref, lzcnt, ht, tsc, tscinvbit, bmi1, bmi2, adx, fma
Memory: 4k page, physical 16777216k(62596k free), swap 6291456k(874240k free)
vm_info: OpenJDK 64-Bit Server VM (13.0.1+10-MTS) for bsd-amd64 JRE (13.0.1+10-MTS) (Zulu13.28+11-CA), built on Oct 9 2019 12:07:25 by "zulu_re" with clang 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:19 (10 by maintainers)
Top Results From Across the Web
jvm crushes in native when try to running torch model #150
Hi. I have some problems with running a model from https://github.com/emiliantolo/pytorch_nsfw_model in DJL. Jvm crashes with an error in native ...
Read more >Crash when running my TorchScript model from JAVA - Mobile
Hi, I'm trying to run my TorchScript model in JAVA, but I get an error and my application crashes. I based on the...
Read more >Troubleshooting | djl - Deep Java Library
You might see the error when DJL tries to load the native library for the engines, but some shared libraries are missing. Let's...
Read more >How should I diagnose and prevent JVM crashes?
Try running whatever hardware diagnostics are most appropriate for your system. As JVM crashes are rare I'd report them to Sun.
Read more >JVM Crash -- need help debugging — oracle-tech
What is the best way to try to figure out these crashes? This seems to occur when there is either keyboard or mouse...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I have compare image pixels and figure out that some of them slightly not same in jvm (ImageIO) and python (PIL). And found in stackoverflow discussion about it. All looks like that only in OSX results is not same. I will try later to check results in docker image.
@evgzakharov Thanks for reporting this issue. Will try to reproduce it.